Latest Posts (20 found)
Sean Goedecke Yesterday

How good engineers write bad code at big companies

Every couple of years somebody notices that large tech companies sometimes produce surprisingly sloppy code. If you haven’t worked at a big company, it might be hard to understand how this happens. Big tech companies pay well enough to attract many competent engineers. They move slowly enough that it looks like they’re able to take their time and do solid work. How does bad code happen? I think the main reason is that big companies are full of engineers working outside their area of expertise . The average big tech employee stays for only a year or two 1 . In fact, big tech compensation packages are typically designed to put a four-year cap on engineer tenure: after four years, the initial share grant is fully vested, causing engineers to take what can be a 50% pay cut. Companies do extend temporary yearly refreshes, but it obviously incentivizes engineers to go find another job where they don’t have to wonder if they’re going to get the other half of their compensation each year. If you count internal mobility, it’s even worse. The longest I have ever stayed on a single team or codebase was three years, near the start of my career. I expect to be re-orged at least every year, and often much more frequently. However, the average tenure of a codebase in a big tech company is a lot longer than that. Many of the services I work on are a decade old or more, and have had many, many different owners over the years. That means many big tech engineers are constantly “figuring it out”. A pretty high percentage of code changes are made by “beginners”: people who have onboarded to the company, the codebase, or even the programming language in the past six months. To some extent, this problem is mitigated by “old hands”: engineers who happen to have been in the orbit of a particular system for long enough to develop real expertise. These engineers can give deep code reviews and reliably catch obvious problems. But relying on “old hands” has two problems. First, this process is entirely informal . Big tech companies make surprisingly little effort to develop long-term expertise in individual systems, and once they’ve got it they seem to barely care at all about retaining it. Often the engineers in question are moved to different services, and have to either keep up their “old hand” duties on an effectively volunteer basis, or abandon them and become a relative beginner on a brand new system. Second, experienced engineers are always overloaded . It is a busy job being one of the few engineers who has deep expertise on a particular service. You don’t have enough time to personally review every software change, or to be actively involved in every decision-making process. Remember that you also have your own work to do : if you spend all your time reviewing changes and being involved in discussions, you’ll likely be punished by the company for not having enough individual output. Putting all this together, what does the median productive 2 engineer at a big tech company look like? They are usually: They are almost certainly working to a deadline, or to a series of overlapping deadlines for different projects. In other words, they are trying to do their best in an environment that is not set up to produce quality code. That’s how “obviously” bad code happens. For instance, a junior engineer picks up a ticket for an annoying bug in a codebase they’re barely familiar with. They spend a few days figuring it out and come up with a hacky solution. One of the more senior “old hands” (if they’re lucky) glances over it in a spare half-hour, vetoes it, and suggests something slightly better that would at least work. The junior engineer implements that as best they can, tests that it works, it gets briefly reviewed and shipped, and everyone involved immediately moves on to higher-priority work. Five years later somebody notices this 3 and thinks “wow, that’s hacky - how did such bad code get written at such a big software company”? I have written a lot about the internal tech company dynamics that contribute to this. Most directly, in Seeing like a software company I argue that big tech companies consistently prioritize internal legibility - the ability to see at a glance who’s working on what and to change it at will - over productivity. Big companies know that treating engineers as fungible and moving them around destroys their ability to develop long-term expertise in a single codebase. That’s a deliberate tradeoff. They’re giving up some amount of expertise and software quality in order to gain the ability to rapidly deploy skilled engineers onto whatever the problem-of-the-month is. I don’t know if this is a good idea or a bad idea. It certainly seems to be working for the big tech companies, particularly now that “how fast can you pivot to something AI-related” is so important. But if you’re doing this, then of course you’re going to produce some genuinely bad code. That’s what happens when you ask engineers to rush out work on systems they’re unfamiliar with. Individual engineers are entirely powerless to alter this dynamic . This is particularly true in 2025, when the balance of power has tilted away from engineers and towards tech company leadership. The most you can do as an individual engineer is to try and become an “old hand”: to develop expertise in at least one area, and to use it to block the worst changes and steer people towards at least minimally-sensible technical decisions. But even that is often swimming against the current of the organization, and if inexpertly done can cause you to get PIP-ed or worse. I think a lot of this comes down to the distinction between pure and impure software engineering . To pure engineers - engineers working on self-contained technical projects, like a programming language - the only explanation for bad code is incompetence. But impure engineers operate more like plumbers or electricians. They’re working to deadlines on projects that are relatively new to them, and even if their technical fundamentals are impeccable, there’s always something about the particular setup of this situation that’s awkward or surprising. To impure engineers, bad code is inevitable. As long as the overall system works well enough, the project is a success. At big tech companies, engineers don’t get to decide if they’re working on pure or impure engineering work. It’s not their codebase ! If the company wants to move you from working on database infrastructure to building the new payments system, they’re fully entitled to do that. The fact that you might make some mistakes in an unfamiliar system - or that your old colleagues on the database infra team might suffer without your expertise - is a deliberate tradeoff being made by the company, not the engineer . It’s fine to point out examples of bad code at big companies. If nothing else, it can be an effective way to get those specific examples fixed, since execs usually jump at the chance to turn bad PR into good PR. But I think it’s a mistake 4 to attribute primary responsibility to the engineers at those companies. If you could wave a magic wand and make every engineer twice as strong, you would still have bad code , because almost nobody can come into a brand new codebase and quickly make changes with zero mistakes. The root cause is that most big company engineers are forced to do most of their work in unfamiliar codebases . I struggled to find a good original source on this. There’s a 2013 PayScale report citing a 1.1 year median turnover at Google, which seems low. Many engineers at big tech companies are not productive, but that’s a post all to itself. I don’t want to get into it here for two reasons. First, I think competent engineers produce enough bad code that it’s fine to be a bit generous and just scope the discussion to them. Second, even if an incompetent engineer wrote the code, there’s almost always competent engineers who could have reviewed it, and the question of why that didn’t happen is still interesting. The example I’m thinking of here is not the recent GitHub Actions one , which I have no first-hand experience of. I can think of at least ten separate instances of this happening to me. In my view, mainly a failure of imagination : thinking that your own work environment must be pretty similar to everyone else’s. competent enough to pass the hiring bar and be able to do the work, but either working on a codebase or language that is largely new to them, or trying to stay on top of a flood of code changes while also juggling their own work. I struggled to find a good original source on this. There’s a 2013 PayScale report citing a 1.1 year median turnover at Google, which seems low. ↩ Many engineers at big tech companies are not productive, but that’s a post all to itself. I don’t want to get into it here for two reasons. First, I think competent engineers produce enough bad code that it’s fine to be a bit generous and just scope the discussion to them. Second, even if an incompetent engineer wrote the code, there’s almost always competent engineers who could have reviewed it, and the question of why that didn’t happen is still interesting. ↩ The example I’m thinking of here is not the recent GitHub Actions one , which I have no first-hand experience of. I can think of at least ten separate instances of this happening to me. ↩ In my view, mainly a failure of imagination : thinking that your own work environment must be pretty similar to everyone else’s. ↩

1 views
Sean Goedecke 4 days ago

Becoming unblockable

With enough careful effort, it’s possible to become unblockable. In other words, you can put yourself in a position where you’re always able to make forward progress on your goals. I wrote about this six months ago in Why strong engineers are rarely blocked , but I wanted to take another crack at it and give some more concrete advice. The easiest way to avoid being blocked is to have more than one task on the go. Like a CPU thread, if you’re responsible for multiple streams of work, you can deal with one stream getting blocked by rolling onto another one. While one project might be blocked, you are not: you can continue getting stuff done. Because of this, I almost always have more than one task on my plate. However, there’s a lot of nuance involved in doing this correctly. The worst thing you can do is to be responsible for two urgent tasks at the same time - no matter how hard you work, one of them will always be making no progress, which is very bad 1 . If you’ve got too many ongoing tasks at the same time, you also risk overloading yourself if one or two of them suddenly blow out. It’s famously hard to scope engineering work. In a single day, you can go from having two or three trivial tasks to having three big jobs at the same time. I do not recommend just mindlessly picking up an extra ticket from your project board. Instead, try to have some non-project work floating around: refactors, performance work, writing performance reviews, mandatory training, and so on. It can be okay to pick up an extra ticket if you’re tactical about which ticket you pick up. Try to avoid having two important tasks on the go at the same time. Plan out projects from the start to minimize blockers. This section is more relevant for projects that you yourself are running, but the principle holds even for smaller pieces of work. If you think something is likely to get blocked (for instance, maybe database migrations at your company are run by a dedicated team with a large backlog), do it as early as possible . That way you can proceed with the rest of the project while you wait. Getting this wrong can add weeks to a project. Likewise, if there’s a part of your project that’s likely to be controversial, do it early so you can keep working on the rest of the project while the debate rages on. Do whatever it takes to have a stable and reliable developer environment. I don’t think it’s possible to overstate the importance of this. The stability of your developer environment directly determines how much of a workday you can spend actually doing work. For instance, use as normal a developer stack as possible . At GitHub, most development is done in Codespaces , a platform for server-hosted dev containers. You can connect to a codespace with almost any IDE, but the majority of people use VSCode, so I use VSCode , with as few plugins as possible 2 . I think a lot of developers are too focused on their personal “top speed” with their developer environment when everything is working great, and under-emphasize how much time they spend tweaking config, patching dotfiles, and troubleshooting in general. Fix developer environment problems as quickly as production incidents. If you can’t run tests or run a local server, don’t half-ass the troubleshooting process - focus on it until it’s fixed. On the flip side, don’t treat it as a leisurely learning experience (say, about how MacOS handles Dockerized networking). In many circumstances you’re probably better off tearing down and re-creating everything than digging in and trying to patch the specific issue. If your developer environment really is irreparably broken - maybe you’re waiting on new hardware, or you’re making a one-off change to a service that you don’t have the right dev environment permissions for - be scrappy about finding alternatives . If you can’t run tests, your GitHub CI probably can. If you can’t run a server locally, can you deploy to a staging environment and test there? Be careful about doing this in your main developer environment. You’re usually better off spending the time to actually fix the problem. But when you can’t, you should be creative about how you can keep working instead of just giving up. I see a lot of engineers run into a weird thing - commonly a 403 or 400 status code from some other service - and say “oh, I’m blocked, I need this other service’s owners to investigate”. You can and should investigate yourself. This is particularly true if you’ve got access to the codebase. If you’re getting an error, go and search their codebase to see what could be causing the error. Find the logs for your request to see if there’s anything relevant there. Of course, you won’t be able to dig as deep as engineers with real domain expertise, but often it doesn’t take domain expertise to solve your particular problem. There’s even less excuse not to do this now that AI agents are ubiquitous. Point Codex (or Copilot agent mode, or Claude Code, or whatever you have access to) at the codebase in question and ask “why might I be seeing this error with this specific request?” In my experience, you get the correct answer about a third of the time, which is amazing . Instead of waiting for hours or days to get help, you can spend ten minutes waiting for the agent and half an hour checking its work. Even if you can’t solve the problem yourself, a bit of research can often make your request for help much more compelling . As a service owner, there’s nothing more dispiriting than getting a “help, I get weird 400 errors” message - you know you’re going to spend a lot of time trawling through the logs before you can even figure out what the problem is, let alone how to reproduce it. But if the message already contains a link to the logs, or the text of a specific error, that immediately tells you where to start looking. There are typically two ways to do anything in a large tech company: the formal, legible way, and the informal way. As an example, it’s common to have a “ask for code review” Slack channel, which is full of engineers posting their changes. But many engineers don’t use these channels at all. Instead, they ping each other for immediate reviews, which is a much faster process. Of course, you can’t just DM random engineers asking for them to review your PR. It might work in the short term, but people will get really annoyed with you. You have to build relationships with engineers on every codebase you’d like to work on. If you’re extremely charismatic, maybe you can accomplish this with sheer force of will. But the rest of us have to build relationships by being useful: giving prompt and clear responses to questions from other teams, investigating bugs for them, reviewing their code, and so on. The most effective engineers at are tech company typically have really strong relationships with engineers on many other different teams. That isn’t to say that they operate entirely through backchannels, just that they have personal connections they can draw on when needed. If you’re blocked on work that another team is doing, it makes a huge difference having “someone on the inside”. Almost all blockers at large tech companies can be destroyed with sufficient “air support”. Typically this means a director or VP who’s aware of your project and is willing to throw their weight around to unblock you. For instance, they might message the database team’s manager saying “hey, can you prioritize this migration”, or task their very-senior-engineer direct report with resolving some technical debate that’s delaying your work. You can’t get air support for everything you’d like to do - it just doesn’t work like that, unless the company is very dysfunctional or you have a very good relationship with a senior manager. But you can choose to do things that align with what senior managers in the organizaton want, which can put you in a position to request support and get it. I wrote about this a lot more in How I influence tech company politics as a staff software engineer , but in one sentence: the trick is to have a bunch of possible project ideas in your back pocket, and then choose the ones that align with whatever your company cares about this month. Many engineers just don’t make use of the powerful allies they have. If you’re working on a high-priority project, the executive in charge of that project is unlikely to have the bandwidth to follow your work closely. They will be depending on you to go and tell them if you’re blocked and need their help. Unlike the relationships you may have with engineers on different teams, requesting air cover does not spend any credit. In fact, it often builds it, by showing that you’re switched-on enough to want to be unblocked, and savvy enough to know you can ask for their help. Senior managers are usually quite happy to go and unblock you, if you’re clear enough about what exactly you need them to do. To minimize the amount of time you spend blocked, I recommend: At some point somebody important will ask “why isn’t this task making any progress”, and you do not want the answer to be “I was working on something else”. Before I joined GitHub, I worked entirely inside a terminal and neovim. I switched to VSCode entirely because of Codespaces. If I joined another company where most developers used JetBrains, I would immediately switch to JetBrains. Working on at least two things at a time, so when one gets blocked you can switch to the other Sequencing your work so potential blockers are discovered and started early Making a reliable developer environment a high priority, including avoiding unusual developer tooling Being willing to debug into other services that you don’t own Building relationships with engineers on other teams Making use of very senior managers to unblock you, when necessary At some point somebody important will ask “why isn’t this task making any progress”, and you do not want the answer to be “I was working on something else”. ↩ Before I joined GitHub, I worked entirely inside a terminal and neovim. I switched to VSCode entirely because of Codespaces. If I joined another company where most developers used JetBrains, I would immediately switch to JetBrains. ↩

0 views
Sean Goedecke 1 weeks ago

Why it takes months to tell if new AI models are good

Nobody knows how to tell if current-generation models are any good . When GPT-5 launched, the overall mood was very negative, and the consensus was that it wasn’t a strong model. But three months later it turns out that GPT-5 (and its derivative GPT-5-Codex) is a very strong model for agentic work 1 : enough to break Anthropic’s monopoly on agentic coding models. In fact, GPT-5-Codex is my preferred model for agentic coding. It’s slower than Claude Sonnet 4.5, but in my experience it gets more hard problems correct. Why did it take months for me to figure this out? The textbook solution for this problem is evals - datasets of test cases that models can be scored against - but evals are largely unreliable . Many models score very well on evals but turn out to be useless in practice. There are a couple of reasons for this. First, it’s just really hard to write useful evals for real-world problems , since real-world problems require an enormous amount of context. Can’t you take previous real-world problems and put them in your evals - for instance, by testing models on already-solved open-source issues? You can, but you run into two difficulties: Another problem is that evals are a target for AI companies . How well Anthropic or OpenAI’s new models perform on evals has a direct effect on the stock price of those companies. It’d be naive to think that they don’t make some kind of effort to do well on evals: if not by directly training on public eval data 2 , then by training on data that’s close enough to eval data to produce strong results. I’m fairly confident that big AI companies will not release a model unless they can point to a set of evals that their model does better than competitors. So you can’t trust that strong evals will mean a strong model, because every single new model is released with strong evals. If you can’t rely on evals to tell you if a new model is good, what can you rely on? For most people, the answer is the “vibe check”: interacting with the model themselves and making their own judgement. Often people use a set of their own pet questions, which are typically questions that other LLMs get wrong (say, word puzzles). Trick questions can be useful, but plenty of strong models struggle with specific trick questions for some reason. My sense is also that current models are too strong for obvious word puzzles. You used to be able to trip up models with straightforward questions like “If I put a ball in a box, then put the box in my pocket, where is the ball?” Now you have to be more devious, which gives less signal about how strong the model is. Sometimes people use artistic prompts. Simon Willison famously asks new models to produce a SVG of a pelican riding a bicycle. It’s now a common Twitter practice to post side-by-side “I asked two models to build an object in Minecraft” screenshots. This is cool - you can see at a glance that bigger models produce better images - but at some point it becomes difficult to draw conclusions from the images. If Claude Sonnet 4.5 puts the pelican’s feet on the pedals correctly, but GPT-5.1 adds spokes to the wheels, which model is better? Finally, many people rely on pure vibes: the intangible sense you get after using a model about whether it’s good or not. This is sometimes described as “big model smell”. I am fairly agnostic about people’s ability to determine model capability from vibes alone. It seems like something humans might be able to do, but also like something that would be very easy to fool yourself about. For instance, I would struggle to judge a model with the conversational style of GPT-4o as very smart, but there’s nothing in principle that would prevent that. Of course, for people who engage in intellectually challenging pursuits, there’s an easy (if slow) way to evaluate model capability: just give it the problems you’re grappling with and see how it does. I often ask a strong agentic coding model to do a task I’m working on in parallel with my own efforts. If the model fails, it doesn’t slow me down much; if it succeeds, it catches something I don’t, or at least gives me a useful second opinion. The problem with this approach is that it takes a fair amount of time and effort to judge if a new model is any good, because you have to actually do the work : if you’re not engaging with the problem yourself, you will have no idea if the model’s solution is any good or not. So testing out a new model can be risky. If it’s no good, you’ve wasted a fair amount of time and effort! I’m currently trying to decide whether to invest this effort into testing out Gemini 3 Pro or GPT-5.1-Codex - right now I’m still using GPT-5-Codex for most tasks, or Claude Sonnet 4.5 on some simpler problems. Each new model release reignites the debate over whether AI progress is stagnating. The most prominent example is Gary Marcus, who has written that GPT-4 , GPT-4o , Claude 3.5 Sonnet , GPT-5 and DeepSeek all prove that AI progress has hit a wall. But almost everyone who writes about AI seems to be interested in the topic. Each new model launch is watched to see if this is the end of the bubble, or if LLMs will continue to get more capable. The reason this debate never ends is that there’s no reliable way to tell if an AI model is good . Suppose that base AI models were getting linearly smarter (i.e. that GPT-5 really was as far above GPT-4 as GPT-4 was above GPT-3.5, and so on). Would we actually be able to tell? When you’re talking to someone who’s less smart than you 3 , it’s very clear. You can see them failing to follow points you’re making, or they just straight up spend time visibly confused and contradicting themselves. But when you’re talking to someone smarter than you, it’s far from clear (to you) what’s going on. You can sometimes feel that you’re confused by what they say, but that doesn’t necessarily mean they’re smarter. It could be that they’re just talking nonsense. And smarter people won’t confuse you all the time - only when they fail to pitch their communication at your level. Talking with AI models is like that. GPT-3.5 was very clearly less smart than most of the humans who talked to it. It was mainly impressive that it was able to carry on a conversation at all. GPT-4 was probably on par with the average human (or a little better) in its strongest domains. GPT-5 (at least in thinking mode) is smarter than the average human across most domains, I believe. Suppose we had no objective way of measuring chess ability. Would I be able to tell if computer chess engines were continuing to get better? I’d certainly be impressed when the chess engines went from laughably bad to beating me every time. But I’m not particularly good at chess. I would lose to chess engines from the early 1980s . It would thus seem to me as if chess engine progress had stalled out, when in fact modern chess engines have double the rating of chess engines from the 1980s. I acknowledge that “the model is now at least partly smarter than you” is an underwhelming explanation for why AI models don’t appear to be rapidly getting better. It’s easy to point to cases where even strong models fall over. But it’s worth pointing out that if models were getting consistently smarter, this is what it would look like : rapid subjective improvement as the models go from less intelligent than you to on par with you, and then an immediate plateau as the models surpass you and you become unable to tell how smart they are. By “agentic work” I mean “LLM with tools that runs in a loop”, like Copilot Agent Mode, Claude Code, and Codex. I haven’t yet tried GPT-5.1-Codex enough to have a strong opinion. If you train a model on the actual eval dataset itself, it will get very good at answering those specific questions, even if it’s not good at answering those kinds of questions. This is often called “benchmaxxing”: prioritizing evals and benchmarks over actual capability. I want to bracket the question of whether “smart” is a broad category, or how exactly to define it. I’m talking specifically about the way GPT-4 is smarter than GPT-3.5 - even if we can’t define exactly how, we know that’s a real thing. Open-source coding is often meaningfully different from the majority of programming work. For more on this, see my comments in METR’S AI productivity study is really good , where I discuss an AI-productivity study that was done on open-source codebases. You’re still only covering a tiny subset of all programming work. For instance, the well-known SWE-Bench set of coding evals are just in Python. A model might be really good at Python but struggle with other languages. Nobody knows how good a model is when it’s launched. Even the AI lab who built it are only guessing and hoping it’ll turn out to be effective for real-world use cases. Evals are mostly marketing tools. It’s hard to figure out how good the eval is, or if the model is being “taught to the test”. If you’re trying to judge models from their public evals you’re fighting against the billions of dollars of effort going into gaming the system. Vibe checks don’t test the kind of skills that are useful for real work, but testing a model by using it to do real work takes a lot of time. You can’t figure out if a brand new model is good that way. Because of all this, it’s very hard to tell if AI progress is stagnating or not. Are the models getting better? Are they any good right now? Compounding that problem, it’s hard to judge between two models that are both smarter than you (in a particular domain). If the models do keep getting better, we might expect it to feel like they’re plateauing, because once they get better than us we’ll stop seeing evidence of improvement. By “agentic work” I mean “LLM with tools that runs in a loop”, like Copilot Agent Mode, Claude Code, and Codex. I haven’t yet tried GPT-5.1-Codex enough to have a strong opinion. ↩ If you train a model on the actual eval dataset itself, it will get very good at answering those specific questions, even if it’s not good at answering those kinds of questions. This is often called “benchmaxxing”: prioritizing evals and benchmarks over actual capability. ↩ I want to bracket the question of whether “smart” is a broad category, or how exactly to define it. I’m talking specifically about the way GPT-4 is smarter than GPT-3.5 - even if we can’t define exactly how, we know that’s a real thing. ↩

0 views
Sean Goedecke 2 weeks ago

Only three kinds of AI products actually work

The very first LLM-based product, ChatGPT, was just 1 the ability to talk with the model itself: in other words, a pure chatbot. This is still the most popular LLM product by a large margin. In fact, given the amount of money that’s been invested in the industry, it’s shocking how many “new AI products” are just chatbots. As far as I can tell, there are only three types of AI product that currently work . For the first couple of years of the AI boom, all LLM products were chatbots. They were branded in a lot of different ways - maybe the LLM knew about your emails, or a company’s helpdesk articles - but the fundamental product was just the ability to talk in natural language to an LLM. The problem with chatbots is that the best chatbot product is the model itself . Most of the reason users want to talk with an LLM is generic: they want to ask questions, or get advice, or confess their sins, or do any one of a hundred things that have nothing to do with your particular product. In other words, your users will just use ChatGPT 2 . AI labs have two decisive advantages over you: first, they will always have access to the most cutting-edge models before you do; and second, they can develop their chatbot harness simultaneously with the model itself (like how Anthropic specifically trains their models to be used in Claude Code, or OpenAI trains their models to be used in Codex). One way your chatbot product can beat ChatGPT is by doing what OpenAI won’t do: for instance, happily roleplaying an AI boyfriend or generating pornography. There is currently a very lucrative niche of products like this, which typically rely on less-capable but less-restrictive open-source models. These products have the problems I discussed above. But it doesn’t matter that their chatbots are less capable than ChatGPT or Claude: if you’re in the market for sexually explicit AI roleplay, and ChatGPT and Claude won’t do it, you’re going to take what you can get. I think there are serious ethical problems with this kind of product. But even practically speaking, this is a segment of the industry likely to be eaten alive by the big AI labs, as they become more comfortable pushing the boundaries of adult content. Grok Companions is already going down this pathway, and Sam Altman has said that OpenAI models will be more open to generating adult content in the future. There’s a slight variant on chatbots which gives the model tools : so instead of just chatting with your calendar, you can ask the chatbot to book meetings, and so on. This kind of product is usually called an “AI assistant”. This doesn’t work well because savvy users can manipulate the chatbot into calling tools . So you can never give a support chatbot real support powers like “refund this customer”, because the moment you do, thousands of people will immediately find the right way to jailbreak your chatbot into giving them money. You can only give your chatbots tools that the user could do themselves - in which case, your chatbot is competing with the usability of your actual product, and will likely lose. Why will your chatbot lose? Because chat is not a good user interface . Users simply do not want to type out “hey, can you increase the font size for me” when they could simply hit “ctrl-plus” or click a single button 3 . I think this is a hard lesson for engineers to learn. It’s tempting to believe that since chatbots have gotten 100x better, they must now be the best user interface for many tasks. Unfortunately, they started out 200x worse than a regular user interface, so they’re still twice as bad. The second real AI product actually came out before ChatGPT did: GitHub Copilot. The idea behind the original Copilot product (and all its imitators, like Cursor Tab) is that a fast LLM can act as a smart autocomplete. By feeding the model the code you’re typing as you type it, a code editor can suggest autocompletions that actually write the rest of the function (or file) for you. The genius of this kind of product is that users never have to talk to the model . Like I said above, chat is a bad user interface. LLM-generated completions allow users to access the power of AI models without having to change any part of their current workflow: they simply see the kind of autocomplete suggestions their editor was already giving them, but far more powerful. I’m a little surprised that completions-based products haven’t taken off outside coding (where they immediately generated a multi-billion-dollar market). Google Docs and Microsoft Word both have something like this. Why isn’t there more hype around this? The third real AI product is the coding agent. People have been talking about this for years, but it was only really in 2025 that the technology behind coding agents became feasible (with Claude Sonnet 3.7, and later GPT-5-Codex). Agents are kind of like chatbots, in that users interact with them by typing natural language text. But they’re unlike chatbots in that you only have to do that once : the model takes your initial request and goes away to implement and test it all by itself. The reason agents work and chatbots-with-tools don’t is the difference between asking an LLM to hit a single button for you and asking the LLM to hit a hundred buttons in a specific order. Even though each individual action would be easier for a human to perform, agentic LLMs are now smart enough to take over the entire process. Coding agents are a natural fit for AI agents for two reasons: For my money, the current multi-billion-dollar question is can AI agents be useful for tasks other than coding? Bear in mind that Claude Sonnet 3.5 was released just under nine months ago . In that time, the tech industry has successfully built agentic products about their own work. They’re just starting to build agentic products for other tasks. It remains to be seen how successful that will be, or what those products will look like. There’s another kind of agent that isn’t about coding: the research agent. LLMs are particularly good at tasks like “skim through ten pages of search results” or “keyword search this giant dataset for any information on a particular topic”. I use this functionality a lot for all kinds of things. There are a few examples of AI products built on this capability, like Perplexity . In the big AI labs, this has been absorbed into the chatbot products: OpenAI’s “deep research” went from a separate feature to just what GPT-5-Thinking does automatically, for instance. I think there’s almost certainly potential here for area-specific research agents (e.g. in medicine or law). If agents are the most recent successful AI product, AI-generated feeds might be the one just over the horizon. AI labs are currently experimenting with ways of producing infinite feeds of personalized content to their users: So far none of these have taken off. But scrolling feeds has become the primary way users interact with technology in general , so the potential here is massive. It does not seem unlikely to me at all that in five years time most internet users will spend a big part of their day scrolling an AI-generated feed. Like a completions-based product, the advantage of a feed is that users don’t have to interact with a chatbot. The inputs to the model come from how the user interacts with the feed (likes, scrolling speed, time spent looking at an item, and so on). Users can experience the benefits of an LLM-generated feed (if any) without having to change their consumption habits at all. The technology behind current human-generated infinite feeds is already a mature application of state of the art machine learning. When you interact with Twitter or LinkedIn, you’re interacting with a model, except instead of generating text it’s generating lists of other people’s posts. In other words, feeds already maintain a sophisticated embedding of your personal likes and dislikes . The step from “use that embedding to surface relevant content” to “use that embedding to generate relevant content” might be very short indeed. I’m pretty suspicious of AI-generated infinite feeds of generated video, but I do think other kinds of infinite feeds are an under-explored kind of product. In fact, I built a feed-based hobby project of my own, called Autodeck 4 . The idea was to use an AI-generated feed to generate spaced repetition cards for learning. It works pretty well! It still gets a reasonable amount of use from people who’ve found it via my blog (also, from myself and my partner). One other kind of AI-generated product that people have been talking about for years is the AI-based video game. The most speculative efforts in this direction have been full world simulations like DeepMind’s Genie , but people have also explored using AI to generate a subset of game content, such as pure-text games like AI Dungeon or this Skyrim mod which adds AI-generated dialogue. Many more game developers have incorporated AI art or audio assets into their games. Could there be a transformative product that incorporates LLMs into video games? I don’t think ARC Raiders counts as an “AI product” just because it uses AI voice lines, and the more ambitious projects haven’t yet really taken off. Why not? One reason could be that games just take a really long time to develop . When Stardew Valley took the world by storm in 2016, I expected a flood of copycat cozy pixel-art farming games, but that only really started happening in 2018 and 2019. That’s how long it takes to make a game! So even if someone has a really good idea for an LLM-based video game, we’re probably still a year or two out from it being released. Another reason is that many gamers really don’t like AI . Including generative AI in your game is a guaranteed controversy (though it doesn’t seem to be fatal, as the success of ARC Raiders shows). I wouldn’t be surprised if some game developers simply don’t think it’s worth the risk to try an AI-based game idea 5 . A third reason could be that generated content is just not a good fit for gaming . Certainly ChatGPT-like dialogue sticks out like a sore thumb in most video games. AI chatbots are also pretty bad at challenging the user: their post-training is all working to make them try to satisfy the user immediately 6 . Still, I don’t think this is an insurmountable technical problem. You could simply post-train a language model in a different direction (though perhaps the necessary resources for that haven’t yet been made available to gaming companies). By my count, there are three successful types of language model product: On top of that, there are two kinds of LLM-based product that don’t work yet but may soon: Almost all AI products are just chatbots (e.g. AI-powered customer support). These suffer from having to compete with ChatGPT, which is a superior general product, and not being able to use powerful tools, because users will be able to easily jailbreak the model. Agentic products are new, and have been wildly successful for coding . It remains to be seen what they’ll look like in other domains, but we’ll almost certainly see domain-specific research agents in fields like law. Research agents in coding have seen some success as well (e.g. code review or automated security scanning products). Infinite AI-generated feeds haven’t yet been successful, but hundreds of millions of dollars are currently being poured into them. Will OpenAI’s Sora be a real competitor to Twitter or Instagram, or will those companies release their own AI-generated feed product? AI-generated games sound like they could be a good idea, but there’s still no clear working strategy for how to incorporate LLMs into a video game. Pure world models - where the entire game is generated frame-by-frame - are cool demos but a long way from being products. One other thing I haven’t mentioned is image generation. Is this part of a chatbot product, or a tool in itself? Frankly, I think AI image generation is still more of a toy than a product, but it’s certainly seeing a ton of use. There’s probably some fertile ground for products here, if they can successfully differentiate themselves from the built-in image generation in ChatGPT. In general, it feels like the early days of the internet. LLMs have so much potential, but we’re still mostly building copies of the same thing. There have to be some really simple product ideas that we’ll look back on and think “that’s so obvious, I wonder why they didn’t do it immediately”. Of course, “just” here covers a raft of progress in training stronger models, and real innovations around RLHF, which made it possible to talk with pure LLMs at all. This is a big reason why most AI enterprise projects fail . Anecdotally, I have heard a lot of frustration with bespoke enterprise chatbots. People just want to use ChatGPT! If you’re not convinced, take any device you’re comfortable using (say, your phone, your car, your microwave) and imagine having to type out every command. Maybe really good speech recognition will fix this, but I doubt it. I wrote about it here and it’s linked in the topbar. Though this could be counterbalanced by what I’m sure is a strong push from executives to get in on the action and “build something with AI”. If you’ve ever tried to ask ChatGPT to DM for you, you’ll have experienced this first-hand: the model will immediately try and show you something cool, skipping over the necessary dullness that builds tension and lends verisimilitude. Maybe the answer is that the people using this product don’t engage with AI online spaces, and are just quietly using the product? Maybe there’s something about normal professional writing that’s less amenable to autocomplete than code? I doubt that, since so much normal professional writing is being copied out of a ChatGPT window. It could be that code editors already had autocomplete, so users were familiar with it. I bet autocomplete is brand-new and confusing to many Word users. It’s easy to verify changes by running tests or checking if the code compiles AI labs are incentivized to produce effective coding models to accelerate their own work Mark Zuckerberg has talked about filling Instagram with auto-generated content OpenAI has recently launched a Sora-based video-gen feed OpenAI has also started pushing users towards “Pulse”, a personalized daily update inside the ChatGPT product xAI is working on putting an infinite image and video feed into Twitter Chatbots like ChatGPT, which are used by hundreds of millions of people for a huge variety of tasks Completions coding products like Copilot or Cursor Tab, which are very niche but easy to get immediate value from Agentic products like Claude Code, Codex, Cursor, and Copilot Agent mode, which have only really started working in the last six months LLM-generated feeds Video games that are based on AI-generated content Of course, “just” here covers a raft of progress in training stronger models, and real innovations around RLHF, which made it possible to talk with pure LLMs at all. ↩ This is a big reason why most AI enterprise projects fail . Anecdotally, I have heard a lot of frustration with bespoke enterprise chatbots. People just want to use ChatGPT! ↩ If you’re not convinced, take any device you’re comfortable using (say, your phone, your car, your microwave) and imagine having to type out every command. Maybe really good speech recognition will fix this, but I doubt it. ↩ I wrote about it here and it’s linked in the topbar. ↩ Though this could be counterbalanced by what I’m sure is a strong push from executives to get in on the action and “build something with AI”. ↩ If you’ve ever tried to ask ChatGPT to DM for you, you’ll have experienced this first-hand: the model will immediately try and show you something cool, skipping over the necessary dullness that builds tension and lends verisimilitude. ↩

0 views
Sean Goedecke 2 weeks ago

Writing for AIs is a good way to reach more humans

There’s an idea going around right now about “writing for AIs”: writing as if your primary audience is not human readers, but the language models that will be trained on the content of your posts. Why would anyone do this? For the same reason you might want to go on podcasts or engage in SEO: to get your core ideas in front of many more people than would read your posts directly. If you write to make money, writing for AI is counterproductive. Why would anyone buy your writing if they can get a reasonable facsimile of it for free out of ChatGPT? If you write in order to express yourself in poetry, writing for AI might seem repulsive. I certainly find language model attempts at poetry to be off-putting 1 . But if you write to spread ideas, I think writing for AI makes a lot of sense. I don’t write this blog to make money or to express myself in beautiful language. I write because I have specific things I want to say: While it’s nice that some people read about this stuff on my website, I would be just as happy if they read about them elsewhere: via word-of-mouth from people who’ve read my posts, in the Google preview banners, or in my email newsletter (where I include the entire post content so people never have to click through to the website at all). I give blanket permission to anyone who asks to translate and rehost my articles. Likewise, I would be just as happy for people to consume these ideas via talking with an LLM. In 2022, Scott Alexander wrote that the purpose of a book is not to produce the book itself. Instead, a book acts as a “ritual object”: a reason to hold a public relations campaign that aims to “burn a paragraph of text into the public consciousness” via TV interviews and magazine articles. Likewise, I think it’s fair to say that the purpose of a technical blog post is not to be read, but to be a ritual object that gives people a reason to discuss a single idea. Take Jeff Atwood’s well-known post that popularized Foote and Yoder’s original idea of the “big ball of mud”. Or Joel Spolsky’s The Law of Leaky Abstractions , which popularized the idea that “all abstractions leak”. Or Tanya Reilly’s Being Glue , which popularized the term “glue work” to describe the under-rewarded coordination work that holds teams together 2 . Many, many more people are familiar with these ideas and terms than have read the original posts. They have sunk into the public consciousness via repeated discussion in forums, Slack channels, and so on, in the same way that the broad idea of a non-fiction book sinks in via secondary sources 3 . Large language models do read all these books and blog posts. But what they read in much greater quantities is people talking about these books and blog posts (at least via other articles, if they’re not being explicitly trained on Hacker News and Reddit comments). If you write a popular blog, your ideas will thus be over-represented in the training data. For instance, when someone asks about coordination work, GPT-5 immediately calls it “glue work”: When engineers talk to language models about their work, I would like those models to be informed by my posts, either via web search or by inclusion in the training data. As models get better, I anticipate people using them more (for instance, via voice chat). That’s one reason why I’ve written so much this year: I want to get my foothold in the training data as early as possible, so my ideas can be better represented by language models long-term. Of course, there are other reasons why people might want to be represented in the training data. Scott Alexander lists three reasons: teaching AIs what you know, trying to convince AIs of what you believe, and providing enough personal information that the superintelligent future AI will be able to simulate you accurately later on. I’m really only moved by the first reason: teaching AIs what I believe so they can share my ideas with other human readers. I agree with Scott that writing in order to shape the personality of future AIs is pointless. It might just be impossible - for instance, maybe even the most persuasive person in the world can’t argue hard enough to outweigh millennia of training data, or maybe any future superintelligence will be too smart to be influenced by mere humans. If it does turn out to be possible, AI companies will likely take control of the process and deliberately convince their models of whatever set of beliefs are most useful to them 4 . Okay, suppose you’re convinced that it’s worth writing for AIs. What does that mean? Do you need to write any differently? I don’t think so. When I say “writing for AIs”, I mean: The first point is pretty self-explanatory: the more you write, the more of your content will be represented in future AI training data. So even if you aren’t getting a lot of human traffic in the short term, your long term reach will be much greater. Of course, you shouldn’t just put out a high volume of garbage for two reasons: first, because having your work shared and repeated by humans is likely to increase your footprint in the training set; and second, because what would be the point of increasing the reach of garbage? The second point is that putting writing behind a paywall means it’s going to be harder to train on. This is one reason why I’ve never considered having paid subscribers to my blog. Relatedly, I think it’s also worth avoiding fancy Javascript-only presentation which would make it harder to scrape your content - for instance, an infinite-scroll page, or a non-SSR-ed single-page-application. Tyler Cowen has famously suggested being nice to the AIs in an attempt to get them to pay more attention to your work. I don’t think this works, any more than being nice to Google results in your pages getting ranked higher in Google Search. AIs do not make conscious decisions about what to pull from their training data. They are influenced by each piece of data to the extent that (a) it’s represented in the training set, and (b) it aligns with the overall “personality” of the model. Neither of those things is likely to be affected by how pro-AI your writing is. I recommend just writing how you would normally write. Much of the “why write for AIs” discussion is dominated by far-future speculation, but there are much more straightforward reasons to write for AIs: for instance, so that they’ll help spread your ideas more broadly. I think this is a good reason to write more and to make your writing accessible (i.e. not behind a paywall). But I wouldn’t recommend changing your writing style to be more “AI-friendly”: we don’t have much reason to think that works, and if it makes your writing less appealing to humans it’s probably not a worthwhile tradeoff. There may be some particular writing style that’s appealing to AIs. It wouldn’t surprise me if we end up in a war between internet writers and AI labs, in the same way that SEO experts are at war with Google’s search team. I just don’t think we know what that writing style is yet. It’s slop in the truest sense. Technically this was a talk, not a blog post. I also wrote about glue work and why you should be wary of it in Glue work considered harmful . For instance, take my own Seeing like a software company , which expresses the idea of the book Seeing Like a State in the first three lines. There’s also the idea that by contributing to the training data, future AIs will be able to simulate you, providing a form of immortality. I find it tough to take this seriously. Or perhaps I’m just unwilling to pour my innermost heart out into my blog. Any future simulation of me from these posts would only capture a tiny fraction of my personality. That the fundamental nature of tech work has changed since like 2023 and the end of ZIRP That emotional regulation is at least as important as technical skill for engineering performance That large tech companies do not function by their written rules, but instead by complex networks of personal incentives That you should actually read the papers that drive engineering conversations, because they often say the exact opposite of how they’re popularly interpreted Writing more than you normally would, and Publishing your writing where it can be easily scraped for AI training data and discovered by AI web searchers It’s slop in the truest sense. ↩ Technically this was a talk, not a blog post. I also wrote about glue work and why you should be wary of it in Glue work considered harmful . ↩ For instance, take my own Seeing like a software company , which expresses the idea of the book Seeing Like a State in the first three lines. ↩ There’s also the idea that by contributing to the training data, future AIs will be able to simulate you, providing a form of immortality. I find it tough to take this seriously. Or perhaps I’m just unwilling to pour my innermost heart out into my blog. Any future simulation of me from these posts would only capture a tiny fraction of my personality. ↩

0 views
Sean Goedecke 3 weeks ago

To get better at technical writing, lower your expectations

Technical writing is a big part of a software engineer’s job. This is more true the more senior you get. In the limit case, a principal or distinguished engineer might only write technical documents, but even brand-new junior engineers need to write: commit messages, code comments, PR descriptions and comments, Slack threads, internal announcements, documentation, runbooks, and so on. Whether you write well or badly matters a lot. The primary rule about technical writing is that almost none of your readers will pay much attention . Your readers will typically read the first sentence, skim the next one, and then either skim the rest or stop reading entirely. You should thus write as little as possible. If you can communicate your idea in a single sentence, do that - there’s a high chance that people will actually read it. What if you can’t communicate all the details in so few words? In fact, this is a feature not a bug. You should deliberately omit many subtle details . This is the biggest difference between technical writing and code. Each line of code you write is as important to the computer as any other. The compiler or interpreter will methodically go through every single detail you put into your codebase and incorporate it into the final product. If there’s a subtle point you want to make - say, that prorated monthly billing should account for the fact that some months are longer than others - you can and should spend a paragraph of code articulating it. But each point you make when talking to humans consumes a limited attention budget 1 . Because of that, it’s usually wise to leave out subtle points entirely. One consequence of this is that you should frontload all important information . If you spend a paragraph providing context and throat-clearing, many of your readers will have stopped reading (or started skimming) by the time you make your actual point. Try to get to your point in the very first sentence. It’s even better to get your point into the title , like an academic paper does. Many engineers refuse to deliberately leave out information, because they think they shouldn’t need to. They believe their technical colleagues are thoughtful, conscientious people who will pay careful attention to everything they read and attempt to come to a full understanding of it. In other words, many engineers have far too high expectations of what their technical writing can accomplish. For instance, your technical writing is not going to transplant your understanding of a system into somebody else’s head. It simply does not work like that. Understanding of technical systems is won only through painstaking concrete effort : you have to interact with the system, read the code, make changes to it, and so on. At best, good technical writing will give someone enough rough context to get a fuzzy understanding of what you’re suggesting. At worst, good technical writing will convey to the reader that at least you know what you’re talking about, so they may as well trust you even if they’re still confused. Your technical writing is also not going to get everybody on the same page. Disagreement and confusion in large organizations is not dysfunction but function: it is the normal baseline of operation, just like some constant degree of failure is part of the normal baseline of a complex distributed system. You should not expect your technical writing to fix this problem. So what can good technical writing do? It can communicate a very simple technical point to a broad audience. For instance, you can explain “adding new settings requires a database migration, so it cannot be done dynamically” to your Product org, and thus prevent them from suggesting impossible things 2 . However, be aware that the actual point being communicated is probably even simpler than that: something like “adding new settings is hard”, or even “settings are complicated”. This is a fairly pessimistic view about the usefulness of technical writing. But I do think that even this limited goal is surprisingly important in a large organization. The baseline level of technical confusion is so high that communicating even obvious things is high-leverage. I wrote about this in Providing technical clarity to non-technical leaders : even the people leading engineering organizations are often operating on a really low level of technical context, because they simply have so many other things to worry about. Good technical writing can also communicate a reasonably complex technical point to a tiny audience. For instance, you can write an ADR that goes into many subtle details about a planned piece of work. This is a great way to get excellent feedback on a plan, but the effective audience for writing like this can be in the single digits . Sometimes the effective audience for an ADR is two : you, the writer, and one other engineer who has the requisite context to understand it. Of course, to write clearly you first must think clearly. I haven’t written about this here because it’s an entirely separate topic (and one I have less concrete advice for). For some engineers, this is the main obstacle to condensing their point into a key sentence or two: they simply do not have a clear enough understanding to do that, and must instead express a series of related ideas and hope the reader gets the overall message. I did write about this almost a year ago in Thinking clearly about software . I stand by most of that, particularly the parts about sitting with your uncertainty and focusing on what you can say for sure, but I think there’s probably much more to be said on this topic. The biggest mistake engineers make in their technical writing is setting their expectations too high . They try to communicate in too much detail and end up failing to communicate anything at all: either because readers are checked out by the time they arrive at the key point, or because they’ve simply assumed too much background technical knowledge and the reader is hopelessly confused. When you are doing technical writing, you are almost always communicating to people with less technical context than you . You may have a crystal-clear understanding of what you’re talking about, but your reader will not. They likely won’t even be as interested as you in the topic: if you’re writing a paragraph to try and ask some other team to do something, that something is your project that you (presumably) care about, but the other team has their own projects and will probably only skim what you wrote. If you can say what you want to say in one sentence - even if it means leaving out some nuance - you should almost always do that. If you have to write a paragraph, make it as short a paragraph as you can. If you have to write a page, make sure the first paragraph contains as much key information as possible, because many readers won’t make it any further than that. The good news is that even if you’re overestimating how much you can successfully convey, you’re likely underestimating how useful it is to convey even a tiny amount of technical content. In large organizations, many technical decisions are made by people with effectively zero technical context 3 . Going from that to even a very rough “lay of the land” is a massive improvement. In this sense, it’s similar to talking with LLMs. Of course settings (however implemented) don’t need to require a database migration. You can rearchitect the system to make almost any architectural impossibility possible. But “we’d need to redesign settings to do this” is pretty similar to “this is impossible” for many one-off low-priority asks. Even if the decision-makers have technical context on some of the system, they’ll likely still be often making decisions about other parts of the system that are black boxes to them. In this sense, it’s similar to talking with LLMs. ↩ Of course settings (however implemented) don’t need to require a database migration. You can rearchitect the system to make almost any architectural impossibility possible. But “we’d need to redesign settings to do this” is pretty similar to “this is impossible” for many one-off low-priority asks. ↩ Even if the decision-makers have technical context on some of the system, they’ll likely still be often making decisions about other parts of the system that are black boxes to them. ↩

4 views
Sean Goedecke 3 weeks ago

Is it worrying that 95% of AI enterprise projects fail?

In July of this year, MIT NANDA released a report called The GenAI Divide: State of AI in Business 2025 . The report spends most of its time giving advice about how to run enterprises AI projects, but the item that got everybody talking was its headline stat: 95% of organizations are getting zero return from their AI projects . This is a very exciting statistic for those already disposed to be pessimistic about the impact of AI. The incredible amounts of money and time being spent on AI depend on language models being a transformative technology. Many people are expecting AI to eventually unlock hundreds of billions of dollars in value. The NANDA paper seems like very bad news for those people, if the last three years of AI investment really has failed to unlock even one dollar in value for most companies. Cards on the table - I think AI is going to have an impact about on-par with the internet, or railroads, but that we’re also definitely in a bubble. I wrote about this in What’s next after the AI bubble bursts? 1 . I am not convinced that the NANDA report is bad news for AI . The obvious question to ask about the report is “well, what’s the base rate?” Suppose that 95% of enterprise AI transformations fail. How does that compare to the failure rate of normal enterprise IT projects? This might seem like a silly question for those unfamiliar with enterprise AI projects - whatever the failure rate, surely it can’t be close to 95%! Well. In 2016, Forbes interviewed the author of another study very much like the NANDA report, except about IT transformations in general, and found an 84% failure rate. McKinsey has only one in 200 IT projects coming in on time and within budget. The infamous 2015 CHAOS report found an 61% failure rate, going up to 98% for “large, complex projects”. Most enterprise IT projects are at least partial failures. Of course, much of this turns on how we define success. Is a project a success if it delivers what it promised a year late? What if it had to cut down on some of the promised features? Does it matter which features? The NANDA report defines it like this, which seems like a fairly strict definition to me: We define successfully implemented for task-specific GenAI tools as ones users or executives have remarked as causing a marked and sustained productivity and/or P&L impact. Compare the CHAOS report’s definition of success: Success … means the project was resolved within a reasonable estimated time, stayed within budget, and delivered customer and user satisfaction regardless of the original scope. I think these are close enough to be worth comparing, which means that according to the NANDA report, AI projects succeed at roughly the same rate as ordinary enterprise IT projects . Nobody says “oh, databases must be just hype” when a database project fails. In the interest of fairness, we should extend the same grace to AI. 81% and 95% are both high failure rates, but 95% is higher. Is that because AI offers less value than other technologies, or because AI projects are unusually hard? I want to give some reasons why we might think AI projects fall in the CHAOS report’s category of “large, complex projects”. Useful AI models have not been around for long. GPT-3.5 was released in 2022, but it was more of a toy than a tool. For my money, the first useful AI model was GPT-4, released in March 2023, and the first cheap, useful, and reliable AI model was GPT-4o in May 2024. That means that enterprise AI projects have been going at most three years, if they were willing and able to start with GPT-3.5, and likely much closer to eighteen months. The average duration of an enterprise IT project is 2.4 years in the private sector and 3.9 years in the public sector. Enterprise AI adoption is still very young, by the standards of their other IT projects. Also, to state the obvious, AI is a brand-new technology. Most failed enterprise IT projects are effectively “solved problems”: like migrating information into a central database, or tracking high-volume events, or aggregating various data sources into a single data warehouse for analysis 2 . Of course, any software engineer should know that solving a “solved problem” is not easy. The difficulties are in all the myriad details that have to be worked out. But enterprise AI projects are largely not “solved problems”. The industry is still working out the best way to build a chatbot. Should tools be given as definitions, or discovered via MCP? Should agents use sub-agents? What’s the best way to compact the context window? Should data be fetched via RAG, or via agentic keyword search? And so on. This is a much more fluid technical landscape than most enterprise AI projects. Even by itself, that’s enough to push AI projects into the “complex” category. So far I’ve assumed that the “95% of enterprise AI projects fail” statistic is reliable. Should we? NANDA’s source for the 95% figure is a survey in section 3.2: The immediate problem here is that I don’t think this figure even shows that 95% of AI projects fail . As I read it, the leftmost section shows that 60% of the surveyed companies “investigated” building task-specific AI. 20% of the surveyed companies then built a pilot, and 5% built an implementation that had a sustained, notable impact on productivity or profits. So just on the face of it, that’s an 8.3% success rate, not a 5% success rate, because 40% of the surveyed companies didn’t even try . It’s also unclear if all the companies that investigated AI projects resolved to carry them out. If some of them decided not to pursue an AI project after the initial investigation, they’d also be counted in the failure rate, which doesn’t seem right at all. We also don’t know how good the raw data is. Read this quote, directly above the image: These figures are directionally accurate based on individual interviews rather than official company reporting. Sample sizes vary by category, and success definitions may differ across organizations. In Section 8.2, the report lays out its methodology: 52 interviews across “enterprise stakeholders”, 153 surveys of enterprise “leaders”, and an analysis of 300+ public AI projects. I take this quote to mean that the 95% figure is based on a subset of those 52 interviews. Maybe all 52 interviews gave really specific data! Or maybe only a handful of them did. Finally, the subject of the claim here is a bit narrower than “AI projects” . The 95% figure is specific to “embedded or task-specific GenAI”, as opposed to general purpose LLM use (presumably something like using the enterprise version of GitHub Copilot or ChatGPT). In fairness to the NANDA report, the content of the report does emphasize that many employees are internally using AI via those tools, and at least believe that they’re getting a lot of value out of it. This one’s more a criticism of the people who’ve been tweeting that “95% of AI use at companies is worthless”, and so on. The NANDA report is not as scary as it looks. The main reason is that ~95% of hard enterprise IT projects fail no matter what, so AI projects failing at that rate is nothing special . AI projects are all going to be on the hard end, because the technology is so new and there’s very little industry agreement on best practices. It’s also not clear to me that the 95% figure is trustworthy. Even taking it on its own terms, it’s mathematically closer to 92%, which doesn’t inspire confidence in the rest of the NANDA team’s interpretation. We’re forced to take it on trust, since we can’t see the underlying data - in particular, how many of those 52 interviews went into that 95% figure. Here’s what I think it’s fair to conclude from the paper. Like IT projects in general, almost all internal AI projects at large enterprises fail. That means that enterprises will reap the value of AI - whatever it turns out to be - in two ways: first, illicit use of personal AI tools like ChatGPT, which forms a familiar “shadow IT” in large enterprises; second, by using pre-built enterprise tooling like Copilot and the various AI labs’ enterprise products. It remains to be seen exactly how much value that is. In short: almost every hugely transformative technology went through its own bubble, as hype expectations outpaced the genuine value of the technology that was fuelling the market. I expect the AI bubble to burst, the infrastructure (e.g. datacenters full of GPUs) to stick around at cheaper prices, and AI to eventually become as fundamental a technology as the internet is today. By “solved problem” I mean that the technology involved is mature, well-understood, and available (e.g. you can just pick up Kafka for event management, etc). In short: almost every hugely transformative technology went through its own bubble, as hype expectations outpaced the genuine value of the technology that was fuelling the market. I expect the AI bubble to burst, the infrastructure (e.g. datacenters full of GPUs) to stick around at cheaper prices, and AI to eventually become as fundamental a technology as the internet is today. ↩ By “solved problem” I mean that the technology involved is mature, well-understood, and available (e.g. you can just pick up Kafka for event management, etc). ↩

0 views
Sean Goedecke 1 months ago

Mistakes I see engineers making in their code reviews

In the last two years, code review has gotten much more important. Code is now easy to generate using LLMs, but it’s still just as hard to review 1 . Many software engineers now spend as much (or more) time reviewing the output of their own AI tools than their colleagues’ code. I think a lot of engineers don’t do code review correctly. Of course, there are lots of different ways to do code review, so this is largely a statement of my engineering taste . The biggest mistake I see is doing a review that focuses solely on the diff 2 . Most of the highest-impact code review comments have very little to do with the diff at all, but instead come from your understanding of the rest of the system. For instance, one of the most straightforwardly useful comments is “you don’t have to add this method here, since it already exists in this other place”. The diff itself won’t help you produce a comment like this. You have to already be familiar with other parts of the codebase that the diff author doesn’t know about. Likewise, comments like “this code should probably live in this other file” are very helpful for maintaining the long-term quality of a codebase. The cardinal value when working in large codebases is consistency (I write about this more in Mistakes engineers make in large established codebases ). Of course, you cannot judge consistency from the diff alone. Reviewing the diff by itself is much easier than considering how it fits into the codebase as a whole. You can rapidly skim a diff and leave line comments (like “rename this variable” or “this function should flow differently”). Those comments might even be useful! But you’ll miss out on a lot of value by only leaving this kind of review. Probably my most controversial belief about code review is that a good code review shouldn’t contain more than five or six comments . Most engineers leave too many comments. When you receive a review with a hundred comments, it’s very hard to engage with that review on anything other than a trivial level. Any really important comments get lost in the noise 2.5 . What do you do when there are twenty places in the diff that you’d like to see updated - for instance, twenty instances of variables instead of ? Instead of leaving twenty comments, I’d suggest leaving a single comment explaining the stylistic change you’d like to make, and asking the engineer you’re reviewing to make the correct line-level changes themselves. There’s at least one exception to this rule. When you’re onboarding a new engineer to the team, it can be helpful to leave a flurry of stylistic comments to help them understand the specific dialect that your team uses in this codebase. But even in this case, you should bear in mind that any “real” comments you leave are likely to be buried by these other comments. You may still be better off leaving a general “we don’t do early returns in this codebase” comment than leaving a line comment on every single early return in the diff. One reason engineers leave too many comments is that they review code like this: This is a good way to end up with hundreds of comments on a pull request: an endless stream of “I would have done these two operations in a different order”, or “I would have factored this function slightly differently”, and so on. I’m not saying that these minor comments are always bad. Sometimes the order of operations really does matter, or functions really are factored badly. But one of my strongest opinions about software engineering is that there are multiple acceptable approaches to any software problem , and that which one you choose often comes down to taste . As a reviewer, when you come across cases where you would have done it differently, you must be able to approve those cases without comment, so long as either way is acceptable. Otherwise you’re putting your colleagues in an awkward position. They can either accept all your comments to avoid conflict, adding needless time and setting you up as the de facto gatekeeper for all changes to the codebase, or they can push back and argue on each trivial point, which will take even more time. Code review is not the time for you to impose your personal taste on a colleague. So far I’ve only talked about review comments. But the “high-order bit” of a code review is not the content of the comments, but the status of the review: whether it’s an approval, just a set of comments, or a blocking review. The status of the review colors all the comments in the review. Comments in an approval read like “this is great, just some tweaks if you want”. Comments in a blocking review read like “here’s why I don’t want you to merge this in”. If you want to block, leave a blocking review. Many engineers seem to think it’s rude to leave a blocking review even if they see big problems, so they instead just leave comments describing the problems. Don’t do this. It creates a culture where nobody is sure whether it’s okay to merge their change or not. An approval should mean “I’m happy for you to merge, even if you ignore my comments”. Just leaving comments should mean “I’m happy for you to merge if someone else approves, even if you ignore my comments.” If you would be upset if a change were merged, you should leave a blocking review on it. That way the person writing the change knows for sure whether they can merge or not, and they don’t have to go and chase up everyone who’s left a comment to get their informal approval. I should start with a caveat: this depends a lot on what kind of codebase we’re talking about. For instance, I think it’s fine if PRs against something like SQLite get mostly blocking reviews. But a standard SaaS codebase, where teams are actively developing new features, ought to have mostly approvals. I go into a lot more detail about the distinction between these two types of codebase in Pure and Impure Engineering . If tons of PRs are being blocked, it’s usually a sign that there’s too much gatekeeping going on . One dynamic I’ve seen play out a lot is where one team owns a bottleneck for many other teams’ features - for instance, maybe they own the edge network configuration where new public-facing routes must be defined, or the database structure that new features will need to modify. That team is typically more reliability-focused than a typical feature team. Engineers on that team may have a different title, like SRE, or even belong to a different organization. Their incentives are thus misaligned with the feature teams they’re nominally supporting. Suppose the feature team wants to update the public-facing ingress routes in order to ship some important project. But the edge networking team doesn’t care about that project - it doesn’t affect their or their boss’s review cycles. What does affect their reviews is any production problem the change might cause. That means they’re motivated to block any potentially-risky change for as long as possible. This can be very frustrating for the feature team, who is willing to accept some amount of risk for the sake of delivering new features 3 . Of course, there are other reasons why many PRs might be getting blocking reviews. Maybe the company just hired a bunch of incompetent engineers, who ought to be prevented from merging their changes. Maybe the company has had a recent high-profile incident, and all risky changes should be blocked for a couple of weeks until their users forget about it. But in normal circumstances, a high rate of blocked reviews represents a structural problem . For many engineers - including me - it feels good to leave a blocking review, for the same reasons that it feels good to gatekeep in general. It feels like you’re single-handedly protecting the quality of the codebase, or averting some production incident. It’s also a way to indulge a common vice among engineers: flexing your own technical knowledge on some less-competent engineer. Oh, looks like you didn’t know that your code would have caused an N+1 query! Well, I knew about it. Aren’t you lucky I took the time to read through your code? This principle - that you should bias towards approving changes - is important enough that Google’s own guide to code review begins with it, calling it ” the senior principle among all of the code review guidelines” 4 . I’m quite confident that many competent engineers will disagree with most or all of the points in this post. That’s fine! I also believe many obviously true things about code review, but I didn’t include them here. In my experience, it’s a good idea to: This all more or less applies to reviewing code from agentic LLM systems. They are particularly prone to missing code that they ought to be writing, they also get a bit lost if you feed them a hundred comments at once, and they have their own style. The one point that does not apply to LLMs is the “bias towards approving” point. You can and should gatekeep AI-generated PRs as much as you want. I do want to close by saying that there are many different ways to do code review . Here’s a non-exhaustive set of values that a code review practice might be trying to satisfy: making sure multiple people on the team are familiar with every part of the codebase, letting the team discuss the software design of each change, catching subtle bugs that a single person might not see, transmitting knowledge horizontally across the team, increasing perceived ownership of each change, enforcing code style and format rules across the codebase, and satisfying SOC2 “no one person can change the system alone” constraints. I’ve listed these in the order I care about them, but engineers who would order these differently will have a very different approach to code review. Of course there are LLM-based reviewing tools. They’re even pretty useful! But at least right now they’re not as good as human reviewers, because they can’t bring to bear the amount of general context that a competent human engineer can. For readers who aren’t software engineers, “diff” here means the difference between the existing code and the proposed new code, showing what lines are deleted, added, or edited. This is a special instance of a general truth about communication: if you tell someone one thing, they’ll likely remember it; if you tell them twenty things, they will probably forget it all. In the end, these impasses are typically resolved by the feature team complaining to their director or VP, who complains to the edge networking team’s director or VP, who tells them to just unblock the damn change already. But this is a pretty crude way to resolve the incentive mismatch, and it only really works for features that are high-profile enough to receive air cover from a very senior manager. Google’s principle is much more explicit, stating that you should approve a change if it’s even a minor improvement, not when it’s perfect. But I take the underlying message here to be “I know it feels good, but don’t be a nitpicky gatekeeper - approve the damn PR!” Look at a hunk of the diff Ask themselves “how would I write this, if I were writing this code?” Leave a comment with each difference between how they would write it and the actual diff Consider what code isn’t being written in the PR instead of just reviewing the diff Leave a small number of well-thought-out comments, instead of dashing off line comments as you go and ending up with a hundred of them Review with a “will this work” filter, not with a “is this exactly how I would have done it” filter If you don’t want the change to be merged, leave a blocking review Unless there are very serious problems, approve the change Of course there are LLM-based reviewing tools. They’re even pretty useful! But at least right now they’re not as good as human reviewers, because they can’t bring to bear the amount of general context that a competent human engineer can. ↩ For readers who aren’t software engineers, “diff” here means the difference between the existing code and the proposed new code, showing what lines are deleted, added, or edited. ↩ This is a special instance of a general truth about communication: if you tell someone one thing, they’ll likely remember it; if you tell them twenty things, they will probably forget it all. ↩ In the end, these impasses are typically resolved by the feature team complaining to their director or VP, who complains to the edge networking team’s director or VP, who tells them to just unblock the damn change already. But this is a pretty crude way to resolve the incentive mismatch, and it only really works for features that are high-profile enough to receive air cover from a very senior manager. ↩ Google’s principle is much more explicit, stating that you should approve a change if it’s even a minor improvement, not when it’s perfect. But I take the underlying message here to be “I know it feels good, but don’t be a nitpicky gatekeeper - approve the damn PR!” ↩

0 views
Sean Goedecke 1 months ago

Should LLMs just treat text content as an image?

Several days ago, DeepSeek released a new OCR paper . OCR, or “optical character recognition”, is the process of converting an image of text - say, a scanned page of a book - into actual text content. Better OCR is obviously relevant to AI because it unlocks more text data to train language models on 1 . But there’s a more subtle reason why really good OCR might have deep implications for AI models. According to the DeepSeek paper, you can pull out 10 text tokens from a single image token with near-100% accuracy. In other words, a model’s internal representation of an image is ten times as efficient as its internal representation of text. Does this mean that models shouldn’t consume text at all? When I paste a few paragraphs into ChatGPT, would it be more efficient to convert that into an image of text before sending it to the model? Can we supply 10x or 20x more data to a model at inference time by supplying it as an image of text instead of text itself? This is called “optical compression”. It reminds me of a funny idea from June of this year to save money on OpenAI transcriptions: before uploading the audio, run it through ffmpeg to speed it up by 2x. The model is smart enough to still pull out the text, and with one simple trick you’ve cut your inference costs and time by half. Optical compression is the same kind of idea: before uploading a big block of text, take a screenshot of it (and optionally downscale the quality) and upload the screenshot instead. Some people are already sort-of doing this with existing multimodal LLMs. There’s a company selling this as a service , an open-source project, and even a benchmark . It seems to work okay! Bear in mind that this is not an intended use case for existing models, so it’s plausible that it could get a lot better if AI labs start actually focusing on it. The DeepSeek paper suggests an interesting way 2 to use tighter optical compression for long-form text contexts. As the context grows, you could decrease the resolution of the oldest images so they’re cheaper to store, but are also literally blurrier. The paper suggests an analogy between this and human memory, where fresh memories are quite vivid but older ones are vaguer and have less detail. Optical compression is pretty unintuitive to many software engineers. Why on earth would an image of text be expressible in fewer tokens than the text itself? In terms of raw information density, an image obviously contains more information than its equivalent text. You can test this for yourself by creating a text file, screenshotting the page, and comparing the size of the image with the size of the text file: the image is about 200x larger. Intuitively, the word “dog” only contains a single word’s worth of information, while an image of the word “dog” contains information about the font, the background and text color, kerning, margins, and so on. How, then, could it be possible that a single image token can contain ten tokens worth of text? The first explanation is that text tokens are discrete while image tokens are continuous . Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens. Another way of looking at the same intuition is that text tokens are a really inefficient way of expressing information . This is often obscured by the fact that text tokens are a reasonably efficient way of sharing information, so long as the sender and receiver both know the list of all possible tokens. When you send a LLM a stream of tokens and it outputs the next one, you’re not passing around slices of a thousand numbers for each token - you’re passing a single integer that represents the token ID. But inside the model this is expanded into a much more inefficient representation (inefficient because it encodes some amount of information about the meaning and use of the token) 3 . So it’s not that surprising that you could do better than text tokens. Zooming out a bit, it’s plausible to me that processing text as images is closer to how the human brain works . To state the obvious, humans don’t consume text as textual content; we consume it as image content (or sometimes as audio). Maybe treating text as a sub-category of image content could unlock ways of processing text that are unavailable when you’re just consuming text content. As a toy example, emoji like :) are easily-understandable as image content but require you to “already know the trick” as text content 4 . Of course, AI research is full of ideas that sounds promising but just don’t work that well. It sounds like you should be able to do this trick on current multimodal LLMs - particularly since many people just use them for OCR purposes anyway - but it hasn’t worked well enough to become common practice. Could you train a new large language model on text represented as image content? It might be tricky. Training on text tokens is easy - you can simply take a string of text and ask the model to predict the next token. How do you train on an image of text? You could break up the image into word chunks and ask the model to generate an image of the next word. But that seems to me like it’d be really slow, and tricky to check if the model was correct or not (e.g. how do you quickly break a file into per-word chunks, how do you match the next word in the image, etc). Alternatively, you could ask the model to output the next word as a token. But then you probably have to train the model on enough tokens so it knows how to manipulate text tokens. At some point you’re just training a normal LLM with no special “text as image” superpowers. AI labs are desperate for high-quality text, but only around 30% of written books have been digitized. It’s really hard to find recent data on this, but as a very rough estimate Google Books had ~40M books in 2023, but Google estimates there to have been ~130M books in 2010. That comes out to 30%. See Figure 13. Not to skip too far ahead, but this is one reason to think that representing a block of text tokens in a single image might not be such a great idea. Of course current LLMs can interpret these emojis. Less-toy examples: image-based LLMs might have a better feel for paragraph breaks and headings, might be better able to take a big picture view of a single page of text, and might find it easier to “skip through” large documents by skimming the start of each paragraph. Or they might not! We won’t know until somebody tries. AI labs are desperate for high-quality text, but only around 30% of written books have been digitized. It’s really hard to find recent data on this, but as a very rough estimate Google Books had ~40M books in 2023, but Google estimates there to have been ~130M books in 2010. That comes out to 30%. ↩ See Figure 13. ↩ Not to skip too far ahead, but this is one reason to think that representing a block of text tokens in a single image might not be such a great idea. ↩ Of course current LLMs can interpret these emojis. Less-toy examples: image-based LLMs might have a better feel for paragraph breaks and headings, might be better able to take a big picture view of a single page of text, and might find it easier to “skip through” large documents by skimming the start of each paragraph. Or they might not! We won’t know until somebody tries. ↩

0 views
Sean Goedecke 1 months ago

We are in the "gentleman scientist" era of AI research

Many scientific discoveries used to be made by amateurs. William Herschel , who discovered Uranus, was a composer and an organist. Antoine Lavoisier , who laid the foundation for modern chemistry, was a politician. In one sense, this is a truism. The job of “professional scientist” only really appeared in the 19th century, so all discoveries before then logically had to have come from amateurs, since only amateur scientists existed. But it also reflects that any field of knowledge gets more complicated over time . In the early days of a scientific field, discoveries are simple: “air has weight”, “white light can be dispersed through a prism into different colors”, “the mass of a burnt object is identical to its original mass”, and so on. The way you come up with those discoveries is also simple: observing mercury in a tall glass tube, holding a prism up to a light source, weighing a sealed jar before and after incinerating it, and so on. The 2025 Nobel prize in physics was just awarded “for the discovery of macroscopic quantum mechanical tunnelling and energy quantisation in an electric circuit”. The press release gallantly tries to make this discovery understandable to the layman, but it’s clearly much more complicated than the examples I listed above. Even understanding the terms involved would take years of serious study. If you wanted to win the 2026 Nobel prize in physics, you have to be a physicist : not a musician who dabbles in physics, or a politician who has a physics hobby in your spare time. You have to be fully immersed in the world of physics 1 . AI research is not like this. We are very much in the “early days of science” category. At this point, a critical reader might have two questions. How can I say that when many AI papers look like this ? 2 Alternatively, how can I say that when the field of AI research has been around for decades, and is actively pursued by many serious professional scientists? First, because AI research discoveries are often simpler than they look . This dynamic is familiar to any software engineer who’s sat down and tried to read a paper or two: the fearsome-looking mathematics often contains an idea that would be trivial to express in five lines of code. It’s written this way because (a) researchers are more comfortable with mathematics, and so genuinely don’t find it intimidating, and (b) mathematics is the lingua franca of academic research, because researchers like to write to far-future readers for whom Python syntax may be as unfamiliar as COBOL is to us. Take group-relative policy optimization, or GRPO, introduced in a 2024 DeepSeek paper . This has been hugely influential for reinforcement learning (which in turn has been the driver behind much LLM capability improvement in the last year). Let me try and explain the general idea. When you’re training a model with reinforcement learning, you might naively reward success and punish failure (e.g. how close the model gets to the right answer in a math problem). The problem is that this signal breaks down on hard problems. You don’t know if the model is “doing well” without knowing how hard the math problem is, which is itself a difficult qualitative assessment. The previous state-of-the art was to train a “critic model” that makes this “is the model doing well” assessment for you. Of course, this brings a whole new set of problems: the critic model is hard to train and verify, costs much more compute to run inside the training loop, and so on. Enter GRPO. Instead of a critic model, you gauge how well the model is doing by letting it try the problem multiple times and computing how well it does on average . Then you reinforce the model attempts that were above average and punish the ones that were below average. This gives you good signal even on very hard prompts, and is much faster than using a critic model. The mathematics in the paper looks pretty fearsome, but the idea itself is surprisingly simple. You don’t need to be a professional AI researcher to have had it. In fact, GRPO is not necessarily that new of an idea. There is discussion of normalizing the “baseline” for RL as early as 1992 (section 8.3), and the idea of using the model’s own outputs to set that baseline was successfully demonstrated in 2016 . So what was really discovered in 2024? I don’t think it was just the idea of “averaging model outputs to determine a RL baseline”. I think it was that that idea works great on LLMs as well . As far as I can tell, this is a consistent pattern in AI research. Many of the big ideas are not brand new or even particularly complicated. They’re usually older ideas or simple tricks, applied to large language models for the first time. Why would that be the case? If deep learning wasn’t a good subject for the amateur scientist ten years ago, why would the advent of LLMs change that? Suppose someone discovered that a rubber-band-powered car - like the ones at science fair competitions - could output as much power as a real combustion engine, so long as you soaked the rubber bands in maple syrup beforehand. This would unsurprisingly produce a revolution in automotive (and many other) engineering fields. But I think it would also “reset” scientific progress back to something like the “gentleman scientist” days, where you could productively do it as a hobby. Of course, there’d be no shortage of real scientists doing real experiments on the new phenomenon. However, there’d also be about a million easy questions to answer. Does it work with all kinds of maple syrup? What if you soak it for longer? What if you mixed in some maple-syrup-like substances? You wouldn’t have to be a real scientist in a real lab to try your hand at some of those questions. After a decade or so, I’d expect those easy questions to have been answered, and for rubber-band engine research to look more like traditional science. But that still leaves a long window for the hobbyist or dilettante scientist to ply their trade. The success of LLMs is like the rubber-band engine. A simple idea that anyone can try 3 - train a large transformer model on a ton of human-written text - produces a surprising and transformative technology. As a consequence, many easy questions have become interesting and accessible subjects of scientific inquiry, alongside the normal hard and complex questions that professional researchers typically tackle. I was inspired to write this by two recent pieces of research: Anthropic’s “skills” product and the Recursive Language Models paper . Both of these present new and useful ideas, but they’re also so simple as to be almost a joke. “Skills” are just markdown files and scripts on-disk that explain to the agent how to perform a task. Recursive language models are just agents with direct code access to the entire prompt via a Python REPL. There, now you can go and implement your own skills or RLM inference code. I don’t want to undersell these ideas. It is a genuinely useful piece of research for Anthropic to say “hey, you don’t really need actual tools if the LLM has shell access, because it can just call whatever scripts you’ve defined for it on disk”. Giving the LLM direct access to its entire prompt via code is also (as far as I can tell) a novel idea, and one with a lot of potential. We need more research like this! Strong LLMs are so new, and are changing so fast, that their capabilities are genuinely unknown 4 . For instance, at the start of this year, it was unclear whether LLMs could be “real agents” (i.e. whether running with tools in a loop would be useful for more than just toy applications). Now, with Codex and Claude Code, I think it’s pretty clear that they can. Many of the things we learn about AI capabilities - like o3’s ability to geolocate photos - come from informal user experimentation. In other words, they come from the AI research equivalent of 17th century “gentleman science”. Incidentally, my own field - analytic philosophy - is very much the same way. Two hundred years ago, you could publish a paper with your thoughts on “what makes a good act good”. Today, in order to publish on the same topic, you have to deeply engage with those two hundred years of scholarship, putting the conversation out of reach of all but professional philosophers. It is unclear to me whether that is a good thing or not. Randomly chosen from recent AI papers on arXiv . I’m sure you could find a more aggressively-technical paper with a bit more effort, but it suffices for my point. Okay, not anyone can train a 400B param model. But if you’re willing to spend a few hundred dollars - far less than Lavoisier spent on his research - you can train a pretty capable language model on your own. In particular, I’d love to see more informal research on making LLMs better at coming up with new ideas. Gwern wrote about this in LLM Daydreaming , and I tried my hand at it in Why can’t language models come up with new ideas? . Incidentally, my own field - analytic philosophy - is very much the same way. Two hundred years ago, you could publish a paper with your thoughts on “what makes a good act good”. Today, in order to publish on the same topic, you have to deeply engage with those two hundred years of scholarship, putting the conversation out of reach of all but professional philosophers. It is unclear to me whether that is a good thing or not. ↩ Randomly chosen from recent AI papers on arXiv . I’m sure you could find a more aggressively-technical paper with a bit more effort, but it suffices for my point. ↩ Okay, not anyone can train a 400B param model. But if you’re willing to spend a few hundred dollars - far less than Lavoisier spent on his research - you can train a pretty capable language model on your own. ↩ In particular, I’d love to see more informal research on making LLMs better at coming up with new ideas. Gwern wrote about this in LLM Daydreaming , and I tried my hand at it in Why can’t language models come up with new ideas? . ↩

0 views
Sean Goedecke 1 months ago

How I provide technical clarity to non-technical leaders

My mission as a staff engineer is to provide technical clarity to the organization. Of course, I do other stuff too. I run projects, I ship code, I review PRs, and so on. But the most important thing I do - what I’m for - is to provide technical clarity. In an organization, technical clarity is when non-technical decision makers have a good-enough practical understanding of what changes they can make to their software systems. The people in charge of your software organization 1 have to make a lot of decisions about software. Even if they’re not setting the overall strategy, they’re still probably deciding which kinds of users get which features, which updates are most important to roll out, whether projects should be delayed or rushed, and so on. These people may have been technical once. They may even have fine technical minds now. But they’re still “non-technical” in the sense I mean, because they simply don’t have the time or the context to build an accurate mental model of the system. Instead, they rely on a vague mental model, supplemented by advice from engineers they trust. To the extent that their vague mental model is accurate and the advice they get is good - in other words, to the extent that they have technical clarity - they’ll make sensible decisions. The stakes are therefore very high. Technical clarity in an organization can be the difference between a functional engineering group and a completely dysfunctional one. The default quantity of technical clarity in an organization is very low. In other words, decision-makers at tech companies are often hopelessly confused about the technology in question . This is not a statement about their competence. Software is really complicated , and even the engineers on the relevant team spend much of their time hopelessly confused about the systems they own. In my experience, this is surprising to non-engineers. But it’s true! For large established codebases, it’s completely normal for very senior engineers to be unable to definitively answer even very basic questions about how their own system works, like “can a user of type X do operation Y”, or “if we perform operation Z, what will it look like for users of type W?” Engineers often 2 answer these questions with “I’ll have to go and check”. Suppose a VP at a tech company wants to offer an existing paid feature to a subset of free-tier users. Of course, most of the technical questions involved in this project are irrelevant to the VP. But there is a set of technical questions that they will need to know the answers to: Finding out the answer to these questions is a complex technical process. It takes a deep understanding of the entire system, and usually requires you to also carefully re-read the relevant code. You can’t simply try the change out in a developer environment or on a test account, because you’re likely to miss edge cases. Maybe it works for your test account, but it doesn’t work for users who are part of an “organization”, or who are on a trial plan, and so on. Sometimes they can only be answered by actually performing the task. I wrote about why this happens in Wicked features : as software systems grow, they build marginal-but-profitable features that interact with each other in surprising ways, until the system becomes almost - but not quite - impossible to understand. Good software design can tame this complexity, but never eliminate it. Experienced software engineers are thus always suspicious that they’re missing some interaction that will turn into a problem in production. For a VP or product leader, it’s an enormous relief to work with an engineer who can be relied on to help them navigate the complexities of the software system. In my experience, this “technical advisor” role is usually filled by staff engineers, or by senior engineers who are rapidly on the path to a staff role. Senior engineers who are good at providing technical clarity sometimes get promoted to staff without even trying, in order to make them a more useful tool for the non-technical leaders who they’re used to helping. Of course, you can be an impactful engineer without doing the work of providing technical clarity to the organization. Many engineers - even staff engineers - deliver most of their value by shipping projects, identifying tricky bugs, doing good systems design , and so on. But those engineers will rarely be as valued as the ones providing technical clarity. That’s partly because senior leadership at the company will remember who was helping them, and partly because technical clarity is just much higher-leverage than almost any single project. Non-technical leaders need to make decisions, whether they’re clear or not. They are thus highly motivated to maintain a mental list of the engineers who can help them make those decisions, and to position those engineers in the most important teams and projects. From the perspective of non-technical leaders, those engineers are an abstraction around technical complexity . In the same way that engineers use garbage-collected languages so they don’t have to care about memory management, VPs use engineers so they don’t have to care about the details of software. But what does it feel like inside the abstraction ? Internally, engineers do have to worry about all the awkward technical details, even if their non-technical leaders don’t have to. If I say “no problem, we’ll be able to roll back safely”, I’m not as confident as I appear. When I’m giving my opinion on a technical topic, I top out at 95% confidence - there’s always a 5% chance that I missed something important - and am usually lower than that. I’m always at least a little bit worried. Why am I worried if I’m 95% sure I’m right? Because I’m worrying about the things I don’t know to look for. When I’ve been spectacularly wrong in my career, it’s usually not about risks that I anticipated. Instead, it’s about the “unknown unknowns”: risks that I didn’t even contemplate, because my understanding of the overall system was missing a piece. That’s why I say that shipping a project takes your full attention . When I lead technical projects, I spend a lot of time sitting and wondering about what I haven’t thought of yet. In other words, even when I’m quite confident in my understanding of the system, I still have a background level of internal paranoia. To provide technical clarity to the organization, I have to keep that paranoia to myself. There’s a careful balance to be struck between verbalizing all my worries - more on that later - and between being so overconfident that I fail to surface risks that I ought to have mentioned. Like good engineers, good VPs understand that all abstractions are sometimes leaky . They don’t blame their engineers for the occasional technical mistake, so long as those engineers are doing their duty as a useful abstraction the rest of the time 3 . What they won’t tolerate in a technical advisor is the lack of a clear opinion at all . An engineer who answers most questions with “well, I can’t be sure, it’s really hard to say” is useless as an advisor. They may still be able to write code and deliver projects, but they will not increase the amount of technical clarity in the organization. When I’ve written about communicating confidently in the past, some readers think I’m advising engineers to act unethically. They think that careful, technically-sound engineers should communicate the exact truth, in all its detail, and that appearing more confident than you are is a con man’s trick: of course if you pretend to be certain, leadership will think you’re a better engineer than the engineer who honestly says they’re not sure. Once one engineer starts keeping their worries to themself, other engineers have to follow or be sidelined, and pretty soon all the fast-talking blowhards are in positions of influence while the honest engineers are relegated to just working on projects. In other words, when I say “no problem, we’ll be able to roll back”, even though I might have missed something, isn’t that just lying? Shouldn’t I just communicate my level of confidence accurately? For instance, could I instead say “I think we’ll be able to roll back safely, though I can’t be sure, since my understanding of the system isn’t perfect - there could be all kinds of potential bugs”? I don’t think so. Saying that engineers should strive for maximum technical accuracy betrays a misunderstanding of what clarity is . At the top of this article, I said that clarity is when non-technical decision makers have a good enough working understanding of the system. That necessarily means a simplified understanding. When engineers are communicating to non-technical leadership, they must therefore simplify their communication (in other words, allow some degree of inaccuracy in the service of being understood). Most of my worries are not relevant information to non-technical decision makers . When I’m asked “can we deliver this today”, or “is it safe to roll this feature out”, the person asking is looking for a “yes” or “no”. If I also give them a stream of vague technical caveats, they will have to consciously filter that out in order to figure out if I mean “yes” or “no”. Why would they care about any of the details? They know that I’m better positioned to evaluate the technical risk than them - that’s why they’re asking me in the first place! I want to be really clear that I’m not advising engineers to always say “yes” even to bad or unacceptably risky decisions. Sometimes you need to say “we won’t be able to roll back safely, so we’d better be sure about the change”, or “no, we can’t ship the feature to this class of users yet”. My point is that when you’re talking to the company’s decision-makers, you should commit to a recommendation one way or the other , and only give caveats when the potential risk is extreme or the chances are genuinely high. At the end of the day, a VP only has so many mental bits to spare on understanding the technical details. If you’re a senior engineer communicating with a VP, you should make sure you fill those bits with the most important pieces: what’s possible, what’s impossible, and what’s risky. Don’t make them parse those pieces out of a long stream of irrelevant (to them) technical information. The highest-leverage work I do is to provide technical clarity to the organization: communicating up to non-technical decision makers to give them context about the software system. This is hard for two reasons. First, even competent engineers find it difficult to answer simple questions definitively about large codebases. Second, non-technical decision makers cannot absorb the same level of technical nuance as a competent engineer, so communicating to them requires simplification . Effectively simplifying complex technical topics requires three things: In a large tech company, this is usually a director or VP. However, depending on the scope we’re talking about, this could even be a manager or product manager - the same principles apply. Sometimes you know the answer off the top of your head, but usually that’s when you’ve been recently working on the relevant part of the codebase (and even then you may want to go and make sure you’re right). You do still have to be right a lot. I wrote about this in Good engineers are right, a lot . Despite this being very important, I don’t have a lot to say about it. You just have to feel it out based on your relationship with the decision-maker in question. Can the paid feature be safely delivered to free users in its current state? Can the feature be rolled out gradually? If something goes wrong, can the feature be reverted without breaking user accounts? Can a subset of users be granted early access for testing (and other) purposes? Can paid users be prioritized in case of capacity problems? Good taste - knowing which risks or context to mention and which to omit 4 . A deep technical understanding of the system. In order to communicate effectively, I need to also be shipping code and delivering projects. If I lose direct contact with the codebase, I will eventually lose my ability to communicate about it (as the codebase changes and my memory of the concrete details fades). The confidence to present a simplified picture to upper management. Many engineers either feel that it’s dishonest, or lack the courage to commit to claims where they’re only 80% or 90% confident. In my view, these engineers are abdicating their responsibility to help the organization make good technical decisions. I write about this a lot more in Engineers who won’t commit . In a large tech company, this is usually a director or VP. However, depending on the scope we’re talking about, this could even be a manager or product manager - the same principles apply. ↩ Sometimes you know the answer off the top of your head, but usually that’s when you’ve been recently working on the relevant part of the codebase (and even then you may want to go and make sure you’re right). ↩ You do still have to be right a lot. I wrote about this in Good engineers are right, a lot . ↩ Despite this being very important, I don’t have a lot to say about it. You just have to feel it out based on your relationship with the decision-maker in question. ↩

0 views
Sean Goedecke 1 months ago

GPT-5-Codex is a better AI researcher than me

In What’s the strongest AI model you can train on a laptop in five minutes? I tried my hand at answering a silly AI-research question. You can probably guess what it was. I chatted with GPT-5 to help me get started with the Python scripts and to bounce ideas off, but it was still me doing the research. I was coming up with the ideas, running the experiments, and deciding what to do next based on the data. The best model I could train was a 1.8M param transformer which produced output like this: Once upon a time , there was a little boy named Tim. Tim had a small box that he liked to play with. He would push the box to open. One day, he found a big red ball in his yard. Tim was so happy. He picked it up and showed it to his friend, Jane. “Look at my bag! I need it!” she said. They played with the ball all day and had a great time. Since then, OpenAI has released GPT-5-codex, and supposedly uses it (plus Codex, their CLI coding tool) to automate a lot of their product development and AI research. I wanted to try the same thing. Codex-plus-me did a much better job than me alone 1 . Here’s an example of the best output I got from the model I trained with Codex: Once upon a time , in a big forest, there lived a little bunny named Ben. Ben loved to play with his friends in the forest. One day, Ben’s mom saw him and was sad because he couldn’t find his friend. She asked, “Why are you sad, Ben?” Ben said, “I lost my toy. I can’t find it.” Ben wanted to help Ben find his toy. He knew they could fix the toy. He went to Sam’s house and found the toy under a tree. Sam was so happy and said, “Thank you, Ben! You are a very pretty toy!” Ben smiled and said, “Yes, I would love to help you.” They played together all day long. The moral of the story is to help others when they needed it. What was the process like to get there? I want to call it “vibe research”. Like “vibe coding”, it’s performing a difficult technical task by relying on the model. I have a broad intuitive sense of what approaches are being tried, but I definitely don’t have a deep enough understanding to do this research unassisted. A real AI researcher would get a lot more out of the tool. Still, it was very easy to get started. I gave Codex the path to my scratch directory, told it “continue the research”, and it immediately began coming up with ideas and running experiments on its own. In a way, the “train in five minutes” challenge is a perfect fit, because the feedback loop is so short. The basic loop of doing AI research with Codex (at least as an enthusiastic amateur) looks something like this: After two days I did paste the current research notes into GPT-5-Pro, which helped a bit, but the vast majority of my time was spent in this loop. As we’ll see, the best ideas were ones Codex already came up with. I chewed through a lot of tokens doing this. That’s OK with me, since I paid for the $200-per-month plan 2 , but if you don’t want to do that you’ll have to space out your research a bit more slowly. I restarted my Codex process every million tokens or so. It didn’t have any issue continuing where it left off from its previous notes, which was nice. I ran Codex with . By default it didn’t have access to MPS, which meant it could only train models on the CPU. There’s probably some more principled way of sandboxing it, but I didn’t bother to figure it out. I didn’t run into any runaway-agent problems, unless you count crashing my laptop a few times by using up too much memory. Here’s a brief summary of how the research went over the four or five days I spent poking at it. I stayed with the TinyStories dataset for all of this, partially because I think it’s the best choice and partially because I wanted a 1:1 comparison between Codex and my own efforts. Codex and I started with a series of n-gram models: instead of training a neural network, n-gram models just store the conditional probabilities of a token based on the n tokens that precede it. These models are very quick to produce (seconds, not minutes) but aren’t very good. The main reason is that even a 5-gram model cannot include context from more than five tokens ago, so they struggle to produce coherent text across an entire sentence. Here’s an example: Once upon a time , in a small school . ” they are friends . they saw a big pond . he pulled and pulled , but the table was still no attention to grow even more . she quickly ran to the house . she says , ” sara said . ” you made him ! ” the smooth more it said , for helping me decorate the cake . It’s not terrible ! There are short segments that are entirely coherent. But it’s kind of like what AI skeptics think LLMs are like: just fragments of the original source, remixed without any unifying through-line. The perplexity is 18.5, worse than basically any of the transformers I trained in my last attempt. Codex trained 19 different n-gram models, of which the above example (a 4-gram model) was the best 3 . In my view, this is one of the strengths of LLM-based AI research: it is trivial to tell the model “go and sweep a bunch of different values for the hyperparameters” . Of course, you can do this yourself. But it’s a lot easier to just tell the model to do it. After this, Codex spent a lot of time working on transformers. It trained ~50 normal transformers with different sizes, number of heads, layers, and so on. Most of this wasn’t particularly fruitful. I was surprised that my hand-picked hyperparameters from my previous attempt were quite competitive - though maybe it shouldn’t have been a shock, since they matched the lower end of the Chinchilla scaling laws. Still, eventually Codex hit on a 8.53 perplexity model (3 layers, 4 heads, and a dimension of 144), which was a strict improvement over my last attempt. I’m not really convinced this was an architectural improvement. One lesson from training fifty different models is that there’s quite a lot of variance between different seeds. A perplexity improvement of just over 1 is more or less what I was seeing on a “lucky seed”. This was an interesting approach for the challenge: going for pure volume and hoping for a lucky training run. You can’t do this with a larger model, since it takes so long to train 4 , but the five-minute limit makes it possible. The next thing Codex tried - based on some feedback I pasted in from GPT-5-Pro - was “shallow fusion” : instead of training a new model, updating the generation logic to blend the transformer-predicted tokens with a n-gram model, a “kNN head” (which looks up hidden states that are “nearby” the current hidden state of the transformer and predicts their tokens), and a “cache head” that makes the model more likely to repeat words that are already in the context. This immediately dropped perplexity down to 7.38 : a whole other point lower than our best transformer. I was excited about that, but the generated content was really bad: Once upon a time,, in a small house, there lived a boy named Tim. Tim loved to play outside with his ball. One Mr. Skip had a lot of fun. He ran everywhere every day. One One day, Tim was playing with his ball new ball near his house. Tim was playing with his his ball and had a lot of fun. But then, he saw a big tree and decided to climb it. Tim tried to climb the tree, but he was too big. He was too small to reach the top of the tree. But the tree was too high. The little tree was too high for him. Soon, Tim was near the tree. He was brave and climbed the tree. But when he got got to the top, he was sad. Tim saw a bird on What happened? I over-optimized for perplexity. As it turns out, the pure transformers that were higher-perplexity were better at writing stories. They had more coherence over the entire length of the story, they avoided generating weird repetition artifacts (like ”,,”), and they weren’t as mindlessly repetitive. I went down a bit of a rabbithole trying to think of how to score my models without just relying on perplexity. I came up with some candidate rubrics, like grammatical coherence, patterns of repetition, and so on, before giving up and just using LLM-as-a-judge. To my shame, I even generated a new API key for the LLM before realizing that I was talking to a strong LLM already via Codex, and I could just ask Codex to rate the model outputs directly. The final and most successful idea I tried was distilling a transformer from a n-gram teacher model . First, we train a n-gram model, which only takes ~10 seconds. Then we train a transformer - but for the first 200 training steps, we push the transformer towards predicting the tokens that the n-gram model would predict. After that, the transformer continues to train on the TinyStories data as usual. Here’s an example of some output: Once upon a time, in a big forest, there lived a little bunny named Ben. Ben loved to play with his friends in the forest. One day, Ben’s mom saw him and was sad because he couldn’t find his friend. She asked, “Why are you sad, Ben?” Ben said, “I lost my toy. I can’t find it.” Ben wanted to help Ben find his toy. He knew they could fix the toy. He went to Sam’s house and found the toy under a tree. Sam was so happy and said, “Thank you, Ben! You are a very pretty toy!” Ben smiled and said, “Yes, I would love to help you.” They played together all day long. The moral of the story is to help others when they needed it. I think this is pretty good! It has characters that continue throughout the story. It has a throughline - Ben’s lost toy - though it confuses “toy” and “friend” a bit. It’s a coherent story, with a setup, problem, solution and moral. This is much better than anything else I’ve been able to train in five minutes. Why is it better? I think the right intuition here is that transformers need to spend a lot of initial compute (say, two minutes) learning how to construct grammatically-correct English sentences. If you begin the training by spending ten seconds training a n-gram model that can already produce sort-of-correct grammar, you can speedrun your way to learning grammar and spend an extra one minute and fifty seconds learning content. I really like this approach. It’s exactly what I was looking for from the start: a cool architectural trick that genuinely helps, but only really makes sense for this weird challenge 5 . I don’t have any illusions about this making me a real AI researcher, any more than a “vibe coder” is a software engineer. Still, I’m surprised that it actually worked. And it was a lot of fun! I’ve pushed up the code here if you want to pick up from where I left off, but you may be better off just starting from scratch with Codex or your preferred coding agent. edit: this post got some comments on Hacker News . The tone is much more negative than on my previous attempt, which is interesting - maybe the title gave people the mistaken impression that I think I’m a strong AI researcher! “Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tooling. My deal with myself was that if I ever have a month where I use fewer than 2M tokens, I’ll cancel the plan. There are a lot of clever tricks involved here: Kneser-Ney smoothing, interpolating unigram/bigram/trigram probabilities on a specific schedule, deliberately keeping the sentinel token, etc. I didn’t spend the time understanding all of these things deeply - that’s vibe research for you - so I won’t write too much about it. Unless you’re a big AI lab. I am 100% convinced that the large labs are spending a lot of compute just re-training on different seeds in the hope of getting a lucky run. I was suspicious that I just got a lucky seed, but I compared ~40 generations with and without the distillation and the distilled model really was better at producing correct-looking stories. Codex makes a change to the training script and does three or four runs (this takes ~20 minutes overall) Based on the results, Codex suggests two or three things that you could try next I pick one of them (or very occasionally suggest my own idea) and return to (1). “Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tooling. ↩ My deal with myself was that if I ever have a month where I use fewer than 2M tokens, I’ll cancel the plan. ↩ There are a lot of clever tricks involved here: Kneser-Ney smoothing, interpolating unigram/bigram/trigram probabilities on a specific schedule, deliberately keeping the sentinel token, etc. I didn’t spend the time understanding all of these things deeply - that’s vibe research for you - so I won’t write too much about it. ↩ Unless you’re a big AI lab. I am 100% convinced that the large labs are spending a lot of compute just re-training on different seeds in the hope of getting a lucky run. ↩ I was suspicious that I just got a lucky seed, but I compared ~40 generations with and without the distillation and the distilled model really was better at producing correct-looking stories. ↩

0 views
Sean Goedecke 1 months ago

How I influence tech company politics as a staff software engineer

Many software engineers are fatalistic about company politics. They believe that it’s pointless to get involved, because 1 : The general idea here is that software engineers are simply not equipped to play the game at the same level as real political operators . This is true! It would be a terrible mistake for a software engineer to think that you ought to start scheming and plotting like you’re in Game of Thrones . Your schemes will be immediately uncovered and repurposed to your disadvantage and other people’s gain. Scheming takes practice and power, and neither of those things are available to software engineers. It is simply a fact that software engineers are tools in the political game being played at large companies, not players in their own right. However, there are many ways to get involved in politics without scheming. The easiest way is to actively work to make a high-profile project successful . This is more or less what you ought to be doing anyway, just as part of your ordinary job. If your company is heavily investing in some new project - these days, likely an AI project - using your engineering skill to make it successful 2 is a politically advantageous move for whatever VP or executive is spearheading that project. In return, you’ll get the rewards that executives can give at tech companies: bonuses, help with promotions, and positions on future high-profile projects. I wrote about this almost a year ago in Ratchet effects determine engineer reputation at large companies . A slightly harder way (but one that gives you more control) is to make your pet idea available for an existing political campaign . Suppose you’ve wanted for a while to pull out some existing functionality into its own service. There are two ways to make that happen. The hard way is to expend your own political capital: drum up support, let your manager know how important it is to you, and slowly wear doubters down until you can get the project formally approved. The easy way is to allow some executive to spend their (much greater) political capital on your project . You wait until there’s a company-wide mandate for some goal that aligns with your project (say, a push for reliability, which often happens in the wake of a high-profile incident). Then you suggest to your manager that your project might be a good fit for this. If you’ve gauged it correctly, your org will get behind your project. Not only that, but it’ll increase your political capital instead of you having to spend it. Organizational interest comes in waves. When it’s reliability time, VPs are desperate to be doing something . They want to come up with plausible-sounding reliability projects that they can fund, because they need to go to their bosses and point at what they’re doing for reliability, but they don’t have the skillset to do it on their own. They’re typically happy to fund anything that the engineering team suggests. On the other hand, when the organization’s attention is focused somewhere else - say, on a big new product ship - the last thing they want is for engineers to spend their time on an internal reliability-focused refactor that’s invisible to customers. So if you want to get something technical done in a tech company, you ought to wait for the appropriate wave . It’s a good idea to prepare multiple technical programs of work, all along different lines. Strong engineers will do some of this kind of thing as an automatic process, simply by noticing things in the normal line of work. For instance, you might have rough plans: When executives are concerned about billing, you can offer the billing refactor as a reliability improvement. When they’re concerned about developer experience, you can suggest replacing the build pipeline. When customers are complaining about performance, you can point to the Golang rewrite as a good option. When the CEO checks the state of the public documentation and is embarrassed, you can make the case for rebuilding it as a static site. The important thing is to have a detailed, effective program of work ready to go for whatever the flavor of the month is. Some program of work will be funded whether you do this or not. However, if you don’t do this, you have no control over what that program is. In my experience, this is where companies make their worst technical decisions : when the political need to do something collides with a lack of any good ideas. When there are no good ideas, a bad idea will do, in a pinch. But nobody prefers this outcome. It’s bad for the executives, who then have to sell a disappointing technical outcome as if it were a success 4 , and it’s bad for the engineers, who have to spend their time and effort building the wrong idea. If you’re a very senior engineer, the VPs (or whoever) will quietly blame you for this. They’ll be right to! Having the right idea handy at the right time is your responsibility. You can view all this in two different ways. Cynically, you can read this as a suggestion to make yourself a convenient tool for the sociopaths who run your company to use in their endless internecine power struggles. Optimistically, you can read this as a suggestion to let executives set the overall priorities for the company - that’s their job, after all - and to tailor your own technical plans to fit 3 . Either way, you’ll achieve more of your technical goals if you push the right plan at the right time. edit: this post got some attention on Hacker News . The comments were much more positive than on my other posts about politics, for reasons I don’t quite understand. This comment is an excellent statement of what I write about here (but targeted at more junior engineers). This comment (echoed here ) references a Milton Friedman quote that applies the idea in this post to political policy in general, which I’d never thought of but sounds correct: Only a crisis—actual or perceived—produces real change. When that crisis occurs, the actions that are taken depend on the ideas that are lying around. That, I believe, is our basic function: to develop alternatives to existing policies, to keep them alive and available until the politically impossible becomes politically inevitable. There’s a few comments calling this approach overly game-playing and self-serving. I think this depends on the goal you’re aiming at. The ones I referenced above seem pretty beneficial to me! Finally, this comment is a good summary of what I was trying to say: Instead of waiting to be told what to do and being cynical about bad ideas coming up when there’s a vacumn and not doing what he wants to do, the author keeps a back log of good and important ideas that he waits to bring up for when someone important says something is priority. He gets what he wants done, compromising on timing. I was prompted to write this after reading Terrible Software’s article Don’t avoid workplace politics and its comments on Hacker News. Disclaimer: I am talking here about broadly functional tech companies (i.e. ones that are making money). If you’re working somewhere that’s completely dysfunctional, I have no idea whether this advice would apply at all. What it takes to make a project successful is itself a complex political question that every senior+ engineer is eventually forced to grapple with (or to deliberately avoid, with consequences for their career). For more on that, see How I ship projects at large tech companies . For more along these lines, see Is it cynical to do what your manager wants? Just because they can do this doesn’t mean they want to. Technical decisions are often made for completely selfish reasons that cannot be influenced by a well-meaning engineer Powerful stakeholders are typically so stupid and dysfunctional that it’s effectively impossible for you to identify their needs and deliver solutions to them The political game being played depends on private information that software engineers do not have, so any attempt to get involved will result in just blundering around Managers and executives spend most of their time playing politics, while engineers spend most of their time doing engineering, so engineers are at a serious political disadvantage before they even start to migrate the billing code to stored-data-updated-by-webhooks instead of cached API calls to rip out the ancient hand-rolled build pipeline and replace it with Vite to rewrite a crufty high-volume Python service in Golang to replace the slow CMS frontend that backs your public documentation with a fast static site I was prompted to write this after reading Terrible Software’s article Don’t avoid workplace politics and its comments on Hacker News. Disclaimer: I am talking here about broadly functional tech companies (i.e. ones that are making money). If you’re working somewhere that’s completely dysfunctional, I have no idea whether this advice would apply at all. ↩ What it takes to make a project successful is itself a complex political question that every senior+ engineer is eventually forced to grapple with (or to deliberately avoid, with consequences for their career). For more on that, see How I ship projects at large tech companies . ↩ For more along these lines, see Is it cynical to do what your manager wants? ↩ Just because they can do this doesn’t mean they want to. ↩

0 views
Sean Goedecke 2 months ago

What is "good taste" in software engineering?

Technical taste is different from technical skill. You can be technically strong but have bad taste, or technically weak with good taste. Like taste in general, technical taste sometimes runs ahead of your ability: just like you can tell good food from bad without being able to cook, you can know what kind of software you like before you’ve got the ability to build it. You can develop technical ability by study and repetition, but good taste is developed in a more mysterious way. Here are some indicators of software taste: I think taste is the ability to adopt the set of engineering values that fit your current project . Aren’t the indicators above just a part of skill? For instance, doesn’t code look good if it’s good code ? I don’t think so. Let’s take an example. Personally, I feel like code that uses map and filter looks nicer than using a for loop. It’s tempting to think that this is a case of me being straightforwardly correct about a point of engineering. For instance, map and filter typically involve pure functions, which are easier to reason about, and they avoid an entire class of off-by-one iterator bugs. It feels to me like this isn’t a matter of taste, but a case where I’m right and other engineers are wrong. But of course it’s more complicated than that. Languages like Golang don’t contain map and filter at all, for principled reasons. Iterating with a for loop is easier to reason about from a performance perspective, and is more straightforward to extend to other iteration strategies (like taking two items at a time). I don’t care about these reasons as much as I care about the reasons in favour of map and filter - that’s why I don’t write a lot of for loops - but it would be far too arrogant for me to say that engineers who prefer for loops are simply less skilled. In many cases, they have technical capabilites that I don’t have. They just care about different things. In other words, our disagreement comes down to a difference in values . I wrote about this point in I don’t know how to build software and you don’t either . Even if the big technical debates do have definite answers, no working software engineer is ever in a position to know what those answers are, because you can only fit so much experience into one career. We are all at least partly relying on our own personal experience: on our particular set of engineering values. Almost every decision in software engineering is a tradeoff. You’re rarely picking between two options where one is strictly better. Instead, each option has its own benefits and downsides. Often you have to make hard tradeoffs between engineering values : past a certain point, you cannot easily increase performance without harming readability, for instance 1 . Really understanding this point is (in my view) the biggest indicator of maturity in software engineering. Immature engineers are rigid about their decisions. They think it’s always better to do X or Y. Mature engineers are usually willing to consider both sides of a decision, because they know that both sides come with different benefits. The trick is not deciding if technology X is better than Y, but whether the benefits of X outweigh Y in this particular case . In other words, immature engineers are too inflexible about their taste . They know what they like, but they mistake that liking for a principled engineering position. What defines a particular engineer’s taste? In my view, your engineering taste is composed of the set of engineering values you find most important . For instance: Resiliency . If an infrastructure component fails (a service dies, a network connection becomes unavailable), does the system remain functional? Can it recover without human intervention? Speed . How fast is the software, compared to the theoretical limit? Is work being done in the hot path that isn’t strictly necessary? Readability . Is the software easy to take in at a glance and to onboard new engineers to? Are functions relatively short and named well? Is the system well-documented? Correctness . Is it possible to represent an invalid state in the system? How locked-down is the system with tests, types, and asserts? Do the tests use techniques like fuzzing? In the extreme case, has the program been proven correct by formal methods like Alloy ? Flexibility . Can the system be trivially extended? How easy is it to make a change? If I need to change something, how many different parts of the program do I need to touch in order to do so? Portability . Is the system tied down to a particular operational environment (say, Microsoft Windows, or AWS)? If the system needs to be redeployed elsewhere, can that happen without a lot of engineering work? Scalability . If traffic goes up 10x, will the system fall over? What about 100x? Does the system have to be over-provisioned or can it scale automatically? What bottlenecks will require engineering intervention? Development speed . If I need to extend the system, how fast can it be done? Can most engineers work on it, or does it require a domain expert? There are many other engineering values: elegance, modern-ness, use of open source, monetary cost of keeping the system running, and so on. All of these are important, but no engineer cares equally about all of these things. Your taste is determined by which of these values you rank highest. For instance, if you value speed and correctness more than development speed, you are likely to prefer Rust over Python. If you value scalability over portability, you are likely to argue for a heavy investment in your host’s (e.g. AWS) particular quirks and tooling. If you value resiliency over speed, you are likely to want to split your traffic between different regions. And so on 2 . It’s possible to break these values down in a more fine-grained way. Two engineers who both deeply care about readability could disagree because one values short functions and the other values short call-stacks. Two engineers who both care about correctness could disagree because one values exhaustive test suites and the other values formal methods. But the principle is the same - there are lots of possible engineering values to care about, and because they are often in tension, each engineer is forced to take some more seriously than others. I’ve said that all of these values are important. Despite that, it’s possible to have bad taste. In the context of software engineering, bad taste means that your preferred values are not a good fit for the project you’re working on . Most of us have worked with engineers like this. They come onto your project evangelizing about something - formal methods, rewriting in Golang, Ruby meta-programming, cross-region deployment, or whatever - because it’s worked well for them in the past. Whether it’s a good fit for your project or not, they’re going to argue for it, because it’s what they like. Before you know it, you’re making sure your internal metrics dashboard has five nines of reliability, at the cost of making it impossible for any junior engineer to understand. In other words, most bad taste comes from inflexibility . I will always distrust engineers who justify decisions by saying “it’s best practice”. No engineering decision is “best practice” in all contexts! You have to make the right decision for the specific problem you’re facing. One interesting consequence of this is that engineers with bad taste are like broken compasses. If you’re in the right spot, a broken compass will still point north. It’s only when you start moving around that the broken compass will steer you wrong. Likewise, many engineers with bad taste can be quite effective in the particular niche where their preferences line up with what the project needs. But when they’re moved between projects or jobs, or when the nature of the project changes, the wheels immediately come off. No job stays the same for long, particularly in these troubled post-2021 times . Good taste is a lot more elusive than technical ability. That’s because, unlike technical ability, good taste is the ability to select the right set of engineering values for the particular technical problem you’re facing . It’s thus much harder to identify if someone has good taste: you can’t test it with toy problems, or by asking about technical facts. You need there to be a real problem, with all of its messy real-world context. You can tell you have good taste if the projects you’re working on succeed. If you’re not meaningfully contributing to the design of a project (maybe you’re just doing ticket-work), you can tell you have good taste if the projects where you agree with the design decisions succeed, and the projects where you disagree are rocky. Importantly, you need a set of different kinds of projects. If it’s just the one project, or the same kind of project over again, you might just be a good fit for that. Even if you go through many different kinds of projects, that’s no guarantee that you have good taste in domains you’re less familiar with 3 . How do you develop good taste? It’s hard to say, but I’d recommend working on a variety of things, paying close attention to which projects (or which parts of the project) are easy and which parts are hard. You should focus on flexibility : try not to acquire strong universal opinions about the right way to write software. What good taste I have I acquired pretty slowly. Still, I don’t see why you couldn’t acquire it fast. I’m sure there are prodigies with taste beyond their experience in programming, just as there are prodigies in other domains. Of course this isn’t always true. There are win-win changes where you can improve several usually-opposing values at the same time. But mostly we’re not in that position. Like I said above, different projects will obviously demand a different set of values. But the engineers working on those projects will still have to draw the line somewhere, and they’ll rely on their own taste to do that. That said, I do think good taste is somewhat transferable. I don’t have much personal experience with this so I’m leaving it in a footnote, but if you’re flexible and attentive to the details in domain A, you’ll probably be flexible and attentive to the details in domain B.

1 views
Sean Goedecke 2 months ago

AI coding agents rely too much on fallbacks

One frustrating pattern I’ve noticed in AI agents - at least in Claude Code, Codex and Copilot - is building automatic fallbacks . Suppose you ask Codex to build a system to automatically group pages in a wiki by topic. (This isn’t hypothetical, I just did this for EndlessWiki ). You’ll probably want to use something like the Louvain method to identify clusters. But if you task an AI agent with building something like that, it usually will go one step further, and build a fallback: a separate, simpler code path if the Louvain method fails (say, grouping page slugs alphabetically). If you’re not careful, you might not even know if the Louvain method is working, or if you’re just seeing the fallback behavior. In my experience, AI agents will do this constantly . If you’re building an app that makes an AI inference request, the generated code will likely fallback to some hard-coded response if the inference request fails. If you’re using an agent to pull structured data from some API, the agent may silently fallback to placeholder data for part of it. If you’re writing some kind of clever spam detector, the agent will want to fall back to a basic keyword check if your clever approach doesn’t work. This is particularly frustrating for the main kind of work that AI agents are useful for: prototyping new ideas. If you’re using AI agents to make real production changes to an existing app, fallbacks are annoying but can be easily stripped out before you submit the pull request. But if you’re using AI agents to test out a new approach, you’re typically not checking the code line-by-line. The usual workflow is to ask the agent to try an approach, then benchmark or fiddle with the result, and so on. If your benchmark or testing doesn’t know whether it’s hitting the real code or some toy fallback, you can’t be confident that you’re actually evaluating your latest idea. I don’t think this behavior is deliberate. My best guess is that it’s a reinforcement learning artifact: code with fallbacks is more likely to succeed, so during training the models are learning to include fallback 1 . If I’m wrong and it’s part of the hidden system prompt (or a deliberate choice), I think it’s a big mistake. When you ask an AI agent to implement a particular algorithm, it should implement that algorithm. In researching this post, I saw this r/cursor thread where people are complaining about this exact problem (and also attributing it to RL). Supposedly you can prompt around it, if you repeat “DO NOT WRITE FALLBACK CODE” several times.

1 views
Sean Goedecke 2 months ago

Endless AI-generated Wikipedia

I built an infinite, AI-generated wiki. You can try it out at endlesswiki.com ! Large language models are like Borges’ infinite library . They contain a huge array of possible texts, waiting to be elicited by the right prompt - including some version of Wikipedia. What if you could explore a model by interacting with it as a wiki? The idea here is to build a version of Wikipedia where all the content is AI-generated. You only have to generate a single page to get started: when a user clicks any link on that page, the page for that link is generated on-the-fly, which will include links of its own. By browsing the wiki, users can dig deeper into the stored knowledge of the language model. This works because wikipedias 1 connect topics very broadly. If you follow enough links, you can get from any topic to any other topic. In fact, people already play a game where they try to race from one page to a totally unrelated page by just following links. It’s fun to try and figure out the most likely chain of conceptual relationships between two completely different things. In a sense, EndlessWiki is a collaborative attempt to mine the depths of a language model. Once a page is generated, all users will be able to search for it or link it to their friends. The basic design is very simple: a MySQL database with a table, and a Golang server. When the server gets a request, it looks up in the database. If it exists, it serves the page directly; if not, it generates the page from a LLM and saves it to the database before serving it. I’m using Kimi K2 for the model. I chose a large model because larger models contain more facts about the world (which is good for a wiki), and Kimi specifically because in my experience Groq is faster and more reliable than other model inference providers. Speed is really important for this kind of application, because the user has to wait for new pages to be generated. Fortunately, Groq is fast enough that the wait time is only a few hundred ms. Unlike AutoDeck , I don’t charge any money or require sign-in for this. That’s because this is more of a toy than a tool, so I’m not worried about one power user costing me a lot of money in inference. You have to be manually clicking links to trigger inference. The most interesting design decision I made was preventing “cheating”. I’m excited to see how obscure the pages can get (for instance, can you get to eventually get to Neon Genesis Evangelion from the root page?) It would defeat the purpose if you could just manually go to in the address bar. To defeat that, I make each link have a query parameter, and then I fetch the origin page server-side to validate that it does indeed contain a link to the page you’re navigating to 2 . Like AutoDeck , EndlessWiki represents another step in my “what if you could interact with LLMs without having to chat” line of thought. I think there’s a lot of potential here for non-toy features. For instance, what if ChatGPT automatically hyperlinked each proper noun in its responses, and clicking on those generated a response focused on that noun? Anyway, check it out! I use the lowercase “w” because I mean all encyclopedia wikis. Wikipedia is just the most popular example. Interestingly, Codex came up with five solutions to prevent cheating, all of which were pretty bad - way more complicated than the solution I ended up with. If I was purely vibe coding, I’d have ended up with some awkward cryptographic approach.

6 views
Sean Goedecke 2 months ago

What I learned building an AI-driven spaced repetition app

I spent the last couple of weeks building an AI-driven spaced repetition app. You can try it out here . Like many software engineering types who were teenagers in the early 2000s 1 , I’ve been interested in this for a long time. The main reason is that, unlike many other learning approaches, spaced repetition works . If you want to learn something, study it now, then study it an hour later, then a day later, then a week later, and so on. You don’t have to spend much time overall, as long as you’re consistent about coming back to it. Eventually you only need to refresh your memory every few years in order to maintain a solid working knowledge of the topic. Spaced repetition learning happens more or less automatically as part of a software engineering job. Specific engineering skills will come up every so often (for instance, using to inspect open network sockets, or the proper regex syntax for backtracking). If they come up often enough, you’ll internalize them. It’s more difficult to use spaced repetition to deliberately learn new things. Even if you’re using a spaced repetition tool like Anki , you have to either write your own deck of flashcards (which requires precisely the kind of expertise you don’t have yet), or search for an existing one that exactly matches the area you’re trying to learn 2 . One way I learn new things is from LLMs. I wrote about this in How I use LLMs to learn new subjects , but the gist is that I ask a ton of follow-up questions about a question I have. The best part about this approach is that it requires zero setup cost: if at any moment I want to learn more about something, I can type a question out and rapidly dig in to something I didn’t already know. What if you could use LLMs to make spaced repetition easier? Specifically, what if you could ask a LLM to give you an infinite feed of spaced repetition flashcards, adjusting the difficulty based on your responses? That’s the idea behind AutoDeck . You give it a topic and it gives you infinite flashcards about that topic. If it’s pitched too easy (e.g. you keep saying “I know”) or too hard, it’ll automatically change the difficulty. The thing I liked most about building AutoDeck is that it’s an AI-driven app where the interface isn’t chat . I think that’s really cool - almost every kind of killer AI app presents a chat interface. To use Claude Code, you chat with an agent. The various data analysis tools are typically in a “chat with your data” mode. To use ChatGPT, you obviously chat with it. That makes sense, since (a) the most unusual thing about LLMs is that you can talk with them, and (b) most AI apps let the user take a huge variety of possible actions, for which the only possible interface is some kind of chat. The problem with chat is that it demands a lot of the user. That’s why most “normal” apps have the user click buttons instead of type out sentences, and that’s why many engineering and design blogs have been writing about how to build AI apps that aren’t chat-based. Still, it’s easier said than done. I think spaced repetition flashcards are a good use-case for AI. Generating them for any topic is something that would be impossible without LLMs, so it’s a compelling idea. But you don’t have to interact with them via text (beyond typing out what topic you want at the outset). How do you use AI to generate an infinite feed of content? I tried a bunch of different approaches here. The two main problems here are speed and consistency . Speed is difficult because AI generation can be pretty slow: counting the time-to-first-token, it’s a few hundred ms, even for quick models. If you’re generating each flashcard with a single request, a user who’s familiar with the subject matter can click through flashcards faster than the AI can generate them. Batching up flashcard generation is quicker (because you only wait for time-to-first-token once) but it forces the user to wait much longer before they see their first card. What if you generate flashcards in parallel? That has two problems of its own. First, you’re still waiting for the time-to-first-token on every request, so throughput is still much slower than the batched approach. Second, it’s very easy to generate duplicate cards that way. Even with a high temperature, if you ask the same model the same question with the same prompt, you’re likely to get similar answers. The parallel-generation flashcard feed was thus pretty repetitive: if you wanted to learn about French history, you’d get “what year was the Bastille stormed” right next to “in what year was the storming of the Bastille”, and so on. The solution I landed on was batching the generation, but saving each card as it comes in . In other words, I asked the model to generate ten cards, but instead of waiting for the entire response to be over before I saved the data, I made each card available to the client as soon as it was generated. This was trickier than it sounds for a few reasons. First, it means you can’t use JSON structured outputs . Structured outputs are great for ensuring you get a response that your code can parse, but you can’t (easily) parse chunks of JSON mid-stream. You have to wait for the entire output before it’s valid JSON, because you need the closing or characters 3 . Instead, I asked the model to respond in XML chunks, which could be easily parsed as they came in. Second, it meant I couldn’t simply have the client request a card and get a card back. The code that generated cards had to be able to run in the background without blocking the client, which forced the client to periodically check for available cards. I built most of AutoDeck with OpenAI’s Codex. It was pretty good! I had to intervene in maybe one change out of three, and I only had to seriously intervene (i.e. completely change the approach) in one change out of ten. Some examples of where I had to intervene: I tried Claude Code at various parts of the process and honestly found it underwhelming. It took longer to make each change and in general required more intervention, which meant I was less comfortable queueing up changes. This is a pretty big win for OpenAI - until very recently, Claude Code has been much better than Codex in my experience. I cannot imagine trying to build even a relatively simple app like this without being a competent software engineer already. Codex saved me a lot of time and effort, but it made a lot of bad decisions that I had to intervene. It wasn’t able to fix every bug I encountered. At this point, I don’t think we’re in the golden age of vibe coding. You still need to know what you’re doing to actually ship an app with one of these tools. One interesting thing about building AI projects is that it kind of forces you to charge money. I’ve released previous apps I’ve built for free, because I wanted people to use them and I’m not trying to do the software entepreneur thing. But an app that uses AI costs me money for each user - not a ton of money, but enough that I’m strongly incentivized to charge a small amount for users who want to use the app more than just kicking the tires. I think this is probably a good thing. Charging money for software is a forcing function for actually making it work. If AI inference was free, I would probably have shipped AutoDeck in a much more half-assed state. Since I’m obliged to charge money for it, I spent more time making sure it was actually useful than I would normally spend on a side project. I had a lot of fun building AutoDeck! It’s still mainly for me, but if you’ve read this far I hope you try it out and see if you like it as well. I’m still trying to figure out the best model. GPT-5 was actually pretty bad at generating spaced repetition cards: the time-to-first-token was really slow, and the super-concise GPT-5 style made the cards read awkwardly. You don’t need the smartest available model for spaced repetition, just a model with a good grasp of a bunch of textbook and textbook-adjacent facts. The surviving ur-text for this is probably Gwern’s 2009 post Spaced Repetition for Efficient Learning . Most existing decks are tailored towards students doing particular classes (e.g. anatomy flashcards for med school), not people just trying to learn something new, so they often assume more knowledge than you might have. I think this is just a lack of maturity in the ecosystem. I would hope that in a year or two you can generate structured XML, JSONL, or other formats that are more easily parseable in chunks. Those formats are just as easy to express as a grammar that the logit sampler can adhere to.

0 views
Sean Goedecke 2 months ago

If you are good at code review, you will be good at using AI agents

Using AI agents correctly is a process of reviewing code . If you’re good at reviewing code, you’ll be good at using tools like Claude Code, Codex, or the Copilot coding agent. Why is that. Large language models are good at producing a lot of code, but they don’t yet have the depth of judgement of a competent software engineer

1 views
Sean Goedecke 2 months ago

AI is good news for Australian and European software engineers

Right now the dominant programing model is something like “centaur chess” , where a skilled human is paired with a computer assistant. Together, they produce more work than either could individually. No individual human can work as fast or as consistently as a LLM, but LLMs lack the depth of judgement that good engineers do 1

0 views
Sean Goedecke 2 months ago

The whole point of OpenAI's Responses API is to help them hide reasoning traces

About six months ago, OpenAI released their Responses API , which replaced their previous /chat/completions API for inference. The old API was very simple: you pass in an array of messages representing a conversation between the model and a user, and get the model’s next response back. The new Responses API is more complicated

0 views