Latest Posts (20 found)
Sean Goedecke 3 days ago

How I provide technical clarity to non-technical leaders

My mission as a staff engineer is to provide technical clarity to the organization. Of course, I do other stuff too. I run projects, I ship code, I review PRs, and so on. But the most important thing I do - what I’m for - is to provide technical clarity. In an organization, technical clarity is when non-technical decision makers have a good-enough practical understanding of what changes they can make to their software systems. The people in charge of your software organization 1 have to make a lot of decisions about software. Even if they’re not setting the overall strategy, they’re still probably deciding which kinds of users get which features, which updates are most important to roll out, whether projects should be delayed or rushed, and so on. These people may have been technical once. They may even have fine technical minds now. But they’re still “non-technical” in the sense I mean, because they simply don’t have the time or the context to build an accurate mental model of the system. Instead, they rely on a vague mental model, supplemented by advice from engineers they trust. To the extent that their vague mental model is accurate and the advice they get is good - in other words, to the extent that they have technical clarity - they’ll make sensible decisions. The stakes are therefore very high. Technical clarity in an organization can be the difference between a functional engineering group and a completely dysfunctional one. The default quantity of technical clarity in an organization is very low. In other words, decision-makers at tech companies are often hopelessly confused about the technology in question . This is not a statement about their competence. Software is really complicated , and even the engineers on the relevant team spend much of their time hopelessly confused about the systems they own. In my experience, this is surprising to non-engineers. But it’s true! For large established codebases, it’s completely normal for very senior engineers to be unable to definitively answer even very basic questions about how their own system works, like “can a user of type X do operation Y”, or “if we perform operation Z, what will it look like for users of type W?” Engineers often 2 answer these questions with “I’ll have to go and check”. Suppose a VP at a tech company wants to offer an existing paid feature to a subset of free-tier users. Of course, most of the technical questions involved in this project are irrelevant to the VP. But there is a set of technical questions that they will need to know the answers to: Finding out the answer to these questions is a complex technical process. It takes a deep understanding of the entire system, and usually requires you to also carefully re-read the relevant code. You can’t simply try the change out in a developer environment or on a test account, because you’re likely to miss edge cases. Maybe it works for your test account, but it doesn’t work for users who are part of an “organization”, or who are on a trial plan, and so on. Sometimes they can only be answered by actually performing the task. I wrote about why this happens in Wicked features : as software systems grow, they build marginal-but-profitable features that interact with each other in surprising ways, until the system becomes almost - but not quite - impossible to understand. Good software design can tame this complexity, but never eliminate it. Experienced software engineers are thus always suspicious that they’re missing some interaction that will turn into a problem in production. For a VP or product leader, it’s an enormous relief to work with an engineer who can be relied on to help them navigate the complexities of the software system. In my experience, this “technical advisor” role is usually filled by staff engineers, or by senior engineers who are rapidly on the path to a staff role. Senior engineers who are good at providing technical clarity sometimes get promoted to staff without even trying, in order to make them a more useful tool for the non-technical leaders who they’re used to helping. Of course, you can be an impactful engineer without doing the work of providing technical clarity to the organization. Many engineers - even staff engineers - deliver most of their value by shipping projects, identifying tricky bugs, doing good systems design , and so on. But those engineers will rarely be as valued as the ones providing technical clarity. That’s partly because senior leadership at the company will remember who was helping them, and partly because technical clarity is just much higher-leverage than almost any single project. Non-technical leaders need to make decisions, whether they’re clear or not. They are thus highly motivated to maintain a mental list of the engineers who can help them make those decisions, and to position those engineers in the most important teams and projects. From the perspective of non-technical leaders, those engineers are an abstraction around technical complexity . In the same way that engineers use garbage-collected languages so they don’t have to care about memory management, VPs use engineers so they don’t have to care about the details of software. But what does it feel like inside the abstraction ? Internally, engineers do have to worry about all the awkward technical details, even if their non-technical leaders don’t have to. If I say “no problem, we’ll be able to roll back safely”, I’m not as confident as I appear. When I’m giving my opinion on a technical topic, I top out at 95% confidence - there’s always a 5% chance that I missed something important - and am usually lower than that. I’m always at least a little bit worried. Why am I worried if I’m 95% sure I’m right? Because I’m worrying about the things I don’t know to look for. When I’ve been spectacularly wrong in my career, it’s usually not about risks that I anticipated. Instead, it’s about the “unknown unknowns”: risks that I didn’t even contemplate, because my understanding of the overall system was missing a piece. That’s why I say that shipping a project takes your full attention . When I lead technical projects, I spend a lot of time sitting and wondering about what I haven’t thought of yet. In other words, even when I’m quite confident in my understanding of the system, I still have a background level of internal paranoia. To provide technical clarity to the organization, I have to keep that paranoia to myself. There’s a careful balance to be struck between verbalizing all my worries - more on that later - and between being so overconfident that I fail to surface risks that I ought to have mentioned. Like good engineers, good VPs understand that all abstractions are sometimes leaky . They don’t blame their engineers for the occasional technical mistake, so long as those engineers are doing their duty as a useful abstraction the rest of the time 3 . What they won’t tolerate in a technical advisor is the lack of a clear opinion at all . An engineer who answers most questions with “well, I can’t be sure, it’s really hard to say” is useless as an advisor. They may still be able to write code and deliver projects, but they will not increase the amount of technical clarity in the organization. When I’ve written about communicating confidently in the past, some readers think I’m advising engineers to act unethically. They think that careful, technically-sound engineers should communicate the exact truth, in all its detail, and that appearing more confident than you are is a con man’s trick: of course if you pretend to be certain, leadership will think you’re a better engineer than the engineer who honestly says they’re not sure. Once one engineer starts keeping their worries to themself, other engineers have to follow or be sidelined, and pretty soon all the fast-talking blowhards are in positions of influence while the honest engineers are relegated to just working on projects. In other words, when I say “no problem, we’ll be able to roll back”, even though I might have missed something, isn’t that just lying? Shouldn’t I just communicate my level of confidence accurately? For instance, could I instead say “I think we’ll be able to roll back safely, though I can’t be sure, since my understanding of the system isn’t perfect - there could be all kinds of potential bugs”? I don’t think so. Saying that engineers should strive for maximum technical accuracy betrays a misunderstanding of what clarity is . At the top of this article, I said that clarity is when non-technical decision makers have a good enough working understanding of the system. That necessarily means a simplified understanding. When engineers are communicating to non-technical leadership, they must therefore simplify their communication (in other words, allow some degree of inaccuracy in the service of being understood). Most of my worries are not relevant information to non-technical decision makers . When I’m asked “can we deliver this today”, or “is it safe to roll this feature out”, the person asking is looking for a “yes” or “no”. If I also give them a stream of vague technical caveats, they will have to consciously filter that out in order to figure out if I mean “yes” or “no”. Why would they care about any of the details? They know that I’m better positioned to evaluate the technical risk than them - that’s why they’re asking me in the first place! I want to be really clear that I’m not advising engineers to always say “yes” even to bad or unacceptably risky decisions. Sometimes you need to say “we won’t be able to roll back safely, so we’d better be sure about the change”, or “no, we can’t ship the feature to this class of users yet”. My point is that when you’re talking to the company’s decision-makers, you should commit to a recommendation one way or the other , and only give caveats when the potential risk is extreme or the chances are genuinely high. At the end of the day, a VP only has so many mental bits to spare on understanding the technical details. If you’re a senior engineer communicating with a VP, you should make sure you fill those bits with the most important pieces: what’s possible, what’s impossible, and what’s risky. Don’t make them parse those pieces out of a long stream of irrelevant (to them) technical information. The highest-leverage work I do is to provide technical clarity to the organization: communicating up to non-technical decision makers to give them context about the software system. This is hard for two reasons. First, even competent engineers find it difficult to answer simple questions definitively about large codebases. Second, non-technical decision makers cannot absorb the same level of technical nuance as a competent engineer, so communicating to them requires simplification . Effectively simplifying complex technical topics requires three things: In a large tech company, this is usually a director or VP. However, depending on the scope we’re talking about, this could even be a manager or product manager - the same principles apply. Sometimes you know the answer off the top of your head, but usually that’s when you’ve been recently working on the relevant part of the codebase (and even then you may want to go and make sure you’re right). You do still have to be right a lot. I wrote about this in Good engineers are right, a lot . Despite this being very important, I don’t have a lot to say about it. You just have to feel it out based on your relationship with the decision-maker in question. Can the paid feature be safely delivered to free users in its current state? Can the feature be rolled out gradually? If something goes wrong, can the feature be reverted without breaking user accounts? Can a subset of users be granted early access for testing (and other) purposes? Can paid users be prioritized in case of capacity problems? Good taste - knowing which risks or context to mention and which to omit 4 . A deep technical understanding of the system. In order to communicate effectively, I need to also be shipping code and delivering projects. If I lose direct contact with the codebase, I will eventually lose my ability to communicate about it (as the codebase changes and my memory of the concrete details fades). The confidence to present a simplified picture to upper management. Many engineers either feel that it’s dishonest, or lack the courage to commit to claims where they’re only 80% or 90% confident. In my view, these engineers are abdicating their responsibility to help the organization make good technical decisions. I write about this a lot more in Engineers who won’t commit . In a large tech company, this is usually a director or VP. However, depending on the scope we’re talking about, this could even be a manager or product manager - the same principles apply. ↩ Sometimes you know the answer off the top of your head, but usually that’s when you’ve been recently working on the relevant part of the codebase (and even then you may want to go and make sure you’re right). ↩ You do still have to be right a lot. I wrote about this in Good engineers are right, a lot . ↩ Despite this being very important, I don’t have a lot to say about it. You just have to feel it out based on your relationship with the decision-maker in question. ↩

0 views
Sean Goedecke 1 weeks ago

GPT-5-Codex is a better AI researcher than me

In What’s the strongest AI model you can train on a laptop in five minutes? I tried my hand at answering a silly AI-research question. You can probably guess what it was. I chatted with GPT-5 to help me get started with the Python scripts and to bounce ideas off, but it was still me doing the research. I was coming up with the ideas, running the experiments, and deciding what to do next based on the data. The best model I could train was a 1.8M param transformer which produced output like this: Once upon a time , there was a little boy named Tim. Tim had a small box that he liked to play with. He would push the box to open. One day, he found a big red ball in his yard. Tim was so happy. He picked it up and showed it to his friend, Jane. “Look at my bag! I need it!” she said. They played with the ball all day and had a great time. Since then, OpenAI has released GPT-5-codex, and supposedly uses it (plus Codex, their CLI coding tool) to automate a lot of their product development and AI research. I wanted to try the same thing. Codex-plus-me did a much better job than me alone 1 . Here’s an example of the best output I got from the model I trained with Codex: Once upon a time , in a big forest, there lived a little bunny named Ben. Ben loved to play with his friends in the forest. One day, Ben’s mom saw him and was sad because he couldn’t find his friend. She asked, “Why are you sad, Ben?” Ben said, “I lost my toy. I can’t find it.” Ben wanted to help Ben find his toy. He knew they could fix the toy. He went to Sam’s house and found the toy under a tree. Sam was so happy and said, “Thank you, Ben! You are a very pretty toy!” Ben smiled and said, “Yes, I would love to help you.” They played together all day long. The moral of the story is to help others when they needed it. What was the process like to get there? I want to call it “vibe research”. Like “vibe coding”, it’s performing a difficult technical task by relying on the model. I have a broad intuitive sense of what approaches are being tried, but I definitely don’t have a deep enough understanding to do this research unassisted. A real AI researcher would get a lot more out of the tool. Still, it was very easy to get started. I gave Codex the path to my scratch directory, told it “continue the research”, and it immediately began coming up with ideas and running experiments on its own. In a way, the “train in five minutes” challenge is a perfect fit, because the feedback loop is so short. The basic loop of doing AI research with Codex (at least as an enthusiastic amateur) looks something like this: After two days I did paste the current research notes into GPT-5-Pro, which helped a bit, but the vast majority of my time was spent in this loop. As we’ll see, the best ideas were ones Codex already came up with. I chewed through a lot of tokens doing this. That’s OK with me, since I paid for the $200-per-month plan 2 , but if you don’t want to do that you’ll have to space out your research a bit more slowly. I restarted my Codex process every million tokens or so. It didn’t have any issue continuing where it left off from its previous notes, which was nice. I ran Codex with . By default it didn’t have access to MPS, which meant it could only train models on the CPU. There’s probably some more principled way of sandboxing it, but I didn’t bother to figure it out. I didn’t run into any runaway-agent problems, unless you count crashing my laptop a few times by using up too much memory. Here’s a brief summary of how the research went over the four or five days I spent poking at it. I stayed with the TinyStories dataset for all of this, partially because I think it’s the best choice and partially because I wanted a 1:1 comparison between Codex and my own efforts. Codex and I started with a series of n-gram models: instead of training a neural network, n-gram models just store the conditional probabilities of a token based on the n tokens that precede it. These models are very quick to produce (seconds, not minutes) but aren’t very good. The main reason is that even a 5-gram model cannot include context from more than five tokens ago, so they struggle to produce coherent text across an entire sentence. Here’s an example: Once upon a time , in a small school . ” they are friends . they saw a big pond . he pulled and pulled , but the table was still no attention to grow even more . she quickly ran to the house . she says , ” sara said . ” you made him ! ” the smooth more it said , for helping me decorate the cake . It’s not terrible ! There are short segments that are entirely coherent. But it’s kind of like what AI skeptics think LLMs are like: just fragments of the original source, remixed without any unifying through-line. The perplexity is 18.5, worse than basically any of the transformers I trained in my last attempt. Codex trained 19 different n-gram models, of which the above example (a 4-gram model) was the best 3 . In my view, this is one of the strengths of LLM-based AI research: it is trivial to tell the model “go and sweep a bunch of different values for the hyperparameters” . Of course, you can do this yourself. But it’s a lot easier to just tell the model to do it. After this, Codex spent a lot of time working on transformers. It trained ~50 normal transformers with different sizes, number of heads, layers, and so on. Most of this wasn’t particularly fruitful. I was surprised that my hand-picked hyperparameters from my previous attempt were quite competitive - though maybe it shouldn’t have been a shock, since they matched the lower end of the Chinchilla scaling laws. Still, eventually Codex hit on a 8.53 perplexity model (3 layers, 4 heads, and a dimension of 144), which was a strict improvement over my last attempt. I’m not really convinced this was an architectural improvement. One lesson from training fifty different models is that there’s quite a lot of variance between different seeds. A perplexity improvement of just over 1 is more or less what I was seeing on a “lucky seed”. This was an interesting approach for the challenge: going for pure volume and hoping for a lucky training run. You can’t do this with a larger model, since it takes so long to train 4 , but the five-minute limit makes it possible. The next thing Codex tried - based on some feedback I pasted in from GPT-5-Pro - was “shallow fusion” : instead of training a new model, updating the generation logic to blend the transformer-predicted tokens with a n-gram model, a “kNN head” (which looks up hidden states that are “nearby” the current hidden state of the transformer and predicts their tokens), and a “cache head” that makes the model more likely to repeat words that are already in the context. This immediately dropped perplexity down to 7.38 : a whole other point lower than our best transformer. I was excited about that, but the generated content was really bad: Once upon a time,, in a small house, there lived a boy named Tim. Tim loved to play outside with his ball. One Mr. Skip had a lot of fun. He ran everywhere every day. One One day, Tim was playing with his ball new ball near his house. Tim was playing with his his ball and had a lot of fun. But then, he saw a big tree and decided to climb it. Tim tried to climb the tree, but he was too big. He was too small to reach the top of the tree. But the tree was too high. The little tree was too high for him. Soon, Tim was near the tree. He was brave and climbed the tree. But when he got got to the top, he was sad. Tim saw a bird on What happened? I over-optimized for perplexity. As it turns out, the pure transformers that were higher-perplexity were better at writing stories. They had more coherence over the entire length of the story, they avoided generating weird repetition artifacts (like ”,,”), and they weren’t as mindlessly repetitive. I went down a bit of a rabbithole trying to think of how to score my models without just relying on perplexity. I came up with some candidate rubrics, like grammatical coherence, patterns of repetition, and so on, before giving up and just using LLM-as-a-judge. To my shame, I even generated a new API key for the LLM before realizing that I was talking to a strong LLM already via Codex, and I could just ask Codex to rate the model outputs directly. The final and most successful idea I tried was distilling a transformer from a n-gram teacher model . First, we train a n-gram model, which only takes ~10 seconds. Then we train a transformer - but for the first 200 training steps, we push the transformer towards predicting the tokens that the n-gram model would predict. After that, the transformer continues to train on the TinyStories data as usual. Here’s an example of some output: Once upon a time, in a big forest, there lived a little bunny named Ben. Ben loved to play with his friends in the forest. One day, Ben’s mom saw him and was sad because he couldn’t find his friend. She asked, “Why are you sad, Ben?” Ben said, “I lost my toy. I can’t find it.” Ben wanted to help Ben find his toy. He knew they could fix the toy. He went to Sam’s house and found the toy under a tree. Sam was so happy and said, “Thank you, Ben! You are a very pretty toy!” Ben smiled and said, “Yes, I would love to help you.” They played together all day long. The moral of the story is to help others when they needed it. I think this is pretty good! It has characters that continue throughout the story. It has a throughline - Ben’s lost toy - though it confuses “toy” and “friend” a bit. It’s a coherent story, with a setup, problem, solution and moral. This is much better than anything else I’ve been able to train in five minutes. Why is it better? I think the right intuition here is that transformers need to spend a lot of initial compute (say, two minutes) learning how to construct grammatically-correct English sentences. If you begin the training by spending ten seconds training a n-gram model that can already produce sort-of-correct grammar, you can speedrun your way to learning grammar and spend an extra one minute and fifty seconds learning content. I really like this approach. It’s exactly what I was looking for from the start: a cool architectural trick that genuinely helps, but only really makes sense for this weird challenge 5 . I don’t have any illusions about this making me a real AI researcher, any more than a “vibe coder” is a software engineer. Still, I’m surprised that it actually worked. And it was a lot of fun! I’ve pushed up the code here if you want to pick up from where I left off, but you may be better off just starting from scratch with Codex or your preferred coding agent. edit: this post got some comments on Hacker News . The tone is much more negative than on my previous attempt, which is interesting - maybe the title gave people the mistaken impression that I think I’m a strong AI researcher! “Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tooling. My deal with myself was that if I ever have a month where I use fewer than 2M tokens, I’ll cancel the plan. There are a lot of clever tricks involved here: Kneser-Ney smoothing, interpolating unigram/bigram/trigram probabilities on a specific schedule, deliberately keeping the sentinel token, etc. I didn’t spend the time understanding all of these things deeply - that’s vibe research for you - so I won’t write too much about it. Unless you’re a big AI lab. I am 100% convinced that the large labs are spending a lot of compute just re-training on different seeds in the hope of getting a lucky run. I was suspicious that I just got a lucky seed, but I compared ~40 generations with and without the distillation and the distilled model really was better at producing correct-looking stories. Codex makes a change to the training script and does three or four runs (this takes ~20 minutes overall) Based on the results, Codex suggests two or three things that you could try next I pick one of them (or very occasionally suggest my own idea) and return to (1). “Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tooling. ↩ My deal with myself was that if I ever have a month where I use fewer than 2M tokens, I’ll cancel the plan. ↩ There are a lot of clever tricks involved here: Kneser-Ney smoothing, interpolating unigram/bigram/trigram probabilities on a specific schedule, deliberately keeping the sentinel token, etc. I didn’t spend the time understanding all of these things deeply - that’s vibe research for you - so I won’t write too much about it. ↩ Unless you’re a big AI lab. I am 100% convinced that the large labs are spending a lot of compute just re-training on different seeds in the hope of getting a lucky run. ↩ I was suspicious that I just got a lucky seed, but I compared ~40 generations with and without the distillation and the distilled model really was better at producing correct-looking stories. ↩

0 views
Sean Goedecke 1 weeks ago

How I influence tech company politics as a staff software engineer

Many software engineers are fatalistic about company politics. They believe that it’s pointless to get involved, because 1 : The general idea here is that software engineers are simply not equipped to play the game at the same level as real political operators . This is true! It would be a terrible mistake for a software engineer to think that you ought to start scheming and plotting like you’re in Game of Thrones . Your schemes will be immediately uncovered and repurposed to your disadvantage and other people’s gain. Scheming takes practice and power, and neither of those things are available to software engineers. It is simply a fact that software engineers are tools in the political game being played at large companies, not players in their own right. However, there are many ways to get involved in politics without scheming. The easiest way is to actively work to make a high-profile project successful . This is more or less what you ought to be doing anyway, just as part of your ordinary job. If your company is heavily investing in some new project - these days, likely an AI project - using your engineering skill to make it successful 2 is a politically advantageous move for whatever VP or executive is spearheading that project. In return, you’ll get the rewards that executives can give at tech companies: bonuses, help with promotions, and positions on future high-profile projects. I wrote about this almost a year ago in Ratchet effects determine engineer reputation at large companies . A slightly harder way (but one that gives you more control) is to make your pet idea available for an existing political campaign . Suppose you’ve wanted for a while to pull out some existing functionality into its own service. There are two ways to make that happen. The hard way is to expend your own political capital: drum up support, let your manager know how important it is to you, and slowly wear doubters down until you can get the project formally approved. The easy way is to allow some executive to spend their (much greater) political capital on your project . You wait until there’s a company-wide mandate for some goal that aligns with your project (say, a push for reliability, which often happens in the wake of a high-profile incident). Then you suggest to your manager that your project might be a good fit for this. If you’ve gauged it correctly, your org will get behind your project. Not only that, but it’ll increase your political capital instead of you having to spend it. Organizational interest comes in waves. When it’s reliability time, VPs are desperate to be doing something . They want to come up with plausible-sounding reliability projects that they can fund, because they need to go to their bosses and point at what they’re doing for reliability, but they don’t have the skillset to do it on their own. They’re typically happy to fund anything that the engineering team suggests. On the other hand, when the organization’s attention is focused somewhere else - say, on a big new product ship - the last thing they want is for engineers to spend their time on an internal reliability-focused refactor that’s invisible to customers. So if you want to get something technical done in a tech company, you ought to wait for the appropriate wave . It’s a good idea to prepare multiple technical programs of work, all along different lines. Strong engineers will do some of this kind of thing as an automatic process, simply by noticing things in the normal line of work. For instance, you might have rough plans: When executives are concerned about billing, you can offer the billing refactor as a reliability improvement. When they’re concerned about developer experience, you can suggest replacing the build pipeline. When customers are complaining about performance, you can point to the Golang rewrite as a good option. When the CEO checks the state of the public documentation and is embarrassed, you can make the case for rebuilding it as a static site. The important thing is to have a detailed, effective program of work ready to go for whatever the flavor of the month is. Some program of work will be funded whether you do this or not. However, if you don’t do this, you have no control over what that program is. In my experience, this is where companies make their worst technical decisions : when the political need to do something collides with a lack of any good ideas. When there are no good ideas, a bad idea will do, in a pinch. But nobody prefers this outcome. It’s bad for the executives, who then have to sell a disappointing technical outcome as if it were a success 4 , and it’s bad for the engineers, who have to spend their time and effort building the wrong idea. If you’re a very senior engineer, the VPs (or whoever) will quietly blame you for this. They’ll be right to! Having the right idea handy at the right time is your responsibility. You can view all this in two different ways. Cynically, you can read this as a suggestion to make yourself a convenient tool for the sociopaths who run your company to use in their endless internecine power struggles. Optimistically, you can read this as a suggestion to let executives set the overall priorities for the company - that’s their job, after all - and to tailor your own technical plans to fit 3 . Either way, you’ll achieve more of your technical goals if you push the right plan at the right time. edit: this post got some attention on Hacker News . The comments were much more positive than on my other posts about politics, for reasons I don’t quite understand. This comment is an excellent statement of what I write about here (but targeted at more junior engineers). This comment (echoed here ) references a Milton Friedman quote that applies the idea in this post to political policy in general, which I’d never thought of but sounds correct: Only a crisis—actual or perceived—produces real change. When that crisis occurs, the actions that are taken depend on the ideas that are lying around. That, I believe, is our basic function: to develop alternatives to existing policies, to keep them alive and available until the politically impossible becomes politically inevitable. There’s a few comments calling this approach overly game-playing and self-serving. I think this depends on the goal you’re aiming at. The ones I referenced above seem pretty beneficial to me! Finally, this comment is a good summary of what I was trying to say: Instead of waiting to be told what to do and being cynical about bad ideas coming up when there’s a vacumn and not doing what he wants to do, the author keeps a back log of good and important ideas that he waits to bring up for when someone important says something is priority. He gets what he wants done, compromising on timing. I was prompted to write this after reading Terrible Software’s article Don’t avoid workplace politics and its comments on Hacker News. Disclaimer: I am talking here about broadly functional tech companies (i.e. ones that are making money). If you’re working somewhere that’s completely dysfunctional, I have no idea whether this advice would apply at all. What it takes to make a project successful is itself a complex political question that every senior+ engineer is eventually forced to grapple with (or to deliberately avoid, with consequences for their career). For more on that, see How I ship projects at large tech companies . For more along these lines, see Is it cynical to do what your manager wants? Just because they can do this doesn’t mean they want to. Technical decisions are often made for completely selfish reasons that cannot be influenced by a well-meaning engineer Powerful stakeholders are typically so stupid and dysfunctional that it’s effectively impossible for you to identify their needs and deliver solutions to them The political game being played depends on private information that software engineers do not have, so any attempt to get involved will result in just blundering around Managers and executives spend most of their time playing politics, while engineers spend most of their time doing engineering, so engineers are at a serious political disadvantage before they even start to migrate the billing code to stored-data-updated-by-webhooks instead of cached API calls to rip out the ancient hand-rolled build pipeline and replace it with Vite to rewrite a crufty high-volume Python service in Golang to replace the slow CMS frontend that backs your public documentation with a fast static site I was prompted to write this after reading Terrible Software’s article Don’t avoid workplace politics and its comments on Hacker News. Disclaimer: I am talking here about broadly functional tech companies (i.e. ones that are making money). If you’re working somewhere that’s completely dysfunctional, I have no idea whether this advice would apply at all. ↩ What it takes to make a project successful is itself a complex political question that every senior+ engineer is eventually forced to grapple with (or to deliberately avoid, with consequences for their career). For more on that, see How I ship projects at large tech companies . ↩ For more along these lines, see Is it cynical to do what your manager wants? ↩ Just because they can do this doesn’t mean they want to. ↩

0 views
Sean Goedecke 2 weeks ago

What is "good taste" in software engineering?

Technical taste is different from technical skill. You can be technically strong but have bad taste, or technically weak with good taste. Like taste in general, technical taste sometimes runs ahead of your ability: just like you can tell good food from bad without being able to cook, you can know what kind of software you like before you’ve got the ability to build it. You can develop technical ability by study and repetition, but good taste is developed in a more mysterious way. Here are some indicators of software taste: I think taste is the ability to adopt the set of engineering values that fit your current project . Aren’t the indicators above just a part of skill? For instance, doesn’t code look good if it’s good code ? I don’t think so. Let’s take an example. Personally, I feel like code that uses map and filter looks nicer than using a for loop. It’s tempting to think that this is a case of me being straightforwardly correct about a point of engineering. For instance, map and filter typically involve pure functions, which are easier to reason about, and they avoid an entire class of off-by-one iterator bugs. It feels to me like this isn’t a matter of taste, but a case where I’m right and other engineers are wrong. But of course it’s more complicated than that. Languages like Golang don’t contain map and filter at all, for principled reasons. Iterating with a for loop is easier to reason about from a performance perspective, and is more straightforward to extend to other iteration strategies (like taking two items at a time). I don’t care about these reasons as much as I care about the reasons in favour of map and filter - that’s why I don’t write a lot of for loops - but it would be far too arrogant for me to say that engineers who prefer for loops are simply less skilled. In many cases, they have technical capabilites that I don’t have. They just care about different things. In other words, our disagreement comes down to a difference in values . I wrote about this point in I don’t know how to build software and you don’t either . Even if the big technical debates do have definite answers, no working software engineer is ever in a position to know what those answers are, because you can only fit so much experience into one career. We are all at least partly relying on our own personal experience: on our particular set of engineering values. Almost every decision in software engineering is a tradeoff. You’re rarely picking between two options where one is strictly better. Instead, each option has its own benefits and downsides. Often you have to make hard tradeoffs between engineering values : past a certain point, you cannot easily increase performance without harming readability, for instance 1 . Really understanding this point is (in my view) the biggest indicator of maturity in software engineering. Immature engineers are rigid about their decisions. They think it’s always better to do X or Y. Mature engineers are usually willing to consider both sides of a decision, because they know that both sides come with different benefits. The trick is not deciding if technology X is better than Y, but whether the benefits of X outweigh Y in this particular case . In other words, immature engineers are too inflexible about their taste . They know what they like, but they mistake that liking for a principled engineering position. What defines a particular engineer’s taste? In my view, your engineering taste is composed of the set of engineering values you find most important . For instance: Resiliency . If an infrastructure component fails (a service dies, a network connection becomes unavailable), does the system remain functional? Can it recover without human intervention? Speed . How fast is the software, compared to the theoretical limit? Is work being done in the hot path that isn’t strictly necessary? Readability . Is the software easy to take in at a glance and to onboard new engineers to? Are functions relatively short and named well? Is the system well-documented? Correctness . Is it possible to represent an invalid state in the system? How locked-down is the system with tests, types, and asserts? Do the tests use techniques like fuzzing? In the extreme case, has the program been proven correct by formal methods like Alloy ? Flexibility . Can the system be trivially extended? How easy is it to make a change? If I need to change something, how many different parts of the program do I need to touch in order to do so? Portability . Is the system tied down to a particular operational environment (say, Microsoft Windows, or AWS)? If the system needs to be redeployed elsewhere, can that happen without a lot of engineering work? Scalability . If traffic goes up 10x, will the system fall over? What about 100x? Does the system have to be over-provisioned or can it scale automatically? What bottlenecks will require engineering intervention? Development speed . If I need to extend the system, how fast can it be done? Can most engineers work on it, or does it require a domain expert? There are many other engineering values: elegance, modern-ness, use of open source, monetary cost of keeping the system running, and so on. All of these are important, but no engineer cares equally about all of these things. Your taste is determined by which of these values you rank highest. For instance, if you value speed and correctness more than development speed, you are likely to prefer Rust over Python. If you value scalability over portability, you are likely to argue for a heavy investment in your host’s (e.g. AWS) particular quirks and tooling. If you value resiliency over speed, you are likely to want to split your traffic between different regions. And so on 2 . It’s possible to break these values down in a more fine-grained way. Two engineers who both deeply care about readability could disagree because one values short functions and the other values short call-stacks. Two engineers who both care about correctness could disagree because one values exhaustive test suites and the other values formal methods. But the principle is the same - there are lots of possible engineering values to care about, and because they are often in tension, each engineer is forced to take some more seriously than others. I’ve said that all of these values are important. Despite that, it’s possible to have bad taste. In the context of software engineering, bad taste means that your preferred values are not a good fit for the project you’re working on . Most of us have worked with engineers like this. They come onto your project evangelizing about something - formal methods, rewriting in Golang, Ruby meta-programming, cross-region deployment, or whatever - because it’s worked well for them in the past. Whether it’s a good fit for your project or not, they’re going to argue for it, because it’s what they like. Before you know it, you’re making sure your internal metrics dashboard has five nines of reliability, at the cost of making it impossible for any junior engineer to understand. In other words, most bad taste comes from inflexibility . I will always distrust engineers who justify decisions by saying “it’s best practice”. No engineering decision is “best practice” in all contexts! You have to make the right decision for the specific problem you’re facing. One interesting consequence of this is that engineers with bad taste are like broken compasses. If you’re in the right spot, a broken compass will still point north. It’s only when you start moving around that the broken compass will steer you wrong. Likewise, many engineers with bad taste can be quite effective in the particular niche where their preferences line up with what the project needs. But when they’re moved between projects or jobs, or when the nature of the project changes, the wheels immediately come off. No job stays the same for long, particularly in these troubled post-2021 times . Good taste is a lot more elusive than technical ability. That’s because, unlike technical ability, good taste is the ability to select the right set of engineering values for the particular technical problem you’re facing . It’s thus much harder to identify if someone has good taste: you can’t test it with toy problems, or by asking about technical facts. You need there to be a real problem, with all of its messy real-world context. You can tell you have good taste if the projects you’re working on succeed. If you’re not meaningfully contributing to the design of a project (maybe you’re just doing ticket-work), you can tell you have good taste if the projects where you agree with the design decisions succeed, and the projects where you disagree are rocky. Importantly, you need a set of different kinds of projects. If it’s just the one project, or the same kind of project over again, you might just be a good fit for that. Even if you go through many different kinds of projects, that’s no guarantee that you have good taste in domains you’re less familiar with 3 . How do you develop good taste? It’s hard to say, but I’d recommend working on a variety of things, paying close attention to which projects (or which parts of the project) are easy and which parts are hard. You should focus on flexibility : try not to acquire strong universal opinions about the right way to write software. What good taste I have I acquired pretty slowly. Still, I don’t see why you couldn’t acquire it fast. I’m sure there are prodigies with taste beyond their experience in programming, just as there are prodigies in other domains. Of course this isn’t always true. There are win-win changes where you can improve several usually-opposing values at the same time. But mostly we’re not in that position. Like I said above, different projects will obviously demand a different set of values. But the engineers working on those projects will still have to draw the line somewhere, and they’ll rely on their own taste to do that. That said, I do think good taste is somewhat transferable. I don’t have much personal experience with this so I’m leaving it in a footnote, but if you’re flexible and attentive to the details in domain A, you’ll probably be flexible and attentive to the details in domain B.

1 views
Sean Goedecke 2 weeks ago

AI coding agents rely too much on fallbacks

One frustrating pattern I’ve noticed in AI agents - at least in Claude Code, Codex and Copilot - is building automatic fallbacks . Suppose you ask Codex to build a system to automatically group pages in a wiki by topic. (This isn’t hypothetical, I just did this for EndlessWiki ). You’ll probably want to use something like the Louvain method to identify clusters. But if you task an AI agent with building something like that, it usually will go one step further, and build a fallback: a separate, simpler code path if the Louvain method fails (say, grouping page slugs alphabetically). If you’re not careful, you might not even know if the Louvain method is working, or if you’re just seeing the fallback behavior. In my experience, AI agents will do this constantly . If you’re building an app that makes an AI inference request, the generated code will likely fallback to some hard-coded response if the inference request fails. If you’re using an agent to pull structured data from some API, the agent may silently fallback to placeholder data for part of it. If you’re writing some kind of clever spam detector, the agent will want to fall back to a basic keyword check if your clever approach doesn’t work. This is particularly frustrating for the main kind of work that AI agents are useful for: prototyping new ideas. If you’re using AI agents to make real production changes to an existing app, fallbacks are annoying but can be easily stripped out before you submit the pull request. But if you’re using AI agents to test out a new approach, you’re typically not checking the code line-by-line. The usual workflow is to ask the agent to try an approach, then benchmark or fiddle with the result, and so on. If your benchmark or testing doesn’t know whether it’s hitting the real code or some toy fallback, you can’t be confident that you’re actually evaluating your latest idea. I don’t think this behavior is deliberate. My best guess is that it’s a reinforcement learning artifact: code with fallbacks is more likely to succeed, so during training the models are learning to include fallback 1 . If I’m wrong and it’s part of the hidden system prompt (or a deliberate choice), I think it’s a big mistake. When you ask an AI agent to implement a particular algorithm, it should implement that algorithm. In researching this post, I saw this r/cursor thread where people are complaining about this exact problem (and also attributing it to RL). Supposedly you can prompt around it, if you repeat “DO NOT WRITE FALLBACK CODE” several times.

1 views
Sean Goedecke 2 weeks ago

Endless AI-generated Wikipedia

I built an infinite, AI-generated wiki. You can try it out at endlesswiki.com ! Large language models are like Borges’ infinite library . They contain a huge array of possible texts, waiting to be elicited by the right prompt - including some version of Wikipedia. What if you could explore a model by interacting with it as a wiki? The idea here is to build a version of Wikipedia where all the content is AI-generated. You only have to generate a single page to get started: when a user clicks any link on that page, the page for that link is generated on-the-fly, which will include links of its own. By browsing the wiki, users can dig deeper into the stored knowledge of the language model. This works because wikipedias 1 connect topics very broadly. If you follow enough links, you can get from any topic to any other topic. In fact, people already play a game where they try to race from one page to a totally unrelated page by just following links. It’s fun to try and figure out the most likely chain of conceptual relationships between two completely different things. In a sense, EndlessWiki is a collaborative attempt to mine the depths of a language model. Once a page is generated, all users will be able to search for it or link it to their friends. The basic design is very simple: a MySQL database with a table, and a Golang server. When the server gets a request, it looks up in the database. If it exists, it serves the page directly; if not, it generates the page from a LLM and saves it to the database before serving it. I’m using Kimi K2 for the model. I chose a large model because larger models contain more facts about the world (which is good for a wiki), and Kimi specifically because in my experience Groq is faster and more reliable than other model inference providers. Speed is really important for this kind of application, because the user has to wait for new pages to be generated. Fortunately, Groq is fast enough that the wait time is only a few hundred ms. Unlike AutoDeck , I don’t charge any money or require sign-in for this. That’s because this is more of a toy than a tool, so I’m not worried about one power user costing me a lot of money in inference. You have to be manually clicking links to trigger inference. The most interesting design decision I made was preventing “cheating”. I’m excited to see how obscure the pages can get (for instance, can you get to eventually get to Neon Genesis Evangelion from the root page?) It would defeat the purpose if you could just manually go to in the address bar. To defeat that, I make each link have a query parameter, and then I fetch the origin page server-side to validate that it does indeed contain a link to the page you’re navigating to 2 . Like AutoDeck , EndlessWiki represents another step in my “what if you could interact with LLMs without having to chat” line of thought. I think there’s a lot of potential here for non-toy features. For instance, what if ChatGPT automatically hyperlinked each proper noun in its responses, and clicking on those generated a response focused on that noun? Anyway, check it out! I use the lowercase “w” because I mean all encyclopedia wikis. Wikipedia is just the most popular example. Interestingly, Codex came up with five solutions to prevent cheating, all of which were pretty bad - way more complicated than the solution I ended up with. If I was purely vibe coding, I’d have ended up with some awkward cryptographic approach.

5 views
Sean Goedecke 3 weeks ago

What I learned building an AI-driven spaced repetition app

I spent the last couple of weeks building an AI-driven spaced repetition app. You can try it out here . Like many software engineering types who were teenagers in the early 2000s 1 , I’ve been interested in this for a long time. The main reason is that, unlike many other learning approaches, spaced repetition works . If you want to learn something, study it now, then study it an hour later, then a day later, then a week later, and so on. You don’t have to spend much time overall, as long as you’re consistent about coming back to it. Eventually you only need to refresh your memory every few years in order to maintain a solid working knowledge of the topic. Spaced repetition learning happens more or less automatically as part of a software engineering job. Specific engineering skills will come up every so often (for instance, using to inspect open network sockets, or the proper regex syntax for backtracking). If they come up often enough, you’ll internalize them. It’s more difficult to use spaced repetition to deliberately learn new things. Even if you’re using a spaced repetition tool like Anki , you have to either write your own deck of flashcards (which requires precisely the kind of expertise you don’t have yet), or search for an existing one that exactly matches the area you’re trying to learn 2 . One way I learn new things is from LLMs. I wrote about this in How I use LLMs to learn new subjects , but the gist is that I ask a ton of follow-up questions about a question I have. The best part about this approach is that it requires zero setup cost: if at any moment I want to learn more about something, I can type a question out and rapidly dig in to something I didn’t already know. What if you could use LLMs to make spaced repetition easier? Specifically, what if you could ask a LLM to give you an infinite feed of spaced repetition flashcards, adjusting the difficulty based on your responses? That’s the idea behind AutoDeck . You give it a topic and it gives you infinite flashcards about that topic. If it’s pitched too easy (e.g. you keep saying “I know”) or too hard, it’ll automatically change the difficulty. The thing I liked most about building AutoDeck is that it’s an AI-driven app where the interface isn’t chat . I think that’s really cool - almost every kind of killer AI app presents a chat interface. To use Claude Code, you chat with an agent. The various data analysis tools are typically in a “chat with your data” mode. To use ChatGPT, you obviously chat with it. That makes sense, since (a) the most unusual thing about LLMs is that you can talk with them, and (b) most AI apps let the user take a huge variety of possible actions, for which the only possible interface is some kind of chat. The problem with chat is that it demands a lot of the user. That’s why most “normal” apps have the user click buttons instead of type out sentences, and that’s why many engineering and design blogs have been writing about how to build AI apps that aren’t chat-based. Still, it’s easier said than done. I think spaced repetition flashcards are a good use-case for AI. Generating them for any topic is something that would be impossible without LLMs, so it’s a compelling idea. But you don’t have to interact with them via text (beyond typing out what topic you want at the outset). How do you use AI to generate an infinite feed of content? I tried a bunch of different approaches here. The two main problems here are speed and consistency . Speed is difficult because AI generation can be pretty slow: counting the time-to-first-token, it’s a few hundred ms, even for quick models. If you’re generating each flashcard with a single request, a user who’s familiar with the subject matter can click through flashcards faster than the AI can generate them. Batching up flashcard generation is quicker (because you only wait for time-to-first-token once) but it forces the user to wait much longer before they see their first card. What if you generate flashcards in parallel? That has two problems of its own. First, you’re still waiting for the time-to-first-token on every request, so throughput is still much slower than the batched approach. Second, it’s very easy to generate duplicate cards that way. Even with a high temperature, if you ask the same model the same question with the same prompt, you’re likely to get similar answers. The parallel-generation flashcard feed was thus pretty repetitive: if you wanted to learn about French history, you’d get “what year was the Bastille stormed” right next to “in what year was the storming of the Bastille”, and so on. The solution I landed on was batching the generation, but saving each card as it comes in . In other words, I asked the model to generate ten cards, but instead of waiting for the entire response to be over before I saved the data, I made each card available to the client as soon as it was generated. This was trickier than it sounds for a few reasons. First, it means you can’t use JSON structured outputs . Structured outputs are great for ensuring you get a response that your code can parse, but you can’t (easily) parse chunks of JSON mid-stream. You have to wait for the entire output before it’s valid JSON, because you need the closing or characters 3 . Instead, I asked the model to respond in XML chunks, which could be easily parsed as they came in. Second, it meant I couldn’t simply have the client request a card and get a card back. The code that generated cards had to be able to run in the background without blocking the client, which forced the client to periodically check for available cards. I built most of AutoDeck with OpenAI’s Codex. It was pretty good! I had to intervene in maybe one change out of three, and I only had to seriously intervene (i.e. completely change the approach) in one change out of ten. Some examples of where I had to intervene: I tried Claude Code at various parts of the process and honestly found it underwhelming. It took longer to make each change and in general required more intervention, which meant I was less comfortable queueing up changes. This is a pretty big win for OpenAI - until very recently, Claude Code has been much better than Codex in my experience. I cannot imagine trying to build even a relatively simple app like this without being a competent software engineer already. Codex saved me a lot of time and effort, but it made a lot of bad decisions that I had to intervene. It wasn’t able to fix every bug I encountered. At this point, I don’t think we’re in the golden age of vibe coding. You still need to know what you’re doing to actually ship an app with one of these tools. One interesting thing about building AI projects is that it kind of forces you to charge money. I’ve released previous apps I’ve built for free, because I wanted people to use them and I’m not trying to do the software entepreneur thing. But an app that uses AI costs me money for each user - not a ton of money, but enough that I’m strongly incentivized to charge a small amount for users who want to use the app more than just kicking the tires. I think this is probably a good thing. Charging money for software is a forcing function for actually making it work. If AI inference was free, I would probably have shipped AutoDeck in a much more half-assed state. Since I’m obliged to charge money for it, I spent more time making sure it was actually useful than I would normally spend on a side project. I had a lot of fun building AutoDeck! It’s still mainly for me, but if you’ve read this far I hope you try it out and see if you like it as well. I’m still trying to figure out the best model. GPT-5 was actually pretty bad at generating spaced repetition cards: the time-to-first-token was really slow, and the super-concise GPT-5 style made the cards read awkwardly. You don’t need the smartest available model for spaced repetition, just a model with a good grasp of a bunch of textbook and textbook-adjacent facts. The surviving ur-text for this is probably Gwern’s 2009 post Spaced Repetition for Efficient Learning . Most existing decks are tailored towards students doing particular classes (e.g. anatomy flashcards for med school), not people just trying to learn something new, so they often assume more knowledge than you might have. I think this is just a lack of maturity in the ecosystem. I would hope that in a year or two you can generate structured XML, JSONL, or other formats that are more easily parseable in chunks. Those formats are just as easy to express as a grammar that the logit sampler can adhere to.

0 views
Sean Goedecke 3 weeks ago

If you are good at code review, you will be good at using AI agents

Using AI agents correctly is a process of reviewing code . If you’re good at reviewing code, you’ll be good at using tools like Claude Code, Codex, or the Copilot coding agent. Why is that. Large language models are good at producing a lot of code, but they don’t yet have the depth of judgement of a competent software engineer

1 views
Sean Goedecke 3 weeks ago

AI is good news for Australian and European software engineers

Right now the dominant programing model is something like “centaur chess” , where a skilled human is paired with a computer assistant. Together, they produce more work than either could individually. No individual human can work as fast or as consistently as a LLM, but LLMs lack the depth of judgement that good engineers do 1

0 views
Sean Goedecke 1 months ago

The whole point of OpenAI's Responses API is to help them hide reasoning traces

About six months ago, OpenAI released their Responses API , which replaced their previous /chat/completions API for inference. The old API was very simple: you pass in an array of messages representing a conversation between the model and a user, and get the model’s next response back. The new Responses API is more complicated

0 views
Sean Goedecke 1 months ago

'Make invalid states unrepresentable' considered harmful

One of the most controversial things I believe about good software design is that your code should be more flexible than your domain model . This is in direct opposition to a lot of popular design advice, which is all about binding your code to your domain model as tightly as possible. For instance, a popular principle for good software design is to make invalid states unrepresentable

0 views
Sean Goedecke 1 months ago

An unofficial FAQ for Stripe's new "Tempo" blockchain

Stripe just announced Tempo , a “L1 blockchain” for “stablecoin payments”. What does any of this mean. In 2021, I was interested enough in blockchain to write a simple explainer and a technical description of Bitcoin specifically . But I’ve never been a blockchain fan . Both my old and new “what kind of work I want” posts state that I’m ethically opposed to proof-of-work blockchain

0 views
Sean Goedecke 1 months ago

Seeing like a software company

The big idea of James C. Scott’s Seeing Like A State can be expressed in three points: By “legible”, I mean work that is predictable, well-estimated, has a paper trail, and doesn’t depend on any contingent factors (like the availability of specific people). Quarterly planning, OKRs, and Jira all exist to make work legible

0 views
Sean Goedecke 1 months ago

Do the simplest thing that could possibly work

When designing software systems, do the simplest thing that could possibly work. It’s surprising how far you can take this piece of advice. I genuinely think you can do this all the time . You can follow this approach for fixing bugs, for maintaining existing systems, and for architecting new ones

0 views
Sean Goedecke 1 months ago

Finding the low-hanging fruit

Suppose your job is to pick fruit in a giant orchard. The orchard covers several hills and valleys, and is big enough that you’d need a few weeks to walk all the way around the edge. What should you do first

0 views
Sean Goedecke 1 months ago

Everything I know about good API design

Most of what modern software engineers do 1 involves APIs: public interfaces for communicating with a program, like this one from Twilio. I’ve spent a lot of time working with APIs, both building and using them

0 views
Sean Goedecke 1 months ago

Don't feed me AI slop

In the early days of any new technology, the relevant social norms are still being workshopped. For mobile phones, that meant collectively deciding when and where phones should be on silent mode 1 . For instant messaging, that meant jumping right into the request instead of trying to do small talk first. What are the social norms we’re working out for AI right now

0 views
Sean Goedecke 2 months ago

Is chain-of-thought AI reasoning a mirage?

Reading research papers and articles about chain-of-thought reasoning 1 makes me frustrated. There are many interesting questions to ask about chain-of-thought: how accurately it reflects the actual process going on, why training it “from scratch” often produces chains that switch fluidly between multiple languages, and so on

0 views
Sean Goedecke 2 months ago

The famous "bottomless pit" AI greentext is fake

Many people believe 1 this is the best piece of art or humour that AI has ever produced: This was generated three years ago by GPT-3. It’s notable by itself that the best piece of AI art might have been produced by the least capable AI model

0 views
Sean Goedecke 2 months ago

What's the strongest AI model you can train on a laptop in five minutes?

What’s the strongest model I can train on my MacBook Pro 1 in five minutes. I’ll give the answer upfront: the best 5-minute model I could train was a ~1. 8M-param GPT-style transformer trained on ~20M TinyStories tokens, reaching ~9. 6 perplexity on a held-out split. Here’s an example of the output, with the prompt bolded: Once upon a time , there was a little boy named Tim

0 views