Latest Posts (20 found)
Sean Goedecke 5 days ago

Is it worrying that 95% of AI enterprise projects fail?

In July of this year, MIT NANDA released a report called The GenAI Divide: State of AI in Business 2025 . The report spends most of its time giving advice about how to run enterprises AI projects, but the item that got everybody talking was its headline stat: 95% of organizations are getting zero return from their AI projects . This is a very exciting statistic for those already disposed to be pessimistic about the impact of AI. The incredible amounts of money and time being spent on AI depend on language models being a transformative technology. Many people are expecting AI to eventually unlock hundreds of billions of dollars in value. The NANDA paper seems like very bad news for those people, if the last three years of AI investment really has failed to unlock even one dollar in value for most companies. Cards on the table - I think AI is going to have an impact about on-par with the internet, or railroads, but that we’re also definitely in a bubble. I wrote about this in What’s next after the AI bubble bursts? 1 . I am not convinced that the NANDA report is bad news for AI . The obvious question to ask about the report is “well, what’s the base rate?” Suppose that 95% of enterprise AI transformations fail. How does that compare to the failure rate of normal enterprise IT projects? This might seem like a silly question for those unfamiliar with enterprise AI projects - whatever the failure rate, surely it can’t be close to 95%! Well. In 2016, Forbes interviewed the author of another study very much like the NANDA report, except about IT transformations in general, and found an 84% failure rate. McKinsey has only one in 200 IT projects coming in on time and within budget. The infamous 2015 CHAOS report found an 61% failure rate, going up to 98% for “large, complex projects”. Most enterprise IT projects are at least partial failures. Of course, much of this turns on how we define success. Is a project a success if it delivers what it promised a year late? What if it had to cut down on some of the promised features? Does it matter which features? The NANDA report defines it like this, which seems like a fairly strict definition to me: We define successfully implemented for task-specific GenAI tools as ones users or executives have remarked as causing a marked and sustained productivity and/or P&L impact. Compare the CHAOS report’s definition of success: Success … means the project was resolved within a reasonable estimated time, stayed within budget, and delivered customer and user satisfaction regardless of the original scope. I think these are close enough to be worth comparing, which means that according to the NANDA report, AI projects succeed at roughly the same rate as ordinary enterprise IT projects . Nobody says “oh, databases must be just hype” when a database project fails. In the interest of fairness, we should extend the same grace to AI. 81% and 95% are both high failure rates, but 95% is higher. Is that because AI offers less value than other technologies, or because AI projects are unusually hard? I want to give some reasons why we might think AI projects fall in the CHAOS report’s category of “large, complex projects”. Useful AI models have not been around for long. GPT-3.5 was released in 2022, but it was more of a toy than a tool. For my money, the first useful AI model was GPT-4, released in March 2023, and the first cheap, useful, and reliable AI model was GPT-4o in May 2024. That means that enterprise AI projects have been going at most three years, if they were willing and able to start with GPT-3.5, and likely much closer to eighteen months. The average duration of an enterprise IT project is 2.4 years in the private sector and 3.9 years in the public sector. Enterprise AI adoption is still very young, by the standards of their other IT projects. Also, to state the obvious, AI is a brand-new technology. Most failed enterprise IT projects are effectively “solved problems”: like migrating information into a central database, or tracking high-volume events, or aggregating various data sources into a single data warehouse for analysis 2 . Of course, any software engineer should know that solving a “solved problem” is not easy. The difficulties are in all the myriad details that have to be worked out. But enterprise AI projects are largely not “solved problems”. The industry is still working out the best way to build a chatbot. Should tools be given as definitions, or discovered via MCP? Should agents use sub-agents? What’s the best way to compact the context window? Should data be fetched via RAG, or via agentic keyword search? And so on. This is a much more fluid technical landscape than most enterprise AI projects. Even by itself, that’s enough to push AI projects into the “complex” category. So far I’ve assumed that the “95% of enterprise AI projects fail” statistic is reliable. Should we? NANDA’s source for the 95% figure is a survey in section 3.2: The immediate problem here is that I don’t think this figure even shows that 95% of AI projects fail . As I read it, the leftmost section shows that 60% of the surveyed companies “investigated” building task-specific AI. 20% of the surveyed companies then built a pilot, and 5% built an implementation that had a sustained, notable impact on productivity or profits. So just on the face of it, that’s an 8.3% success rate, not a 5% success rate, because 40% of the surveyed companies didn’t even try . It’s also unclear if all the companies that investigated AI projects resolved to carry them out. If some of them decided not to pursue an AI project after the initial investigation, they’d also be counted in the failure rate, which doesn’t seem right at all. We also don’t know how good the raw data is. Read this quote, directly above the image: These figures are directionally accurate based on individual interviews rather than official company reporting. Sample sizes vary by category, and success definitions may differ across organizations. In Section 8.2, the report lays out its methodology: 52 interviews across “enterprise stakeholders”, 153 surveys of enterprise “leaders”, and an analysis of 300+ public AI projects. I take this quote to mean that the 95% figure is based on a subset of those 52 interviews. Maybe all 52 interviews gave really specific data! Or maybe only a handful of them did. Finally, the subject of the claim here is a bit narrower than “AI projects” . The 95% figure is specific to “embedded or task-specific GenAI”, as opposed to general purpose LLM use (presumably something like using the enterprise version of GitHub Copilot or ChatGPT). In fairness to the NANDA report, the content of the report does emphasize that many employees are internally using AI via those tools, and at least believe that they’re getting a lot of value out of it. This one’s more a criticism of the people who’ve been tweeting that “95% of AI use at companies is worthless”, and so on. The NANDA report is not as scary as it looks. The main reason is that ~95% of hard enterprise IT projects fail no matter what, so AI projects failing at that rate is nothing special . AI projects are all going to be on the hard end, because the technology is so new and there’s very little industry agreement on best practices. It’s also not clear to me that the 95% figure is trustworthy. Even taking it on its own terms, it’s mathematically closer to 92%, which doesn’t inspire confidence in the rest of the NANDA team’s interpretation. We’re forced to take it on trust, since we can’t see the underlying data - in particular, how many of those 52 interviews went into that 95% figure. Here’s what I think it’s fair to conclude from the paper. Like IT projects in general, almost all internal AI projects at large enterprises fail. That means that enterprises will reap the value of AI - whatever it turns out to be - in two ways: first, illicit use of personal AI tools like ChatGPT, which forms a familiar “shadow IT” in large enterprises; second, by using pre-built enterprise tooling like Copilot and the various AI labs’ enterprise products. It remains to be seen exactly how much value that is. In short: almost every hugely transformative technology went through its own bubble, as hype expectations outpaced the genuine value of the technology that was fuelling the market. I expect the AI bubble to burst, the infrastructure (e.g. datacenters full of GPUs) to stick around at cheaper prices, and AI to eventually become as fundamental a technology as the internet is today. By “solved problem” I mean that the technology involved is mature, well-understood, and available (e.g. you can just pick up Kafka for event management, etc). In short: almost every hugely transformative technology went through its own bubble, as hype expectations outpaced the genuine value of the technology that was fuelling the market. I expect the AI bubble to burst, the infrastructure (e.g. datacenters full of GPUs) to stick around at cheaper prices, and AI to eventually become as fundamental a technology as the internet is today. ↩ By “solved problem” I mean that the technology involved is mature, well-understood, and available (e.g. you can just pick up Kafka for event management, etc). ↩

0 views
Sean Goedecke 2 weeks ago

Mistakes I see engineers making in their code reviews

In the last two years, code review has gotten much more important. Code is now easy to generate using LLMs, but it’s still just as hard to review 1 . Many software engineers now spend as much (or more) time reviewing the output of their own AI tools than their colleagues’ code. I think a lot of engineers don’t do code review correctly. Of course, there are lots of different ways to do code review, so this is largely a statement of my engineering taste . The biggest mistake I see is doing a review that focuses solely on the diff 2 . Most of the highest-impact code review comments have very little to do with the diff at all, but instead come from your understanding of the rest of the system. For instance, one of the most straightforwardly useful comments is “you don’t have to add this method here, since it already exists in this other place”. The diff itself won’t help you produce a comment like this. You have to already be familiar with other parts of the codebase that the diff author doesn’t know about. Likewise, comments like “this code should probably live in this other file” are very helpful for maintaining the long-term quality of a codebase. The cardinal value when working in large codebases is consistency (I write about this more in Mistakes engineers make in large established codebases ). Of course, you cannot judge consistency from the diff alone. Reviewing the diff by itself is much easier than considering how it fits into the codebase as a whole. You can rapidly skim a diff and leave line comments (like “rename this variable” or “this function should flow differently”). Those comments might even be useful! But you’ll miss out on a lot of value by only leaving this kind of review. Probably my most controversial belief about code review is that a good code review shouldn’t contain more than five or six comments . Most engineers leave too many comments. When you receive a review with a hundred comments, it’s very hard to engage with that review on anything other than a trivial level. Any really important comments get lost in the noise 2.5 . What do you do when there are twenty places in the diff that you’d like to see updated - for instance, twenty instances of variables instead of ? Instead of leaving twenty comments, I’d suggest leaving a single comment explaining the stylistic change you’d like to make, and asking the engineer you’re reviewing to make the correct line-level changes themselves. There’s at least one exception to this rule. When you’re onboarding a new engineer to the team, it can be helpful to leave a flurry of stylistic comments to help them understand the specific dialect that your team uses in this codebase. But even in this case, you should bear in mind that any “real” comments you leave are likely to be buried by these other comments. You may still be better off leaving a general “we don’t do early returns in this codebase” comment than leaving a line comment on every single early return in the diff. One reason engineers leave too many comments is that they review code like this: This is a good way to end up with hundreds of comments on a pull request: an endless stream of “I would have done these two operations in a different order”, or “I would have factored this function slightly differently”, and so on. I’m not saying that these minor comments are always bad. Sometimes the order of operations really does matter, or functions really are factored badly. But one of my strongest opinions about software engineering is that there are multiple acceptable approaches to any software problem , and that which one you choose often comes down to taste . As a reviewer, when you come across cases where you would have done it differently, you must be able to approve those cases without comment, so long as either way is acceptable. Otherwise you’re putting your colleagues in an awkward position. They can either accept all your comments to avoid conflict, adding needless time and setting you up as the de facto gatekeeper for all changes to the codebase, or they can push back and argue on each trivial point, which will take even more time. Code review is not the time for you to impose your personal taste on a colleague. So far I’ve only talked about review comments. But the “high-order bit” of a code review is not the content of the comments, but the status of the review: whether it’s an approval, just a set of comments, or a blocking review. The status of the review colors all the comments in the review. Comments in an approval read like “this is great, just some tweaks if you want”. Comments in a blocking review read like “here’s why I don’t want you to merge this in”. If you want to block, leave a blocking review. Many engineers seem to think it’s rude to leave a blocking review even if they see big problems, so they instead just leave comments describing the problems. Don’t do this. It creates a culture where nobody is sure whether it’s okay to merge their change or not. An approval should mean “I’m happy for you to merge, even if you ignore my comments”. Just leaving comments should mean “I’m happy for you to merge if someone else approves, even if you ignore my comments.” If you would be upset if a change were merged, you should leave a blocking review on it. That way the person writing the change knows for sure whether they can merge or not, and they don’t have to go and chase up everyone who’s left a comment to get their informal approval. I should start with a caveat: this depends a lot on what kind of codebase we’re talking about. For instance, I think it’s fine if PRs against something like SQLite get mostly blocking reviews. But a standard SaaS codebase, where teams are actively developing new features, ought to have mostly approvals. I go into a lot more detail about the distinction between these two types of codebase in Pure and Impure Engineering . If tons of PRs are being blocked, it’s usually a sign that there’s too much gatekeeping going on . One dynamic I’ve seen play out a lot is where one team owns a bottleneck for many other teams’ features - for instance, maybe they own the edge network configuration where new public-facing routes must be defined, or the database structure that new features will need to modify. That team is typically more reliability-focused than a typical feature team. Engineers on that team may have a different title, like SRE, or even belong to a different organization. Their incentives are thus misaligned with the feature teams they’re nominally supporting. Suppose the feature team wants to update the public-facing ingress routes in order to ship some important project. But the edge networking team doesn’t care about that project - it doesn’t affect their or their boss’s review cycles. What does affect their reviews is any production problem the change might cause. That means they’re motivated to block any potentially-risky change for as long as possible. This can be very frustrating for the feature team, who is willing to accept some amount of risk for the sake of delivering new features 3 . Of course, there are other reasons why many PRs might be getting blocking reviews. Maybe the company just hired a bunch of incompetent engineers, who ought to be prevented from merging their changes. Maybe the company has had a recent high-profile incident, and all risky changes should be blocked for a couple of weeks until their users forget about it. But in normal circumstances, a high rate of blocked reviews represents a structural problem . For many engineers - including me - it feels good to leave a blocking review, for the same reasons that it feels good to gatekeep in general. It feels like you’re single-handedly protecting the quality of the codebase, or averting some production incident. It’s also a way to indulge a common vice among engineers: flexing your own technical knowledge on some less-competent engineer. Oh, looks like you didn’t know that your code would have caused an N+1 query! Well, I knew about it. Aren’t you lucky I took the time to read through your code? This principle - that you should bias towards approving changes - is important enough that Google’s own guide to code review begins with it, calling it ” the senior principle among all of the code review guidelines” 4 . I’m quite confident that many competent engineers will disagree with most or all of the points in this post. That’s fine! I also believe many obviously true things about code review, but I didn’t include them here. In my experience, it’s a good idea to: This all more or less applies to reviewing code from agentic LLM systems. They are particularly prone to missing code that they ought to be writing, they also get a bit lost if you feed them a hundred comments at once, and they have their own style. The one point that does not apply to LLMs is the “bias towards approving” point. You can and should gatekeep AI-generated PRs as much as you want. I do want to close by saying that there are many different ways to do code review . Here’s a non-exhaustive set of values that a code review practice might be trying to satisfy: making sure multiple people on the team are familiar with every part of the codebase, letting the team discuss the software design of each change, catching subtle bugs that a single person might not see, transmitting knowledge horizontally across the team, increasing perceived ownership of each change, enforcing code style and format rules across the codebase, and satisfying SOC2 “no one person can change the system alone” constraints. I’ve listed these in the order I care about them, but engineers who would order these differently will have a very different approach to code review. Of course there are LLM-based reviewing tools. They’re even pretty useful! But at least right now they’re not as good as human reviewers, because they can’t bring to bear the amount of general context that a competent human engineer can. For readers who aren’t software engineers, “diff” here means the difference between the existing code and the proposed new code, showing what lines are deleted, added, or edited. This is a special instance of a general truth about communication: if you tell someone one thing, they’ll likely remember it; if you tell them twenty things, they will probably forget it all. In the end, these impasses are typically resolved by the feature team complaining to their director or VP, who complains to the edge networking team’s director or VP, who tells them to just unblock the damn change already. But this is a pretty crude way to resolve the incentive mismatch, and it only really works for features that are high-profile enough to receive air cover from a very senior manager. Google’s principle is much more explicit, stating that you should approve a change if it’s even a minor improvement, not when it’s perfect. But I take the underlying message here to be “I know it feels good, but don’t be a nitpicky gatekeeper - approve the damn PR!” Look at a hunk of the diff Ask themselves “how would I write this, if I were writing this code?” Leave a comment with each difference between how they would write it and the actual diff Consider what code isn’t being written in the PR instead of just reviewing the diff Leave a small number of well-thought-out comments, instead of dashing off line comments as you go and ending up with a hundred of them Review with a “will this work” filter, not with a “is this exactly how I would have done it” filter If you don’t want the change to be merged, leave a blocking review Unless there are very serious problems, approve the change Of course there are LLM-based reviewing tools. They’re even pretty useful! But at least right now they’re not as good as human reviewers, because they can’t bring to bear the amount of general context that a competent human engineer can. ↩ For readers who aren’t software engineers, “diff” here means the difference between the existing code and the proposed new code, showing what lines are deleted, added, or edited. ↩ This is a special instance of a general truth about communication: if you tell someone one thing, they’ll likely remember it; if you tell them twenty things, they will probably forget it all. ↩ In the end, these impasses are typically resolved by the feature team complaining to their director or VP, who complains to the edge networking team’s director or VP, who tells them to just unblock the damn change already. But this is a pretty crude way to resolve the incentive mismatch, and it only really works for features that are high-profile enough to receive air cover from a very senior manager. ↩ Google’s principle is much more explicit, stating that you should approve a change if it’s even a minor improvement, not when it’s perfect. But I take the underlying message here to be “I know it feels good, but don’t be a nitpicky gatekeeper - approve the damn PR!” ↩

0 views
Sean Goedecke 2 weeks ago

Should LLMs just treat text content as an image?

Several days ago, DeepSeek released a new OCR paper . OCR, or “optical character recognition”, is the process of converting an image of text - say, a scanned page of a book - into actual text content. Better OCR is obviously relevant to AI because it unlocks more text data to train language models on 1 . But there’s a more subtle reason why really good OCR might have deep implications for AI models. According to the DeepSeek paper, you can pull out 10 text tokens from a single image token with near-100% accuracy. In other words, a model’s internal representation of an image is ten times as efficient as its internal representation of text. Does this mean that models shouldn’t consume text at all? When I paste a few paragraphs into ChatGPT, would it be more efficient to convert that into an image of text before sending it to the model? Can we supply 10x or 20x more data to a model at inference time by supplying it as an image of text instead of text itself? This is called “optical compression”. It reminds me of a funny idea from June of this year to save money on OpenAI transcriptions: before uploading the audio, run it through ffmpeg to speed it up by 2x. The model is smart enough to still pull out the text, and with one simple trick you’ve cut your inference costs and time by half. Optical compression is the same kind of idea: before uploading a big block of text, take a screenshot of it (and optionally downscale the quality) and upload the screenshot instead. Some people are already sort-of doing this with existing multimodal LLMs. There’s a company selling this as a service , an open-source project, and even a benchmark . It seems to work okay! Bear in mind that this is not an intended use case for existing models, so it’s plausible that it could get a lot better if AI labs start actually focusing on it. The DeepSeek paper suggests an interesting way 2 to use tighter optical compression for long-form text contexts. As the context grows, you could decrease the resolution of the oldest images so they’re cheaper to store, but are also literally blurrier. The paper suggests an analogy between this and human memory, where fresh memories are quite vivid but older ones are vaguer and have less detail. Optical compression is pretty unintuitive to many software engineers. Why on earth would an image of text be expressible in fewer tokens than the text itself? In terms of raw information density, an image obviously contains more information than its equivalent text. You can test this for yourself by creating a text file, screenshotting the page, and comparing the size of the image with the size of the text file: the image is about 200x larger. Intuitively, the word “dog” only contains a single word’s worth of information, while an image of the word “dog” contains information about the font, the background and text color, kerning, margins, and so on. How, then, could it be possible that a single image token can contain ten tokens worth of text? The first explanation is that text tokens are discrete while image tokens are continuous . Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens. Another way of looking at the same intuition is that text tokens are a really inefficient way of expressing information . This is often obscured by the fact that text tokens are a reasonably efficient way of sharing information, so long as the sender and receiver both know the list of all possible tokens. When you send a LLM a stream of tokens and it outputs the next one, you’re not passing around slices of a thousand numbers for each token - you’re passing a single integer that represents the token ID. But inside the model this is expanded into a much more inefficient representation (inefficient because it encodes some amount of information about the meaning and use of the token) 3 . So it’s not that surprising that you could do better than text tokens. Zooming out a bit, it’s plausible to me that processing text as images is closer to how the human brain works . To state the obvious, humans don’t consume text as textual content; we consume it as image content (or sometimes as audio). Maybe treating text as a sub-category of image content could unlock ways of processing text that are unavailable when you’re just consuming text content. As a toy example, emoji like :) are easily-understandable as image content but require you to “already know the trick” as text content 4 . Of course, AI research is full of ideas that sounds promising but just don’t work that well. It sounds like you should be able to do this trick on current multimodal LLMs - particularly since many people just use them for OCR purposes anyway - but it hasn’t worked well enough to become common practice. Could you train a new large language model on text represented as image content? It might be tricky. Training on text tokens is easy - you can simply take a string of text and ask the model to predict the next token. How do you train on an image of text? You could break up the image into word chunks and ask the model to generate an image of the next word. But that seems to me like it’d be really slow, and tricky to check if the model was correct or not (e.g. how do you quickly break a file into per-word chunks, how do you match the next word in the image, etc). Alternatively, you could ask the model to output the next word as a token. But then you probably have to train the model on enough tokens so it knows how to manipulate text tokens. At some point you’re just training a normal LLM with no special “text as image” superpowers. AI labs are desperate for high-quality text, but only around 30% of written books have been digitized. It’s really hard to find recent data on this, but as a very rough estimate Google Books had ~40M books in 2023, but Google estimates there to have been ~130M books in 2010. That comes out to 30%. See Figure 13. Not to skip too far ahead, but this is one reason to think that representing a block of text tokens in a single image might not be such a great idea. Of course current LLMs can interpret these emojis. Less-toy examples: image-based LLMs might have a better feel for paragraph breaks and headings, might be better able to take a big picture view of a single page of text, and might find it easier to “skip through” large documents by skimming the start of each paragraph. Or they might not! We won’t know until somebody tries. AI labs are desperate for high-quality text, but only around 30% of written books have been digitized. It’s really hard to find recent data on this, but as a very rough estimate Google Books had ~40M books in 2023, but Google estimates there to have been ~130M books in 2010. That comes out to 30%. ↩ See Figure 13. ↩ Not to skip too far ahead, but this is one reason to think that representing a block of text tokens in a single image might not be such a great idea. ↩ Of course current LLMs can interpret these emojis. Less-toy examples: image-based LLMs might have a better feel for paragraph breaks and headings, might be better able to take a big picture view of a single page of text, and might find it easier to “skip through” large documents by skimming the start of each paragraph. Or they might not! We won’t know until somebody tries. ↩

0 views
Sean Goedecke 3 weeks ago

We are in the "gentleman scientist" era of AI research

Many scientific discoveries used to be made by amateurs. William Herschel , who discovered Uranus, was a composer and an organist. Antoine Lavoisier , who laid the foundation for modern chemistry, was a politician. In one sense, this is a truism. The job of “professional scientist” only really appeared in the 19th century, so all discoveries before then logically had to have come from amateurs, since only amateur scientists existed. But it also reflects that any field of knowledge gets more complicated over time . In the early days of a scientific field, discoveries are simple: “air has weight”, “white light can be dispersed through a prism into different colors”, “the mass of a burnt object is identical to its original mass”, and so on. The way you come up with those discoveries is also simple: observing mercury in a tall glass tube, holding a prism up to a light source, weighing a sealed jar before and after incinerating it, and so on. The 2025 Nobel prize in physics was just awarded “for the discovery of macroscopic quantum mechanical tunnelling and energy quantisation in an electric circuit”. The press release gallantly tries to make this discovery understandable to the layman, but it’s clearly much more complicated than the examples I listed above. Even understanding the terms involved would take years of serious study. If you wanted to win the 2026 Nobel prize in physics, you have to be a physicist : not a musician who dabbles in physics, or a politician who has a physics hobby in your spare time. You have to be fully immersed in the world of physics 1 . AI research is not like this. We are very much in the “early days of science” category. At this point, a critical reader might have two questions. How can I say that when many AI papers look like this ? 2 Alternatively, how can I say that when the field of AI research has been around for decades, and is actively pursued by many serious professional scientists? First, because AI research discoveries are often simpler than they look . This dynamic is familiar to any software engineer who’s sat down and tried to read a paper or two: the fearsome-looking mathematics often contains an idea that would be trivial to express in five lines of code. It’s written this way because (a) researchers are more comfortable with mathematics, and so genuinely don’t find it intimidating, and (b) mathematics is the lingua franca of academic research, because researchers like to write to far-future readers for whom Python syntax may be as unfamiliar as COBOL is to us. Take group-relative policy optimization, or GRPO, introduced in a 2024 DeepSeek paper . This has been hugely influential for reinforcement learning (which in turn has been the driver behind much LLM capability improvement in the last year). Let me try and explain the general idea. When you’re training a model with reinforcement learning, you might naively reward success and punish failure (e.g. how close the model gets to the right answer in a math problem). The problem is that this signal breaks down on hard problems. You don’t know if the model is “doing well” without knowing how hard the math problem is, which is itself a difficult qualitative assessment. The previous state-of-the art was to train a “critic model” that makes this “is the model doing well” assessment for you. Of course, this brings a whole new set of problems: the critic model is hard to train and verify, costs much more compute to run inside the training loop, and so on. Enter GRPO. Instead of a critic model, you gauge how well the model is doing by letting it try the problem multiple times and computing how well it does on average . Then you reinforce the model attempts that were above average and punish the ones that were below average. This gives you good signal even on very hard prompts, and is much faster than using a critic model. The mathematics in the paper looks pretty fearsome, but the idea itself is surprisingly simple. You don’t need to be a professional AI researcher to have had it. In fact, GRPO is not necessarily that new of an idea. There is discussion of normalizing the “baseline” for RL as early as 1992 (section 8.3), and the idea of using the model’s own outputs to set that baseline was successfully demonstrated in 2016 . So what was really discovered in 2024? I don’t think it was just the idea of “averaging model outputs to determine a RL baseline”. I think it was that that idea works great on LLMs as well . As far as I can tell, this is a consistent pattern in AI research. Many of the big ideas are not brand new or even particularly complicated. They’re usually older ideas or simple tricks, applied to large language models for the first time. Why would that be the case? If deep learning wasn’t a good subject for the amateur scientist ten years ago, why would the advent of LLMs change that? Suppose someone discovered that a rubber-band-powered car - like the ones at science fair competitions - could output as much power as a real combustion engine, so long as you soaked the rubber bands in maple syrup beforehand. This would unsurprisingly produce a revolution in automotive (and many other) engineering fields. But I think it would also “reset” scientific progress back to something like the “gentleman scientist” days, where you could productively do it as a hobby. Of course, there’d be no shortage of real scientists doing real experiments on the new phenomenon. However, there’d also be about a million easy questions to answer. Does it work with all kinds of maple syrup? What if you soak it for longer? What if you mixed in some maple-syrup-like substances? You wouldn’t have to be a real scientist in a real lab to try your hand at some of those questions. After a decade or so, I’d expect those easy questions to have been answered, and for rubber-band engine research to look more like traditional science. But that still leaves a long window for the hobbyist or dilettante scientist to ply their trade. The success of LLMs is like the rubber-band engine. A simple idea that anyone can try 3 - train a large transformer model on a ton of human-written text - produces a surprising and transformative technology. As a consequence, many easy questions have become interesting and accessible subjects of scientific inquiry, alongside the normal hard and complex questions that professional researchers typically tackle. I was inspired to write this by two recent pieces of research: Anthropic’s “skills” product and the Recursive Language Models paper . Both of these present new and useful ideas, but they’re also so simple as to be almost a joke. “Skills” are just markdown files and scripts on-disk that explain to the agent how to perform a task. Recursive language models are just agents with direct code access to the entire prompt via a Python REPL. There, now you can go and implement your own skills or RLM inference code. I don’t want to undersell these ideas. It is a genuinely useful piece of research for Anthropic to say “hey, you don’t really need actual tools if the LLM has shell access, because it can just call whatever scripts you’ve defined for it on disk”. Giving the LLM direct access to its entire prompt via code is also (as far as I can tell) a novel idea, and one with a lot of potential. We need more research like this! Strong LLMs are so new, and are changing so fast, that their capabilities are genuinely unknown 4 . For instance, at the start of this year, it was unclear whether LLMs could be “real agents” (i.e. whether running with tools in a loop would be useful for more than just toy applications). Now, with Codex and Claude Code, I think it’s pretty clear that they can. Many of the things we learn about AI capabilities - like o3’s ability to geolocate photos - come from informal user experimentation. In other words, they come from the AI research equivalent of 17th century “gentleman science”. Incidentally, my own field - analytic philosophy - is very much the same way. Two hundred years ago, you could publish a paper with your thoughts on “what makes a good act good”. Today, in order to publish on the same topic, you have to deeply engage with those two hundred years of scholarship, putting the conversation out of reach of all but professional philosophers. It is unclear to me whether that is a good thing or not. Randomly chosen from recent AI papers on arXiv . I’m sure you could find a more aggressively-technical paper with a bit more effort, but it suffices for my point. Okay, not anyone can train a 400B param model. But if you’re willing to spend a few hundred dollars - far less than Lavoisier spent on his research - you can train a pretty capable language model on your own. In particular, I’d love to see more informal research on making LLMs better at coming up with new ideas. Gwern wrote about this in LLM Daydreaming , and I tried my hand at it in Why can’t language models come up with new ideas? . Incidentally, my own field - analytic philosophy - is very much the same way. Two hundred years ago, you could publish a paper with your thoughts on “what makes a good act good”. Today, in order to publish on the same topic, you have to deeply engage with those two hundred years of scholarship, putting the conversation out of reach of all but professional philosophers. It is unclear to me whether that is a good thing or not. ↩ Randomly chosen from recent AI papers on arXiv . I’m sure you could find a more aggressively-technical paper with a bit more effort, but it suffices for my point. ↩ Okay, not anyone can train a 400B param model. But if you’re willing to spend a few hundred dollars - far less than Lavoisier spent on his research - you can train a pretty capable language model on your own. ↩ In particular, I’d love to see more informal research on making LLMs better at coming up with new ideas. Gwern wrote about this in LLM Daydreaming , and I tried my hand at it in Why can’t language models come up with new ideas? . ↩

0 views
Sean Goedecke 3 weeks ago

How I provide technical clarity to non-technical leaders

My mission as a staff engineer is to provide technical clarity to the organization. Of course, I do other stuff too. I run projects, I ship code, I review PRs, and so on. But the most important thing I do - what I’m for - is to provide technical clarity. In an organization, technical clarity is when non-technical decision makers have a good-enough practical understanding of what changes they can make to their software systems. The people in charge of your software organization 1 have to make a lot of decisions about software. Even if they’re not setting the overall strategy, they’re still probably deciding which kinds of users get which features, which updates are most important to roll out, whether projects should be delayed or rushed, and so on. These people may have been technical once. They may even have fine technical minds now. But they’re still “non-technical” in the sense I mean, because they simply don’t have the time or the context to build an accurate mental model of the system. Instead, they rely on a vague mental model, supplemented by advice from engineers they trust. To the extent that their vague mental model is accurate and the advice they get is good - in other words, to the extent that they have technical clarity - they’ll make sensible decisions. The stakes are therefore very high. Technical clarity in an organization can be the difference between a functional engineering group and a completely dysfunctional one. The default quantity of technical clarity in an organization is very low. In other words, decision-makers at tech companies are often hopelessly confused about the technology in question . This is not a statement about their competence. Software is really complicated , and even the engineers on the relevant team spend much of their time hopelessly confused about the systems they own. In my experience, this is surprising to non-engineers. But it’s true! For large established codebases, it’s completely normal for very senior engineers to be unable to definitively answer even very basic questions about how their own system works, like “can a user of type X do operation Y”, or “if we perform operation Z, what will it look like for users of type W?” Engineers often 2 answer these questions with “I’ll have to go and check”. Suppose a VP at a tech company wants to offer an existing paid feature to a subset of free-tier users. Of course, most of the technical questions involved in this project are irrelevant to the VP. But there is a set of technical questions that they will need to know the answers to: Finding out the answer to these questions is a complex technical process. It takes a deep understanding of the entire system, and usually requires you to also carefully re-read the relevant code. You can’t simply try the change out in a developer environment or on a test account, because you’re likely to miss edge cases. Maybe it works for your test account, but it doesn’t work for users who are part of an “organization”, or who are on a trial plan, and so on. Sometimes they can only be answered by actually performing the task. I wrote about why this happens in Wicked features : as software systems grow, they build marginal-but-profitable features that interact with each other in surprising ways, until the system becomes almost - but not quite - impossible to understand. Good software design can tame this complexity, but never eliminate it. Experienced software engineers are thus always suspicious that they’re missing some interaction that will turn into a problem in production. For a VP or product leader, it’s an enormous relief to work with an engineer who can be relied on to help them navigate the complexities of the software system. In my experience, this “technical advisor” role is usually filled by staff engineers, or by senior engineers who are rapidly on the path to a staff role. Senior engineers who are good at providing technical clarity sometimes get promoted to staff without even trying, in order to make them a more useful tool for the non-technical leaders who they’re used to helping. Of course, you can be an impactful engineer without doing the work of providing technical clarity to the organization. Many engineers - even staff engineers - deliver most of their value by shipping projects, identifying tricky bugs, doing good systems design , and so on. But those engineers will rarely be as valued as the ones providing technical clarity. That’s partly because senior leadership at the company will remember who was helping them, and partly because technical clarity is just much higher-leverage than almost any single project. Non-technical leaders need to make decisions, whether they’re clear or not. They are thus highly motivated to maintain a mental list of the engineers who can help them make those decisions, and to position those engineers in the most important teams and projects. From the perspective of non-technical leaders, those engineers are an abstraction around technical complexity . In the same way that engineers use garbage-collected languages so they don’t have to care about memory management, VPs use engineers so they don’t have to care about the details of software. But what does it feel like inside the abstraction ? Internally, engineers do have to worry about all the awkward technical details, even if their non-technical leaders don’t have to. If I say “no problem, we’ll be able to roll back safely”, I’m not as confident as I appear. When I’m giving my opinion on a technical topic, I top out at 95% confidence - there’s always a 5% chance that I missed something important - and am usually lower than that. I’m always at least a little bit worried. Why am I worried if I’m 95% sure I’m right? Because I’m worrying about the things I don’t know to look for. When I’ve been spectacularly wrong in my career, it’s usually not about risks that I anticipated. Instead, it’s about the “unknown unknowns”: risks that I didn’t even contemplate, because my understanding of the overall system was missing a piece. That’s why I say that shipping a project takes your full attention . When I lead technical projects, I spend a lot of time sitting and wondering about what I haven’t thought of yet. In other words, even when I’m quite confident in my understanding of the system, I still have a background level of internal paranoia. To provide technical clarity to the organization, I have to keep that paranoia to myself. There’s a careful balance to be struck between verbalizing all my worries - more on that later - and between being so overconfident that I fail to surface risks that I ought to have mentioned. Like good engineers, good VPs understand that all abstractions are sometimes leaky . They don’t blame their engineers for the occasional technical mistake, so long as those engineers are doing their duty as a useful abstraction the rest of the time 3 . What they won’t tolerate in a technical advisor is the lack of a clear opinion at all . An engineer who answers most questions with “well, I can’t be sure, it’s really hard to say” is useless as an advisor. They may still be able to write code and deliver projects, but they will not increase the amount of technical clarity in the organization. When I’ve written about communicating confidently in the past, some readers think I’m advising engineers to act unethically. They think that careful, technically-sound engineers should communicate the exact truth, in all its detail, and that appearing more confident than you are is a con man’s trick: of course if you pretend to be certain, leadership will think you’re a better engineer than the engineer who honestly says they’re not sure. Once one engineer starts keeping their worries to themself, other engineers have to follow or be sidelined, and pretty soon all the fast-talking blowhards are in positions of influence while the honest engineers are relegated to just working on projects. In other words, when I say “no problem, we’ll be able to roll back”, even though I might have missed something, isn’t that just lying? Shouldn’t I just communicate my level of confidence accurately? For instance, could I instead say “I think we’ll be able to roll back safely, though I can’t be sure, since my understanding of the system isn’t perfect - there could be all kinds of potential bugs”? I don’t think so. Saying that engineers should strive for maximum technical accuracy betrays a misunderstanding of what clarity is . At the top of this article, I said that clarity is when non-technical decision makers have a good enough working understanding of the system. That necessarily means a simplified understanding. When engineers are communicating to non-technical leadership, they must therefore simplify their communication (in other words, allow some degree of inaccuracy in the service of being understood). Most of my worries are not relevant information to non-technical decision makers . When I’m asked “can we deliver this today”, or “is it safe to roll this feature out”, the person asking is looking for a “yes” or “no”. If I also give them a stream of vague technical caveats, they will have to consciously filter that out in order to figure out if I mean “yes” or “no”. Why would they care about any of the details? They know that I’m better positioned to evaluate the technical risk than them - that’s why they’re asking me in the first place! I want to be really clear that I’m not advising engineers to always say “yes” even to bad or unacceptably risky decisions. Sometimes you need to say “we won’t be able to roll back safely, so we’d better be sure about the change”, or “no, we can’t ship the feature to this class of users yet”. My point is that when you’re talking to the company’s decision-makers, you should commit to a recommendation one way or the other , and only give caveats when the potential risk is extreme or the chances are genuinely high. At the end of the day, a VP only has so many mental bits to spare on understanding the technical details. If you’re a senior engineer communicating with a VP, you should make sure you fill those bits with the most important pieces: what’s possible, what’s impossible, and what’s risky. Don’t make them parse those pieces out of a long stream of irrelevant (to them) technical information. The highest-leverage work I do is to provide technical clarity to the organization: communicating up to non-technical decision makers to give them context about the software system. This is hard for two reasons. First, even competent engineers find it difficult to answer simple questions definitively about large codebases. Second, non-technical decision makers cannot absorb the same level of technical nuance as a competent engineer, so communicating to them requires simplification . Effectively simplifying complex technical topics requires three things: In a large tech company, this is usually a director or VP. However, depending on the scope we’re talking about, this could even be a manager or product manager - the same principles apply. Sometimes you know the answer off the top of your head, but usually that’s when you’ve been recently working on the relevant part of the codebase (and even then you may want to go and make sure you’re right). You do still have to be right a lot. I wrote about this in Good engineers are right, a lot . Despite this being very important, I don’t have a lot to say about it. You just have to feel it out based on your relationship with the decision-maker in question. Can the paid feature be safely delivered to free users in its current state? Can the feature be rolled out gradually? If something goes wrong, can the feature be reverted without breaking user accounts? Can a subset of users be granted early access for testing (and other) purposes? Can paid users be prioritized in case of capacity problems? Good taste - knowing which risks or context to mention and which to omit 4 . A deep technical understanding of the system. In order to communicate effectively, I need to also be shipping code and delivering projects. If I lose direct contact with the codebase, I will eventually lose my ability to communicate about it (as the codebase changes and my memory of the concrete details fades). The confidence to present a simplified picture to upper management. Many engineers either feel that it’s dishonest, or lack the courage to commit to claims where they’re only 80% or 90% confident. In my view, these engineers are abdicating their responsibility to help the organization make good technical decisions. I write about this a lot more in Engineers who won’t commit . In a large tech company, this is usually a director or VP. However, depending on the scope we’re talking about, this could even be a manager or product manager - the same principles apply. ↩ Sometimes you know the answer off the top of your head, but usually that’s when you’ve been recently working on the relevant part of the codebase (and even then you may want to go and make sure you’re right). ↩ You do still have to be right a lot. I wrote about this in Good engineers are right, a lot . ↩ Despite this being very important, I don’t have a lot to say about it. You just have to feel it out based on your relationship with the decision-maker in question. ↩

0 views
Sean Goedecke 1 months ago

GPT-5-Codex is a better AI researcher than me

In What’s the strongest AI model you can train on a laptop in five minutes? I tried my hand at answering a silly AI-research question. You can probably guess what it was. I chatted with GPT-5 to help me get started with the Python scripts and to bounce ideas off, but it was still me doing the research. I was coming up with the ideas, running the experiments, and deciding what to do next based on the data. The best model I could train was a 1.8M param transformer which produced output like this: Once upon a time , there was a little boy named Tim. Tim had a small box that he liked to play with. He would push the box to open. One day, he found a big red ball in his yard. Tim was so happy. He picked it up and showed it to his friend, Jane. “Look at my bag! I need it!” she said. They played with the ball all day and had a great time. Since then, OpenAI has released GPT-5-codex, and supposedly uses it (plus Codex, their CLI coding tool) to automate a lot of their product development and AI research. I wanted to try the same thing. Codex-plus-me did a much better job than me alone 1 . Here’s an example of the best output I got from the model I trained with Codex: Once upon a time , in a big forest, there lived a little bunny named Ben. Ben loved to play with his friends in the forest. One day, Ben’s mom saw him and was sad because he couldn’t find his friend. She asked, “Why are you sad, Ben?” Ben said, “I lost my toy. I can’t find it.” Ben wanted to help Ben find his toy. He knew they could fix the toy. He went to Sam’s house and found the toy under a tree. Sam was so happy and said, “Thank you, Ben! You are a very pretty toy!” Ben smiled and said, “Yes, I would love to help you.” They played together all day long. The moral of the story is to help others when they needed it. What was the process like to get there? I want to call it “vibe research”. Like “vibe coding”, it’s performing a difficult technical task by relying on the model. I have a broad intuitive sense of what approaches are being tried, but I definitely don’t have a deep enough understanding to do this research unassisted. A real AI researcher would get a lot more out of the tool. Still, it was very easy to get started. I gave Codex the path to my scratch directory, told it “continue the research”, and it immediately began coming up with ideas and running experiments on its own. In a way, the “train in five minutes” challenge is a perfect fit, because the feedback loop is so short. The basic loop of doing AI research with Codex (at least as an enthusiastic amateur) looks something like this: After two days I did paste the current research notes into GPT-5-Pro, which helped a bit, but the vast majority of my time was spent in this loop. As we’ll see, the best ideas were ones Codex already came up with. I chewed through a lot of tokens doing this. That’s OK with me, since I paid for the $200-per-month plan 2 , but if you don’t want to do that you’ll have to space out your research a bit more slowly. I restarted my Codex process every million tokens or so. It didn’t have any issue continuing where it left off from its previous notes, which was nice. I ran Codex with . By default it didn’t have access to MPS, which meant it could only train models on the CPU. There’s probably some more principled way of sandboxing it, but I didn’t bother to figure it out. I didn’t run into any runaway-agent problems, unless you count crashing my laptop a few times by using up too much memory. Here’s a brief summary of how the research went over the four or five days I spent poking at it. I stayed with the TinyStories dataset for all of this, partially because I think it’s the best choice and partially because I wanted a 1:1 comparison between Codex and my own efforts. Codex and I started with a series of n-gram models: instead of training a neural network, n-gram models just store the conditional probabilities of a token based on the n tokens that precede it. These models are very quick to produce (seconds, not minutes) but aren’t very good. The main reason is that even a 5-gram model cannot include context from more than five tokens ago, so they struggle to produce coherent text across an entire sentence. Here’s an example: Once upon a time , in a small school . ” they are friends . they saw a big pond . he pulled and pulled , but the table was still no attention to grow even more . she quickly ran to the house . she says , ” sara said . ” you made him ! ” the smooth more it said , for helping me decorate the cake . It’s not terrible ! There are short segments that are entirely coherent. But it’s kind of like what AI skeptics think LLMs are like: just fragments of the original source, remixed without any unifying through-line. The perplexity is 18.5, worse than basically any of the transformers I trained in my last attempt. Codex trained 19 different n-gram models, of which the above example (a 4-gram model) was the best 3 . In my view, this is one of the strengths of LLM-based AI research: it is trivial to tell the model “go and sweep a bunch of different values for the hyperparameters” . Of course, you can do this yourself. But it’s a lot easier to just tell the model to do it. After this, Codex spent a lot of time working on transformers. It trained ~50 normal transformers with different sizes, number of heads, layers, and so on. Most of this wasn’t particularly fruitful. I was surprised that my hand-picked hyperparameters from my previous attempt were quite competitive - though maybe it shouldn’t have been a shock, since they matched the lower end of the Chinchilla scaling laws. Still, eventually Codex hit on a 8.53 perplexity model (3 layers, 4 heads, and a dimension of 144), which was a strict improvement over my last attempt. I’m not really convinced this was an architectural improvement. One lesson from training fifty different models is that there’s quite a lot of variance between different seeds. A perplexity improvement of just over 1 is more or less what I was seeing on a “lucky seed”. This was an interesting approach for the challenge: going for pure volume and hoping for a lucky training run. You can’t do this with a larger model, since it takes so long to train 4 , but the five-minute limit makes it possible. The next thing Codex tried - based on some feedback I pasted in from GPT-5-Pro - was “shallow fusion” : instead of training a new model, updating the generation logic to blend the transformer-predicted tokens with a n-gram model, a “kNN head” (which looks up hidden states that are “nearby” the current hidden state of the transformer and predicts their tokens), and a “cache head” that makes the model more likely to repeat words that are already in the context. This immediately dropped perplexity down to 7.38 : a whole other point lower than our best transformer. I was excited about that, but the generated content was really bad: Once upon a time,, in a small house, there lived a boy named Tim. Tim loved to play outside with his ball. One Mr. Skip had a lot of fun. He ran everywhere every day. One One day, Tim was playing with his ball new ball near his house. Tim was playing with his his ball and had a lot of fun. But then, he saw a big tree and decided to climb it. Tim tried to climb the tree, but he was too big. He was too small to reach the top of the tree. But the tree was too high. The little tree was too high for him. Soon, Tim was near the tree. He was brave and climbed the tree. But when he got got to the top, he was sad. Tim saw a bird on What happened? I over-optimized for perplexity. As it turns out, the pure transformers that were higher-perplexity were better at writing stories. They had more coherence over the entire length of the story, they avoided generating weird repetition artifacts (like ”,,”), and they weren’t as mindlessly repetitive. I went down a bit of a rabbithole trying to think of how to score my models without just relying on perplexity. I came up with some candidate rubrics, like grammatical coherence, patterns of repetition, and so on, before giving up and just using LLM-as-a-judge. To my shame, I even generated a new API key for the LLM before realizing that I was talking to a strong LLM already via Codex, and I could just ask Codex to rate the model outputs directly. The final and most successful idea I tried was distilling a transformer from a n-gram teacher model . First, we train a n-gram model, which only takes ~10 seconds. Then we train a transformer - but for the first 200 training steps, we push the transformer towards predicting the tokens that the n-gram model would predict. After that, the transformer continues to train on the TinyStories data as usual. Here’s an example of some output: Once upon a time, in a big forest, there lived a little bunny named Ben. Ben loved to play with his friends in the forest. One day, Ben’s mom saw him and was sad because he couldn’t find his friend. She asked, “Why are you sad, Ben?” Ben said, “I lost my toy. I can’t find it.” Ben wanted to help Ben find his toy. He knew they could fix the toy. He went to Sam’s house and found the toy under a tree. Sam was so happy and said, “Thank you, Ben! You are a very pretty toy!” Ben smiled and said, “Yes, I would love to help you.” They played together all day long. The moral of the story is to help others when they needed it. I think this is pretty good! It has characters that continue throughout the story. It has a throughline - Ben’s lost toy - though it confuses “toy” and “friend” a bit. It’s a coherent story, with a setup, problem, solution and moral. This is much better than anything else I’ve been able to train in five minutes. Why is it better? I think the right intuition here is that transformers need to spend a lot of initial compute (say, two minutes) learning how to construct grammatically-correct English sentences. If you begin the training by spending ten seconds training a n-gram model that can already produce sort-of-correct grammar, you can speedrun your way to learning grammar and spend an extra one minute and fifty seconds learning content. I really like this approach. It’s exactly what I was looking for from the start: a cool architectural trick that genuinely helps, but only really makes sense for this weird challenge 5 . I don’t have any illusions about this making me a real AI researcher, any more than a “vibe coder” is a software engineer. Still, I’m surprised that it actually worked. And it was a lot of fun! I’ve pushed up the code here if you want to pick up from where I left off, but you may be better off just starting from scratch with Codex or your preferred coding agent. edit: this post got some comments on Hacker News . The tone is much more negative than on my previous attempt, which is interesting - maybe the title gave people the mistaken impression that I think I’m a strong AI researcher! “Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tooling. My deal with myself was that if I ever have a month where I use fewer than 2M tokens, I’ll cancel the plan. There are a lot of clever tricks involved here: Kneser-Ney smoothing, interpolating unigram/bigram/trigram probabilities on a specific schedule, deliberately keeping the sentinel token, etc. I didn’t spend the time understanding all of these things deeply - that’s vibe research for you - so I won’t write too much about it. Unless you’re a big AI lab. I am 100% convinced that the large labs are spending a lot of compute just re-training on different seeds in the hope of getting a lucky run. I was suspicious that I just got a lucky seed, but I compared ~40 generations with and without the distillation and the distilled model really was better at producing correct-looking stories. Codex makes a change to the training script and does three or four runs (this takes ~20 minutes overall) Based on the results, Codex suggests two or three things that you could try next I pick one of them (or very occasionally suggest my own idea) and return to (1). “Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tooling. ↩ My deal with myself was that if I ever have a month where I use fewer than 2M tokens, I’ll cancel the plan. ↩ There are a lot of clever tricks involved here: Kneser-Ney smoothing, interpolating unigram/bigram/trigram probabilities on a specific schedule, deliberately keeping the sentinel token, etc. I didn’t spend the time understanding all of these things deeply - that’s vibe research for you - so I won’t write too much about it. ↩ Unless you’re a big AI lab. I am 100% convinced that the large labs are spending a lot of compute just re-training on different seeds in the hope of getting a lucky run. ↩ I was suspicious that I just got a lucky seed, but I compared ~40 generations with and without the distillation and the distilled model really was better at producing correct-looking stories. ↩

0 views
Sean Goedecke 1 months ago

How I influence tech company politics as a staff software engineer

Many software engineers are fatalistic about company politics. They believe that it’s pointless to get involved, because 1 : The general idea here is that software engineers are simply not equipped to play the game at the same level as real political operators . This is true! It would be a terrible mistake for a software engineer to think that you ought to start scheming and plotting like you’re in Game of Thrones . Your schemes will be immediately uncovered and repurposed to your disadvantage and other people’s gain. Scheming takes practice and power, and neither of those things are available to software engineers. It is simply a fact that software engineers are tools in the political game being played at large companies, not players in their own right. However, there are many ways to get involved in politics without scheming. The easiest way is to actively work to make a high-profile project successful . This is more or less what you ought to be doing anyway, just as part of your ordinary job. If your company is heavily investing in some new project - these days, likely an AI project - using your engineering skill to make it successful 2 is a politically advantageous move for whatever VP or executive is spearheading that project. In return, you’ll get the rewards that executives can give at tech companies: bonuses, help with promotions, and positions on future high-profile projects. I wrote about this almost a year ago in Ratchet effects determine engineer reputation at large companies . A slightly harder way (but one that gives you more control) is to make your pet idea available for an existing political campaign . Suppose you’ve wanted for a while to pull out some existing functionality into its own service. There are two ways to make that happen. The hard way is to expend your own political capital: drum up support, let your manager know how important it is to you, and slowly wear doubters down until you can get the project formally approved. The easy way is to allow some executive to spend their (much greater) political capital on your project . You wait until there’s a company-wide mandate for some goal that aligns with your project (say, a push for reliability, which often happens in the wake of a high-profile incident). Then you suggest to your manager that your project might be a good fit for this. If you’ve gauged it correctly, your org will get behind your project. Not only that, but it’ll increase your political capital instead of you having to spend it. Organizational interest comes in waves. When it’s reliability time, VPs are desperate to be doing something . They want to come up with plausible-sounding reliability projects that they can fund, because they need to go to their bosses and point at what they’re doing for reliability, but they don’t have the skillset to do it on their own. They’re typically happy to fund anything that the engineering team suggests. On the other hand, when the organization’s attention is focused somewhere else - say, on a big new product ship - the last thing they want is for engineers to spend their time on an internal reliability-focused refactor that’s invisible to customers. So if you want to get something technical done in a tech company, you ought to wait for the appropriate wave . It’s a good idea to prepare multiple technical programs of work, all along different lines. Strong engineers will do some of this kind of thing as an automatic process, simply by noticing things in the normal line of work. For instance, you might have rough plans: When executives are concerned about billing, you can offer the billing refactor as a reliability improvement. When they’re concerned about developer experience, you can suggest replacing the build pipeline. When customers are complaining about performance, you can point to the Golang rewrite as a good option. When the CEO checks the state of the public documentation and is embarrassed, you can make the case for rebuilding it as a static site. The important thing is to have a detailed, effective program of work ready to go for whatever the flavor of the month is. Some program of work will be funded whether you do this or not. However, if you don’t do this, you have no control over what that program is. In my experience, this is where companies make their worst technical decisions : when the political need to do something collides with a lack of any good ideas. When there are no good ideas, a bad idea will do, in a pinch. But nobody prefers this outcome. It’s bad for the executives, who then have to sell a disappointing technical outcome as if it were a success 4 , and it’s bad for the engineers, who have to spend their time and effort building the wrong idea. If you’re a very senior engineer, the VPs (or whoever) will quietly blame you for this. They’ll be right to! Having the right idea handy at the right time is your responsibility. You can view all this in two different ways. Cynically, you can read this as a suggestion to make yourself a convenient tool for the sociopaths who run your company to use in their endless internecine power struggles. Optimistically, you can read this as a suggestion to let executives set the overall priorities for the company - that’s their job, after all - and to tailor your own technical plans to fit 3 . Either way, you’ll achieve more of your technical goals if you push the right plan at the right time. edit: this post got some attention on Hacker News . The comments were much more positive than on my other posts about politics, for reasons I don’t quite understand. This comment is an excellent statement of what I write about here (but targeted at more junior engineers). This comment (echoed here ) references a Milton Friedman quote that applies the idea in this post to political policy in general, which I’d never thought of but sounds correct: Only a crisis—actual or perceived—produces real change. When that crisis occurs, the actions that are taken depend on the ideas that are lying around. That, I believe, is our basic function: to develop alternatives to existing policies, to keep them alive and available until the politically impossible becomes politically inevitable. There’s a few comments calling this approach overly game-playing and self-serving. I think this depends on the goal you’re aiming at. The ones I referenced above seem pretty beneficial to me! Finally, this comment is a good summary of what I was trying to say: Instead of waiting to be told what to do and being cynical about bad ideas coming up when there’s a vacumn and not doing what he wants to do, the author keeps a back log of good and important ideas that he waits to bring up for when someone important says something is priority. He gets what he wants done, compromising on timing. I was prompted to write this after reading Terrible Software’s article Don’t avoid workplace politics and its comments on Hacker News. Disclaimer: I am talking here about broadly functional tech companies (i.e. ones that are making money). If you’re working somewhere that’s completely dysfunctional, I have no idea whether this advice would apply at all. What it takes to make a project successful is itself a complex political question that every senior+ engineer is eventually forced to grapple with (or to deliberately avoid, with consequences for their career). For more on that, see How I ship projects at large tech companies . For more along these lines, see Is it cynical to do what your manager wants? Just because they can do this doesn’t mean they want to. Technical decisions are often made for completely selfish reasons that cannot be influenced by a well-meaning engineer Powerful stakeholders are typically so stupid and dysfunctional that it’s effectively impossible for you to identify their needs and deliver solutions to them The political game being played depends on private information that software engineers do not have, so any attempt to get involved will result in just blundering around Managers and executives spend most of their time playing politics, while engineers spend most of their time doing engineering, so engineers are at a serious political disadvantage before they even start to migrate the billing code to stored-data-updated-by-webhooks instead of cached API calls to rip out the ancient hand-rolled build pipeline and replace it with Vite to rewrite a crufty high-volume Python service in Golang to replace the slow CMS frontend that backs your public documentation with a fast static site I was prompted to write this after reading Terrible Software’s article Don’t avoid workplace politics and its comments on Hacker News. Disclaimer: I am talking here about broadly functional tech companies (i.e. ones that are making money). If you’re working somewhere that’s completely dysfunctional, I have no idea whether this advice would apply at all. ↩ What it takes to make a project successful is itself a complex political question that every senior+ engineer is eventually forced to grapple with (or to deliberately avoid, with consequences for their career). For more on that, see How I ship projects at large tech companies . ↩ For more along these lines, see Is it cynical to do what your manager wants? ↩ Just because they can do this doesn’t mean they want to. ↩

0 views
Sean Goedecke 1 months ago

What is "good taste" in software engineering?

Technical taste is different from technical skill. You can be technically strong but have bad taste, or technically weak with good taste. Like taste in general, technical taste sometimes runs ahead of your ability: just like you can tell good food from bad without being able to cook, you can know what kind of software you like before you’ve got the ability to build it. You can develop technical ability by study and repetition, but good taste is developed in a more mysterious way. Here are some indicators of software taste: I think taste is the ability to adopt the set of engineering values that fit your current project . Aren’t the indicators above just a part of skill? For instance, doesn’t code look good if it’s good code ? I don’t think so. Let’s take an example. Personally, I feel like code that uses map and filter looks nicer than using a for loop. It’s tempting to think that this is a case of me being straightforwardly correct about a point of engineering. For instance, map and filter typically involve pure functions, which are easier to reason about, and they avoid an entire class of off-by-one iterator bugs. It feels to me like this isn’t a matter of taste, but a case where I’m right and other engineers are wrong. But of course it’s more complicated than that. Languages like Golang don’t contain map and filter at all, for principled reasons. Iterating with a for loop is easier to reason about from a performance perspective, and is more straightforward to extend to other iteration strategies (like taking two items at a time). I don’t care about these reasons as much as I care about the reasons in favour of map and filter - that’s why I don’t write a lot of for loops - but it would be far too arrogant for me to say that engineers who prefer for loops are simply less skilled. In many cases, they have technical capabilites that I don’t have. They just care about different things. In other words, our disagreement comes down to a difference in values . I wrote about this point in I don’t know how to build software and you don’t either . Even if the big technical debates do have definite answers, no working software engineer is ever in a position to know what those answers are, because you can only fit so much experience into one career. We are all at least partly relying on our own personal experience: on our particular set of engineering values. Almost every decision in software engineering is a tradeoff. You’re rarely picking between two options where one is strictly better. Instead, each option has its own benefits and downsides. Often you have to make hard tradeoffs between engineering values : past a certain point, you cannot easily increase performance without harming readability, for instance 1 . Really understanding this point is (in my view) the biggest indicator of maturity in software engineering. Immature engineers are rigid about their decisions. They think it’s always better to do X or Y. Mature engineers are usually willing to consider both sides of a decision, because they know that both sides come with different benefits. The trick is not deciding if technology X is better than Y, but whether the benefits of X outweigh Y in this particular case . In other words, immature engineers are too inflexible about their taste . They know what they like, but they mistake that liking for a principled engineering position. What defines a particular engineer’s taste? In my view, your engineering taste is composed of the set of engineering values you find most important . For instance: Resiliency . If an infrastructure component fails (a service dies, a network connection becomes unavailable), does the system remain functional? Can it recover without human intervention? Speed . How fast is the software, compared to the theoretical limit? Is work being done in the hot path that isn’t strictly necessary? Readability . Is the software easy to take in at a glance and to onboard new engineers to? Are functions relatively short and named well? Is the system well-documented? Correctness . Is it possible to represent an invalid state in the system? How locked-down is the system with tests, types, and asserts? Do the tests use techniques like fuzzing? In the extreme case, has the program been proven correct by formal methods like Alloy ? Flexibility . Can the system be trivially extended? How easy is it to make a change? If I need to change something, how many different parts of the program do I need to touch in order to do so? Portability . Is the system tied down to a particular operational environment (say, Microsoft Windows, or AWS)? If the system needs to be redeployed elsewhere, can that happen without a lot of engineering work? Scalability . If traffic goes up 10x, will the system fall over? What about 100x? Does the system have to be over-provisioned or can it scale automatically? What bottlenecks will require engineering intervention? Development speed . If I need to extend the system, how fast can it be done? Can most engineers work on it, or does it require a domain expert? There are many other engineering values: elegance, modern-ness, use of open source, monetary cost of keeping the system running, and so on. All of these are important, but no engineer cares equally about all of these things. Your taste is determined by which of these values you rank highest. For instance, if you value speed and correctness more than development speed, you are likely to prefer Rust over Python. If you value scalability over portability, you are likely to argue for a heavy investment in your host’s (e.g. AWS) particular quirks and tooling. If you value resiliency over speed, you are likely to want to split your traffic between different regions. And so on 2 . It’s possible to break these values down in a more fine-grained way. Two engineers who both deeply care about readability could disagree because one values short functions and the other values short call-stacks. Two engineers who both care about correctness could disagree because one values exhaustive test suites and the other values formal methods. But the principle is the same - there are lots of possible engineering values to care about, and because they are often in tension, each engineer is forced to take some more seriously than others. I’ve said that all of these values are important. Despite that, it’s possible to have bad taste. In the context of software engineering, bad taste means that your preferred values are not a good fit for the project you’re working on . Most of us have worked with engineers like this. They come onto your project evangelizing about something - formal methods, rewriting in Golang, Ruby meta-programming, cross-region deployment, or whatever - because it’s worked well for them in the past. Whether it’s a good fit for your project or not, they’re going to argue for it, because it’s what they like. Before you know it, you’re making sure your internal metrics dashboard has five nines of reliability, at the cost of making it impossible for any junior engineer to understand. In other words, most bad taste comes from inflexibility . I will always distrust engineers who justify decisions by saying “it’s best practice”. No engineering decision is “best practice” in all contexts! You have to make the right decision for the specific problem you’re facing. One interesting consequence of this is that engineers with bad taste are like broken compasses. If you’re in the right spot, a broken compass will still point north. It’s only when you start moving around that the broken compass will steer you wrong. Likewise, many engineers with bad taste can be quite effective in the particular niche where their preferences line up with what the project needs. But when they’re moved between projects or jobs, or when the nature of the project changes, the wheels immediately come off. No job stays the same for long, particularly in these troubled post-2021 times . Good taste is a lot more elusive than technical ability. That’s because, unlike technical ability, good taste is the ability to select the right set of engineering values for the particular technical problem you’re facing . It’s thus much harder to identify if someone has good taste: you can’t test it with toy problems, or by asking about technical facts. You need there to be a real problem, with all of its messy real-world context. You can tell you have good taste if the projects you’re working on succeed. If you’re not meaningfully contributing to the design of a project (maybe you’re just doing ticket-work), you can tell you have good taste if the projects where you agree with the design decisions succeed, and the projects where you disagree are rocky. Importantly, you need a set of different kinds of projects. If it’s just the one project, or the same kind of project over again, you might just be a good fit for that. Even if you go through many different kinds of projects, that’s no guarantee that you have good taste in domains you’re less familiar with 3 . How do you develop good taste? It’s hard to say, but I’d recommend working on a variety of things, paying close attention to which projects (or which parts of the project) are easy and which parts are hard. You should focus on flexibility : try not to acquire strong universal opinions about the right way to write software. What good taste I have I acquired pretty slowly. Still, I don’t see why you couldn’t acquire it fast. I’m sure there are prodigies with taste beyond their experience in programming, just as there are prodigies in other domains. Of course this isn’t always true. There are win-win changes where you can improve several usually-opposing values at the same time. But mostly we’re not in that position. Like I said above, different projects will obviously demand a different set of values. But the engineers working on those projects will still have to draw the line somewhere, and they’ll rely on their own taste to do that. That said, I do think good taste is somewhat transferable. I don’t have much personal experience with this so I’m leaving it in a footnote, but if you’re flexible and attentive to the details in domain A, you’ll probably be flexible and attentive to the details in domain B.

1 views
Sean Goedecke 1 months ago

AI coding agents rely too much on fallbacks

One frustrating pattern I’ve noticed in AI agents - at least in Claude Code, Codex and Copilot - is building automatic fallbacks . Suppose you ask Codex to build a system to automatically group pages in a wiki by topic. (This isn’t hypothetical, I just did this for EndlessWiki ). You’ll probably want to use something like the Louvain method to identify clusters. But if you task an AI agent with building something like that, it usually will go one step further, and build a fallback: a separate, simpler code path if the Louvain method fails (say, grouping page slugs alphabetically). If you’re not careful, you might not even know if the Louvain method is working, or if you’re just seeing the fallback behavior. In my experience, AI agents will do this constantly . If you’re building an app that makes an AI inference request, the generated code will likely fallback to some hard-coded response if the inference request fails. If you’re using an agent to pull structured data from some API, the agent may silently fallback to placeholder data for part of it. If you’re writing some kind of clever spam detector, the agent will want to fall back to a basic keyword check if your clever approach doesn’t work. This is particularly frustrating for the main kind of work that AI agents are useful for: prototyping new ideas. If you’re using AI agents to make real production changes to an existing app, fallbacks are annoying but can be easily stripped out before you submit the pull request. But if you’re using AI agents to test out a new approach, you’re typically not checking the code line-by-line. The usual workflow is to ask the agent to try an approach, then benchmark or fiddle with the result, and so on. If your benchmark or testing doesn’t know whether it’s hitting the real code or some toy fallback, you can’t be confident that you’re actually evaluating your latest idea. I don’t think this behavior is deliberate. My best guess is that it’s a reinforcement learning artifact: code with fallbacks is more likely to succeed, so during training the models are learning to include fallback 1 . If I’m wrong and it’s part of the hidden system prompt (or a deliberate choice), I think it’s a big mistake. When you ask an AI agent to implement a particular algorithm, it should implement that algorithm. In researching this post, I saw this r/cursor thread where people are complaining about this exact problem (and also attributing it to RL). Supposedly you can prompt around it, if you repeat “DO NOT WRITE FALLBACK CODE” several times.

1 views
Sean Goedecke 1 months ago

Endless AI-generated Wikipedia

I built an infinite, AI-generated wiki. You can try it out at endlesswiki.com ! Large language models are like Borges’ infinite library . They contain a huge array of possible texts, waiting to be elicited by the right prompt - including some version of Wikipedia. What if you could explore a model by interacting with it as a wiki? The idea here is to build a version of Wikipedia where all the content is AI-generated. You only have to generate a single page to get started: when a user clicks any link on that page, the page for that link is generated on-the-fly, which will include links of its own. By browsing the wiki, users can dig deeper into the stored knowledge of the language model. This works because wikipedias 1 connect topics very broadly. If you follow enough links, you can get from any topic to any other topic. In fact, people already play a game where they try to race from one page to a totally unrelated page by just following links. It’s fun to try and figure out the most likely chain of conceptual relationships between two completely different things. In a sense, EndlessWiki is a collaborative attempt to mine the depths of a language model. Once a page is generated, all users will be able to search for it or link it to their friends. The basic design is very simple: a MySQL database with a table, and a Golang server. When the server gets a request, it looks up in the database. If it exists, it serves the page directly; if not, it generates the page from a LLM and saves it to the database before serving it. I’m using Kimi K2 for the model. I chose a large model because larger models contain more facts about the world (which is good for a wiki), and Kimi specifically because in my experience Groq is faster and more reliable than other model inference providers. Speed is really important for this kind of application, because the user has to wait for new pages to be generated. Fortunately, Groq is fast enough that the wait time is only a few hundred ms. Unlike AutoDeck , I don’t charge any money or require sign-in for this. That’s because this is more of a toy than a tool, so I’m not worried about one power user costing me a lot of money in inference. You have to be manually clicking links to trigger inference. The most interesting design decision I made was preventing “cheating”. I’m excited to see how obscure the pages can get (for instance, can you get to eventually get to Neon Genesis Evangelion from the root page?) It would defeat the purpose if you could just manually go to in the address bar. To defeat that, I make each link have a query parameter, and then I fetch the origin page server-side to validate that it does indeed contain a link to the page you’re navigating to 2 . Like AutoDeck , EndlessWiki represents another step in my “what if you could interact with LLMs without having to chat” line of thought. I think there’s a lot of potential here for non-toy features. For instance, what if ChatGPT automatically hyperlinked each proper noun in its responses, and clicking on those generated a response focused on that noun? Anyway, check it out! I use the lowercase “w” because I mean all encyclopedia wikis. Wikipedia is just the most popular example. Interestingly, Codex came up with five solutions to prevent cheating, all of which were pretty bad - way more complicated than the solution I ended up with. If I was purely vibe coding, I’d have ended up with some awkward cryptographic approach.

5 views
Sean Goedecke 1 months ago

What I learned building an AI-driven spaced repetition app

I spent the last couple of weeks building an AI-driven spaced repetition app. You can try it out here . Like many software engineering types who were teenagers in the early 2000s 1 , I’ve been interested in this for a long time. The main reason is that, unlike many other learning approaches, spaced repetition works . If you want to learn something, study it now, then study it an hour later, then a day later, then a week later, and so on. You don’t have to spend much time overall, as long as you’re consistent about coming back to it. Eventually you only need to refresh your memory every few years in order to maintain a solid working knowledge of the topic. Spaced repetition learning happens more or less automatically as part of a software engineering job. Specific engineering skills will come up every so often (for instance, using to inspect open network sockets, or the proper regex syntax for backtracking). If they come up often enough, you’ll internalize them. It’s more difficult to use spaced repetition to deliberately learn new things. Even if you’re using a spaced repetition tool like Anki , you have to either write your own deck of flashcards (which requires precisely the kind of expertise you don’t have yet), or search for an existing one that exactly matches the area you’re trying to learn 2 . One way I learn new things is from LLMs. I wrote about this in How I use LLMs to learn new subjects , but the gist is that I ask a ton of follow-up questions about a question I have. The best part about this approach is that it requires zero setup cost: if at any moment I want to learn more about something, I can type a question out and rapidly dig in to something I didn’t already know. What if you could use LLMs to make spaced repetition easier? Specifically, what if you could ask a LLM to give you an infinite feed of spaced repetition flashcards, adjusting the difficulty based on your responses? That’s the idea behind AutoDeck . You give it a topic and it gives you infinite flashcards about that topic. If it’s pitched too easy (e.g. you keep saying “I know”) or too hard, it’ll automatically change the difficulty. The thing I liked most about building AutoDeck is that it’s an AI-driven app where the interface isn’t chat . I think that’s really cool - almost every kind of killer AI app presents a chat interface. To use Claude Code, you chat with an agent. The various data analysis tools are typically in a “chat with your data” mode. To use ChatGPT, you obviously chat with it. That makes sense, since (a) the most unusual thing about LLMs is that you can talk with them, and (b) most AI apps let the user take a huge variety of possible actions, for which the only possible interface is some kind of chat. The problem with chat is that it demands a lot of the user. That’s why most “normal” apps have the user click buttons instead of type out sentences, and that’s why many engineering and design blogs have been writing about how to build AI apps that aren’t chat-based. Still, it’s easier said than done. I think spaced repetition flashcards are a good use-case for AI. Generating them for any topic is something that would be impossible without LLMs, so it’s a compelling idea. But you don’t have to interact with them via text (beyond typing out what topic you want at the outset). How do you use AI to generate an infinite feed of content? I tried a bunch of different approaches here. The two main problems here are speed and consistency . Speed is difficult because AI generation can be pretty slow: counting the time-to-first-token, it’s a few hundred ms, even for quick models. If you’re generating each flashcard with a single request, a user who’s familiar with the subject matter can click through flashcards faster than the AI can generate them. Batching up flashcard generation is quicker (because you only wait for time-to-first-token once) but it forces the user to wait much longer before they see their first card. What if you generate flashcards in parallel? That has two problems of its own. First, you’re still waiting for the time-to-first-token on every request, so throughput is still much slower than the batched approach. Second, it’s very easy to generate duplicate cards that way. Even with a high temperature, if you ask the same model the same question with the same prompt, you’re likely to get similar answers. The parallel-generation flashcard feed was thus pretty repetitive: if you wanted to learn about French history, you’d get “what year was the Bastille stormed” right next to “in what year was the storming of the Bastille”, and so on. The solution I landed on was batching the generation, but saving each card as it comes in . In other words, I asked the model to generate ten cards, but instead of waiting for the entire response to be over before I saved the data, I made each card available to the client as soon as it was generated. This was trickier than it sounds for a few reasons. First, it means you can’t use JSON structured outputs . Structured outputs are great for ensuring you get a response that your code can parse, but you can’t (easily) parse chunks of JSON mid-stream. You have to wait for the entire output before it’s valid JSON, because you need the closing or characters 3 . Instead, I asked the model to respond in XML chunks, which could be easily parsed as they came in. Second, it meant I couldn’t simply have the client request a card and get a card back. The code that generated cards had to be able to run in the background without blocking the client, which forced the client to periodically check for available cards. I built most of AutoDeck with OpenAI’s Codex. It was pretty good! I had to intervene in maybe one change out of three, and I only had to seriously intervene (i.e. completely change the approach) in one change out of ten. Some examples of where I had to intervene: I tried Claude Code at various parts of the process and honestly found it underwhelming. It took longer to make each change and in general required more intervention, which meant I was less comfortable queueing up changes. This is a pretty big win for OpenAI - until very recently, Claude Code has been much better than Codex in my experience. I cannot imagine trying to build even a relatively simple app like this without being a competent software engineer already. Codex saved me a lot of time and effort, but it made a lot of bad decisions that I had to intervene. It wasn’t able to fix every bug I encountered. At this point, I don’t think we’re in the golden age of vibe coding. You still need to know what you’re doing to actually ship an app with one of these tools. One interesting thing about building AI projects is that it kind of forces you to charge money. I’ve released previous apps I’ve built for free, because I wanted people to use them and I’m not trying to do the software entepreneur thing. But an app that uses AI costs me money for each user - not a ton of money, but enough that I’m strongly incentivized to charge a small amount for users who want to use the app more than just kicking the tires. I think this is probably a good thing. Charging money for software is a forcing function for actually making it work. If AI inference was free, I would probably have shipped AutoDeck in a much more half-assed state. Since I’m obliged to charge money for it, I spent more time making sure it was actually useful than I would normally spend on a side project. I had a lot of fun building AutoDeck! It’s still mainly for me, but if you’ve read this far I hope you try it out and see if you like it as well. I’m still trying to figure out the best model. GPT-5 was actually pretty bad at generating spaced repetition cards: the time-to-first-token was really slow, and the super-concise GPT-5 style made the cards read awkwardly. You don’t need the smartest available model for spaced repetition, just a model with a good grasp of a bunch of textbook and textbook-adjacent facts. The surviving ur-text for this is probably Gwern’s 2009 post Spaced Repetition for Efficient Learning . Most existing decks are tailored towards students doing particular classes (e.g. anatomy flashcards for med school), not people just trying to learn something new, so they often assume more knowledge than you might have. I think this is just a lack of maturity in the ecosystem. I would hope that in a year or two you can generate structured XML, JSONL, or other formats that are more easily parseable in chunks. Those formats are just as easy to express as a grammar that the logit sampler can adhere to.

0 views
Sean Goedecke 1 months ago

If you are good at code review, you will be good at using AI agents

Using AI agents correctly is a process of reviewing code . If you’re good at reviewing code, you’ll be good at using tools like Claude Code, Codex, or the Copilot coding agent. Why is that. Large language models are good at producing a lot of code, but they don’t yet have the depth of judgement of a competent software engineer

1 views
Sean Goedecke 1 months ago

AI is good news for Australian and European software engineers

Right now the dominant programing model is something like “centaur chess” , where a skilled human is paired with a computer assistant. Together, they produce more work than either could individually. No individual human can work as fast or as consistently as a LLM, but LLMs lack the depth of judgement that good engineers do 1

0 views
Sean Goedecke 2 months ago

The whole point of OpenAI's Responses API is to help them hide reasoning traces

About six months ago, OpenAI released their Responses API , which replaced their previous /chat/completions API for inference. The old API was very simple: you pass in an array of messages representing a conversation between the model and a user, and get the model’s next response back. The new Responses API is more complicated

0 views
Sean Goedecke 2 months ago

'Make invalid states unrepresentable' considered harmful

One of the most controversial things I believe about good software design is that your code should be more flexible than your domain model . This is in direct opposition to a lot of popular design advice, which is all about binding your code to your domain model as tightly as possible. For instance, a popular principle for good software design is to make invalid states unrepresentable

0 views
Sean Goedecke 2 months ago

An unofficial FAQ for Stripe's new "Tempo" blockchain

Stripe just announced Tempo , a “L1 blockchain” for “stablecoin payments”. What does any of this mean. In 2021, I was interested enough in blockchain to write a simple explainer and a technical description of Bitcoin specifically . But I’ve never been a blockchain fan . Both my old and new “what kind of work I want” posts state that I’m ethically opposed to proof-of-work blockchain

0 views
Sean Goedecke 2 months ago

Seeing like a software company

The big idea of James C. Scott’s Seeing Like A State can be expressed in three points: By “legible”, I mean work that is predictable, well-estimated, has a paper trail, and doesn’t depend on any contingent factors (like the availability of specific people). Quarterly planning, OKRs, and Jira all exist to make work legible

0 views
Sean Goedecke 2 months ago

Do the simplest thing that could possibly work

When designing software systems, do the simplest thing that could possibly work. It’s surprising how far you can take this piece of advice. I genuinely think you can do this all the time . You can follow this approach for fixing bugs, for maintaining existing systems, and for architecting new ones

0 views
Sean Goedecke 2 months ago

Finding the low-hanging fruit

Suppose your job is to pick fruit in a giant orchard. The orchard covers several hills and valleys, and is big enough that you’d need a few weeks to walk all the way around the edge. What should you do first

0 views
Sean Goedecke 2 months ago

Everything I know about good API design

Most of what modern software engineers do 1 involves APIs: public interfaces for communicating with a program, like this one from Twilio. I’ve spent a lot of time working with APIs, both building and using them

0 views