Posts in Programming (20 found)
Martin Fowler Yesterday

Fragments: April 14

I attended the first Pragmatic Summit early this year, and while there host Gergely Orosz interviewed Kent Beck and myself on stage . The video runs for about half-an-hour. I always enjoy nattering with Kent like this, and Gergely pushed into some worthwhile topics. Given the timing, AI dominated the conversation - we compared it to earlier technology shifts, the experience of agile methods, the role of TDD, the danger of unhealthy performance metrics, and how to thrive in an AI-native industry. ❄                ❄                ❄                ❄                ❄ Perl is a language I used a little, but never loved. However the definitive book on it, by its designer Larry Wall, contains a wonderful gem. The three virtues of a programmer: hubris, impatience - and above all - laziness . Bryan Cantrill also loves this virtue : Of these virtues, I have always found laziness to be the most profound: packed within its tongue-in-cheek self-deprecation is a commentary on not just the need for abstraction, but the aesthetics of it. Laziness drives us to make the system as simple as possible (but no simpler!) — to develop the powerful abstractions that then allow us to do much more, much more easily. Of course, the implicit wink here is that it takes a lot of work to be lazy Understanding how to think about a problem domain by building abstractions (models) is my favorite part of programming. I love it because I think it’s what gives me a deeper understanding of a problem domain, and because once I find a good set of abstractions, I get a buzz from the way they make difficulties melt away, allowing me to achieve much more functionality with less lines of code. Cantrill worries that AI is so good at writing code, we risk losing that virtue, something that’s reinforced by brogrammers bragging about how they produce thirty-seven thousand lines of code a day. The problem is that LLMs inherently lack the virtue of laziness. Work costs nothing to an LLM. LLMs do not feel a need to optimize for their own (or anyone’s) future time, and will happily dump more and more onto a layercake of garbage. Left unchecked, LLMs will make systems larger, not better — appealing to perverse vanity metrics, perhaps, but at the cost of everything that matters. As such, LLMs highlight how essential our human laziness is: our finite time forces us to develop crisp abstractions in part because we don’t want to waste our (human!) time on the consequences of clunky ones. The best engineering is always borne of constraints, and the constraint of our time places limits on the cognitive load of the system that we’re willing to accept. This is what drives us to make the system simpler, despite its essential complexity. This reflection particularly struck me this Sunday evening. I’d spent a bit of time making a modification of how my music playlist generator worked. I needed a new capability, spent some time adding it, got frustrated at how long it was taking, and wondered about maybe throwing a coding agent at it. More thought led to realizing that I was doing it in a more complicated way than it needed to be. I was including a facility that I didn’t need, and by applying yagni , I could make the whole thing much easier, doing the task in just a couple of dozen lines of code. If I had used an LLM for this, it may well have done the task much more quickly, but would it have made a similar over-complication? If so would I just shrug and say LGTM? Would that complication cause me (or the LLM) problems in the future? ❄                ❄                ❄                ❄                ❄ Jessica Kerr (Jessitron) has a simple example of applying the principle of Test-Driven Development to prompting agents . She wants all updates to include updating the documentation. Instructions – We can change AGENTS.md to instruct our coding agent to look for documentation files and update them. Verification – We can add a reviewer agent to check each PR for missed documentation updates. This is two changes, so I can break this work into two parts. Which of these should we do first? Of course my initial comment about TDD answers that question ❄                ❄                ❄                ❄                ❄ Mark Little prodded an old memory of mine as he wondered about to work with AIs that are over-confident of their knowledge and thus prone to make up answers to questions, or to act when they should be more hesitant. He draws inspiration from an old, low-budget, but classic SciFi movie: Dark Star . I saw that movie once in my 20s (ie a long time ago), but I still remember the crisis scene where a crew member has to use philosophical argument to prevent a sentient bomb from detonating . Doolittle: You have no absolute proof that Sergeant Pinback ordered you to detonate. Bomb #20: I recall distinctly the detonation order. My memory is good on matters like these. Doolittle: Of course you remember it, but all you remember is merely a series of sensory impulses which you now realize have no real, definite connection with outside reality. Bomb #20: True. But since this is so, I have no real proof that you’re telling me all this. Doolittle: That’s all beside the point. I mean, the concept is valid no matter where it originates. Bomb #20: Hmmmm…. Doolittle: So, if you detonate… Bomb #20: In nine seconds…. Doolittle: …you could be doing so on the basis of false data. Bomb #20: I have no proof it was false data. Doolittle: You have no proof it was correct data! Bomb #20: I must think on this further. Doolittle has to expand the bomb’s consciousness, teaching it to doubt its sensors. As Little puts it: That’s a useful metaphor for where we are with AI today. Most AI systems are optimised for decisiveness. Given an input, produce an output. Given ambiguity, resolve it probabilistically. Given uncertainty, infer. This works well in bounded domains, but it breaks down in open systems where the cost of a wrong decision is asymmetric or irreversible. In those cases, the correct behaviour is often deferral, or even deliberate inaction. But inaction is not a natural outcome of most AI architectures. It has to be designed in. In my more human interactions, I’ve always valued doubt, and distrust people who operate under undue certainty. Doubt doesn’t necessarily lead to indecisiveness, but it does suggest that we include the risk of inaccurate information or faulty reasoning into decisions with profound consequences. If we want AI systems that can operate safely without constant human oversight, we need to teach them not just how to decide, but when not to. In a world of increasing autonomy, restraint isn’t a limitation, it’s a capability. And in many cases, it may be the most important one we build.

0 views

From Painfully Explicit to Implicit in Lean

Note: AI was used to edit this post. As a proof-of-human thought and input, I am also publishing the original draft which was written fully before asking AI to edit the post with me. This post is aimed at Lean language beginners that are interested in writing proofs in Lean, but still feel lost when reading Lean code. A very simplified mental model of Lean is that at the core there are two systems:

0 views

How I use org-roam

While Org-mode is fantastic in its core functionality, there is a lovely little extension that creates a way to build a wiki for all personal knowledge, ideas, writing, work, and so much more: org-roam . A “clone” of ROAM research , if you are familiar with logseq or obsidian , this will have you feeling right at home (albeit, actually at home inside emacs). It has taken some time to figure out how I wanted to use org-roam, but I think I have cracked the code. I will discuss how I’ve been capturing, filing away, and taking action on everything that pops into my head. As a small overview, Org-Roam gives you the ability to create notes (big whoop). The power comes in the backlink to any previous note that may be in your system, similar to how Wikipedia links between articles. As I write in any org-roam document (node), I see suggestions of past notes I have taken, giving the option to immediately create a link back to them. This is fine on it’s own, but you start to see inter-linking between ideas: which becomes massively helpful for research and creating new connections of information. Generally, one would be blind to in other methods of note taking. Org-roam uses an sqlite database (which some critique), as well as an ID system in which everything (files, org headers) have a unique ID. This ID is what forms the link between our notes. Let’s discuss how I’m using this. As with my org-mode flow, the goal is to not only capture, but to reduce friction of the capture to almost nothing. I have capture templates for the following files in my general org-mode file: What I was lacking was a way to integrate with org-roam and create backlinks across the notes I was taking on everything. Enter the new capture system. I use (mapped to ) to hit a daily org-roam file (~/org/roam/daily/2026-04-10.org for example) which is my capture file for everything for the day. I write everything in this file. I mean everything : I then take 5 minutes at the end of every day and file away these items into org-roam nodes if they are “seeds” (in the digital garden sense), actionable items, things I want to look into at some point, or just leave them in the daily file to be archived for posterity. Whenever I want to write something on the computer, emacs is the place I do so, in which I have autocomplete, spelling check, and macros right at my finger tips. I hit a keybind that universally reaches out to emacs and opens the org-roam-dailies-capture-today buffer if I am not on workspace 1 (emacs) and capture the thought/writing/email/text/content, and move on with my day. What this also allows it the use of my capture system via termux on my phone. I simply leave my ~/org/roam/daily/date.org file open every morning in termux running in emacsclient on my workstation, and go about my day. This means all notes live in one place, I don’t generally have to go into “note to self” in signal or xmpp and move things around, and org-roam works out of the box for backlinking and clean up. Is it ideal? No, but it is still better than the various mobile orgmode apps I have tried. I treat the phone just as a capture node, all organizing and refiling happens on my bigger screen at end of day. The major benefit of this methodology is that we have content which is greppable forevermore. If I write, it is written in emacs. Anything more than a sentence or two is in my daily file. I don’t care what it is, I can grep it for all time, version control it, and it is ready to expand upon in the future. By the end of the day, I may have dozens of captures in my daily file. I sit down, open the file up, and review. If the item is actionable or has a date/deadline associated with it, then it is filed to inbox.org/calendar.org. If it is an idea that is a seed of something larger, it is filed into its own org-roam node that can then grow on its own. If something needs to be filed under an existing roam-node, that occurs here as well, and backlinks organically take shape as I write. Finally, if the item is none of these things, it just lives in the daily file as an archive that can be revisited later with ripgrep as stated above. I have bound to project-wide for this, which I use frequently for finding anything. Refiling is simply accomplished by: Which will give you files and org headings under which to refile everything. As we grow our notes database, we will start to see that we have autosuggestions offered via cape and corfu. They look like so: allowing a direct link to previous notes’ IDs, which are portable across the filesystem, so you can move files around to logically work in a heirarchy if you so choose. The standard advice is to keep a flat file system in which all notes are in one directory, but I like organization too much and have created nested directories for this. These links and IDs are handled via the function that can be set to fire automatically on file changes. Oh the fabled “neuronal link graph” that was popularised by Obidian - how could we forget about that? opens a D3 rendered graph that looks nice, but I have not really found use for it other than pretty screenshots to show how “deep(ly autistic)” I am. I find this to be the easiest way to maintain a note taking system that actually grows with the author, while staying sane and keeping everything organized. The notes that we create allow us to understand deeply, and to make connections that are otherwise missed. As in my discussion with Prot , writing everything down has greatly impacted my thinking and allowed growth in areas that are deeply meaningful. Org-roam (and holistically org itself) is once again, just text files. So, you can very easily take any .org file and back it up and hold onto it for all time, as you will never have any proprietary lock in. The database is just an sqlite database, which is the most portable and easily malleable database in existence. The two interlink to give you peace of mind were you ever to leave emacs (haha, you won’t). If you don’t want the “heaviness” of org-roam’s database structure, you could use Prot’s denote package that is a more simplified (yet still highly powerful) method. I just like the autosuggestions and speed of roam, but your mileage may vary. So there you have it, the way that I am using org-roam to create a mind map/second brain and keep notes on everything I come across on a daily basis. How are you using org-roam, or do you have a note taking system you swear by? Post below or send me an email! As always, God bless, and until next time. If you enjoyed this post, consider Supporting my work , Checking out my book , Working with me , or sending me an Email to tell me what you think. inbox.org: Actionable items with a TODO - these are then filed away to projects or kept in this file until acted upon. calendar.org: Scheduled or deadlined items bookmarks.org: web bookmarks contacts.org: every contact I have and reach out to system. notes.org: but this is being replaced as we will see text messages emails (if not already sent via mu4e) notes to self LLM prompts websites I visit journal entries this very post, that will then become a blog post in my writing project code snippets things I want to remember

0 views
Martin Fowler 6 days ago

Fragments: April 9

I mostly link to written material here, but I’ve recently listened to two excellent podcasts that I can recommend. Anyone who regularly reads these fragments knows that I’m a big fan of Simon Willison, his (also very fragmentary) posts have earned a regular spot in my RSS reader. But the problem with fragments, however valuable, is that they don’t provide a cohesive overview of the situation. So his podcast with Lenny Rachitsky is a welcome survey of that state of world as seen through a discerning pair of eyeballs. He paints a good picture of how programming has changed for him since the “November inflection point”, important patterns for this work, and his concern about the security bomb nestled inside the beast. My other great listening was on a regular podcast that I listen to, as Gergely Orosz interviewed Thuan Pham - the former CTO of Uber. As with so many of Gergely’s podcasts, they focused on Thuan Pham’s fascinating career direction, giving listeners an opportunity to learn from a successful professional. There’s also an informative insight into Uber’s use of microservices (they had 5000 of them), and the way high-growth software necessarily gets rewritten a lot (a phenomenon I dubbed Sacrificial Architecture ) ❄                ❄                ❄                ❄                ❄ Axios published their post-mortem on their recent supply chain compromise . It’s quite a story, the attackers spent a couple of weeks developing contact with the lead maintainer, leading to a video call where the meeting software indicated something on the maintainer’s system was out of date. That led to the maintainer installing the update, which in fact was a Remote Access Trojan (RAT). they tailored this process specifically to me by doing the following: Simon Willison has a summary and further links . ❄                ❄                ❄                ❄                ❄ I recently bumped into Diátaxis , a framework for organizing technical documentation. I only looked at it briefly, but there’s much to like. In particular I appreciated how it classified four forms of documentation: The distinction between tutorials and how-to guides is interesting A tutorial serves the needs of the user who is at study. Its obligation is to provide a successful learning experience. A how-to guide serves the needs of the user who is at work. Its obligation is to help the user accomplish a task. I also appreciated its point of pulling explanations out into separate areas. The idea is that other forms should contain only minimal explanations, linking to the explanation material for more depth. That way we keep the flow on the goal and allow the user to seek deeper explanations in their own way. The study/work distinction between explanation and reference mirrors that same distinction between tutorials and how-to guides. ❄                ❄                ❄                ❄                ❄ For eight years, Lalit Maganti wanted a set of tools for working with SQLite. But it would be hard and tedious work, “getting into the weeds of SQLite source code, a fiendishly difficult codebase to understand”. So he didn’t try it. But after the November inflection point , he decided to tackle this need. His account of this exercise is an excellent description of the benefits and perils of developing with AI agents. Through most of January, I iterated, acting as semi-technical manager and delegating almost all the design and all the implementation to Claude. Functionally, I ended up in a reasonable place: a parser in C extracted from SQLite sources using a bunch of Python scripts, a formatter built on top, support for both the SQLite language and the PerfettoSQL extensions, all exposed in a web playground. But when I reviewed the codebase in detail in late January, the downside was obvious: the codebase was complete spaghetti. I didn’t understand large parts of the Python source extraction pipeline, functions were scattered in random files without a clear shape, and a few files had grown to several thousand lines. It was extremely fragile; it solved the immediate problem but it was never going to cope with my larger vision, never mind integrating it into the Perfetto tools. The saving grace was that it had proved the approach was viable and generated more than 500 tests, many of which I felt I could reuse. He threw it all away and worked more closely with the AI on the second attempt, with lots of thinking about the design, reviewing all the code, and refactoring with every step In the rewrite, refactoring became the core of my workflow. After every large batch of generated code, I’d step back and ask “is this ugly?” Sometimes AI could clean it up. Other times there was a large-scale abstraction that AI couldn’t see but I could; I’d give it the direction and let it execute. If you have taste, the cost of a wrong approach drops dramatically because you can restructure quickly. He ended up with a working system, and the AI proved its value in allowing him to tackle something that he’d been leaving on the todo pile for years. But even with the rewrite, the AI had its potholes. His conclusion of the relative value of AI in different scenarios: When I was working on something I already understood deeply, AI was excellent…. When I was working on something I could describe but didn’t yet know, AI was good but required more care…. When I was working on something where I didn’t even know what I wanted, AI was somewhere between unhelpful and harmful… At the heart of this is that AI works at its best when there is an objectively checkable answer. If we want an implementation that can pass some tests, then AI does a good job. But when it came to the public API: I spent several days in early March doing nothing but API refactoring, manually fixing things any experienced engineer would have instinctively avoided but AI made a total mess of. There’s no test or objective metric for “is this API pleasant to use” and “will this API help users solve the problems they have” and that’s exactly why the coding agents did so badly at it. ❄                ❄                ❄                ❄                ❄ I became familiar with Ryan Avent’s writing when he wrote the Free Exchange column for The Economist. His recent post talks about how James Talarico and Zohran Mamdani have made their religion an important part of their electoral appeal, and their faith is centered on caring for others. He explains that a focus on care leads to an important perspective on economic growth. The first thing to understand is that we should not want growth for its own sake. What is good about growth is that it expands our collective capacities: we come to know more and we are able to do more. This, in turn, allows us to alleviate suffering, to discover more things about the universe, and to spend more time being complete people. they reached out masquerading as the founder of a company they had cloned the companys founders likeness as well as the company itself. they then invited me to a real slack workspace. this workspace was branded to the companies ci and named in a plausible manner. the slack was thought out very well, they had channels where they were sharing linked-in posts, the linked in posts i presume just went to the real companys account but it was super convincing etc. they even had what i presume were fake profiles of the team of the company but also number of other oss maintainers. they scheduled a meeting with me to connect. the meeting was on ms teams. the meeting had what seemed to be a group of people that were involved. the meeting said something on my system was out of date. i installed the missing item as i presumed it was something to do with teams, and this was the RAT. everything was extremely well co-ordinated looked legit and was done in a professional manner. Tutorials: to learn how to use the product How-to guides: for users to follow to achieve particular goals with the product Reference: to describe what the product does Explanations: background and context to educate the user on the product’s rationale

0 views
Martin Fowler 1 weeks ago

Feedback Flywheel

Rahul Garg finishes his series on reducing the friction in AI-Assisted Development. He proposes a structured feedback practice that harvests learnings from AI sessions and feeds them back into the team's shared artifacts, turning individual experience into collective improvement.

0 views
Armin Ronacher 1 weeks ago

Mario and Earendil

Today I’m very happy to share that Mario Zechner is joining Earendil . First things first: I think you should read Mario’s post . This is his news more than it is ours, and he tells his side of it better than I could. What I want to do here is add a more personal note about why this matters so much to me, how the last months led us here, and why I am so excited to have him on board. Last year changed the way many of us thought about software. It certainly changed the way I did. I spent much of 2025 building, probing, and questioning how to build software, and in many more ways what I want to do. If you are a regular reader of this blog you were along for the ride. I wrote a lot, experimented a lot, and tried to get a better sense for what these systems can actually do and what kinds of companies make sense to build around them. There was, and continues to be, a lot of excitement in the air, but also a lot of noise. It has become clear to me that it’s not a question of whether AI systems can be useful but what kind of software and human-machine interactions we want to bring into the world with them. That is one of the reasons I have been so drawn to Mario’s work and approaches. Pi is, in my opinion, one of the most thoughtful coding agents and agent infrastructure libraries in this space. Not because it is trying to be the loudest or the fastest, but because it is clearly built by someone who cares deeply about software quality, taste, extensibility, and design. In a moment where much of the industry is racing to ship ever more quickly, often at the cost of coherence and craft, Mario kept insisting on making something solid. That matters to me a great deal. I have known Mario for a long time, and one of the things I admire most about him is that he does not confuse velocity with progress. He has a strong sense for what good tools should feel like. He cares about details. He cares about whether something is well made. And he cares about building in a way that can last. Mario has been running Pi in a rather unusual way. He exerts back-pressure on the issue tracker and the pull requests through OSS vacations and other means. The last year has also made something else clearer to me: these systems are not only exciting, they are also capable of producing a great deal of damage. Sometimes that damage is obvious; sometimes it looks like low-grade degradation everywhere at once. More slop, more noise, more disingenuous emails in my inbox. There is a version of this future that makes people more distracted, more alienated, and less careful with one another. That is not a future I want to help build. At Earendil, Colin and I have been trying to think very carefully about what a different path might look like. That is a big part of what led us to Lefos . Lefos is our attempt to build a machine entity that is more thoughtful and more deliberate by design. Not an agent whose main purpose is to make everything a little more efficient so that we can produce even more forgettable output, but one that can help people communicate with more care, more clarity, and joy. Good software should not aim to optimize every minute of your life, but should create room for better and more joyful experiences, better relationships, and better ways of relating to one another. Especially in communication and software engineering, I think we should be aiming for more thought rather than more throughput. We should want tools that help people be more considerate, more present, and more human. If all we do is use these systems to accelerate the production of slop, we will have missed the opportunity entirely. This is also why Mario joining Earendil feels so meaningful to me. Pi and Lefos come from different starting points. There was a year of distance collaboration, but they are animated by a similar instinct: that quality matters, that design matters, and that trust is earned through care rather than captured through hype. I am very happy that Pi is coming along for the ride. Me and Colin care a lot about it, and we want to be good stewards of it. It has already played an important role in our own work over the last months, and I continue to believe it is one of the best foundations for building capable agents. We will have more to say soon about how we think about Pi’s future and its relationship to Lefos, but the short version is simple: we want Pi to continue to exist as a high-quality, open, extensible piece of software, and we want to invest in making that future real. As for our thoughts of Pi’s license, read more here and our company post here .

0 views
Martin Fowler 1 weeks ago

Principles of Mechanical Sympathy

Modern hardware is remarkably fast, but software often fails to leverage it. Caer Sanders has found it valuable to guide their work with mechanical sympathy - the practice of creating software that is sympathetic to its underlying hardware. They distill this practice into everyday principles: predictable memory access, awareness of cache lines, single-writer, and natural batching.

0 views

Rules and Gates

One of the first things I stumbled into as I was building what became Superpowers was the idea of a "gate" in a prompt. It's a term that Claude Code introduced me to, so clearly was part of the literature already. But the basic idea is that a gate is something that must be completed before moving on to the next step. Gates work really, really well. But I never thought much about why they work. Today, I finally asked one of my agents how they thought about the difference between rules and gates. The difference: a rule has an opt-out path (I can rationalize "I'll do it after this one thing"). A gate doesn't — the next action is blocked until the gate condition is met. Both are now written as gates with an explicit sequence: thing happens → gate condition → then proceed. Here's a concrete example from an agent today: Before (rule): "Verify claims with web research before asserting them." That's a rule. It lives in my head. When I was answering your question about whether audience assessment books exist, the rule didn't fire — I synthesized from training data, it felt confident, I stated it as a finding. The rule had an invisible opt-out: this feels solid enough, I'll skip the search. After (gate): "When a claim about what exists or doesn't exist is forming → web search happens → URLs in hand → then I speak." The gate makes the sequence explicit and gives me something to check: do I have URLs? If no, I haven't cleared the gate. "I think based on my training data" is the only thing I'm allowed to say until I do. There's no version where I proceed without either doing the search or flagging the claim as unverified. The practical test of whether something is a gate rather than a rule: when I'm about to skip it, does the gate formulation give me a concrete question I can't answer? "Do I have URLs?" is concrete. "Did I verify this?" is too easy to answer yes to without having actually done anything. Gates are conditions that an agent can evaluate objectively to figure out if they're allowed to finish a task and move on to the next one. They're an incredibly powerful framing for prompting that can dramatically reduce all kinds of agentic misbehavior.

0 views
iDiallo 1 weeks ago

AI Did It in 12 Minutes. It Took Me 10 Hours to Fix It

I've been working on personal projects since the 2000s. One thing I've always been adamant about is understanding the code I write. Even when Stack Overflow came along, I was that annoying guy who told people not to copy and paste code into their repos. Instead, they should read it and adapt it to their specific case. On personal projects, I've applied this to a fault. Projects never get done because I'm reading and editing code to make it work exactly as I want. I am by no means trying to convince you that my code is high quality. Every day, I regret the design choices I made for this very blog. But at the very least, I like to understand the code that powers my projects. So you can imagine how I struggle with the reviewing part when AI writes a large chunk of our daily work. Large language models are just so verbose, and often produce large blocks of code that don't even get used. I don't want to attribute it to malice (wasting your tokens) when I know this is an emergent technology we are all still adapting to. But it doesn't help that there is just so much code to review. What I tell myself when I review an AI-generated PR is: if I don't have a mental model of how the application works, how can I be of any use when it fails? This weekend, I decided to tackle a project I've been postponing since I created this blog over a decade ago. I needed a nice interface to upload assets, such as images, to go with each blog post. According to my git history, I started work on it 11 years ago but never finished. Yet I've still been uploading images regularly to go with blog posts. How? Using and . Like a caveman. Whenever I finish writing a blog post, I create images in Photoshop or Photopea, fire up FileZilla to connect to my file server, and drag and drop my assets. Then, of course, I have to log back into the blog's admin panel and provide the correct URLs for those assets. It's as primitive as it gets. This was a perfect project to tackle with AI. The good news is that I was able to complete it in a single day. The bad news. Well, I read through close to 5,000 lines of code. This is the first time I've used an LLM to write PHP code. I used the GLM-5 model from z.ai, which I've often played with for generating graphs from data. I'll have to try it on a few more PHP projects before I commit to this opinion, but it consistently wrote spaghetti code . Spaghetti code, to me, is anything where logic is embedded directly in a template. HTML, CSS, and database queries all crammed into the same file. That's really hard to maintain. I did realize, though, that my instructions were in direct contradiction with the agent's system prompt. It was explicit in the thinking process: This is NOT a Next.js project - the user explicitly wants a PHP app. So I should NOT use the Next.js project that's already initialized. I should create a PHP application. Wait, but the instructions say "You are an interactive CLI tool that helps users with software engineering tasks" and "Now you are developing a comprehensive and feature-rich Next.js project from scratch." But the user is explicitly asking for a PHP app. I should follow the user's request. It did generate a bunch of Node.js code, which I was able to remove manually. Luckily, it kept the PHP project in its own folder. If you're wondering how 12 files contain ~5,000 lines of code, I wondered the same. But that's what spaghetti code does. I set it up locally, ran and , and a few more files and folders were generated. When I finally ran the application, it didn't work. I spent a few hours working through permissions, updating the install script, and modifying the SQLite setup. I thought StackOverflow was dead, but I don't think I would have gotten SQLite working without it. One error, for example, was that SQLite kept throwing a warning that it was running in read-only mode. Apparently, you have to make the parent folder writable (not just the database file) to enable write mode. It had been a long time since I'd manually d files in PHP. I normally use namespaces and autoload. Since this project was generated from scratch, I had to hunt down various statements that all had incorrect paths. Once I sorted those out, I had to deal with authentication. PHP sessions come with batteries included, you call and you can read and write session variables via the global. But I couldn't figure out why it kept failing. When I created a standalone test file, sessions worked fine. But when loaded through the application, values weren't being saved. I spent a good while debugging before I found that was missing from the login success flow. When I logged in, the page redirected to the dashboard, but every subsequent action that required authentication immediately kicked me out. Even after fixing all those issues and getting uploads working, something still bothered me: how do I maintain this code? How do I add new pages to manage uploaded assets? Do I add meatballs directly to the spaghetti? Or do I just trust the AI agent to know where to put new features? Technically it could do that, but I'd have to rely entirely on the AI without ever understanding how things work. So I did the only sane thing: I rewrote a large part of the code and restructured the project. Maybe I should have started there, but I didn't know what I wanted until I saw it. Which is probably why I had been dragging this project along for 11 years. Yes, now I have 22 files, almost double the original count. But the code is also much simpler at just 1,254 lines. There's far less cognitive load when it comes to fixing bugs. There's still a lot to improve, but it's a much leaner foundation. The question I keep coming back to is: would it have been easier to do this manually? Well, the timeline speaks for itself. I had been neglecting this project for years. Without AI, I probably never would have finished it. That said, it would have been easier to build on my existing framework. My blog's framework has been tested for years and has accumulated a lot of useful features: a template engine, a working router, an auth system, and more. All things I had to re-engineer from scratch here. If I'd taken the time to work within my own framework, it probably would have taken less time overall. But AI gave me the illusion that the work could be done much faster. Z.ai generated the whole thing in just 12 minutes. It took an additional 10 hours to clean it up and get it working the way I wanted. This reminds me of several non-technical friends who built/vibe-coded apps last year. The initial results looked impressive. Most of them don't have a working app anymore, because they realized that the cleanup is just as important as the generation if you want something that actually holds together. I can only imagine what "vibe-debugging" looks like. I'm glad I have a working app, but I'm not sure I can honestly call this vibe-coded. Most, if not all, of the files have been rewritten. When companies claim that a significant percentage of their code is AI-generated , do their developers agree? For me, it's unthinkable to deploy code I haven't vetted and understood. But I'm not the benchmark. In the meantime, I think I've earned the right to say this the next time I ship an AI-assisted app: "I apologize for so many lines of code - I didn't have time to write a shorter app."

0 views

Binary Lambda Calculus is Hard

Read on the website: Binary Lambda Calculus is a really alluring idea. But it’s also hard to grasp and use! Here’s my list of complaints and obstacles to using BLC.

0 views

Stamp It! All Programs Must Report Their Version

Recently, during a production incident response, I guessed the root cause of an outage correctly within less than an hour (cool!) and submitted a fix just to rule it out, only to then spend many hours fumbling in the dark because we lacked visibility into version numbers and rollouts… 😞 This experience made me think about software versioning again, or more specifically about build info (build versioning, version stamping, however you want to call it) and version reporting. I realized that for the i3 window manager, I had solved this problem well over a decade ago, so it was really unexpected that the problem was decidedly not solved at work. In this article, I’ll explain how 3 simple steps (Stamp it! Plumb it! Report it!) are sufficient to save you hours of delays and stress during incident response. Every household appliance has incredibly detailed versioning! Consider this dishwasher: (Thank you Feuermurmel for sending me this lovely example!) I observed a couple household appliance repairs and am under the impression that if a repair person cannot identify the appliance, they would most likely refuse to even touch it. So why are our standards so low in computers, in comparison? Sure, consumer products are typically versioned somehow and that’s typically good enough (except for, say, USB 3.2 Gen 1×2!). But recently, I have encountered too many developer builds that were not adequately versioned! Unlike a physical household appliance with a stamped metal plate, software is constantly updated and runs in places and structures we often cannot even see. Let’s dig into what we need to increase our versioning standard! Usually, software has a name and some version number of varying granularity: All of these identify the Chrome browser on my computer, but each at different granularity. All are correct and useful, depending on the context. Here’s an example for each: After creating the i3 window manager , I quickly learned that for user support, it is very valuable for programs to clearly identify themselves. Let me illustrate with the following case study. When running , you will see output like this: Each word was carefully deliberated and placed. Let me dissect: When doing user support, there are a couple of questions that are conceptually easy to ask the affected user and produce very valuable answers for the developer: Based on my experiences with asking these questions many times, I noticed a few patterns in how these debugging sessions went. In response, I introduced another way for i3 to report its version in i3 v4.3 (released in September 2012): a flag! Now I could ask users a small variation of the first question: What is the output of ? Note how this also transfers well over spoken word, for example at a computer meetup: Michael: Which version are you using? User: How can I check? Michael: Run this command: User: It says 4.24. Michael: Good, that is recent enough to include the bug fix. Now, we need more version info! Run please and tell me what you see. When you run , it does not just report the version of the i3 program you called, it also connects to the running i3 window manager process in your X11 session using its IPC (interprocess communication) interface and reports the running i3 process’s version, alongside other key details that are helpful to show the user, like which configuration file is loaded and when it was last changed: This might look like a lot of detail on first glance, but let me spell out why this output is such a valuable debugging tool: Connecting to i3 via the IPC interface is an interesting test in and of itself. If a user sees output, that implies they will also be able to run debugging commands like (for example) to capture the full layout state. During a debugging session, running is an easy check to see if the version you just built is actually effective (see the line). Showing the full path to the loaded config file will make it obvious if the user has been editing the wrong file. If the path alone is not sufficient, the modification time (displayed both absolute and relative) will flag editing the wrong file. I use NixOS, BTW, so I automatically get a stable identifier ( ) for the specific build of i3. To see the build recipe (“derivation” in Nix terminology) which produced this Nix store output ( ), I can run : Unfortunately, I am not aware of a way to go from the derivation to the source, but at least one can check that a certain source results in an identical derivation. The versioning I have described so far is sufficient for most users, who will not be interested in tracking intermediate versions of software, but only the released versions. But what about developers, or any kind of user who needs more precision? When building i3 from git, it reports the git revision it was built from, using : A modified working copy gets represented by a after the revision: Reporting the git revision (or VCS revision, generally speaking) is the most useful choice. This way, we catch the following common mistakes: As we have seen above, the single most useful piece of version information is the VCS revision. We can fetch all other details (version numbers, dates, authors, …) from the VCS repository. Now, let’s demonstrate the best case scenario by looking at how Go does it! Go has become my favorite programming language over the years, in big part because of the good taste and style of the Go developers, and of course also because of the high-quality tooling: I strive to respect everybody’s personal preferences, so I usually steer clear of debates about which is the best programming language, text editor or operating system. However, recently I was asked a couple of times why I like and use a lot of Go, so here is a coherent article to fill in the blanks of my ad-hoc in-person ramblings :-). Read more → Therefore, I am pleased to say that Go implements the gold standard with regard to software versioning: it stamps VCS buildinfo by default! 🥳 This was introduced in Go 1.18 (March 2022) : Additionally, the go command embeds information about the build, including build and tool tags (set with -tags), compiler, assembler, and linker flags (like -gcflags), whether cgo was enabled, and if it was, the values of the cgo environment variables (like CGO_CFLAGS). Both VCS and build information may be read together with module information using or runtime/debug.ReadBuildInfo (for the currently running binary) or the new debug/buildinfo package. Note: Before Go 1.18, the standard approach was to use or similar explicit injection. This setup works (and can still be seen in many places) but requires making changes to the application code, whereas the Go 1.18+ stamping requires no extra steps. What does this mean in practice? Here is a diagram for the common case: building from git: This covers most of my hobby projects! Many tools I just , or if I want to easily copy them around to other computers. Although, I am managing more and more of my software in NixOS. When I find a program that is not yet fully managed, I can use and the tool to identify it: It’s very cool that Go does the right thing by default! Systems that consist of 100% Go software (like my gokrazy Go appliance platform ) are fully stamped! For example, the gokrazy web interface shows me exactly which version and dependencies went into the build on my scan2drive appliance . Despite being fully stamped, note that gokrazy only shows the module versions, and no VCS buildinfo, because it currently suffers from the same gap as Nix: For the gokrazy packer, which follows a rolling release model (no version numbers), I ended up with a few lines of Go code (see below) to display a git revision, no matter if you installed the packer from a Go module or from a git working copy. The code either displays (the easy case; built from git) or extracts the revision from the Go module version of the main module ( ): What are the other cases? These examples illustrate the scenarios I usually deal with: This is what it looks like in practice: But a version built from git has the full revision available (→ you can tell them apart): When packaging Go software with Nix, it’s easy to lose Go VCS revision stamping: So the fundamental tension here is between reproducibility and VCS stamping. Luckily, there is a solution that works for both: I created the Nix overlay module that you can import to get working Go VCS revision stamping by default for your Nix expressions! Tip: If you are not a Nix user, feel free to skip over this section. I included it in this article so that you have a full example of making VCS stamping work in the most complicated environments. Packaging Go software in Nix is pleasantly straightforward. For example, the Go Protobuf generator plugin is packaged in Nix with <30 lines: official nixpkgs package.nix . You call , supply as the result from and add a few lines of metadata. But getting developer builds fully stamped is not straightforward at all! When packaging my own software, I want to package individual revisions (developer builds), not just released versions. I use the same , or if I need the latest Go version. Instead of using , I provide my sources using Flakes, usually also from GitHub or from another Git repository. For example, I package like so: The comes from my : Go stamps all builds, but it does not have much to stamp here: Here’s a full example of gokrazy/bull: To fix VCS stamping, add my overlay to your : (If you are using , like I am, you need to apply the overlay in both places.) After rebuilding, your Go binaries should newly be stamped with buildinfo: Nice! 🥳 But… how does it work? When does it apply? How do you know how to fix your config? I’ll show you the full diagram first, and then explain how to read it: There are 3 relevant parts of the Nix stack that you can end up in, depending on what you write into your files: For the purpose of VCS revision stamping, you should: Hence, we will stick to the left-most column: fetchers. Unfortunately, by default, with fetchers, the VCS revision information, which is stored in a Nix attrset (in-memory, during the build process), does not make it into the Nix store, hence, when the Nix derivation is evaluated and Go compiles the source code, Go does not see any VCS revision. My Nix overlay module fixes this, and enabling the overlay is how you end up in the left-most lane of the above diagram: the happy path, where your Go binaries are now stamped! How does the overlay work? It functions as an adapter between Nix and Go: So the overlay implements 3 steps to get Go to stamp the correct info: For the full source, see . See Go issue #77020 and Go issue #64162 for a cleaner approach to fixing this gap: allowing package managers to invoke the Go tool with the correct VCS information injected. This would allow Nix (or also gokrazy) to pass along buildinfo cleanly, without the need for workarounds like my adapter . At the time of writing, issue #77020 does not seem to have much traction and is still open. My argument is simple: Stamping the VCS revision is conceptually easy, but very important! For example, if the production system from the incident I mentioned had reported its version, we would have saved multiple hours of mitigation time! Unfortunately, many environments only identify the build output (useful, but orthogonal), but do not plumb the VCS revision (much more useful!), or at least not by default. Your action plan to fix it is just 3 simple steps: Implementing “version observability” throughout your system is a one-day high-ROI project. With my Nix example, you saw how the VCS revision is available throughout the stack, but can get lost in the middle. Hopefully my resources help you quickly fix your stack(s), too: Now go stamp your programs and data transfers! 🚀 Chrome 146.0.7680.80 Chrome f08938029c887ea624da7a1717059788ed95034d-refs/branch-heads/7680_65@{#34} “This works in Chrome for me, did you test in Firefox?” “Chrome 146 contains broken middle-click-to-paste-and-navigate” “I run Chrome 146.0.7680.80 and cannot reproduce your issue” “Apply this patch on top of Chrome f08938029c887ea624da7a1717059788ed95034d-refs/branch-heads/7680_65@{#34} and follow these steps to reproduce: […]” : I could have shortened this to or maybe , but I figured it would be helpful to be explicit because is such a short name. Users might mumble aloud “What’s an i-3-4-2-4?”, but when putting “version” in there, the implication is that i3 is some computer thing (→ a computer program) that exists in version 4.24. is the release date so that you can immediately tell if “ ” is recent. signals when the project was started and who is the main person behind it. gives credit to the many people who helped. i3 was never a one-person project; it was always a group effort. Question: “Which version of i3 are you using?” Since i3 is not a typical program that runs in a window (but a window manager / desktop environment), there is no Help → About menu option. Instead, we started asking: What is the output of ? Question: “ Are you reporting a new issue or a preexisting issue? To confirm, can you try going back to the version of i3 you used previously? ”. The technical terms for “going back” are downgrade, rollback or revert. Depending on the Linux distribution, this is either trivial or a nightmare. With NixOS, it’s trivial: you just boot into an older system “generation” by selecting that version in the bootloader. Or you revert in git, if your configs are version-controlled. With imperative Linux distributions like Debian Linux or Arch Linux, if you did not take a file system-level snapshot, there is no easy and reliable way to go back after upgrading your system. If you are lucky, you can just the older version of i3. But you might run into dependency conflicts (“version hell”). I know that it is possible to run older versions of Debian using snapshot.debian.org , but it is just not very practical, at least when I last tried. Can you check if the issue is still present in the latest i3 development version? Of course, I could also try reproducing the user issue with the latest release version, and then one additional time on the latest development version. But this way, the verification step moves to the affected user, which is good because it filters for highly-motivated bug reporters (higher chance the bug report actually results in a fix!) and it makes the user reproduce the bug twice , figuring out if it’s a flaky issue, hard-to-reproduce, if the reproduction instructions are correct, etc. A natural follow-up question: “ Does this code change make the issue go away? ” This is easy to test for the affected user who now has a development environment. Connecting to i3 via the IPC interface is an interesting test in and of itself. If a user sees output, that implies they will also be able to run debugging commands like (for example) to capture the full layout state. During a debugging session, running is an easy check to see if the version you just built is actually effective (see the line). Note that this is the same check that is relevant during production incidents: verifying that effectively running matches supposed to be running versions. Showing the full path to the loaded config file will make it obvious if the user has been editing the wrong file. If the path alone is not sufficient, the modification time (displayed both absolute and relative) will flag editing the wrong file. People build from the wrong revision. People build, but forget to install. People install, but their session does not pick it up (wrong location?). Nix fetchers like are implemented by fetching an archive ( ) file from GitHub — the full repository is not transferred, which is more efficient. Even if a repository is present, Nix usually intentionally removes it for reproducibility: directories contain packed objects that change across runs (for example), which would break reproducible builds (different hash for the same source). We build from a directory, not a Go module, so the module version is . The stamped buildinfo does not contain any information. Fetchers. These are what Flakes use, but also non-Flake use-cases. Fixed-output derivations (FOD). This is how is implemented, but the constant hash churn (updating the line) inherent to FODs is annoying. Copiers. These just copy files into the Nix store and are not git-aware. Avoid the Copiers! If you use Flakes: ❌ do not use as a Flake input ✅ use instead for git awareness I avoid the fixed-output derivation (FOD) as well. Fetching the git repository at build time is slow and inefficient. Enabling , which is needed for VCS revision stamping with this approach, is even more inefficient because a new Git repository must be constructed deterministically to keep the FOD reproducible. Nix tracks the VCS revision in the in-memory attrset. Go expects to find the VCS revision in a repository, accessed via file access and commands. It synthesizes a file so that Go’s detects a git repository. It injects a command into the that implements exactly the two commands used by Go and fails loudly on anything else (in case Go updates its implementation). It sets in the environment variable. Stamp it! Include the source VCS revision in your programs. This is not a new idea: i3 builds include their revision since 2012! Plumb it! When building / packaging, ensure the VCS revision does not get lost. My “VCS rev with NixOS” case study section above illustrates several reasons why the VCS rev could get lost, which paths can work and how to fix the missing plumbing. Report it! Make your software print its VCS revision on every relevant surface, for example: Executable programs: Report the VCS revision when run with For Go programs, you can always use Services and batch jobs: Include the VCS revision in the startup logs. Outgoing HTTP requests: Include the VCS revision in the HTTP responses: Include the VCS revision in a header (internally) Remote Procedure Calls (RPCs): Include the revision in RPC metadata User Interfaces: Expose the revision somewhere visible for debugging. My overlay for Nix / NixOS My repository is a community resource to collect examples (as markdown content) and includes a Go module with a few helpers to make version reporting trivial.

0 views
Ahead of AI 1 weeks ago

Components of A Coding Agent

In this article, I want to cover the overall design of coding agents and agent harnesses: what they are, how they work, and how the different pieces fit together in practice. Readers of my Build a Large Language Model (From Scratch) and Build a Large Reasoning Model (From Scratch) books often ask about agents, so I thought it would be useful to write a reference I can point to. More generally, agents have become an important topic because much of the recent progress in practical LLM systems is not just about better models, but about how we use them. In many real-world applications, the surrounding system, such as tool use, context management, and memory, plays as much of a role as the model itself. This also helps explain why systems like Claude Code or Codex can feel significantly more capable than the same models used in a plain chat interface. In this article, I lay out six of the main building blocks of a coding agent. You are probably familiar with Claude Code or the Codex CLI, but just to set the stage, they are essentially agentic coding tools that wrap an LLM in an application layer, a so-called agentic harness, to be more convenient and better-performing for coding tasks. Figure 1: Claude Code CLI, Codex CLI, and my Mini Coding Agent . Coding agents are engineered for software work where the notable parts are not only the model choice but the surrounding system, including repo context, tool design, prompt-cache stability, memory, and long-session continuity. That distinction matters because when we talk about the coding capabilities of LLMs, people often collapse the model, the reasoning behavior, and the agent product into one thing. But before getting into the coding agent specifics, let me briefly provide a bit more context on the difference between the broader concepts, the LLMs, reasoning models, and agents. An LLM is the core next-token model. A reasoning model is still an LLM, but usually one that was trained and/or prompted to spend more inference-time compute on intermediate reasoning, verification, or search over candidate answers. An agent is a layer on top, which can be understood as a control loop around the model. Typically, given a goal, the agent layer (or harness) decides what to inspect next, which tools to call, how to update its state, and when to stop, etc. Roughly, we can think about the relationship as this: the LLM is the engine, a reasoning model is a beefed-up engine (more powerful, but more expensive to use), and an agent harness helps us the model. The analogy is not perfect, because we can also use conventional and reasoning LLMs as standalone models (in a chat UI or Python session), but I hope it conveys the main point. Figure 2: The relationship between conventional LLM, reasoning LLM (or reasoning model), and an LLM wrapped in an agent harness. In other words, the agent is the system that repeatedly calls the model inside an environment. So, in short, we can summarize it like this: LLM: the raw model Reasoning model : an LLM optimized to output intermediate reasoning traces and to verify itself more Agent: a loop that uses a model plus tools, memory, and environment feedback Agent harness: the software scaffold around an agent that manages context, tool use, prompts, state, and control flow Coding harness: a special case of an agent harness; i.e., a task-specific harness for software engineering that manages code context, tools, execution, and iterative feedback As listed above, in the context of agents and coding tools, we also have the two popular terms agent harness and (agentic) coding harness . A coding harness is the software scaffold around a model that helps it write and edit code effectively. And an agent harness is a bit broader and not specific to coding (e.g., think of OpenClaw). Codex and Claude Code can be considered coding harnesses. Anyways, A better LLM provides a better foundation for a reasoning model (which involves additional training), and a harness gets more out of this reasoning model. Sure, LLMs and reasoning models are also capable of solving coding tasks by themselves (without a harness), but coding work is only partly about next-token generation. A lot of it is about repo navigation, search, function lookup, diff application, test execution, error inspection, and keeping all the relevant information in context. (Coders may know that this is hard mental work, which is why we don’t like to be disrupted during coding sessions :)). Figure 3. A coding harness combines three layers: the model family, an agent loop, and runtime supports. The model provides the “engine”, the agent loop drives iterative problem solving, and the runtime supports provide the plumbing. Within the loop, “observe” collects information from the environment, “inspect” analyzes that information, “choose” selects the next step, and “act” executes it. The takeaway here is that a good coding harness can make a reasoning and a non-reasoning model feel much stronger than it does in a plain chat box, because it helps with context management and more. As mentioned in the previous section, when we say harness , we typically mean the software layer around the model that assembles prompts, exposes tools, tracks file state, applies edits, runs commands, manages permissions, caches stable prefixes, stores memory, and many more. Today, when using LLMs, this layer shapes most of the user experience compared to prompting the model directly or using web chat UI (which is closer to “chat with uploaded files”). Since, in my view, the vanilla versions of LLMs nowadays have very similar capabilities (e.g., the vanilla versions of GPT-5.4, Opus 4.6, and GLM-5 or so), the harness can often be the distinguishing factor that makes one LLM work better than another. This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code. That said, some harness-specific post-training is usually beneficial. For example, OpenAI historically maintained separate GPT-5.3 and GPT-5.3-Codex variants. In the next section, I want to go more into the specifics and discuss the core components of a coding harness using my Mini Coding Agent : https://github.com/rasbt/mini-coding-agent . Figure 4: Main harness features of a coding agent / coding harness that will be discussed in the following sections. By the way, in this article, I use the terms “coding agent” and “coding harness” somewhat interchangeably for simplicity. (Strictly speaking, the agent is the model-driven decision-making loop, while the harness is the surrounding software scaffold that provides context, tools, and execution support.) Figure 5: Minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python) Anyways, below are six main components of coding agents. You can check out the source code of my minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python), for more concrete code examples. The code annotates the six components discussed below via code comments: This is maybe the most obvious component, but it is also one of the most important ones. When a user says “fix the tests” or “implement xyz,” the model should know whether it is inside a Git repo, what branch it is on, which project documents might contain instructions, and so on. That’s because those details often change or affect what the correct action is. For example, “Fix the tests” is not a self-contained instruction. If the agent sees AGENTS.md or a project README, it may learn which test command to run, etc. If it knows the repo root and layout, it can look in the right places instead of guessing. Also, the git branch, status, and commits can help provide more context about what changes are currently in progress and where to focus. Figure 6: The agent harness first builds a small workspace summary that gets combined with the user request for additional project context. The takeaway is that the coding agent collects info (”stable facts” as a workspace summary) upfront before doing any work, so that it’s is not starting from zero, without context, on every prompt. Once the agent has a repo view, the next question is how to feed that information to the model. The previous figure showed a simplified view of this (“Combined prompt: prefix + request”), but in practice, it would be relatively wasteful to combine and re-process the workspace summary on every user query. I.e., coding sessions are repetitive, and the agent rules usually stay the same. The tool descriptions usually stay the same, too. And even the workspace summary usually stays (mostly) the same. The main changes are usually the latest user request, the recent transcript, and maybe the short-term memory. “Smart” runtimes don’t rebuild everything as one giant undifferentiated prompt on every turn, as illustrated in the figure below. Figure 7: The agent harness builds a stable prompt prefix, adds the changing session state, and then feeds that combined prompt to the model. The main difference from section 1 is that section 1 was about gathering repo facts. Here, we are now interested in packaging and caching those facts efficiently for repeated model calls. The “stable” “Stable prompt prefix” means that the information contained there doesn’t change too much. It usually contains the general instructions, tool descriptions, and the workspace summary. We don’t want to waste compute on rebuilding it from scratch in each interaction if nothing important has changed. The other components are updated more frequently (usually each turn). This includes short-term memory, the recent transcript, and the newest user request. In short, the caching aspect for the “Stable prompt prefix” is simply that a smart runtime tries to reuse that part. Tool access and tool use are where it starts to feel less like chat and more like an agent. A plain model can suggest commands in prose, but an LLM in a coding harness should do something narrower and more useful and be actually able to execute the command and retrieve the results (versus us calling the command manually and pasting the results back into the chat). But instead of letting the model improvise arbitrary syntax, the harness usually provides a pre-defined list of allowed and named tools with clear inputs and clear boundaries. (But of course, something like Python can be part of this so that the agent could also execute an arbitrary wide list of shell commands.) The tool-use flow is illustrated in the figure below. Figure 8: The model emits a structured action, the harness validates it, optionally asks for approval, executes it, and feeds the bounded result back into the loop. To illustrate this, below is an example of how this usually looks to the user using my Mini Coding Agent. (This is not as pretty as Claude Code or Codex because it is very minimal and uses plain Python without any external dependencies.) Figure 9: Illustration of a tool call approval request in the Mini Coding Agent. Here, the model has to choose an action that the harness recognizes, like list files, read a file, search, run a shell command, write a file, etc. It also has to provide arguments in a shape that the harness can check. So when the model asks to do something, the runtime can stop and run programmatic checks like “Is this a known tool?”, “Are the arguments valid?”, “Does this need user approval?” “Is the requested path even inside the workspace?” Only after those checks pass does anything actually run. While running coding agents, of course, carries some risk, the harness checks also improve reliability because the model doesn’t execute totally arbitrary commands. Also, besides rejecting malformed actions and approval gating, file access can be kept inside the repo by checking file paths. In a sense, the harness is giving the model less freedom, but it also improves the usability at the same time. Context bloat is not a unique problem of coding agents but an issue for LLMs in general. Sure, LLMs are supporting longer and longer contexts these days (and I recently wrote about the attention variants that make it computationally more feasible), but long contexts are still expensive and can also introduce additional noise (if there is a lot of irrelevant info). Coding agents are even more susceptible to context bloat than regular LLMs during multi-turn chats, because of repeated file reads, lengthy tool outputs, logs, etc. If the runtime keeps all of that at full fidelity, it will run out of available context tokens pretty quickly. So, a good coding harness is usually pretty sophisticated about handling context bloat beyond just cutting our summarizing information like regular chat UIs. Conceptually, the context compaction in coding agents might work as summarized in the figure below. Specifically, we are zooming a bit further into the clip (step 6) part of Figure 8 in the previous section. Figure 10: Large outputs are clipped, older reads are deduplicated, and the transcript is compressed before it goes back into the prompt. A minimal harness uses at least two compaction strategies to manage that problem. The first is clipping, which shortens long document snippets, large tool outputs, memory notes, and transcript entries. In other words, it prevents any one piece of text from taking over the prompt budget just because it happened to be verbose. The second strategy is transcript reduction or summarization, which turns the full session history (more on that in the next section) into a smaller promptable summary. A key trick here is to keep recent events richer because they are more likely to matter for the current step. And we compress older events more aggressively because they are likely less relevant. Additionally, we also deduplicate older file reads so the model does not keep seeing the same file content over and over again just because it was read multiple times earlier in the session. Overall, I think this is one of the underrated, boring parts of good coding-agent design. A lot of apparent “model quality” is really context quality. In practice, all these 6 core concepts covered here are highly intertwined, and the different sections and figures cover them with different focuses or zoom levels. In the previous section, we covered prompt-time use of history and how we build a compact transcript. The question there is: how much of the past should go back into the model on the next turn? So the emphasis is compression, clipping, deduplication, and recency. Now, this section, structured session memory, is about the storage-time structure of history. The question here is: what does the agent keep over time as a permanent record? So the emphasis is that the runtime keeps a fuller transcript as a durable state, alongside a lighter memory layer that is smaller and gets modified and compacted rather than just appended to. To summarize, a coding agent separates state into (at least) two layers: working memory: the small, distilled state the agent keeps explicitly a full transcript: this covers all the user requests, tool outputs, and LLM responses Figure 11: New events get appended to a full transcript and summarized in a working memory. The session files on disk are usually stored as JSON files. The figure above illustrates the two main session files, the full transcript and the working memory, that usually get stored as JSON files on disk. As mentioned before, the full transcript stores the whole history, and it’s resumable if we close the agent. The working memory is more of a distilled version with the currently most important info, which is somewhat related to the compact transcript. But the compact transcript and working memory have slightly different jobs. The compact transcript is for prompt reconstruction. Its job is to give the model a compressed view of recent history so it can continue the conversation without seeing the full transcript every turn. The working memory is more meant for task continuity. Its job is to keep a small, explicitly maintained summary of what matters across turns, things like the current task, important files, and recent notes. Following step 4 in the figure above, the latest user request, together with the LLM response and tool output, would then be recorded as a “new event” in both the full transcript and working memory, in the next round, which is not shown to reduce clutter in the figure above. Once an agent has tools and state, one of the next useful capabilities is delegation. The reason is that it allows us to parallelize certain work into subtasks via subagents and speed up the main task. For example, the main agent may be in the middle of one task and still need a side answer, for example, which file defines a symbol, what a config says, or why a test is failing. It is useful to split that off into a bounded subtask instead of forcing one loop to carry every thread of work at once. (In my mini coding agent, the implementation is simpler, and the child still runs synchronously, but the underlying idea is the same.) A subagent is only useful if it inherits enough context to do real work. But if we don’t restrict it, we now have multiple agents duplicating work, touching the same files, or spawning more subagents, and so on. So the tricky design problem is not just how to spawn a subagent but also how to bind one :). Figure 12: The subagent inherits enough context to be useful, but it runs inside tighter boundaries than the main agent. The trick here is that the subagent inherits enough context to be useful, but also has it constrained (for example, read-only and restricted in recursion depth) Claude Code has supported subagents for a long time, and Codex added them more recently. Codex does not generally force subagents into read-only mode. Instead, they usually inherit much of the main agent’s sandbox and approval setup. So, the boundary is more about task scoping, context, and depth. The section above tried to cover the main components of coding agents. As mentioned before, they are more or less deeply intertwined in their implementation. However, I hope that covering them one by one helps with the overall mental model of how coding harnesses work, and why they can make the LLM more useful compared to simple multi-turn chats. Figure 13: Six main features of a coding harness discussed in previous sections. If you are interested in seeing these implemented in clean, minimalist Python code, you may like my Mini Coding Agent . OpenClaw may be an interesting comparison, but it is not quite the same kind of system. OpenClaw is more like a local, general agent platform that can also code, rather than being a specialized (terminal) coding assistant. There are still several overlaps with a coding harness: it uses prompt and instruction files in the workspace, such as AGENTS.md, SOUL.md, and TOOLS.md it keeps JSONL session files and includes transcript compaction and session management it can spawn helper sessions and subagents However, as mentioned above, the emphasis is different. Coding agents are optimized for a person working in a repository and asking a coding assistant to inspect files, edit code, and run local tools efficiently. OpenClaw is more optimized for running many long-lived local agents across chats, channels, and workspaces, with coding as one important workload among several others. I am excited to share that I finished writing Build A Reasoning Model (From Scratch) and all chapters are in early access yet. The publisher is currently working on the layouts, and it should be available this summer. This is probably my most ambitious book so far. I spent about 1.5 years writing it, and a large number of experiments went into it. It is also probably the book I worked hardest on in terms of time, effort, and polish, and I hope you’ll enjoy it. Build a Reasoning Model (From Scratch) on Manning and Amazon . The main topics are evaluating reasoning models inference-time scaling self-refinement reinforcement learning distillation There is a lot of discussion around “reasoning” in LLMs, and I think the best way to understand what it really means in the context of LLMs is to implement one from scratch! Amazon (pre-order) Manning (complete book in early access , pre-final layout, 528 pages) Figure 1: Claude Code CLI, Codex CLI, and my Mini Coding Agent . Coding agents are engineered for software work where the notable parts are not only the model choice but the surrounding system, including repo context, tool design, prompt-cache stability, memory, and long-session continuity. That distinction matters because when we talk about the coding capabilities of LLMs, people often collapse the model, the reasoning behavior, and the agent product into one thing. But before getting into the coding agent specifics, let me briefly provide a bit more context on the difference between the broader concepts, the LLMs, reasoning models, and agents. On The Relationship Between LLMs, Reasoning Models, and Agents An LLM is the core next-token model. A reasoning model is still an LLM, but usually one that was trained and/or prompted to spend more inference-time compute on intermediate reasoning, verification, or search over candidate answers. An agent is a layer on top, which can be understood as a control loop around the model. Typically, given a goal, the agent layer (or harness) decides what to inspect next, which tools to call, how to update its state, and when to stop, etc. Roughly, we can think about the relationship as this: the LLM is the engine, a reasoning model is a beefed-up engine (more powerful, but more expensive to use), and an agent harness helps us the model. The analogy is not perfect, because we can also use conventional and reasoning LLMs as standalone models (in a chat UI or Python session), but I hope it conveys the main point. Figure 2: The relationship between conventional LLM, reasoning LLM (or reasoning model), and an LLM wrapped in an agent harness. In other words, the agent is the system that repeatedly calls the model inside an environment. So, in short, we can summarize it like this: LLM: the raw model Reasoning model : an LLM optimized to output intermediate reasoning traces and to verify itself more Agent: a loop that uses a model plus tools, memory, and environment feedback Agent harness: the software scaffold around an agent that manages context, tool use, prompts, state, and control flow Coding harness: a special case of an agent harness; i.e., a task-specific harness for software engineering that manages code context, tools, execution, and iterative feedback Figure 3. A coding harness combines three layers: the model family, an agent loop, and runtime supports. The model provides the “engine”, the agent loop drives iterative problem solving, and the runtime supports provide the plumbing. Within the loop, “observe” collects information from the environment, “inspect” analyzes that information, “choose” selects the next step, and “act” executes it. The takeaway here is that a good coding harness can make a reasoning and a non-reasoning model feel much stronger than it does in a plain chat box, because it helps with context management and more. The Coding Harness As mentioned in the previous section, when we say harness , we typically mean the software layer around the model that assembles prompts, exposes tools, tracks file state, applies edits, runs commands, manages permissions, caches stable prefixes, stores memory, and many more. Today, when using LLMs, this layer shapes most of the user experience compared to prompting the model directly or using web chat UI (which is closer to “chat with uploaded files”). Since, in my view, the vanilla versions of LLMs nowadays have very similar capabilities (e.g., the vanilla versions of GPT-5.4, Opus 4.6, and GLM-5 or so), the harness can often be the distinguishing factor that makes one LLM work better than another. This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code. That said, some harness-specific post-training is usually beneficial. For example, OpenAI historically maintained separate GPT-5.3 and GPT-5.3-Codex variants. In the next section, I want to go more into the specifics and discuss the core components of a coding harness using my Mini Coding Agent : https://github.com/rasbt/mini-coding-agent . Figure 4: Main harness features of a coding agent / coding harness that will be discussed in the following sections. By the way, in this article, I use the terms “coding agent” and “coding harness” somewhat interchangeably for simplicity. (Strictly speaking, the agent is the model-driven decision-making loop, while the harness is the surrounding software scaffold that provides context, tools, and execution support.) Figure 5: Minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python) Anyways, below are six main components of coding agents. You can check out the source code of my minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python), for more concrete code examples. The code annotates the six components discussed below via code comments: 1. Live Repo Context This is maybe the most obvious component, but it is also one of the most important ones. When a user says “fix the tests” or “implement xyz,” the model should know whether it is inside a Git repo, what branch it is on, which project documents might contain instructions, and so on. That’s because those details often change or affect what the correct action is. For example, “Fix the tests” is not a self-contained instruction. If the agent sees AGENTS.md or a project README, it may learn which test command to run, etc. If it knows the repo root and layout, it can look in the right places instead of guessing. Also, the git branch, status, and commits can help provide more context about what changes are currently in progress and where to focus. Figure 6: The agent harness first builds a small workspace summary that gets combined with the user request for additional project context. The takeaway is that the coding agent collects info (”stable facts” as a workspace summary) upfront before doing any work, so that it’s is not starting from zero, without context, on every prompt. 2. Prompt Shape And Cache Reuse Once the agent has a repo view, the next question is how to feed that information to the model. The previous figure showed a simplified view of this (“Combined prompt: prefix + request”), but in practice, it would be relatively wasteful to combine and re-process the workspace summary on every user query. I.e., coding sessions are repetitive, and the agent rules usually stay the same. The tool descriptions usually stay the same, too. And even the workspace summary usually stays (mostly) the same. The main changes are usually the latest user request, the recent transcript, and maybe the short-term memory. “Smart” runtimes don’t rebuild everything as one giant undifferentiated prompt on every turn, as illustrated in the figure below. Figure 7: The agent harness builds a stable prompt prefix, adds the changing session state, and then feeds that combined prompt to the model. The main difference from section 1 is that section 1 was about gathering repo facts. Here, we are now interested in packaging and caching those facts efficiently for repeated model calls. The “stable” “Stable prompt prefix” means that the information contained there doesn’t change too much. It usually contains the general instructions, tool descriptions, and the workspace summary. We don’t want to waste compute on rebuilding it from scratch in each interaction if nothing important has changed. The other components are updated more frequently (usually each turn). This includes short-term memory, the recent transcript, and the newest user request. In short, the caching aspect for the “Stable prompt prefix” is simply that a smart runtime tries to reuse that part. 3. Tool Access and Use Tool access and tool use are where it starts to feel less like chat and more like an agent. A plain model can suggest commands in prose, but an LLM in a coding harness should do something narrower and more useful and be actually able to execute the command and retrieve the results (versus us calling the command manually and pasting the results back into the chat). But instead of letting the model improvise arbitrary syntax, the harness usually provides a pre-defined list of allowed and named tools with clear inputs and clear boundaries. (But of course, something like Python can be part of this so that the agent could also execute an arbitrary wide list of shell commands.) The tool-use flow is illustrated in the figure below. Figure 8: The model emits a structured action, the harness validates it, optionally asks for approval, executes it, and feeds the bounded result back into the loop. To illustrate this, below is an example of how this usually looks to the user using my Mini Coding Agent. (This is not as pretty as Claude Code or Codex because it is very minimal and uses plain Python without any external dependencies.) Figure 9: Illustration of a tool call approval request in the Mini Coding Agent. Here, the model has to choose an action that the harness recognizes, like list files, read a file, search, run a shell command, write a file, etc. It also has to provide arguments in a shape that the harness can check. So when the model asks to do something, the runtime can stop and run programmatic checks like “Is this a known tool?”, “Are the arguments valid?”, “Does this need user approval?” “Is the requested path even inside the workspace?” Figure 10: Large outputs are clipped, older reads are deduplicated, and the transcript is compressed before it goes back into the prompt. A minimal harness uses at least two compaction strategies to manage that problem. The first is clipping, which shortens long document snippets, large tool outputs, memory notes, and transcript entries. In other words, it prevents any one piece of text from taking over the prompt budget just because it happened to be verbose. The second strategy is transcript reduction or summarization, which turns the full session history (more on that in the next section) into a smaller promptable summary. A key trick here is to keep recent events richer because they are more likely to matter for the current step. And we compress older events more aggressively because they are likely less relevant. Additionally, we also deduplicate older file reads so the model does not keep seeing the same file content over and over again just because it was read multiple times earlier in the session. Overall, I think this is one of the underrated, boring parts of good coding-agent design. A lot of apparent “model quality” is really context quality. 5. Structured Session Memory In practice, all these 6 core concepts covered here are highly intertwined, and the different sections and figures cover them with different focuses or zoom levels. In the previous section, we covered prompt-time use of history and how we build a compact transcript. The question there is: how much of the past should go back into the model on the next turn? So the emphasis is compression, clipping, deduplication, and recency. Now, this section, structured session memory, is about the storage-time structure of history. The question here is: what does the agent keep over time as a permanent record? So the emphasis is that the runtime keeps a fuller transcript as a durable state, alongside a lighter memory layer that is smaller and gets modified and compacted rather than just appended to. To summarize, a coding agent separates state into (at least) two layers: working memory: the small, distilled state the agent keeps explicitly a full transcript: this covers all the user requests, tool outputs, and LLM responses Figure 11: New events get appended to a full transcript and summarized in a working memory. The session files on disk are usually stored as JSON files. The figure above illustrates the two main session files, the full transcript and the working memory, that usually get stored as JSON files on disk. As mentioned before, the full transcript stores the whole history, and it’s resumable if we close the agent. The working memory is more of a distilled version with the currently most important info, which is somewhat related to the compact transcript. But the compact transcript and working memory have slightly different jobs. The compact transcript is for prompt reconstruction. Its job is to give the model a compressed view of recent history so it can continue the conversation without seeing the full transcript every turn. The working memory is more meant for task continuity. Its job is to keep a small, explicitly maintained summary of what matters across turns, things like the current task, important files, and recent notes. Following step 4 in the figure above, the latest user request, together with the LLM response and tool output, would then be recorded as a “new event” in both the full transcript and working memory, in the next round, which is not shown to reduce clutter in the figure above. 6. Delegation With (Bounded) Subagents Once an agent has tools and state, one of the next useful capabilities is delegation. The reason is that it allows us to parallelize certain work into subtasks via subagents and speed up the main task. For example, the main agent may be in the middle of one task and still need a side answer, for example, which file defines a symbol, what a config says, or why a test is failing. It is useful to split that off into a bounded subtask instead of forcing one loop to carry every thread of work at once. (In my mini coding agent, the implementation is simpler, and the child still runs synchronously, but the underlying idea is the same.) A subagent is only useful if it inherits enough context to do real work. But if we don’t restrict it, we now have multiple agents duplicating work, touching the same files, or spawning more subagents, and so on. So the tricky design problem is not just how to spawn a subagent but also how to bind one :). Figure 12: The subagent inherits enough context to be useful, but it runs inside tighter boundaries than the main agent. The trick here is that the subagent inherits enough context to be useful, but also has it constrained (for example, read-only and restricted in recursion depth) Claude Code has supported subagents for a long time, and Codex added them more recently. Codex does not generally force subagents into read-only mode. Instead, they usually inherit much of the main agent’s sandbox and approval setup. So, the boundary is more about task scoping, context, and depth. Components Summary The section above tried to cover the main components of coding agents. As mentioned before, they are more or less deeply intertwined in their implementation. However, I hope that covering them one by one helps with the overall mental model of how coding harnesses work, and why they can make the LLM more useful compared to simple multi-turn chats. Figure 13: Six main features of a coding harness discussed in previous sections. If you are interested in seeing these implemented in clean, minimalist Python code, you may like my Mini Coding Agent . How Does This Compare To OpenClaw? OpenClaw may be an interesting comparison, but it is not quite the same kind of system. OpenClaw is more like a local, general agent platform that can also code, rather than being a specialized (terminal) coding assistant. There are still several overlaps with a coding harness: it uses prompt and instruction files in the workspace, such as AGENTS.md, SOUL.md, and TOOLS.md it keeps JSONL session files and includes transcript compaction and session management it can spawn helper sessions and subagents Build a Reasoning Model (From Scratch) on Manning and Amazon . The main topics are evaluating reasoning models inference-time scaling self-refinement reinforcement learning distillation Amazon (pre-order) Manning (complete book in early access , pre-final layout, 528 pages)

0 views
Ankur Sethi 1 weeks ago

I'm no longer using coding assistants on personal projects

I’ve spent the last few months figuring out how best to use LLMs to build software. In January and February, I used Claude Code to build a little programming language in C. In December I used local a local LLM to analyze all the journal entries I wrote in 2025 , and then used Gemini to write scripts that could visualize that data. Besides what I’ve written about publicly, I’ve also used Claude Code to: I won’t lie, I started off skeptical about the ability of LLMs to write code, but I can’t deny the fact that, in 2026, they can produce code that’s as good or better than a junior-to-intermediate developer for most programming domains. If you’re abstaining from learning about or using LLMs in your own work, you’re doing a disservice to yourself and your career. It’s a very real possibility that in five years, most of the code we write will be produced using an LLM. It’s not a certainty, but it’s a strong possibility. However, I’m not going to stop writing code by hand. Not anytime soon. As long as there are computers to program, I will be programming them using my own two fleshy human hands. I started programming computers because I enjoy the act of programming. I enjoy thinking through problems, coming up with solutions, evolving those solutions so that they are as correct and clear as possible, and then putting them out into the world where they can be of use to people. It’s a fun and fulfilling profession. Some people see the need for writing code as an impediment to getting good use out of a computer. In fact, some of the most avid fans of generative AI believe that the act of actually doing the work is a punishment. They see work as unnecesary friction that must be optimized away. Truth is, the friction inherent in doing any kind of work—writing, programming, making music, painting, or any other creative activity generative AI purpots to replace—is the whole point. The artifacts you produce as the result of your hard work are not important. They are incidental. The work itself is the point. When you do the work, you change and grow and become more yourself. Work—especially creative work—is an act of self-love if you choose to see it that way. Besides, when you rely on generative AI to do the work, you miss out on the pleasurable sensations of being in flow state. Your skills atrophy (no, writing good prompts is not a skill, any idiot can do it). Your brain gets saturated with dopamine in the same way when you gamble, doomscroll, or play a gatcha game. Using Claude Code as your main method of producing code is like scrolling TikTok eight hours a day, every day, for work. And the worst part? The code you produce using LLMs is pure cognitive debt. You have no idea what it’s doing, only that it seems to be doing what you want it to do. You don’t have a mental model for how it works, and you can’t fix it if it breaks in production. Such a codebase is not an asset but a liability. I predict that in 1-3 years we’re going see organizations rewrite their LLM-generated software using actual human programmers. Personally, I’ve stopped using generative AI to write code for my personal projects. I still use Claude Code as a souped up search engine to look up information, or to help me debug nasty errors. But I’m manually typing every single line of code in my current Django project, with my own fingers, using a real physical keyboard. I’m even thinking up all the code using my own brain. Miraculous! For the commercial projects I work on for my clients, I’m going to follow whatever the norms around LLM use happen to at my workplace. If a client requires me to use Claude Code to write every single line of code, I’ll be happy to oblige. If they ban LLMs outright, I’m fine with that too. After spending hundreds of hours yelling at Claude, I’m dangerously proficient at getting it to do the right thing. But I haven’t lost my programming skills yet, and I don’t plan to. I’m flexible. Given the freedom to choose, I’d probably pick a middle path: use LLMs to generate boilerplate code, write tricky test cases, debug nasty issues I can’t think of, and quickly prototype ideas to test. I’m not an AI vegan. But when it comes to code I write for myself—which includes the code that runs this website—I’m going to continue writing it myself, line by line, like I always did. Somebody has to clean up after the robots when they make a mess, right? Write and debug Emacs Lisp for my personal Emacs configuration. Write several Alfred workflows (in Bash, AppleScript, and Swift) to automate tasks on my computer. Debug CSS issues on this very website. Generate React components for a couple of throwaway side projects. Generate Django apps for a couple of throwaway side projects. Port color themes between text editors. A lot more that I’m forgetting now.

0 views
Armin Ronacher 1 weeks ago

Absurd In Production

About five months ago I wrote about Absurd , a durable execution system we built for our own use at Earendil, sitting entirely on top of Postgres and Postgres alone. The pitch was simple: you don’t need a separate service , a compiler plugin , or an entire runtime to get durable workflows. You need a SQL file and a thin SDK. Since then we’ve been running it in production, and I figured it’s worth sharing what the experience has been like. The short version: the design held up, the system has been a pleasure to work with, and other people seem to agree. Absurd is a durable execution system that lives entirely inside Postgres. The core is a single SQL file ( absurd.sql ) that defines stored procedures for task management, checkpoint storage, event handling, and claim-based scheduling. On top of that sit thin SDKs (currently TypeScript , Python and an experimental Go one) that make the system ergonomic in your language of choice. The model is straightforward: you register tasks, decompose them into steps, and each step acts as a checkpoint. If anything fails, the task retries from the last completed step. Tasks can sleep, wait for external events, and suspend for days or weeks. All state lives in Postgres. If you want the full introduction, the original blog post covers the fundamentals. What follows here is what we’ve learned since. The project got multiple releases over the last five months. Most of the changes are things you’d expect from a system that people actually started depending on: hardened claim handling, watchdogs that terminate broken workers, deadlock prevention, proper lease management, event race conditions, and all the edge cases that only show up when you’re running real workloads. A few things worth calling out specifically. Decomposed steps. The original design only had , where you pass in a function and get back its checkpointed result. That works well for many cases but not all. Sometimes you need to know whether a step already ran before deciding what to do next. So we added / , which give you a handle you can inspect before committing the result. This turned out to be very useful for modeling intentional failures and conditional logic. This in particular is necessary when working with “before call” and “after call” type hook APIs. Task results. You can now spawn a task, go do other things, and later come back to fetch or await its result. This sounds obvious in hindsight, but the original system was purely fire-and-forget. Having proper result inspection made it possible to use Absurd for things like spawning child tasks from within a parent workflow and waiting for them to finish. This is particularly useful for debugging with agents too. absurdctl . We built this out as a proper CLI tool. You can initialize schemas, run migrations, create queues, spawn tasks, emit events, retry failures from the command line. It’s installable via or as a standalone binary. This has been invaluable for debugging production issues. When something is stuck, being able to just and see exactly where it stopped is a very different experience from digging through logs. Habitat . A small Go application that serves up a web dashboard for monitoring tasks, runs, checkpoints, and events. It connects directly to Postgres and gives you a live view of what’s happening. It’s simple, but it’s the kind of thing that makes the system more enjoyable for humans. Agent integration. Since Absurd was originally built for agent workloads, we added a bundled skill that coding agents can discover and use to debug workflow state via . There’s also a documented pattern for making pi agent turns durable by logging each message as a checkpoint. The thing I’m most pleased about is that the core design didn’t need to change all that much. The fundamental model of tasks, steps, checkpoints, events, and suspending is still exactly what it was initially. We added features around it, but nothing forced us to rethink the basic abstractions. Putting the complexity in SQL and keeping the SDKs thin turned out to be a genuinely good call. The TypeScript SDK is about 1,400 lines. The Python SDK is about 1,900 but most of this comes from the complexity of supporting colored functions. Compare that to Temporal’s Python SDK at around 170,000 lines. It means the SDKs are easy to understand, easy to debug, and easy to port. When something goes wrong, you can read the entire SDK in an afternoon and understand what it does. The checkpoint-based replay model also aged well. Unlike systems that require deterministic replay of your entire workflow function, Absurd just loads the cached step results and skips over completed work. That means your code doesn’t need to be deterministic outside of steps. You can call or in between steps and things still work, because only the step boundaries matter. In practice, this makes it much easier to reason about what’s safe and what isn’t. Pull-based scheduling was the right choice too. Workers pull tasks from Postgres as they have capacity. There’s no coordinator, no push mechanism, no HTTP callbacks. That makes it trivially self-hostable and means you don’t have to think about load management at the infrastructure level. I had some discussions with folks about whether the right abstraction should have been a durable promise . It’s a very appealing idea, but it turns out to be much more complex to implement in practice. It’s however in theory also more powerful. I did make some attempts to see what absurd would look like if it was based on durable promises but so far did not get anywhere with it. It’s however an experiment that I think would be fun to try! The primary use case is still agent workflows. An agent is essentially a loop that calls an LLM, processes tool results, and repeats until it decides it’s done. Each iteration becomes a step, and each step’s result is checkpointed. If the process dies on iteration 7, it restarts and replays iterations 1 through 6 from the store, then continues from 7. But we’ve found it useful for a lot of other things too. All our crons just dispatch distributed workflows with a pre-generated deduplication key from the invocation. We can have two cron processes running and they will only trigger one absurd task invocation. We also use it for background processing that needs to survive deploys. Basically anything where you’d otherwise build your own retry-and-resume logic on top of a queue. Absurd is deliberately minimal, but there are things I’d like to see. There’s no built-in scheduler. If you want cron-like behavior, you run your own scheduler loop and use idempotency keys to deduplicate. That works, and we have a documented pattern for it , but it would be nice to have something more integrated. There’s no push model. Everything is pull. If you need an HTTP endpoint to receive webhooks and wake up tasks, you build that yourself. I think that’s the right default as push systems are harder to operate and easier to overwhelm but there are cases where it would be convenient. In particular there are quite a few agentic systems where it would be super nice to have webhooks natively integrated (wake on incoming POST request). I definitely don’t want to have this in the core, but that sounds like the kind of problem that could be a nice adjacent library that builds on top of absurd. The biggest omission is that it does not support partitioning yet. That’s unfortunate because it makes cleaning up data more expensive than it has to be. In theory supporting partitions would be pretty simple. You could have weekly partitions and then detach and delete them when they expire. The only thing that really stands in the way of that is that Postgres does not have a convenient way of actually doing that. The hard part is not partitioning itself, it’s partition lifecycle management under real workloads. If a worker inserts a row whose lands in a month without a partition, the insert fails and the workflow crashes. So you need a separate maintenance loop that always creates future partitions far enough ahead for sleeps/retries, and does that for every queue. On the delete side, the safe approach is , but getting that to run from doesn’t work because it cannot be run within a transaction, but runs everything in one. I don’t think it’s an unsolvable problem, but it’s one I have not found a good solution for and I would love to get input on . This brings me a bit to a meta point on the whole thing which is what the point of Open Source libraries in the age of agentic engineering is. Durable Execution is now something that plenty of startups sell you. On the other hand it’s also something that an agent would build you and people might not even look for solutions any more. It’s kind of … weird? I don’t think a durable execution library can support a company, I really don’t. On the other hand I think it’s just complex enough of a problem that it could be a good Open Source project void of commercial interests. You do need a bit of an ecosystem around it, particularly for UI and good DX for debugging, and that’s hard to get from a throwaway implementation. I don’t think we have squared this yet, but it’s already much better to use than a few months ago. If you’re using Absurd, thinking about it, or building adjacent ideas, I’d love your feedback. Bug reports, rough edges, design critiques, and contributions are all very welcome—this project has gotten better every time someone poked at it from a different angle.

0 views
Max Bernstein 1 weeks ago

Value numbering

Welcome back to compiler land. Today we’re going to talk about value numbering , which is like SSA, but more. Static single assignment (SSA) gives names to values: every expression has a name, and each name corresponds to exactly one expression. It transforms programs like this: where the variable is assigned more than once in the program text, into programs like this: where each assignment to has been replaced with an assignment to a new fresh name. It’s great because it makes clear the differences between the two expressions. Though they textually look similar, they compute different values. The first computes 1 and the second computes 2. In this example, it is not possible to substitute in a variable and re-use the value of , because the s are different. But what if we see two “textually” identical instructions in SSA? That sounds much more promising than non-SSA because the transformation into SSA form has removed (much of) the statefulness of it all. When can we re-use the result? Identifying instructions that are known at compile-time to always produce the same value at run-time is called value numbering . To understand value numbering, let’s extend the above IR snippet with two more instructions, v3 and v4. In this new snippet, v3 looks the same as v1: adding v0 and 1. Assuming our addition operation is some ideal mathematical addition, we can absolutely re-use v1; no need to compute the addition again. We can rewrite the IR to something like: This is kind of similar to the destructive union-find representation that JavaScriptCore and a couple other compilers use, where the optimizer doesn’t eagerly re-write all uses but instead leaves a little breadcrumb / instruction 1 . We could then run our copy propagation pass (“union-find cleanup”?) and get: Great. But how does this happen? How does an optimizer identify reusable instruction candidates that are “textually identical”? Generally, there is no actual text in the IR . One popular solution is to compute a hash of each instruction. Then any instructions with the same hash (that also compare equal, in case of collisions) are considered equivalent. This is called hash-consing . When trying to figure all this out, I read through a couple of different implementations. I particularly like the Maxine VM implementation. For example, here is the (hashing) and functions for most binary operations, slightly modified for clarity: The rest of the value numbering implementation assumes that if a function returns 0, it does not wish to be considered for value numbering. Why might an instruction opt-out of value numbering? An instruction might opt out of value numbering if it is not “pure”. Some instructions are not pure. Purity is in the eye of the beholder, but in general it means that an instruction does not interact with the state of the outside world, except for trivial computation on its operands. (What does it mean to de-duplicate/cache/reuse ?) A load from an array object is also not a pure operation 2 . The load operation implicitly relies on the state of the memory. Also, even if the array was known-constant, in some runtime systems, the load might raise an exception. Changing the source location where an exception is raised is generally frowned upon. Languages such as Java often have requirements about where exceptions are raised codified in their specifications. We’ll work only on pure operations for now, but we’ll come back to this later. We do often want to optimize impure operations as well! We’ll start off with the simplest form of value numbering, which operates only on linear sequences of instructions, like basic blocks or traces. Let’s build a small implementation of local value numbering (LVN). We’ll start with straight-line code—no branches or anything tricky. Most compiler optimizations on control-flow graphs (CFGs) iterate over the instructions “top to bottom” 3 and it seems like we can do the same thing here too. From what we’ve seen so far optimizing our made-up IR snippet, we can do something like this: The find-and-replace, remember, is not a literal find-and-replace, but instead something like: (if you have been following along with the toy optimizer series) This several-line function (as long as you already have a hash map and a union-find available to you) is enough to build local value numbering! And real compilers are built this way, too. If you don’t believe me, take a look at this slightly edited snippet from Maxine’s value numbering implementation. It has all of the components we just talked about: iterating over instructions, map lookup, and some substitution. This alone will get you pretty far. Code generators of all shapes tend to leave messy repeated computations all over their generated code and this will make short work of them. Sometimes, though, your computations are spread across control flow—over multiple basic blocks. What do you do then? Computing value numbers for an entire function is called global value numbering (GVN) and it requires dealing with control flow (if, loops, etc). I don’t just mean that for an entire function, we run local value numbering block-by-block. Global value numbering implies that expressions can be de-duplicated and shared across blocks. Let’s tackle control flow case by case. First is the simple case from above: one block. In this case, we can go top to bottom with our value numbering and do alright. The second case is also reasonable to handle: one block flowing into another. In this case, we can still go top to bottom. We just have to find a way to iterate over the blocks. If we’re not going to share value maps between blocks, the order doesn’t matter. But since the point of global value numbering is to share values, we have to iterate them in topological order (reverse post order (RPO)). This ensures that predecessors get visited before successors. If you have , we have to visit first and then . Because of how SSA works and how CFGs work, the second block can “look up” into the first block and use the values from it. To get global value numbering working, we have to copy ’s value map before we start processing so we can re-use the instructions. Maybe something like: Then the expressions can accrue across blocks. can re-use the already-computed from because it is still in the map. …but this breaks as soon as you have control-flow splits. Consider the following shape graph: We’re going to iterate over that graph in one of two orders: A B C or A C B. In either case, we’re going to be adding all this stuff into the value map from one block (say, B) that is not actually available to its sibling block (say, C). When I say “not available”, I mean “would not have been computed before”. This is because we execute either A then B or A then C. There’s no world in which we execute B then C. But alright, look at a third case where there is such a world: a control-flow join. In this diagram, we have two predecessor blocks B and C each flowing into D. In this diagram, B always flows into D and also C always flows into D. So the iterator order is fine, right? Well, still no. We have the same sibling problem as before. B and C still can’t share value maps. We also have a weird question when we enter D: where did we come from? If we came from B, we can re-use expressions from B. If we came from C, we can re-use expressions from C. But we cannot in general know which predecessor block we came from. The only block we know for sure that we executed before D is A. This means we can re-use A’s value map in D because we can guarantee that all execution paths that enter D have previously gone through A. This relationship is called a dominator relationship and this is the key to one style of global value numbering that we’re going to talk about in this post. A block can always use the value map from any other block that dominates it. For completeness’ sake, in the diamond diagram, A dominates each of B and C, too. We can compute dominators a couple of ways 4 , but that’s a little bit out of scope for this blog post. If we assume that we have dominator information available in our CFG, we can use that for global value numbering. And that’s just what—you guessed it—Maxine VM does. It iterates over all blocks in reverse post-order, doing local value numbering, threading through value maps from dominator blocks. In this case, their method gets the immediate dominator : the “closest” dominator block of all the blocks that dominate the current one. And that’s it! That’s the core of Maxine’s GVN implementation . I love how short it is. For not very much code, you can remove a lot of duplicate pure SSA instructions. This does still work with loops, but with some caveats. From p7 of Briggs GVN : The φ-functions require special treatment. Before the compiler can analyze the φ-functions in a block, it must previously have assigned value numbers to all of the inputs. This is not possible in all cases; specifically, any φ-function input whose value flows along a back edge (with respect to the dominator tree) cannot have a value number. If any of the parameters of a φ-function have not been assigned a value number, then the compiler cannot analyze the φ-function, and it must assign a unique, new value number to the result. It also talks about eliminating useless phis, which is optional, but would the strengthen global value numbering pass: it makes more information transparent. But what if we want to handle impure instructions? Languages such as Java allow for reading fields from the / object within methods as if the field were a variable name. This makes code like the following common: Each of these reference to and is an implicit reference to or , which is semantically a field load off an object. You can see it in the bytecode (thanks, Matt Godbolt): When straightforwardly building an SSA IR from the JVM bytecode for this method, you will end up with a bunch of IR that looks like this: Pretty much the same as the bytecode. Even though no code in the middle could modify the field (which would require a re-load), we still have a duplicate load. Bummer. I don’t want to re-hash this too much but it’s possible to fold Load and store forwarding into your GVN implementation by either: See, there’s nothing fundamentally stopping you from tracking the state of your heap at compile-time across blocks. You just have to do a little more bookkeeping. In our dominator-based GVN implementation, for example, you can: Not so bad. Maxine doesn’t do global memory tracking, but they do a limited form of load-store forwarding while building their HIR from bytecode: see GraphBuilder which uses the MemoryMap to help track this stuff. At least they would not have the same duplicate instructions in the example above! We’ve now looked at one kind of value numbering and one implementation of it. What else is out there? Apparently, you can get better results by having a unified hash table (p9 of Briggs GVN ) of expressions, not limiting the value map to dominator-available expressions. Not 100% on how this works yet. They note: Using a unified hash-table has one important algorithmic consequence. Replacements cannot be performed on-line because the table no longer reflects availability. Which is the first time that it occurred to me that hash-based value numbering with dominators was an approximation of available expression analysis. There’s also a totally different kind of value numbering called value partitioning (p12 of Briggs GVN ). See also a nice blog post about this by Allen Wang from the Cornell compiler course . I think this mostly replaces the hashing bit, and you still need some other thing for the available expressions bit. Ben Titzer and Seth Goldstein have some good slides from CMU . Where they talk about the worklist dataflow approach. Apparently this is slower but gets you more available expressions than just looking to dominator blocks. I wonder how much it differs from dominator+unified hash table. While Maxine uses hash table cloning to copy value maps from dominator blocks, there are also compilers such as Cranelift that use scoped hash maps to track this information more efficiently. (Though Amanieu notes that you may not need a scoped hash map and instead can tag values in your value map with the block they came from, ignoring non-dominating values with a quick check. The dominance check makes sense but I haven’t internalized how this affects the set of available expressions yet.) You may be wondering if this kind of algorithm even helps at all in a dynamic language JIT context. Surely everything is too dynamic, right? Actually, no! The JIT hopes to eliminate a lot of method calls and dynamic behaviors, replacing them with guards, assumptions, and simpler operations. These strength reductions often leave behind a lot of repeated instructions. Just the other day, Kokubun filed a value-numbering-like PR to clean up some of the waste. ART has a recent blog post about speeding up GVN. Go forth and give your values more numbers. There’s been an ongoing discussion with Phil Zucker on SSI, GVN, acyclic egraphs, and scoped union-find. TODO summarize Commutativity; canonicalization Seeding alternative representations into the GVN Aegraphs and union-find during GVN https://github.com/bytecodealliance/rfcs/blob/main/accepted/cranelift-egraph.md https://github.com/bytecodealliance/wasmtime/issues/9049 https://github.com/bytecodealliance/wasmtime/issues/4371 Writing this post is roughly the time when I realized that the whole time I was wondering why Cinder did not use union-find for rewriting, it actually did! Optimizing instruction by replacing with followed by copy propagation is equivalent to union-find.  ↩ In some forms of SSA, like heap-array SSA or sea of nodes, it’s possible to more easily de-duplicate loads because the memory representation has been folded into (modeled in) the IR.  ↩ The order is a little more complicated than that: reverse post-order (RPO). And there’s a paper called “A Simple Algorithm for Global Data Flow Analysis Problems” that I don’t yet have a PDF for that claims that RPO is optimal for solving dataflow problems.  ↩ There’s the iterative dataflow way (described in the Cooper paper (PDF)), Lengauer-Tarjan (PDF), the Engineered Algorithm (PDF), hybrid/Semi-NCA approach (PDF), …  ↩ initialize a map from instruction numbers to instruction pointers for each instruction if wants to participate in value numbering if ’s value number is already in the map, replace all pointers to in the rest of the program with the corresponding value from the map otherwise, add to the map doing load-store forwarding as part of local value numbering and clearing memory information from the value map at the end of each block, or keeping track of effects across blocks track heap write effects for each block at the start of each block B, union all of the “kill” sets for every block back to its immediate dominator finally, remove the stuff that got killed from the dominator’s value map V8 Hydrogen Writing this post is roughly the time when I realized that the whole time I was wondering why Cinder did not use union-find for rewriting, it actually did! Optimizing instruction by replacing with followed by copy propagation is equivalent to union-find.  ↩ In some forms of SSA, like heap-array SSA or sea of nodes, it’s possible to more easily de-duplicate loads because the memory representation has been folded into (modeled in) the IR.  ↩ The order is a little more complicated than that: reverse post-order (RPO). And there’s a paper called “A Simple Algorithm for Global Data Flow Analysis Problems” that I don’t yet have a PDF for that claims that RPO is optimal for solving dataflow problems.  ↩ There’s the iterative dataflow way (described in the Cooper paper (PDF)), Lengauer-Tarjan (PDF), the Engineered Algorithm (PDF), hybrid/Semi-NCA approach (PDF), …  ↩

0 views
Giles's blog 1 weeks ago

Writing an LLM from scratch, part 32h -- Interventions: full fat float32

This is the last of the interventions I'm trying out to see if I can improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Back when I did my first training run for a base model, on my local RTX 3090 , I used two optimisations: The first of those boosted training speed from 12,599 tokens per second to 15,402 in my test harness, while AMP on its own boosted it to 19,921 tps (and also allowed me to increase the batch size from 5 to 6). Doing both appeared to hit some kind of diminishing returns -- it maxed out at 19,997 tps, only a little better than AMP on its own. But intuitively, you'd expect that might come at a cost. While I'm sure the PyTorch developers have solid understanding of where switching to 16-bit will have a minimal impact on training quality, it seems too good to be true that it would have no impact at all. Let's see what happens if we switch both of these optimisations off! I added a new flag to the config file for the training harness, with a default of 1 . The core implementation was pretty simple; where we had the call to , we needed to guard it: ...and where we did the forward pass and the loss calculation, we had to not wrap it in a : We also had to avoid unscaling when clipping gradients ; I did that by just not creating a scaler when in non-AMP mode, and then: ...and likewise, instead of using the scaler to step the optimiser, we step it directly if we don't have one: However, there was an issue: non-finite gradients. As I discovered when looking into gradient clipping , the scaler was actually doing something quite useful for us. Somewhat buried in the AMP recipes page is a comment: Now, from the gradient clipping train, I'd come to the conclusion that we were occasionally getting non-finite gradients, and the scaler was saving us from applying junk updates when that happened. If our new code was stepping the optimiser directly, we'd not have that safety net. We'd need something to save us from that. My first cut at this was to use the one other API feature I'd seen that handled non-finite gradients for you: has a parameter, so if we were using gradient clipping, we could set that to and use the exception to skip stepping the optimiser if it was raised. To avoid actually doing any gradient clipping when that happened, if we did not have gradient clipping explicitly enabled, we could set the to infinity. Here's the code for that version . I wasn't very happy with it, though. The use of a gradient clipping API just for its side-effect of telling us about non-finite gradients felt a bit ugly, and even worse, the exception it raised was just a generic , not a custom exception type, which meant that I had to distinguish between it and other by looking at the exception message -- not terribly safe, as that's something that could easily change in the future. So I switched to a more explicit, simpler version: scan through the parameters looking for non-finite gradients, and skip the optimiser step if any are found: I did have some concerns about the performance impact of that; on my local machine it took about 0.13 seconds to scan all of the parameters like that for one step. However, it's better than failing to train the model at all due to garbage updates! So with that, it was time to do the training run. It was pretty clear that I would not be able to run this with my normal microbatch size of 12 on the 8x A100 40 GiB machines that I'd been using so far for these intervention tests -- AMP and the lower-precision matrix multiplications save a bit of VRAM, and I was already pretty much at the limit of what would fit in there. Changing the batch size would make this a poor test of the effects of removing the FP precision stuff in isolation, so I decided that the safest minimal change was to use a machine with more VRAM -- specifically an 8x A100 80 GiB, as that was the closest to what I was using (switching to eg. H100s would add all kinds of confounding changes). The next problem was getting any kind of machine at all! Lambda (they appear to have rebranded away from "Lambda Labs") very rarely seemed to have any available instances, never mind the specific type that I wanted. Eventually, I put together a system to poll their API and launch an instance when one was available. At 3:25am today 2 , I got a Telegram message from the script saying that it had managed to find and start one. I kicked off the training run, and watched as it got started. I could see it was using 43.8 GiB/GPU, so it definitely did need the larger instance type. And it quickly became clear that this was going to be a long one -- it was estimating 8 hours to do the complete run! In a way that was good news, though, as I could just set an alarm and go to bed. When I woke up, it was done: That's 8h7m. For comparison, the baseline train took 3h24m, so we're taking more than double the time. Cost-wise, things were even worse -- more than US$135 in server costs, because as well as needing the server for much longer, being a larger machine it cost US$16.48/hour rather than $11.84. So that's more than three times as expensive as the US$42 that a typical recent train has cost me (Lambda raised their prices, so it went up from about US$35 in February). Still, at least it looked like a solid run: Very similar to the others we've seen in this series. Time to upload it to Hugging Face Hub , and on to the evals to see if all of this extra cost was worthwhile. Firstly, the smoke test -- how did it complete ? Not bad at all! But the important metric is the loss on the test set, and for that I got 3.679. Let's add it to the table to see how that compares to the other training runs: So, a tiny improvement over our baseline. Taking more than twice as long on the training run, and spending three times as much, gained us a loss improvement that's smaller than any other successful intervention. The first question is, did removing AMP and lower-precision matrix multiplications lead to a better model? The answer appears to be "yes" -- but it's a tiny enough difference that it could well be in the noise. But the follow-up has to be, was it worth the extra cost in time and money? And for that I'm certain that the answer is "no". If we'd spent twice the time training with AMP -- on an extra 3B-odd tokens, or on a second epoch with the same 3B -- it seems implausible that the resulting loss would not have been better. And anyway, given that my goal with these interventions is to train the best model I can in two days locally (or 3h30m or so on an 8x A100 40 GiB), it's pretty clear that if we'd cut this run off about halfway through it would have been worse -- and that's not even accounting for it being more memory-hungry. So, I think the takeaway from this is that AMP appears to be a huge win, at least for this model. It has a tiny cost (if any) in model quality, and a huge benefit in training speed, plus a smallish but still useful benefit in training VRAM requirements. 3 And with that, I've reached the end of the interventions that I wanted to try ! Next, I'll need to think through what we need to do to try to stack them up. In particular, is there any easy way to work out whether any of the improvements I've seen might be due to random noise? After all, even though I've been carefully using explicit seeds, each intervention will have changed the way the training run uses the random number stream, and that could easily have an effect. Stay tuned! The name of the flag is not quite right, as of course we're switching off not just AMP but the matrix multiplication precision, but it's a decent shorthand.  ↩ I'm a night owl, so luckily I was still awake.  ↩ I have to admit that I'm very tempted to see what effect even bigger moves in the low-precision direction might have. What if I moved to some kind of 16-bit training, like ? After all, most of the open weights models like Qwen are at least released at that kind of bittedness. But that's one to look into later, I think.  ↩ Setting the 32-bit floating point matrix multiplication precision to "high" rather than to "highest" , which means that it uses lower-precision (but still technically 32-bit) TF32 for those operations rather than normal float32. Using PyTorch's Automated Mixed Precision (AMP) , which allows it to use 16-bit calculations rather than 32-bit in places where it makes sense to do so. The name of the flag is not quite right, as of course we're switching off not just AMP but the matrix multiplication precision, but it's a decent shorthand.  ↩ I'm a night owl, so luckily I was still awake.  ↩ I have to admit that I'm very tempted to see what effect even bigger moves in the low-precision direction might have. What if I moved to some kind of 16-bit training, like ? After all, most of the open weights models like Qwen are at least released at that kind of bittedness. But that's one to look into later, I think.  ↩

0 views
Anton Zhiyanov 1 weeks ago

Porting Go's strings package to C

Creating a subset of Go that translates to C was never my end goal. I liked writing C code with Go, but without the standard library it felt pretty limited. So, the next logical step was to port Go's stdlib to C. Of course, this isn't something I could do all at once. I started with the io package , which provides core abstractions like and , as well as general-purpose functions like . But isn't very interesting on its own, since it doesn't include specific reader or writer implementations. So my next choices were naturally and — the workhorses of almost every Go program. This post is about how the porting process went. Bits and UTF-8 • Bytes • Allocators • Buffers and builders • Benchmarks • Optimizing search • Optimizing builder • Wrapping up Before I could start porting , I had to deal with its dependencies first: Both of these packages are made up of pure functions, so they were pretty easy to port. The only minor challenge was the difference in operator precedence between Go and C — specifically, bit shifts ( , ). In Go, bit shifts have higher precedence than addition and subtraction. In C, they have lower precedence: The simplest solution was to just use parentheses everywhere shifts are involved: With and done, I moved on to . The package provides functions for working with byte slices: Some of them were easy to port, like . Here's how it looks in Go: And here's the C version: Just like in Go, the ( → ) macro doesn't allocate memory; it just reinterprets the byte slice's underlying storage as a string. The function (which works like in Go) is easy to implement using from the libc API. Another example is the function, which looks for a specific byte in a slice. Here's the pure-Go implementation: And here's the C version: I used a regular C loop to mimic Go's : But and don't allocate memory. What should I do with , since it clearly does? I had a decision to make. The Go runtime handles memory allocation and deallocation automatically. In C, I had a few options: An allocator is a tool that reserves memory (typically on the heap) so a program can store its data structures there. See Allocators from C to Zig if you want to learn more about them. For me, the winner was clear. Modern systems programming languages like Zig and Odin clearly showed the value of allocators: An is an interface with three methods: , , and . In C, it translates to a struct with function pointers: As I mentioned in the post about porting the io package , this interface representation isn't as efficient as using a static method table, but it's simpler. If you're interested in other options, check out the post on interfaces . By convention, if a function allocates memory, it takes an allocator as its first parameter. So Go's : Translates to this C code: If the caller doesn't care about using a specific allocator, they can just pass an empty allocator, and the implementation will use the system allocator — , , and from libc. Here's a simplified version of the system allocator (I removed safety checks to make it easier to read): The system allocator is stateless, so it's safe to have a global instance: Here's an example of how to call with an allocator: Way better than hidden allocations! Besides pure functions, and also provide types like , , and . I ported them using the same approach as with functions. For types that allocate memory, like , the allocator becomes a struct field: The code is pretty wordy — most C developers would dislike using instead of something shorter like . My solution to this problem is to automatically translate Go code to C (which is actually what I do when porting Go's stdlib). If you're interested, check out the post about this approach — Solod: Go can be a better C . Types that don't allocate, like , need no special treatment — they translate directly to C structs without an allocator field. The package is the twin of , so porting it was uneventful. Here's usage example in Go and C side by side: Again, the C code is just a more verbose version of Go's implementation, plus explicit memory allocation. What's the point of writing C code if it's slow, right? I decided it was time to benchmark the ported C types and functions against their Go versions. To do that, I ported the benchmarking part of Go's package. Surprisingly, the simplified version was only 300 lines long and included everything I needed: Here's a sample benchmark for the type: Reads almost like Go's benchmarks. To monitor memory usage, I created — a memory allocator that wraps another allocator and keeps track of allocations: The benchmark gets an allocator through the function and wraps it in a to keep track of allocations: There's no auto-discovery, but the manual setup is quite straightforward. With the benchmarking setup ready, I ran benchmarks on the package. Some functions did well — about 1.5-2x faster than their Go equivalents: But (searching for a substring in a string) was a total disaster — it was nearly 20 times slower than in Go: The problem was caused by the function we looked at earlier: This "pure" Go implementation is just a fallback. On most platforms, Go uses a specialized version of written in assembly. For the C version, the easiest solution was to use , which is also optimized for most platforms: With this fix, the benchmark results changed drastically: Still not quite as fast as Go, but it's close. Honestly, I don't know why the -based implementation is still slower than Go's assembly here, but I decided not to pursue it any further. After running the rest of the function benchmarks, the ported versions won all of them except for two: Benchmarking details is a common way to compose strings from parts in Go, so I tested its performance too. The results were worse than I expected: Here, the C version performed about the same as Go, but I expected it to be faster. Unlike , is written entirely in Go, so there's no reason the ported version should lose in this benchmark. The method looked almost identical in Go and C: Go's automatically grows the backing slice, while does it manually ( , on the contrary, doesn't grow the slice — it's merely a wrapper). So, there shouldn't be any difference. I had to investigate. Looking at the compiled binary, I noticed a difference in how the functions returned results. Go returns multiple values in separate registers, so uses three registers: one for 8-byte , two for the interface (implemented as two 8-byte pointers). But in C, was a single struct made up of two unions and a pointer: Of course, this 56-byte monster can't be returned in registers — the C calling convention passes it through memory instead. Since is on the hot path in the benchmark, I figured this had to be the issue. So I switched from a single monolithic type to signature-specific types for multi-return pairs: Now, the implementation in C looked like this: is only 16 bytes — small enough to be returned in two registers. Problem solved! But it wasn't — the benchmark only showed a slight improvement. After looking into it more, I finally found the real issue: unlike Go, the C compiler wasn't inlining calls. Adding and moving to the header file made all the difference: 2-4x faster. That's what I was hoping for! Porting and was a mix of easy parts and interesting challenges. The pure functions were straightforward — just translate the syntax and pay attention to operator precedence. The real design challenge was memory management. Using allocators turned out to be a good solution, making memory allocation clear and explicit without being too difficult to use. The benchmarks showed that the C versions outperformed Go in most cases, sometimes by 2-4x. The only exceptions were and , where Go relies on hand-written assembly. The optimization was an interesting challenge: what seemed like a return-type issue was actually an inlining problem, and fixing it gave a nice speed boost. There's a lot more of Go's stdlib to port. In the next post, we'll cover — a very unique Go package. In the meantime, if you'd like to write Go that translates to C — with no runtime and manual memory management — I invite you to try Solod . The and packages are included, of course. implements bit counting and manipulation functions. implements functions for UTF-8 encoded text. Loop over the slice indexes with ( is a macro that returns , similar to Go's built-in). Access the i-th byte with (a bounds-checking macro that returns ). Use a reliable garbage collector like Boehm GC to closely match Go's behavior. Allocate memory with libc's and have the caller free it later with . Introduce allocators. It's obvious whether a function allocates memory or not: if it has an allocator as a parameter, it allocates. It's easy to use different allocation methods: you can use for one function, an arena for another, and a stack allocator for a third. It helps with testing and debugging: you can use a tracking allocator to find memory leaks, or a failing allocator to test error handling. Figuring out how many iterations to run. Running the benchmark function in a loop. Recording metrics (ns/op, MB/s, B/op, allocs/op). Reporting the results.

0 views
Sean Goedecke 1 weeks ago

Programming (with AI agents) as theory building

Back in 1985, computer scientist Peter Naur wrote “Programming as Theory Building” . According to Naur - and I agree with him - the core output of software engineers is not the program itself, but the theory of how the program works . In other words, the knowledge inside the engineer’s mind is the primary artifact of engineering work, and the actual software is merely a by-product of that. This sounds weird, but it’s surprisingly intuitive. Every working programmer knows that you cannot make a change to a program simply by having the code. You first need to read through the code carefully enough to build up a mental model (what Naur calls a “theory”) of what it’s supposed to do and how it does it. Then you make the desired change to your mental model, and only after that can you begin modifying the code. Many people 1 think that this is why LLMs are not good tools for software engineering: because using them means that engineers can skip building Naur theories of the system, and because LLMs themselves are incapable of developing a Naur theory themselves. Let’s take those one at a time. Do AI agents let some engineers avoid building detailed mental models of the systems they work on? Of course! As an extreme example, someone could simply punt every task to the latest GPT or Claude model and build no mental model at all 2 . But even a conscientious developer who uses AI tools will necessarily build a less detailed mental model than someone who does it entirely by hand. This is well-attested by the nascent literature on how AI use impacts learning. And it also just makes obvious sense. The whole point of using AI tools is to offload some of the cognitive effort: to be able to just sketch out some of the fine detail in your mental model, because you’re confident that the AI tool can handle it. For instance, you might have a good grasp on what the broad components do in your service, and how the data flows between them, but not the specific detail of how some sub-component is implemented (because you only reviewed that code, instead of writing it). Isn’t this really bad? If you start dropping the implementation details, aren’t you admitting that you don’t really know how your system works? After all, a theory that isn’t detailed enough to tell you what code would need to be written for a particular change is a useless theory, right? I don’t think so. First, it’s simply a fact that every mental model glosses over some fine details . Before LLMs were a thing, it was common to talk about the “breadth of your stack”: roughly, the level of abstraction that your technical mental model could operate at. You might understand every line of code in the system, but what about dependencies? What about the world of Linux abstractions - processes, threads, sockets, syscalls, ports, and buffers? What about the assembly operations that are ultimately performed by your code? It simply can’t be true that giving up any amount of fine detail is a disaster. Second, coding with LLMs teaches you first-hand how important your mental model is . I do a lot of LLM-assisted work, and in general it looks like this: Note that only 10% of agent output is actually making its way into my output . Almost my entire time is spent looking at some piece of agent-generated code or text and trying to figure out whether it fits into my theory of the system. That theory is necessarily a bit less detailed than when I was writing every line of code by hand. But it’s still my theory! If it weren’t, I’d be accepting most of what the agent produced instead of rejecting almost all of it. Can AI agents build their own theories of the system? If not, this would be a pretty good reason not to use them, or to think that any supposed good outcomes are illusory. The first reason to think they can is that LLMs clearly do make working changes to codebases. If you think that a theory is essential to make working changes (which is at least plausible), doesn’t that prove that LLMs can build Naur theories? Well, maybe. They could be pattern-matching to Naur theories in the training data that are close enough to sort of work, or they could be able to build local theories which are good enough (as long as you don’t layer too many of them on top of each other). The second reason to think they can is that you can see them doing it . If you read an agent’s logs, they’re full of explicit theory-building 3 : making hypotheses about how the system works, trying to confirm or disprove them, adjusting the hypothesis, and repeating. When I’m trying to debug something, I’m usually racing against one or more AI agents, and sometimes they win . I refuse to believe that you can debug a million-line codebase without theory-building. I think it’s an open question if AI agents can build working theories of any codebase. In my experience, they do a good job with normal-ish applications like CRUD servers, proxies, and other kinds of program that are well-represented in the training data. If you’re doing something truly weird, I can believe they might struggle (though even then it seems at least possible ). Regardless, one big problem with AI agents is that they can’t retain theories of the codebase . They have to build their theory from scratch every time. Of course, documentation can help a little with this, but in Naur’s words, it’s “strictly impossible” to fully capture a theory in documentation. In fact, Naur thought that if all the humans who built a piece of software left, it was unwise to try and construct a theory of the software even from the code itself , and that you should simply rewrite the program from scratch. I think this is overstating it a bit, at least for large programs, but I agree that it’s a difficult task. AI agents are permanently in this unfortunate position: forced to construct a theory of the software from scratch, every single time they’re spun up. Given that, it’s kind of a minor miracle that AI agents are as effective as they are. The next big innovation in AI coding agents will probably be some way of allowing agents to build more long-term theories of the codebase: either by allowing them to modify their own weights 4 , or simply supporting contexts long enough so that you can make weeks worth of changes in the same agent run, or some other idea I haven’t thought of. This is the most recent (and well-written) example I’ve seen, but it’s a common view. I have heard of people working like this. Ironically, I think it’s a good thing. The kind of engineer who does this is likely to be improved by becoming a thin wrapper around a frontier LLM (though it’s not great for their career prospects). I think some people would say here that AI agents simply can’t build any theories at all, because theories are a human-mind thing. These are the people who say that AIs can’t believe anything, or think, or have personalities, and so on. I have some sympathy for this as a metaphysical position, but it just seems obviously wrong as a practical view. If I can see GPT-5.4 testing hypotheses and correctly answering questions about the system, I don’t really care if it’s coming from a “real” theory or some synthetic equivalent. This is the dream of continuous learning : if what the AI agent learns about the codebase can be somehow encoded in its weights, it can take days or weeks to build its theory instead of mere minutes. I spin off two or three parallel agents to try and answer some question or implement some code As each agent finishes (or I glance over at what it’s doing), I scan its work and make a snap judgement about whether it’s accurately reflecting my mental model of the overall system When it doesn’t - which is about 80% of the time - I either kill the process or I write a quick “no, you didn’t account for X” message I carefully review the 20% of plausible responses against my mental model, do my own poking around the codebase and manual testing/tweaking, and about half of that code will become a PR This is the most recent (and well-written) example I’ve seen, but it’s a common view. ↩ I have heard of people working like this. Ironically, I think it’s a good thing. The kind of engineer who does this is likely to be improved by becoming a thin wrapper around a frontier LLM (though it’s not great for their career prospects). ↩ I think some people would say here that AI agents simply can’t build any theories at all, because theories are a human-mind thing. These are the people who say that AIs can’t believe anything, or think, or have personalities, and so on. I have some sympathy for this as a metaphysical position, but it just seems obviously wrong as a practical view. If I can see GPT-5.4 testing hypotheses and correctly answering questions about the system, I don’t really care if it’s coming from a “real” theory or some synthetic equivalent. ↩ This is the dream of continuous learning : if what the AI agent learns about the codebase can be somehow encoded in its weights, it can take days or weeks to build its theory instead of mere minutes. ↩

0 views
Simon Willison 1 weeks ago

Highlights from my conversation about agentic engineering on Lenny's Podcast

I was a guest on Lenny Rachitsky's podcast, in a new episode titled An AI state of the union: We've passed the inflection point, dark factories are coming, and automation timelines . It's available on YouTube , Spotify , and Apple Podcasts . Here are my highlights from our conversation, with relevant links. 4:19 - The end result of these two labs throwing everything they had at making their models better at code is that in November we had what I call the inflection point where GPT 5.1 and Claude Opus 4.5 came along. They were both incrementally better than the previous models, but in a way that crossed a threshold where previously the code would mostly work, but you had to pay very close attention to it. And suddenly we went from that to... almost all of the time it does what you told it to do, which makes all of the difference in the world. Now you can spin up a coding agent and say, build me a Mac application that does this thing , and you'll get something back which won't just be a buggy pile of rubbish that doesn't do anything. 5:49 - I can churn out 10,000 lines of code in a day. And most of it works. Is that good? Like, how do we get from most of it works to all of it works? There are so many new questions that we're facing, which I think makes us a bellwether for other information workers. Code is easier than almost every other problem that you pose these agents because code is obviously right or wrong - either it works or it doesn't work. There might be a few subtle hidden bugs, but generally you can tell if the thing actually works. If it writes you an essay, if it prepares a lawsuit for you, it's so much harder to derive if it's actually done a good job, and to figure out if it got things right or wrong. But it's happening to us as software engineers. It came for us first. And we're figuring out, OK, what do our careers look like? How do we work as teams when part of what we did that used to take most of the time doesn't take most of the time anymore? What does that look like? And it's going to be very interesting seeing how this rolls out to other information work in the future. Lawyers are falling for this really badly. The AI hallucination cases database is up to 1,228 cases now! Plus this bit from the cold open at the start : It used to be you'd ask ChatGPT for some code, and it would spit out some code, and you'd have to run it and test it. The coding agents take that step for you now. And an open question for me is how many other knowledge work fields are actually prone to these agent loops? 8:19 - I write so much of my code on my phone. It's wild. I can get good work done walking the dog along the beach, which is delightful. I mainly use the Claude iPhone app for this, both with a regular Claude chat session (which can execute code now ) or using it to control Claude Code for web . 9:55 If you're vibe coding something for yourself, where the only person who gets hurt if it has bugs is you, go wild. That's completely fine. The moment you ship your vibe coding code for other people to use, where your bugs might actually harm somebody else, that's when you need to take a step back. See also When is it OK to vibe code? 12:49 The reason it's called the dark factory is there's this idea in factory automation that if your factory is so automated that you don't need any people there, you can turn the lights off. Like the machines can operate in complete darkness if you don't need people on the factory floor. What does that look like for software? [...] So there's this policy that nobody writes any code: you cannot type code into a computer. And honestly, six months ago, I thought that was crazy. And today, probably 95% of the code that I produce, I didn't type myself. That world is practical already because the latest models are good enough that you can tell them to rename that variable and refactor and add this line there... and they'll just do it - it's faster than you typing on the keyboard yourself. The next rule though, is nobody reads the code. And this is the thing which StrongDM started doing last year. I wrote a lot more about StrongDM's dark factory explorations back in February. 21:27 - It used to be, you'd come up with a spec and you hand it to your engineering team. And three weeks later, if you're lucky, they'd come back with an implementation. And now that maybe takes three hours, depending on how well the coding agents are established for that kind of thing. So now what, right? Now, where else are the bottlenecks? Anyone who's done any product work knows that your initial ideas are always wrong. What matters is proving them, and testing them. We can test things so much faster now because we can build workable prototypes so much quicker. So there's an interesting thing I've been doing in my own work where any feature that I want to design, I'll often prototype three different ways it could work because that takes very little time. I've always loved prototyping things, and prototyping is even more valuable now. 22:40 - A UI prototype is free now. ChatGPT and Claude will just build you a very convincing UI for anything that you describe. And that's how you should be working. I think anyone who's doing product design and isn't vibe coding little prototypes is missing out on the most powerful boost that we get in that step. But then what do you do? Given your three options that you have instead of one option, how do you prove to yourself which one of those is the best? I don't have a confident answer to that. I expect this is where the good old fashioned usability testing comes in. More on prototyping later on: 46:35 - Throughout my entire career, my superpower has been prototyping. I've been very quick at knocking out working prototypes of things. I'm the person who can show up at a meeting and say, look, here's how it could work. And that was kind of my unique selling point. And that's gone. Anyone can do what I could do. 26:25 - I'm finding that using coding agents well is taking every inch of my 25 years of experience as a software engineer, and it is mentally exhausting. I can fire up four agents in parallel and have them work on four different problems. And by like 11 AM, I am wiped out for the day. [...] There's a personal skill we have to learn in finding our new limits - what's a responsible way for us not to burn out. I've talked to a lot of people who are losing sleep because they're like, my coding agents could be doing work for me. I'm just going to stay up an extra half hour and set off a bunch of extra things... and then waking up at four in the morning. That's obviously unsustainable. [...] There's an element of sort of gambling and addiction to how we're using some of these tools. 45:16 - People talk about how important it is not to interrupt your coders. Your coders need to have solid two to four hour blocks of uninterrupted work so they can spin up their mental model and churn out the code. That's changed completely. My programming work, I need two minutes every now and then to prompt my agent about what to do next. And then I can do the other stuff and I can go back. I'm much more interruptible than I used to be. 28:19 - I've got 25 years of experience in how long it takes to build something. And that's all completely gone - it doesn't work anymore because I can look at a problem and say that this is going to take two weeks, so it's not worth it. And now it's like... maybe it's going to take 20 minutes because the reason it would have taken two weeks was all of the sort of crufty coding things that the AI is now covering for us. I constantly throw tasks at AI that I don't think it'll be able to do because every now and then it does it. And when it doesn't do it, you learn, right? But when it does do something, especially something that the previous models couldn't do, that's actually cutting edge AI research. And a related anecdote: 36:56 - A lot of my friends have been talking about how they have this backlog of side projects, right? For the last 10, 15 years, they've got projects they never quite finished. And some of them are like, well, I've done them all now. Last couple of months, I just went through and every evening I'm like, let's take that project and finish it. And they almost feel a sort of sense of loss at the end where they're like, well, okay, my backlog's gone. Now what am I going to build? 29:29 - So ThoughtWorks, the big IT consultancy, did an offsite about a month ago , and they got a whole bunch of engineering VPs in from different companies to talk about this stuff. And one of the interesting theories they came up with is they think this stuff is really good for experienced engineers, like it amplifies their skills. It's really good for new engineers because it solves so many of those onboarding problems. The problem is the people in the middle. If you're mid-career, if you haven't made it to sort of super senior engineer yet, but you're not sort of new either, that's the group which is probably in the most trouble right now. I mentioned Cloudflare hiring 1,000 interns , and Shopify too. Lenny asked for my advice for people stuck in that middle: 31:21 - That's a big responsibility you're putting on me there! I think the way forward is to lean into this stuff and figure out how do I help this make me better? A lot of people worry about skill atrophy: if the AI is doing it for you, you're not learning anything. I think if you're worried about that, you push back at it. You have to be mindful about how you're applying the technology and think, okay, I've been given this thing that can answer any question and often gets it right. How can I use this to amplify my own skills, to learn new things, to take on much more ambitious projects? [...] 33:05 - Everything is changing so fast right now. The only universal skill is being able to roll with the changes. That's the thing that we all need. The term that comes up most in these conversations about how you can be great with AI is agency . I think agents have no agency at all. I would argue that the one thing AI can never have is agency because it doesn't have human motivations. So I'd say that's the thing is to invest in your own agency and invest in how to use this technology to get better at what you do and to do new things. The fact that it's so easy to create software with detailed documentation and robust tests means it's harder to figure out what's a credible project. 37:47 Sometimes I'll have an idea for a piece of software, Python library or whatever, and I can knock it out in like an hour and get to a point where it's got documentation and tests and all of those things, and it looks like the kind of software that previously I'd have spent several weeks on - and I can stick it up on GitHub And yet... I don't believe in it. And the reason I don't believe in it is that I got to rush through all of those things... I think the quality is probably good, but I haven't spent enough time with it to feel confident in that quality. Most importantly, I haven't used it yet . It turns out when I'm using somebody else's software, the thing I care most about is I want them to have used it for months. I've got some very cool software that I built that I've never used . It was quicker to build it than to actually try and use it! 41:31 - Everyone's like, oh, it must be easy. It's just a chat bot. It's not easy. That's one of the great misconceptions in AI is that using these tools effectively is easy. It takes a lot of practice and it takes a lot of trying things that didn't work and trying things that did work. 19:04 - In the past sort of three to six months, they've started being credible as security researchers, which is sending shockwaves through the security research industry. See Thomas Ptacek: Vulnerability Research Is Cooked . At the same time, open source projects are being bombarded with junk security reports: 20:05 - There are these people who don't know what they're doing, who are asking ChatGPT to find a security hole and then reporting it to the maintainer. And the report looks good. ChatGPT can produce a very well formatted report of a vulnerability. It's a total waste of time. It's not actually verified as being a real problem. A good example of the right way to do this is Anthropic's collaboration with Firefox , where Anthropic's security team verified every security problem before passing them to Mozilla. Of course we had to talk about OpenClaw! Lenny had his running on a Mac Mini. 1:29:23 - OpenClaw demonstrates that people want a personal digital assistant so much that they are willing to not just overlook the security side of things, but also getting the thing running is not easy. You've got to create API keys and tokens and install stuff. It's not trivial to get set up and hundreds of thousands of people got it set up. [...] The first line of code for OpenClaw was written on November the 25th. And then in the Super Bowl, there was an ad for AI.com, which was effectively a vaporware white labeled OpenClaw hosting provider. So we went from first line of code in November to Super Bowl ad in what? Three and a half months. I continue to love Drew Breunig's description of OpenClaw as a digital pet: A friend of mine said that OpenClaw is basically a Tamagotchi. It's a digital pet and you buy the Mac Mini as an aquarium. In talking about my explorations of AI for data journalism through Datasette : 1:34:58 - You would have thought that AI is a very bad fit for journalism where the whole idea is to find the truth. But the flip side is journalists deal with untrustworthy sources all the time. The art of journalism is you talk to a bunch of people and some of them lie to you and you figure out what's true. So as long as the journalist treats the AI as yet another unreliable source, they're actually better equipped to work with AI than most other professions are. Obviously we talked about pelicans riding bicycles : 56:10 - There appears to be a very strong correlation between how good their drawing of a pelican riding a bicycle is and how good they are at everything else. And nobody can explain to me why that is. [...] People kept on asking me, what if labs cheat on the benchmark? And my answer has always been, really, all I want from life is a really good picture of a pelican riding a bicycle . And if I can trick every AI lab in the world into cheating on benchmarks to get it, then that just achieves my goal. 59:56 - I think something people often miss is that this space is inherently funny. The fact that we have these incredibly expensive, power hungry, supposedly the most advanced computers of all time. And if you ask them to draw a pelican on a bicycle, it looks like a five-year-old drew it. That's really funny to me. Lenny asked if I had anything else I wanted to leave listeners with to wrap up the show, so I went with the best piece of news in the world right now. 1:38:10 - There is a rare parrot in New Zealand called the Kākāpō. There are only 250 of these parrots left in the world. They are flightless nocturnal parrots - beautiful green dumpy looking things. And the good news is they're having a fantastic breeding season in 2026, They only breed when the Rimu trees in New Zealand have a mass fruiting season, and the Rimu trees haven't done that since 2022 - so there has not been a single baby kākāpō born in four years. This year, the Rimu trees are in fruit. The kākāpō are breeding. There have been dozens of new chicks born. It's a really, really good time. It's great news for rare New Zealand parrots and you should look them up because they're delightful. Everyone should watch the live stream of Rakiura on her nest with two chicks ! Here's the full list of chapters Lenny's team defined for the YouTube video: You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The November inflection point Software engineers as bellwethers for other information workers Writing code on my phone Responsible vibe coding Dark Factories and StrongDM The bottleneck has moved to testing This stuff is exhausting Interruptions cost a lot less now My ability to estimate software is broken It's tough for people in the middle It's harder to evaluate software The misconception that AI tools are easy Coding agents are useful for security research now Journalists are good at dealing with unreliable sources The pelican benchmark And finally, some good news about parrots YouTube chapters 00:00 : Introduction to Simon Willison 02:40 : The November 2025 inflection point 08:01 : What's possible now with AI coding 10:42 : Vibe coding vs. agentic engineering 13:57 : The dark-factory pattern 20:41 : Where bottlenecks have shifted 23:36 : Where human brains will continue to be valuable 25:32 : Defending of software engineers 29:12 : Why experienced engineers get better results 30:48 : Advice for avoiding the permanent underclass 33:52 : Leaning into AI to amplify your skills 35:12 : Why Simon says he's working harder than ever 37:23 : The market for pre-2022 human-written code 40:01 : Prediction: 50% of engineers writing 95% AI code by the end of 2026 44:34 : The impact of cheap code 48:27 : Simon's AI stack 54:08 : Using AI for research 55:12 : The pelican-riding-a-bicycle benchmark 59:01 : The inherent ridiculousness of AI 1:00:52 : Hoarding things you know how to do 1:08:21 : Red/green TDD pattern for better AI code 1:14:43 : Starting projects with good templates 1:16:31 : The lethal trifecta and prompt injection 1:21:53 : Why 97% effectiveness is a failing grade 1:25:19 : The normalization of deviance 1:28:32 : OpenClaw: the security nightmare everyone is looking past 1:34:22 : What's next for Simon 1:36:47 : Zero-deliverable consulting 1:38:05 : Good news about Kakapo parrots

0 views