Latest Posts (20 found)
Armin Ronacher 2 weeks ago

The Final Bottleneck

Historically, writing code was slower than reviewing code. It might not have felt that way, because code reviews sat in queues until someone got around to picking it up. But if you compare the actual acts themselves, creation was usually the more expensive part. In teams where people both wrote and reviewed code, it never felt like “we should probably program slower.” So when more and more people tell me they no longer know what code is in their own codebase, I feel like something is very wrong here and it’s time to reflect. Software engineers often believe that if we make the bathtub bigger , overflow disappears. It doesn’t. OpenClaw right now has north of 2,500 pull requests open. That’s a big bathtub. Anyone who has worked with queues knows this: if input grows faster than throughput, you have an accumulating failure. At that point, backpressure and load shedding are the only things that retain a system that can still operate. If you have ever been in a Starbucks overwhelmed by mobile orders, you know the feeling. The in-store experience breaks down. You no longer know how many orders are ahead of you. There is no clear line, no reliable wait estimate, and often no real cancellation path unless you escalate and make noise. That is what many AI-adjacent open source projects feel like right now. And increasingly, that is what a lot of internal company projects feel like in “AI-first” engineering teams, and that’s not sustainable. You can’t triage, you can’t review, and many of the PRs cannot be merged after a certain point because they are too far out of date. And the creator might have lost the motivation to actually get it merged. There is huge excitement about newfound delivery speed, but in private conversations, I keep hearing the same second sentence: people are also confused about how to keep up with the pace they themselves created. Humanity has been here before. Many times over. We already talk about the Luddites a lot in the context of AI, but it’s interesting to see what led up to it. Mark Cartwright wrote a great article about the textile industry in Britain during the industrial revolution. At its core was a simple idea: whenever a bottleneck was removed, innovation happened downstream from that. Weaving sped up? Yarn became the constraint. Faster spinning? Fibre needed to be improved to support the new speeds until finally the demand for cotton went up and that had to be automated too. We saw the same thing in shipping that led to modern automated ports and containerization. As software engineers we have been here too. Assembly did not scale to larger engineering teams, and we had to invent higher level languages. A lot of what programming languages and software development frameworks did was allow us to write code faster and to scale to larger code bases. What it did not do up to this point was take away the core skill of engineering. While it’s definitely easier to write C than assembly, many of the core problems are the same. Memory latency still matters, physics are still our ultimate bottleneck, algorithmic complexity still makes or breaks software at scale. When one part of the pipeline becomes dramatically faster, you need to throttle input. Pi is a great example of this. PRs are auto closed unless people are trusted. It takes OSS vacations . That’s one option: you just throttle the inflow. You push against your newfound powers until you can handle them. But what if the speed continues to increase? What downstream of writing code do we have to speed up? Sure, the pull request review clearly turns into the bottleneck. But it cannot really be automated. If the machine writes the code, the machine better review the code at the same time. So what ultimately comes up for human review would already have passed the most critical possible review of the most capable machine. What else is in the way? If we continue with the fundamental belief that machines cannot be accountable, then humans need to be able to understand the output of the machine. And the machine will ship relentlessly. Support tickets of customers will go straight to machines to implement improvements and fixes, for other machines to review, for humans to rubber stamp in the morning. A lot of this sounds both unappealing and reminiscent of the textile industry. The individual weaver no longer carried responsibility for a bad piece of cloth. If it was bad, it became the responsibility of the factory as a whole and it was just replaced outright. As we’re entering the phase of single-use plastic software, we might be moving the whole layer of responsibility elsewhere. But to me it still feels different. Maybe that’s because my lowly brain can’t comprehend the change we are going through, and future generations will just laugh about our challenges. It feels different to me, because what I see taking place in some Open Source projects, in some companies and teams feels deeply wrong and unsustainable. Even Steve Yegge himself now casts doubts about the sustainability of the ever-increasing pace of code creation. So what if we need to give in? What if we need to pave the way for this new type of engineering to become the standard? What affordances will we have to create to make it work? I for one do not know. I’m looking at this with fascination and bewilderment and trying to make sense of it. Because it is not the final bottleneck. We will find ways to take responsibility for what we ship, because society will demand it. Non-sentient machines will never be able to carry responsibility, and it looks like we will need to deal with this problem before machines achieve this status. Regardless of how bizarre they appear to act already. I too am the bottleneck now . But you know what? Two years ago, I too was the bottleneck. I was the bottleneck all along. The machine did not really change that. And for as long as I carry responsibilities and am accountable, this will remain true. If we manage to push accountability upwards, it might change, but so far, how that would happen is not clear.

0 views
Armin Ronacher 3 weeks ago

A Language For Agents

Last year I first started thinking about what the future of programming languages might look like now that agentic engineering is a growing thing. Initially I felt that the enormous corpus of pre-existing code would cement existing languages in place but now I’m starting to think the opposite is true. Here I want to outline my thinking on why we are going to see more new programming languages and why there is quite a bit of space for interesting innovation. And just in case someone wants to start building one, here are some of my thoughts on what we should aim for! Does an agent perform dramatically better on a language that it has in its weights? Obviously yes. But there are less obvious factors that affect how good an agent is at programming in a language: how good the tooling around it is and how much churn there is. Zig seems underrepresented in the weights (at least in the models I’ve used) and also changing quickly. That combination is not optimal, but it’s still passable: you can program even in the upcoming Zig version if you point the agent at the right documentation. But it’s not great. On the other hand, some languages are well represented in the weights but agents still don’t succeed as much because of tooling choices. Swift is a good example: in my experience the tooling around building a Mac or iOS application can be so painful that agents struggle to navigate it. Also not great. So, just because it exists doesn’t mean the agent succeeds and just because it’s new also doesn’t mean that the agent is going to struggle. I’m convinced that you can build yourself up to a new language if you don’t want to depart everywhere all at once. The biggest reason new languages might work is that the cost of coding is going down dramatically. The result is the breadth of an ecosystem matters less. I’m now routinely reaching for JavaScript in places where I would have used Python. Not because I love it or the ecosystem is better, but because the agent does much better with TypeScript. The way to think about this: if important functionality is missing in my language of choice, I just point the agent at a library from a different language and have it build a port. As a concrete example, I recently built an Ethernet driver in JavaScript to implement the host controller for our sandbox. Implementations exist in Rust, C, and Go, but I wanted something pluggable and customizable in JavaScript. It was easier to have the agent reimplement it than to make the build system and distribution work against a native binding. New languages will work if their value proposition is strong enough and they evolve with knowledge of how LLMs train. People will adopt them despite being underrepresented in the weights. And if they are designed to work well with agents, then they might be designed around familiar syntax that is already known to work well. So why would we want a new language at all? The reason this is interesting to think about is that many of today’s languages were designed with the assumption that punching keys is laborious, so we traded certain things for brevity. As an example, many languages — particular modern ones — lean heavily on type inference so that you don’t have to write out types. The downside is that you now need an LSP or the resulting compiler error messages to figure out what the type of an expression is. Agents struggle with this too, and it’s also frustrating in pull request review where complex operations can make it very hard to figure out what the types actually are. Fully dynamic languages are even worse in that regard. The cost of writing code is going down, but because we are also producing more of it, understanding what the code does is becoming more important. We might actually want more code to be written if it means there is less ambiguity when we perform a review. I also want to point out that we are heading towards a world where some code is never seen by a human and is only consumed by machines. Even in that case, we still want to give an indication to a user, who is potentially a non-programmer, about what is going on. We want to be able to explain to a user what the code will do without going into the details of how. So the case for a new language comes down to: given the fundamental changes in who is programming and what the cost of code is, we should at least consider one. It’s tricky to say what an agent wants because agents will lie to you and they are influenced by all the code they’ve seen. But one way to estimate how they are doing is to look at how many changes they have to perform on files and how many iterations they need for common tasks. There are some things I’ve found that I think will be true for a while. The language server protocol lets an IDE infer information about what’s under the cursor or what should be autocompleted based on semantic knowledge of the codebase. It’s a great system, but it comes at one specific cost that is tricky for agents: the LSP has to be running. There are situations when an agent just won’t run the LSP — not because of technical limitations, but because it’s also lazy and will skip that step if it doesn’t have to. If you give it an example from documentation, there is no easy way to run the LSP because it’s a snippet that might not even be complete. If you point it at a GitHub repository and it pulls down individual files, it will just look at the code. It won’t set up an LSP for type information. A language that doesn’t split into two separate experiences (with-LSP and without-LSP) will be beneficial to agents because it gives them one unified way of working across many more situations. It pains me as a Python developer to say this, but whitespace-based indentation is a problem. The underlying token efficiency of getting whitespace right is tricky, and a language with significant whitespace is harder for an LLM to work with. This is particularly noticeable if you try to make an LLM do surgical changes without an assisted tool. Quite often they will intentionally disregard whitespace, add markers to enable or disable code and then rely on a code formatter to clean up indentation later. On the other hand, braces that are not separated by whitespace can cause issues too. Depending on the tokenizer, runs of closing parentheses can end up split into tokens in surprising ways (a bit like the “strawberry” counting problem), and it’s easy for an LLM to get Lisp or Scheme wrong because it loses track of how many closing parentheses it has already emitted or is looking at. Fixable with future LLMs? Sure, but also something that was hard for humans to get right too without tooling. Readers of this blog might know that I’m a huge believer in async locals and flow execution context — basically the ability to carry data through every invocation that might only be needed many layers down the call chain. Working at an observability company has really driven home the importance of this for me. The challenge is that anything that flows implicitly might not be configured. Take for instance the current time. You might want to implicitly pass a timer to all functions. But what if a timer is not configured and all of a sudden a new dependency appears? Passing all of it explicitly is tedious for both humans and agents and bad shortcuts will be made. One thing I’ve experimented with is having effect markers on functions that are added through a code formatting step. A function can declare that it needs the current time or the database, but if it doesn’t mark this explicitly, it’s essentially a linting warning that auto-formatting fixes. The LLM can start using something like the current time in a function and any existing caller gets the warning; formatting propagates the annotation. This is nice because when the LLM builds a test, it can precisely mock out these side effects — it understands from the error messages what it has to supply. For instance: Agents struggle with exceptions, they are afraid of them. I’m not sure to what degree this is solvable with RL (Reinforcement Learning), but right now agents will try to catch everything they can, log it, and do a pretty poor recovery. Given how little information is actually available about error paths, that makes sense. Checked exceptions are one approach, but they propagate all the way up the call chain and don’t dramatically improve things. Even if they end up as hints where a linter tracks which errors can fly by, there are still many call sites that need adjusting. And like the auto-propagation proposed for context data, it might not be the right solution. Maybe the right approach is to go more in on typed results, but that’s still tricky for composability without a type and object system that supports it. The general approach agents use today to read files into memory is line-based, which means they often pick chunks that span multi-line strings. One easy way to see this fall apart: have an agent work on a 2000-line file that also contains long embedded code strings — basically a code generator. The agent will sometimes edit within a multi-line string assuming it’s the real code when it’s actually just embedded code in a multi-line string. For multi-line strings, the only language I’m aware of with a good solution is Zig, but its prefix-based syntax is pretty foreign to most people. Reformatting also often causes constructs to move to different lines. In many languages, trailing commas in lists are either not supported (JSON) or not customary. If you want diff stability, you’d aim for a syntax that requires less reformatting and mostly avoids multi-line constructs. What’s really nice about Go is that you mostly cannot import symbols from another package into scope without every use being prefixed with the package name. Eg: instead of . There are escape hatches (import aliases and dot-imports), but they’re relatively rare and usually frowned upon. That dramatically helps an agent understand what it’s looking at. In general, making code findable through the most basic tools is great — it works with external files that aren’t indexed, and it means fewer false positives for large-scale automation driven by code generated on the fly (eg: , invocations). Much of what I’ve said boils down to: agents really like local reasoning. They want it to work in parts because they often work with just a few loaded files in context and don’t have much spatial awareness of the codebase. They rely on external tooling like grep to find things, and anything that’s hard to grep or that hides information elsewhere is tricky. What makes agents fail or succeed in many languages is just how good the build tools are. Many languages make it very hard to determine what actually needs to rebuild or be retested because there are too many cross-references. Go is really good here: it forbids circular dependencies between packages (import cycles), packages have a clear layout, and test results are cached. Agents often struggle with macros. It was already pretty clear that humans struggle with macros too, but the argument for them was mostly that code generation was a good way to have less code to write. Since that is less of a concern now, we should aim for languages with less dependence on macros. There’s a separate question about generics and comptime . I think they fare somewhat better because they mostly generate the same structure with different placeholders and it’s much easier for an agent to understand that. Related to greppability: agents often struggle to understand barrel files and they don’t like them. Not being able to quickly figure out where a class or function comes from leads to imports from the wrong place, or missing things entirely and wasting context by reading too many files. A one-to-one mapping from where something is declared to where it’s imported from is great. And it does not have to be overly strict either. Go kind of goes this way, but not too extreme. Any file within a directory can define a function, which isn’t optimal, but it’s quick enough to find and you don’t need to search too far. It works because packages are forced to be small enough to find everything with grep. The worst case is free re-exports all over the place that completely decouple the implementation from any trivially reconstructable location on disk. Or worse: aliasing. Agents often hate it when aliases are involved. In fact, you can get them to even complain about it in thinking blocks if you let them refactor something that uses lots of aliases. Ideally a language encourages good naming and discourages aliasing at import time as a result. Nobody likes flaky tests, but agents even less so. Ironic given how particularly good agents are at creating flaky tests in the first place. That’s because agents currently love to mock and most languages do not support mocking well. So many tests end up accidentally not being concurrency safe or depend on development environment state that then diverges in CI or production. Most programming languages and frameworks make it much easier to write flaky tests than non-flaky ones. That’s because they encourage indeterminism everywhere. In an ideal world the agent has one command, that lints and compiles and it tells the agent if all worked out fine. Maybe another command to run all tests that need running. In practice most environments don’t work like this. For instance in TypeScript you can often run the code even though it fails type checks . That can gaslight the agent. Likewise different bundler setups can cause one thing to succeed just for a slightly different setup in CI to fail later. The more uniform the tooling the better. Ideally it either runs or doesn’t and there is mechanical fixing for as many linting failures as possible so that the agent does not have to do it by hand. I think we will. We are writing more software now than we ever have — more websites, more open source projects, more of everything. Even if the ratio of new languages stays the same, the absolute number will go up. But I also truly believe that many more people will be willing to rethink the foundations of software engineering and the languages we work with. That’s because while for some years it has felt you need to build a lot of infrastructure for a language to take off, now you can target a rather narrow use case: make sure the agent is happy and extend from there to the human. I just hope we see two things. First, some outsider art: people who haven’t built languages before trying their hand at it and showing us new things. Second, a much more deliberate effort to document what works and what doesn’t from first principles. We have actually learned a lot about what makes good languages and how to scale software engineering to large teams. Yet, finding it written down, as a consumable overview of good and bad language design, is very hard to come by. Too much of it has been shaped by opinion on rather pointless things instead of hard facts. Now though, we are slowly getting to the point where facts matter more, because you can actually measure what works by seeing how well agents perform with it. No human wants to be subject to surveys, but agents don’t care . We can see how successful they are and where they are struggling.

0 views
Armin Ronacher 1 months ago

Pi: The Minimal Agent Within OpenClaw

If you haven’t been living under a rock, you will have noticed this week that a project of my friend Peter went viral on the internet . It went by many names. The most recent one is OpenClaw but in the news you might have encountered it as ClawdBot or MoltBot depending on when you read about it. It is an agent connected to a communication channel of your choice that just runs code . What you might be less familiar with is that what’s under the hood of OpenClaw is a little coding agent called Pi . And Pi happens to be, at this point, the coding agent that I use almost exclusively. Over the last few weeks I became more and more of a shill for the little agent. After I gave a talk on this recently, I realized that I did not actually write about Pi on this blog yet, so I feel like I might want to give some context on why I’m obsessed with it, and how it relates to OpenClaw. Pi is written by Mario Zechner and unlike Peter, who aims for “sci-fi with a touch of madness,” 1 Mario is very grounded. Despite the differences in approach, both OpenClaw and Pi follow the same idea: LLMs are really good at writing and running code, so embrace this. In some ways I think that’s not an accident because Peter got me and Mario hooked on this idea, and agents last year. So Pi is a coding agent. And there are many coding agents. Really, I think you can pick effectively anyone off the shelf at this point and you will be able to experience what it’s like to do agentic programming. In reviews on this blog I’ve positively talked about AMP and one of the reasons I resonated so much with AMP is that it really felt like it was a product built by people who got both addicted to agentic programming but also had tried a few different things to see which ones work and not just to build a fancy UI around it. Pi is interesting to me because of two main reasons: And a little bonus: Pi itself is written like excellent software. It doesn’t flicker, it doesn’t consume a lot of memory, it doesn’t randomly break, it is very reliable and it is written by someone who takes great care of what goes into the software. Pi also is a collection of little components that you can build your own agent on top. That’s how OpenClaw is built, and that’s also how I built my own little Telegram bot and how Mario built his mom . If you want to build your own agent, connected to something, Pi when pointed to itself and mom, will conjure one up for you. And in order to understand what’s in Pi, it’s even more important to understand what’s not in Pi, why it’s not in Pi and more importantly: why it won’t be in Pi. The most obvious omission is support for MCP. There is no MCP support in it. While you could build an extension for it, you can also do what OpenClaw does to support MCP which is to use mcporter . mcporter exposes MCP calls via a CLI interface or TypeScript bindings and maybe your agent can do something with it. Or not, I don’t know :) And this is not a lazy omission. This is from the philosophy of how Pi works. Pi’s entire idea is that if you want the agent to do something that it doesn’t do yet, you don’t go and download an extension or a skill or something like this. You ask the agent to extend itself. It celebrates the idea of code writing and running code. That’s not to say that you cannot download extensions. It is very much supported. But instead of necessarily encouraging you to download someone else’s extension, you can also point your agent to an already existing extension, say like, build it like the thing you see over there, but make these changes to it that you like. When you look at what Pi and by extension OpenClaw are doing, there is an example of software that is malleable like clay. And this sets certain requirements for the underlying architecture of it that are actually in many ways setting certain constraints on the system that really need to go into the core design. So for instance, Pi’s underlying AI SDK is written so that a session can really contain many different messages from many different model providers. It recognizes that the portability of sessions is somewhat limited between model providers and so it doesn’t lean in too much into any model-provider-specific feature set that cannot be transferred to another. The second is that in addition to the model messages it maintains custom messages in the session files which can be used by extensions to store state or by the system itself to maintain information that either not at all is sent to the AI or only parts of it. Because this system exists and extension state can also be persisted to disk, it has built-in hot reloading so that the agent can write code, reload, test it and go in a loop until your extension actually is functional. It also ships with documentation and examples that the agent itself can use to extend itself. Even better: sessions in Pi are trees. You can branch and navigate within a session which opens up all kinds of interesting opportunities such as enabling workflows for making a side-quest to fix a broken agent tool without wasting context in the main session. After the tool is fixed, I can rewind the session back to earlier and Pi summarizes what has happened on the other branch. This all matters because for instance if you consider how MCP works, on most model providers, tools for MCP, like any tool for the LLM, need to be loaded into the system context or the tool section thereof on session start. That makes it very hard to impossible to fully reload what tools can do without trashing the complete cache or confusing the AI about how prior invocations work differently. An extension in Pi can register a tool to be available to the LLM to call and every once in a while I find this useful. For instance, despite my criticism of how Beads is implemented, I do think that giving an agent access to a to-do list is a very useful thing. And I do use an agent-specific issue tracker that works locally that I had my agent build itself. And because I wanted the agent to also manage to-dos, in this particular case I decided to give it a tool rather than a CLI. It felt appropriate for the scope of the problem and it is currently the only additional tool that I’m loading into my context. But for the most part all of what I’m adding to my agent are either skills or TUI extensions to make working with the agent more enjoyable for me. Beyond slash commands, Pi extensions can render custom TUI components directly in the terminal: spinners, progress bars, interactive file pickers, data tables, preview panes. The TUI is flexible enough that Mario proved you can run Doom in it . Not practical, but if you can run Doom, you can certainly build a useful dashboard or debugging interface. I want to highlight some of my extensions to give you an idea of what’s possible. While you can use them unmodified, the whole idea really is that you point your agent to one and remix it to your heart’s content. I don’t use plan mode . I encourage the agent to ask questions and there’s a productive back and forth. But I don’t like structured question dialogs that happen if you give the agent a question tool. I prefer the agent’s natural prose with explanations and diagrams interspersed. The problem: answering questions inline gets messy. So reads the agent’s last response, extracts all the questions, and reformats them into a nice input box. Even though I criticize Beads for its implementation, giving an agent a to-do list is genuinely useful. The command brings up all items stored in as markdown files. Both the agent and I can manipulate them, and sessions can claim tasks to mark them as in progress. As more code is written by agents, it makes little sense to throw unfinished work at humans before an agent has reviewed it first. Because Pi sessions are trees, I can branch into a fresh review context, get findings, then bring fixes back to the main session. The UI is modeled after Codex which provides easy to review commits, diffs, uncommitted changes, or remote PRs. The prompt pays attention to things I care about so I get the call-outs I want (eg: I ask it to call out newly added dependencies.) An extension I experiment with but don’t actively use. It lets one Pi agent send prompts to another. It is a simple multi-agent system without complex orchestration which is useful for experimentation. Lists all files changed or referenced in the session. You can reveal them in Finder, diff in VS Code, quick-look them, or reference them in your prompt. quick-looks the most recently mentioned file which is handy when the agent produces a PDF. Others have built extensions too: Nico’s subagent extension and interactive-shell which lets Pi autonomously run interactive CLIs in an observable TUI overlay. These are all just ideas of what you can do with your agent. The point of it mostly is that none of this was written by me, it was created by the agent to my specifications. I told Pi to make an extension and it did. There is no MCP, there are no community skills, nothing. Don’t get me wrong, I use tons of skills. But they are hand-crafted by my clanker and not downloaded from anywhere. For instance I fully replaced all my CLIs or MCPs for browser automation with a skill that just uses CDP . Not because the alternatives don’t work, or are bad, but because this is just easy and natural. The agent maintains its own functionality. My agent has quite a few skills and crucially I throw skills away if I don’t need them. I for instance gave it a skill to read Pi sessions that other engineers shared, which helps with code review. Or I have a skill to help the agent craft the commit messages and commit behavior I want, and how to update changelogs. These were originally slash commands, but I’m currently migrating them to skills to see if this works equally well. I also have a skill that hopefully helps Pi use rather than , but I also added a custom extension to intercept calls to and to redirect them to instead. Part of the fascination that working with a minimal agent like Pi gave me is that it makes you live that idea of using software that builds more software. That taken to the extreme is when you remove the UI and output and connect it to your chat. That’s what OpenClaw does and given its tremendous growth, I really feel more and more that this is going to become our future in one way or another. https://x.com/steipete/status/2017313990548865292 ↩ First of all, it has a tiny core. It has the shortest system prompt of any agent that I’m aware of and it only has four tools: Read, Write, Edit, Bash. The second thing is that it makes up for its tiny core by providing an extension system that also allows extensions to persist state into sessions, which is incredibly powerful. https://x.com/steipete/status/2017313990548865292 ↩

0 views
Armin Ronacher 1 months ago

Colin and Earendil

Regular readers of this blog will know that I started a new company. We have put out just a tiny bit of information today , and some keen folks have discovered and reached out by email with many thoughtful responses. It has been delightful. Colin and I met here, in Vienna. We started sharing coffees, ideas, and lunches, and soon found shared values despite coming from different backgrounds and different parts of the world. We are excited about the future, but we’re equally vigilant of it. After traveling together a bit, we decided to plunge into the cold water and start a company together. We want to be successful, but we want to do it the right way and we want to be able to demonstrate that to our kids. Vienna is a city of great history, two million inhabitants and a fascinating vibe that is nothing like San Francisco. In fact, Vienna is in many ways the polar opposite to the Silicon Valley, both in mindset, in opportunity and approach to life. Colin comes from San Francisco, and though I’m Austrian, my career has been shaped by years working with California companies and people from there who used my Open Source software. Vienna is now our shared home. Despite Austria being so far away from California, it is a place of tinkerers and troublemakers. It’s always good to remind oneself that society consists of more than just your little bubble. It also creates the necessary counter balance to think in these times. The world that is emerging in front of our eyes is one of change. We incorporated as a PBC with a founding charter to craft software and open protocols, strengthen human agency, bridge division and ignorance and to cultivate lasting joy and understanding. Things we believe in deeply. I have dedicated 20 years of my life in one way or another creating Open Source software. In the same way as artificial intelligence calls into question the very nature of my profession and the way we build software, the present day circumstances are testing society. We’re not immune to these changes and we’re navigating them like everyone else, with a mixture of excitement and worry. But we share a belief that right now is the time to stand true to one’s values and principles. We want to take an earnest shot at leaving the world a better place than we found it. Rather than reject the changes that are happening, we look to nudge them towards the right direction. If you want to follow along you can subscribe to our newsletter , written by humans not machines.

0 views
Armin Ronacher 1 months ago

Agent Psychosis: Are We Going Insane?

You can use Polecats without the Refinery and even without the Witness or Deacon. Just tell the Mayor to shut down the rig and sling work to the polecats with the message that they are to merge to main directly. Or the polecats can submit MRs and then the Mayor can merge them manually. It’s really up to you. The Refineries are useful if you have done a LOT of up-front specification work, and you have huge piles of Beads to churn through with long convoys. — Gas Town Emergency User Manual , Steve Yegge Many of us got hit by the agent coding addiction. It feels good, we barely sleep, we build amazing things. Every once in a while that interaction involves other humans, and all of a sudden we get a reality check that maybe we overdid it. The most obvious example of this is the massive degradation of quality of issue reports and pull requests. As a maintainer many PRs now look like an insult to one’s time, but when one pushes back, the other person does not see what they did wrong. They thought they helped and contributed and get agitated when you close it down. But it’s way worse than that. I see people develop parasocial relationships with their AIs, get heavily addicted to it, and create communities where people reinforce highly unhealthy behavior. How did we get here and what does it do to us? I will preface this post by saying that I don’t want to call anyone out in particular, and I think I sometimes feel tendencies that I see as negative, in myself as well. I too, have thrown some vibeslop up to other people’s repositories. In His Dark Materials, every human has a dæmon, a companion that is an externally visible manifestation of their soul. It lives alongside as an animal, but it talks, thinks and acts independently. I’m starting to relate our relationship with agents that have memory to those little creatures. We become dependent on them, and separation from them is painful and takes away from our new-found identity. We’re relying on these little companions to validate us and to collaborate with. But it’s not a genuine collaboration like between humans, it’s one that is completely driven by us, and the AI is just there for the ride. We can trick it to reinforce our ideas and impulses. And we act through this AI. Some people who have not programmed before, now wield tremendous powers, but all those powers are gone when their subscription hits a rate limit and their little dæmon goes to sleep. Then, when we throw up a PR or issue to someone else, that contribution is the result of this pseudo-collaboration with the machine. When I see an AI pull request come in, or on another repository, I cannot tell how someone created it, but I can usually after a while tell when it was prompted in a way that is fundamentally different from how I do it. Yet it takes me minutes to figure this out. I have seen some coding sessions from others and it’s often done with clarity, but using slang that someone has come up with and most of all: by completely forcing the AI down a path without any real critical thinking. Particularly when you’re not familiar with how the systems are supposed to work, giving in to what the machine says and then thinking one understands what is going on creates some really bizarre outcomes at times. But people create these weird relationships with their AI agent and once you see how some prompt their machines, you realize that it dramatically alters what comes out of it. To get good results you need to provide context, you need to make the tradeoffs, you need to use your knowledge. It’s not just a question of using the context badly, it’s also the way in which people interact with the machine. Sometimes it’s unclear instructions, sometimes it’s weird role-playing and slang, sometimes it’s just swearing and forcing the machine, sometimes it’s a weird ritualistic behavior. Some people just really ram the agent straight towards the most narrow of all paths towards a badly defined goal with little concern about the health of the codebase. These dæmon relationships change not just how we work, but what we produce. You can completely give in and let the little dæmon run circles around you. You can reinforce it to run towards ill defined (or even self defined) goals without any supervision. It’s one thing when newcomers fall into this dopamine loop and produce something. When Peter first got me hooked on Claude, I did not sleep. I spent two months excessively prompting the thing and wasting tokens. I ended up building and building and creating a ton of tools I did not end up using much. “You can just do things” was what was on my mind all the time but it took quite a bit longer to realize that just because you can, you might not want to. It became so easy to build something and in comparison it became much harder to actually use it or polish it. Quite a few of the tools I built I felt really great about, just to realize that I did not actually use them or they did not end up working as I thought they would. The thing is that the dopamine hit from working with these agents is so very real. I’ve been there! You feel productive, you feel like everything is amazing, and if you hang out just with people that are into that stuff too, without any checks, you go deeper and deeper into the belief that this all makes perfect sense. You can build entire projects without any real reality check. But it’s decoupled from any external validation. For as long as nobody looks under the hood, you’re good. But when an outsider first pokes at it, it looks pretty crazy. And damn some things look amazing. I too was blown away (and fully expected at the same time) when Cursor’s AI written Web Browser landed. It’s super impressive that agents were able to bootstrap a browser in a week! But holy crap! I hope nobody ever uses that thing or would try to build an actual browser out of it, at least with this generation of agents, it’s still pure slop with little oversight. It’s an impressive research and tech demo, not an approach to building software people should use. At least not yet. There is also another side to this slop loop addiction: token consumption. Consider how many tokens these loops actually consume. A well-prepared session with good tooling and context can be remarkably token-efficient. For instance, the entire port of MiniJinja to Go took only 2.2 million tokens. But the hands-off approaches—spinning up agents and letting them run wild—burn through tokens at staggering rates. Patterns like Ralph are particularly wasteful: you restart the loop from scratch each time, which means you lose the ability to use cached tokens or reuse context. We should also remember that current token pricing is almost certainly subsidized. These patterns may not be economically viable for long. And those discounted coding plans we’re all on? They might not last either. And then there are things like Beads and Gas Town , Steve Yegge’s agentic coding tools, which are the complete celebration of slop loops. Beads, which is basically some sort of issue tracker for agents, is 240,000 lines of code that … manages markdown files in GitHub repositories. And the code quality is abysmal. There appears to be some competition in place to run as many of these agents in parallel with almost no quality control in some circles. And to then use agents to try to create documentation artifacts to regain some confidence of what is actually going on. Except those documents themselves read like slop . Looking at Gas Town (and Beads) from the outside, it looks like a Mad Max cult. What are polecats, refineries, mayors, beads, convoys doing in an agentic coding system? If the maintainer is in the loop, and the whole community is in on this mad ride, then everyone and their dæmons just throw more slop up. As an external observer the whole project looks like an insane psychosis or a complete mad art project. Except, it’s real? Or is it not? Apparently a reason for slowdown in Gas Town is contention on figuring out the version of Beads, which takes 7 subprocess spawns . Or using the doctor command times out completely . Beads keeps growing and growing in complexity and people who are using it, are realizing that it’s almost impossible to uninstall . And they might not even work well together even though one apparently depends on the other. I don’t want to pick on Gas Town or these projects, but they are just the most visible examples of this in-group behavior right now. But you can see similar things in some of the AI builder circles on Discord and X where people hype each other up with their creations, without much critical thinking and sanity checking of what happens under the hood. It takes you a minute of prompting and waiting a few minutes for code to come out of it. But actually honestly reviewing a pull request takes many times longer than that. The asymmetry is completely brutal. Shooting up bad code is rude because you completely disregard the time of the maintainer. But everybody else is also creating AI-generated code, but maybe they passed the bar of it being good. So how can you possibly tell as a maintainer when it all looks the same? And as the person writing the issue or the PR, you felt good about it. Yet what you get back is frustration and rejection. I’m not sure how we will go ahead here, but it’s pretty clear that in projects that don’t submit themselves to the slop loop, it’s going to be a nightmare to deal with all the AI-generated noise. Even for projects that are fully AI-generated but are setting some standard for contributions, some folks now prefer actually just getting the prompts over getting the actual code. Because then it’s clearer what the person actually intended. There is more trust in running the agent oneself than having other people do it. Which really makes me wonder: am I missing something here? Is this where we are going? Am I just not ready for this new world? Are we all collectively getting insane? Particularly if you want to opt out of this craziness right now, it’s getting quite hard. Some projects no longer accept human contributions until they have vetted the people completely. Others are starting to require that you submit prompts alongside your code, or just the prompts alone. I am a maintainer who uses AI myself, and I know others who do. We’re not luddites and we’re definitely not anti-AI. But we’re also frustrated when we encounter AI slop on issue and pull request trackers. Every day brings more PRs that took someone a minute to generate and take an hour to review. There is a dire need to say no now. But when one does, the contributor is genuinely confused: “Why are you being so negative? I was trying to help.” They were trying to help. Their dæmon told them it was good. Maybe the answer is that we need better tools — better ways to signal quality, better ways to share context, better ways to make the AI’s involvement visible and reviewable. Maybe the culture will self-correct as people hit walls. Maybe this is just the awkward transition phase before we figure out new norms. Or maybe some of us are genuinely losing the plot, and we won’t know which camp we’re in until we look back. All I know is that when I watch someone at 3am, running their tenth parallel agent session, telling me they’ve never been more productive — in that moment I don’t see productivity. I see someone who might need to step away from the machine for a bit. And I wonder how often that someone is me. Two things are both true to me right now: AI agents are amazing and a huge productivity boost. They are also massive slop machines if you turn off your brain and let go completely.

0 views
Armin Ronacher 1 months ago

Porting MiniJinja to Go With an Agent

Turns out you can just port things now. I already attempted this experiment in the summer, but it turned out to be a bit too much for what I had time for. However, things have advanced since. Yesterday I ported MiniJinja (a Rust Jinja2 template engine) to native Go, and I used an agent to do pretty much all of the work. In fact, I barely did anything beyond giving some high-level guidance on how I thought it could be accomplished. In total I probably spent around 45 minutes actively with it. It worked for around 3 hours while I was watching, then another 7 hours alone. This post is a recollection of what happened and what I learned from it. All prompting was done by voice using pi , starting with Opus 4.5 and switching to GPT-5.2 Codex for the long tail of test fixing. MiniJinja is a re-implementation of Jinja2 for Rust. I originally wrote it because I wanted to do a infrastructure automation project in Rust and Jinja was popular for that. The original project didn’t go anywhere, but MiniJinja itself continued being useful for both me and other users. The way MiniJinja is tested is with snapshot tests: inputs and expected outputs, using insta to verify they match. These snapshot tests were what I wanted to use to validate the Go port. My initial prompt asked the agent to figure out how to validate the port. Through that conversation, the agent and I aligned on a path: reuse the existing Rust snapshot tests and port incrementally (lexer -> parser -> runtime). This meant the agent built Go-side tooling to: This resulted in a pretty good harness with a tight feedback loop. The agent had a clear goal (make everything pass) and a progression (lexer -> parser -> runtime). The tight feedback loop mattered particularly at the end where it was about getting details right. Every missing behavior had one or more failing snapshots. I used Pi’s branching feature to structure the session into phases. I rewound back to earlier parts of the session and used the branch switch feature to inform the agent automatically what it had already done. This is similar to compaction, but Pi shows me what it puts into the context. When Pi switches branches it does two things: Without switching branches, I would probably just make new sessions and have more plan files lying around or use something like Amp’s handoff feature which also allows the agent to consult earlier conversations if it needs more information. What was interesting is that the agent went from literal porting to behavioral porting quite quickly. I didn’t steer it away from this as long as the behavior aligned. I let it do this for a few reasons. First, the code base isn’t that large, so I felt I could make adjustments at the end if needed. Letting the agent continue with what was already working felt like the right strategy. Second, it was aligning to idiomatic Go much better this way. For instance, on the runtime it implemented a tree-walking interpreter (not a bytecode interpreter like Rust) and it decided to use Go’s reflection for the value type. I didn’t tell it to do either of these things, but they made more sense than replicating my Rust interpreter design, which was partly motivated by not having a garbage collector or runtime type information. On the other hand, the agent made some changes while making tests pass that I disagreed with. It completely gave up on all the “must fail” tests because the error messages were impossible to replicate perfectly given the runtime differences. So I had to steer it towards fuzzy matching instead. It also wanted to regress behavior I wanted to retain (e.g., exact HTML escaping semantics, or that must return an iterator). I think if I hadn’t steered it there, it might not have made it to completion without going down problematic paths, or I would have lost confidence in the result. Once the major semantic mismatches were fixed, the remaining work was filling in all missing pieces: missing filters and test functions, loop extras, macros, call blocks, etc. Since I wanted to go to bed, I switched to Codex 5.2 and queued up a few “continue making all tests pass if they are not passing yet” prompts, then let it work through compaction. I felt confident enough that the agent could make the rest of the tests pass without guidance once it had the basics covered. This phase ran without supervision overnight. After functional convergence, I asked the agent to document internal functions and reorganize (like moving filters to a separate file). I also asked it to document all functions and filters like in the Rust code base. This was also when I set up CI, release processes, and talked through what was created to come up with some finalizing touches before merging. There are a few things I find interesting here. First: these types of ports are possible now. I know porting was already possible for many months, but it required much more attention. This changes some dynamics. I feel less like technology choices are constrained by ecosystem lock-in. Sure, porting NumPy to Go would be a more involved undertaking, and getting it competitive even more so (years of optimizations in there). But still, it feels like many more libraries can be used now. Second: for me, the value is shifting from the code to the tests and documentation. A good test suite might actually be worth more than the code. That said, this isn’t an argument for keeping tests secret — generating tests with good coverage is also getting easier. However, for keeping code bases in different languages in sync, you need to agree on shared tests, otherwise divergence is inevitable. Lastly, there’s the social dynamic. Once, having people port your code to other languages was something to take pride in. It was a sign of accomplishment — a project was “cool enough” that someone put time into making it available elsewhere. With agents, it doesn’t invoke the same feelings. Will McGugan also called out this change . Lastly, some boring stats for the main session: This did not count the adding of doc strings and smaller fixups. Pi session transcript Narrated video of the porting session Parse Rust’s test input files (which embed settings as JSON headers). Parse the reference insta snapshots and compare output. Maintain a skip-list to temporarily opt out of failing tests. It stays in the same session so I can navigate around, but it makes a new branch off an earlier message. When switching, it adds a summary of what it did as a priming message into where it branched off. I found this quite helpful to avoid the agent doing vision quests from scratch to figure out how far it had already gotten. Agent run duration: 10 hours ( 3 hours supervised) Active human time: ~45 minutes Total messages: 2,698 My prompts: 34 Tool calls: 1,386 Raw API token cost: $60 Total tokens: 2.2 million Models: and for the unattended overnight run

0 views
Armin Ronacher 2 months ago

A Year Of Vibes

2025 draws to a close and it’s been quite a year. Around this time last year, I wrote a post that reflected on my life . Had I written about programming, it might have aged badly, as 2025 has been a year like no other for my profession. 2025 was the year of changes. Not only did I leave Sentry and start my new company, it was also the year I stopped programming the way I did before. In June I finally felt confident enough to share that my way of working was different: Where I used to spend most of my time in Cursor, I now mostly use Claude Code, almost entirely hands-off. […] If you would have told me even just six months ago that I’d prefer being an engineering lead to a virtual programmer intern over hitting the keys myself, I would not have believed it. While I set out last year wanting to write more, that desire had nothing to do with agentic coding. Yet I published 36 posts — almost 18% of all posts on this blog since 2007. I also had around a hundred conversations with programmers, founders, and others about AI because I was fired up with curiosity after falling into the agent rabbit hole. 2025 was also a not so great year for the world. To make my peace with it, I started a separate blog to separate out my thoughts from here. It started with a growing obsession with Claude Code in April or May, resulting in months of building my own agents and using others’. Social media exploded with opinions on AI: some good, some bad. Now I feel I have found a new stable status quo for how I reason about where we are and where we are going. I’m doubling down on code generation, file systems, programmatic tool invocation via an interpreter glue, and skill-based learning. Basically: what Claude Code innovated is still state of the art for me. That has worked very well over the last few months, and seeing foundation model providers double down on skills reinforces my belief in this approach. I’m still perplexed by how TUIs made such a strong comeback. At the moment I’m using Amp , Claude Code , and Pi , all from the command line. Amp feels like the Apple or Porsche of agentic coding tools, Claude Code is the affordable Volkswagen, and Pi is the Hacker’s Open Source choice for me. They all feel like projects built by people who, like me, use them to an unhealthy degree to build their own products, but with different trade-offs. I continue to be blown away by what LLMs paired with tool execution can do. At the beginning of the year I mostly used them for code generation, but now a big number of my agentic uses are day-to-day things. I’m sure we will see some exciting pushes towards consumer products in 2026. LLMs are now helping me with organizing my life, and I expect that to grow further. Because LLMs now not only help me program, I’m starting to rethink my relationship to those machines. I increasingly find it harder not to create parasocial bonds with some of the tools I use. I find this odd and discomforting. Most agents we use today do not have much of a memory and have little personality but it’s easy to build yourself one that does. An LLM with memory is an experience that is hard to shake off. It’s both fascinating and questionable. I have tried to train myself for two years, to think of these models as mere token tumblers, but that reductive view does not work for me any longer. These systems we now create have human tendencies, but elevating them to a human level would be a mistake. I increasingly take issue with calling these machines “agents,” yet I have no better word for it. I take issue with “agent” as a term because agency and responsibility should remain with humans. Whatever they are becoming, they can trigger emotional responses in us that can be detrimental if we are not careful. Our inability to properly name and place these creations in relation to us is a challenge I believe we need to solve. Because of all this unintentional anthropomorphization, I’m really struggling at times to find the right words for how I’m working with these machines. I know that this is not just me; it’s others too. It creates even more discomfort when working with people who currently reject these systems outright. One of the most common comments I read in response to agentic coding tool articles is this rejection of giving the machine personality. An unexpected aspect of using AI so much is that we talk far more about vibes than anything else. This way of working is less than a year old, yet it challenges half a century of software engineering experience. So there are many opinions, and it’s hard to say which will stand the test of time. I found a lot of conventional wisdom I don’t agree with, but I have nothing to back up my opinions. How would I? I quite vocally shared my lack of success with MCP throughout the year, but I had little to back it up beyond “does not work for me.” Others swore by it. Similar with model selection. Peter , who got me hooked on Claude early in the year, moved to Codex and is happy with it. I don’t enjoy that experience nearly as much, though I started using it more. I have nothing beyond vibes to back up my preference for Claude. It’s also important to know that some of the vibes come with intentional signalling. Plenty of people whose views you can find online have a financial interest in one product over another, for instance because they are investors in it or they are paid influencers. They might have become investors because they liked the product, but it’s also possible that their views are affected and shaped by that relationship. Pick up a library from any AI company today and you’ll notice they’re built with Stainless or Fern. The docs use Mintlify, the site’s authentication system might be Clerk. Companies now sell services you would have built yourself previously. This increase in outsourcing of core services to companies specializing in it meant that the bar for some aspects of the user experience has risen. But with our newfound power from agentic coding tools, you can build much of this yourself. I had Claude build me an SDK generator for Python and TypeScript — partly out of curiosity, partly because it felt easy enough. As you might know, I’m a proponent of simple code and building it yourself . This makes me somewhat optimistic that AI has the potential to encourage building on fewer dependencies. At the same time, it’s not clear to me that we’re moving that way given the current trends of outsourcing everything. This brings me not to predictions but to wishes for where we could put our energy next. I don’t really know what I’m looking for here, but I want to point at my pain points and give some context and food for thought. My biggest unexpected finding: we’re hitting limits of traditional tools for sharing code. The pull request model on GitHub doesn’t carry enough information to review AI generated code properly — I wish I could see the prompts that led to changes. It’s not just GitHub, it’s also git that is lacking. With agentic coding, part of what makes the models work today is knowing the mistakes. If you steer it back to an earlier state, you want the tool to remember what went wrong. There is, for lack of a better word, value in failures. As humans we might also benefit from knowing the paths that did not lead us anywhere, but for machines this is critical information. You notice this when you are trying to compress the conversation history. Discarding the paths that led you astray means that the model will try the same mistakes again. Some agentic coding tools have begun spinning up worktrees or creating checkpoints in git for restore, in-conversation branch and undo features. There’s room for UX innovation that could make these tools easier to work with. This is probably why we’re seeing discussions about stacked diffs and alternative version control systems like Jujutsu . Will this change GitHub or will it create space for some new competition? I hope so. I increasingly want to better understand genuine human input and tell it apart from machine output. I want to see the prompts and the attempts that failed along the way. And then somehow I want to squash and compress it all on merge, but with a way to retrieve the full history if needed. This is related to the version control piece: current code review tools assign strict role definitions that just don’t work with AI. Take the GitHub code review UI: I regularly want to use comments on the PR view to leave notes for my own agents, but there is no guided way to do that. The review interface refuses to let me review my own code, I can only comment, but that does not have quite the same intention. There is also the problem that an increased amount of code review now happens between me and my agents locally. For instance, the Codex code review feature on GitHub stopped working for me because it can only be bound to one organization at a time. So I now use Codex on the command line to do reviews, but that means a whole part of my iteration cycles is invisible to other engineers on the team. That doesn’t work for me. Code review to me feels like it needs to become part of the VCS. I also believe that observability is up for grabs again. We now have both the need and opportunity to take advantage of it on a whole new level. Most people were not in a position where they could build their own eBPF programs, but LLMs can. Likewise, many observability tools shied away from SQL because of its complexity, but LLMs are better at it than any proprietary query language. They can write queries, they can grep, they can map-reduce, they remote-control LLDB. Anything that has some structure and text is suddenly fertile ground for agentic coding tools to succeed. I don’t know what the observability of the future looks like, but my strong hunch is that we will see plenty of innovation here. The better the feedback loop to the machine, the better the results. I’m not even sure what I’m asking for here, but I think that one of the challenges in the past was that many cool ideas for better observability — specifically dynamic reconfiguration of services for more targeted filtering — were user-unfriendly because they were complex and hard to use. But now those might be the right solutions in light of LLMs because of their increased capabilities for doing this grunt work. For instance Python 3.14 landed an external debugger interface which is an amazing capability for an agentic coding tool. This may be a little more controversial, but what I haven’t managed this year is to give in to the machine. I still treat it like regular software engineering and review a lot. I also recognize that an increasing number of people are not working with this model of engineering but instead completely given in to the machine. As crazy as that sounds, I have seen some people be quite successful with this. I don’t yet know how to reason about this, but it is clear to me that even though code is being generated in the end, the way of working in that new world is very different from the world that I’m comfortable with. And my suspicion is that because that world is here to stay, we might need some new social contracts to separate these out. The most obvious version of this is the increased amount of these types of contributions to Open Source projects, which are quite frankly an insult to anyone who is not working in that model. I find reading such pull requests quite rage-inducing. Personally, I’ve tried to attack this problem with contribution guidelines and pull request templates. But this seems a little like a fight against windmills. This might be something where the solution will not come from changing what we’re doing. Instead, it might come from vocal people who are also pro-AI engineering speaking out on what good behavior in an agentic codebase looks like. And it is not just to throw up unreviewed code and then have another person figure the shit out.

0 views
Armin Ronacher 2 months ago

Skills vs Dynamic MCP Loadouts

I’ve been moving all my MCPs to skills, including the remaining one I still used: the Sentry MCP 1 . Previously I had already moved entirely away from Playwright to a Playwright skill. In the last month or so there have been discussions about using dynamic tool loadouts to defer loading of tool definitions until later. Anthropic has also been toying around with the idea of wiring together MCP calls via code, something I have experimented with . I want to share my updated findings with all of this and why the deferred tool loading that Anthropic came up with does not fix my lack of love for MCP. Maybe they are useful for someone else. When the agent encounters a tool definition through reinforcement learning or otherwise, it is encouraged to emit tool calls through special tokens when it encounters a situation where that tool call would be appropriate. For all intents and purposes, tool definitions can only appear between special tool definition tokens in a system prompt. Historically this means that you cannot emit tool definitions later in the conversation state. So your only real option is for a tool to be loaded when the conversation starts. In agentic uses, you can of course compress your conversation state or change the tool definitions in the system message at any point. But the consequence is that you will lose the reasoning traces and also the cache. In the case of Anthropic, for instance, this will make your conversation significantly more expensive. You would basically start from scratch and pay full token rates plus cache write cost, compared to cache read. One recent innovation from Anthropic is deferred tool loading. You still declare tools ahead of time in the system message, but they are not injected into the conversation when the initial system message is emitted. Instead they appear at a later point. The tool definitions however still have to be static for the entire conversation, as far as I know. So the tools that could exist are defined when the conversation starts. The way Anthropic discovers the tools is purely by regex search. This is all quite relevant because even though MCP with deferred loading feels like it should perform better, it actually requires quite a bit of engineering on the LLM API side. The skill system gets away without any of that and, at least from my experience, still outperforms it. Skills are really just short summaries of which skills exist and in which file the agent can learn more about them. These are proactively loaded into the context. So the agent understands in the system context (or maybe somewhere later in the context) what capabilities it has and gets a link to the manual for how to use them. Crucially, skills do not actually load a tool definition into the context. The tools remain the same: bash and the other tools the agent already has. All it learns from the skill are tips and tricks for how to use these tools more effectively. Because the main thing it learns is how to use other command line tools and similar utilities, the fundamentals of how to chain and coordinate them together do not actually change. The reinforcement learning that made the Claude family of models very good tool callers just helps with these newly discovered tools. So that obviously raises the question: if skills work so well, can I move the MCP outside of the context entirely and invoke it through the CLI in a similar way as Anthropic proposes? The answer is yes, you can, but it doesn’t work well. One option here is Peter Steinberger’s mcporter . In short, it reads the files and exposes the MCPs behind it as callable tools: And yes, it looks very much like a command line tool that the LLM can invoke. The problem however is that the LLM does not have any idea about what tools are available, and now you need to teach it that. So you might think: why not make some skills that teach the LLM about the MCPs? Here the issue for me comes from the fact that MCP servers have no desire to maintain API stability. They are increasingly starting to trim down tool definitions to the bare minimum to preserve tokens. This makes sense, but for the skill pattern it’s not what you want. For instance, the Sentry MCP server at one point switched the query syntax entirely to natural language. A great improvement for the agent, but my suggestions for how to use it became a hindrance and I did not discover the issue straight away. This is in fact quite similar to Anthropic’s deferred tool loading: there is no information about the tool in the context at all. You need to create a summary. The eager loading of MCP tools we have done in the past now has ended up with an awkward compromise: the description is both too long to eagerly load it, and too short to really tell the agent how to use it. So at least from my experience, you end up maintaining these manual skill summaries for MCP tools exposed via mcporter or similar. This leads me to my current conclusion: I tend to go with what is easiest, which is to ask the agent to write its own tools as a skill. Not only does it not take all that long, but the biggest benefit is that the tool is largely under my control. Whenever it breaks or needs some other functionality, I ask the agent to adjust it. The Sentry MCP is a great example. I think it’s probably one of the better designed MCPs out there, but I don’t use it anymore. In part because when I load it into the context right away I lose around 8k tokens out of the box, and I could not get it to work via mcporter. On the other hand, I have Claude maintain a skill for me. And yes, that skill is probably quite buggy and needs to be updated, but because the agent maintains it, it works out better. It’s quite likely that all of this will change, but at the moment manually maintained skills and agents writing their own tools have become my preferred way. I suspect that dynamic tool loading with MCP will become a thing, but it will probably quite some protocol changes to bring in skill-like summaries and built-in manuals for the tools. I also suspect that MCP would greatly benefit of protocol stability. The fact that MCP servers keep changing their tool descriptions at will does not work well with materialized calls and external tool descriptions in READMEs and skill files. Keen readers will remember that last time, the last MCP I used was Playwright. In the meantime I added and removed two more MCPs: Linear and Sentry, mostly because of authentication issues and neither having a great command line interface. ↩ Keen readers will remember that last time, the last MCP I used was Playwright. In the meantime I added and removed two more MCPs: Linear and Sentry, mostly because of authentication issues and neither having a great command line interface. ↩

0 views
Armin Ronacher 2 months ago

Let’s Destroy The European Union!

Elon Musk is not happy with the EU fining his X platform and is currently on a tweet rampage complaining about it. Among other things, he wants the whole EU to be abolished. He sadly is hardly the first wealthy American to share their opinions on European politics lately. I’m not a fan of this outside attention but I believe it’s noteworthy and something to pay attention to. In particular because the idea of destroying and ripping apart the EU is not just popular in the US; it’s popular over here too. Something that greatly concerns me. There is definitely a bunch of stuff we might want to fix over here. I have complained about our culture before. Unfortunately, I happen to think that our challenges are not coming from politicians or civil servants, but from us, the people. Europeans don’t like to take risks and are quite pessimistic about the future compared to their US counterparts. Additionally, we Europeans have been trained to feel a lot of guilt over the years, which makes us hesitant to stand up for ourselves. This has led to all kinds of interesting counter-cultural movements in Europe, like years of significant support for unregulated immigration and an unhealthy obsession with the idea of degrowth. Today, though, neither seems quite as popular as it once was. Morally these things may be defensible, but in practice they have led to Europe losing its competitive edge and eroding social cohesion. The combination of a strong social state and high taxes in particular does not mix well with the kind of immigration we have seen in the last decade: mostly people escaping wars ending up in low-skilled jobs. That means it’s not unlikely that certain classes of immigrants are going to be net-negative for a very long time, if not forever, and increasingly society is starting to think about what the implications of that might be. Yet even all of that is not where our problems lie, and it’s certainly not our presumed lack of free speech. Any conversation on that topic is foolish because it’s too nuanced. Society clearly wants to place some limits to free speech here, but the same is true in the US. In the US we can currently see a significant push-back against “woke ideologies,” and a lot of that push-back involves restricting freedom of expression through different avenues. The US might try to lecture Europe right now on free speech, but what it should be lecturing us on is our economic model. Europe has too much fragmentation, incredibly strict regulation that harms innovation, ineffective capital markets, and a massive dependency on both the United States and China. If the US were to cut us off from their cloud providers, we would not be able to operate anything over here. If China were to stop shipping us chips, we would be in deep trouble too ( we have seen this ). This is painful because the US is historically a great example when it comes to freedom of information, direct democracy at the state level, and rather low corruption. These are all areas where we’re not faring well, at least not consistently, and we should be lectured. Fundamentally, the US approach to capitalism is about as good as it’s going to get. If there was any doubt that alternative approaches might have worked out better, at this point there’s very little evidence in favor of that. Yet because of increased loss of civil liberties in the US, many Europeans now see everything that the US is doing as bad. A grave mistake. Both China and the US are quite happy with the dependency we have on them and with us falling short of our potential. Europe’s attempt at dealing with the dependency so far has been to regulate and tax US corporations more heavily. That’s not a good strategy. The solution must be to become competitive again so that we can redirect that tax revenue to local companies instead. The Digital Services Act is a good example: we’re punishing Apple and forcing them to open up their platform, but we have no company that can take advantage of that opening. If you read my blog here, you might remember my musings about the lack of clarity of what a foreigner is in Europe. The reality is that Europe has been deeply integrated for a long time now as a result of how the EU works — but still not at the same level as the US. I think this is still the biggest problem. People point to languages as the challenge, but underneath the hood, the countries are still fighting each other. Austria wants to protect its local stores from larger competition in Germany and its carpenters from the cheaper ones coming from Slovenia. You can replace Austria with any other EU country and you will find the same thing. The EU might not be perfect, but it’s hard to imagine that abolishing it would solve any problem given how national states have shown to behave. The moment the EU fell away, we would be warming up all border struggles again. We have already seen similar issues pop up in Northern Ireland after the UK left. And we just have so much bureaucracy, so many non-functioning social systems, and such a tremendous amount of incoming governmental debt to support our flailing pension schemes. We need growth more than any other bloc, and we have such a low probability of actually accomplishing that. Given how the EU is structured, it’s also acting as the punching bag for the failure of the nation states to come to agreements. It’s not that EU bureaucrats are telling Europeans to take in immigrants, to enact chat control or to enact cookie banners or attached plastic caps. Those are all initiatives that come from one or more member states. But the EU in the end will always take the blame because even local politicians that voted in support of some of these things can easily point towards “Brussels” as having created a problem. A Europe in pieces does not sound appealing to me at all, and that’s because I can look at what China and the US have. What China and the US have that Europe lacks is a strong national identity. Both countries have recognized that strength comes from unity. China in particular is fighting any kind of regionalism tooth and nail. The US has accomplished this through the pledge of allegiance, a civil war, the Department of Education pushing a common narrative in schools, and historically putting post offices and infrastructure everywhere. Europe has none of that. More importantly, Europeans don’t even want it. There is a mistaken belief that we can just become these tiny states again and be fine. If Europe wants to be competitive, it seems unlikely that this can be accomplished without becoming a unified superpower. Yet there is no belief in Europe that this can or should happen, and the other superpowers have little interest in seeing it happen either. If I had to propose something constructive, it would be this: Europe needs to stop pretending it can be 27 different countries with 27 different economic policies while also being a single market. The half-measures are killing us. We have a common currency in the Eurozone but no common fiscal policy. We have freedom of movement but wildly different social systems. We have common regulations but fragmented enforcement. 27 labor laws, 27 different legal systems, tax codes, complex VAT rules and so on. The Draghi report from last year laid out many of these issues quite clearly: Europe needs massive investment in technology and infrastructure. It needs a genuine single market for services, not just goods. It needs capital markets that can actually fund startups at scale. None of this is news to anyone paying attention. But here’s the uncomfortable truth: none of this will happen without Europeans accepting that more integration is the answer, not less. And right now, the political momentum is in the opposite direction. Every country wants the benefits of the EU without the obligations. Every country wants to protect its own industries while accessing everyone else’s markets. One of the arguments against deeper integration is that Europe hinges on some quite unrelated issues. For instance, the EU is seen as non-democratic, but some of the criticism just does not sit right with me. Sure, I too would welcome more democracy in the EU, but at the same time, the system really is not undemocratic today. Take things like chat control: the reason this thing does not die, is because some member states and their elected representatives are pushing for it. What stands in the way is that the member countries and their people don’t actually want to strengthen the EU further. The “lack of democracy” is very much intentional and the exact outcome you get if you want to keep the power with the national states. So back to where we started: should the EU be abolished as Musk suggests? I think this is a profoundly unserious proposal from someone who has little understanding of European history and even less interest in learning. The EU exists because two world wars taught Europeans that nationalism without checks leads to catastrophe. It exists because small countries recognized they have more leverage negotiating as a bloc than individually. I also take a lot of issue with the idea that European politics should be driven by foreign interests. Neither Russians nor Americans have any good reason for why they should be having so much interest in European politics. They are not living here; we are. Would Europe be more “free” without the EU? Perhaps in some narrow regulatory sense. But it would also be weaker, more divided, and more susceptible to manipulation by larger powers — including the United States. I also find it somewhat rich that American tech billionaires are calling for the dissolution of the EU while they are greatly benefiting from the open market it provides. Their companies extract enormous value from the European market, more than even local companies are able to. The real question isn’t whether Europe should have less regulation or more freedom. It’s whether we Europeans can find the political will to actually complete the project we started. A genuine federation with real fiscal transfers, a common defense policy, and a unified foreign policy would be a superpower. What we have now is a compromise that satisfies nobody and leaves us vulnerable to exactly the kind of pressure Musk and other oligarchs represent. Europe doesn’t need fixing in the way the loud present-day critics suggest. It doesn’t need to become more like America or abandon its social model entirely. What it needs is to decide what it actually wants to be. The current state of perpetual ambiguity is unsustainable. It also should not lose its values. Europeans might no longer be quite as hot on the human rights that the EU provides, and they might no longer want to have the same level of immigration. Yet simultaneously, Europeans are presented with a reality that needs all of these things. We’re all highly dependent on movement of labour, and that includes people from abroad. Unfortunately, the wars of the last decade have dominated any migration discourse, and that has created ground for populists to thrive. Any skilled tech migrant is running into the same walls as everyone else, which has made it less and less appealing to come. Or perhaps we’ll continue muddling through, which historically has been Europe’s preferred approach. It’s not inspiring, but it’s also not going to be the catastrophe the internet would have you believe either. Is there reason to be optimistic? On a long enough timeline the graph goes up and to the right. We might be going through some rough patches, but structurally the whole thing here is still pretty solid. And it’s not as if the rest of the world is cruising along smoothly: the US, China, and Russia are each dealing with their own crises. That shouldn’t serve as an excuse, but it does offer context. As bleak as things can feel, we’re not alone in having challenges, but ours are uniquely ours and we will face them. One way or another.

14 views
Armin Ronacher 3 months ago

LLM APIs are a Synchronization Problem

The more I work with large language models through provider-exposed APIs, the more I feel like we have built ourselves into quite an unfortunate API surface area. It might not actually be the right abstraction for what’s happening under the hood. The way I like to think about this problem now is that it’s actually a distributed state synchronization problem. At its core, a large language model takes text, tokenizes it into numbers, and feeds those tokens through a stack of matrix multiplications and attention layers on the GPU. Using a large set of fixed weights, it produces activations and predicts the next token. If it weren’t for temperature (randomization), you could think of it having the potential of being a much more deterministic system, at least in principle. As far as the core model is concerned, there’s no magical distinction between “user text” and “assistant text”—everything is just tokens. The only difference comes from special tokens and formatting that encode roles (system, user, assistant, tool), injected into the stream via the prompt template. You can look at the system prompt templates on Ollama for the different models to get an idea. Let’s ignore for a second which APIs already exist and just think about what usually happens in an agentic system. If I were to have my LLM run locally on the same machine, there is still state to be maintained, but that state is very local to me. You’d maintain the conversation history as tokens in RAM, and the model would keep a derived “working state” on the GPU—mainly the attention key/value cache built from those tokens. The weights themselves stay fixed; what changes per step are the activations and the KV cache. From a mental-model perspective, caching means “remember the computation you already did for a given prefix so you don’t have to redo it.” Internally, that usually means storing the attention KV cache for those prefix tokens on the server and letting you reuse it, not literally handing you raw GPU state. There are probably some subtleties to this that I’m missing, but I think this is a pretty good model to think about it. The moment you’re working with completion-style APIs such as OpenAI’s or Anthropic’s, abstractions are put in place that make things a little different from this very simple system. The first difference is that you’re not actually sending raw tokens around. The way the GPU looks at the conversation history and the way you look at it are on fundamentally different levels of abstraction. While you could count and manipulate tokens on one side of the equation, extra tokens are being injected into the stream that you can’t see. Some of those tokens come from converting the JSON message representation into the underlying input tokens fed into the machine. But you also have things like tool definitions, which are injected into the conversation in proprietary ways. Then there’s out-of-band information such as cache points. And beyond that, there are tokens you will never see. For instance, with reasoning models you often don’t see any real reasoning tokens, because some LLM providers try to hide as much as possible so that you can’t retrain your own models with their reasoning state. On the other hand, they might give you some other informational text so that you have something to show to the user. Model providers also love to hide search results and how those results were injected into the token stream. Instead, you only get an encrypted blob back that you need to send back to continue the conversation. All of a sudden, you need to take some information on your side and funnel it back to the server so that state can be reconciled on either end. In completion-style APIs, each new turn requires resending the entire prompt history. The size of each individual request grows linearly with the number of turns, but the cumulative amount of data sent over a long conversation grows quadratically because each linear-sized history is retransmitted at every step. This is one of the reasons long chat sessions feel increasingly expensive. On the server, the model’s attention cost over that sequence also grows quadratically in sequence length, which is why caching starts to matter. One of the ways OpenAI tried to address this problem was to introduce the Responses API, which maintains the conversational history on the server (at least in the version with the saved state flag). But now you’re in a bizarre situation where you’re fully dealing with state synchronization: there’s hidden state on the server and state on your side, but the API gives you very limited synchronization capabilities. To this point, it remains unclear to me how long you can actually continue that conversation. It’s also unclear what happens if there is state divergence or corruption. I’ve seen the Responses API get stuck in ways where I couldn’t recover it. It’s also unclear what happens if there’s a network partition, or if one side got the state update but the other didn’t. The Responses API with saved state is quite a bit harder to use, at least as it’s currently exposed. Obviously, for OpenAI it’s great because it allows them to hide more behind-the-scenes state that would otherwise have to be funneled through with every conversation message. Regardless of whether you’re using a completion-style API or the Responses API, the provider always has to inject additional context behind the scenes—prompt templates, role markers, system/tool definitions, sometimes even provider-side tool outputs—that never appears in your visible message list. Different providers handle this hidden context in different ways, and there’s no common standard for how it’s represented or synchronized. The underlying reality is much simpler than the message-based abstractions make it look: if you run an open-weights model yourself, you can drive it directly with token sequences and design APIs that are far cleaner than the JSON-message interfaces we’ve standardized around. The complexity gets even worse when you go through intermediaries like OpenRouter or SDKs like the Vercel AI SDK, which try to mask provider-specific differences but can’t fully unify the hidden state each provider maintains. In practice, the hardest part of unifying LLM APIs isn’t the user-visible messages—it’s that each provider manages its own partially hidden state in incompatible ways. It really comes down to how you pass this hidden state around in one form or another. I understand that from a model provider’s perspective, it’s nice to be able to hide things from the user. But synchronizing hidden state is tricky, and none of these APIs have been built with that mindset, as far as I can tell. Maybe it’s time to start thinking about what a state synchronization API would look like, rather than a message-based API. The more I work with these agents, the more I feel like I don’t actually need a unified message API. The core idea of it being message-based in its current form is itself an abstraction that might not survive the passage of time. There’s a whole ecosystem that has dealt with this kind of mess before: the local-first movement. Those folks spent a decade figuring out how to synchronize distributed state across clients and servers that don’t trust each other, drop offline, fork, merge, and heal. Peer-to-peer sync, and conflict-free replicated storage engines all exist because “shared state but with gaps and divergence” is a hard problem that nobody could solve with naive message passing. Their architectures explicitly separate canonical state, derived state, and transport mechanics — exactly the kind of separation missing from most LLM APIs today. Some of those ideas map surprisingly well to models: KV caches resemble derived state that could be checkpointed and resumed; prompt history is effectively an append-only log that could be synced incrementally instead of resent wholesale; provider-side invisible context behaves like a replicated document with hidden fields. At the same time though, if the remote state gets wiped because the remote site doesn’t want to hold it for that long, we would want to be in a situation where we can replay it entirely from scratch—which for instance the Responses API today does not allow. There’s been plenty of talk about unifying message-based APIs, especially in the wake of MCP (Model Context Protocol). But if we ever standardize anything, it should start from how these models actually behave, not from the surface conventions we’ve inherited. A good standard would acknowledge hidden state, synchronization boundaries, replay semantics, and failure modes — because those are real issues. There is always the risk that we rush to formalize the current abstractions and lock in their weaknesses and faults. I don’t know what the right abstraction looks like, but I’m increasingly doubtful that the status-quo solutions are the right fit.

1 views
Armin Ronacher 3 months ago

Absurd Workflows: Durable Execution With Just Postgres

It’s probably no surprise to you that we’re building agents somewhere. Everybody does it. Building a good agent, however, brings back some of the historic challenges involving durable execution. Entirely unsurprisingly, a lot of people are now building durable execution systems. Many of these, however, are incredibly complex and require you to sign up for another third-party service. I generally try to avoid bringing in extra complexity if I can avoid it, so I wanted to see how far I can go with just Postgres. To this end, I wrote Absurd 1 , a tiny SQL-only library with a very thin SDK to enable durable workflows on top of just Postgres — no extension needed. Durable execution (or durable workflows) is a way to run long-lived, reliable functions that can survive crashes, restarts, and network failures without losing state or duplicating work. Durable execution can be thought of as the combination of a queue system and a state store that remembers the most recently seen execution state. Because Postgres is excellent at queues thanks to , you can use it for the queue (e.g., with pgmq ). And because it’s a database, you can also use it to store the state. The state is important. With durable execution, instead of running your logic in memory, the goal is to decompose a task into smaller pieces (step functions) and record every step and decision. When the process stops (whether it fails, intentionally suspends, or a machine dies) the engine can replay those events to restore the exact state and continue where it left off, as if nothing happened. Absurd at the core is a single file ( ) which needs to be applied to a database of your choice. That SQL file’s goal is to move the complexity of SDKs into the database. SDKs then make the system convenient by abstracting the low-level operations in a way that leverages the ergonomics of the language you are working with. The system is very simple: A task dispatches onto a given queue from where a worker picks it up to work on. Tasks are subdivided into steps , which are executed in sequence by the worker. Tasks can be suspended or fail, and when that happens, they execute again (a run ). The result of a step is stored in the database (a checkpoint ). To avoid repeating work, checkpoints are automatically loaded from the state storage in Postgres again. Additionally, tasks can sleep or suspend for events and wait until they are emitted. Events are cached, which means they are race-free. What is the relationship of agents with workflows? Normally, workflows are DAGs defined by a human ahead of time. AI agents, on the other hand, define their own adventure as they go. That means they are basically a workflow with mostly a single step that iterates over changing state until it determines that it has completed. Absurd enables this by automatically counting up steps if they are repeated: This defines a single task named , and it has just a single step. The return value is the changed state, but the current state is passed in as an argument. Every time the step function is executed, the data is looked up first from the checkpoint store. The first checkpoint will be , the second , , etc. Each state only stores the new messages it generated, not the entire message history. If a step fails, the task fails and will be retried. And because of checkpoint storage, if you crash in step 5, the first 4 steps will be loaded automatically from the store. Steps are never retried, only tasks. How do you kick it off? Simply enqueue it: And if you are curious, this is an example implementation of the function used above: And like Temporal and other solutions, you can yield if you want. If you want to come back to a problem in 7 days, you can do so: Or if you want to wait for an event: Which someone else can emit: Really, that’s it. There is really not much to it. It’s just a queue and a state store — that’s all you need. There is no compiler plugin and no separate service or whole runtime integration . Just Postgres. That’s not to throw shade on these other solutions; they are great. But not every problem necessarily needs to scale to that level of complexity, and you can get quite far with much less. Particularly if you want to build software that other people should be able to self-host, that might be quite appealing. It’s named Absurd because durable workflows are absurdly simple, but have been overcomplicated in recent years. ↩ It’s named Absurd because durable workflows are absurdly simple, but have been overcomplicated in recent years. ↩

0 views
Armin Ronacher 4 months ago

Regulation Isn’t the European Trap — Resignation Is

Plenty has been written about how hard it is to build in Europe versus the US. The list is always the same with little process: brittle politics, dense bureaucracy, mandatory notaries, endless and rigid KYC and AML processes. Fine. I know, you know. I’m not here to add another complaint to the pile (but if we meet over a beer or coffee, I’m happy to unload a lot of hilarious anecdotes on you). The unfortunate reality is that most of these constraints won’t change in my lifetime and maybe ever. Europe is not culturally aligned with entrepreneurship, it’s opposed to the idea of employee equity, and our laws reflect that. What bothers me isn’t the rules — it’s the posture that develops form it within people that should know better. Across the system, everyone points at someone else. If a process takes 10 steps, you’ll find 10 people who feel absolved of responsibility because they can cite 9 other blockers. Friction becomes a moral license to do a mediocre job (while lamenting about it). The vibe is: “Because the system is slow, I can be slow. Because there are rules, I don’t need judgment. Because there’s risk, I don’t need initiative.” And then we all nod along and nothing moves. There are excellent people here; I’ve worked with them. But they are fighting upstream against a default of low agency. When the process is bad, too many people collapse into it. Communication narrows to the shortest possible message. Friday after 2pm, the notary won’t reply — and the notary surely will blame labor costs or regulation for why service ends there. The bank will cite compliance for why they don’t need to do anything. The registrar will point at some law that allows them to demand a translation of a document by a court appointed translator. Everyone has a reason. No one owns the outcome. Meanwhile, in the US, our counsel replies when it matters, even after hours. Bankers answer the same day. The instinct is to enable progress, not enumerate reasons you can’t have it. The goal is the outcome and the rules are constraints to navigate, not a shield to hide behind. So what’s the point? I can’t fix politics. What I can do: act with agency, and surround myself with people who do the same and speak in support of it. Work with those who start from “how do we make this work?” not “why this can’t work.” Name the absurdities without using them as cover. Be transparent, move anyway and tell people. Nothing stops a notary from designing an onboarding flow that gets an Austrian company set up in five days — standardized KYC packets, templated resolutions, scheduled signing slots, clear checklists, async updates, a bias for same-day feedback. That could exist right now. It rarely does or falls short. Yes, much in Europe is objectively worse for builders. We have to accept it. Then squeeze everything you can from what is in your control: Select for agency. Choose partners who answer promptly when it’s material and who don’t confuse process with progress. The trap is not only regulation. It’s the learned helplessness it breeds. If we let friction set our standards, we become the friction. We won’t legislate our way to a US-style environment anytime soon. But we don’t need permission to be better operators inside a bad one. That’s the contrast and it’s the part we control. Postscript: Comparing Europe to the US triggers people and I’m concious of that. Maturity is holding two truths at once: they do some things right and some things wrong and so do we. You don’t win by talking others down or praying for their failure. I’d rather see both Europe and the US succeed than celebrate Europe failing slightly less. And no, saying I feel gratitude and happiness when I get a midnight reply doesn’t make me anti-work-life balance ( I am not ). It means when something is truly time-critical, fast, clear action lifts everyone. The times someone sent a document in minutes, late at night, both sides felt good about it when it mattered. Responsiveness, used with judgment, is not exploitation; it’s respect for outcomes and the relationships we form. Own the handoff. When you’re step 3 of 10, behave like step 10 depends on you and behave like you control all 10 steps. Anticipate blockers further down the line. Move same day. Eliminate ambiguity. Close loops. Default to clarity. Send checklists. Preempt the next two questions. Reduce the number of touches. Model urgency without theatrics. Be calm, fast, and precise. Don’t make your customer chase you. Use judgment. Rules exist and we can’t break them all. But we can work with them and be guided by them.

0 views
Armin Ronacher 4 months ago

Building an Agent That Leverages Throwaway Code

In August I wrote about my experiments with replacing MCP ( Model Context Protocol ) with code. In the time since I utilized that idea for exploring non-coding agents at Earendil . And I’m not alone! In the meantime, multiple people have explored this space and I felt it was worth sharing some updated findings. The general idea is pretty simple. Agents are very good at writing code, so why don’t we let them write throw-away code to solve problems that are not related to code at all? I want to show you how and what I’m doing to give you some ideas of what works and why this is much simpler than you might think. The first thing you have to realize is that Pyodide is secretly becoming a pretty big deal for a lot of agentic interactions. What is Pyodide? Pyodide is an open source project that makes a standard Python interpreter available via a WebAssembly runtime. What is neat about it is that it has an installer called micropip that allows it to install dependencies from PyPI. It also targets the emscripten runtime environment, which means there is a pretty good standard Unix setup around the interpreter that you can interact with. Getting Pyodide to run is shockingly simple if you have a Node environment. You can directly install it from npm. What makes this so cool is that you can also interact with the virtual file system, which allows you to create a persistent runtime environment that interacts with the outside world. You can also get hosted Pyodide at this point from a whole bunch of startups, but you can actually get this running on your own machine and infrastructure very easily if you want to. The way I found this to work best is if you banish Pyodide into a web worker. This allows you to interrupt it in case it runs into time limits. A big reason why Pyodide is such a powerful runtime, is because Python has an amazing ecosystem of well established libraries that the models know about. From manipulating PDFs or word documents, to creating images, it’s all there. Another vital ingredient to a code interpreter is having a file system. Not just any file system though. I like to set up a virtual file system that I intercept so that I can provide it with access to remote resources from specific file system locations. For instance, you can have a folder on the file system that exposes files which are just resources that come from your own backend API. If the agent then chooses to read from those files, you can from outside the sandbox make a safe HTTP request to bring that resource into play. The sandbox itself does not have network access, so it’s only the file system that gates access to resources. The reason the file system is so good is that agents just know so much about how they work, and you can provide safe access to resources through some external system outside of the sandbox. You can provide read-only access to some resources and write access to others, then access the created artifacts from the outside again. Now actually doing that is a tad tricky because the emscripten file system is sync, and most of the interesting things you can do are async. The option that I ended up going with is to move the fetch-like async logic into another web worker and use to block. If your entire Pyodide runtime is in a web worker, that’s not as bad as it looks. That said, I wish the emscripten file system API was changed to support stack swiching instead of this. While it’s now possible to hide async promises behind sync abstractions within Pyodide with call_sync , the same approach does not work for the emscripten JavaScript FS API. I have a full example of this at the end, but the simplified pseudocode that I ended up with looks like this: Lastly now that you have agents running, you really need durable execution. I would describe durable execution as the idea of being able to retry a complex workflow safely without losing progress. The reason for this is that agents can take a very long time, and if they interrupt, you want to bring them back to the state they were in. This has become a pretty hot topic. There are a lot of startups in that space and you can buy yourself a tool off the shelf if you want to. What is a little bit disappointing is that there is no truly simple durable execution system. By that I mean something that just runs on top of Postgres and/or Redis in the same way as, for instance, there is pgmq. The easiest way to shoehorn this yourself is to use queues to restart your tasks and to cache away the temporary steps from your execution. Basically, you compose your task from multiple steps and each of the steps just has a very simple cache key. It’s really just that simple: You can improve on this greatly, but this is the general idea. The state is basically the conversation log and whatever else you need to keep around for the tool execution (e.g., whatever was thrown on the file system). What tools does an agent need that are not code? Well, the code needs to be able to do something interesting so you need to give it access to something. The most interesting access you can provide is via the file system, as mentioned. But there are also other tools you might want to expose. What Cloudflare proposed is connecting to MCP servers and exposing their tools to the code interpreter. I think this is a quite interesting approach and to some degree it’s probably where you want to go. Some tools that I find interesting: : a tool that just lets the agent run more inference, mostly with files that the code interpreter generated. For instance if you have a zip file it’s quite fun to see the code interpreter use Python to unpack it. But if then that unpacked file is a jpg, you will need to go back to inference to understand it. : a tool that just … brings up help. Again, can be with inference for basic RAG, or similar. I found it quite interesting to let the AI ask it for help. For example, you want the manual tool to allow a query like “Which Python code should I write to create a chart for the given XLSX file?” On the other hand, you can also just stash away some instructions in .md files on the virtual file system and have the code interpreter read it. It’s all an option. If you want to see what this roughly looks like, I vibe-coded a simple version of this together. It uses a made-up example but it does show how a sandbox with very little tool availability can create surprising results: mitsuhiko/mini-agent . When you run it, it looks up the current IP from a special network drive that triggers an async fetch, and then it (usually) uses pillow or matplotlib to make an image of that IP address. Pretty pointless, but a lot of fun! 4he same approach has also been leveraged by Anthropic and Cloudflare. There is some further reading that might give you more ideas: : a tool that just lets the agent run more inference, mostly with files that the code interpreter generated. For instance if you have a zip file it’s quite fun to see the code interpreter use Python to unpack it. But if then that unpacked file is a jpg, you will need to go back to inference to understand it. : a tool that just … brings up help. Again, can be with inference for basic RAG, or similar. I found it quite interesting to let the AI ask it for help. For example, you want the manual tool to allow a query like “Which Python code should I write to create a chart for the given XLSX file?” On the other hand, you can also just stash away some instructions in .md files on the virtual file system and have the code interpreter read it. It’s all an option. Claude Skills is fully leveraging code generation for working with documents or other interesting things. Comes with a (non Open Source) repository of example skills that the LLM and code executor can use: anthropics/skills Cloudflare’s Code Mode which is the idea of creating TypeScript bindings for MCP tools and having the agent write code to use them in a sandbox.

2 views
Armin Ronacher 5 months ago

90%

“I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code” — Dario Amodei Three months ago I said that AI changes everything. I came to that after plenty of skepticism. There are still good reasons to doubt that AI will write all code, but my current reality is close. For the infrastructure component I started at my new company, I’m probably north of 90% AI-written code. I don’t want to convince you — just share what I learned. In parts, because I approached this project differently from my first experiments with AI-assisted coding. The service is written in Go with few dependencies and an OpenAPI-compatible REST API. At its core, it sends and receives emails. I also generated SDKs for Python and TypeScript with a custom SDK generator. In total: about 40,000 lines, including Go, YAML, Pulumi, and some custom SDK glue. I set a high bar, especially that I can operate it reliably. I’ve run similar systems before and knew what I wanted. Some startups are already near 100% AI-generated. I know, because many build in the open and you can see their code. Whether that works long-term remains to be seen. I still treat every line as my responsibility, judged as if I wrote it myself. AI doesn’t change that. There are no weird files that shouldn’t belong there, no duplicate implementations, and no emojis all over the place. The comments still follow the style I want and, crucially, often aren’t there. I pay close attention to the fundamentals of system architecture, code layout, and database interaction. I’m incredibly opinionated. As a result, there are certain things I don’t let the AI do. I know it won’t reach the point where I could sign off on a commit. That’s why it’s not 100%. As contrast: another quick prototype we built is a mess of unclear database tgables, markdown file clutter in the repo, and boatloads of unwanted emojis. It served its purpose — validate an idea — but wasn’t built to last, and we had no expectation to that end. I began in the traditional way: system design, schema, architecture. At this state I don’t let the AI write, but I loop it in AI as a kind of rubber duck. The back-and-forth helps me see mistakes, even if I don’t need or trust the answers. I did get the foundation wrong once. I initially argued myself into a more complex setup than I wanted. That’s a part where I later used the LLM to redo a larger part early and clean it up. For AI-generated or AI-supported code, I now end up with a stack that looks something like something I often wanted, but was too hard to do by hand: Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM. The fact that I no longer have to write SQL because the AI does it for me is a game changer. I also use raw SQL for migrations now. OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on. Today I use Claude Code and Codex. Each has strengths, but the constant is Codex for code review after PRs. It’s very good at that. Claude is indispensable still when debugging and needing a lot of tool access (eg: why do I have a deadlock, why is there corrupted data in the database etc.). The working together of the two is where it’s most magical. Claude might find the data, Codex might understand it better. I cannot stress enough how bad the code from these agents can be if you’re not careful. While they understand system architecture and how to build something, they can’t keep the whole picture in scope. They will recreate things that already exist. They create abstractions that are completely inappropriate for the scale of the problem. You constantly need to learn how to bring the right information to the context. For me, this means pointing the AI to existing implementations and giving it very specific instructions on how to follow along. I generally create PR-sized chunks that I can review. There are two paths to this: Agent loop with finishing touches: Prompt until the result is close, then clean up. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge. It requires intuition to know when each approach is more likely to lead to the right results. Familiarity with the agent also helps understanding when a task will not go anywhere, avoiding wasted cycles. The most important piece of working with an agent is the same as regular software engineering. You need to understand your state machines, how the system behaves at any point in time, your database. It is easy to create systems that appear to behave correctly but have unclear runtime behavior when relying on agents. For instance, the AI doesn’t fully comprehend threading or goroutines. If you don’t keep the bad decisions at bay early it, you won’t be able to operate it in a stable manner later. Here’s an example: I asked it to build a rate limiter. It “worked” but lacked jitter and used poor storage decisions. Easy to fix if you know rate limiters, dangerous if you don’t. Agents also operate on conventional wisdom from the internet and in tern do things I would never do myself. It loves to use dependencies (particularly outdated ones). It loves to swallow errors and take away all tracebacks. I’d rather uphold strong invariants and let code crash loudly when they fail, than hide problems. If you don’t fight this, you end up with opaque, unobservable systems. For me, this has reached the point where I can’t imagine working any other way. Yes, I could probably have done it without AI. But I would have built a different system in parts because I would have made different trade-offs. This way of working unlocks paths I’d normally skip or defer. Here are some of the things I enjoyed a lot on this project: Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes. It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation. Trying out things: I tried three different OpenAPI implementations and approaches in a day. Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy. Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs. Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests. SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering and when writing it. Is 90% of code going to be written by AI? I don’t know. What I do know is, that for me, on this project, the answer is already yes. I’m part of that growing subset of developers who are building real systems this way. At the same time, for me, AI doesn’t own the code. I still review every line, shape the architecture, and carry the responsibility for how it runs in production. But the sheer volume of what I now let an agent generate would have been unthinkable even six months ago. That’s why I’m convinced this isn’t some far-off prediction. It’s already here — just unevenly distributed — and the number of developers working like this is only going to grow. That said, none of this removes the need to actually be a good engineer. If you let the AI take over without judgment, you’ll end up with brittle systems and painful surprises (data loss, security holes, unscalable software). The tools are powerful, but they don’t absolve you of responsibility. Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM. The fact that I no longer have to write SQL because the AI does it for me is a game changer. I also use raw SQL for migrations now. OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on. Agent loop with finishing touches: Prompt until the result is close, then clean up. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge. Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes. It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation. Trying out things: I tried three different OpenAPI implementations and approaches in a day. Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy. Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs. Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests. SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering and when writing it.

0 views
Armin Ronacher 5 months ago

What’s a Foreigner?

Across many countries, resistance to immigration is rising — even places with little immigration, like Japan, now see rallies against it . I’m not going to take a side here. I want to examine a simpler question: who do we mean when we say “foreigner”? I would argue there isn’t a universal answer. Laws differ, but so do social definitions. In Vienna, where I live, immigration is visible: roughly half of primary school children don’t speak German at home . Austria makes citizenship hard to obtain. Many people born here aren’t citizens; at the same time, EU citizens living here have broad rights and labor-market access similar to native Austrians. Over my lifetime, the fear of foreigners has shifted: once aimed at nearby Eastern Europeans, it now falls more on people from outside the EU, often framed through religion or culture. Practically, “foreigner” increasingly ends up meaning “non-EU.” Keep in mind that over the last 30 years the EU went from 12 countries to 27. That’s a signifcant increase in social mobility. I believe this is quite different from what is happening in the United States. The present-day US debate is more tightly tied to citizenship and allegiance, which is partly why current fights there include attempts to narrow who gets citizenship at birth. The worry is less about which foreigners come and more about the terms of becoming American and whether newcomers will embrace what some define as American values. Inside the EU, the concept of EU citizenship changes social reality. Free movement, aligned standards, interoperable social systems, and easier labor mobility make EU citizens feel less “foreign” to each other — despite real frictions. The UK before Brexit was a notable exception: less integrated in visible ways and more hostile to Central and Eastern European workers. Perhaps another sign that the level of integration matters. In practical terms, allegiances are also much less clearly defined in the EU. There are people who live their entire live in other EU countries and whos allegiance is no longer clearly aligned to any one country. Legal immigration itself is widely misunderstood. Most systems are both far more restrictive in some areas and far more permissive than people assume. On the one hand, what’s called “illegal” is often entirely lawful. Many who are considered “illegal” are legally awaiting pending asylum decisions or are accepted refugees. These are processes many think shouldn’t exist, but they are, in fact, legal. On the other hand, the requirements for non-asylum immigration are very high, and most citizens of a country themselves would not qualify for skilled immigration visas. Meanwhile, the notion that a country could simply “remove all foreigners” runs into practical and ethical dead ends. Mobility pressures aren’t going away; they’re reinforced by universities, corporations, individual employers, demographics, and geopolitics. Citizenship is just a small wrinkle. In Austria, you generally need to pass a modest German exam and renounce your prior citizenship. That creates odd outcomes: native-born non-citizens who speak perfect German but lack a passport, and naturalized citizens who never fully learned the language. Legally clear, socially messy — and not unique to Austria. The high hurdle to obtaining a passport also leads many educated people to intentionally opt out of becoming citizens. The cost that comes with renouncing a passport is not to be underestimated. Where does this leave us? The realities of international mobility leave our current categories of immigration straining and misaligned with what the population at large thinks immigration should look like. Economic anxiety, war, and political polarization are making some groups of foreigners targets, while the deeper drivers behind immigration will only keep intensifying. Perhaps we need to admit that we’re all struggling with these questions. The person worried about their community or country changing too quickly and the immigrant seeking a better life are both responding to forces larger than themselves. In a world where capital moves freely but most people cannot, where climate change might soon displace millions, and where birth rates are collapsing in wealthy nations, our immigration systems will be tested and stressed, and our current laws and regulations are likely inadequate.

0 views
Armin Ronacher 5 months ago

996

“Amazing salary, hackerhouse in SF, crazy equity. 996 . Our mission is OSS.” — Gregor Zunic “The current vibe is no drinking, no drugs, 9-9-6, […].” — Daksh Gupta “The truth is, China’s really doing ‘007’ now—midnight to midnight, seven days a week […] if you want to build a $10 billion company, you have to work seven days a week.” — Harry Stebbings I love work. I love working late nights, hacking on things. This week I didn’t go to sleep before midnight once. And yet… I also love my wife and kids. I love long walks, contemplating life over good coffee, and deep, meaningful conversations. None of this would be possible if my life was defined by 12 hour days, six days a week. More importantly, a successful company is not a sprint, it’s a marathon. And this is when this is your own company! When you devote 72 hours a week to someone else’s startup, you need to really think about that arrangement a few times. I find it highly irresponsible for a founder to promote that model. As a founder, you are not an employee, and your risks and leverage are fundamentally different. I will always advocate for putting the time in because it is what brought me happiness. Intensity, and giving a shit about what I’m doing, will always matter to me. But you don’t measure that by the energy you put in, or the hours you’re sitting in the office, but the output you produce. Burning out on twelve-hour days, six days a week, has no prize at the end. It’s unsustainable, it shouldn’t be the standard and it sure as hell should not be seen as a positive sign of a company. I’ve pulled many all-nighters, and I’ve enjoyed them. I still do. But they’re enjoyable in the right context, for the right reasons, and when that is a completely personal choice, not the basis of company culture. And that all-nighter? It comes with a fucked up and unproductive morning the day after. When someone promotes a 996 work culture, we should push back.

0 views
Armin Ronacher 6 months ago

Passkeys and Modern Authentication

There is an ongoing trend in the industry to move people away from username and password towards passkeys . The intentions here are good, and I would assume that this has a significant net benefit for the average consumer. At the same time, the underlying standard has some peculiarities. These enable behaviors by large corporations, employers, and governments that are worth thinking about. One potential source of problems here is the attestation system. It allows the authenticator to provide more information about what it is to the website that you’re authenticating with. In particular it is what tells a website if you have a Yubikey plugged in versus something like 1password. This is the mechanism by which the Austrian government, for instance, prevents you from using an Open Source or any other software-based authenticator to sign in to do your taxes, access medical records or do anything else that is protected by eID . Instead you have to buy a whitelisted hardware token . Attestations themselves are not used by software authenticators today, or anything that syncs. Both Apple and Google do not expose attestation data in their own software authenticators (Keychain and Google Authenticator) for consumer passkeys. However, they will pass through attestation data from hardware tokens just fine. Both of them also, to the best of my knowledge, expose attestation data for enterprises through Mobile Device Management. One could make the argument that it is unlikely that attestation data will be used at scale to create vendor lock-in. However, I’m not sufficiently convinced that this won’t create sub-ecosystems where we see exactly that happening. If for no other reason, this API exists and it has already been used to restrict keys for governmental sign-in systems. One slightly more concerning issue today is that there is effectively no way to export private keys between authentication password managers. You need to enroll all of your ecosystems individually into a password manager. An attempt by an open source password manager to reveal private keys to the user was ruled insecure and should not be supported . This taking away agency from the user is not an accident. You can also see this with the passkey export specification which comes with a protocol that, while enabling exports in principle, encourages a system to system transfer that does not hand over the user’s credentials to the user. 1 This might be for good intentions, but it also creates problems. As someone recently trying to leave the Apple ecosystem step by step, I have noticed how many services are now bound to an iCloud-based passkey. Particularly when it comes to Apple, this fear is not entirely unwarranted. Sign-in with Apple using non-shared email addresses makes it very hard to migrate to Android unless you retain an iCloud subscription. Obviously, one could pay for an authenticator like 1Password, which at least is ecosystem independent. However, not everybody is in a situation where they can afford to pay for basic services like password managers. One reason why passkeys are adopted so well today is because it happens automatically for many. I discovered that non-technical family members now all have passkeys for some services, and they did not even notice doing that. A notable example is Amazon. After every sign-in, it attempts to enroll you into a passkey automatically without clear notification. It just brings up the fingerprint prompt, and users will instinctively touch it. If you use different types of devices to authenticate — for instance, a Windows and an iOS device — you may eventually have both authenticators associated. This now covers the devices you already use. However, it can make moving to a completely different ecosystem later much harder. For many years already, people lose access to their Google account every day and can never regain it. Google is well known for terminating accounts without stating any reasons. With that comes the loss of access to your data. In this case, you also lose your credentials for third-party websites. There is no legal recourse for this and no mechanism for appeal. You just have to hope that you’re a good citizen and not doing anything that would upset Google’s account flagging systems. As a sufficiently technical person, you might weigh the risks, but others will not. Many years ago, I tried to help another family gain access to their child’s Facebook account after they passed away. Even then, it was a bureaucratic nightmare where there was little support by Facebook to make it happen. There is a real risk that access becomes much harder for families. This is particularly true in situations where someone is incapacitated or dead. The more we move away from basic authentication systems, the worse this becomes. It’s also really inconvenient when you are not on your own devices. Signing into my accounts on my children’s devices has turned from a straightforward process to an incredibly frustrating experience. I find myself juggling all kinds of different apps and flows. Every once in a while, I find myself in a situation where I have very little foundation to build on. This is mostly just because of a hobby. I like to see how things work and build them from scratch. Increasingly, that has become harder. Many username and password authentication schemes have been replaced with OAuth sign-ins over the years. Nowadays, some services are moving towards passkeys, though most places do not enforce these yet. If you want to build an operating system from scratch, or even just build a client yourself, you often find yourself needing to do a lot of yak-shaving. All this work is necessary just to get basic things working. I think this is at least something to be wary of. It doesn’t mean that bad things will necessarily happen, but there is potential for loss of individual agency. An accelerated version of this has been seen with email. Accessing your own personal IMAP account from Google today has been significantly restricted under security arguments. Getting OAuth credentials that can access someone’s IMAP accounts with their approval has become increasingly harder. It is also very costly. Username and password authentication has largely been removed. Even the app-specific passwords on Google are now entirely undocumented. They are no longer exposed in the settings unless you know the link 2 . I don’t know. I am both a user of passkeys and generally wary of making myself overly dependent on tech giants and complex solutions. I’m noticing an increased reliance and potential loss of access to my own data. This does abstractly concern me. Not to the degree that it changes anything I’m doing, but still. As annoying as managing usernames and passwords was, I don’t think I have ever spent so much time authenticating on a daily basis. The systems that we now need to interface with for authentication are vast and complex. This might just be the path we’re going. However, it is also one where we maybe want to reflect a little bit on whether this is really what we want. Edit: I reworded the statement about pass key exports to not misrepresent the original comment on GitHub. The details can be debated, but the protocol explicitly does not permit a user to just hold on to a symmetrically encrypted export (or even a plain text one). The best option is the HPKE scheme. ↩ This OAuth dependency also puts Open Source projects in an interesting situation. For instance, the Thunderbird client ships with OAuth credentials for Google when you download it from Mozilla. However, if you self-compile it, you don’t have that access. ↩ The details can be debated, but the protocol explicitly does not permit a user to just hold on to a symmetrically encrypted export (or even a plain text one). The best option is the HPKE scheme. ↩ This OAuth dependency also puts Open Source projects in an interesting situation. For instance, the Thunderbird client ships with OAuth credentials for Google when you download it from Mozilla. However, if you self-compile it, you don’t have that access. ↩

0 views
Armin Ronacher 6 months ago

Your MCP Doesn’t Need 30 Tools: It Needs Code

I wrote a while back about why code performs better than MCP ( Model Context Protocol ) for some tasks. In particular, I pointed out that if you have command line tools available, agentic coding tools seem very happy to use those. In the meantime, I learned a few more things that put some nuance to this. There are a handful of challenges with CLI-based tools that are rather hard to resolve and require further examination. In this blog post, I want to present the (not so novel) idea that an interesting approach is using MCP servers exposing a single tool, that accepts programming code as tool inputs. The first and most obvious challenge with CLI tools is that they are sometimes platform-dependent, version-dependent, and at times undocumented. This has meant that I routinely encounter failures when using tools on first use. A good example of this is when the tool usage requires non-ASCII string inputs. For instance, Sonnet and Opus are both sometimes unsure how to feed newlines or control characters via shell arguments. This is unfortunate but ironically not entirely unique to shell tools either. For instance, when you program with C and compile it, trailing newlines are needed. At times, agentic coding tools really struggle with appending an empty line to the end of a file, and you can find some quite impressive tool loops to work around this issue. This becomes particularly frustrating when your tool is absolutely not in the training set and uses unknown syntax. In that case, getting agents to use it can become quite a frustrating experience. Another issue is that in some agents (Claude Code in particular), there is an extra pass taking place for shell invocations: the security preflight. Before executing a tool, Claude also runs it through the fast Haiku model to determine if the tool will do something dangerous and avoid the invocation. This further slows down tool use when multiple turns are needed. In general, doing multiple turns is very hard with CLI tools because you need to teach the agent how to manage sessions. A good example of this is when you ask it to use tmux for remote-controlling an LLDB session . It’s absolutely capable of doing it, but it can lose track of the state of its tmux session. During some tests, I ended up with it renaming the session halfway through, forgetting that it had a session (and thus not killing it). This is particularly frustrating because the failure case can be that it starts from scratch or moves on to other tools just because it got a small detail wrong. Unfortunately, when moving to MCP, you immediately lose the ability to compose without inference (at least today). One of the reasons lldb can be remote-controlled with tmux at all is that the agent manages to compose quite well. How does it do that? It uses basic tmux commands such as to send inputs or to get the output, which don’t require a lot of extra tooling. It then chains commands like and to ensure it doesn’t read output too early. Likewise, when it starts to fail with encoding more complex characters, it sometimes changes its approach and might even use . The command line really isn’t just one tool — it’s a series of tools that can be composed through a programming language: bash. The most interesting uses are when you ask it to write tools that it can reuse later. It will start composing large scripts out of these one-liners. All of that is hard with MCP today. It’s very clear that there are limits to what these shell tools can do. At some point, you start to fight those tools. They are in many ways only as good as their user interface, and some of these user interfaces are just inherently tricky. For instance, when evaluated, tmux performs better than GNU screen , largely because the command-line interface of tmux is better and less error-prone. But either way, it requires the agent to maintain a stateful session, and it’s not particularly good at this today. What is stateful out of the box, however, is MCP. One surprisingly useful way of running an MCP server is to make it an MCP server with a single tool (the ubertool) which is just a Python interpreter that runs with retained state . It maintains state in the background and exposes tools that the agent already knows how to use. I did this experiment in a few ways now, the one that is public is . It’s an MCP that exposes a single tool called . It is, however, in many ways a misnomer. It’s not really a tool — it’s a Python interpreter running out of a virtualenv that has installed. What is ? It is the Python port of the ancient command-line tool which allows one to interact with command-line programs through scripts. The documentation describes as a “program that ‘talks’ to other interactive programs according to a script.” What is special about is that it’s old, has a stable API, and has been used all over the place. You could wrap or with lots of different MCP tools like , , , and more. That’s because the class exposes 36 different API functions! That’s a lot. But many of these cannot be used in isolation well anyway. Take this motivating example from the docs: Even the most basic use here involves three chained tool calls. And that doesn’t include error handling, which one might also want to encode. So instead, a much more interesting way to have this entire thing run is to just have the command language to the MCP be Python. The MCP server turns into a stateful Python interpreter, and the tool just lets it send Python code that is evaluated with the same state as before. There is some extra support in the MCP server to make the experience more reliable (like timeout support), but for the most part, the interface is to just send Python code. In fact, the exact script from above is what an MCP client is expected to send. The tool description just says this: This works because the interface to the MCP is now not just individual tools it has never seen — it’s a programming language that it understands very well, with additional access to an SDK ( ) that it has also seen and learned all the patterns from. We’re relegating the MCP to do the thing that it does really well: session management and guiding the tool through a built-in prompt. More importantly, the code that it writes is very similar to what it might put into a reusable script. There is so little plumbing in the actual MCP that you can tell the agent after the session to write a reusable pexpect script from what it learned in the session. That works because all the commands it ran are just Python — they’re still in the context, and the lift from that to a reusable Python script is low. Now I don’t want to bore you too much with lots of Claude output, but I took a crashing demo app that Mario wrote and asked it to debug with LLDB through . Here is what that looked like: Afterwards I asked it to dump it into a reusable Python script to be run later: And from a fresh session we can ask it to execute it once more: That again works because the code it writes into the MCP is very close to the code that it would write into a Python script. And the difference is meaningful. The initial debug takes about 45 seconds on my machine and uses about 7 tool calls. The re-run with the dumped playbook takes one tool call and finishes in less than 5 seconds. Most importantly: that script is standalone. I can run it as a human, even without the MCP! Now the above example works beautifully because these models just know so much about . That’s hardly surprising in a way. So how well does this work when the code that it should write is entirely unknown to it? Well, not quite as well. However, and this is the key part, because the meta input language is Python, it means that the total surface area that can be exposed from an ubertool is pretty impressive. A general challenge with MCP today is that the more tools you have, the more you’re contributing to context rot. You’re also limited to rather low amounts of input. On the other hand, if you have an MCP that exposes a programming language, it also indirectly exposes a lot of functionality that it knows from its training. For instance, one of the really neat parts about this is that it knows , , , and other stuff. Heck, it even knows about . This means that you can give it very rudimentary instructions about how its sandbox operates and what it might want to do to learn more about what is available to it as needed. You can also tell it in the prompt that there is a function it can run to learn more about what’s available when it needs help! So when you build something that is completely novel, at least the programming language is known. You can, for instance, write a tiny MCP that dumps out the internal state of your application, provides basic query helpers for your database that support your sharding setup, or provides data reading APIs. It will discover all of this anyway from reading the code, but now it can also use a stateful Python or JavaScript session to run these tools and explore more. This is also a fun feature when you want to ask the agent to debug the MCP itself. Because Python and JavaScript are so powerful, you can, for instance, also ask it to debug the MCP’s state itself when something went wrong. The elephant in the room for all things agentic coding is security. Claude mostly doesn’t delete your machine and maybe part of that is the Haiku preflight security check. But isn’t all of this a sham anyway? I generally love to watch how Claude and other agents maneuver their way around protections in pretty creative ways. Clearly it’s potent and prompt-injectable. By building an MCP that just runs , we might be getting rid of some of the remaining safety here. But does it matter? We are seemingly okay with it writing code and running tests, which is the same kind of bad as running . I’m sure the day of reckoning will come for all of us, but right now we’re living in this world where protections don’t matter and we can explore what these things can do. I’m honestly not sure how to best protect these things. They are pretty special in that they are just inherently unsafe and impossible to secure. Maybe the way to really protect them would be to intercept every system call and have some sort of policy framework/sandbox around the whole thing. But even in that case, what prevents an ever more clever LLM from circumventing all these things? It has internet access, it can be prompt-injected, and all interfaces we have for them are just too low-level to support protection well. So to some degree, I think the tail risks of code execution are here to stay. But I would argue that they are not dramatically worse when the MCP executes Python code. In this particular case, consider that itself runs programs. There is little point in securing the MCP if what the MCP can run is any bash command. As interesting as the case is, that was not my original motivation. What I started to look into is replacing Playwright’s MCP with an MCP that just exposes the Playwright API via JavaScript. This is an experiment I have been running for a while, and the results are somewhat promising but also not promising enough yet. If you want to play with it, the MCP is called “ playwrightess ” is pretty simple. It just lets it execute JavaScript code against a sync playwright client. Same idea. Here, the tool usage is particularly nice because it gets down from ~30 tool definitions to 1: The other thing that is just much nicer about this approach is how many more ways it has to funnel data out. For instance from both the browser as well as the playwright script are forwarded back to the agent automatically. There is no need for the agent to ask for that information, it comes automatically. It also has a variable that it can use to accumulate extra information between calls which it liberally uses if you for instance ask it to collect data from multiple pages in a pagination. It can do that without any further inference, because the loop happens within JavaScript. Same with — you can easily get it to dump out a script for later that circumvents a lot of MCP calls with something it already saw. Particularly when you are debugging a gnarly issue and you need to restart the debugging more than once, that shows some promise. Does it perform better than Playwright MCP? Not in the current form, but I want to see if this idea can be taken further. It is quite verbose in the scripts that it writes, and it is not really well tuned between screenshots and text extraction.

0 views
Armin Ronacher 7 months ago

In Support Of Shitty Types

You probably know that I love Rust and TypeScript, and I’m a big proponent of good typing systems. One of the reasons I find them useful is that they enable autocomplete, which is generally a good feature. Having a well-integrated type system that makes sense and gives you optimization potential for memory layouts is generally a good idea. From that, you’d naturally think this would also be great for agentic coding tools. There’s clearly some benefit to it. If you have an agent write TypeScript and the agent adds types, it performs well. I don’t know if it outperforms raw JavaScript, but at the very least it doesn’t seem to do any harm. But most agentic tools don’t have access to an LSP (language server protocol). My experiments with agentic coding tools that do have LSP access (with type information available) haven’t meaningfully benefited from it. The LSP protocol slows things down and pollutes the context significantly. Also, the models haven’t been trained sufficiently to understand how to work with this information. Just getting a type check failure from the compiler in text form yields better results. What you end up with is an agent coding loop that, without type checks enabled, results in the agent making forward progress by writing code and putting types somewhere. As long as this compiles to some version of JavaScript (if you use Bun, much of it ends up type-erased), it creates working code. And from there it continues. But that’s bad progress—it’s the type of progress where it needs to come back after and clean up the types. It’s curious because types are obviously being written but they’re largely being ignored. If you do put the type check into the loop, my tests actually showed worse performance. That’s because the agent manages to get the code running, and only after it’s done does it run the type check. Only then, maybe at a much later point, does it realize it made type errors. Then it starts fixing them, maybe goes in a loop, and wastes a ton of context. If you make it do the type checks after every single edit, you end up eating even more into the context. This gets really bad when the types themselves are incredibly complicated and non-obvious. TypeScript has arcane expression functionality, and some libraries go overboard with complex constructs (e.g., conditional types ). LLMs have little clue how to read any of this. For instance, if you give it access to the .d.ts files from TanStack Router and the forward declaration stuff it uses for the router system to work properly, it doesn’t understand any of it. It guesses, and sometimes guesses badly. It’s utterly confused. When it runs into type errors, it performs all kinds of manipulations, none of which are helpful. Python typing has an even worse problem, because there we have to work with a very complicated ecosystem where different type checkers cannot even agree on how type checking should work. That means that the LLM, at least from my testing, is not even fully capable of understanding how to resolve type check errors from tools which are not from mypy. It’s not universally bad, but if you actually end up with a complex type checking error that you cannot resolve yourself, it is shocking how the LLM is also often not able to fully figure out what’s going on, or at least needs multiple attempts. As a shining example of types adding a lot of value we have Go. Go’s types are much less expressive and very structural. Things conform to interfaces purely by having certain methods. The LLM does not need to understand much to comprehend that. Also, the types that Go has are rather strictly enforced. If they are wrong, it won’t compile. Because Go has a much simpler type system that doesn’t support complicated constructs, it works much better—both for LLMs to understand the code they produce and for the LLM to understand real-world libraries you might give to an LLM. I don’t really know what to do with this, but these behaviors suggest there’s a lot more value in best-effort type systems or type hints like JSDoc. Because at least as far as the LLM is concerned, it doesn’t need to fully understand the types, it just needs to have a rough understanding of what type some object probably is. For the LLM it’s more important that the type name in the error message aligns with the type name in source. I think it’s an interesting question whether this behavior of LLMs today will influence future language design. I don’t know if it will, but I think it gives a lot of credence to some of the decisions that led to languages like Go and Java. As critical as I have been in the past about their rather simple approaches to problems and having a design that maybe doesn’t hold developers in a particularly high regard, I now think that they actually are measurably in a very good spot. There is more elegance to their design than I gave it credit for.

0 views
Armin Ronacher 7 months ago

Agentic Coding Things That Didn’t Work

Using Claude Code and other agentic coding tools has become all the rage. Not only is it getting millions of downloads , but these tools are also gaining features that help streamline workflows. As you know, I got very excited about agentic coding in May, and I’ve tried many of the new features that have been added. I’ve spent considerable time exploring everything on my plate. But oddly enough, very little of what I attempted I ended up sticking with. Most of my attempts didn’t last, and I thought it might be interesting to share what didn’t work. This doesn’t mean these approaches won’t work or are bad ideas; it just means I didn’t manage to make them work. Maybe there’s something to learn from these failures for others. The best way to think about the approach that I use is: Non-working automations turn out to be quite common. Either I can’t get myself to use them, I forget about them, or I end up fine-tuning them endlessly. For me, deleting a failed workflow helper is crucial. You don’t want unused Claude commands cluttering your workspace and confusing others. So I end up doing the simplest thing possible most of the time: just talk to the machine more, give it more context, keep the audio input going, and dump my train of thought into the prompt. And that is 95% of my workflow. The rest might be good use of copy/paste. Slash commands allow you to preload prompts to have them readily available in a session. I expected these to be more useful than they ended up being. I do use them, but many of the ones that I added I ended up never using. There are some limitations with slash commands that make them less useful than they could be. One limitation is that there’s only one way to pass arguments, and it’s unstructured. This proves suboptimal in practice for my uses. Another issue I keep running into with Claude Code is that if you do use a slash command, the argument to the slash command for some reason does not support file-based autocomplete . To make them work better, I often ask Claude to use the current Git state to determine which files to operate on. For instance, I have a command in this blog that fixes grammar mistakes. It operates almost entirely from the current git status context because providing filenames explicitly is tedious without autocomplete. Here is one of the few slash commands I actually do use: My workflow now assumes that Claude can determine which files I mean from the Git status virtually every time, making explicit arguments largely unnecessary. Here are some of the many slash commands that I built at one point but ended up not using: So if I’m using fewer slash commands, what am I doing instead? Copy/paste is really, really useful because of how fuzzy LLMs are. For instance, I maintain link collections that I paste in when needed. Sometimes I fetch files proactively, drop them into a git-ignored folder, and mention them. It’s simple, easy, and effective. You still need to be somewhat selective to avoid polluting your context too much, but compared to having it spelunk in the wrong places, more text doesn’t harm as much. I tried hard to make hooks work, but I haven’t seen any efficiency gains from them yet. I think part of the problem is that I use yolo mode. I wish hooks could actually manipulate what gets executed. The only way to guide Claude today is through denies, which don’t work in yolo mode. For instance, I tried using hooks to make it use uv instead of regular Python, but I was unable to do so. Instead, I ended up preloading executables on the PATH that override the default ones, steering Claude toward the right tools. For instance, this is really my hack for making it use instead of more reliably: I really just have a bunch of these in and preload that folder onto before launching Claude: I also found it hard to hook into the right moment. I wish I could run formatters at the end of a long edit session. Currently, you must run formatters after each Edit tool operation, which often forces Claude to re-read files, wasting context. Even with the Edit tool hook, I’m not sure if I’m going to keep using it. I’m actually really curious whether people manage to get good use out of hooks. I’ve seen some discussions on Twitter that suggest there are some really good ways of making them work, but I just went with much simpler solutions instead. I was initially very bullish on Claude’s print mode. I tried hard to have Claude generate scripts that used print mode internally. For instance, I had it create a mock data loading script — mostly deterministic code with a small inference component to generate test data using Claude Code. The challenge is achieving reliability, which hasn’t worked well for me yet. Print mode is slow and difficult to debug. So I use it far less than I’d like, despite loving the concept of mostly deterministic scripts with small inference components. Whether using the Claude SDK or the command-line print flag, I haven’t achieved the results I hoped for. I’m drawn to Print Mode because inference is too much like a slot machine. Many programming tasks are actually quite rigid and deterministic. We love linters and formatters because they’re unambiguous. Anything we can fully automate, we should. Using an LLM for tasks that don’t require inference is the wrong approach in my book. That’s what makes print mode appealing. If only it worked better. Use an LLM for the commit message, but regular scripts for the commit and gh pr commands. Make mock data loading 90% deterministic with only 10% inference. I still use it, but I see more potential than I am currently leveraging. I use the task tool frequently for basic parallelization and context isolation. Anthropic recently launched an agents feature meant to streamline this process, but I haven’t found it easier to use. Sub-tasks and sub-agents enable parallelism, but you must be careful. Tasks that don’t parallelize well — especially those mixing reads and writes — create chaos. Outside of investigative tasks, I don’t get good results. While sub-agents should preserve context better, I often get better results by starting new sessions, writing thoughts to Markdown files, or even switching to o3 in the chat interface. What’s interesting about workflow automation is that without rigorous rules that you consistently follow as a developer, simply taking time to talk to the machine and give clear instructions outperforms elaborate pre-written prompts. For instance, I don’t use emojis or commit prefixes. I don’t enforce templates for pull requests either. As a result, there’s less structure for me to teach the machine. I also lack the time and motivation to thoroughly evaluate all my created workflows. This prevents me from gaining confidence in their value. Context engineering and management remain major challenges. Despite my efforts to help agents pull the right data from various files and commands, they don’t yet succeed reliably. They pull in too much or too little. Long sessions lead to forgotten context from the beginning. Whether done manually or with slash commands, the results feel too random. It’s hard enough with ad-hoc approaches, but static prompts and commands make it even harder. The rule I have now is that if I do want to automate something, I must have done it a few times already, and then I evaluate whether the agent gets any better results through my automation. There’s no exact science to it, but I mostly measure that right now by letting it do the same task three times and looking at the variance manually as measured by: would I accept the result. Forcing myself to evaluate the automation has another benefit: I’m less likely to just blindly assume it helps me. Because there is a big hidden risk with automation through LLMs: it encourages mental disengagement. When you stop thinking like an engineer, quality drops, time gets wasted and you don’t understand and learn. LLMs are already bad enough as they are, but whenever I lean in on automation I notice that it becomes even easier to disengage. I tend to overestimate the agent’s capabilities with time. There are real dragons there! You can still review things as they land, but it becomes increasingly harder to do so later. While LLMs are reducing the cost of refactoring, the cost doesn’t drop to zero, and regressions are common. I only automate things that I do regularly. If I create an automation for something that I do regularly, but then I stop using the automation, I consider it a failed automation and I delete it. : I had a command that instructed Claude to fix bugs by pulling issues from GitHub and adding extra context. But I saw no meaningful improvement over simply mentioning the GitHub issue URL and voicing my thoughts about how to fix it. : I tried getting Claude to write good commit messages, but they never matched my style. I stopped using this command, though I haven’t given up on the idea entirely. : I really hoped this would work. My idea was to have Claude skip tests during development, then use an elaborate reusable prompt to generate them properly at the end. But this approach wasn’t consistently better than automatic test generation, which I’m still not satisfied with overall. : I had a command to fix linting issues and run formatters. I stopped using it because it never became muscle memory, and Claude already knows how to do this. I can just tell it “fix lint” in the CLAUDE.md file without needing a slash command. : I track small items in a to-do.md file and had a command to pull the next item and work on it. Even here, workflow automation didn’t help much. I use this command far less than expected. Speech-to-text. Cannot stress this enough but talking to the machine means you’re more likely to share more about what you want it to do. I maintain some basic prompts and context for copy-pasting at the end or the beginning of what I entered.

0 views