Latest Posts (20 found)
Armin Ronacher 1 weeks ago

Mario and Earendil

Today I’m very happy to share that Mario Zechner is joining Earendil . First things first: I think you should read Mario’s post . This is his news more than it is ours, and he tells his side of it better than I could. What I want to do here is add a more personal note about why this matters so much to me, how the last months led us here, and why I am so excited to have him on board. Last year changed the way many of us thought about software. It certainly changed the way I did. I spent much of 2025 building, probing, and questioning how to build software, and in many more ways what I want to do. If you are a regular reader of this blog you were along for the ride. I wrote a lot, experimented a lot, and tried to get a better sense for what these systems can actually do and what kinds of companies make sense to build around them. There was, and continues to be, a lot of excitement in the air, but also a lot of noise. It has become clear to me that it’s not a question of whether AI systems can be useful but what kind of software and human-machine interactions we want to bring into the world with them. That is one of the reasons I have been so drawn to Mario’s work and approaches. Pi is, in my opinion, one of the most thoughtful coding agents and agent infrastructure libraries in this space. Not because it is trying to be the loudest or the fastest, but because it is clearly built by someone who cares deeply about software quality, taste, extensibility, and design. In a moment where much of the industry is racing to ship ever more quickly, often at the cost of coherence and craft, Mario kept insisting on making something solid. That matters to me a great deal. I have known Mario for a long time, and one of the things I admire most about him is that he does not confuse velocity with progress. He has a strong sense for what good tools should feel like. He cares about details. He cares about whether something is well made. And he cares about building in a way that can last. Mario has been running Pi in a rather unusual way. He exerts back-pressure on the issue tracker and the pull requests through OSS vacations and other means. The last year has also made something else clearer to me: these systems are not only exciting, they are also capable of producing a great deal of damage. Sometimes that damage is obvious; sometimes it looks like low-grade degradation everywhere at once. More slop, more noise, more disingenuous emails in my inbox. There is a version of this future that makes people more distracted, more alienated, and less careful with one another. That is not a future I want to help build. At Earendil, Colin and I have been trying to think very carefully about what a different path might look like. That is a big part of what led us to Lefos . Lefos is our attempt to build a machine entity that is more thoughtful and more deliberate by design. Not an agent whose main purpose is to make everything a little more efficient so that we can produce even more forgettable output, but one that can help people communicate with more care, more clarity, and joy. Good software should not aim to optimize every minute of your life, but should create room for better and more joyful experiences, better relationships, and better ways of relating to one another. Especially in communication and software engineering, I think we should be aiming for more thought rather than more throughput. We should want tools that help people be more considerate, more present, and more human. If all we do is use these systems to accelerate the production of slop, we will have missed the opportunity entirely. This is also why Mario joining Earendil feels so meaningful to me. Pi and Lefos come from different starting points. There was a year of distance collaboration, but they are animated by a similar instinct: that quality matters, that design matters, and that trust is earned through care rather than captured through hype. I am very happy that Pi is coming along for the ride. Me and Colin care a lot about it, and we want to be good stewards of it. It has already played an important role in our own work over the last months, and I continue to believe it is one of the best foundations for building capable agents. We will have more to say soon about how we think about Pi’s future and its relationship to Lefos, but the short version is simple: we want Pi to continue to exist as a high-quality, open, extensible piece of software, and we want to invest in making that future real. As for our thoughts of Pi’s license, read more here and our company post here .

0 views
Armin Ronacher 1 weeks ago

Absurd In Production

About five months ago I wrote about Absurd , a durable execution system we built for our own use at Earendil, sitting entirely on top of Postgres and Postgres alone. The pitch was simple: you don’t need a separate service , a compiler plugin , or an entire runtime to get durable workflows. You need a SQL file and a thin SDK. Since then we’ve been running it in production, and I figured it’s worth sharing what the experience has been like. The short version: the design held up, the system has been a pleasure to work with, and other people seem to agree. Absurd is a durable execution system that lives entirely inside Postgres. The core is a single SQL file ( absurd.sql ) that defines stored procedures for task management, checkpoint storage, event handling, and claim-based scheduling. On top of that sit thin SDKs (currently TypeScript , Python and an experimental Go one) that make the system ergonomic in your language of choice. The model is straightforward: you register tasks, decompose them into steps, and each step acts as a checkpoint. If anything fails, the task retries from the last completed step. Tasks can sleep, wait for external events, and suspend for days or weeks. All state lives in Postgres. If you want the full introduction, the original blog post covers the fundamentals. What follows here is what we’ve learned since. The project got multiple releases over the last five months. Most of the changes are things you’d expect from a system that people actually started depending on: hardened claim handling, watchdogs that terminate broken workers, deadlock prevention, proper lease management, event race conditions, and all the edge cases that only show up when you’re running real workloads. A few things worth calling out specifically. Decomposed steps. The original design only had , where you pass in a function and get back its checkpointed result. That works well for many cases but not all. Sometimes you need to know whether a step already ran before deciding what to do next. So we added / , which give you a handle you can inspect before committing the result. This turned out to be very useful for modeling intentional failures and conditional logic. This in particular is necessary when working with “before call” and “after call” type hook APIs. Task results. You can now spawn a task, go do other things, and later come back to fetch or await its result. This sounds obvious in hindsight, but the original system was purely fire-and-forget. Having proper result inspection made it possible to use Absurd for things like spawning child tasks from within a parent workflow and waiting for them to finish. This is particularly useful for debugging with agents too. absurdctl . We built this out as a proper CLI tool. You can initialize schemas, run migrations, create queues, spawn tasks, emit events, retry failures from the command line. It’s installable via or as a standalone binary. This has been invaluable for debugging production issues. When something is stuck, being able to just and see exactly where it stopped is a very different experience from digging through logs. Habitat . A small Go application that serves up a web dashboard for monitoring tasks, runs, checkpoints, and events. It connects directly to Postgres and gives you a live view of what’s happening. It’s simple, but it’s the kind of thing that makes the system more enjoyable for humans. Agent integration. Since Absurd was originally built for agent workloads, we added a bundled skill that coding agents can discover and use to debug workflow state via . There’s also a documented pattern for making pi agent turns durable by logging each message as a checkpoint. The thing I’m most pleased about is that the core design didn’t need to change all that much. The fundamental model of tasks, steps, checkpoints, events, and suspending is still exactly what it was initially. We added features around it, but nothing forced us to rethink the basic abstractions. Putting the complexity in SQL and keeping the SDKs thin turned out to be a genuinely good call. The TypeScript SDK is about 1,400 lines. The Python SDK is about 1,900 but most of this comes from the complexity of supporting colored functions. Compare that to Temporal’s Python SDK at around 170,000 lines. It means the SDKs are easy to understand, easy to debug, and easy to port. When something goes wrong, you can read the entire SDK in an afternoon and understand what it does. The checkpoint-based replay model also aged well. Unlike systems that require deterministic replay of your entire workflow function, Absurd just loads the cached step results and skips over completed work. That means your code doesn’t need to be deterministic outside of steps. You can call or in between steps and things still work, because only the step boundaries matter. In practice, this makes it much easier to reason about what’s safe and what isn’t. Pull-based scheduling was the right choice too. Workers pull tasks from Postgres as they have capacity. There’s no coordinator, no push mechanism, no HTTP callbacks. That makes it trivially self-hostable and means you don’t have to think about load management at the infrastructure level. I had some discussions with folks about whether the right abstraction should have been a durable promise . It’s a very appealing idea, but it turns out to be much more complex to implement in practice. It’s however in theory also more powerful. I did make some attempts to see what absurd would look like if it was based on durable promises but so far did not get anywhere with it. It’s however an experiment that I think would be fun to try! The primary use case is still agent workflows. An agent is essentially a loop that calls an LLM, processes tool results, and repeats until it decides it’s done. Each iteration becomes a step, and each step’s result is checkpointed. If the process dies on iteration 7, it restarts and replays iterations 1 through 6 from the store, then continues from 7. But we’ve found it useful for a lot of other things too. All our crons just dispatch distributed workflows with a pre-generated deduplication key from the invocation. We can have two cron processes running and they will only trigger one absurd task invocation. We also use it for background processing that needs to survive deploys. Basically anything where you’d otherwise build your own retry-and-resume logic on top of a queue. Absurd is deliberately minimal, but there are things I’d like to see. There’s no built-in scheduler. If you want cron-like behavior, you run your own scheduler loop and use idempotency keys to deduplicate. That works, and we have a documented pattern for it , but it would be nice to have something more integrated. There’s no push model. Everything is pull. If you need an HTTP endpoint to receive webhooks and wake up tasks, you build that yourself. I think that’s the right default as push systems are harder to operate and easier to overwhelm but there are cases where it would be convenient. In particular there are quite a few agentic systems where it would be super nice to have webhooks natively integrated (wake on incoming POST request). I definitely don’t want to have this in the core, but that sounds like the kind of problem that could be a nice adjacent library that builds on top of absurd. The biggest omission is that it does not support partitioning yet. That’s unfortunate because it makes cleaning up data more expensive than it has to be. In theory supporting partitions would be pretty simple. You could have weekly partitions and then detach and delete them when they expire. The only thing that really stands in the way of that is that Postgres does not have a convenient way of actually doing that. The hard part is not partitioning itself, it’s partition lifecycle management under real workloads. If a worker inserts a row whose lands in a month without a partition, the insert fails and the workflow crashes. So you need a separate maintenance loop that always creates future partitions far enough ahead for sleeps/retries, and does that for every queue. On the delete side, the safe approach is , but getting that to run from doesn’t work because it cannot be run within a transaction, but runs everything in one. I don’t think it’s an unsolvable problem, but it’s one I have not found a good solution for and I would love to get input on . This brings me a bit to a meta point on the whole thing which is what the point of Open Source libraries in the age of agentic engineering is. Durable Execution is now something that plenty of startups sell you. On the other hand it’s also something that an agent would build you and people might not even look for solutions any more. It’s kind of … weird? I don’t think a durable execution library can support a company, I really don’t. On the other hand I think it’s just complex enough of a problem that it could be a good Open Source project void of commercial interests. You do need a bit of an ecosystem around it, particularly for UI and good DX for debugging, and that’s hard to get from a throwaway implementation. I don’t think we have squared this yet, but it’s already much better to use than a few months ago. If you’re using Absurd, thinking about it, or building adjacent ideas, I’d love your feedback. Bug reports, rough edges, design critiques, and contributions are all very welcome—this project has gotten better every time someone poked at it from a different angle.

0 views
Armin Ronacher 3 weeks ago

Some Things Just Take Time

Trees take quite a while to grow. If someone 50 years ago planted a row of oaks or a chestnut tree on your plot of land, you have something that no amount of money or effort can replicate. The only way is to wait. Tree-lined roads, old gardens, houses sheltered by decades of canopy: if you want to start fresh on an empty plot, you will not be able to get that. Because some things just take time. We know this intuitively. We pay premiums for Swiss watches, Hermès bags and old properties precisely because of the time embedded in them. Either because of the time it took to build them or because of their age. We require age minimums for driving, voting, and drinking because we believe maturity only comes through lived experience. Yet right now we also live in a time of instant gratification, and it’s entering how we build software and companies. As much as we can speed up code generation, the real defining element of a successful company or an Open Source project will continue to be tenacity. The ability of leadership or the maintainers to stick to a problem for years, to build relationships, to work through challenges fundamentally defined by human lifetimes. The current generation of startup founders and programmers is obsessed with speed. Fast iteration, rapid deployment, doing everything as quickly as possible. For many things, that’s fine. You can go fast, leave some quality on the table, and learn something along the way. But there are things where speed is actively harmful, where the friction exists for a reason. Compliance is one of those cases. There’s a strong desire to eliminate everything that processes like SOC2 require, and an entire industry of turnkey solutions has sprung up to help — Delve just being one example, there are more. There’s a feeling that all the things that create friction in your life should be automated away. That human involvement should be replaced by AI-based decision-making. Because it is the friction of the process that is the problem. When in fact many times the friction, or that things just take time, is precisely the point. There’s a reason we have cooling-off periods for some important decisions in one’s life. We recognize that people need time to think about what they’re doing, and that doing something right once doesn’t mean much because you need to be able to do it over a longer period of time. AI writes code fast which isn’t news anymore. What’s interesting is that we’re pushing this force downstream: we seemingly have this desire to ship faster than ever, to run more experiments and that creates a new desire, one to remove all the remaining friction of reviews, designing and configuring infrastructure, anything that slows the pipeline. If the machines are so great, why do we even need checklists or permission systems? Express desire, enjoy result. Because we now believe it is important for us to just do everything faster. But increasingly, I also feel like this means that the shelf life of much of the software being created today — software that people and businesses should depend on — can be measured only in months rather than decades, and the relationships alongside. In one of last year’s earlier YC batches, there was already a handful that just disappeared without even saying what they learned or saying goodbye to their customers. They just shut down their public presence and moved on to other things. And to me, that is not a sign of healthy iteration. That is a sign of breaking the basic trust you need to build a relationship with customers. A proper shutdown takes time and effort, and our current environment treats that as time not wisely spent. Better to just move on to the next thing. This is extending to Open Source projects as well. All of a sudden, everything is an Open Source project, but many of them only have commits for a week or so, and then they go away because the motivation of the creator already waned. And in the name of experimentation, that is all good and well, but what makes a good Open Source project is that you think and truly believe that the person that created it is either going to stick with it for a very long period of time, or they are able to set up a strategy for succession, or they have created enough of a community that these projects will stand the test of time in one form or another. Relatedly, I’m also increasingly skeptical of anyone who sells me something that supposedly saves my time. When all that I see is that everybody who is like me, fully onboarded into AI and agentic tools, seemingly has less and less time available because we fall into a trap where we’re immediately filling it with more things. We all sell each other the idea that we’re going to save time, but that is not what’s happening. Any time saved gets immediately captured by competition. Someone who actually takes a breath is outmaneuvered by someone who fills every freed-up hour with new output. There is no easy way to bank the time and it just disappears. I feel this acutely. I’m very close to the red-hot center of where economic activity around AI is taking place, and more than anything, I have less and less time, even when I try to purposefully scale back and create the space. For me this is a problem. It’s a problem because even with the best intentions, I actually find it very hard to create quality when we are quickly commoditizing software, and the machines make it so appealing. I keep coming back to the trees. I’ve been maintaining Open Source projects for close to two decades now. The last startup I worked on, I spent 10 years at. That’s not because I’m particularly disciplined or virtuous. It’s because I or someone else, planted something, and then I kept showing up, and eventually the thing had roots that went deeper than my enthusiasm on any given day. That’s what time does! It turns some idea or plan into a commitment and a commitment into something that can shelter and grow other people. Nobody is going to mass-produce a 50-year-old oak. And nobody is going to conjure trust, or quality, or community out of a weekend sprint. The things I value most — the projects, the relationships, the communities — are all things that took years to become what they are. No tool, no matter how fast, was going to get them there sooner. We recently planted a new tree with Colin. I want it to grow into a large one. I know that’s going to take time, and I’m not in a rush.

0 views
Armin Ronacher 1 months ago

AI And The Ship of Theseus

Because code gets cheaper and cheaper to write, this includes re-implementations. I mentioned recently that I had an AI port one of my libraries to another language and it ended up choosing a different design for that implementation. In many ways, the functionality was the same, but the path it took to get there was different. The way that port worked was by going via the test suite. Something related, but different, happened with chardet . The current maintainer reimplemented it from scratch by only pointing it to the API and the test suite. The motivation: enabling relicensing from LGPL to MIT. I personally have a horse in the race here because I too wanted chardet to be under a non-GPL license for many years. So consider me a very biased person in that regard. Unsurprisingly, that new implementation caused a stir. In particular, Mark Pilgrim, the original author of the library, objects to the new implementation and considers it a derived work. The new maintainer, who has maintained it for the last 12 years, considers it a new work and instructs his coding agent to do precisely that. According to author, validating with JPlag, the new implementation is distinct. If you actually consider how it works, that’s not too surprising. It’s significantly faster than the original implementation, supports multiple cores and uses a fundamentally different design. What I think is more interesting about this question is the consequences of where we are. Copyleft code like the GPL heavily depends on copyrights and friction to enforce it. But because it’s fundamentally in the open, with or without tests, you can trivially rewrite it these days. I myself have been intending to do this for a little while now with some other GPL libraries. In particular I started a re-implementation of readline a while ago for similar reasons, because of its GPL license. There is an obvious moral question here, but that isn’t necessarily what I’m interested in. For all the GPL software that might re-emerge as MIT software, so might be proprietary abandonware. For me personally, what is more interesting is that we might not even be able to copyright these creations at all. A court still might rule that all AI-generated code is in the public domain, because there was not enough human input in it. That’s quite possible, though probably not very likely. But this all causes some interesting new developments we are not necessarily ready for. Vercel, for instance, happily re-implemented bash with Clankers but got visibly upset when someone re-implemented Next.js in the same way. There are huge consequences to this. When the cost of generating code goes down that much, and we can re-implement it from test suites alone, what does that mean for the future of software? Will we see a lot of software re-emerging under more permissive licenses? Will we see a lot of proprietary software re-emerging as open source? Will we see a lot of software re-emerging as proprietary? It’s a new world and we have very little idea of how to navigate it. In the interim we will have some fights about copyrights but I have the feeling very few of those will go to court, because everyone involved will actually be somewhat scared of setting a precedent. In the GPL case, though, I think it warms up some old fights about copyleft vs permissive licenses that we have not seen in a long time. It probably does not feel great to have one’s work rewritten with a Clanker and one’s authorship eradicated. Unlike the Ship of Theseus , though, this seems more clear-cut: if you throw away all code and start from scratch, even if the end result behaves the same, it’s a new ship. It only continues to carry the name. Which may be another argument for why authors should hold on to trademarks rather than rely on licenses and contract law. I personally think all of this is exciting. I’m a strong supporter of putting things in the open with as little license enforcement as possible. I think society is better off when we share, and I consider the GPL to run against that spirit by restricting what can be done with it. This development plays into my worldview. I understand, though, that not everyone shares that view, and I expect more fights over the emergence of slopforks as a result. After all, it combines two very heated topics, licensing and AI, in the worst possible way.

0 views
Armin Ronacher 2 months ago

The Final Bottleneck

Historically, writing code was slower than reviewing code. It might not have felt that way, because code reviews sat in queues until someone got around to picking it up. But if you compare the actual acts themselves, creation was usually the more expensive part. In teams where people both wrote and reviewed code, it never felt like “we should probably program slower.” So when more and more people tell me they no longer know what code is in their own codebase, I feel like something is very wrong here and it’s time to reflect. Software engineers often believe that if we make the bathtub bigger , overflow disappears. It doesn’t. OpenClaw right now has north of 2,500 pull requests open. That’s a big bathtub. Anyone who has worked with queues knows this: if input grows faster than throughput, you have an accumulating failure. At that point, backpressure and load shedding are the only things that retain a system that can still operate. If you have ever been in a Starbucks overwhelmed by mobile orders, you know the feeling. The in-store experience breaks down. You no longer know how many orders are ahead of you. There is no clear line, no reliable wait estimate, and often no real cancellation path unless you escalate and make noise. That is what many AI-adjacent open source projects feel like right now. And increasingly, that is what a lot of internal company projects feel like in “AI-first” engineering teams, and that’s not sustainable. You can’t triage, you can’t review, and many of the PRs cannot be merged after a certain point because they are too far out of date. And the creator might have lost the motivation to actually get it merged. There is huge excitement about newfound delivery speed, but in private conversations, I keep hearing the same second sentence: people are also confused about how to keep up with the pace they themselves created. Humanity has been here before. Many times over. We already talk about the Luddites a lot in the context of AI, but it’s interesting to see what led up to it. Mark Cartwright wrote a great article about the textile industry in Britain during the industrial revolution. At its core was a simple idea: whenever a bottleneck was removed, innovation happened downstream from that. Weaving sped up? Yarn became the constraint. Faster spinning? Fibre needed to be improved to support the new speeds until finally the demand for cotton went up and that had to be automated too. We saw the same thing in shipping that led to modern automated ports and containerization. As software engineers we have been here too. Assembly did not scale to larger engineering teams, and we had to invent higher level languages. A lot of what programming languages and software development frameworks did was allow us to write code faster and to scale to larger code bases. What it did not do up to this point was take away the core skill of engineering. While it’s definitely easier to write C than assembly, many of the core problems are the same. Memory latency still matters, physics are still our ultimate bottleneck, algorithmic complexity still makes or breaks software at scale. When one part of the pipeline becomes dramatically faster, you need to throttle input. Pi is a great example of this. PRs are auto closed unless people are trusted. It takes OSS vacations . That’s one option: you just throttle the inflow. You push against your newfound powers until you can handle them. But what if the speed continues to increase? What downstream of writing code do we have to speed up? Sure, the pull request review clearly turns into the bottleneck. But it cannot really be automated. If the machine writes the code, the machine better review the code at the same time. So what ultimately comes up for human review would already have passed the most critical possible review of the most capable machine. What else is in the way? If we continue with the fundamental belief that machines cannot be accountable, then humans need to be able to understand the output of the machine. And the machine will ship relentlessly. Support tickets of customers will go straight to machines to implement improvements and fixes, for other machines to review, for humans to rubber stamp in the morning. A lot of this sounds both unappealing and reminiscent of the textile industry. The individual weaver no longer carried responsibility for a bad piece of cloth. If it was bad, it became the responsibility of the factory as a whole and it was just replaced outright. As we’re entering the phase of single-use plastic software, we might be moving the whole layer of responsibility elsewhere. But to me it still feels different. Maybe that’s because my lowly brain can’t comprehend the change we are going through, and future generations will just laugh about our challenges. It feels different to me, because what I see taking place in some Open Source projects, in some companies and teams feels deeply wrong and unsustainable. Even Steve Yegge himself now casts doubts about the sustainability of the ever-increasing pace of code creation. So what if we need to give in? What if we need to pave the way for this new type of engineering to become the standard? What affordances will we have to create to make it work? I for one do not know. I’m looking at this with fascination and bewilderment and trying to make sense of it. Because it is not the final bottleneck. We will find ways to take responsibility for what we ship, because society will demand it. Non-sentient machines will never be able to carry responsibility, and it looks like we will need to deal with this problem before machines achieve this status. Regardless of how bizarre they appear to act already. I too am the bottleneck now . But you know what? Two years ago, I too was the bottleneck. I was the bottleneck all along. The machine did not really change that. And for as long as I carry responsibilities and am accountable, this will remain true. If we manage to push accountability upwards, it might change, but so far, how that would happen is not clear.

0 views
Armin Ronacher 2 months ago

A Language For Agents

Last year I first started thinking about what the future of programming languages might look like now that agentic engineering is a growing thing. Initially I felt that the enormous corpus of pre-existing code would cement existing languages in place but now I’m starting to think the opposite is true. Here I want to outline my thinking on why we are going to see more new programming languages and why there is quite a bit of space for interesting innovation. And just in case someone wants to start building one, here are some of my thoughts on what we should aim for! Does an agent perform dramatically better on a language that it has in its weights? Obviously yes. But there are less obvious factors that affect how good an agent is at programming in a language: how good the tooling around it is and how much churn there is. Zig seems underrepresented in the weights (at least in the models I’ve used) and also changing quickly. That combination is not optimal, but it’s still passable: you can program even in the upcoming Zig version if you point the agent at the right documentation. But it’s not great. On the other hand, some languages are well represented in the weights but agents still don’t succeed as much because of tooling choices. Swift is a good example: in my experience the tooling around building a Mac or iOS application can be so painful that agents struggle to navigate it. Also not great. So, just because it exists doesn’t mean the agent succeeds and just because it’s new also doesn’t mean that the agent is going to struggle. I’m convinced that you can build yourself up to a new language if you don’t want to depart everywhere all at once. The biggest reason new languages might work is that the cost of coding is going down dramatically. The result is the breadth of an ecosystem matters less. I’m now routinely reaching for JavaScript in places where I would have used Python. Not because I love it or the ecosystem is better, but because the agent does much better with TypeScript. The way to think about this: if important functionality is missing in my language of choice, I just point the agent at a library from a different language and have it build a port. As a concrete example, I recently built an Ethernet driver in JavaScript to implement the host controller for our sandbox. Implementations exist in Rust, C, and Go, but I wanted something pluggable and customizable in JavaScript. It was easier to have the agent reimplement it than to make the build system and distribution work against a native binding. New languages will work if their value proposition is strong enough and they evolve with knowledge of how LLMs train. People will adopt them despite being underrepresented in the weights. And if they are designed to work well with agents, then they might be designed around familiar syntax that is already known to work well. So why would we want a new language at all? The reason this is interesting to think about is that many of today’s languages were designed with the assumption that punching keys is laborious, so we traded certain things for brevity. As an example, many languages — particular modern ones — lean heavily on type inference so that you don’t have to write out types. The downside is that you now need an LSP or the resulting compiler error messages to figure out what the type of an expression is. Agents struggle with this too, and it’s also frustrating in pull request review where complex operations can make it very hard to figure out what the types actually are. Fully dynamic languages are even worse in that regard. The cost of writing code is going down, but because we are also producing more of it, understanding what the code does is becoming more important. We might actually want more code to be written if it means there is less ambiguity when we perform a review. I also want to point out that we are heading towards a world where some code is never seen by a human and is only consumed by machines. Even in that case, we still want to give an indication to a user, who is potentially a non-programmer, about what is going on. We want to be able to explain to a user what the code will do without going into the details of how. So the case for a new language comes down to: given the fundamental changes in who is programming and what the cost of code is, we should at least consider one. It’s tricky to say what an agent wants because agents will lie to you and they are influenced by all the code they’ve seen. But one way to estimate how they are doing is to look at how many changes they have to perform on files and how many iterations they need for common tasks. There are some things I’ve found that I think will be true for a while. The language server protocol lets an IDE infer information about what’s under the cursor or what should be autocompleted based on semantic knowledge of the codebase. It’s a great system, but it comes at one specific cost that is tricky for agents: the LSP has to be running. There are situations when an agent just won’t run the LSP — not because of technical limitations, but because it’s also lazy and will skip that step if it doesn’t have to. If you give it an example from documentation, there is no easy way to run the LSP because it’s a snippet that might not even be complete. If you point it at a GitHub repository and it pulls down individual files, it will just look at the code. It won’t set up an LSP for type information. A language that doesn’t split into two separate experiences (with-LSP and without-LSP) will be beneficial to agents because it gives them one unified way of working across many more situations. It pains me as a Python developer to say this, but whitespace-based indentation is a problem. The underlying token efficiency of getting whitespace right is tricky, and a language with significant whitespace is harder for an LLM to work with. This is particularly noticeable if you try to make an LLM do surgical changes without an assisted tool. Quite often they will intentionally disregard whitespace, add markers to enable or disable code and then rely on a code formatter to clean up indentation later. On the other hand, braces that are not separated by whitespace can cause issues too. Depending on the tokenizer, runs of closing parentheses can end up split into tokens in surprising ways (a bit like the “strawberry” counting problem), and it’s easy for an LLM to get Lisp or Scheme wrong because it loses track of how many closing parentheses it has already emitted or is looking at. Fixable with future LLMs? Sure, but also something that was hard for humans to get right too without tooling. Readers of this blog might know that I’m a huge believer in async locals and flow execution context — basically the ability to carry data through every invocation that might only be needed many layers down the call chain. Working at an observability company has really driven home the importance of this for me. The challenge is that anything that flows implicitly might not be configured. Take for instance the current time. You might want to implicitly pass a timer to all functions. But what if a timer is not configured and all of a sudden a new dependency appears? Passing all of it explicitly is tedious for both humans and agents and bad shortcuts will be made. One thing I’ve experimented with is having effect markers on functions that are added through a code formatting step. A function can declare that it needs the current time or the database, but if it doesn’t mark this explicitly, it’s essentially a linting warning that auto-formatting fixes. The LLM can start using something like the current time in a function and any existing caller gets the warning; formatting propagates the annotation. This is nice because when the LLM builds a test, it can precisely mock out these side effects — it understands from the error messages what it has to supply. For instance: Agents struggle with exceptions, they are afraid of them. I’m not sure to what degree this is solvable with RL (Reinforcement Learning), but right now agents will try to catch everything they can, log it, and do a pretty poor recovery. Given how little information is actually available about error paths, that makes sense. Checked exceptions are one approach, but they propagate all the way up the call chain and don’t dramatically improve things. Even if they end up as hints where a linter tracks which errors can fly by, there are still many call sites that need adjusting. And like the auto-propagation proposed for context data, it might not be the right solution. Maybe the right approach is to go more in on typed results, but that’s still tricky for composability without a type and object system that supports it. The general approach agents use today to read files into memory is line-based, which means they often pick chunks that span multi-line strings. One easy way to see this fall apart: have an agent work on a 2000-line file that also contains long embedded code strings — basically a code generator. The agent will sometimes edit within a multi-line string assuming it’s the real code when it’s actually just embedded code in a multi-line string. For multi-line strings, the only language I’m aware of with a good solution is Zig, but its prefix-based syntax is pretty foreign to most people. Reformatting also often causes constructs to move to different lines. In many languages, trailing commas in lists are either not supported (JSON) or not customary. If you want diff stability, you’d aim for a syntax that requires less reformatting and mostly avoids multi-line constructs. What’s really nice about Go is that you mostly cannot import symbols from another package into scope without every use being prefixed with the package name. Eg: instead of . There are escape hatches (import aliases and dot-imports), but they’re relatively rare and usually frowned upon. That dramatically helps an agent understand what it’s looking at. In general, making code findable through the most basic tools is great — it works with external files that aren’t indexed, and it means fewer false positives for large-scale automation driven by code generated on the fly (eg: , invocations). Much of what I’ve said boils down to: agents really like local reasoning. They want it to work in parts because they often work with just a few loaded files in context and don’t have much spatial awareness of the codebase. They rely on external tooling like grep to find things, and anything that’s hard to grep or that hides information elsewhere is tricky. What makes agents fail or succeed in many languages is just how good the build tools are. Many languages make it very hard to determine what actually needs to rebuild or be retested because there are too many cross-references. Go is really good here: it forbids circular dependencies between packages (import cycles), packages have a clear layout, and test results are cached. Agents often struggle with macros. It was already pretty clear that humans struggle with macros too, but the argument for them was mostly that code generation was a good way to have less code to write. Since that is less of a concern now, we should aim for languages with less dependence on macros. There’s a separate question about generics and comptime . I think they fare somewhat better because they mostly generate the same structure with different placeholders and it’s much easier for an agent to understand that. Related to greppability: agents often struggle to understand barrel files and they don’t like them. Not being able to quickly figure out where a class or function comes from leads to imports from the wrong place, or missing things entirely and wasting context by reading too many files. A one-to-one mapping from where something is declared to where it’s imported from is great. And it does not have to be overly strict either. Go kind of goes this way, but not too extreme. Any file within a directory can define a function, which isn’t optimal, but it’s quick enough to find and you don’t need to search too far. It works because packages are forced to be small enough to find everything with grep. The worst case is free re-exports all over the place that completely decouple the implementation from any trivially reconstructable location on disk. Or worse: aliasing. Agents often hate it when aliases are involved. In fact, you can get them to even complain about it in thinking blocks if you let them refactor something that uses lots of aliases. Ideally a language encourages good naming and discourages aliasing at import time as a result. Nobody likes flaky tests, but agents even less so. Ironic given how particularly good agents are at creating flaky tests in the first place. That’s because agents currently love to mock and most languages do not support mocking well. So many tests end up accidentally not being concurrency safe or depend on development environment state that then diverges in CI or production. Most programming languages and frameworks make it much easier to write flaky tests than non-flaky ones. That’s because they encourage indeterminism everywhere. In an ideal world the agent has one command, that lints and compiles and it tells the agent if all worked out fine. Maybe another command to run all tests that need running. In practice most environments don’t work like this. For instance in TypeScript you can often run the code even though it fails type checks . That can gaslight the agent. Likewise different bundler setups can cause one thing to succeed just for a slightly different setup in CI to fail later. The more uniform the tooling the better. Ideally it either runs or doesn’t and there is mechanical fixing for as many linting failures as possible so that the agent does not have to do it by hand. I think we will. We are writing more software now than we ever have — more websites, more open source projects, more of everything. Even if the ratio of new languages stays the same, the absolute number will go up. But I also truly believe that many more people will be willing to rethink the foundations of software engineering and the languages we work with. That’s because while for some years it has felt you need to build a lot of infrastructure for a language to take off, now you can target a rather narrow use case: make sure the agent is happy and extend from there to the human. I just hope we see two things. First, some outsider art: people who haven’t built languages before trying their hand at it and showing us new things. Second, a much more deliberate effort to document what works and what doesn’t from first principles. We have actually learned a lot about what makes good languages and how to scale software engineering to large teams. Yet, finding it written down, as a consumable overview of good and bad language design, is very hard to come by. Too much of it has been shaped by opinion on rather pointless things instead of hard facts. Now though, we are slowly getting to the point where facts matter more, because you can actually measure what works by seeing how well agents perform with it. No human wants to be subject to surveys, but agents don’t care . We can see how successful they are and where they are struggling.

0 views
Armin Ronacher 2 months ago

Pi: The Minimal Agent Within OpenClaw

If you haven’t been living under a rock, you will have noticed this week that a project of my friend Peter went viral on the internet . It went by many names. The most recent one is OpenClaw but in the news you might have encountered it as ClawdBot or MoltBot depending on when you read about it. It is an agent connected to a communication channel of your choice that just runs code . What you might be less familiar with is that what’s under the hood of OpenClaw is a little coding agent called Pi . And Pi happens to be, at this point, the coding agent that I use almost exclusively. Over the last few weeks I became more and more of a shill for the little agent. After I gave a talk on this recently, I realized that I did not actually write about Pi on this blog yet, so I feel like I might want to give some context on why I’m obsessed with it, and how it relates to OpenClaw. Pi is written by Mario Zechner and unlike Peter, who aims for “sci-fi with a touch of madness,” 1 Mario is very grounded. Despite the differences in approach, both OpenClaw and Pi follow the same idea: LLMs are really good at writing and running code, so embrace this. In some ways I think that’s not an accident because Peter got me and Mario hooked on this idea, and agents last year. So Pi is a coding agent. And there are many coding agents. Really, I think you can pick effectively anyone off the shelf at this point and you will be able to experience what it’s like to do agentic programming. In reviews on this blog I’ve positively talked about AMP and one of the reasons I resonated so much with AMP is that it really felt like it was a product built by people who got both addicted to agentic programming but also had tried a few different things to see which ones work and not just to build a fancy UI around it. Pi is interesting to me because of two main reasons: And a little bonus: Pi itself is written like excellent software. It doesn’t flicker, it doesn’t consume a lot of memory, it doesn’t randomly break, it is very reliable and it is written by someone who takes great care of what goes into the software. Pi also is a collection of little components that you can build your own agent on top. That’s how OpenClaw is built, and that’s also how I built my own little Telegram bot and how Mario built his mom . If you want to build your own agent, connected to something, Pi when pointed to itself and mom, will conjure one up for you. And in order to understand what’s in Pi, it’s even more important to understand what’s not in Pi, why it’s not in Pi and more importantly: why it won’t be in Pi. The most obvious omission is support for MCP. There is no MCP support in it. While you could build an extension for it, you can also do what OpenClaw does to support MCP which is to use mcporter . mcporter exposes MCP calls via a CLI interface or TypeScript bindings and maybe your agent can do something with it. Or not, I don’t know :) And this is not a lazy omission. This is from the philosophy of how Pi works. Pi’s entire idea is that if you want the agent to do something that it doesn’t do yet, you don’t go and download an extension or a skill or something like this. You ask the agent to extend itself. It celebrates the idea of code writing and running code. That’s not to say that you cannot download extensions. It is very much supported. But instead of necessarily encouraging you to download someone else’s extension, you can also point your agent to an already existing extension, say like, build it like the thing you see over there, but make these changes to it that you like. When you look at what Pi and by extension OpenClaw are doing, there is an example of software that is malleable like clay. And this sets certain requirements for the underlying architecture of it that are actually in many ways setting certain constraints on the system that really need to go into the core design. So for instance, Pi’s underlying AI SDK is written so that a session can really contain many different messages from many different model providers. It recognizes that the portability of sessions is somewhat limited between model providers and so it doesn’t lean in too much into any model-provider-specific feature set that cannot be transferred to another. The second is that in addition to the model messages it maintains custom messages in the session files which can be used by extensions to store state or by the system itself to maintain information that either not at all is sent to the AI or only parts of it. Because this system exists and extension state can also be persisted to disk, it has built-in hot reloading so that the agent can write code, reload, test it and go in a loop until your extension actually is functional. It also ships with documentation and examples that the agent itself can use to extend itself. Even better: sessions in Pi are trees. You can branch and navigate within a session which opens up all kinds of interesting opportunities such as enabling workflows for making a side-quest to fix a broken agent tool without wasting context in the main session. After the tool is fixed, I can rewind the session back to earlier and Pi summarizes what has happened on the other branch. This all matters because for instance if you consider how MCP works, on most model providers, tools for MCP, like any tool for the LLM, need to be loaded into the system context or the tool section thereof on session start. That makes it very hard to impossible to fully reload what tools can do without trashing the complete cache or confusing the AI about how prior invocations work differently. An extension in Pi can register a tool to be available to the LLM to call and every once in a while I find this useful. For instance, despite my criticism of how Beads is implemented, I do think that giving an agent access to a to-do list is a very useful thing. And I do use an agent-specific issue tracker that works locally that I had my agent build itself. And because I wanted the agent to also manage to-dos, in this particular case I decided to give it a tool rather than a CLI. It felt appropriate for the scope of the problem and it is currently the only additional tool that I’m loading into my context. But for the most part all of what I’m adding to my agent are either skills or TUI extensions to make working with the agent more enjoyable for me. Beyond slash commands, Pi extensions can render custom TUI components directly in the terminal: spinners, progress bars, interactive file pickers, data tables, preview panes. The TUI is flexible enough that Mario proved you can run Doom in it . Not practical, but if you can run Doom, you can certainly build a useful dashboard or debugging interface. I want to highlight some of my extensions to give you an idea of what’s possible. While you can use them unmodified, the whole idea really is that you point your agent to one and remix it to your heart’s content. I don’t use plan mode . I encourage the agent to ask questions and there’s a productive back and forth. But I don’t like structured question dialogs that happen if you give the agent a question tool. I prefer the agent’s natural prose with explanations and diagrams interspersed. The problem: answering questions inline gets messy. So reads the agent’s last response, extracts all the questions, and reformats them into a nice input box. Even though I criticize Beads for its implementation, giving an agent a to-do list is genuinely useful. The command brings up all items stored in as markdown files. Both the agent and I can manipulate them, and sessions can claim tasks to mark them as in progress. As more code is written by agents, it makes little sense to throw unfinished work at humans before an agent has reviewed it first. Because Pi sessions are trees, I can branch into a fresh review context, get findings, then bring fixes back to the main session. The UI is modeled after Codex which provides easy to review commits, diffs, uncommitted changes, or remote PRs. The prompt pays attention to things I care about so I get the call-outs I want (eg: I ask it to call out newly added dependencies.) An extension I experiment with but don’t actively use. It lets one Pi agent send prompts to another. It is a simple multi-agent system without complex orchestration which is useful for experimentation. Lists all files changed or referenced in the session. You can reveal them in Finder, diff in VS Code, quick-look them, or reference them in your prompt. quick-looks the most recently mentioned file which is handy when the agent produces a PDF. Others have built extensions too: Nico’s subagent extension and interactive-shell which lets Pi autonomously run interactive CLIs in an observable TUI overlay. These are all just ideas of what you can do with your agent. The point of it mostly is that none of this was written by me, it was created by the agent to my specifications. I told Pi to make an extension and it did. There is no MCP, there are no community skills, nothing. Don’t get me wrong, I use tons of skills. But they are hand-crafted by my clanker and not downloaded from anywhere. For instance I fully replaced all my CLIs or MCPs for browser automation with a skill that just uses CDP . Not because the alternatives don’t work, or are bad, but because this is just easy and natural. The agent maintains its own functionality. My agent has quite a few skills and crucially I throw skills away if I don’t need them. I for instance gave it a skill to read Pi sessions that other engineers shared, which helps with code review. Or I have a skill to help the agent craft the commit messages and commit behavior I want, and how to update changelogs. These were originally slash commands, but I’m currently migrating them to skills to see if this works equally well. I also have a skill that hopefully helps Pi use rather than , but I also added a custom extension to intercept calls to and to redirect them to instead. Part of the fascination that working with a minimal agent like Pi gave me is that it makes you live that idea of using software that builds more software. That taken to the extreme is when you remove the UI and output and connect it to your chat. That’s what OpenClaw does and given its tremendous growth, I really feel more and more that this is going to become our future in one way or another. https://x.com/steipete/status/2017313990548865292 ↩ First of all, it has a tiny core. It has the shortest system prompt of any agent that I’m aware of and it only has four tools: Read, Write, Edit, Bash. The second thing is that it makes up for its tiny core by providing an extension system that also allows extensions to persist state into sessions, which is incredibly powerful. https://x.com/steipete/status/2017313990548865292 ↩

0 views
Armin Ronacher 2 months ago

Colin and Earendil

Regular readers of this blog will know that I started a new company. We have put out just a tiny bit of information today , and some keen folks have discovered and reached out by email with many thoughtful responses. It has been delightful. Colin and I met here, in Vienna. We started sharing coffees, ideas, and lunches, and soon found shared values despite coming from different backgrounds and different parts of the world. We are excited about the future, but we’re equally vigilant of it. After traveling together a bit, we decided to plunge into the cold water and start a company together. We want to be successful, but we want to do it the right way and we want to be able to demonstrate that to our kids. Vienna is a city of great history, two million inhabitants and a fascinating vibe that is nothing like San Francisco. In fact, Vienna is in many ways the polar opposite to the Silicon Valley, both in mindset, in opportunity and approach to life. Colin comes from San Francisco, and though I’m Austrian, my career has been shaped by years working with California companies and people from there who used my Open Source software. Vienna is now our shared home. Despite Austria being so far away from California, it is a place of tinkerers and troublemakers. It’s always good to remind oneself that society consists of more than just your little bubble. It also creates the necessary counter balance to think in these times. The world that is emerging in front of our eyes is one of change. We incorporated as a PBC with a founding charter to craft software and open protocols, strengthen human agency, bridge division and ignorance and to cultivate lasting joy and understanding. Things we believe in deeply. I have dedicated 20 years of my life in one way or another creating Open Source software. In the same way as artificial intelligence calls into question the very nature of my profession and the way we build software, the present day circumstances are testing society. We’re not immune to these changes and we’re navigating them like everyone else, with a mixture of excitement and worry. But we share a belief that right now is the time to stand true to one’s values and principles. We want to take an earnest shot at leaving the world a better place than we found it. Rather than reject the changes that are happening, we look to nudge them towards the right direction. If you want to follow along you can subscribe to our newsletter , written by humans not machines.

0 views
Armin Ronacher 2 months ago

Agent Psychosis: Are We Going Insane?

You can use Polecats without the Refinery and even without the Witness or Deacon. Just tell the Mayor to shut down the rig and sling work to the polecats with the message that they are to merge to main directly. Or the polecats can submit MRs and then the Mayor can merge them manually. It’s really up to you. The Refineries are useful if you have done a LOT of up-front specification work, and you have huge piles of Beads to churn through with long convoys. — Gas Town Emergency User Manual , Steve Yegge Many of us got hit by the agent coding addiction. It feels good, we barely sleep, we build amazing things. Every once in a while that interaction involves other humans, and all of a sudden we get a reality check that maybe we overdid it. The most obvious example of this is the massive degradation of quality of issue reports and pull requests. As a maintainer many PRs now look like an insult to one’s time, but when one pushes back, the other person does not see what they did wrong. They thought they helped and contributed and get agitated when you close it down. But it’s way worse than that. I see people develop parasocial relationships with their AIs, get heavily addicted to it, and create communities where people reinforce highly unhealthy behavior. How did we get here and what does it do to us? I will preface this post by saying that I don’t want to call anyone out in particular, and I think I sometimes feel tendencies that I see as negative, in myself as well. I too, have thrown some vibeslop up to other people’s repositories. In His Dark Materials, every human has a dæmon, a companion that is an externally visible manifestation of their soul. It lives alongside as an animal, but it talks, thinks and acts independently. I’m starting to relate our relationship with agents that have memory to those little creatures. We become dependent on them, and separation from them is painful and takes away from our new-found identity. We’re relying on these little companions to validate us and to collaborate with. But it’s not a genuine collaboration like between humans, it’s one that is completely driven by us, and the AI is just there for the ride. We can trick it to reinforce our ideas and impulses. And we act through this AI. Some people who have not programmed before, now wield tremendous powers, but all those powers are gone when their subscription hits a rate limit and their little dæmon goes to sleep. Then, when we throw up a PR or issue to someone else, that contribution is the result of this pseudo-collaboration with the machine. When I see an AI pull request come in, or on another repository, I cannot tell how someone created it, but I can usually after a while tell when it was prompted in a way that is fundamentally different from how I do it. Yet it takes me minutes to figure this out. I have seen some coding sessions from others and it’s often done with clarity, but using slang that someone has come up with and most of all: by completely forcing the AI down a path without any real critical thinking. Particularly when you’re not familiar with how the systems are supposed to work, giving in to what the machine says and then thinking one understands what is going on creates some really bizarre outcomes at times. But people create these weird relationships with their AI agent and once you see how some prompt their machines, you realize that it dramatically alters what comes out of it. To get good results you need to provide context, you need to make the tradeoffs, you need to use your knowledge. It’s not just a question of using the context badly, it’s also the way in which people interact with the machine. Sometimes it’s unclear instructions, sometimes it’s weird role-playing and slang, sometimes it’s just swearing and forcing the machine, sometimes it’s a weird ritualistic behavior. Some people just really ram the agent straight towards the most narrow of all paths towards a badly defined goal with little concern about the health of the codebase. These dæmon relationships change not just how we work, but what we produce. You can completely give in and let the little dæmon run circles around you. You can reinforce it to run towards ill defined (or even self defined) goals without any supervision. It’s one thing when newcomers fall into this dopamine loop and produce something. When Peter first got me hooked on Claude, I did not sleep. I spent two months excessively prompting the thing and wasting tokens. I ended up building and building and creating a ton of tools I did not end up using much. “You can just do things” was what was on my mind all the time but it took quite a bit longer to realize that just because you can, you might not want to. It became so easy to build something and in comparison it became much harder to actually use it or polish it. Quite a few of the tools I built I felt really great about, just to realize that I did not actually use them or they did not end up working as I thought they would. The thing is that the dopamine hit from working with these agents is so very real. I’ve been there! You feel productive, you feel like everything is amazing, and if you hang out just with people that are into that stuff too, without any checks, you go deeper and deeper into the belief that this all makes perfect sense. You can build entire projects without any real reality check. But it’s decoupled from any external validation. For as long as nobody looks under the hood, you’re good. But when an outsider first pokes at it, it looks pretty crazy. And damn some things look amazing. I too was blown away (and fully expected at the same time) when Cursor’s AI written Web Browser landed. It’s super impressive that agents were able to bootstrap a browser in a week! But holy crap! I hope nobody ever uses that thing or would try to build an actual browser out of it, at least with this generation of agents, it’s still pure slop with little oversight. It’s an impressive research and tech demo, not an approach to building software people should use. At least not yet. There is also another side to this slop loop addiction: token consumption. Consider how many tokens these loops actually consume. A well-prepared session with good tooling and context can be remarkably token-efficient. For instance, the entire port of MiniJinja to Go took only 2.2 million tokens. But the hands-off approaches—spinning up agents and letting them run wild—burn through tokens at staggering rates. Patterns like Ralph are particularly wasteful: you restart the loop from scratch each time, which means you lose the ability to use cached tokens or reuse context. We should also remember that current token pricing is almost certainly subsidized. These patterns may not be economically viable for long. And those discounted coding plans we’re all on? They might not last either. And then there are things like Beads and Gas Town , Steve Yegge’s agentic coding tools, which are the complete celebration of slop loops. Beads, which is basically some sort of issue tracker for agents, is 240,000 lines of code that … manages markdown files in GitHub repositories. And the code quality is abysmal. There appears to be some competition in place to run as many of these agents in parallel with almost no quality control in some circles. And to then use agents to try to create documentation artifacts to regain some confidence of what is actually going on. Except those documents themselves read like slop . Looking at Gas Town (and Beads) from the outside, it looks like a Mad Max cult. What are polecats, refineries, mayors, beads, convoys doing in an agentic coding system? If the maintainer is in the loop, and the whole community is in on this mad ride, then everyone and their dæmons just throw more slop up. As an external observer the whole project looks like an insane psychosis or a complete mad art project. Except, it’s real? Or is it not? Apparently a reason for slowdown in Gas Town is contention on figuring out the version of Beads, which takes 7 subprocess spawns . Or using the doctor command times out completely . Beads keeps growing and growing in complexity and people who are using it, are realizing that it’s almost impossible to uninstall . And they might not even work well together even though one apparently depends on the other. I don’t want to pick on Gas Town or these projects, but they are just the most visible examples of this in-group behavior right now. But you can see similar things in some of the AI builder circles on Discord and X where people hype each other up with their creations, without much critical thinking and sanity checking of what happens under the hood. It takes you a minute of prompting and waiting a few minutes for code to come out of it. But actually honestly reviewing a pull request takes many times longer than that. The asymmetry is completely brutal. Shooting up bad code is rude because you completely disregard the time of the maintainer. But everybody else is also creating AI-generated code, but maybe they passed the bar of it being good. So how can you possibly tell as a maintainer when it all looks the same? And as the person writing the issue or the PR, you felt good about it. Yet what you get back is frustration and rejection. I’m not sure how we will go ahead here, but it’s pretty clear that in projects that don’t submit themselves to the slop loop, it’s going to be a nightmare to deal with all the AI-generated noise. Even for projects that are fully AI-generated but are setting some standard for contributions, some folks now prefer actually just getting the prompts over getting the actual code. Because then it’s clearer what the person actually intended. There is more trust in running the agent oneself than having other people do it. Which really makes me wonder: am I missing something here? Is this where we are going? Am I just not ready for this new world? Are we all collectively getting insane? Particularly if you want to opt out of this craziness right now, it’s getting quite hard. Some projects no longer accept human contributions until they have vetted the people completely. Others are starting to require that you submit prompts alongside your code, or just the prompts alone. I am a maintainer who uses AI myself, and I know others who do. We’re not luddites and we’re definitely not anti-AI. But we’re also frustrated when we encounter AI slop on issue and pull request trackers. Every day brings more PRs that took someone a minute to generate and take an hour to review. There is a dire need to say no now. But when one does, the contributor is genuinely confused: “Why are you being so negative? I was trying to help.” They were trying to help. Their dæmon told them it was good. Maybe the answer is that we need better tools — better ways to signal quality, better ways to share context, better ways to make the AI’s involvement visible and reviewable. Maybe the culture will self-correct as people hit walls. Maybe this is just the awkward transition phase before we figure out new norms. Or maybe some of us are genuinely losing the plot, and we won’t know which camp we’re in until we look back. All I know is that when I watch someone at 3am, running their tenth parallel agent session, telling me they’ve never been more productive — in that moment I don’t see productivity. I see someone who might need to step away from the machine for a bit. And I wonder how often that someone is me. Two things are both true to me right now: AI agents are amazing and a huge productivity boost. They are also massive slop machines if you turn off your brain and let go completely.

0 views
Armin Ronacher 3 months ago

Porting MiniJinja to Go With an Agent

Turns out you can just port things now. I already attempted this experiment in the summer, but it turned out to be a bit too much for what I had time for. However, things have advanced since. Yesterday I ported MiniJinja (a Rust Jinja2 template engine) to native Go, and I used an agent to do pretty much all of the work. In fact, I barely did anything beyond giving some high-level guidance on how I thought it could be accomplished. In total I probably spent around 45 minutes actively with it. It worked for around 3 hours while I was watching, then another 7 hours alone. This post is a recollection of what happened and what I learned from it. All prompting was done by voice using pi , starting with Opus 4.5 and switching to GPT-5.2 Codex for the long tail of test fixing. MiniJinja is a re-implementation of Jinja2 for Rust. I originally wrote it because I wanted to do a infrastructure automation project in Rust and Jinja was popular for that. The original project didn’t go anywhere, but MiniJinja itself continued being useful for both me and other users. The way MiniJinja is tested is with snapshot tests: inputs and expected outputs, using insta to verify they match. These snapshot tests were what I wanted to use to validate the Go port. My initial prompt asked the agent to figure out how to validate the port. Through that conversation, the agent and I aligned on a path: reuse the existing Rust snapshot tests and port incrementally (lexer -> parser -> runtime). This meant the agent built Go-side tooling to: This resulted in a pretty good harness with a tight feedback loop. The agent had a clear goal (make everything pass) and a progression (lexer -> parser -> runtime). The tight feedback loop mattered particularly at the end where it was about getting details right. Every missing behavior had one or more failing snapshots. I used Pi’s branching feature to structure the session into phases. I rewound back to earlier parts of the session and used the branch switch feature to inform the agent automatically what it had already done. This is similar to compaction, but Pi shows me what it puts into the context. When Pi switches branches it does two things: Without switching branches, I would probably just make new sessions and have more plan files lying around or use something like Amp’s handoff feature which also allows the agent to consult earlier conversations if it needs more information. What was interesting is that the agent went from literal porting to behavioral porting quite quickly. I didn’t steer it away from this as long as the behavior aligned. I let it do this for a few reasons. First, the code base isn’t that large, so I felt I could make adjustments at the end if needed. Letting the agent continue with what was already working felt like the right strategy. Second, it was aligning to idiomatic Go much better this way. For instance, on the runtime it implemented a tree-walking interpreter (not a bytecode interpreter like Rust) and it decided to use Go’s reflection for the value type. I didn’t tell it to do either of these things, but they made more sense than replicating my Rust interpreter design, which was partly motivated by not having a garbage collector or runtime type information. On the other hand, the agent made some changes while making tests pass that I disagreed with. It completely gave up on all the “must fail” tests because the error messages were impossible to replicate perfectly given the runtime differences. So I had to steer it towards fuzzy matching instead. It also wanted to regress behavior I wanted to retain (e.g., exact HTML escaping semantics, or that must return an iterator). I think if I hadn’t steered it there, it might not have made it to completion without going down problematic paths, or I would have lost confidence in the result. Once the major semantic mismatches were fixed, the remaining work was filling in all missing pieces: missing filters and test functions, loop extras, macros, call blocks, etc. Since I wanted to go to bed, I switched to Codex 5.2 and queued up a few “continue making all tests pass if they are not passing yet” prompts, then let it work through compaction. I felt confident enough that the agent could make the rest of the tests pass without guidance once it had the basics covered. This phase ran without supervision overnight. After functional convergence, I asked the agent to document internal functions and reorganize (like moving filters to a separate file). I also asked it to document all functions and filters like in the Rust code base. This was also when I set up CI, release processes, and talked through what was created to come up with some finalizing touches before merging. There are a few things I find interesting here. First: these types of ports are possible now. I know porting was already possible for many months, but it required much more attention. This changes some dynamics. I feel less like technology choices are constrained by ecosystem lock-in. Sure, porting NumPy to Go would be a more involved undertaking, and getting it competitive even more so (years of optimizations in there). But still, it feels like many more libraries can be used now. Second: for me, the value is shifting from the code to the tests and documentation. A good test suite might actually be worth more than the code. That said, this isn’t an argument for keeping tests secret — generating tests with good coverage is also getting easier. However, for keeping code bases in different languages in sync, you need to agree on shared tests, otherwise divergence is inevitable. Lastly, there’s the social dynamic. Once, having people port your code to other languages was something to take pride in. It was a sign of accomplishment — a project was “cool enough” that someone put time into making it available elsewhere. With agents, it doesn’t invoke the same feelings. Will McGugan also called out this change . Lastly, some boring stats for the main session: This did not count the adding of doc strings and smaller fixups. Pi session transcript Narrated video of the porting session Parse Rust’s test input files (which embed settings as JSON headers). Parse the reference insta snapshots and compare output. Maintain a skip-list to temporarily opt out of failing tests. It stays in the same session so I can navigate around, but it makes a new branch off an earlier message. When switching, it adds a summary of what it did as a priming message into where it branched off. I found this quite helpful to avoid the agent doing vision quests from scratch to figure out how far it had already gotten. Agent run duration: 10 hours ( 3 hours supervised) Active human time: ~45 minutes Total messages: 2,698 My prompts: 34 Tool calls: 1,386 Raw API token cost: $60 Total tokens: 2.2 million Models: and for the unattended overnight run

0 views
Armin Ronacher 3 months ago

A Year Of Vibes

2025 draws to a close and it’s been quite a year. Around this time last year, I wrote a post that reflected on my life . Had I written about programming, it might have aged badly, as 2025 has been a year like no other for my profession. 2025 was the year of changes. Not only did I leave Sentry and start my new company, it was also the year I stopped programming the way I did before. In June I finally felt confident enough to share that my way of working was different: Where I used to spend most of my time in Cursor, I now mostly use Claude Code, almost entirely hands-off. […] If you would have told me even just six months ago that I’d prefer being an engineering lead to a virtual programmer intern over hitting the keys myself, I would not have believed it. While I set out last year wanting to write more, that desire had nothing to do with agentic coding. Yet I published 36 posts — almost 18% of all posts on this blog since 2007. I also had around a hundred conversations with programmers, founders, and others about AI because I was fired up with curiosity after falling into the agent rabbit hole. 2025 was also a not so great year for the world. To make my peace with it, I started a separate blog to separate out my thoughts from here. It started with a growing obsession with Claude Code in April or May, resulting in months of building my own agents and using others’. Social media exploded with opinions on AI: some good, some bad. Now I feel I have found a new stable status quo for how I reason about where we are and where we are going. I’m doubling down on code generation, file systems, programmatic tool invocation via an interpreter glue, and skill-based learning. Basically: what Claude Code innovated is still state of the art for me. That has worked very well over the last few months, and seeing foundation model providers double down on skills reinforces my belief in this approach. I’m still perplexed by how TUIs made such a strong comeback. At the moment I’m using Amp , Claude Code , and Pi , all from the command line. Amp feels like the Apple or Porsche of agentic coding tools, Claude Code is the affordable Volkswagen, and Pi is the Hacker’s Open Source choice for me. They all feel like projects built by people who, like me, use them to an unhealthy degree to build their own products, but with different trade-offs. I continue to be blown away by what LLMs paired with tool execution can do. At the beginning of the year I mostly used them for code generation, but now a big number of my agentic uses are day-to-day things. I’m sure we will see some exciting pushes towards consumer products in 2026. LLMs are now helping me with organizing my life, and I expect that to grow further. Because LLMs now not only help me program, I’m starting to rethink my relationship to those machines. I increasingly find it harder not to create parasocial bonds with some of the tools I use. I find this odd and discomforting. Most agents we use today do not have much of a memory and have little personality but it’s easy to build yourself one that does. An LLM with memory is an experience that is hard to shake off. It’s both fascinating and questionable. I have tried to train myself for two years, to think of these models as mere token tumblers, but that reductive view does not work for me any longer. These systems we now create have human tendencies, but elevating them to a human level would be a mistake. I increasingly take issue with calling these machines “agents,” yet I have no better word for it. I take issue with “agent” as a term because agency and responsibility should remain with humans. Whatever they are becoming, they can trigger emotional responses in us that can be detrimental if we are not careful. Our inability to properly name and place these creations in relation to us is a challenge I believe we need to solve. Because of all this unintentional anthropomorphization, I’m really struggling at times to find the right words for how I’m working with these machines. I know that this is not just me; it’s others too. It creates even more discomfort when working with people who currently reject these systems outright. One of the most common comments I read in response to agentic coding tool articles is this rejection of giving the machine personality. An unexpected aspect of using AI so much is that we talk far more about vibes than anything else. This way of working is less than a year old, yet it challenges half a century of software engineering experience. So there are many opinions, and it’s hard to say which will stand the test of time. I found a lot of conventional wisdom I don’t agree with, but I have nothing to back up my opinions. How would I? I quite vocally shared my lack of success with MCP throughout the year, but I had little to back it up beyond “does not work for me.” Others swore by it. Similar with model selection. Peter , who got me hooked on Claude early in the year, moved to Codex and is happy with it. I don’t enjoy that experience nearly as much, though I started using it more. I have nothing beyond vibes to back up my preference for Claude. It’s also important to know that some of the vibes come with intentional signalling. Plenty of people whose views you can find online have a financial interest in one product over another, for instance because they are investors in it or they are paid influencers. They might have become investors because they liked the product, but it’s also possible that their views are affected and shaped by that relationship. Pick up a library from any AI company today and you’ll notice they’re built with Stainless or Fern. The docs use Mintlify, the site’s authentication system might be Clerk. Companies now sell services you would have built yourself previously. This increase in outsourcing of core services to companies specializing in it meant that the bar for some aspects of the user experience has risen. But with our newfound power from agentic coding tools, you can build much of this yourself. I had Claude build me an SDK generator for Python and TypeScript — partly out of curiosity, partly because it felt easy enough. As you might know, I’m a proponent of simple code and building it yourself . This makes me somewhat optimistic that AI has the potential to encourage building on fewer dependencies. At the same time, it’s not clear to me that we’re moving that way given the current trends of outsourcing everything. This brings me not to predictions but to wishes for where we could put our energy next. I don’t really know what I’m looking for here, but I want to point at my pain points and give some context and food for thought. My biggest unexpected finding: we’re hitting limits of traditional tools for sharing code. The pull request model on GitHub doesn’t carry enough information to review AI generated code properly — I wish I could see the prompts that led to changes. It’s not just GitHub, it’s also git that is lacking. With agentic coding, part of what makes the models work today is knowing the mistakes. If you steer it back to an earlier state, you want the tool to remember what went wrong. There is, for lack of a better word, value in failures. As humans we might also benefit from knowing the paths that did not lead us anywhere, but for machines this is critical information. You notice this when you are trying to compress the conversation history. Discarding the paths that led you astray means that the model will try the same mistakes again. Some agentic coding tools have begun spinning up worktrees or creating checkpoints in git for restore, in-conversation branch and undo features. There’s room for UX innovation that could make these tools easier to work with. This is probably why we’re seeing discussions about stacked diffs and alternative version control systems like Jujutsu . Will this change GitHub or will it create space for some new competition? I hope so. I increasingly want to better understand genuine human input and tell it apart from machine output. I want to see the prompts and the attempts that failed along the way. And then somehow I want to squash and compress it all on merge, but with a way to retrieve the full history if needed. This is related to the version control piece: current code review tools assign strict role definitions that just don’t work with AI. Take the GitHub code review UI: I regularly want to use comments on the PR view to leave notes for my own agents, but there is no guided way to do that. The review interface refuses to let me review my own code, I can only comment, but that does not have quite the same intention. There is also the problem that an increased amount of code review now happens between me and my agents locally. For instance, the Codex code review feature on GitHub stopped working for me because it can only be bound to one organization at a time. So I now use Codex on the command line to do reviews, but that means a whole part of my iteration cycles is invisible to other engineers on the team. That doesn’t work for me. Code review to me feels like it needs to become part of the VCS. I also believe that observability is up for grabs again. We now have both the need and opportunity to take advantage of it on a whole new level. Most people were not in a position where they could build their own eBPF programs, but LLMs can. Likewise, many observability tools shied away from SQL because of its complexity, but LLMs are better at it than any proprietary query language. They can write queries, they can grep, they can map-reduce, they remote-control LLDB. Anything that has some structure and text is suddenly fertile ground for agentic coding tools to succeed. I don’t know what the observability of the future looks like, but my strong hunch is that we will see plenty of innovation here. The better the feedback loop to the machine, the better the results. I’m not even sure what I’m asking for here, but I think that one of the challenges in the past was that many cool ideas for better observability — specifically dynamic reconfiguration of services for more targeted filtering — were user-unfriendly because they were complex and hard to use. But now those might be the right solutions in light of LLMs because of their increased capabilities for doing this grunt work. For instance Python 3.14 landed an external debugger interface which is an amazing capability for an agentic coding tool. This may be a little more controversial, but what I haven’t managed this year is to give in to the machine. I still treat it like regular software engineering and review a lot. I also recognize that an increasing number of people are not working with this model of engineering but instead completely given in to the machine. As crazy as that sounds, I have seen some people be quite successful with this. I don’t yet know how to reason about this, but it is clear to me that even though code is being generated in the end, the way of working in that new world is very different from the world that I’m comfortable with. And my suspicion is that because that world is here to stay, we might need some new social contracts to separate these out. The most obvious version of this is the increased amount of these types of contributions to Open Source projects, which are quite frankly an insult to anyone who is not working in that model. I find reading such pull requests quite rage-inducing. Personally, I’ve tried to attack this problem with contribution guidelines and pull request templates. But this seems a little like a fight against windmills. This might be something where the solution will not come from changing what we’re doing. Instead, it might come from vocal people who are also pro-AI engineering speaking out on what good behavior in an agentic codebase looks like. And it is not just to throw up unreviewed code and then have another person figure the shit out.

0 views
Armin Ronacher 4 months ago

Skills vs Dynamic MCP Loadouts

I’ve been moving all my MCPs to skills, including the remaining one I still used: the Sentry MCP 1 . Previously I had already moved entirely away from Playwright to a Playwright skill. In the last month or so there have been discussions about using dynamic tool loadouts to defer loading of tool definitions until later. Anthropic has also been toying around with the idea of wiring together MCP calls via code, something I have experimented with . I want to share my updated findings with all of this and why the deferred tool loading that Anthropic came up with does not fix my lack of love for MCP. Maybe they are useful for someone else. When the agent encounters a tool definition through reinforcement learning or otherwise, it is encouraged to emit tool calls through special tokens when it encounters a situation where that tool call would be appropriate. For all intents and purposes, tool definitions can only appear between special tool definition tokens in a system prompt. Historically this means that you cannot emit tool definitions later in the conversation state. So your only real option is for a tool to be loaded when the conversation starts. In agentic uses, you can of course compress your conversation state or change the tool definitions in the system message at any point. But the consequence is that you will lose the reasoning traces and also the cache. In the case of Anthropic, for instance, this will make your conversation significantly more expensive. You would basically start from scratch and pay full token rates plus cache write cost, compared to cache read. One recent innovation from Anthropic is deferred tool loading. You still declare tools ahead of time in the system message, but they are not injected into the conversation when the initial system message is emitted. Instead they appear at a later point. The tool definitions however still have to be static for the entire conversation, as far as I know. So the tools that could exist are defined when the conversation starts. The way Anthropic discovers the tools is purely by regex search. This is all quite relevant because even though MCP with deferred loading feels like it should perform better, it actually requires quite a bit of engineering on the LLM API side. The skill system gets away without any of that and, at least from my experience, still outperforms it. Skills are really just short summaries of which skills exist and in which file the agent can learn more about them. These are proactively loaded into the context. So the agent understands in the system context (or maybe somewhere later in the context) what capabilities it has and gets a link to the manual for how to use them. Crucially, skills do not actually load a tool definition into the context. The tools remain the same: bash and the other tools the agent already has. All it learns from the skill are tips and tricks for how to use these tools more effectively. Because the main thing it learns is how to use other command line tools and similar utilities, the fundamentals of how to chain and coordinate them together do not actually change. The reinforcement learning that made the Claude family of models very good tool callers just helps with these newly discovered tools. So that obviously raises the question: if skills work so well, can I move the MCP outside of the context entirely and invoke it through the CLI in a similar way as Anthropic proposes? The answer is yes, you can, but it doesn’t work well. One option here is Peter Steinberger’s mcporter . In short, it reads the files and exposes the MCPs behind it as callable tools: And yes, it looks very much like a command line tool that the LLM can invoke. The problem however is that the LLM does not have any idea about what tools are available, and now you need to teach it that. So you might think: why not make some skills that teach the LLM about the MCPs? Here the issue for me comes from the fact that MCP servers have no desire to maintain API stability. They are increasingly starting to trim down tool definitions to the bare minimum to preserve tokens. This makes sense, but for the skill pattern it’s not what you want. For instance, the Sentry MCP server at one point switched the query syntax entirely to natural language. A great improvement for the agent, but my suggestions for how to use it became a hindrance and I did not discover the issue straight away. This is in fact quite similar to Anthropic’s deferred tool loading: there is no information about the tool in the context at all. You need to create a summary. The eager loading of MCP tools we have done in the past now has ended up with an awkward compromise: the description is both too long to eagerly load it, and too short to really tell the agent how to use it. So at least from my experience, you end up maintaining these manual skill summaries for MCP tools exposed via mcporter or similar. This leads me to my current conclusion: I tend to go with what is easiest, which is to ask the agent to write its own tools as a skill. Not only does it not take all that long, but the biggest benefit is that the tool is largely under my control. Whenever it breaks or needs some other functionality, I ask the agent to adjust it. The Sentry MCP is a great example. I think it’s probably one of the better designed MCPs out there, but I don’t use it anymore. In part because when I load it into the context right away I lose around 8k tokens out of the box, and I could not get it to work via mcporter. On the other hand, I have Claude maintain a skill for me. And yes, that skill is probably quite buggy and needs to be updated, but because the agent maintains it, it works out better. It’s quite likely that all of this will change, but at the moment manually maintained skills and agents writing their own tools have become my preferred way. I suspect that dynamic tool loading with MCP will become a thing, but it will probably quite some protocol changes to bring in skill-like summaries and built-in manuals for the tools. I also suspect that MCP would greatly benefit of protocol stability. The fact that MCP servers keep changing their tool descriptions at will does not work well with materialized calls and external tool descriptions in READMEs and skill files. Keen readers will remember that last time, the last MCP I used was Playwright. In the meantime I added and removed two more MCPs: Linear and Sentry, mostly because of authentication issues and neither having a great command line interface. ↩ Keen readers will remember that last time, the last MCP I used was Playwright. In the meantime I added and removed two more MCPs: Linear and Sentry, mostly because of authentication issues and neither having a great command line interface. ↩

0 views
Armin Ronacher 4 months ago

Let’s Destroy The European Union!

Elon Musk is not happy with the EU fining his X platform and is currently on a tweet rampage complaining about it. Among other things, he wants the whole EU to be abolished. He sadly is hardly the first wealthy American to share their opinions on European politics lately. I’m not a fan of this outside attention but I believe it’s noteworthy and something to pay attention to. In particular because the idea of destroying and ripping apart the EU is not just popular in the US; it’s popular over here too. Something that greatly concerns me. There is definitely a bunch of stuff we might want to fix over here. I have complained about our culture before. Unfortunately, I happen to think that our challenges are not coming from politicians or civil servants, but from us, the people. Europeans don’t like to take risks and are quite pessimistic about the future compared to their US counterparts. Additionally, we Europeans have been trained to feel a lot of guilt over the years, which makes us hesitant to stand up for ourselves. This has led to all kinds of interesting counter-cultural movements in Europe, like years of significant support for unregulated immigration and an unhealthy obsession with the idea of degrowth. Today, though, neither seems quite as popular as it once was. Morally these things may be defensible, but in practice they have led to Europe losing its competitive edge and eroding social cohesion. The combination of a strong social state and high taxes in particular does not mix well with the kind of immigration we have seen in the last decade: mostly people escaping wars ending up in low-skilled jobs. That means it’s not unlikely that certain classes of immigrants are going to be net-negative for a very long time, if not forever, and increasingly society is starting to think about what the implications of that might be. Yet even all of that is not where our problems lie, and it’s certainly not our presumed lack of free speech. Any conversation on that topic is foolish because it’s too nuanced. Society clearly wants to place some limits to free speech here, but the same is true in the US. In the US we can currently see a significant push-back against “woke ideologies,” and a lot of that push-back involves restricting freedom of expression through different avenues. The US might try to lecture Europe right now on free speech, but what it should be lecturing us on is our economic model. Europe has too much fragmentation, incredibly strict regulation that harms innovation, ineffective capital markets, and a massive dependency on both the United States and China. If the US were to cut us off from their cloud providers, we would not be able to operate anything over here. If China were to stop shipping us chips, we would be in deep trouble too ( we have seen this ). This is painful because the US is historically a great example when it comes to freedom of information, direct democracy at the state level, and rather low corruption. These are all areas where we’re not faring well, at least not consistently, and we should be lectured. Fundamentally, the US approach to capitalism is about as good as it’s going to get. If there was any doubt that alternative approaches might have worked out better, at this point there’s very little evidence in favor of that. Yet because of increased loss of civil liberties in the US, many Europeans now see everything that the US is doing as bad. A grave mistake. Both China and the US are quite happy with the dependency we have on them and with us falling short of our potential. Europe’s attempt at dealing with the dependency so far has been to regulate and tax US corporations more heavily. That’s not a good strategy. The solution must be to become competitive again so that we can redirect that tax revenue to local companies instead. The Digital Services Act is a good example: we’re punishing Apple and forcing them to open up their platform, but we have no company that can take advantage of that opening. If you read my blog here, you might remember my musings about the lack of clarity of what a foreigner is in Europe. The reality is that Europe has been deeply integrated for a long time now as a result of how the EU works — but still not at the same level as the US. I think this is still the biggest problem. People point to languages as the challenge, but underneath the hood, the countries are still fighting each other. Austria wants to protect its local stores from larger competition in Germany and its carpenters from the cheaper ones coming from Slovenia. You can replace Austria with any other EU country and you will find the same thing. The EU might not be perfect, but it’s hard to imagine that abolishing it would solve any problem given how national states have shown to behave. The moment the EU fell away, we would be warming up all border struggles again. We have already seen similar issues pop up in Northern Ireland after the UK left. And we just have so much bureaucracy, so many non-functioning social systems, and such a tremendous amount of incoming governmental debt to support our flailing pension schemes. We need growth more than any other bloc, and we have such a low probability of actually accomplishing that. Given how the EU is structured, it’s also acting as the punching bag for the failure of the nation states to come to agreements. It’s not that EU bureaucrats are telling Europeans to take in immigrants, to enact chat control or to enact cookie banners or attached plastic caps. Those are all initiatives that come from one or more member states. But the EU in the end will always take the blame because even local politicians that voted in support of some of these things can easily point towards “Brussels” as having created a problem. A Europe in pieces does not sound appealing to me at all, and that’s because I can look at what China and the US have. What China and the US have that Europe lacks is a strong national identity. Both countries have recognized that strength comes from unity. China in particular is fighting any kind of regionalism tooth and nail. The US has accomplished this through the pledge of allegiance, a civil war, the Department of Education pushing a common narrative in schools, and historically putting post offices and infrastructure everywhere. Europe has none of that. More importantly, Europeans don’t even want it. There is a mistaken belief that we can just become these tiny states again and be fine. If Europe wants to be competitive, it seems unlikely that this can be accomplished without becoming a unified superpower. Yet there is no belief in Europe that this can or should happen, and the other superpowers have little interest in seeing it happen either. If I had to propose something constructive, it would be this: Europe needs to stop pretending it can be 27 different countries with 27 different economic policies while also being a single market. The half-measures are killing us. We have a common currency in the Eurozone but no common fiscal policy. We have freedom of movement but wildly different social systems. We have common regulations but fragmented enforcement. 27 labor laws, 27 different legal systems, tax codes, complex VAT rules and so on. The Draghi report from last year laid out many of these issues quite clearly: Europe needs massive investment in technology and infrastructure. It needs a genuine single market for services, not just goods. It needs capital markets that can actually fund startups at scale. None of this is news to anyone paying attention. But here’s the uncomfortable truth: none of this will happen without Europeans accepting that more integration is the answer, not less. And right now, the political momentum is in the opposite direction. Every country wants the benefits of the EU without the obligations. Every country wants to protect its own industries while accessing everyone else’s markets. One of the arguments against deeper integration is that Europe hinges on some quite unrelated issues. For instance, the EU is seen as non-democratic, but some of the criticism just does not sit right with me. Sure, I too would welcome more democracy in the EU, but at the same time, the system really is not undemocratic today. Take things like chat control: the reason this thing does not die, is because some member states and their elected representatives are pushing for it. What stands in the way is that the member countries and their people don’t actually want to strengthen the EU further. The “lack of democracy” is very much intentional and the exact outcome you get if you want to keep the power with the national states. So back to where we started: should the EU be abolished as Musk suggests? I think this is a profoundly unserious proposal from someone who has little understanding of European history and even less interest in learning. The EU exists because two world wars taught Europeans that nationalism without checks leads to catastrophe. It exists because small countries recognized they have more leverage negotiating as a bloc than individually. I also take a lot of issue with the idea that European politics should be driven by foreign interests. Neither Russians nor Americans have any good reason for why they should be having so much interest in European politics. They are not living here; we are. Would Europe be more “free” without the EU? Perhaps in some narrow regulatory sense. But it would also be weaker, more divided, and more susceptible to manipulation by larger powers — including the United States. I also find it somewhat rich that American tech billionaires are calling for the dissolution of the EU while they are greatly benefiting from the open market it provides. Their companies extract enormous value from the European market, more than even local companies are able to. The real question isn’t whether Europe should have less regulation or more freedom. It’s whether we Europeans can find the political will to actually complete the project we started. A genuine federation with real fiscal transfers, a common defense policy, and a unified foreign policy would be a superpower. What we have now is a compromise that satisfies nobody and leaves us vulnerable to exactly the kind of pressure Musk and other oligarchs represent. Europe doesn’t need fixing in the way the loud present-day critics suggest. It doesn’t need to become more like America or abandon its social model entirely. What it needs is to decide what it actually wants to be. The current state of perpetual ambiguity is unsustainable. It also should not lose its values. Europeans might no longer be quite as hot on the human rights that the EU provides, and they might no longer want to have the same level of immigration. Yet simultaneously, Europeans are presented with a reality that needs all of these things. We’re all highly dependent on movement of labour, and that includes people from abroad. Unfortunately, the wars of the last decade have dominated any migration discourse, and that has created ground for populists to thrive. Any skilled tech migrant is running into the same walls as everyone else, which has made it less and less appealing to come. Or perhaps we’ll continue muddling through, which historically has been Europe’s preferred approach. It’s not inspiring, but it’s also not going to be the catastrophe the internet would have you believe either. Is there reason to be optimistic? On a long enough timeline the graph goes up and to the right. We might be going through some rough patches, but structurally the whole thing here is still pretty solid. And it’s not as if the rest of the world is cruising along smoothly: the US, China, and Russia are each dealing with their own crises. That shouldn’t serve as an excuse, but it does offer context. As bleak as things can feel, we’re not alone in having challenges, but ours are uniquely ours and we will face them. One way or another.

14 views
Armin Ronacher 4 months ago

LLM APIs are a Synchronization Problem

The more I work with large language models through provider-exposed APIs, the more I feel like we have built ourselves into quite an unfortunate API surface area. It might not actually be the right abstraction for what’s happening under the hood. The way I like to think about this problem now is that it’s actually a distributed state synchronization problem. At its core, a large language model takes text, tokenizes it into numbers, and feeds those tokens through a stack of matrix multiplications and attention layers on the GPU. Using a large set of fixed weights, it produces activations and predicts the next token. If it weren’t for temperature (randomization), you could think of it having the potential of being a much more deterministic system, at least in principle. As far as the core model is concerned, there’s no magical distinction between “user text” and “assistant text”—everything is just tokens. The only difference comes from special tokens and formatting that encode roles (system, user, assistant, tool), injected into the stream via the prompt template. You can look at the system prompt templates on Ollama for the different models to get an idea. Let’s ignore for a second which APIs already exist and just think about what usually happens in an agentic system. If I were to have my LLM run locally on the same machine, there is still state to be maintained, but that state is very local to me. You’d maintain the conversation history as tokens in RAM, and the model would keep a derived “working state” on the GPU—mainly the attention key/value cache built from those tokens. The weights themselves stay fixed; what changes per step are the activations and the KV cache. From a mental-model perspective, caching means “remember the computation you already did for a given prefix so you don’t have to redo it.” Internally, that usually means storing the attention KV cache for those prefix tokens on the server and letting you reuse it, not literally handing you raw GPU state. There are probably some subtleties to this that I’m missing, but I think this is a pretty good model to think about it. The moment you’re working with completion-style APIs such as OpenAI’s or Anthropic’s, abstractions are put in place that make things a little different from this very simple system. The first difference is that you’re not actually sending raw tokens around. The way the GPU looks at the conversation history and the way you look at it are on fundamentally different levels of abstraction. While you could count and manipulate tokens on one side of the equation, extra tokens are being injected into the stream that you can’t see. Some of those tokens come from converting the JSON message representation into the underlying input tokens fed into the machine. But you also have things like tool definitions, which are injected into the conversation in proprietary ways. Then there’s out-of-band information such as cache points. And beyond that, there are tokens you will never see. For instance, with reasoning models you often don’t see any real reasoning tokens, because some LLM providers try to hide as much as possible so that you can’t retrain your own models with their reasoning state. On the other hand, they might give you some other informational text so that you have something to show to the user. Model providers also love to hide search results and how those results were injected into the token stream. Instead, you only get an encrypted blob back that you need to send back to continue the conversation. All of a sudden, you need to take some information on your side and funnel it back to the server so that state can be reconciled on either end. In completion-style APIs, each new turn requires resending the entire prompt history. The size of each individual request grows linearly with the number of turns, but the cumulative amount of data sent over a long conversation grows quadratically because each linear-sized history is retransmitted at every step. This is one of the reasons long chat sessions feel increasingly expensive. On the server, the model’s attention cost over that sequence also grows quadratically in sequence length, which is why caching starts to matter. One of the ways OpenAI tried to address this problem was to introduce the Responses API, which maintains the conversational history on the server (at least in the version with the saved state flag). But now you’re in a bizarre situation where you’re fully dealing with state synchronization: there’s hidden state on the server and state on your side, but the API gives you very limited synchronization capabilities. To this point, it remains unclear to me how long you can actually continue that conversation. It’s also unclear what happens if there is state divergence or corruption. I’ve seen the Responses API get stuck in ways where I couldn’t recover it. It’s also unclear what happens if there’s a network partition, or if one side got the state update but the other didn’t. The Responses API with saved state is quite a bit harder to use, at least as it’s currently exposed. Obviously, for OpenAI it’s great because it allows them to hide more behind-the-scenes state that would otherwise have to be funneled through with every conversation message. Regardless of whether you’re using a completion-style API or the Responses API, the provider always has to inject additional context behind the scenes—prompt templates, role markers, system/tool definitions, sometimes even provider-side tool outputs—that never appears in your visible message list. Different providers handle this hidden context in different ways, and there’s no common standard for how it’s represented or synchronized. The underlying reality is much simpler than the message-based abstractions make it look: if you run an open-weights model yourself, you can drive it directly with token sequences and design APIs that are far cleaner than the JSON-message interfaces we’ve standardized around. The complexity gets even worse when you go through intermediaries like OpenRouter or SDKs like the Vercel AI SDK, which try to mask provider-specific differences but can’t fully unify the hidden state each provider maintains. In practice, the hardest part of unifying LLM APIs isn’t the user-visible messages—it’s that each provider manages its own partially hidden state in incompatible ways. It really comes down to how you pass this hidden state around in one form or another. I understand that from a model provider’s perspective, it’s nice to be able to hide things from the user. But synchronizing hidden state is tricky, and none of these APIs have been built with that mindset, as far as I can tell. Maybe it’s time to start thinking about what a state synchronization API would look like, rather than a message-based API. The more I work with these agents, the more I feel like I don’t actually need a unified message API. The core idea of it being message-based in its current form is itself an abstraction that might not survive the passage of time. There’s a whole ecosystem that has dealt with this kind of mess before: the local-first movement. Those folks spent a decade figuring out how to synchronize distributed state across clients and servers that don’t trust each other, drop offline, fork, merge, and heal. Peer-to-peer sync, and conflict-free replicated storage engines all exist because “shared state but with gaps and divergence” is a hard problem that nobody could solve with naive message passing. Their architectures explicitly separate canonical state, derived state, and transport mechanics — exactly the kind of separation missing from most LLM APIs today. Some of those ideas map surprisingly well to models: KV caches resemble derived state that could be checkpointed and resumed; prompt history is effectively an append-only log that could be synced incrementally instead of resent wholesale; provider-side invisible context behaves like a replicated document with hidden fields. At the same time though, if the remote state gets wiped because the remote site doesn’t want to hold it for that long, we would want to be in a situation where we can replay it entirely from scratch—which for instance the Responses API today does not allow. There’s been plenty of talk about unifying message-based APIs, especially in the wake of MCP (Model Context Protocol). But if we ever standardize anything, it should start from how these models actually behave, not from the surface conventions we’ve inherited. A good standard would acknowledge hidden state, synchronization boundaries, replay semantics, and failure modes — because those are real issues. There is always the risk that we rush to formalize the current abstractions and lock in their weaknesses and faults. I don’t know what the right abstraction looks like, but I’m increasingly doubtful that the status-quo solutions are the right fit.

1 views
Armin Ronacher 5 months ago

Absurd Workflows: Durable Execution With Just Postgres

It’s probably no surprise to you that we’re building agents somewhere. Everybody does it. Building a good agent, however, brings back some of the historic challenges involving durable execution. Entirely unsurprisingly, a lot of people are now building durable execution systems. Many of these, however, are incredibly complex and require you to sign up for another third-party service. I generally try to avoid bringing in extra complexity if I can avoid it, so I wanted to see how far I can go with just Postgres. To this end, I wrote Absurd 1 , a tiny SQL-only library with a very thin SDK to enable durable workflows on top of just Postgres — no extension needed. Durable execution (or durable workflows) is a way to run long-lived, reliable functions that can survive crashes, restarts, and network failures without losing state or duplicating work. Durable execution can be thought of as the combination of a queue system and a state store that remembers the most recently seen execution state. Because Postgres is excellent at queues thanks to , you can use it for the queue (e.g., with pgmq ). And because it’s a database, you can also use it to store the state. The state is important. With durable execution, instead of running your logic in memory, the goal is to decompose a task into smaller pieces (step functions) and record every step and decision. When the process stops (whether it fails, intentionally suspends, or a machine dies) the engine can replay those events to restore the exact state and continue where it left off, as if nothing happened. Absurd at the core is a single file ( ) which needs to be applied to a database of your choice. That SQL file’s goal is to move the complexity of SDKs into the database. SDKs then make the system convenient by abstracting the low-level operations in a way that leverages the ergonomics of the language you are working with. The system is very simple: A task dispatches onto a given queue from where a worker picks it up to work on. Tasks are subdivided into steps , which are executed in sequence by the worker. Tasks can be suspended or fail, and when that happens, they execute again (a run ). The result of a step is stored in the database (a checkpoint ). To avoid repeating work, checkpoints are automatically loaded from the state storage in Postgres again. Additionally, tasks can sleep or suspend for events and wait until they are emitted. Events are cached, which means they are race-free. What is the relationship of agents with workflows? Normally, workflows are DAGs defined by a human ahead of time. AI agents, on the other hand, define their own adventure as they go. That means they are basically a workflow with mostly a single step that iterates over changing state until it determines that it has completed. Absurd enables this by automatically counting up steps if they are repeated: This defines a single task named , and it has just a single step. The return value is the changed state, but the current state is passed in as an argument. Every time the step function is executed, the data is looked up first from the checkpoint store. The first checkpoint will be , the second , , etc. Each state only stores the new messages it generated, not the entire message history. If a step fails, the task fails and will be retried. And because of checkpoint storage, if you crash in step 5, the first 4 steps will be loaded automatically from the store. Steps are never retried, only tasks. How do you kick it off? Simply enqueue it: And if you are curious, this is an example implementation of the function used above: And like Temporal and other solutions, you can yield if you want. If you want to come back to a problem in 7 days, you can do so: Or if you want to wait for an event: Which someone else can emit: Really, that’s it. There is really not much to it. It’s just a queue and a state store — that’s all you need. There is no compiler plugin and no separate service or whole runtime integration . Just Postgres. That’s not to throw shade on these other solutions; they are great. But not every problem necessarily needs to scale to that level of complexity, and you can get quite far with much less. Particularly if you want to build software that other people should be able to self-host, that might be quite appealing. It’s named Absurd because durable workflows are absurdly simple, but have been overcomplicated in recent years. ↩ It’s named Absurd because durable workflows are absurdly simple, but have been overcomplicated in recent years. ↩

0 views
Armin Ronacher 5 months ago

Regulation Isn’t the European Trap — Resignation Is

Plenty has been written about how hard it is to build in Europe versus the US. The list is always the same with little process: brittle politics, dense bureaucracy, mandatory notaries, endless and rigid KYC and AML processes. Fine. I know, you know. I’m not here to add another complaint to the pile (but if we meet over a beer or coffee, I’m happy to unload a lot of hilarious anecdotes on you). The unfortunate reality is that most of these constraints won’t change in my lifetime and maybe ever. Europe is not culturally aligned with entrepreneurship, it’s opposed to the idea of employee equity, and our laws reflect that. What bothers me isn’t the rules — it’s the posture that develops form it within people that should know better. Across the system, everyone points at someone else. If a process takes 10 steps, you’ll find 10 people who feel absolved of responsibility because they can cite 9 other blockers. Friction becomes a moral license to do a mediocre job (while lamenting about it). The vibe is: “Because the system is slow, I can be slow. Because there are rules, I don’t need judgment. Because there’s risk, I don’t need initiative.” And then we all nod along and nothing moves. There are excellent people here; I’ve worked with them. But they are fighting upstream against a default of low agency. When the process is bad, too many people collapse into it. Communication narrows to the shortest possible message. Friday after 2pm, the notary won’t reply — and the notary surely will blame labor costs or regulation for why service ends there. The bank will cite compliance for why they don’t need to do anything. The registrar will point at some law that allows them to demand a translation of a document by a court appointed translator. Everyone has a reason. No one owns the outcome. Meanwhile, in the US, our counsel replies when it matters, even after hours. Bankers answer the same day. The instinct is to enable progress, not enumerate reasons you can’t have it. The goal is the outcome and the rules are constraints to navigate, not a shield to hide behind. So what’s the point? I can’t fix politics. What I can do: act with agency, and surround myself with people who do the same and speak in support of it. Work with those who start from “how do we make this work?” not “why this can’t work.” Name the absurdities without using them as cover. Be transparent, move anyway and tell people. Nothing stops a notary from designing an onboarding flow that gets an Austrian company set up in five days — standardized KYC packets, templated resolutions, scheduled signing slots, clear checklists, async updates, a bias for same-day feedback. That could exist right now. It rarely does or falls short. Yes, much in Europe is objectively worse for builders. We have to accept it. Then squeeze everything you can from what is in your control: Select for agency. Choose partners who answer promptly when it’s material and who don’t confuse process with progress. The trap is not only regulation. It’s the learned helplessness it breeds. If we let friction set our standards, we become the friction. We won’t legislate our way to a US-style environment anytime soon. But we don’t need permission to be better operators inside a bad one. That’s the contrast and it’s the part we control. Postscript: Comparing Europe to the US triggers people and I’m concious of that. Maturity is holding two truths at once: they do some things right and some things wrong and so do we. You don’t win by talking others down or praying for their failure. I’d rather see both Europe and the US succeed than celebrate Europe failing slightly less. And no, saying I feel gratitude and happiness when I get a midnight reply doesn’t make me anti-work-life balance ( I am not ). It means when something is truly time-critical, fast, clear action lifts everyone. The times someone sent a document in minutes, late at night, both sides felt good about it when it mattered. Responsiveness, used with judgment, is not exploitation; it’s respect for outcomes and the relationships we form. Own the handoff. When you’re step 3 of 10, behave like step 10 depends on you and behave like you control all 10 steps. Anticipate blockers further down the line. Move same day. Eliminate ambiguity. Close loops. Default to clarity. Send checklists. Preempt the next two questions. Reduce the number of touches. Model urgency without theatrics. Be calm, fast, and precise. Don’t make your customer chase you. Use judgment. Rules exist and we can’t break them all. But we can work with them and be guided by them.

0 views
Armin Ronacher 6 months ago

Building an Agent That Leverages Throwaway Code

In August I wrote about my experiments with replacing MCP ( Model Context Protocol ) with code. In the time since I utilized that idea for exploring non-coding agents at Earendil . And I’m not alone! In the meantime, multiple people have explored this space and I felt it was worth sharing some updated findings. The general idea is pretty simple. Agents are very good at writing code, so why don’t we let them write throw-away code to solve problems that are not related to code at all? I want to show you how and what I’m doing to give you some ideas of what works and why this is much simpler than you might think. The first thing you have to realize is that Pyodide is secretly becoming a pretty big deal for a lot of agentic interactions. What is Pyodide? Pyodide is an open source project that makes a standard Python interpreter available via a WebAssembly runtime. What is neat about it is that it has an installer called micropip that allows it to install dependencies from PyPI. It also targets the emscripten runtime environment, which means there is a pretty good standard Unix setup around the interpreter that you can interact with. Getting Pyodide to run is shockingly simple if you have a Node environment. You can directly install it from npm. What makes this so cool is that you can also interact with the virtual file system, which allows you to create a persistent runtime environment that interacts with the outside world. You can also get hosted Pyodide at this point from a whole bunch of startups, but you can actually get this running on your own machine and infrastructure very easily if you want to. The way I found this to work best is if you banish Pyodide into a web worker. This allows you to interrupt it in case it runs into time limits. A big reason why Pyodide is such a powerful runtime, is because Python has an amazing ecosystem of well established libraries that the models know about. From manipulating PDFs or word documents, to creating images, it’s all there. Another vital ingredient to a code interpreter is having a file system. Not just any file system though. I like to set up a virtual file system that I intercept so that I can provide it with access to remote resources from specific file system locations. For instance, you can have a folder on the file system that exposes files which are just resources that come from your own backend API. If the agent then chooses to read from those files, you can from outside the sandbox make a safe HTTP request to bring that resource into play. The sandbox itself does not have network access, so it’s only the file system that gates access to resources. The reason the file system is so good is that agents just know so much about how they work, and you can provide safe access to resources through some external system outside of the sandbox. You can provide read-only access to some resources and write access to others, then access the created artifacts from the outside again. Now actually doing that is a tad tricky because the emscripten file system is sync, and most of the interesting things you can do are async. The option that I ended up going with is to move the fetch-like async logic into another web worker and use to block. If your entire Pyodide runtime is in a web worker, that’s not as bad as it looks. That said, I wish the emscripten file system API was changed to support stack swiching instead of this. While it’s now possible to hide async promises behind sync abstractions within Pyodide with call_sync , the same approach does not work for the emscripten JavaScript FS API. I have a full example of this at the end, but the simplified pseudocode that I ended up with looks like this: Lastly now that you have agents running, you really need durable execution. I would describe durable execution as the idea of being able to retry a complex workflow safely without losing progress. The reason for this is that agents can take a very long time, and if they interrupt, you want to bring them back to the state they were in. This has become a pretty hot topic. There are a lot of startups in that space and you can buy yourself a tool off the shelf if you want to. What is a little bit disappointing is that there is no truly simple durable execution system. By that I mean something that just runs on top of Postgres and/or Redis in the same way as, for instance, there is pgmq. The easiest way to shoehorn this yourself is to use queues to restart your tasks and to cache away the temporary steps from your execution. Basically, you compose your task from multiple steps and each of the steps just has a very simple cache key. It’s really just that simple: You can improve on this greatly, but this is the general idea. The state is basically the conversation log and whatever else you need to keep around for the tool execution (e.g., whatever was thrown on the file system). What tools does an agent need that are not code? Well, the code needs to be able to do something interesting so you need to give it access to something. The most interesting access you can provide is via the file system, as mentioned. But there are also other tools you might want to expose. What Cloudflare proposed is connecting to MCP servers and exposing their tools to the code interpreter. I think this is a quite interesting approach and to some degree it’s probably where you want to go. Some tools that I find interesting: : a tool that just lets the agent run more inference, mostly with files that the code interpreter generated. For instance if you have a zip file it’s quite fun to see the code interpreter use Python to unpack it. But if then that unpacked file is a jpg, you will need to go back to inference to understand it. : a tool that just … brings up help. Again, can be with inference for basic RAG, or similar. I found it quite interesting to let the AI ask it for help. For example, you want the manual tool to allow a query like “Which Python code should I write to create a chart for the given XLSX file?” On the other hand, you can also just stash away some instructions in .md files on the virtual file system and have the code interpreter read it. It’s all an option. If you want to see what this roughly looks like, I vibe-coded a simple version of this together. It uses a made-up example but it does show how a sandbox with very little tool availability can create surprising results: mitsuhiko/mini-agent . When you run it, it looks up the current IP from a special network drive that triggers an async fetch, and then it (usually) uses pillow or matplotlib to make an image of that IP address. Pretty pointless, but a lot of fun! 4he same approach has also been leveraged by Anthropic and Cloudflare. There is some further reading that might give you more ideas: : a tool that just lets the agent run more inference, mostly with files that the code interpreter generated. For instance if you have a zip file it’s quite fun to see the code interpreter use Python to unpack it. But if then that unpacked file is a jpg, you will need to go back to inference to understand it. : a tool that just … brings up help. Again, can be with inference for basic RAG, or similar. I found it quite interesting to let the AI ask it for help. For example, you want the manual tool to allow a query like “Which Python code should I write to create a chart for the given XLSX file?” On the other hand, you can also just stash away some instructions in .md files on the virtual file system and have the code interpreter read it. It’s all an option. Claude Skills is fully leveraging code generation for working with documents or other interesting things. Comes with a (non Open Source) repository of example skills that the LLM and code executor can use: anthropics/skills Cloudflare’s Code Mode which is the idea of creating TypeScript bindings for MCP tools and having the agent write code to use them in a sandbox.

2 views
Armin Ronacher 6 months ago

90%

“I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code” — Dario Amodei Three months ago I said that AI changes everything. I came to that after plenty of skepticism. There are still good reasons to doubt that AI will write all code, but my current reality is close. For the infrastructure component I started at my new company, I’m probably north of 90% AI-written code. I don’t want to convince you — just share what I learned. In parts, because I approached this project differently from my first experiments with AI-assisted coding. The service is written in Go with few dependencies and an OpenAPI-compatible REST API. At its core, it sends and receives emails. I also generated SDKs for Python and TypeScript with a custom SDK generator. In total: about 40,000 lines, including Go, YAML, Pulumi, and some custom SDK glue. I set a high bar, especially that I can operate it reliably. I’ve run similar systems before and knew what I wanted. Some startups are already near 100% AI-generated. I know, because many build in the open and you can see their code. Whether that works long-term remains to be seen. I still treat every line as my responsibility, judged as if I wrote it myself. AI doesn’t change that. There are no weird files that shouldn’t belong there, no duplicate implementations, and no emojis all over the place. The comments still follow the style I want and, crucially, often aren’t there. I pay close attention to the fundamentals of system architecture, code layout, and database interaction. I’m incredibly opinionated. As a result, there are certain things I don’t let the AI do. I know it won’t reach the point where I could sign off on a commit. That’s why it’s not 100%. As contrast: another quick prototype we built is a mess of unclear database tgables, markdown file clutter in the repo, and boatloads of unwanted emojis. It served its purpose — validate an idea — but wasn’t built to last, and we had no expectation to that end. I began in the traditional way: system design, schema, architecture. At this state I don’t let the AI write, but I loop it in AI as a kind of rubber duck. The back-and-forth helps me see mistakes, even if I don’t need or trust the answers. I did get the foundation wrong once. I initially argued myself into a more complex setup than I wanted. That’s a part where I later used the LLM to redo a larger part early and clean it up. For AI-generated or AI-supported code, I now end up with a stack that looks something like something I often wanted, but was too hard to do by hand: Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM. The fact that I no longer have to write SQL because the AI does it for me is a game changer. I also use raw SQL for migrations now. OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on. Today I use Claude Code and Codex. Each has strengths, but the constant is Codex for code review after PRs. It’s very good at that. Claude is indispensable still when debugging and needing a lot of tool access (eg: why do I have a deadlock, why is there corrupted data in the database etc.). The working together of the two is where it’s most magical. Claude might find the data, Codex might understand it better. I cannot stress enough how bad the code from these agents can be if you’re not careful. While they understand system architecture and how to build something, they can’t keep the whole picture in scope. They will recreate things that already exist. They create abstractions that are completely inappropriate for the scale of the problem. You constantly need to learn how to bring the right information to the context. For me, this means pointing the AI to existing implementations and giving it very specific instructions on how to follow along. I generally create PR-sized chunks that I can review. There are two paths to this: Agent loop with finishing touches: Prompt until the result is close, then clean up. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge. It requires intuition to know when each approach is more likely to lead to the right results. Familiarity with the agent also helps understanding when a task will not go anywhere, avoiding wasted cycles. The most important piece of working with an agent is the same as regular software engineering. You need to understand your state machines, how the system behaves at any point in time, your database. It is easy to create systems that appear to behave correctly but have unclear runtime behavior when relying on agents. For instance, the AI doesn’t fully comprehend threading or goroutines. If you don’t keep the bad decisions at bay early it, you won’t be able to operate it in a stable manner later. Here’s an example: I asked it to build a rate limiter. It “worked” but lacked jitter and used poor storage decisions. Easy to fix if you know rate limiters, dangerous if you don’t. Agents also operate on conventional wisdom from the internet and in tern do things I would never do myself. It loves to use dependencies (particularly outdated ones). It loves to swallow errors and take away all tracebacks. I’d rather uphold strong invariants and let code crash loudly when they fail, than hide problems. If you don’t fight this, you end up with opaque, unobservable systems. For me, this has reached the point where I can’t imagine working any other way. Yes, I could probably have done it without AI. But I would have built a different system in parts because I would have made different trade-offs. This way of working unlocks paths I’d normally skip or defer. Here are some of the things I enjoyed a lot on this project: Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes. It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation. Trying out things: I tried three different OpenAPI implementations and approaches in a day. Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy. Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs. Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests. SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering and when writing it. Is 90% of code going to be written by AI? I don’t know. What I do know is, that for me, on this project, the answer is already yes. I’m part of that growing subset of developers who are building real systems this way. At the same time, for me, AI doesn’t own the code. I still review every line, shape the architecture, and carry the responsibility for how it runs in production. But the sheer volume of what I now let an agent generate would have been unthinkable even six months ago. That’s why I’m convinced this isn’t some far-off prediction. It’s already here — just unevenly distributed — and the number of developers working like this is only going to grow. That said, none of this removes the need to actually be a good engineer. If you let the AI take over without judgment, you’ll end up with brittle systems and painful surprises (data loss, security holes, unscalable software). The tools are powerful, but they don’t absolve you of responsibility. Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM. The fact that I no longer have to write SQL because the AI does it for me is a game changer. I also use raw SQL for migrations now. OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on. Agent loop with finishing touches: Prompt until the result is close, then clean up. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge. Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes. It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation. Trying out things: I tried three different OpenAPI implementations and approaches in a day. Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy. Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs. Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests. SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering and when writing it.

0 views
Armin Ronacher 7 months ago

What’s a Foreigner?

Across many countries, resistance to immigration is rising — even places with little immigration, like Japan, now see rallies against it . I’m not going to take a side here. I want to examine a simpler question: who do we mean when we say “foreigner”? I would argue there isn’t a universal answer. Laws differ, but so do social definitions. In Vienna, where I live, immigration is visible: roughly half of primary school children don’t speak German at home . Austria makes citizenship hard to obtain. Many people born here aren’t citizens; at the same time, EU citizens living here have broad rights and labor-market access similar to native Austrians. Over my lifetime, the fear of foreigners has shifted: once aimed at nearby Eastern Europeans, it now falls more on people from outside the EU, often framed through religion or culture. Practically, “foreigner” increasingly ends up meaning “non-EU.” Keep in mind that over the last 30 years the EU went from 12 countries to 27. That’s a signifcant increase in social mobility. I believe this is quite different from what is happening in the United States. The present-day US debate is more tightly tied to citizenship and allegiance, which is partly why current fights there include attempts to narrow who gets citizenship at birth. The worry is less about which foreigners come and more about the terms of becoming American and whether newcomers will embrace what some define as American values. Inside the EU, the concept of EU citizenship changes social reality. Free movement, aligned standards, interoperable social systems, and easier labor mobility make EU citizens feel less “foreign” to each other — despite real frictions. The UK before Brexit was a notable exception: less integrated in visible ways and more hostile to Central and Eastern European workers. Perhaps another sign that the level of integration matters. In practical terms, allegiances are also much less clearly defined in the EU. There are people who live their entire live in other EU countries and whos allegiance is no longer clearly aligned to any one country. Legal immigration itself is widely misunderstood. Most systems are both far more restrictive in some areas and far more permissive than people assume. On the one hand, what’s called “illegal” is often entirely lawful. Many who are considered “illegal” are legally awaiting pending asylum decisions or are accepted refugees. These are processes many think shouldn’t exist, but they are, in fact, legal. On the other hand, the requirements for non-asylum immigration are very high, and most citizens of a country themselves would not qualify for skilled immigration visas. Meanwhile, the notion that a country could simply “remove all foreigners” runs into practical and ethical dead ends. Mobility pressures aren’t going away; they’re reinforced by universities, corporations, individual employers, demographics, and geopolitics. Citizenship is just a small wrinkle. In Austria, you generally need to pass a modest German exam and renounce your prior citizenship. That creates odd outcomes: native-born non-citizens who speak perfect German but lack a passport, and naturalized citizens who never fully learned the language. Legally clear, socially messy — and not unique to Austria. The high hurdle to obtaining a passport also leads many educated people to intentionally opt out of becoming citizens. The cost that comes with renouncing a passport is not to be underestimated. Where does this leave us? The realities of international mobility leave our current categories of immigration straining and misaligned with what the population at large thinks immigration should look like. Economic anxiety, war, and political polarization are making some groups of foreigners targets, while the deeper drivers behind immigration will only keep intensifying. Perhaps we need to admit that we’re all struggling with these questions. The person worried about their community or country changing too quickly and the immigrant seeking a better life are both responding to forces larger than themselves. In a world where capital moves freely but most people cannot, where climate change might soon displace millions, and where birth rates are collapsing in wealthy nations, our immigration systems will be tested and stressed, and our current laws and regulations are likely inadequate.

0 views
Armin Ronacher 7 months ago

996

“Amazing salary, hackerhouse in SF, crazy equity. 996 . Our mission is OSS.” — Gregor Zunic “The current vibe is no drinking, no drugs, 9-9-6, […].” — Daksh Gupta “The truth is, China’s really doing ‘007’ now—midnight to midnight, seven days a week […] if you want to build a $10 billion company, you have to work seven days a week.” — Harry Stebbings I love work. I love working late nights, hacking on things. This week I didn’t go to sleep before midnight once. And yet… I also love my wife and kids. I love long walks, contemplating life over good coffee, and deep, meaningful conversations. None of this would be possible if my life was defined by 12 hour days, six days a week. More importantly, a successful company is not a sprint, it’s a marathon. And this is when this is your own company! When you devote 72 hours a week to someone else’s startup, you need to really think about that arrangement a few times. I find it highly irresponsible for a founder to promote that model. As a founder, you are not an employee, and your risks and leverage are fundamentally different. I will always advocate for putting the time in because it is what brought me happiness. Intensity, and giving a shit about what I’m doing, will always matter to me. But you don’t measure that by the energy you put in, or the hours you’re sitting in the office, but the output you produce. Burning out on twelve-hour days, six days a week, has no prize at the end. It’s unsustainable, it shouldn’t be the standard and it sure as hell should not be seen as a positive sign of a company. I’ve pulled many all-nighters, and I’ve enjoyed them. I still do. But they’re enjoyable in the right context, for the right reasons, and when that is a completely personal choice, not the basis of company culture. And that all-nighter? It comes with a fucked up and unproductive morning the day after. When someone promotes a 996 work culture, we should push back.

0 views