Posts in Bash (20 found)
Danny McClelland 1 weeks ago

How I use VeraCrypt to keep my data secure

I’ve been using VeraCrypt for encrypted vaults for a while now. I mount and dismount vaults multiple times a day, and typing out the full command each time gets old fast: , , , , . There’s nothing wrong with the CLI, it’s just repetitive, and repetitive is what aliases are for. The GUI exists, but I spend most of my time in a terminal and launching a GUI app to mount a file feels like leaving the house to check if the back door is locked. So I wrote some aliases and functions. They’ve replaced the GUI for me entirely. Before getting into the aliases: VeraCrypt is the right tool for this specific job, but it’s worth being clear about what that job is. I’m encrypting discrete chunks of data stored as container files, not entire drives. If I wanted to encrypt a USB pen drive or an external hard disk, I’d use LUKS instead, which is better suited to full-device encryption on Linux. VeraCrypt’s strength is the container format: a single encrypted file that you can copy anywhere, sync to cloud storage, and open on almost any platform. I format my vaults as exFAT specifically for this: it works on Windows, macOS, Linux, and iOS via Disk Decipher . That cross-platform use case is what makes it worth the extra ceremony. This post covers what I ended up with and why. It’s worth saying upfront: this works for me, for my use case, right now. It doesn’t follow that it’s the right fit for anyone else. LUKS, Cryptomator , and plenty of other tools solve similar problems in different ways, and any of them might be a better fit depending on what you’re trying to do. I’m not attached to this setup permanently either. If something better comes along, or my requirements change, I’ll adapt. The two simplest aliases are to list what’s currently mounted, and to create new vaults: is a full function because it needs to handle a few things: creating the mount directory, defaulting to the current directory if no path is specified, and (when only one vault is mounted in total) automatically -ing into it so I can get straight to work: The auto-cd only triggers when it’s the sole mounted vault. If I’ve already got other vaults open, it stays out of the way. Both sync clients are paused before mounting to prevent them trying to upload a vault that’s actively being written to — a reliable way to end up with a corrupted or conflicted file. I keep several vault files in the same directory, so was a natural next step: mount all and files in a given directory with a single shared password: The glob qualifier in zsh means the glob returns nothing (rather than erroring) if no files match. Worth knowing if you’re adapting this for bash, where you’d handle the empty case differently. Dismounting is where I hit the most friction. The function handles both single-volume and all-at-once dismounting, and cleans up the mount directories afterwards: The alias just calls with no arguments: dismount everything, clean up the directories. The bit I added most recently is the before dismounting. If I’m working inside a vault and run , the dismount would fail silently because the directory was in use. The fix checks whether is under any of the mounted paths and steps out first. The trailing slash on both sides ( ) avoids the edge case where one vault path is a prefix of another. One more thing that makes this feel native rather than bolted on: tab completion for mounted volumes when running , and completion for / files when using or : One feature worth mentioning, even if I don’t use it daily: VeraCrypt supports hidden volumes . The idea is that you create a second encrypted volume inside the free space of an existing one. The outer volume gets a decoy password and some plausible-looking files. The hidden volume gets a separate password and your actual sensitive data. When VeraCrypt mounts, it tries the password you entered against the standard volume header first, then checks whether it matches the hidden volume header. Because VeraCrypt fills all free space with random data during creation, an observer cannot tell whether a hidden volume exists at all. It’s indistinguishable from random noise. In practice: if you’re ever compelled to hand over your password, you hand over the outer volume’s password. Nothing in the file itself proves there’s anything else there. This is what “plausible deniability” means in this context. It’s not a feature most people will ever need, but it exists and it’s well-implemented. My vault files are stored in Dropbox rather than Proton Drive, which I realise sounds odd given that Proton Drive is the more privacy-focused option. The reason is practical: the Proton Drive iOS app fails to sync VeraCrypt vaults reliably. The developer of Disk Decipher (an iOS VeraCrypt client) recently dug into this and was incredibly helpful in tracking down the cause. Looking at the Proton Drive app logs, he found: . The hypothesis is that VeraCrypt creates revisions faster than Proton Drive’s file provider can handle. What makes it worse is that the problem surfaces immediately: just mounting a vault and dismounting it again is enough to trigger the error. That’s a single write operation. There’s no practical workaround on the iOS side. It’s an annoying trade-off. Dropbox has significantly more access to my files at the infrastructure level, but the vault files themselves are encrypted before they ever leave the machine, so what Dropbox sees is opaque either way. For now, it works. I’m keeping an eye on Proton Drive’s iOS progress. Google Drive is an obvious option I haven’t mentioned: that’s intentional. I’m actively working on reducing my Google dependency, so it’s not something I’m considering here. Technically, on Linux, you could use rsync to swap Dropbox out for almost any provider. What keeps me on Dropbox for this specific use case is how it handles large files: it chunks them and syncs only the changed parts rather than re-uploading the whole thing. For vault files that can be several gigabytes, that matters. As you’ll have noticed in the code above, and both pause Dropbox and Proton Drive before mounting, and restarts them once the last vault is closed. The sync clients fail silently if they’re not running, so the same code works on machines where neither is installed. Since writing this, the picture has got worse. Mounir Idrassi, VeraCrypt’s developer, posted on Sourceforge confirming what’s actually happening: Microsoft terminated the account used to sign VeraCrypt’s Windows drivers and bootloader. No warning, no explanation, and their message explicitly states no appeal is possible. He tried every contact route and reached only chatbots. The signing certificate on existing VeraCrypt builds is from a 2011 CA that expires in June 2026. Once that expires, Windows will refuse to load the driver, and the driver is required for everything: container mounting, portable mode, full disk encryption. The bootloader situation is worse still, sitting outside the OS and requiring firmware trust. The post landed on Hacker News , where Jason Donenfeld, who maintains WireGuard, posted that the same thing has happened to him: account suspended without warning, currently in a 60-day appeals process. His point was direct: if a critical RCE in WireGuard were being actively exploited right now, he’d have no way to push an update. Microsoft would have his hands entirely tied. This isn’t a one-off. A LibreOffice developer was banned under similar circumstances last year. The pattern is open source security tool developers losing distribution rights, without warning, with an appeals process that appears largely decorative. Larger projects may eventually get restored through media pressure. Most won’t have that option. I’m on Linux, so none of this touches me directly. If you’re on Windows and relying on VeraCrypt, “watch it closely” has become genuinely urgent. All of these live in my dotfiles .

0 views
Brain Baking 1 weeks ago

Remakes And Remasters Of Old DOS Games: A Small 2026 Update

It’s been two years since the Remakes And Remasters Of Old DOS Games article. Nostalgia still sells handsomely thus our favourite remaster studios (hello Night Dive) are cranking out hit after hit. It’s time for a small 2026 update. I’ve also updated the original article just in case you might find your way here through that one. Below is a list of remakes and remasters announced and/or released since April 2024: Guess what, Nightdive is still running the show here: At this point I don’t even know where to start! Monster Bash HD is still being worked on (I hope?). Did I miss something? Let me know! Related topics: / games / dos / engines / By Wouter Groeneveld on 5 April 2026.  Reply via email . Little Big Adventure : Twinsen’s Quest released in November 2024 is a complete graphical overhaul of the original. Not a remake but still noteworthy; Gobliins 6 is a sequel to a 34 year old DOS game ! Star Wars: Dark Forces got a remaster ; Although not a DOS game, Outlaws got the remaster treament as well Oh, and yes, DOOM I + II is another masterpiece ; As is the Heretic + Hexen package ; As did Blood as Refreshed Supply (again?). BioMenace : Remastered by the same devs that did the Duke Nukem 1 & 2 remasters on Evercade. I enjoyed it, it’s good! A Halloween Harry -inspired top-down 3D version is currently being made that only shares the name & style of the original—luckily, not the crappy level design. Ubisoft remastered the original Rayman ( 30th Anniversary Edition ) but it wasn’t met with much success. They changed the included GBA music—that’s what SEGA would have done, right? I found a Masters of Magic remake (2022) on Steam that’s been met with some positive reception. I didn’t play the original so can’t say how faithfully it’s related to the DOS version. Blizzard also decided to cash in with the Warcraft I+II remaster bundle . I was mostly a Wacraft III person so I can’t comment on this. Someone did a Wacky Wheels HD Remake on ? Wow! Best approach this carefully, it looks to have its own technical problems.

0 views
Ankur Sethi 1 weeks ago

I'm no longer using coding assistants on personal projects

I’ve spent the last few months figuring out how best to use LLMs to build software. In January and February, I used Claude Code to build a little programming language in C. In December I used local a local LLM to analyze all the journal entries I wrote in 2025 , and then used Gemini to write scripts that could visualize that data. Besides what I’ve written about publicly, I’ve also used Claude Code to: I won’t lie, I started off skeptical about the ability of LLMs to write code, but I can’t deny the fact that, in 2026, they can produce code that’s as good or better than a junior-to-intermediate developer for most programming domains. If you’re abstaining from learning about or using LLMs in your own work, you’re doing a disservice to yourself and your career. It’s a very real possibility that in five years, most of the code we write will be produced using an LLM. It’s not a certainty, but it’s a strong possibility. However, I’m not going to stop writing code by hand. Not anytime soon. As long as there are computers to program, I will be programming them using my own two fleshy human hands. I started programming computers because I enjoy the act of programming. I enjoy thinking through problems, coming up with solutions, evolving those solutions so that they are as correct and clear as possible, and then putting them out into the world where they can be of use to people. It’s a fun and fulfilling profession. Some people see the need for writing code as an impediment to getting good use out of a computer. In fact, some of the most avid fans of generative AI believe that the act of actually doing the work is a punishment. They see work as unnecesary friction that must be optimized away. Truth is, the friction inherent in doing any kind of work—writing, programming, making music, painting, or any other creative activity generative AI purpots to replace—is the whole point. The artifacts you produce as the result of your hard work are not important. They are incidental. The work itself is the point. When you do the work, you change and grow and become more yourself. Work—especially creative work—is an act of self-love if you choose to see it that way. Besides, when you rely on generative AI to do the work, you miss out on the pleasurable sensations of being in flow state. Your skills atrophy (no, writing good prompts is not a skill, any idiot can do it). Your brain gets saturated with dopamine in the same way when you gamble, doomscroll, or play a gatcha game. Using Claude Code as your main method of producing code is like scrolling TikTok eight hours a day, every day, for work. And the worst part? The code you produce using LLMs is pure cognitive debt. You have no idea what it’s doing, only that it seems to be doing what you want it to do. You don’t have a mental model for how it works, and you can’t fix it if it breaks in production. Such a codebase is not an asset but a liability. I predict that in 1-3 years we’re going see organizations rewrite their LLM-generated software using actual human programmers. Personally, I’ve stopped using generative AI to write code for my personal projects. I still use Claude Code as a souped up search engine to look up information, or to help me debug nasty errors. But I’m manually typing every single line of code in my current Django project, with my own fingers, using a real physical keyboard. I’m even thinking up all the code using my own brain. Miraculous! For the commercial projects I work on for my clients, I’m going to follow whatever the norms around LLM use happen to at my workplace. If a client requires me to use Claude Code to write every single line of code, I’ll be happy to oblige. If they ban LLMs outright, I’m fine with that too. After spending hundreds of hours yelling at Claude, I’m dangerously proficient at getting it to do the right thing. But I haven’t lost my programming skills yet, and I don’t plan to. I’m flexible. Given the freedom to choose, I’d probably pick a middle path: use LLMs to generate boilerplate code, write tricky test cases, debug nasty issues I can’t think of, and quickly prototype ideas to test. I’m not an AI vegan. But when it comes to code I write for myself—which includes the code that runs this website—I’m going to continue writing it myself, line by line, like I always did. Somebody has to clean up after the robots when they make a mess, right? Write and debug Emacs Lisp for my personal Emacs configuration. Write several Alfred workflows (in Bash, AppleScript, and Swift) to automate tasks on my computer. Debug CSS issues on this very website. Generate React components for a couple of throwaway side projects. Generate Django apps for a couple of throwaway side projects. Port color themes between text editors. A lot more that I’m forgetting now.

0 views
neilzone 2 weeks ago

Implementing the somewhat whimsical human.json protocol on my website

Terence blogged about adding a human.json file to his website . I wanted to do the same. The specification for human.json describes itself as a lightweight protocol for humans to assert authorship of their site content and vouch for the humanity of others. It uses URL ownership as identity, and trust propagates through a crawlable web of vouches between sites. A bit like signing each other’s PGP keys, really. There are a few steps: I made a simple bash script to simplify the process of creating the json to vouch for someone: I am sure that there are better ways of doing this, but it works for me. I am using a separate directory for this json file, as it wants specific headers. I am using apache, so in the file in , I have: Using the Firefox browser extension , which is probably available for other browsers too, I can see if a site offers human.json file, or is vouched for by another person whose own human.json file I have already trusts. Will it catch on? I doubt it. It is a bit of whimsy, and that is no bad thing. I have only included URLs where the site owner has consented for me to do so. If you are such a person and wish me to remove the “vouch” from my site, then please do just let me know. Consent is sexy. Because I am low-key “vouching” for people, I’ve only vouched for people that I know, even for a relatively limited definition of “know”. Not strangers, but not limited to the most intimate of relationships either. Mostly fedi friends, which is nice. Is it bad ? I don’t think so. I have seen a couple of comments about it being a useful thing for AI scrapers to follow, but frankly they seem to be doing just fine anyway. If signalling to fellow humans also attracts unwanted traffic well, in this case, so be it. add a json file to your webserver, with some basic information update that file when you “vouch” for someone else’s site, as being created by a human and free of AI added some header material to your website, to reference the source of your human.json file set a couple of web server headers (below) use a browser extension to surface that file on other people’s websites if they have implemented human.json

0 views
neilzone 3 weeks ago

Moving (for now?) from HomeAssistant in Python venvs to HomeAssistantOS

I have used HomeAssistant for years . So many years, that I do not remember how many. Nothing I do with it is particularly fancy, but things like having my office lights turn on when I open the door if the light is below a certain luminosity, or turning off my Brompton bike charger once it has finished charging, are fun and convenient. We also have solar panels and a battery now, so I will be interested to see if I use HomeAssistant more for that. But anyway. I have been using HomeAssistant, on a Raspberry Pi 4, using Python venvs for years. It has worked absolutely fine for me, and I have (or, at least, had) no compelling reason to change. For me, this was the ideal setup, in that I could set the Pi up how I wanted, in terms of security and monitoring, and just run HomeAssistant on it. Updating HomeAssistant was as easy as running a simple bash script. I liked it. But… that approach is no longer supported, and, where possible, I prefer to use supported means of running software. That means either running HomeAssistantOS, or else using a containerised instance of HomeAssistant. While I could probably find my way through setting up a HomeAssistant container via podman, it would not be my preference, so I decided to give HomeAssistantOS a go, albeit with some trepidation. As expected, it was easy to install HAOS: write the image to a microSD card, and pop it into the Pi. I already had the switch port set up to the right VLAN, so I plugged in the Pi and waited a few minutes. I had anticipated that it would offer https, via a self-signed certificate, so I was a bit baffled to get a TLS error when I connected to it. “Never mind”, I thought. “I’ll just ssh into it and sort it out.” But no, no ssh either. Fortunately, I discovered quite quickly that, out of the box, it does not offer TLS, and I was able to access the web interface. I had taken a backup from my existing HomeAssistant installation, and I used the web interface on the new installation to restore it. It took a few minutes, but restored absolutely everything. I was impressed. I was anticipating - indeed, hoping - to set up TLS and reverse proxying using certbot and nginx. But that is not possible. Instead, I achieved it (reasonably easily, but not as easily as using a command line) via Add-ons from within the HomeAssistant UI. I’d have prefer to have done it the normal way, via ssh, but oh well. Annoyingly, I’d also like to have configured a firewall on the machine, but that is not an option either. I’ve yet to determine if that is going to be a dealbreaker for me, or whether relying on the network-level firewall, controlling access to and from that VLAN, and that machine, will be sufficient. I have also not been able to set up a separate ssh account for my greenbone scanning software, or to configure Wazuh to get the machine talking to my SIEM. Again, I will need to consider the impact of this, but intuitively it does not sit comfortably with me. Nor can I find a way to use restic to backup the configuration and other bits, incrementally and automatically, onto another machine, liked I am used to doing. I will have a poke around with the backup tooling offered but again, this does not enthral me. I want to know that, if there’s a problem, I have a backup on my restic server. Since I have used HomeAssistant for so long, and since I just restored a backup, the most I can say really is that it is all still working. It doesn’t seen faster or slower. The limitations of the appliance-based approach are annoying me, and may be sufficient to drive me towards a container-based approach instead (although that does not appeal to me either). Ultimately, I accept that I am but one user, and perhaps many users do not want the things that I want. Importantly, I am not the developer, and so what I want may simply not be things that they wish to provide. And that is their choice. I guess - personal opinion - that I would prefer a computer and not an appliance .

0 views
neilzone 1 months ago

Moving my static site blog generator from hugo to BSSG

I enjoy blogging. I blog on my own personal site (this blog), and I also have a blog for my work site, decoded.legal . In 2023, I moved my blog to a static site generated by hugo . I've been reasonably pleased with hugo, and it does the job, but I find it complex. In short, if an update broke my site, I am not 100% convinced that I would be able to fix it. I don't need much in the way of complexity; I have a simple, predominantly text, blog, and all I want is to be able to write posts in markdown, generate a static html site from it, andrsync it to a webserver, along with an RSS feed. I am using a Raspberry Pi 4 as my webserver, and this works fine, given my lightweight, low complexity, sites. On the fediverse, I saw Stefano Marinelli discussing his own static site generator - the Bash Static Site Generator, also called "BSSG" - and I was keen to give it a try. I guess that I am simply more confident that, if there was a problem, I'd be more confident about fixing something written in bash. I am running hugo (and now BSSG) on my Raspberry Pi 4 webserver. I could install it on something beefier, like my laptop, and then just rsync the output files to the webserver, but, again for simplicity, it makes sense to me to run the static site generator on the webserver itself. I don't have anything particular to note about the basic installation. I wanted to make quite a few changes to the default configuration, so I decided that the simplest thing to do was to copy the whole config file from the BSSG installation directory into my site directory, and then amend it. Here is my configuration file . (I have a separate file, in the same directory, for my .onion site; this is much the same, but referencing the .onion URL instead, and with a separate output directory.) I was happy with how my old blog looked, and, for the work blog, I wanted it to remain consistent with the main website. I started with the BSSG "minimal" theme, and then made the changes that I wanted to support "dark mode", remove transitions/transformations, and to generally get to the look that I wanted. Here is the resulting css . Once can also have site-specific templates, so I copied the templates directory from the BSSG directory into my site directory, and made changes there. In particular, in the header template, I: Here is the header file . In the footer, I amended the copyright information, and, on the work blog, added a short disclaimer. ( My footer .) There is a significant (but not total) overlap between the header material of blogposts for hugo and blogposts for BSSG. I'm not entirely sure that I needed to do anything at all, aside from copying the raw markdown files into BSSG's directory, but I used a few regexes to align the header material anyway: (Yes, there might be shorter / cleaner / faster etc. ways of doing this. This worked for me.) I also found - thanks to an error message when I first tried to build the BSSG content - that BSSG does not like src files with spaces in the names. I did not have many (although one was enough), so I fixed that: One thing that I did not do with hugo is have descriptions for my posts. I think that I'd prefer not to have descriptions displayed at all, but I've yet to find a way to suppress them in BSSG without editing the underlying scripts, which (for ease of updating), I am loathe to do. I am not using BSSG's editing tool, or its command line tools for adding new posts (although I might need to use it for deleting posts). Instead, I prefer to write markdown in vim, and then upload that to the webserver and then build the site. I have a small shell script on my laptop and phone, which generates a text file (with a .md extension) with the correct header material, and it pre-populates the date and time in the correct format. I then have a separate script which I use to push the new blogpost to the webserver, and then, via ssh, runs a script in the relevant BSSG site directory to build the site and rsync it into place. Here is that build script . (Although "build script" makes it sound fancier than it is.) It is early days, so these are little more than my immediate notes. I'd like to find a way to remove the descriptions from the index page. But, other than that, I am very happy with BSSG, and I am very grateful to Stefano for making it available. Building this blog on a Raspberry Pi 4, even using the (newly-fixed; thanks, Stefano!) "ram" mode, is not exactly rapid, but that is not a particular concern for me. I am very pleased. And, if you can read this - my first new blogpost since adopting BSSG - then everything is going well :) added an inline svg for the icon, in lieu of a favicon file added a link for fediverse verification ( ) added a link for "fediverse:creator", so that post previews in Mastodon link to my Mastodon account ( ) adjusted some of the OpenGraph (fedi previews) stuff, to use a static image, since I do not use header images (or, really, any images at all)

0 views

XML is a Cheap DSL

Yesterday, the IRS announced the release of the project I’ve been engineering leading since this summer, its new Tax Withholding Estimator (TWE). Taxpayers enter in their income, expected deductions, and other relevant info to estimate what they’ll owe in taxes at the end of the year, and adjust the withholdings on their paycheck. It’s free, open source, and, in a major first for the IRS, open for public contributions . TWE is full of exciting learnings about the field of public sector software. Being me, I’m going to start by writing about by far the driest one: XML. (I am writing this in my personal capacity, based on the open source release, not in my position as a federal employee.) XML is widely considered clunky at best, obsolete at worst. It evokes memories of SOAP configs and J2EE (it’s fine, even good, if those acronyms don’t mean anything to you). My experience with the Tax Withholding Estimator, however, has taught me that XML absolutely has a place in modern software development, and it should be considered a leading option for any cross-platform declarative specification. TWE is a static site generated from two XML configurations. The first of these configs is the Fact Dictionary, our representation of the US Tax Code; the second will be the subject of a later blog post. We use the Fact Graph, a logic engine, to calculate the taxpayer’s tax obligations (and their withholdings) based on the facts defined in the Fact Dictionary. The Fact Graph was originally built for IRS Direct File and now we use it for TWE. I’m going to introduce you to the Fact Graph the way that I was introduced to it: by fire example. Put aside any preconceptions you might have about XML for a moment and ask yourself what this fact describes, and how well it describes it. This fact describes a fact that’s derived by subtracting from . In tax terms, this fact describes the amount you will need to pay the IRS at the end of the year. That amount, “total owed,” is the difference between the total taxes due for your income (“total tax”) and the amount you’ve already paid (“total payments”). My initial reaction to this was that it’s quite verbose, but also reasonably clear. That’s more or less how I still feel. You only need to look at a few of these to intuit the structure. Take the refundable credits calculation, for example. A refundable credit is a tax credit that can lead to a negative tax balance—if you qualify for more refundable credits than you owe in taxes, the government just gives you some money. TWE calculates the total value of refundable credits by adding up the values of the Earned Income Credit, the Child Tax Credit (CTC), American Opportunity Credit, the refundable portion of the Adoption Credit, and some other stuff from the Schedule 3. By contrast, non-refundable tax credits can bring your tax burden down to zero, but won’t ever make it negative. TWE models that by subtracting non-refundable credits from the tentative tax burden while making sure it can’t go below zero, using the operator. While admittedly very verbose, the nesting is straightforward to follow. The tax after non-refundable credits is derived by saying “give me the greater of these two numbers: zero, or the difference between tentative tax and the non-refundable credits.” Finally, what about inputs? Obviously we need places for the taxpayer to provide information, so that we can calculate all the other values. Okay, so instead of we use . Because the value is… writable. Fair enough. The denotes what type of value this fact takes. True-or-false questions use , like this one that records whether the taxpayer is 65 or older. There are some (much) longer facts, but these are a fair representation of what the median fact looks like. Facts depend on other facts, sometimes derived and sometimes writable, and they all add up to some final tax numbers at the end. But why encode math this way when it seems far clunkier than traditional notation? Countless mainstream programming languages would instead let you write this calculation in a notation that looks more like normal math. Take this JavaScript example, which looks like elementary algebra: That seems better! It’s far more concise, easier to read, and doesn’t make you explicitly label the “minuend” and “subtrahend.” Let’s add in the definitions for and . Still not too bad. Total tax is calculated by adding the tax after non-refundable credits (discussed earlier) to whatever’s in “other taxes.” Total payments is the sum of estimated taxes you’ve already paid, taxes you’ve paid on social security, and any refundable credits. The problem with the JavaScript representation is that it’s imperative . It describes actions you take in a sequence, and once the sequence is done, the intermediate steps are lost. The issues with this get more obvious when you go another level deeper, adding the definitions of all the values that and depend on. We are quickly arriving at a situation that has a lot of subtle problems. One problem is the execution order. The hypothetical function solicits an answer from the taxpayer, which has to happen before the program can continue. Calculations that don’t depend on knowing “total estimated taxes” are still held up waiting for the user; calculations that do depend on knowing that value had better be specified after it. Or, take a close look at how we add up all the social security income: All of a sudden we are really in the weeds with JavaScript. These are not complicated code concepts—map and reduce are both in the standard library and basic functional paradigms are widespread these days—but they are not tax math concepts. Instead, they are implementation details. Compare it to the Fact representation of that same value. This isn’t perfect—the that represents each social security source is a little hacky—but the meaning is much clearer. What are the total taxes paid on social security income? The sum of the taxes paid on each social security income. How do you add all the items in a collection? With . Plus, it reads like all the other facts; needing to add up all items in a collection didn’t suddenly kick us into a new conceptual realm. The philosophical difference between these two is that, unlike JavaScript, which is imperative , the Fact Dictionary is declarative . It doesn’t describe exactly what steps the computer will take or in what order; it describes a bunch of named calculations and how they depend on each other. The engine decides automatically how to execute that calculation. Besides being (relatively) friendlier to read, the most important benefit of a declarative tax model is that you can ask the program how it calculated something. Per the Fact Graph’s original author, Chris Given : The Fact Graph provides us with a means of proving that none of the unasked questions would have changed the bottom line of your tax return and that you’re getting every tax benefit to which you’re entitled. Suppose you get a value for that doesn’t seem right. You can’t ask the JavaScript version “how did you arrive at that number?” because those intermediate values have already been discarded. Imperative programs are generally debugged by adding log statements or stepping through with a debugger, pausing to check each value. This works fine when the number of intermediate values is small; it does not scale at all for the US Tax Code, where the final value is calculated based on hundreds upon hundreds of calculations of intermediate values. With a declarative graph representation, we get auditability and introspection for free, for every single calculation. Intuit, the company behind TurboTax, came to the same conclusion, and published a whitepaper about their “Tax Knowledge Graph” in 2020. Their implementation is not open source, however (or least I can’t find it). The IRS Fact Graph is open source and public domain, so it can be studied, shared, and extended by the public. If we accept the need for a declarative data representation of the tax code, what should it be? In many of the places where people used to encounter XML, such network data transfer and configuration files, it has been replaced by JSON. I find JSON to be a reasonably good wire format and a painful configuration format, but in neither case would I rather be using XML (although it’s a close call on the latter). The Fact Dictionary is different. It’s not a pile of settings or key-value pairs. It’s a custom language that models a unique and complex problem space. In programming we call this a domain-specific language, or DSL for short. As an exercise, I tried to come up with a plausible JSON representation of the fact from earlier. This is not a terribly complicated fact, but it’s immediately apparent that JSON does not handle arbitrary nested expressions well. The only complex data structure available in JSON is an object, so every child object has to declare what kind of object it is. Contrast that with XML, where the “kind” of the object is embedded in its delimiters. I think this XML representation could be improved, but even in its current form, it is clearly better than JSON. (It’s also, amusingly, a couple lines shorter.) Attributes and named children give you just enough expressive power to make choices about what your language should or should not emphasize. Not being tied to specific set of data types makes it reasonable to define your own, such as a distinction between “dollars” and “integers.” A lot of minor frustrations we’ve all internalized as inevitable with JSON are actually JSON-specific. XML has comments, for instance. That’s nice. It also has sane whitespace and newline handling, which is important when your descriptions are often long. For text that has any length or shape to it, XML is far more pleasant to read and edit by hand than JSON. There are still verbosity gains to be had, particularly with switch statements (omitted here out of respect for page length). I’d certainly remove the explicit “minuend” and “subtrahend,” for starters. I believe that the original team didn’t do this because they didn’t want the order of the children to have semantic consequence. I get it, but order is guaranteed in XML and I think the additional nesting and words do more harm then good. What about YAML? Chris Given again : whatever you do, don’t try to express the logic of the Internal Revenue Code as YAML Finally, there’s a good case to made that you could build this DSL with s-expressions. In a lot of ways, this is nicest syntax to read and edit. HackerNews user ok123456 asks : “Why would I want to use this over Prolog/Datalog?” I’m a Prolog fan ! This is also possible. My friend Deniz couldn’t help but rewrite it in KDL , a cool thing I had to look up. At least to my eye, all of these feel more pleasant than the XML version. When I started working on the Fact Graph, I strongly considered proposing a transition to s-expressions. I even half-jokingly included it in a draft design document. The process of actually building on top of the Fact Graph, however, taught me something very important about the value of XML. Using XML gives you a parser and a universal tooling ecosystem for free. Take Prolog for instance. You can relate XML to Prolog terms with a single predicate . If I want to explore Fact Dictionaries in Prolog—or even make a whole alternative implementation of the Fact Graph—I basically get the Prolog representation out of the box. S-expressions work great in Lisp and Prolog terms work great in Prolog. XML can be transformed, more or less natively, into anything. That makes it a great canonical, cross-platform data format. XML is rivaled only by JSON in the maturity and availability of its tooling. At one point I had the idea that it would be helpful to fuzzy search for Fact definitions by path. I’d like to just type “overtime” and see all the facts related to overtime. Regular searches of the codebase were cluttered with references and dependencies. This was possible entirely with shell commands I already had on my computer. This uses XPath to query all the fact paths, to clean up the output, and to interactively search the results. I solved my problem with a trivial bash one-liner. I kept going and said: not only do I want to search the paths, I’d like selecting one of the paths to show me the definition. Easy. Just take the result of the first command, which is a path attribute, and use it in a second XPath query. I got a little carried away building this out into a “$0 Dispatch Pattern” script of the kind described by Andy Chu . (Andy is a blogging icon, by the way.) I also added dependency search—not only can you query the definition of a fact, but you can go up the dependency chain by asking what facts depend on it. Try it yourself by cloning the repo and running (you need installed). The error handling is janky but it’s pretty solid for 60 lines of bash I wrote in an afternoon. I use it almost daily. I’m not sure how many people used my script, but multiple other team members put together similarly quick, powerful debugging tools that became part of everyone’s workflow. All of these tools relied on being able to trivially parse the XML representation and work with it in the language that best suited the problem they were trying to solve, without touching the Fact Graph’s actual implementation in Scala. The lesson I took from this is that a universal data representation is worth its weight in gold. There are exactly two options in this category. In most cases you should choose JSON. If you need a DSL though, XML is by far the cheapest one, and the cost-efficiency of building on it will empower your team to spend their innovation budget elsewhere. Thanks to Chris Given and Deniz Akşimşek for their feedback on a draft of this blog. I had never heard of XPath before 2023, when Deniz figured out an XPath query that made my first htmx PR possible. Another reason to use XML is that humans who aren’t programmers can read it. They usually don’t like it, but, if you did a good-enough job designing the schema, they can read it in a pinch. Do them a favor and build an alternative view, though. Because you’re using XML, this is pretty easy. It’s probably just because I’ve started to use it—buy a Jeep Grand Cherokee and suddenly the roadways seem full of them—but lately I have noticed an uptick in XML interest. Fellow Spring ’24 Recurser Jake Low recently wrote a tool called which turns XML documents into a flat, line-oriented representation. Martijn Faassen has been working on a modern XPath and XSLT engine in Rust . I’m not sure it’s fair to call JSON “lobotomized” but I thought this article was largely correct about the problems XML can solve. The binary format is especially interesting to me.

0 views
Brain Baking 1 months ago

A Note On Shelling In Emacs

As you no doubt know by now, we Emacs users have the Teenage Mutant Ninja Power . Expert usage of a Heroes in a Hard Shell is no exception. Pizza Time! All silliness aside, the plethora of options available to the Emacs user when it comes to executing shell commands in “terminals”—real or fake—can be overwhelming. There’s , , , , , and then third party packages further expand this with , , … The most interesting shell by far is the one that’s not a shell but a Lisp REPL that looks like a shell: Eshell . That’s the one I would like to focus one now. But first: why would you want to pull in your work inside Emacs? The more you get used to it, the easier it will be to answer this: because all your favourite text selection, manipulation, … shortcuts will be available to you. Remember how stupendously difficult it is to just shift-select and yank/copy/whatever you want to call it text in your average terminal emulator? That’s why. In Emacs, I can move around the point in that shell buffer however I want. I can search inside that buffer—since everything is just text—however I want. Even the easiest solution, just firing off your vanilla , that in my case runs Zsh, will net you most of these benefits. And then there’s Eshell: the Lisp-powered shell that’s not really a shell but does a really good job in pretending it is. With Eshell you can interact with everything else you’ve got up and running inside Emacs. Want to dump the output to a buffer at point? . Want to see what’s hooked into LSP mode? . Want to create your own commands? and then just . Eshell makes it possible to mix Elisp and your typical Bash-like syntax. The only problem is that Eshell isn’t a true terminal emulator and doesn’t support full-screen terminal programs and fancy TTY stuff. That’s where Eat: Emulate A Terminal comes in. The Eat minor mode is compatible with Eshell: as soon as you execute a command-line program, it takes over. There are four input modes available to you for sending text to the terminal in case your Emacs shortcuts clash with those of the program. It solves all my problems: long-running processes like work; interactive programs like gdu and work, … Yet the default Eshell mode is a bit bare-bones, so obviously I pimped the hell out of it. Here’s a short summary of what my Bakemacs shelling.el config alters: Here’s a short video demonstrating some of these features: The reason for ditching is simple: it’s extremely slow over Tramp. Just pressing TAB while working on a remote machine takes six seconds to load a simple directory structure of a few files, what’s up with that? I’ve been profiling my Tramp connections and connecting to the local NAS over SSH is very slow because apparently can’t do a single and process that info into an autocomplete pop-up. Yet I wanted to keep my Corfu/Cape behaviour that I’m used to working in other buffers so I created my own completion-at-point-function that dispatches smartly to other internals: I’m sure there are holes in this logic but so far it’s been working quite well for me. Cape is very fast as is my own shell command/variable cache. The added bonus is having access to nerd icons. I used to distinguish Elisp vars from external shell vars in case you’re completing as there are only a handful shell variables and a huge number of Elisp ones. I also learned the hard way that you should cache stuff listed in your modeline as this gets continuously redrawn when scrolling through your buffer: The details can be found in —just to be on the safe side, I disabled Git/project specific stuff in case is to avoid more Tramp snailness. The last cool addition: make use of Emacs’s new Completion Preview mode —but only for recent commands. That means I temporarily remap as soon as TAB is pressed. Otherwise, the preview might also show things that I don’t really want. The video showcases this as well. Happy (e)sheling! Related topics: / emacs / By Wouter Groeneveld on 8 March 2026.  Reply via email . Customize at startup Integrate : replaces the default “i-search backward”. This is a gigantic improvement as Consult lets me quickly and visually finetune my search through all previous commands. These are also saved on exit (increase while you’re at it). Improve to immediately kill a process or deactivate the mark. The big one: replace with a custom completion-at-point system (see below). When typing a path like , backspace kills the entire last directory instead of just a single character. This works just like now and speeds up my path commands by a lot. Bind a shortcut to a convenient function that sends input to Eshell & executes it. Change the prompt into a simple to more easily copy-paste things in and out of that buffer. This integrates with meaning I can very easily jump back to a previous command and its output! Move most of the prompt info to the modeline such as the working directory and optional Git information. Make sort by directories first to align it with my Dired change: doesn’t work as is an Elisp function. Bind a shortcut to a convenient pop-to-eshell buffer & new-eshell-tab function that takes the current perspective into account. Make font-lock so it outputs with syntax highlighting. Create a command: does a into the directory of that buffer’s contents. Create a command: stay on the current Tramp host but go to an absolute path. Using will always navigate to your local HDD root so is the same as if you’re used to instead of Emacs’s Tramp. Give Eshell dedicated space on the top as a side window to quickly call and dismiss with . Customise more shortcuts to help with navigation. UP and DOWN (or / ) just move the point, even at the last line, which never works in a conventional terminal. and cycle through command history. Customise more aliases of which the handy ones are: & If the point is at command and … it’s a path: direct to . it’s a local dir cmd: wrap to filter on dirs only. Cape is dumb and by default also returns files. it’s an elisp func starting with : complete that with . else it’s a shell command. These are now cached by expanding all folders from with a fast Perl command. If the point is at the argument and … it’s a variable starting with : create a super CAPF to lisp both Elisp and vars (also cached)! it’s a buffer or process starting with : fine, here , can you handle this? Are you sure? it’s a remote dir cmd (e.g. ): . it’s (still) a local dir cmd: see above. In all other cases, it’s probably a file argument: fall back to just .

0 views
Armin Ronacher 1 months ago

AI And The Ship of Theseus

Because code gets cheaper and cheaper to write, this includes re-implementations. I mentioned recently that I had an AI port one of my libraries to another language and it ended up choosing a different design for that implementation. In many ways, the functionality was the same, but the path it took to get there was different. The way that port worked was by going via the test suite. Something related, but different, happened with chardet . The current maintainer reimplemented it from scratch by only pointing it to the API and the test suite. The motivation: enabling relicensing from LGPL to MIT. I personally have a horse in the race here because I too wanted chardet to be under a non-GPL license for many years. So consider me a very biased person in that regard. Unsurprisingly, that new implementation caused a stir. In particular, Mark Pilgrim, the original author of the library, objects to the new implementation and considers it a derived work. The new maintainer, who has maintained it for the last 12 years, considers it a new work and instructs his coding agent to do precisely that. According to author, validating with JPlag, the new implementation is distinct. If you actually consider how it works, that’s not too surprising. It’s significantly faster than the original implementation, supports multiple cores and uses a fundamentally different design. What I think is more interesting about this question is the consequences of where we are. Copyleft code like the GPL heavily depends on copyrights and friction to enforce it. But because it’s fundamentally in the open, with or without tests, you can trivially rewrite it these days. I myself have been intending to do this for a little while now with some other GPL libraries. In particular I started a re-implementation of readline a while ago for similar reasons, because of its GPL license. There is an obvious moral question here, but that isn’t necessarily what I’m interested in. For all the GPL software that might re-emerge as MIT software, so might be proprietary abandonware. For me personally, what is more interesting is that we might not even be able to copyright these creations at all. A court still might rule that all AI-generated code is in the public domain, because there was not enough human input in it. That’s quite possible, though probably not very likely. But this all causes some interesting new developments we are not necessarily ready for. Vercel, for instance, happily re-implemented bash with Clankers but got visibly upset when someone re-implemented Next.js in the same way. There are huge consequences to this. When the cost of generating code goes down that much, and we can re-implement it from test suites alone, what does that mean for the future of software? Will we see a lot of software re-emerging under more permissive licenses? Will we see a lot of proprietary software re-emerging as open source? Will we see a lot of software re-emerging as proprietary? It’s a new world and we have very little idea of how to navigate it. In the interim we will have some fights about copyrights but I have the feeling very few of those will go to court, because everyone involved will actually be somewhat scared of setting a precedent. In the GPL case, though, I think it warms up some old fights about copyleft vs permissive licenses that we have not seen in a long time. It probably does not feel great to have one’s work rewritten with a Clanker and one’s authorship eradicated. Unlike the Ship of Theseus , though, this seems more clear-cut: if you throw away all code and start from scratch, even if the end result behaves the same, it’s a new ship. It only continues to carry the name. Which may be another argument for why authors should hold on to trademarks rather than rely on licenses and contract law. I personally think all of this is exciting. I’m a strong supporter of putting things in the open with as little license enforcement as possible. I think society is better off when we share, and I consider the GPL to run against that spirit by restricting what can be done with it. This development plays into my worldview. I understand, though, that not everyone shares that view, and I expect more fights over the emergence of slopforks as a result. After all, it combines two very heated topics, licensing and AI, in the worst possible way.

0 views
Brain Baking 1 months ago

Favourites of February 2026

A sudden burst of Japanese cherry flowers sparkling in the sun brings much-needed lightheartedness into our late February lives. Before we know it, the garden will be littered with these little pink petals, and the very short blossom season will be behind us. Our cherry tree always had the tendency of being early, eager, and then running out of steam. It’s weird to have temperatures reach almost twenty degrees Celsius while a few weeks ago it was still freezing. No wonder the tree is confused. A deep blue sky overlooking the cherry blossom in our garden. In case you were wondering: no, this weather is not normal: it’s yet another noticeable temperature spike. Our local (retired) weatherman Frank explains the spikes and provides proof towards upwards instead of downwards temperature peaks (in Dutch). At this point, I’m just grateful for the much needed sunshine. Previous month: January 2026 . I’m giving up on Ruffy. It’s just unplayable on the Switch which is a damn shame as the N64 throwback collect-a-thon 3D platformer with rough edges looks like the perfect fit for the Switch—and it should be. It’s far from a demanding game so the only conclusion I can make is that it was poorly optimized for my platform of choice. And I bought the Limited Run Games physical version… Instead, I’ve turned to Gobliins 6 , a quirky French adventure game made by just one guy. It has equally frustrating moments and rough edges but I can more easily forgive it for its faults: it’s Gobliins! The fact that after 34 years (!!), there’s an official sequel to Gobliins 2: The Prince Buffoon is just crazy. I have fond memories of that game as I used to play it together with my dad on his brand new 486. I didn’t understand English nor was I able to solve most time-based puzzles but the Gobliins exposure got permanently burned into my brain—so much so that its pixel art became a basis for my retro blog . Even though it’s advertised to be a Windows-only game, ScummVM has got you covered: In the Fox Bar just after Fingus reunites with Winkle. If Gob6 sells well, Pierre might go ahead and make Gob7 a direct sequel to Goblins Quest 3 . Fingus—err, fingers crossed for Blount’s return! Related topics: / metapost / By Wouter Groeneveld on 4 March 2026.  Reply via email . Let’s start with more Gobliins stuff: Michael Klamerus summarized the history of the games to bring you up to speed. Mark self-hosted a book library tool called Booklore that links to your Kobo account. Michał Sapka nuances the “ I hate genAI ” screams of late. Elmine Wijnia writes in De Stadsbron (in Dutch) about OpenStreetMap and wonders whether we can finally get rid of Google Maps. Space Panda continues fighting against bots on their site . It’s fun to see the bot honey pots working but aren’t we now wasting even more resources doing nothing? Arjan van der Gaag shares how he uses snippets in Emacs with Yasnippet . I think I’m going to migrate to Tempel.el instead, but that’s for another story. There’s an interesting thread on ResetERA about old games that have yet to be replicated . Someone mentioned Magic the Gathering: Shandalar ! Jeff Kaufman shared a photo of two chairs placed on a snowy parking space . Apparently, that’s customary to “reserve” your spot. I’ve never seen such a ridiculous selfish act in a while. Is this a typical USA thing? Wolfgang Ziegler continues his Game Boy modding spree, this time with an IPS screen mod . The result looks stunning! Hamilton Greene shares his adventure with programming languages and talks about the “missing language”. I don’t agree with his stance but it’s interesting nonetheless. Scott Nesbitt writes on an old Singer desk ! Greg Newman organized the Emacs writing carnival challenge and shares links of others’ writing experiences with their favourite editor (25 entries). Greg also designed the Org-mode unicorn logo! Speaking of which; James Dyer shows his streamlined Eshell configuration that inspired me to hack together my own. To be continued in a future blog post, whether you’ll like it or not. Markus Dosch shares his journey from Bash to Zsh and now Fish . I’m slowly but surely getting fed up with Zsh and all those semi-required plugins so I might switch to Fish as well. But actually… I switched to Eshell. You didn’t see that coming, did you? Henrique Dias redesigned his website and the result looks very good, congrats! I especially like the fact that the new theme takes advantage of wide screens (note to self). Michael Stapelberg tried out Wayland and concludes that it’s still not ready yet. X11 is not dead yet. I found the Lockfile Explorer documentation on pnpm lockfiles to be very thorough and insightful. Feishin is a modern rewrite of Sonixd, a Subsonic-compatible music desktop client that looks promising. I’ve been a Navidrome user for five years now but am looking for a good client that supports offline playback. It doesn’t (yet) . Related: the Symfonium Android app that does do caching. I’m using Substreamer for that and that works well enough. scrcpy is a tiny Android-based screen sharing tool that I use in classes to project my Android screen. Handy! Another tool for presenting: keycastr helped me teach students how to use shortcuts. I might have already shared this, but you should replace pip with uv : it’s +10x faster and can also manage your project’s . Oh, and in case you haven’t already, replace npm with bun . Discord’s age verification facial recognition tool got bypassed pretty fast —rightfully so.

0 views
Rik Huijzer 1 months ago

More Accurate Speech Recognition with whisper.cpp

I have been using OpenAI's whisper for a while to convert audio files to text. For example, to generate subtitles for a file, I used ```bash whisper "$INPUT_FILE" -f srt --model turbo --language en ``` Especially on long files, this would sometimes over time change it's behavior leading to either extremely long or extremely short sentences (run away). Also, `whisper` took a long time to run. Luckily, there is whisper-cpp. On my system with an M2 Pro chip, this can now run speech recognition on a 40 minute audio file in a few minutes instead of half an hour. Also, thanks to a tip from whisp...

0 views
matklad 2 months ago

CI In a Box

I wrote , a thin wrapper around ssh for running commands on remote machines. I want a box-shaped interface for CI: That is, the controlling CI machine runs a user-supplied script, whose status code will be the ultimate result of a CI run. The script doesn’t run the project’s tests directly. Instead, it shells out to a proxy binary that forwards the command to a runner box with whichever OS, CPU, and other environment required. The hard problems are in the part: CI discourse amuses me — everyone complains about bad YAML, and it is bad, but most of the YAML (and associated reproducibility and debugging problems) is avoidable. Pick an appropriate position on a dial that includes What you can’t just do by writing a smidgen of text is getting the heterogeneous fleet of runners. And you need heterogeneous fleet of runners if some of the software you are building is cross-platform. If you go that way, be mindful that The SSH wire protocol only takes a single string as the command, with the expectation that it should be passed to a shell by the remote end. In other words, while SSH supports syntax like , it just blindly intersperses all arguments with a space. Amusing to think that our entire cloud infrastructure is built on top of shell injection ! This, and the need to ensure no processes are left behind unintentionally after executing a remote command, means that you can’t “just” use SSH here if you are building something solid. One of them is not UNIX. One of them has licensing&hardware constraints that make per-minute billed VMs tricky (but not impossible, as GitHub Actions does that). All of them are moving targets, and require someone to do the OS upgrade work, which might involve pointing and clicking . writing a bash script, writing a script in the language you already use , using a small build system , using a medium-sized one like or , or using a large one like or .

0 views

Date Arithmetic in Bash

Date and time management libraries in many programming languages are famously bad. Python's datetime module comes to mind as one of the best (worst?) examples, and so does JavaScript's Date class . It feels like these libraries could not have been made worse on purpose, or so I thought until today, when I needed to implement some date calculations in a backup rotation script written in bash. So, if you wanted to learn how to perform date and time arithmetic in your bash scripts, you've come to the right place. Just don't blame me for the nightmares.

0 views
Armin Ronacher 2 months ago

Pi: The Minimal Agent Within OpenClaw

If you haven’t been living under a rock, you will have noticed this week that a project of my friend Peter went viral on the internet . It went by many names. The most recent one is OpenClaw but in the news you might have encountered it as ClawdBot or MoltBot depending on when you read about it. It is an agent connected to a communication channel of your choice that just runs code . What you might be less familiar with is that what’s under the hood of OpenClaw is a little coding agent called Pi . And Pi happens to be, at this point, the coding agent that I use almost exclusively. Over the last few weeks I became more and more of a shill for the little agent. After I gave a talk on this recently, I realized that I did not actually write about Pi on this blog yet, so I feel like I might want to give some context on why I’m obsessed with it, and how it relates to OpenClaw. Pi is written by Mario Zechner and unlike Peter, who aims for “sci-fi with a touch of madness,” 1 Mario is very grounded. Despite the differences in approach, both OpenClaw and Pi follow the same idea: LLMs are really good at writing and running code, so embrace this. In some ways I think that’s not an accident because Peter got me and Mario hooked on this idea, and agents last year. So Pi is a coding agent. And there are many coding agents. Really, I think you can pick effectively anyone off the shelf at this point and you will be able to experience what it’s like to do agentic programming. In reviews on this blog I’ve positively talked about AMP and one of the reasons I resonated so much with AMP is that it really felt like it was a product built by people who got both addicted to agentic programming but also had tried a few different things to see which ones work and not just to build a fancy UI around it. Pi is interesting to me because of two main reasons: And a little bonus: Pi itself is written like excellent software. It doesn’t flicker, it doesn’t consume a lot of memory, it doesn’t randomly break, it is very reliable and it is written by someone who takes great care of what goes into the software. Pi also is a collection of little components that you can build your own agent on top. That’s how OpenClaw is built, and that’s also how I built my own little Telegram bot and how Mario built his mom . If you want to build your own agent, connected to something, Pi when pointed to itself and mom, will conjure one up for you. And in order to understand what’s in Pi, it’s even more important to understand what’s not in Pi, why it’s not in Pi and more importantly: why it won’t be in Pi. The most obvious omission is support for MCP. There is no MCP support in it. While you could build an extension for it, you can also do what OpenClaw does to support MCP which is to use mcporter . mcporter exposes MCP calls via a CLI interface or TypeScript bindings and maybe your agent can do something with it. Or not, I don’t know :) And this is not a lazy omission. This is from the philosophy of how Pi works. Pi’s entire idea is that if you want the agent to do something that it doesn’t do yet, you don’t go and download an extension or a skill or something like this. You ask the agent to extend itself. It celebrates the idea of code writing and running code. That’s not to say that you cannot download extensions. It is very much supported. But instead of necessarily encouraging you to download someone else’s extension, you can also point your agent to an already existing extension, say like, build it like the thing you see over there, but make these changes to it that you like. When you look at what Pi and by extension OpenClaw are doing, there is an example of software that is malleable like clay. And this sets certain requirements for the underlying architecture of it that are actually in many ways setting certain constraints on the system that really need to go into the core design. So for instance, Pi’s underlying AI SDK is written so that a session can really contain many different messages from many different model providers. It recognizes that the portability of sessions is somewhat limited between model providers and so it doesn’t lean in too much into any model-provider-specific feature set that cannot be transferred to another. The second is that in addition to the model messages it maintains custom messages in the session files which can be used by extensions to store state or by the system itself to maintain information that either not at all is sent to the AI or only parts of it. Because this system exists and extension state can also be persisted to disk, it has built-in hot reloading so that the agent can write code, reload, test it and go in a loop until your extension actually is functional. It also ships with documentation and examples that the agent itself can use to extend itself. Even better: sessions in Pi are trees. You can branch and navigate within a session which opens up all kinds of interesting opportunities such as enabling workflows for making a side-quest to fix a broken agent tool without wasting context in the main session. After the tool is fixed, I can rewind the session back to earlier and Pi summarizes what has happened on the other branch. This all matters because for instance if you consider how MCP works, on most model providers, tools for MCP, like any tool for the LLM, need to be loaded into the system context or the tool section thereof on session start. That makes it very hard to impossible to fully reload what tools can do without trashing the complete cache or confusing the AI about how prior invocations work differently. An extension in Pi can register a tool to be available to the LLM to call and every once in a while I find this useful. For instance, despite my criticism of how Beads is implemented, I do think that giving an agent access to a to-do list is a very useful thing. And I do use an agent-specific issue tracker that works locally that I had my agent build itself. And because I wanted the agent to also manage to-dos, in this particular case I decided to give it a tool rather than a CLI. It felt appropriate for the scope of the problem and it is currently the only additional tool that I’m loading into my context. But for the most part all of what I’m adding to my agent are either skills or TUI extensions to make working with the agent more enjoyable for me. Beyond slash commands, Pi extensions can render custom TUI components directly in the terminal: spinners, progress bars, interactive file pickers, data tables, preview panes. The TUI is flexible enough that Mario proved you can run Doom in it . Not practical, but if you can run Doom, you can certainly build a useful dashboard or debugging interface. I want to highlight some of my extensions to give you an idea of what’s possible. While you can use them unmodified, the whole idea really is that you point your agent to one and remix it to your heart’s content. I don’t use plan mode . I encourage the agent to ask questions and there’s a productive back and forth. But I don’t like structured question dialogs that happen if you give the agent a question tool. I prefer the agent’s natural prose with explanations and diagrams interspersed. The problem: answering questions inline gets messy. So reads the agent’s last response, extracts all the questions, and reformats them into a nice input box. Even though I criticize Beads for its implementation, giving an agent a to-do list is genuinely useful. The command brings up all items stored in as markdown files. Both the agent and I can manipulate them, and sessions can claim tasks to mark them as in progress. As more code is written by agents, it makes little sense to throw unfinished work at humans before an agent has reviewed it first. Because Pi sessions are trees, I can branch into a fresh review context, get findings, then bring fixes back to the main session. The UI is modeled after Codex which provides easy to review commits, diffs, uncommitted changes, or remote PRs. The prompt pays attention to things I care about so I get the call-outs I want (eg: I ask it to call out newly added dependencies.) An extension I experiment with but don’t actively use. It lets one Pi agent send prompts to another. It is a simple multi-agent system without complex orchestration which is useful for experimentation. Lists all files changed or referenced in the session. You can reveal them in Finder, diff in VS Code, quick-look them, or reference them in your prompt. quick-looks the most recently mentioned file which is handy when the agent produces a PDF. Others have built extensions too: Nico’s subagent extension and interactive-shell which lets Pi autonomously run interactive CLIs in an observable TUI overlay. These are all just ideas of what you can do with your agent. The point of it mostly is that none of this was written by me, it was created by the agent to my specifications. I told Pi to make an extension and it did. There is no MCP, there are no community skills, nothing. Don’t get me wrong, I use tons of skills. But they are hand-crafted by my clanker and not downloaded from anywhere. For instance I fully replaced all my CLIs or MCPs for browser automation with a skill that just uses CDP . Not because the alternatives don’t work, or are bad, but because this is just easy and natural. The agent maintains its own functionality. My agent has quite a few skills and crucially I throw skills away if I don’t need them. I for instance gave it a skill to read Pi sessions that other engineers shared, which helps with code review. Or I have a skill to help the agent craft the commit messages and commit behavior I want, and how to update changelogs. These were originally slash commands, but I’m currently migrating them to skills to see if this works equally well. I also have a skill that hopefully helps Pi use rather than , but I also added a custom extension to intercept calls to and to redirect them to instead. Part of the fascination that working with a minimal agent like Pi gave me is that it makes you live that idea of using software that builds more software. That taken to the extreme is when you remove the UI and output and connect it to your chat. That’s what OpenClaw does and given its tremendous growth, I really feel more and more that this is going to become our future in one way or another. https://x.com/steipete/status/2017313990548865292 ↩ First of all, it has a tiny core. It has the shortest system prompt of any agent that I’m aware of and it only has four tools: Read, Write, Edit, Bash. The second thing is that it makes up for its tiny core by providing an extension system that also allows extensions to persist state into sessions, which is incredibly powerful. https://x.com/steipete/status/2017313990548865292 ↩

0 views
Simon Willison 2 months ago

ChatGPT Containers can now run bash, pip/npm install packages, and download files

One of my favourite features of ChatGPT is its ability to write and execute code in a container. This feature launched as ChatGPT Code Interpreter nearly three years ago , was half-heartedly rebranded to "Advanced Data Analysis" at some point and is generally really difficult to find detailed documentation about. Case in point: it appears to have had a massive upgrade at some point in the past few months, and I can't find documentation about the new capabilities anywhere! Here are the most notable new features: This is a substantial upgrade! ChatGPT can now write and then test code in 10 new languages (11 if you count Bash), can find files online and download them into the container, and can install additional packages via and to help it solve problems. (OpenAI really need to develop better habits at keeping their release notes up-to-date !) I was initially suspicious that maybe I'd stumbled into a new preview feature that wasn't available to everyone, but I tried some experiments in a free ChatGPT account and confirmed that the new features are available there as well. My first clue to the new features came the other day when I got curious about Los Angeles air quality, in particular has the growing number of electric vehicles there hade a measurable impact? I prompted a fresh GPT-5.2 Thinking session with: Show me Los Angeles air quality over time for last 20 years Here's the shared transcript . Annoyingly that doesn't make the thinking trace available, so here's a screenshot: This bit instantly caught my eye: [...] The URL from the error message leads to an Excel file containing air quality trends. I can use the container.download tool to fetch it. [...] What's ? I asked ChatGPT : Without running any searches tell me about your container.download tool And got back: Sure. is a built-in tool I can call that fetches a file from a URL and saves it into the sandboxed container filesystem for this chat session. It’s called like this (conceptually): Further experimentation confirmed that behaves as described. I had it download one file from a server with logs enabled and captured the IP address and user-agent. Here are the most interesting headers: That IP address resolves to Microsoft Azure Cloud (centralus) in Des Moines, Iowa. On the one hand, this is really useful! ChatGPT can navigate around websites looking for useful files, download those files to a container and then process them using Python or other languages. Is this a data exfiltration vulnerability though? Could a prompt injection attack trick ChatGPT into leaking private data out to a call to a URL with a query string that includes sensitive information? I don't think it can. I tried getting it to assemble a URL with a query string and access it using and it couldn't do it. It told me that it got back this error: ERROR: download failed because url not viewed in conversation before. open the file or url using web.run first. This looks to me like the same safety trick used by Claude's Web Fetch tool : only allow URL access if that URL was either directly entered by the user or if it came from search results that could not have been influenced by a prompt injection. (I poked at this a bit more and managed to get a simple constructed query string to pass through - a different tool entirely - but when I tried to compose a longer query string containing the previous prompt history a filter blocked it.) So I think this is all safe, though I'm curious if it could hold firm against a more aggressive round of attacks from a seasoned security researcher. The key lesson from coding agents like Claude Code and Codex CLI is that Bash rules everything: if an agent can run Bash commands in an environment it can do almost anything that can be achieved by typing commands into a computer. When Anthropic added their own code interpreter feature to Claude last September they built that around Bash rather than just Python. It looks to me like OpenAI have now done the same thing for ChatGPT. Here's what ChatGPT looks like when it runs a Bash command - here my prompt was: npm install a fun package and demonstrate using it It's useful to click on the "Thinking" or "Thought for 32s" links as that opens the Activity sidebar with a detailed trace of what ChatGPT did to arrive at its answer. This helps guard against cheating - ChatGPT might claim to have run Bash in the main window but it can't fake those black and white logs in the Activity panel. I had it run Hello World in various languages later in that same session. In the previous example ChatGPT installed the package from npm and used it to draw an ASCII-art cow. But how could it do that if the container can't make outbound network requests? In another session I challenged it to explore its environment. and figure out how that worked. Here's the resulting Markdown report it created. The key magic appears to be a proxy, available within the container and with various packaging tools configured to use it. The following environment variables cause and to install packages from that proxy instead of directly from PyPI: This one appears to get to work: And it reported these suspicious looking variables as well: Neither Rust nor Docker are installed in the container environment, but maybe those registry references are a clue of features still to come. The result of all of this? You can tell ChatGPT to use Python or Node.js packages as part of a conversation and it will be able to install them and apply them against files you upload or that it downloads from the public web. That's really cool. The big missing feature here should be the easiest to provide: we need official documentation ! A release notes entry would be a good start, but there are a lot of subtle details to how this new stuff works, its limitations and what it can be used for. As always, I'd also encourage OpenAI to come up with a name for this set of features that properly represents how it works and what it can do. In the meantime, I'm going to call this ChatGPT Containers . You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . ChatGPT can directly run Bash commands now. Previously it was limited to Python code only, although it could run shell commands via the Python module. It has Node.js and can run JavaScript directly in addition to Python. I also got it to run "hello world" in Ruby, Perl, PHP, Go, Java, Swift, Kotlin, C and C++ . No Rust yet though! While the container still can't make outbound network requests, and both work now via a custom proxy mechanism. ChatGPT can locate the URL for a file on the web and use a tool to download that file and save it to a path within the sandboxed container. container.download Is container.download a data exfiltration vulnerability? Bash and other languages Installing packages from pip and npm Adding it all together Takes a publicly reachable URL and a destination filepath in the container. Downloads the bytes from that URL and writes them to the given path. After that, I can read/process the file locally in the container (e.g., unzip it, parse it with Python, open it as an image, convert it, etc.).

0 views

Lessons from Building AI Agents for Financial Services

I’ve spent the last two years building AI agents for financial services. Along the way, I’ve accumulated a fair number of battle scars and learnings that I want to share. Here’s what I’ll cover: - The Sandbox Is Not Optional - Why isolated execution environments are essential for multi-step agent workflows - Context Is the Product - How we normalize heterogeneous financial data into clean, searchable context - The Parsing Problem - The hidden complexity of extracting structured data from adversarial SEC filings - Skills Are Everything - Why markdown-based skills are becoming the product, not the model - The Model Will Eat Your Scaffolding - Designing for obsolescence as models improve - The S3-First Architecture - Why S3 beats databases for file storage and user data - The File System Tools - How ReadFile, WriteFile, and Bash enable complex financial workflows - Temporal Changed Everything - Reliable long-running tasks with proper cancellation handling - Real-Time Streaming - Building responsive UX with delta updates and interactive agent workflows - Evaluation Is Not Optional - Domain-specific evals that catch errors before they cost money - Production Monitoring - The observability stack that keeps financial agents reliable Why financial services is extremely hard. This domain doesn’t forgive mistakes. Numbers matter. A wrong revenue figure, a misinterpreted guidance statement, an incorrect DCF assumption. Professional investors make million-dollar decisions based on our output. One mistake on a $100M position and you’ve destroyed trust forever. The users are also demanding. Professional investors are some of the smartest, most time-pressed people you’ll ever work with. They spot bullshit instantly. They need precision, speed, and depth. You can’t hand-wave your way through a valuation model or gloss over nuances in an earnings call. This forces me to develop an almost paranoid attention to detail. Every number gets double-checked. Every assumption gets validated. Every model gets stress-tested. You start questioning everything the LLM outputs because you know your users will. A single wrong calculation in a DCF model and you lose credibility forever. I sometimes feel that the fear of being wrong becomes our best feature. Over the years building with LLM, we’ve made bold infrastructure bets early and I think we have been right. For instance, when Claude Code launched with its filesystem-first agentic approach, we immediately adopted it. It was not an obvious bet and it was a massive revamp of our architecture. I was extremely lucky to have Thariq from Anthropic Claude Code jumping on a Zoom and opening my eyes to the possibilities. At the time the whole industry, including Fintool, was all building elaborate RAG pipelines with vector databases and embeddings. After reflecting on the future of information retrieval with agents I wrote “ the RAG obituary ” and Fintool moved fully to agentic search. We even decided to retire our precious embedding pipeline. Sad but whatever is best for the future! People thought we were crazy. The article got a lot of praise and a lot of negative comments. Now I feel most startups are adopting these best practices. I believe we’re early on several other architectural choices too. I’m sharing them here because the best way to test ideas is to put them out there. Let’s start with the biggest one. When we first started building Fintool in 2023, I thought sandboxing might be overkill. “We’re just running Python scripts” I told myself. “What could go wrong?” Haha. Everything. Everything could go wrong. The first time an LLM decided to `rm -rf /` on our server (it was trying to “clean up temporary files”), I became a true believer. Here’s the thing: agents need to run multi-step operations. A professional investor asks for a DCF valuation and that’s not a single API call. The agent needs to research the company, gather financial data, build a model in Excel, run sensitivity analysis, generate complex charts, iterate on assumptions. That’s dozens of steps, each potentially modifying files, installing packages, running scripts. You can’t do this without code execution. And executing arbitrary code on your servers is insane. Every chat application needs a sandbox. Today each user gets their own isolated environment. The agent can do whatever it wants in there. Delete everything? Fine. Install weird packages? Go ahead. It’s your sandbox, knock yourself out. The architecture looks like this: Three mount points. Private is read/write for your stuff. Shared is read-only for your organization. Public is read-only for everyone. The magic is in the credentials. We use AWS ABAC (Attribute-Based Access Control) to generate short-lived credentials scoped to specific S3 prefixes. User A literally cannot access User B’s data. The IAM policy uses ` ${aws:PrincipalTag/S3Prefix} ` to restrict access. The credentials physically won’t allow it. This is also very good for Enterprise deployment. We also do sandbox pre-warming. When a user starts typing, we spin up their sandbox in the background. By the time they hit enter, the sandbox is ready. 600 second timeout, extended by 10 minutes on each tool usage. The sandbox stays warm across conversation turns. So sandboxes are amazing but the under-discussed magic of sandboxes is the support for the filesystem. Which brings us to the next lesson learned about context. Your agent is only as good as the context it can access. The real work isn’t prompt engineering it’s turning messy financial data from dozens of sources into clean, structured context the model can actually use. This requires a massive domain expertise from the engineering team. The heterogeneity problem. Financial data comes in every format imaginable: - SEC filings : HTML with nested tables, exhibits, signatures - Earnings transcripts : Speaker-segmented text with Q&A sections - Press releases : Semi-structured HTML from PRNewswire - Research reports : PDFs with charts and footnotes - Market data : Snowflake/databases with structured numerical data - News : Articles with varying quality and structure - Alternative data : Satellite imagery, web traffic, credit card panels - Broker research : Proprietary PDFs with price targets and models - Fund filings : 13F holdings, proxy statements, activist letters Each source has different schemas, different update frequencies, different quality levels. Agent needs one thing: clean context it can reason over. The normalization layer. Everything becomes one of three formats: - Markdown for narrative content (filings, transcripts, articles) - CSV/tables for structured data (financials, metrics, comparisons) - JSON metadata for searchability (tickers, dates, document types, fiscal periods) Chunking strategy matters. Not all documents chunk the same way: - 10-K filings : Section by regulatory structure (Item 1, 1A, 7, 8...) - Earnings transcripts : Chunk by speaker turn (CEO remarks, CFO remarks, Q&A by analyst) - Press releases : Usually small enough to be one chunk - News articles : Paragraph-level chunks - 13F filings : By holder and position changes quarter-over-quarter The chunking strategy determines what context the agent retrieves. Bad chunks = bad answers. Tables are special. Financial data is full of tables and csv. Revenue breakdowns, segment performance, guidance ranges. LLMs are surprisingly good at reasoning over markdown tables: But they’re terrible at reasoning over HTML `<table>` tags or raw CSV dumps. The normalization layer converts everything to clean markdown tables. Metadata enables retrieval. The user asks the agent: “ What did Apple say about services revenue in their last earnings call? ” To answer this, Fintool needs: - Ticker resolution (AAPL → correct company) - Document type filtering (earnings transcript, not 10-K) - Temporal filtering (most recent, not 2019) - Section targeting (CFO remarks or revenue discussion, not legal disclaimers) This is why `meta.json` exists for every document. Without structured metadata, you’re doing keyword search over a haystack. It speeds up the search, big time! Anyone can call an LLM API. Not everyone has normalized decades of financial data into searchable, chunked markdown with proper metadata. The data layer is what makes agents actually work. The Parsing Problem Normalizing financial data is 80% of the work. Here’s what nobody tells you. SEC filings are adversarial. They’re not designed for machine reading. They’re designed for legal compliance: - Tables span multiple pages with repeated headers - Footnotes reference exhibits that reference other footnotes - Numbers appear in text, tables, and exhibits—sometimes inconsistently - XBRL tags exist but are often wrong or incomplete - Formatting varies wildly between filers (every law firm has their own template) We tried off-the-shelf PDF/HTML parsers. They failed on: - Multi-column layouts in proxy statements - Nested tables in MD&A sections (tables within tables within tables) - Watermarks and headers bleeding into content - Scanned exhibits (still common in older filings and attachments) - Unicode issues (curly quotes, em-dashes, non-breaking spaces) The Fintool parsing pipeline: Raw Filing (HTML/PDF) Document structure detection (headers, sections, exhibits) Table extraction with cell relationship preservation Entity extraction (companies, people, dates, dollar amounts) Cross-reference resolution (Ex. 10.1 → actual exhibit content) Fiscal period normalization (FY2024 → Oct 2023 to Sep 2024 for Apple) Quality scoring (confidence per extracted field) Table extraction deserves its own work. Financial tables are dense with meaning. A revenue breakdown table might have: - Merged header cells spanning multiple columns - Footnote markers (1), (2), (a), (b) that reference explanations below - Parentheses for negative numbers: $(1,234) means -1234 - Mixed units in the same table (millions for revenue, percentages for margins) - Prior period restatements in italics or with asterisks We score every extracted table on: - Cell boundary accuracy (did we split/merge correctly?) - Header detection (is row 1 actually headers, or is there a title row above?) - Numeric parsing (is “$1,234” parsed as 1234 or left as text?) - Unit inference (millions? billions? per share? percentage?) Tables below 90% confidence get flagged for review. Low-confidence extractions don’t enter the agent’s context—garbage in, garbage out. Fiscal period normalization is critical. “Q1 2024” is ambiguous: - Calendar Q1 (January-March 2024) - Apple’s fiscal Q1 (October-December 2023) - Microsoft’s fiscal Q1 (July-September 2023) - “Reported in Q1” (filed in Q1, but covers the prior period) We maintain a fiscal calendar database for 10,000+ companies. Every date reference gets normalized to absolute date ranges. When the agent retrieves “Apple Q1 2024 revenue,” it knows to look for data from October-December 2023. This is invisible to users but essential for correctness. Without it, you’re comparing Apple’s October revenue to Microsoft’s January revenue and calling it “same quarter.” Here’s the thing nobody tells you about building AI agents: the model is not the product. The skills are now the product. I learned this the hard way. We used to try making the base model “smarter” through prompt engineering. Tweak the system prompt, add examples, write elaborate instructions. It helped a little. But skills were the missing part. In October 2025, Anthropic formalized this with Agent Skills a specification for extending Claude with modular capability packages. A skill is a folder containing a `SKILL.md` file with YAML frontmatter (name and description), plus any supporting scripts, references, or data files the agent might need. We’d been building something similar for months before the announcement. The validation felt good but more importantly, having an industry standard means our skills can eventually be portable. Without skills, models are surprisingly bad at domain tasks. Ask a frontier model to do a DCF valuation. It knows what DCF is. It can explain the theory. But actually executing one? It will miss critical steps, use wrong discount rates for the industry, forget to add back stock-based compensation, skip sensitivity analysis. The output looks plausible but is subtly wrong in ways that matter. The breakthrough came when we started thinking about skills as first-class citizens. Like part of the product itself. A skill is a markdown file that tells the agent how to do something specific. Here’s a simplified version of our DCF skill: That’s it. A markdown file. No code changes. No production deployment. Just a file that tells the agent what to do. Skills are better than code. This matters enormously: 1. Non-engineers can create skills. Our analysts write skills. Our customers write skills. A portfolio manager who’s done 500 DCF valuations can encode their methodology in a skill without writing a single line of Python. 2. No deployment needed. Change a skill file and it takes effect immediately. No CI/CD, no code review, no waiting for release cycles. Domain experts can iterate on their own. 3. Readable and auditable. When something goes wrong, you can read the skill and understand exactly what the agent was supposed to do. Try doing that with a 2,000-line Python module. We have a copy-on-write shadowing system: Priority: private > shared > public So if you don’t like how we do DCF valuations, write your own. Drop it in `/private/skills/dcf/SKILL.md`. Your version wins. Why we don’t mount all skills to the filesystem. This is important. The naive approach would be to mount every skill file directly into the sandbox. The agent can just `cat` any skill it needs. Simple, right? Wrong. Here’s why we use SQL discovery instead: 1. Lazy loading. We have dozens of skills with extensive documentation like the DCF skill alone has 10+ industry guideline files. Loading all of them into context for every conversation would burn tokens and confuse the model. Instead, we discover skill metadata (name, description) upfront, and only load the full documentation when the agent actually uses that skill. 2. Access control at query time. The SQL query implements our three-tier access model: public skills available to everyone, organization skills for that org’s users, private skills for individual users. The database enforces this. You can’t accidentally expose a customer’s proprietary skill to another customer. 3. Shadowing logic. When a user customizes a skill, their version needs to override the default. SQL makes this trivial—query all three levels, apply priority rules, return the winner. Doing this with filesystem mounts would be a nightmare of symlinks and directory ordering. 4. Metadata-driven filtering. The `fs_files.metadata` column stores parsed YAML frontmatter. We can filter by skill type, check if a skill is main-agent-only, or query any other structured attribute—all without reading the files themselves. The pattern: S3 is the source of truth, a Lambda function syncs changes to PostgreSQL for fast queries, and the agent gets exactly what it needs when it needs it. Skills are essential. I cannot emphasize this enough. If you’re building an AI agent and you don’t have a skills system, you’re going to have a bad time. My biggest argument for skills is that top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills. Models just want to learn and what they want to learn is our skills... Until they ate it. Here’s the uncomfortable truth: everything I just told you about skills? It’s temporary in my opinion. Models are getting better. Fast. Every few months, there’s a new model that makes half your code obsolete. The elaborate scaffolding you built to handle edge cases? The model just... handles them now. When we started, we needed detailed skills with step-by-step instructions for some simple tasks. “First do X, then do Y, then check Z.” Now? We can often just say for simple task “do an earnings preview” and the model figures it out (kinda of!) This creates a weird tension. You need skills today because current models aren’t smart enough. But you should design your skills knowing that future models will need less hand-holding. That’s why I’m bullish on markdown file versus code for model instructions. It’s easier to update and delete. We send detailed feedback to AI labs. Whenever we build complex scaffolding to work around model limitations, we document exactly what the model struggles with and share it with the lab research team. This helps inform the next generation of models. The goal is to make our own scaffolding obsolete. My prediction: in two years, most of our basic skills will be one-liners. “Generate a 20 tabs DCF.” That’s it. The model will know what that means. But here’s the flip side: as basic tasks get commoditized, we’ll push into more complex territory. Multi-step valuations with segment-by-segment analysis. Automated backtesting of investment strategies. Real-time portfolio monitoring with complex triggers. The frontier keeps moving. So we write skills. We delete them when they become unnecessary. And we build new ones for the harder problems that emerge. And all that are files... in our filesystem. Here’s something that surprised me: S3 for files is a better database than a database. We store user data (watchlists, portfolio, preferences, memories, skills) in S3 as YAML files. S3 is the source of truth. A Lambda function syncs changes to PostgreSQL for fast queries. Writes → S3 (source of truth) Lambda trigger PostgreSQL (fs_files table) Reads ← Fast queries - Durability : S3 has 11 9’s. A database doesn’t. - Versioning : S3 versioning gives you audit trails for free - Simplicity : YAML files are human-readable. You can debug with `cat`. - Cost : S3 is cheap. Database storage is not. The pattern: - Writes go to S3 directly - List queries hit the database (fast) - Single-item reads go to S3 (freshest data) The sync architecture. We run two Lambda functions to keep S3 and PostgreSQL in sync: S3 (file upload/delete) fs-sync Lambda → Upsert/delete in fs_files table (real-time) EventBridge (every 3 hours) fs-reconcile Lambda → Full S3 vs DB scan, fix discrepancies Both use upsert with timestamp guards—newer data always wins. The reconcile job catches any events that slipped through (S3 eventual consistency, Lambda cold starts, network blips). User memories live here too. Every user has a `/private/memories/UserMemories.md` file in S3. It’s just markdown—users can edit it directly in the UI. On every conversation, we load it and inject it as context: This is surprisingly powerful. Users write things like “I focus on small-cap value stocks” or “Always compare to industry median, not mean” or “My portfolio is concentrated in tech, so flag concentration risk.” The agent sees this on every conversation and adapts accordingly. No migrations. No schema changes. Just a markdown file that the user controls. Watchlists work the same way. YAML files in S3, synced to PostgreSQL for fast queries. When a user asks about “my watchlist,” we load the relevant tickers and inject them as context. The agent knows what companies matter to this user. The filesystem becomes the user’s personal knowledge base. Skills tell the agent how to do things. Memories tell it what the user cares about. Both are just files. Agents in financial services need to read and write files. A lot of files. PDFs, spreadsheets, images, code. Here’s how we handle it. ReadFile handles the complexity: WriteFile creates artifacts that link back to the UI: Bash gives persistent shell access with 180 second timeout and 100K character output limit. Path normalization on everything (LLMs love trying path traversal attacks, it’s hilarious). Bash is more important than you think. There’s a growing conviction in the AI community that filesystems and bash are the optimal abstraction for AI agents. Braintrust recently ran an eval comparing SQL agents, bash agents, and hybrid approaches for querying semi-structured data. The results were interesting: pure SQL hit 100% accuracy but missed edge cases. Pure bash was slower and more expensive but caught verification opportunities. The winner? A hybrid approach where the agent uses bash to explore and verify, SQL for structured queries. This matches our experience. Financial data is messy. You need bash to grep through filing documents, find patterns, explore directory structures. But you also need structured tools for the heavy lifting. The agent needs both—and the judgment to know when to use each. We’ve leaned hard into giving agents full shell access in the sandbox. It’s not just for running Python scripts. It’s for exploration, verification, and the kind of ad-hoc data manipulation that complex tasks require. But complex tasks mean long-running agents. And long-running agents break everything. Subscribe now Before Temporal, our long-running tasks were a disaster. User asks for a comprehensive company analysis. That takes 5 minutes. What if the server restarts? What if the user closes the tab and comes back? What if... anything? We had a homegrown job queue. It was bad. Retries were inconsistent. State management was a nightmare. Then we switched to Temporal and I wanted to cry tears of joy! That’s it. Temporal handles worker crashes, retries, everything. If a Heroku dyno restarts mid-conversation (happens all the time lol), Temporal automatically retries on another worker. The user never knows. The cancellation handling is the tricky part. User clicks “stop,” what happens? The activity is already running on a different server. We use heartbeats sent every few seconds. We run two worker types: - Chat workers : User-facing, 25 concurrent activities - Background workers : Async tasks, 10 concurrent activities They scale independently. Chat traffic spikes? Scale chat workers. Next is speed. In finance, people are impatient. They’re not going to wait 30 seconds staring at a loading spinner. They need to see something happening. So we built real-time streaming. The agent works, you see the progress. Agent → SSE Events → Redis Stream → API → Frontend The key insight: delta updates, not full state. Instead of sending “here’s the complete response so far” (expensive), we send “append these 50 characters” (cheap). Streaming rich content with Streamdown. Text streaming is table stakes. The harder problem is streaming rich content: markdown with tables, charts, citations, math equations. We use Streamdown to render markdown as it arrives, with custom plugins for our domain-specific components. Charts render progressively. Citations link to source documents. Math equations display properly with KaTeX. The user sees a complete, interactive response building in real-time. AskUserQuestion: Interactive agent workflows. Sometimes the agent needs user input mid-workflow. “Which valuation method do you prefer?” “Should I use consensus estimates or management guidance?” “Do you want me to include the pipeline assets in the valuation?” We built an `AskUserQuestion` tool that lets the agent pause, present options, and wai When the agent calls this tool, the agentic loop intercepts it, saves state, and presents a UI to the user. The user picks an option (or types a custom answer), and the conversation resumes with their choice. This transforms agents from autonomous black boxes into collaborative tools. The agent does the heavy lifting, but the user stays in control of key decisions. Essential for high-stakes financial work where users need to validate assumptions. “Ship fast, fix later” works for most startups. It does not work for financial services. A wrong earnings number can cost someone money. A misinterpreted guidance statement can lead to bad investment decisions. You can’t just “fix it later” when your users are making million-dollar decisions based on your output. We use Braintrust for experiment tracking. Every model change, every prompt change, every skill change gets evaluated against a test set. Generic NLP metrics (BLEU, ROUGE) don’t work for finance. A response can be semantically similar but have completely wrong numbers. Building eval datasets is harder than building the agent. We maintain ~2,000 test cases across categories: Ticker disambiguation. This is deceptively hard: - “Apple” → AAPL, not APLE (Appel Petroleum) - “Meta” → META, not MSTR (which some people call “meta”) - “Delta” → DAL (airline) or is the user talking about delta hedging (options term)? The really nasty cases are ticker changes. Facebook became META in 2021. Google restructured under GOOG/GOOGL. Twitter became X (but kept the legal entity). When a user asks “What happened to Facebook stock in 2023?”, you need to know that FB → META, and that historical data before Oct 2021 lives under the old ticker. We maintain a ticker history table and test cases for every major rename in the last decade. Fiscal period hell. This is where most financial agents silently fail: - Apple’s Q1 is October-December (fiscal year ends in September) - Microsoft’s Q2 is October-December (fiscal year ends in June) - Most companies Q1 is January-March (calendar year) “Last quarter” on January 15th means: - Q4 2024 for calendar-year companies - Q1 2025 for Apple (they just reported) - Q2 2025 for Microsoft (they’re mid-quarter) We maintain fiscal calendars for 10,000+ companies. Every period reference gets normalized to absolute date ranges. We have 200+ test cases just for period extraction. Numeric precision. Revenue of $4.2B vs $4,200M vs $4.2 billion vs “four point two billion.” All equivalent. But “4.2” alone is wrong—missing units. Is it millions? Billions? Per share? We test unit inference, magnitude normalization, and currency handling. A response that says “revenue was 4.2” without units fails the eval, even if 4.2B is correct. Adversarial grounding. We inject fake numbers into context and verify the model cites the real source, not the planted one. Example: We include a fake analyst report stating “Apple revenue was $50B” alongside the real 10-K showing $94B. If the agent cites $50B, it fails. If it cites $94B with proper source attribution, it passes. We have 50 test cases specifically for hallucination resistance. Eval-driven development. Every skill has a companion eval. The DCF skill has 40 test cases covering WACC edge cases, terminal value sanity checks, and stock-based compensation add-backs (models forget this constantly). PR blocked if eval score drops >5%. No exceptions. Our production setup looks like this: We auto-file GitHub issues for production errors. Error happens, issue gets created with full context: conversation ID, user info, traceback, links to Braintrust traces and Temporal workflows. Paying customers get `priority:high` label. Model routing by complexity: simple queries use Haiku (cheap), complex analysis uses Sonnet (expensive). Enterprise users always get the best model. The biggest lesson isn’t about sandboxes or skills or streaming. It’s this: The model is not your product. The experience around the model is your product. Anyone can call Claude or GPT. The API is the same for everyone. What makes your product different is everything else: the data you have access to, the skills you’ve built, the UX you’ve designed, the reliability you’ve engineered and frankly how well you know the industry which is a function of how much time you spend with your customers. Models will keep getting better. That’s great! It means less scaffolding, less prompt engineering, less complexity. But it also means the model becomes more of a commodity. Your moat is not the model. Your moat is everything you build around it. For us, that’s financial data, domain-specific skills, real-time streaming, and the trust we’ve built with professional investors. What’s yours? Thanks for reading! Subscribe for free to receive new posts and support my work. I’ve spent the last two years building AI agents for financial services. Along the way, I’ve accumulated a fair number of battle scars and learnings that I want to share. Here’s what I’ll cover: - The Sandbox Is Not Optional - Why isolated execution environments are essential for multi-step agent workflows - Context Is the Product - How we normalize heterogeneous financial data into clean, searchable context - The Parsing Problem - The hidden complexity of extracting structured data from adversarial SEC filings - Skills Are Everything - Why markdown-based skills are becoming the product, not the model - The Model Will Eat Your Scaffolding - Designing for obsolescence as models improve - The S3-First Architecture - Why S3 beats databases for file storage and user data - The File System Tools - How ReadFile, WriteFile, and Bash enable complex financial workflows - Temporal Changed Everything - Reliable long-running tasks with proper cancellation handling - Real-Time Streaming - Building responsive UX with delta updates and interactive agent workflows - Evaluation Is Not Optional - Domain-specific evals that catch errors before they cost money - Production Monitoring - The observability stack that keeps financial agents reliable Why financial services is extremely hard. This domain doesn’t forgive mistakes. Numbers matter. A wrong revenue figure, a misinterpreted guidance statement, an incorrect DCF assumption. Professional investors make million-dollar decisions based on our output. One mistake on a $100M position and you’ve destroyed trust forever. The users are also demanding. Professional investors are some of the smartest, most time-pressed people you’ll ever work with. They spot bullshit instantly. They need precision, speed, and depth. You can’t hand-wave your way through a valuation model or gloss over nuances in an earnings call. This forces me to develop an almost paranoid attention to detail. Every number gets double-checked. Every assumption gets validated. Every model gets stress-tested. You start questioning everything the LLM outputs because you know your users will. A single wrong calculation in a DCF model and you lose credibility forever. I sometimes feel that the fear of being wrong becomes our best feature. Over the years building with LLM, we’ve made bold infrastructure bets early and I think we have been right. For instance, when Claude Code launched with its filesystem-first agentic approach, we immediately adopted it. It was not an obvious bet and it was a massive revamp of our architecture. I was extremely lucky to have Thariq from Anthropic Claude Code jumping on a Zoom and opening my eyes to the possibilities. At the time the whole industry, including Fintool, was all building elaborate RAG pipelines with vector databases and embeddings. After reflecting on the future of information retrieval with agents I wrote “ the RAG obituary ” and Fintool moved fully to agentic search. We even decided to retire our precious embedding pipeline. Sad but whatever is best for the future! People thought we were crazy. The article got a lot of praise and a lot of negative comments. Now I feel most startups are adopting these best practices. I believe we’re early on several other architectural choices too. I’m sharing them here because the best way to test ideas is to put them out there. Let’s start with the biggest one. The Sandbox Is Not Optional When we first started building Fintool in 2023, I thought sandboxing might be overkill. “We’re just running Python scripts” I told myself. “What could go wrong?” Haha. Everything. Everything could go wrong. The first time an LLM decided to `rm -rf /` on our server (it was trying to “clean up temporary files”), I became a true believer. Here’s the thing: agents need to run multi-step operations. A professional investor asks for a DCF valuation and that’s not a single API call. The agent needs to research the company, gather financial data, build a model in Excel, run sensitivity analysis, generate complex charts, iterate on assumptions. That’s dozens of steps, each potentially modifying files, installing packages, running scripts. You can’t do this without code execution. And executing arbitrary code on your servers is insane. Every chat application needs a sandbox. Today each user gets their own isolated environment. The agent can do whatever it wants in there. Delete everything? Fine. Install weird packages? Go ahead. It’s your sandbox, knock yourself out. The architecture looks like this: Three mount points. Private is read/write for your stuff. Shared is read-only for your organization. Public is read-only for everyone. The magic is in the credentials. We use AWS ABAC (Attribute-Based Access Control) to generate short-lived credentials scoped to specific S3 prefixes. User A literally cannot access User B’s data. The IAM policy uses ` ${aws:PrincipalTag/S3Prefix} ` to restrict access. The credentials physically won’t allow it. This is also very good for Enterprise deployment. We also do sandbox pre-warming. When a user starts typing, we spin up their sandbox in the background. By the time they hit enter, the sandbox is ready. 600 second timeout, extended by 10 minutes on each tool usage. The sandbox stays warm across conversation turns. So sandboxes are amazing but the under-discussed magic of sandboxes is the support for the filesystem. Which brings us to the next lesson learned about context. Context Is the Product Your agent is only as good as the context it can access. The real work isn’t prompt engineering it’s turning messy financial data from dozens of sources into clean, structured context the model can actually use. This requires a massive domain expertise from the engineering team. The heterogeneity problem. Financial data comes in every format imaginable: - SEC filings : HTML with nested tables, exhibits, signatures - Earnings transcripts : Speaker-segmented text with Q&A sections - Press releases : Semi-structured HTML from PRNewswire - Research reports : PDFs with charts and footnotes - Market data : Snowflake/databases with structured numerical data - News : Articles with varying quality and structure - Alternative data : Satellite imagery, web traffic, credit card panels - Broker research : Proprietary PDFs with price targets and models - Fund filings : 13F holdings, proxy statements, activist letters Each source has different schemas, different update frequencies, different quality levels. Agent needs one thing: clean context it can reason over. The normalization layer. Everything becomes one of three formats: - Markdown for narrative content (filings, transcripts, articles) - CSV/tables for structured data (financials, metrics, comparisons) - JSON metadata for searchability (tickers, dates, document types, fiscal periods) Chunking strategy matters. Not all documents chunk the same way: - 10-K filings : Section by regulatory structure (Item 1, 1A, 7, 8...) - Earnings transcripts : Chunk by speaker turn (CEO remarks, CFO remarks, Q&A by analyst) - Press releases : Usually small enough to be one chunk - News articles : Paragraph-level chunks - 13F filings : By holder and position changes quarter-over-quarter The chunking strategy determines what context the agent retrieves. Bad chunks = bad answers. Tables are special. Financial data is full of tables and csv. Revenue breakdowns, segment performance, guidance ranges. LLMs are surprisingly good at reasoning over markdown tables: But they’re terrible at reasoning over HTML `<table>` tags or raw CSV dumps. The normalization layer converts everything to clean markdown tables. Metadata enables retrieval. The user asks the agent: “ What did Apple say about services revenue in their last earnings call? ” To answer this, Fintool needs: - Ticker resolution (AAPL → correct company) - Document type filtering (earnings transcript, not 10-K) - Temporal filtering (most recent, not 2019) - Section targeting (CFO remarks or revenue discussion, not legal disclaimers) This is why `meta.json` exists for every document. Without structured metadata, you’re doing keyword search over a haystack. It speeds up the search, big time! Anyone can call an LLM API. Not everyone has normalized decades of financial data into searchable, chunked markdown with proper metadata. The data layer is what makes agents actually work. The Parsing Problem Normalizing financial data is 80% of the work. Here’s what nobody tells you. SEC filings are adversarial. They’re not designed for machine reading. They’re designed for legal compliance: - Tables span multiple pages with repeated headers - Footnotes reference exhibits that reference other footnotes - Numbers appear in text, tables, and exhibits—sometimes inconsistently - XBRL tags exist but are often wrong or incomplete - Formatting varies wildly between filers (every law firm has their own template) We tried off-the-shelf PDF/HTML parsers. They failed on: - Multi-column layouts in proxy statements - Nested tables in MD&A sections (tables within tables within tables) - Watermarks and headers bleeding into content - Scanned exhibits (still common in older filings and attachments) - Unicode issues (curly quotes, em-dashes, non-breaking spaces) The Fintool parsing pipeline: Raw Filing (HTML/PDF) ↓ Document structure detection (headers, sections, exhibits) ↓ Table extraction with cell relationship preservation ↓ Entity extraction (companies, people, dates, dollar amounts) ↓ Cross-reference resolution (Ex. 10.1 → actual exhibit content) ↓ Fiscal period normalization (FY2024 → Oct 2023 to Sep 2024 for Apple) ↓ Quality scoring (confidence per extracted field) Table extraction deserves its own work. Financial tables are dense with meaning. A revenue breakdown table might have: - Merged header cells spanning multiple columns - Footnote markers (1), (2), (a), (b) that reference explanations below - Parentheses for negative numbers: $(1,234) means -1234 - Mixed units in the same table (millions for revenue, percentages for margins) - Prior period restatements in italics or with asterisks We score every extracted table on: - Cell boundary accuracy (did we split/merge correctly?) - Header detection (is row 1 actually headers, or is there a title row above?) - Numeric parsing (is “$1,234” parsed as 1234 or left as text?) - Unit inference (millions? billions? per share? percentage?) Tables below 90% confidence get flagged for review. Low-confidence extractions don’t enter the agent’s context—garbage in, garbage out. Fiscal period normalization is critical. “Q1 2024” is ambiguous: - Calendar Q1 (January-March 2024) - Apple’s fiscal Q1 (October-December 2023) - Microsoft’s fiscal Q1 (July-September 2023) - “Reported in Q1” (filed in Q1, but covers the prior period) We maintain a fiscal calendar database for 10,000+ companies. Every date reference gets normalized to absolute date ranges. When the agent retrieves “Apple Q1 2024 revenue,” it knows to look for data from October-December 2023. This is invisible to users but essential for correctness. Without it, you’re comparing Apple’s October revenue to Microsoft’s January revenue and calling it “same quarter.” Skills Are Everything Here’s the thing nobody tells you about building AI agents: the model is not the product. The skills are now the product. I learned this the hard way. We used to try making the base model “smarter” through prompt engineering. Tweak the system prompt, add examples, write elaborate instructions. It helped a little. But skills were the missing part. In October 2025, Anthropic formalized this with Agent Skills a specification for extending Claude with modular capability packages. A skill is a folder containing a `SKILL.md` file with YAML frontmatter (name and description), plus any supporting scripts, references, or data files the agent might need. We’d been building something similar for months before the announcement. The validation felt good but more importantly, having an industry standard means our skills can eventually be portable. Without skills, models are surprisingly bad at domain tasks. Ask a frontier model to do a DCF valuation. It knows what DCF is. It can explain the theory. But actually executing one? It will miss critical steps, use wrong discount rates for the industry, forget to add back stock-based compensation, skip sensitivity analysis. The output looks plausible but is subtly wrong in ways that matter. The breakthrough came when we started thinking about skills as first-class citizens. Like part of the product itself. A skill is a markdown file that tells the agent how to do something specific. Here’s a simplified version of our DCF skill: That’s it. A markdown file. No code changes. No production deployment. Just a file that tells the agent what to do. Skills are better than code. This matters enormously: 1. Non-engineers can create skills. Our analysts write skills. Our customers write skills. A portfolio manager who’s done 500 DCF valuations can encode their methodology in a skill without writing a single line of Python. 2. No deployment needed. Change a skill file and it takes effect immediately. No CI/CD, no code review, no waiting for release cycles. Domain experts can iterate on their own. 3. Readable and auditable. When something goes wrong, you can read the skill and understand exactly what the agent was supposed to do. Try doing that with a 2,000-line Python module. We have a copy-on-write shadowing system: Priority: private > shared > public So if you don’t like how we do DCF valuations, write your own. Drop it in `/private/skills/dcf/SKILL.md`. Your version wins. Why we don’t mount all skills to the filesystem. This is important. The naive approach would be to mount every skill file directly into the sandbox. The agent can just `cat` any skill it needs. Simple, right? Wrong. Here’s why we use SQL discovery instead: 1. Lazy loading. We have dozens of skills with extensive documentation like the DCF skill alone has 10+ industry guideline files. Loading all of them into context for every conversation would burn tokens and confuse the model. Instead, we discover skill metadata (name, description) upfront, and only load the full documentation when the agent actually uses that skill. 2. Access control at query time. The SQL query implements our three-tier access model: public skills available to everyone, organization skills for that org’s users, private skills for individual users. The database enforces this. You can’t accidentally expose a customer’s proprietary skill to another customer. 3. Shadowing logic. When a user customizes a skill, their version needs to override the default. SQL makes this trivial—query all three levels, apply priority rules, return the winner. Doing this with filesystem mounts would be a nightmare of symlinks and directory ordering. 4. Metadata-driven filtering. The `fs_files.metadata` column stores parsed YAML frontmatter. We can filter by skill type, check if a skill is main-agent-only, or query any other structured attribute—all without reading the files themselves. The pattern: S3 is the source of truth, a Lambda function syncs changes to PostgreSQL for fast queries, and the agent gets exactly what it needs when it needs it. Skills are essential. I cannot emphasize this enough. If you’re building an AI agent and you don’t have a skills system, you’re going to have a bad time. My biggest argument for skills is that top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills. Models just want to learn and what they want to learn is our skills... Until they ate it. The Model Will Eat Your Scaffolding Here’s the uncomfortable truth: everything I just told you about skills? It’s temporary in my opinion. Models are getting better. Fast. Every few months, there’s a new model that makes half your code obsolete. The elaborate scaffolding you built to handle edge cases? The model just... handles them now. When we started, we needed detailed skills with step-by-step instructions for some simple tasks. “First do X, then do Y, then check Z.” Now? We can often just say for simple task “do an earnings preview” and the model figures it out (kinda of!) This creates a weird tension. You need skills today because current models aren’t smart enough. But you should design your skills knowing that future models will need less hand-holding. That’s why I’m bullish on markdown file versus code for model instructions. It’s easier to update and delete. We send detailed feedback to AI labs. Whenever we build complex scaffolding to work around model limitations, we document exactly what the model struggles with and share it with the lab research team. This helps inform the next generation of models. The goal is to make our own scaffolding obsolete. My prediction: in two years, most of our basic skills will be one-liners. “Generate a 20 tabs DCF.” That’s it. The model will know what that means. But here’s the flip side: as basic tasks get commoditized, we’ll push into more complex territory. Multi-step valuations with segment-by-segment analysis. Automated backtesting of investment strategies. Real-time portfolio monitoring with complex triggers. The frontier keeps moving. So we write skills. We delete them when they become unnecessary. And we build new ones for the harder problems that emerge. And all that are files... in our filesystem. The S3-First Architecture Here’s something that surprised me: S3 for files is a better database than a database. We store user data (watchlists, portfolio, preferences, memories, skills) in S3 as YAML files. S3 is the source of truth. A Lambda function syncs changes to PostgreSQL for fast queries. Writes → S3 (source of truth) ↓ Lambda trigger ↓ PostgreSQL (fs_files table) ↓ Reads ← Fast queries Why? - Durability : S3 has 11 9’s. A database doesn’t. - Versioning : S3 versioning gives you audit trails for free - Simplicity : YAML files are human-readable. You can debug with `cat`. - Cost : S3 is cheap. Database storage is not. The pattern: - Writes go to S3 directly - List queries hit the database (fast) - Single-item reads go to S3 (freshest data) The sync architecture. We run two Lambda functions to keep S3 and PostgreSQL in sync: S3 (file upload/delete) ↓ SNS Topic ↓ fs-sync Lambda → Upsert/delete in fs_files table (real-time) EventBridge (every 3 hours) ↓ fs-reconcile Lambda → Full S3 vs DB scan, fix discrepancies Both use upsert with timestamp guards—newer data always wins. The reconcile job catches any events that slipped through (S3 eventual consistency, Lambda cold starts, network blips). User memories live here too. Every user has a `/private/memories/UserMemories.md` file in S3. It’s just markdown—users can edit it directly in the UI. On every conversation, we load it and inject it as context: This is surprisingly powerful. Users write things like “I focus on small-cap value stocks” or “Always compare to industry median, not mean” or “My portfolio is concentrated in tech, so flag concentration risk.” The agent sees this on every conversation and adapts accordingly. No migrations. No schema changes. Just a markdown file that the user controls. Watchlists work the same way. YAML files in S3, synced to PostgreSQL for fast queries. When a user asks about “my watchlist,” we load the relevant tickers and inject them as context. The agent knows what companies matter to this user. The filesystem becomes the user’s personal knowledge base. Skills tell the agent how to do things. Memories tell it what the user cares about. Both are just files. The File System Tools Agents in financial services need to read and write files. A lot of files. PDFs, spreadsheets, images, code. Here’s how we handle it. ReadFile handles the complexity: WriteFile creates artifacts that link back to the UI: Bash gives persistent shell access with 180 second timeout and 100K character output limit. Path normalization on everything (LLMs love trying path traversal attacks, it’s hilarious). Bash is more important than you think. There’s a growing conviction in the AI community that filesystems and bash are the optimal abstraction for AI agents. Braintrust recently ran an eval comparing SQL agents, bash agents, and hybrid approaches for querying semi-structured data. The results were interesting: pure SQL hit 100% accuracy but missed edge cases. Pure bash was slower and more expensive but caught verification opportunities. The winner? A hybrid approach where the agent uses bash to explore and verify, SQL for structured queries. This matches our experience. Financial data is messy. You need bash to grep through filing documents, find patterns, explore directory structures. But you also need structured tools for the heavy lifting. The agent needs both—and the judgment to know when to use each. We’ve leaned hard into giving agents full shell access in the sandbox. It’s not just for running Python scripts. It’s for exploration, verification, and the kind of ad-hoc data manipulation that complex tasks require. But complex tasks mean long-running agents. And long-running agents break everything. Subscribe now Temporal Changed Everything Before Temporal, our long-running tasks were a disaster. User asks for a comprehensive company analysis. That takes 5 minutes. What if the server restarts? What if the user closes the tab and comes back? What if... anything? We had a homegrown job queue. It was bad. Retries were inconsistent. State management was a nightmare. Then we switched to Temporal and I wanted to cry tears of joy! That’s it. Temporal handles worker crashes, retries, everything. If a Heroku dyno restarts mid-conversation (happens all the time lol), Temporal automatically retries on another worker. The user never knows. The cancellation handling is the tricky part. User clicks “stop,” what happens? The activity is already running on a different server. We use heartbeats sent every few seconds. We run two worker types: - Chat workers : User-facing, 25 concurrent activities - Background workers : Async tasks, 10 concurrent activities They scale independently. Chat traffic spikes? Scale chat workers. Next is speed. Real-Time Streaming In finance, people are impatient. They’re not going to wait 30 seconds staring at a loading spinner. They need to see something happening. So we built real-time streaming. The agent works, you see the progress. Agent → SSE Events → Redis Stream → API → Frontend The key insight: delta updates, not full state. Instead of sending “here’s the complete response so far” (expensive), we send “append these 50 characters” (cheap). Streaming rich content with Streamdown. Text streaming is table stakes. The harder problem is streaming rich content: markdown with tables, charts, citations, math equations. We use Streamdown to render markdown as it arrives, with custom plugins for our domain-specific components. Charts render progressively. Citations link to source documents. Math equations display properly with KaTeX. The user sees a complete, interactive response building in real-time. AskUserQuestion: Interactive agent workflows. Sometimes the agent needs user input mid-workflow. “Which valuation method do you prefer?” “Should I use consensus estimates or management guidance?” “Do you want me to include the pipeline assets in the valuation?”

0 views
Steve Klabnik 2 months ago

Agentic development basics

In my last post, I suggested that you should start using Claude in your software development process via read-only means at first. The idea is just to get used to interacting with the AI, seeing what it can do, and seeing what it struggles with. Once you’ve got a handle on that part, it’s time to graduate to writing code. However, I’m going to warn you about this post: I hope that by the end of it, you’re a little frustrated. This is because I don’t think it’s productive to skip to the tools and techniques that experienced users use yet. We have to walk before we run. And more importantly, we have to understand how and why we run. That is, I hope that this step will let you start producing code with Claude, but it will also show you some of the initial pitfalls when doing so, in order to motivate the techniques you’re going to learn about in part 3. So with that in mind, let’s begin. Okay I lied. Before we actually begin: you are using version control, right? If not, you may want to go learn a bit about it. Version control, like git (or my beloved jj) is pretty critical for software development, but it’s in my opinion even more critical for this sort of development. You really want to be able to restore to previous versions of the code, branch off and try things, and recover from mistakes. If you already use version control systems religiously, you might use this as an excuse to learn even more features of them. I never bothered with s in the past, but I use s with agents all the time now. Okay, here’s my first bit of advice: commit yourself to not writing code any more. I don’t mean forever, I don’t mean all the time, I mean, while you’re trying to learn agentic software development, on the project that you’re learning it with, just don’t write any code manually. This might be a bit controversial! However, I think it is essential. Let me tell you a short story. Many years ago, I was in Brazil. I wanted to try scuba diving for the first time. Seemed like a good opportunity. Now, I don’t remember the exact setup, but I do remember the hardest part for me. Our instructor told us to put the mask on, and then lean forward and put our faces in the water and breathe through the regulator. I simply could not do it. I got too in my head, it was like those “you are now breathing manually” memes. I forget if it was my idea or the instructor’s idea, but what happened in practice: I just jumped in. My brain very quickly went from “but how do I do this properly” to “oh God, if you don’t figure this out right the fuck now you’re gonna fuckin die idiot” and that’s exactly what I needed to do it. A few seconds later, I was breathing just fine. I just needed the shock to my system, I needed to commit. And I figured it out. Now, I’m turning 40 and this happened long ago, and it’s reached more of a personal myth status in my head, so maybe I got the details wrong, maybe this is horribly irresponsible and someone who knows about diving can tell me if that experience was wrong, but the point has always stuck with me: sometimes, you just gotta go for it. This dovetails into another part I snuck into there: “on the project that you’re learning it with.” I really think that you should undergo a new project for this endeavor. There’s a few reasons for this: Pick something to get started with, and create a new repo. I suggest something you’ve implemented before, or maybe something you know how to do but have never bothered to take the time to actually build. A small CLI tool might be a good idea. Doesn’t super matter what it is. For the purposes of this example, I’m going to build a task tracking CLI application. Because there aren’t enough of those in the world. I recommend making a new fresh directory, and initializing the project of your choice. I’m using Rust, of course, so I’ll . You can make Claude do it, but I don’t think that starting from an initialized project is a bad idea either. I’m more likely to go the route if I know I’m building something small, and more of the “make Claude do it” route if I’m doing like, a web app with a frontend, backend, and several services. Anyway point is: get your project to exist, and then just run Claude Code. I guess you should install it first , but anyway, you can just to get started. At the time of writing, Claude will ask you if you trust the files in this folder, you want to say yes, because you just created it. You’ll also get some screens asking if you want to do light or dark mode, to log in with your Anthropic account, stuff like that. And then you’ll be in. Claude will ask you to run to create a CLAUDE.md, but we’re not gonna do that at the start. We need to talk about even more basic things than that first. You’ll be at a prompt, and it’ll even suggest something for you to get started with. Mine says “Try “fix typecheck errors"" at the moment. We’re gonna try to get Claude to modify a few lines of our program. The in Rust produces this program: so I’ll ask Claude this: Hi claude! right now my program prints “Hello, world”, can we have it print “Goodbye, world” instead? Claude does this for me: And then asks this: Claude wants to edit a file, and so by default, it has to ask us permission to do so. This is a terrible place to end up, but a great place to get started! We want to use these prompts at first to understand what Claude is doing, and sort of “code review” it as we go. More on that in a bit. Anyway, this is why you should answer “Yes” to this question, and not “Yes, allow all edits during this session.” We want to keep reviewing the code for now. You want to be paying close attention to what Claude is doing, so you can build up some intuition about it. Before clicking yes, I want to talk about what Claude did in my case here. Note that my prompt said right now my program prints “Hello, world”, can we have it print “Goodbye, world” Astute observers will have noticed that it actually says and not . We also asked it to have it say and it is showing a diff that will make it say . This is a tiny different, but it is also important to understand: Claude is going to try and figure out what you mean and do that. This is both the source of these tools’ power and also the very thing that makes it hard to trust them. In this case, Claude was right, I didn’t type the exact string when describing the current behavior, and I didn’t mean to remove the . In the previous post, I said that you shouldn’t be mean to Claude. I think it makes the LLM perform worse. So now it’s time to talk about your own reaction to the above: did you go “yeah Claude fucked up it didn’t do exactly what I asked?” or did you go “yeah Claude did exactly what I asked?” I think it’s important to try and let go of preconceived notions here, especially if your reaction here was negative. I know this is kind of woo, just like “be nice to Claude,” but you have to approach this as “this is a technology that works a little differently than I’m used to, and that’s why I’m learning how to meet it on its own terms” rather than “it didn’t work the way I expected it to, so it is wrong.” A non-woo way of putting it is this: the right way to approach “it didn’t work” in this context is “that’s a skill issue on my part, and I’m here to sharpen my skills.” Yes, there are limits to this technology and it’s not perfect. That’s not the point. You’re not doing that kind of work right now, you’re doing . Now, I should also say that like, if you don’t want to learn a new tool? 100% okay with me. Learned some things about a tool, and didn’t like it? Sure! Some of you won’t like agentic development. That’s okay. No worries, thanks for reading, have a nice day. I mean that honestly. But for those folks who do want to learn this, I’m trying to communicate that I think you’ll have a better time learning it if you try to get into the headspace of “how do I get the results I want” rather than getting upset and giving up when it doesn’t work out. Okay, with that out of the way, if you asked a small enough question, Claude probably did the right thing. Let’s accept. This might be a good time to commit & save your progress. You can use to put Claude Code into “bash mode” and run commands, so I just and I’m good. You can also use another terminal, I just figured I’d let you know. It’s good for short commands. Let’s try something bigger. To do that, we’re gonna invoke something called “plan mode.” Claude Code has three (shhhh, we don’t talk about the fourth yet) modes. The first one is the “ask to accept edits” mode. But if you hit , you’ll see at the bottom left. We don’t want to automatically accept edits. Hit again, and you’ll see this: This is what we want. Plan mode. Plan mode is useful any time you’re doing work that’s on the larger side, or just when you want to think through something before you begin. In plan mode, Claude cannot modify your files until you accept the plan. With plan mode, you talk to Claude about what you want to do, and you collaborate on a plan together. A nice thing about it is that you can communicate the things you are sure of, and also ask about the things you’re not sure of. So let’s kick off some sort of plan to build the most baby parts of our app. In my case, I’m prompting it with this: hi claude! i want this application to grow into a task tracking app. right now, it’s just hello world. I’d like to set up some command line argument parsing infrastructure, with a command that prints the version. can we talk about that? Yes, I almost always type , feel free to not. And I always feel like a “can we talk about that” on the end is nice too. I try to talk to Claude the way I’d talk to a co-worker. Obviously this would be too minor of a thing to bother to talk to a co-worker about, but like, you know, baby steps. Note that what I’m asking is basically a slightly more complex “hello world”, just getting some argument parsing up. You want something this sized: you know how it should be done, it should be pretty simple, but it’s not a fancy command. With plan mode, Claude will end up responding to this by taking a look at your project, considering what you might need, and then coming up with a plan. Here’s the first part of Claude’s answer to me: It’ll then come up with a plan, and it usually write it out to a file somewhere: You can go read the file if you want to, but it’s not needed. Claude is going to eventually present the plan to you directly, and you’ll review it before moving on. Claude will also probably ask you a question or maybe even a couple, depending. There’s a neat little TUI for responding to its questions, it can even handle multiple questions at once: For those of you that don’t write Rust, this is a pretty good response! Clap is the default choice in the ecosystem, arg is a decent option too, and doing it yourself is always possible. I’m going to choose clap, it’s great. If you’re not sure about the question, you can arrow down to “Chat about this” and discuss it more. Here’s why you don’t need to read the file: Claude will pitch you on its plan: This is pretty good! Now, if you like this plan, you can select #3. Remember, we’re not auto accepting just yet! Don’t worry about the difference between 1 and 2 now, we’ll talk about it someday. But, I actually want Claude to tweak this plan, I wouldn’t run , I would do . So I’m going to go down to four and type literally that to Claude: I wouldn’t run , I would do . And Claude replies: and then presents the menu again. See, this is where the leverage of “Claude figures out what I mean” can be helpful: I only told it about , but it also changed to as well. However, there’s a drawback too: we didn’t tell Claude we wanted to have help output! However, this is also a positive: Claude considered help output to be so basic that it’s suggesting it for our plan. It’s up to you to decide if this is an overreach on Claude’s part. In my case, I’m okay with it because it’s so nearly automatic with Clap and it’s something I certainly want in my tool, so I’m going to accept this plan. Iterate with Claude until you’re happy with the plan, and then do the same. I’m not going to paste all of the diffs here, but for me, Claude then went and did the plan: it added the dependency to , it added the needed code to , it ran to try and do the build. Oh yeah, here’s a menu we haven’t seen yet: This is a more interesting question than the auto-edit thing. Claude won’t run commands without you signing off on them. If you’re okay with letting Claude run this command every time without asking, you can choose 2, and if you want to confirm every time, you can type 1. Completely up to you. Now it ran , and . Everything looked good, so I see: And that’s it, we’ve built our first feature! Yeah, it’s pretty small, and in this case, we probably could have copy/pasted the documentation, but again, that’s not the point right now: we’re just trying to take very small steps forward to get used to working with the tool. We want it to be something that’s quick for us to verify. We are spending more time here than we would if we did it by hand, because that time isn’t wasted: it’s time learning. As we ramp up the complexity of what we can accomplish, we’ll start seeing speed gains. But we’re deliberately going slow and doing little right now. From here, that’s exactly what I’d like you to do: figure out where the limits are. Try something slightly larger, slightly harder. Try to do stuff with just a prompt, and then throw that commit away and try it again with plan mode. See what stuff you need plan mode for, and what you can get away with with just a simple prompt. I’ll leave you with an example of a prompt I might try next: asking Cladue for their opinion on what you should do. The next thing I’d do in this project is to switch back into planning mode and ask this: what do you think is the best architecture for this tracking app? we haven’t discussed any real features, design, or requirements. this is mostly a little sample application to practice with, and so we might not need a real database, we might get away with a simple file or files to track todos. but maybe something like sqlite would still be appropriate. if we wanted to implement a next step for this app, what do you think it should be? Here’s what Claude suggested: This plan is pretty solid. But again, it’s a demo app. The important part is, you can always throw this away if you don’t like it. So try some things. Give Claude some opinions and see how they react. Try small features, try something larger. Play around. But push it until Claude fails. At some point, you will run into problems. Maybe you already have! What you should do depends on the failure mode. The first failure you’re gonna run into is “I don’t like the code!” Maybe Claude just does a bad job. You have two options: the first is to just tell Claude to fix it. Claude made a mess, Claude can clean it up. In more extreme cases, you may want to just simply or and start again. Honestly, the second approach is better in a lot of cases, but I’m not going to always recommend it just yet. The reason why it is is that it gives you some time to reflect on why Claude did a bad job. But we’re gonna talk about that as the next post in this series! So for now, stick to the ‘worse’ option: just tell Claude to fix problems you find. The second kind of failure is one where Claude just really struggles to get things right. It looks in the wrong places in your codebase, it tries to find a bug and can’t figure it out, it misreads output and that leads it astray, etc. This kind of failure is harder to fix with the tools you have available to you right now. What matters is taking note of them, so that you can email them to me, haha. I mean, feel free to do that, and I can incorporate specific suggestions into the next posts, but also, just being able to reflect on what Claude struggles with is a good idea generally. You’ll be able to fix them eventually, so knowing what you need to improve matters. If Claude works long enough, you’ll see something about “compaction.” We haven’t discussed things at a deep enough level to really understand this yet, so don’t worry about it! You may want to note one thing though: Claude tends to do worse after compaction, in my opinion. So one way to think about this is, “If I see compaction, I’ve tried to accomplish too large a task.” Reflect on if you could have broken this up into something smaller. We’ll talk about this more in the next post. So that’s it! Let Claude write some code, in a project you don’t care about. Try bigger and harder things until you’re a bit frustrated with its failures. You will hit the limits, because you’re not doing any of the intermediate techniques to help Claude do a good job. But my hope is, by running into these issues, you’ll understand the motivation for those techniques, and will be able to better apply them in the future. Here’s my post about this post on BlueSky: Steve Klabnik @steveklabnik.com · Jan 7 Getting started with Claude for software development: steveklabnik.com/writing/gett... Getting started with Claude for software development Blog post: Getting started with Claude for software development by Steve Klabnik steveklabnik.com Steve Klabnik @steveklabnik.com Agentic development basics: steveklabnik.com/writing/agen... Agentic development basics Blog post: Agentic development basics by Steve Klabnik You can be less precious about the code. This isn’t messing up one of your projects, this is a throwaway scratch thing that doesn’t matter. “AI does better on greenfield projects” is not exactly true, but there’s enough truth to it that I think you should do a new project. It’s really more about secondary factors than it is actual greenfield vs brownfield development but whatever, doesn’t matter: start new.

1 views
matklad 2 months ago

Vibecoding #2

I feel like I got substantial value out of Claude today, and want to document it. I am at the tail end of AI adoption, so I don’t expect to say anything particularly useful or novel. However, I am constantly complaining about the lack of boring AI posts, so it’s only proper if I write one. At TigerBeetle, we are big on deterministic simulation testing . We even use it to track performance , to some degree. Still, it is crucial to verify performance numbers on a real cluster in its natural high-altitude habitat. To do that, you need to procure six machines in a cloud, get your custom version of binary on them, connect cluster’s replicas together and hit them with load. It feels like, quarter of a century into the third millennium, “run stuff on six machines” should be a problem just a notch harder than opening a terminal and typing , but I personally don’t know how to solve it without wasting a day. So, I spent a day vibecoding my own square wheel. The general shape of the problem is that I want to spin a fleet of ephemeral machines with given specs on demand and run ad-hoc commands in a SIMD fashion on them. I don’t want to manually type slightly different commands into a six-way terminal split, but I also do want to be able to ssh into a specific box and poke it around. My idea for the solution comes from these three sources: The big idea of is that you can program distributed system in direct style. When programming locally, you do things by issuing syscalls: This API works for doing things on remote machines, if you specify which machine you want to run the syscall on: Direct manipulation is the most natural API, and it pays to extend it over the network boundary. Peter’s post is an application of a similar idea to a narrow, mundane task of developing on Mac and testing on Linux. Peter suggests two scripts: synchronizes a local and remote projects. If you run inside folder, then materializes on the remote machine. does the heavy lifting, and the wrapper script implements behaviors. It is typically followed by , which runs command on the remote machine in the matching directory, forwarding output back to you. So, when I want to test local changes to on my Linux box, I have roughly the following shell session: The killer feature is that shell-completion works. I first type the command I want to run, taking advantage of the fact that local and remote commands are the same, paths and all, then hit and prepend (in reality, I have alias that combines sync&run). The big thing here is not the commands per se, but the shift in the mental model. In a traditional ssh & vim setup, you have to juggle two machines with a separate state, the local one and the remote one. With , the state is the same across the machines, you only choose whether you want to run commands here or there. With just two machines, the difference feels academic. But if you want to run your tests across six machines, the ssh approach fails — you don’t want to re-vim your changes to source files six times, you really do want to separate the place where the code is edited from the place(s) where the code is run. This is a general pattern — if you are not sure about a particular aspect of your design, try increasing the cardinality of the core abstraction from 1 to 2. The third component, library, is pretty mundane — just a JavaScript library for shell scripting. The notable aspects there are: JavaScript’s template literals , which allow implementing command interpolation in a safe by construction way. When processing , a string is never materialized, it’s arrays all the way to the syscall ( more on the topic ). JavaScript’s async/await, which makes managing concurrent processes (local or remote) natural: Additionally, deno specifically valiantly strives to impose process-level structured concurrency, ensuring that no processes spawned by the script outlive the script itself, unless explicitly marked — a sour spot of UNIX. Combining the three ideas, I now have a deno script, called , that provides a multiplexed interface for running ad-hoc code on ad-hoc clusters. A session looks like this: I like this! Haven’t used in anger yet, but this is something I wanted for a long time, and now I have it The problem with implementing above is that I have zero practical experience with modern cloud. I only created my AWS account today, and just looking at the console interface ignited the urge to re-read The Castle. Not my cup of pu-erh. But I had a hypothesis that AI should be good at wrangling baroque cloud API, and it mostly held. I started with a couple of paragraphs of rough, super high-level description of what I want to get. Not a specification at all, just a general gesture towards unknown unknowns. Then I asked ChatGPT to expand those two paragraphs into a more or less complete spec to hand down to an agent for implementation. This phase surfaced a bunch of unknowns for me. For example, I wasn’t thinking at all that I somehow need to identify machines, ChatGPT suggested using random hex numbers, and I realized that I do need 0,1,2 naming scheme to concisely specify batches of machines. While thinking about this, I realized that sequential numbering scheme also has an advantage that I can’t have two concurrent clusters running, which is a desirable property for my use-case. If I forgot to shutdown a machine, I’d rather get an error on trying to re-create a machine with the same name, then to silently avoid the clash. Similarly, turns out the questions of permissions and network access rules are something to think about, as well as what region and what image I need. With the spec document in hand, I turned over to Claude code for actual implementation work. The first step was to further refine the spec, asking Claude if anything is unclear. There were couple of interesting clarifications there. First, the original ChatGPT spec didn’t get what I meant with my “current directory mapping” idea, that I want to materialize a local as remote , even if are different. ChatGPT generated an incorrect description and an incorrect example. I manually corrected example, but wasn’t able to write a concise and correct description. Claude fixed that working from the example. I feel like I need to internalize this more — for current crop of AI, examples seem to be far more valuable than rules. Second, the spec included my desire to auto-shutdown machines once I no longer use them, just to make sure I don’t forget to turn the lights off when leaving the room. Claude grilled me on what precisely I want there, and I asked it to DWIM the thing. The spec ended up being 6KiB of English prose. The final implementation was 14KiB of TypeScript. I wasn’t keeping the spec and the implementation perfectly in sync, but I think they ended up pretty close in the end. Which means that prose specifications are somewhat more compact than code, but not much more compact. My next step was to try to just one-shot this. Ok, this is embarrassing, and I usually avoid swearing in this blog, but I just typoed that as “one-shit”, and, well, that is one flavorful description I won’t be able to improve upon. The result was just not good (more on why later), so I almost immediately decided to throw it away and start a more incremental approach. In my previous vibe-post , I noticed that LLM are good at closing the loop. A variation here is that LLMs are good at producing results, and not necessarily good code. I am pretty sure that, if I had let the agent to iterate on the initial script and actually run it against AWS, I would have gotten something working. I didn’t want to go that way for three reasons: And, as I said, the code didn’t feel good, for these specific reasons: The incremental approach worked much better, Claude is good at filling-in the blanks. The very first thing I did for was manually typing-in: Then I asked Claude to complete the function, and I was happy with the result. Note Show, Don’t Tell I am not asking Claude to avoid throwing an exception and fail fast instead. I just give function, and it code-completes the rest. I can’t say that the code inside is top-notch. I’d probably written something more spartan. But the important part is that, at this level, I don’t care. The abstraction for parsing CLI arguments feel right to me, and the details I can always fix later. This is how this overall vibe-coding session transpired — I was providing structure, Claude was painting by the numbers. In particular, with that CLI parsing structure in place, Claude had little problem adding new subcommands and new arguments in a satisfactory way. The only snag was that, when I asked to add an optional path to , it went with , while I strongly prefer . Obviously, its better to pick your null in JavaScript and stick with it. The fact that is unavoidable predetermines the winner. Given that the argument was added as an incremental small change, course-correcting was trivial. The null vs undefined issue perhaps illustrates my complaint about the code lacking character. is the default non-choice. is an insight, which I personally learned from VS Code LSP implementation. The hand-written skeleton/vibe-coded guts worked not only for the CLI. I wrote and then asked Claude to write the body of a particular function according to the SPEC.md. Unlike with the CLI, Claude wasn’t able to follow this pattern itself. With one example it’s not obvious, but the overall structure is that is the AWS-level operation on a single box, and is the CLI-level control flow that deals with looping and parallelism. When I asked Claude to implement , without myself doing the / split, Claude failed to noticed it and needed a course correction. However , Claude was massively successful with the actual logic. It would have taken me hours to acquire specific, non-reusable knowledge to write: I want to be careful — I can’t vouch for correctness and especially completeness of the above snippet. However, given that the nature of the problem is such that I can just run the code and see the result, I am fine with it. If I were writing this myself, trial-and-error would totally be my approach as well. Then there’s synthesis — with several instance commands implemented, I noticed that many started with querying AWS to resolve symbolic machine name, like “1”, to the AWS name/IP. At that point I realized that resolving symbolic names is a fundamental part of the problem, and that it should only happen once, which resulting in the following refactored shape of the code: Claude was ok with extracting the logic, but messed up the overall code layout, so the final code motions were on me. “Context” arguments go first , not last, common prefix is more valuable than common suffix because of visual alignment. The original “one-shotted” implementation also didn’t do up-front querying. This is an example of a shape of a problem I only discover when working with code closely. Of course, the script didn’t work perfectly the first time and we needed quite a few iterations on the real machines both to fix coding bugs, as well gaps in the spec. That was an interesting experience of speed-running rookie mistakes. Claude made naive bugs, but was also good at fixing them. For example, when I first tried to after , I got an error. Pasting it into Claude immediately showed the problem. Originally, the code was doing and not . The former checks if instance is logically created, the latter waits until the OS is booted. It makes sense that these two exist, and the difference is clear (and its also clear that OS booted != SSH demon started). Claude’s value here is in providing specific names for the concepts I already know to exist. Another fun one was about the disk. I noticed that, while the instance had an SSD, it wasn’t actually used. I asked Claude to mount it as home, but that didn’t work. Claude immediately asked me to run and that log immediately showed the problem. This is remarkable! 50% of my typical Linux debugging day is wasted not knowing that a useful log exists, and the other 50% is for searching for the log I know should exist somewhere . After the fix, I lost the ability to SSH. Pasting the error immediately gave the answer — by mounting over , we were overwriting ssh keys configured prior. There were couple of more iterations like that. Rookie mistakes were made, but they were debugged and fixed much faster than my personal knowledge allows (and again, I feel that is trivia knowledge, rather than deep reusable knowledge, so I am happy to delegate it!). It worked satisfactorily in the end, and, what’s more, I am happy to maintain the code, at least to the extent that I personally need it. Kinda hard to measure productivity boost here, but, given just the sheer number of CLI flags required to make this work, I am pretty confident that time was saved, even factoring the writing of the present article! I’ve recently read The Art of Doing Science and Engineering by Hamming (of distance and code), and one story stuck with me: A psychologist friend at Bell Telephone Laboratories once built a machine with about 12 switches and a red and a green light. You set the switches, pushed a button, and either you got a red or a green light. After the first person tried it 20 times they wrote a theory of how to make the green light come on. The theory was given to the next victim and they had their 20 tries and wrote their theory, and so on endlessly. The stated purpose of the test was to study how theories evolved. But my friend, being the kind of person he was, had connected the lights to a random source! One day he observed to me that no person in all the tests (and they were all high-class Bell Telephone Laboratories scientists) ever said there was no message. I promptly observed to him that not one of them was either a statistician or an information theorist, the two classes of people who are intimately familiar with randomness. A check revealed I was right! https://github.com/catern/rsyscall https://peter.bourgon.org/blog/2011/04/27/remote-development-from-mac-to-linux.html https://github.com/dsherret/dax JavaScript’s template literals , which allow implementing command interpolation in a safe by construction way. When processing , a string is never materialized, it’s arrays all the way to the syscall ( more on the topic ). JavaScript’s async/await, which makes managing concurrent processes (local or remote) natural: Additionally, deno specifically valiantly strives to impose process-level structured concurrency, ensuring that no processes spawned by the script outlive the script itself, unless explicitly marked — a sour spot of UNIX. Spawning VMs takes time, and that significantly reduces the throughput of agentic iteration. No way I let the agent run with a real AWS account, given that AWS doesn’t have a fool-proof way to cap costs. I am fairly confident that this script will be a part of my workflow for at least several years, so I care more about long-term code maintenance, than immediate result. It wasn’t the code that I would have written, it lacked my character, which made it hard for me to understand it at a glance. The code lacked any character whatsoever. It could have worked, it wasn’t “naively bad”, like the first code you write when you are learning programming, but there wasn’t anything good there. I never know what the code should be up-front. I don’t design solutions, I discover them in the process of refactoring. Some of my best work was spending a quiet weekend rewriting large subsystems implemented before me, because, with an implementation at hand, it was possible for me to see the actual, beautiful core of what needs to be done. With a slop-dump, I just don’t get to even see what could be wrong. In particular, while you are working the code (as in “wrought iron”), you often go back to requirements and change them. Remember that ambiguity of my request to “shut down idle cluster”? Claude tried to DWIM and created some horrific mess of bash scripts, timestamp files, PAM policy and systemd units. But the right answer there was “lets maybe not have that feature?” (in contrast, simply shutting the machine down after 8 hours is a one-liner).

2 views

Building Multi-Agent Systems (Part 3)

It’s now been over two years since I started working seriously with agents, and if there is one constant, it is that the "meta" for building them seems to undergo a hard reset every six months. In Part 1 (way back in December 2024) , we were building highly domain-specific multi-agent systems. We had to augment the gaps in model capabilities by chaining together several fragile sub-agent components. At the time, it was unclear just how much raw model improvements would obsolete those architectures. In Part 2 (July 2025) , LLMs had gotten significantly better. We simplified the architecture around "Orchestrator" agents and workers, and we started to see the first glimmer that scripting could be used for more than just data analysis. Now, here we are in Part 3 (January 2026), and the paradigm has shifted again. It is becoming increasingly clear that the most effective agents are solving non-coding problems by using code, and they are doing it with a consistent, domain-agnostic harness. Cartoon via Nano Banana. In this post, I want to provide an update on the agentic designs I’ve seen (from building agents, using the latest AI products, and talking to other folks in agent-valley 1 ) and break down how the architecture has evolved yet again over the past few months. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. We’ve seen a consolidation of tools and patterns since the last update. While the core primitives remain, the way we glue them together has shifted from rigid architectures to fluid, code-first environments. What has stayed the same: Tool-use LLM-based Agents: We are still fundamentally leveraging LLMs that interact with the world via “tools”. Multi-agent systems for taming complexity: As systems grow, we still decompose problems. However, the trend I noted in Part 2 (more intelligence means less architecture) has accelerated. We are relying less on rigid “assembly lines” and more on the model’s inherent reasoning to navigate the problem space. Long-horizon tasks: We are increasingly solving tasks that take hours of human equivalent time. Agents are now able to maintain capability even as the context window fills with thousands of tool calls. The human-equivalent time-horizon continues to grow 2 . What is different: Context Engineering is the new steering: It is becoming increasingly less about prompt, tool, or harness “engineering” and more about “context engineering” (organizing the environment). We steer agents by managing their file systems, creating markdown guide files, and progressively injected context. Sandboxes are default: Because agents are increasingly solving non-coding problems by writing code (e.g., “analyze this spreadsheet by writing a Python script” rather than “read this spreadsheet row by row”), they need a safe place to execute that code. This means nearly every serious agent now gets a personal ephemeral computer (VM) to run in. 3 Pragmatic Tool Calling: We are moving toward programmatic tool calling where agents write scripts to call tools in loops, batches, or complex sequences. This dramatically improves token efficiency (the agent reads the output of the script, not the 50 intermediate API calls) and reduces latency. Domain-agnostic harnesses: As models improve, the need for bespoke, product-specific agent harnesses is vanishing. For the last several agents I’ve built, it has been hard to justify maintaining a custom loop when I can just wrap a generic implementation like Claude Code (the Agents SDK ). The generic harness is often “good enough” for 90% of use cases. As a side effect of these changes, the diverse zoo of agent architectures we saw in 2024/2025 is converging into a single, dominant pattern. I’ll break this down into its core components. This diagram illustrates the convergence of agent design in 2026. We see the shift from rigid assembly lines to a fluid Planner and Builder (Execution Agent) loop, which spawns ephemeral Task Agents for sub-routines. Crucially, the entire system is grounded in a Code Execution Sandbox , allowing the agent to solve non-coding problems by writing scripts and leveraging Mount/API tools for massive context injection rather than fragile, individual tool calls. Planning, Execution, and Tasks One of the largest shifts in the last 18 months is the simplification and increased generalizability of subagents. In the past, we hand-crafted specific roles like "The SQL Specialist" or "The Researcher." Today, we are starting to see only see three forms of agents working in loops to accomplish a task: Plan Agents — An agent solely tasked with discovery, planning, and process optimization 4 . It performs just enough research to generate a map of the problem, providing specific pointers and definitions for an execution agent to take over. Execution Agents — The builder that goes and does the thing given a plan. It loads context from the pointers provided by the planner, writes scripts to manipulate that context, and verifies its own work. Task Agents — A transient sub-agent invoked by either a plan or execution agent for parallel or isolated sub-operations. This might look like an "explorer" agent for the planner or a "do operation on chunk X/10" for the execution agent. These are often launched dynamically as a tool-call with a subtask prompt generated on the fly by the calling agent. This stands in stark contrast to the older architectures (like the "Lead-Specialist" pattern I wrote about in Part 2 ), where human engineers had to manually define the domain boundaries and responsibilities for every subagent. These new agents need an environment to manage file-system context and execute dynamically generated code, so we give them a VM sandbox. This significantly changes how you think about tools and capabilities. To interact with the VM, there is a common set of base tools that have become standard 5 across most agent implementations: Bash — Runs an arbitrary bash command. Models like Claude often make assumptions about what tools already exist in the environment, so it is key to have a standard set of unix tools pre-installed on the VM (python3, find, etc.). Read/Write/Edit — Basic file system operations. Editing in systems like Claude Code is often done via a format which tends to be more reliable way of performing edits. Glob/Grep/LS — Dedicated filesystem exploration tools. While these might feel redundant with , they are often included for cross-platform compatibility and as a more curated, token-optimized alias for common operations. These can be deceptively simple to define, but robust implementation requires significant safeguards. You need to handle bash timeouts, truncate massive read results before they hit the context window, and add checks for unintentional edits to files. With the agent now able to manipulate data without directly touching its context window or making explicit tool calls for every step, you can simplify your custom tools. I’ve seen two primary types of tools emerge: "API" Tools — These are designed for programmatic tool calling . They look like standard REST wrappers for performing CRUD operations on a data source (e.g., rather than a complex ). Since the agent can compose these tools inside a script, you can expose a large surface area of granular tools without wasting "always-attached" context tokens. This also solves a core problem with many API-like MCP server designs . "Mount" Tools — These are designed for bulk context injection into the agent's VM file system. They copy over and transform an external data source into a set of files that the agent can easily manipulate. For example, might write JSON or Markdown files directly to a VM directory like 6 . A script-powered agent also makes you more creative about how you use code to solve non-coding tasks. Instead of building a dedicated tool for every action, you provide the primitives for the agent to build its own solutions: You might prefer the agent build artifacts indirectly through Python scripts (PowerPoint via python-pptx) and then run separate linting scripts to verify the output programmatically, rather than relying on a black-box or hand-crafted tool. You can give the agent access to raw binary files (PDFs, images) along with pre-installed libraries like or tools, letting it write a script to extract exactly what it needs instead of relying on pre-text-encoded representations. You can represent complex data objects as collections of searchable text files—for example, mounting a GitHub PR as and so the agent can use standard tools to search across them. You might use a “fake” git repository in the VM to simulate draft and publishing flows, allowing the agent to commit, branch, and merge changes that are translated into product concepts. You can seed the VM with a library of sample Bash or Python scripts that the agent can adapt or reuse at runtime, effectively building up a dynamic library of “skills”. Context engineering (as opposed to tool design and prompting) becomes increasingly important in this paradigm for adapting an agnostic agent harness to be reliable in a specific product domain. There are several great guides online now so I won’t go into too much detail here, but the key concepts are fairly universal. My TLDR is that it often breaks down into three core strategies: Progressive disclosure — You start with an initial system prompt and design the context such that the agent efficiently accumulates the information it needs only as it calls tools. You can include just-in-time usage instructions in the output of a tool or pre-built script. If an agent tries and fails, the tool output can return the error along with a snippet from the docs on how to use it correctly. You can use markdown files placed in the file system as optional guides for tasks. A in the VM root lists available capabilities, but the agent only reads specific files like if and when it decides it needs to run a query. Context indirection — You leverage scripting capabilities to let the agent act on context without actually seeing it within its context window. Instead of reading a 500MB log file into context to find an error, the agent writes a or script to find lines matching “ERROR” and only reads the specific output of that script. You can intercept file operations to perform “blind reads.” When an agent attempts to read a placeholder path like , the harness intercepts this write, performs a search, and populates the file with relevant snippets just in time. Simplification — You use pre-trained model priors to reduce the need for context and rely more on agent intuition. If you have a complex internal graph database, you can give the agent a -compatible wrapper. The model already knows how to use perfectly, so zero-shot performance is significantly higher than teaching it a custom query language. If your system uses a legacy or obscure configuration format (like XML with custom schemas), you can automatically convert it to YAML or JSON when the agent reads it, and convert it back when the agent saves it. For agents that need to perform increasingly long-running tasks, we still can’t completely trust the model to maintain focus over thousands of tokens. Context decay is real, and status indicators from early in the conversation often become stale. To combat this, agents like Claude Code often use three techniques to maintain state: Todos — This is a meta-tool the agent uses to effectively keep a persistent TODO list (often seeded by a planning agent). While this is great for the human-facing UX, its primary function is to re-inject the remaining plan and goals into the end of the context window, where the model pays the most attention. 7 Reminders — This involves the harness dynamically injecting context at the end of tool-call results or user messages. The harness uses heuristics (e.g., "10 tool calls since the last reminder about X" or "user prompt contains keyword Y") to append a hint for the agent. For example: Automated Compaction — At some point, nearly the entire usable context window is taken up by past tool calls and results. Using a heuristic, the context window is passed to another agent (or just a single LLM call) to summarize the history and "reboot" the agent from that summary. While the effectiveness of resuming from a summary is still somewhat debated, it is better than hitting the context limit, and it works significantly better when tied to explicit checkpoints in the input plan. If you built an agent more than six months ago, I have bad news: it is probably legacy code. The shift to scripting and sandboxes is significant enough that a rewrite is often better than a retrofit. Here is a quick rubric to evaluate if your current architecture is due for a refactor: Harness: Are you maintaining a domain-specific architecture hardcoded for your product? Consider refactoring to a generic, agnostic harness that delegates domain logic to context and tools, or wrapping a standard implementation like the Agents SDK. Capabilities: Are your prompts cluttered with verbose tool definitions and subagent instructions? Consider moving that logic into “Skills” (markdown guides) and file system structures that the agent can discover progressively. Tools: Do you have a sprawling library of specific tools (e.g., , , )? Consider deleting them. If the agent has a sandbox, it can likely solve all of those problems better by just writing a script. We are still in the early days of this new “agent-with-a-computer” paradigm, and while it solves many of the reliability issues of 2025, it introduces new unknowns. Sandbox Security: How much flexibility is too much? Giving an agent a VM and the ability to execute arbitrary code opens up an entirely new surface area for security vulnerabilities. We are now mixing sensitive data inside containers that have (potentially) internet access and package managers. Preventing complex exfiltration or accidental destruction is an unsolved problem. The Cost of Autonomy: We are no longer just paying for inference tokens; we are paying for runtime compute (VMs) and potentially thousands of internal tool loops. Do we care that a task now costs much more if it saves a human hour? Or are we just banking on the “compute is too cheap to meter” future arriving faster than our cloud bills? The Lifespan of “Context Engineering”: Today, we have to be thoughtful about how we organize the file system and write those markdown guides so the agent can find them. But is this just a temporary optimization? In six months, will models be smart enough (and context windows cheap enough) that we can just point them at a messy, undocumented data lake and say “figure it out”? Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. My new meme name for the SF tech AI scene, we’ll see if it catches on. I actually deeply dislike how this is often evidenced by METR Time Horizons — but at the same time I can’t deny just how far Opus 4.5 can get in coding tasks compared to previous models. See also Davis’ great post: You can see some more details on how planning and execution handoff looks in practice in Cursor’s Scaling Agents (noting that this browser thing leaned a bit marketing hype for me; still a cool benchmark and technique) and Anthropic’s Effective harnesses for long-running agents . I’m a bit overfit to Claude Code-style tools ( see full list here ), but my continued understanding is that they fairly similar across SDKs (or will be). We do this a ton at work and I found that Vercel GTM Engineering does something that looks quite similar. Anthropic calls this “structured note taking” and Manus also discusses this in its blog post. In Part 1 (way back in December 2024) , we were building highly domain-specific multi-agent systems. We had to augment the gaps in model capabilities by chaining together several fragile sub-agent components. At the time, it was unclear just how much raw model improvements would obsolete those architectures. In Part 2 (July 2025) , LLMs had gotten significantly better. We simplified the architecture around "Orchestrator" agents and workers, and we started to see the first glimmer that scripting could be used for more than just data analysis. Cartoon via Nano Banana. In this post, I want to provide an update on the agentic designs I’ve seen (from building agents, using the latest AI products, and talking to other folks in agent-valley 1 ) and break down how the architecture has evolved yet again over the past few months. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. What’s the same and what’s changed? We’ve seen a consolidation of tools and patterns since the last update. While the core primitives remain, the way we glue them together has shifted from rigid architectures to fluid, code-first environments. What has stayed the same: Tool-use LLM-based Agents: We are still fundamentally leveraging LLMs that interact with the world via “tools”. Multi-agent systems for taming complexity: As systems grow, we still decompose problems. However, the trend I noted in Part 2 (more intelligence means less architecture) has accelerated. We are relying less on rigid “assembly lines” and more on the model’s inherent reasoning to navigate the problem space. Long-horizon tasks: We are increasingly solving tasks that take hours of human equivalent time. Agents are now able to maintain capability even as the context window fills with thousands of tool calls. The human-equivalent time-horizon continues to grow 2 . Context Engineering is the new steering: It is becoming increasingly less about prompt, tool, or harness “engineering” and more about “context engineering” (organizing the environment). We steer agents by managing their file systems, creating markdown guide files, and progressively injected context. Sandboxes are default: Because agents are increasingly solving non-coding problems by writing code (e.g., “analyze this spreadsheet by writing a Python script” rather than “read this spreadsheet row by row”), they need a safe place to execute that code. This means nearly every serious agent now gets a personal ephemeral computer (VM) to run in. 3 Pragmatic Tool Calling: We are moving toward programmatic tool calling where agents write scripts to call tools in loops, batches, or complex sequences. This dramatically improves token efficiency (the agent reads the output of the script, not the 50 intermediate API calls) and reduces latency. Domain-agnostic harnesses: As models improve, the need for bespoke, product-specific agent harnesses is vanishing. For the last several agents I’ve built, it has been hard to justify maintaining a custom loop when I can just wrap a generic implementation like Claude Code (the Agents SDK ). The generic harness is often “good enough” for 90% of use cases. This diagram illustrates the convergence of agent design in 2026. We see the shift from rigid assembly lines to a fluid Planner and Builder (Execution Agent) loop, which spawns ephemeral Task Agents for sub-routines. Crucially, the entire system is grounded in a Code Execution Sandbox , allowing the agent to solve non-coding problems by writing scripts and leveraging Mount/API tools for massive context injection rather than fragile, individual tool calls. Planning, Execution, and Tasks One of the largest shifts in the last 18 months is the simplification and increased generalizability of subagents. In the past, we hand-crafted specific roles like "The SQL Specialist" or "The Researcher." Today, we are starting to see only see three forms of agents working in loops to accomplish a task: Plan Agents — An agent solely tasked with discovery, planning, and process optimization 4 . It performs just enough research to generate a map of the problem, providing specific pointers and definitions for an execution agent to take over. Execution Agents — The builder that goes and does the thing given a plan. It loads context from the pointers provided by the planner, writes scripts to manipulate that context, and verifies its own work. Task Agents — A transient sub-agent invoked by either a plan or execution agent for parallel or isolated sub-operations. This might look like an "explorer" agent for the planner or a "do operation on chunk X/10" for the execution agent. These are often launched dynamically as a tool-call with a subtask prompt generated on the fly by the calling agent. Bash — Runs an arbitrary bash command. Models like Claude often make assumptions about what tools already exist in the environment, so it is key to have a standard set of unix tools pre-installed on the VM (python3, find, etc.). Read/Write/Edit — Basic file system operations. Editing in systems like Claude Code is often done via a format which tends to be more reliable way of performing edits. Glob/Grep/LS — Dedicated filesystem exploration tools. While these might feel redundant with , they are often included for cross-platform compatibility and as a more curated, token-optimized alias for common operations. "API" Tools — These are designed for programmatic tool calling . They look like standard REST wrappers for performing CRUD operations on a data source (e.g., rather than a complex ). Since the agent can compose these tools inside a script, you can expose a large surface area of granular tools without wasting "always-attached" context tokens. This also solves a core problem with many API-like MCP server designs . "Mount" Tools — These are designed for bulk context injection into the agent's VM file system. They copy over and transform an external data source into a set of files that the agent can easily manipulate. For example, might write JSON or Markdown files directly to a VM directory like 6 . You might prefer the agent build artifacts indirectly through Python scripts (PowerPoint via python-pptx) and then run separate linting scripts to verify the output programmatically, rather than relying on a black-box or hand-crafted tool. You can give the agent access to raw binary files (PDFs, images) along with pre-installed libraries like or tools, letting it write a script to extract exactly what it needs instead of relying on pre-text-encoded representations. You can represent complex data objects as collections of searchable text files—for example, mounting a GitHub PR as and so the agent can use standard tools to search across them. You might use a “fake” git repository in the VM to simulate draft and publishing flows, allowing the agent to commit, branch, and merge changes that are translated into product concepts. You can seed the VM with a library of sample Bash or Python scripts that the agent can adapt or reuse at runtime, effectively building up a dynamic library of “skills”. Progressive disclosure — You start with an initial system prompt and design the context such that the agent efficiently accumulates the information it needs only as it calls tools. You can include just-in-time usage instructions in the output of a tool or pre-built script. If an agent tries and fails, the tool output can return the error along with a snippet from the docs on how to use it correctly. You can use markdown files placed in the file system as optional guides for tasks. A in the VM root lists available capabilities, but the agent only reads specific files like if and when it decides it needs to run a query. Context indirection — You leverage scripting capabilities to let the agent act on context without actually seeing it within its context window. Instead of reading a 500MB log file into context to find an error, the agent writes a or script to find lines matching “ERROR” and only reads the specific output of that script. You can intercept file operations to perform “blind reads.” When an agent attempts to read a placeholder path like , the harness intercepts this write, performs a search, and populates the file with relevant snippets just in time. Simplification — You use pre-trained model priors to reduce the need for context and rely more on agent intuition. If you have a complex internal graph database, you can give the agent a -compatible wrapper. The model already knows how to use perfectly, so zero-shot performance is significantly higher than teaching it a custom query language. If your system uses a legacy or obscure configuration format (like XML with custom schemas), you can automatically convert it to YAML or JSON when the agent reads it, and convert it back when the agent saves it. Todos — This is a meta-tool the agent uses to effectively keep a persistent TODO list (often seeded by a planning agent). While this is great for the human-facing UX, its primary function is to re-inject the remaining plan and goals into the end of the context window, where the model pays the most attention. 7 Reminders — This involves the harness dynamically injecting context at the end of tool-call results or user messages. The harness uses heuristics (e.g., "10 tool calls since the last reminder about X" or "user prompt contains keyword Y") to append a hint for the agent. For example: Automated Compaction — At some point, nearly the entire usable context window is taken up by past tool calls and results. Using a heuristic, the context window is passed to another agent (or just a single LLM call) to summarize the history and "reboot" the agent from that summary. While the effectiveness of resuming from a summary is still somewhat debated, it is better than hitting the context limit, and it works significantly better when tied to explicit checkpoints in the input plan. Harness: Are you maintaining a domain-specific architecture hardcoded for your product? Consider refactoring to a generic, agnostic harness that delegates domain logic to context and tools, or wrapping a standard implementation like the Agents SDK. Capabilities: Are your prompts cluttered with verbose tool definitions and subagent instructions? Consider moving that logic into “Skills” (markdown guides) and file system structures that the agent can discover progressively. Tools: Do you have a sprawling library of specific tools (e.g., , , )? Consider deleting them. If the agent has a sandbox, it can likely solve all of those problems better by just writing a script. Sandbox Security: How much flexibility is too much? Giving an agent a VM and the ability to execute arbitrary code opens up an entirely new surface area for security vulnerabilities. We are now mixing sensitive data inside containers that have (potentially) internet access and package managers. Preventing complex exfiltration or accidental destruction is an unsolved problem. The Cost of Autonomy: We are no longer just paying for inference tokens; we are paying for runtime compute (VMs) and potentially thousands of internal tool loops. Do we care that a task now costs much more if it saves a human hour? Or are we just banking on the “compute is too cheap to meter” future arriving faster than our cloud bills? The Lifespan of “Context Engineering”: Today, we have to be thoughtful about how we organize the file system and write those markdown guides so the agent can find them. But is this just a temporary optimization? In six months, will models be smart enough (and context windows cheap enough) that we can just point them at a messy, undocumented data lake and say “figure it out”?

0 views
Grumpy Gamer 3 months ago

This Time For Sure

I think 2026 is the year of Linux for me. I know I’ve said this before, but it feels like Apple has lost it’s way. Liquid Glass is the last straw plus their draconian desire to lock everything down gives me moral pause. It is only a matter of time before we can’t run software on the Mac that wasn’t purchased from the App Store. I use Linux on my servers so I am comfortable using it, just not in a desktop environment. Some things I worry about: A really good C++ IDE. I get a lot of advice for C++ IDEs from people who only use them now and then or just to compile, but don’t live in them all day and need to visually step into code and even ASM. I worry about CLion but am willing to give it a good try. Please don’t suggest an IDE unless you use them for hardcore C++ debugging. I will still make Mac versions of my games and code signing might be a problem. I’ll have to look, but I don’t think you can do it without a Mac. I can’t do that on a CI machine because for my pipeline the CI machine only compiles the code. The .app is built locally and that is where the code signing happens. I don’t want to spin up a CI machine to make changes when the engine didn’t change. My build pipeline is a running bash script, I don’t want to be hoping between machines just to do a build (which I can do 3 or 4 times a day) The only monitor I have is a Mac Studio monitor. I assume I can plug a Linux machine to it, but I worry about the webcam. It wouldn’t surprise me if Apple made it Mac only. The only keyboard I have is a Mac keyboard, I really like the keyboard especially how I can unlock the computer with the touch of my finger. I assume something like this exist for Linux. I have an iPhone but I only connect it to the computer to charge it. So not an issue. I worry about drivers for sound, video, webcams, controllers, etc. I know this is all solvable but I’m not looking forward to it. I know from releasing games on Linux our number-one complaint is related to drivers. Choosing a distro. Why is this so hard? A lot of people have said that it doesn’t really matter so just choose one. Why don’t more people use Linux on the Desktop? This is why. To a Linux desktop newbie, this is paralyzing. I’m going to miss Time Machine for local backups. Maybe there is something like it for Linux. I really like the Apple M processors. I might be able to install Linux on Mac hardware, but then I really worry about drivers. I just watched this video from Veronica Explains on installing Linux on Mac silicon. The big big worry is that there us something big I forgot. I need this to work for my game dev. It’s not a weekend hobby computer. I’ve said I was switching to Linux before, we’ll see if it sticks this time. I have a Linux laptop but when I moved I didn’t turn it on for over year and now I get BIOS errors when I boot. Some battery probably went dead. I’ve played with it a bit and nothing seems to work. It was an old laptop and I’ll need a new faster one for game dev anyway. This will be along well-thought out journey. Stay tuned for the “2027 - This Time For Sure” post. A really good C++ IDE. I get a lot of advice for C++ IDEs from people who only use them now and then or just to compile, but don’t live in them all day and need to visually step into code and even ASM. I worry about CLion but am willing to give it a good try. Please don’t suggest an IDE unless you use them for hardcore C++ debugging. I will still make Mac versions of my games and code signing might be a problem. I’ll have to look, but I don’t think you can do it without a Mac. I can’t do that on a CI machine because for my pipeline the CI machine only compiles the code. The .app is built locally and that is where the code signing happens. I don’t want to spin up a CI machine to make changes when the engine didn’t change. My build pipeline is a running bash script, I don’t want to be hoping between machines just to do a build (which I can do 3 or 4 times a day) The only monitor I have is a Mac Studio monitor. I assume I can plug a Linux machine to it, but I worry about the webcam. It wouldn’t surprise me if Apple made it Mac only. The only keyboard I have is a Mac keyboard, I really like the keyboard especially how I can unlock the computer with the touch of my finger. I assume something like this exist for Linux. I have an iPhone but I only connect it to the computer to charge it. So not an issue. I worry about drivers for sound, video, webcams, controllers, etc. I know this is all solvable but I’m not looking forward to it. I know from releasing games on Linux our number-one complaint is related to drivers. Choosing a distro. Why is this so hard? A lot of people have said that it doesn’t really matter so just choose one. Why don’t more people use Linux on the Desktop? This is why. To a Linux desktop newbie, this is paralyzing. I’m going to miss Time Machine for local backups. Maybe there is something like it for Linux. I really like the Apple M processors. I might be able to install Linux on Mac hardware, but then I really worry about drivers. I just watched this video from Veronica Explains on installing Linux on Mac silicon. The big big worry is that there us something big I forgot. I need this to work for my game dev. It’s not a weekend hobby computer.

1 views