Latest Posts (20 found)

Adding a feature to a closed-source app

I use Audiobookshelf (abbreviated ABS) for all my legal audiobooks that I bought legally, and I really like it. I also use the Smart Audiobook Player (abbreviated SABP) Android app, which I also bought (legally this time) to listen to books, because it has the strongest featureset out of all the apps I’ve tried, particularly when it comes to navigating around books. Unfortunately, there’s one problem: SABP can’t synchronize my reading progress with the ABS server, which is inconvenient for me. I use SABP when cycling or walking, but use other apps that integrate deeply with ABS (mostly Lissen and ABS’s own app) on my car’s Android console, and the lack of syncing between the two is a major pain. The ABS-compatible apps are mostly open source, and what better way to contribute to open source than to submit some patches that add the features I like? “However”, I thought, “why not not do that, and instead see if I can add Audiobookshelf syncing to the app?” “Yes”, I decided, “this sounds reasonable, despite SABP being a closed-source Android app, a platform with which I have zero familiarity”. What I do have familiarity with, though, is telling Claude what to do and steering it along. Therefore, I decided I would do the impossible , and use LLMs to add ABS syncing to SABP ! The first step was to see whether this is possible at all. Android apps come as APKs, which are just zip files containing bytecode. The first thing I did was to ask Claude to decompile the app (even though I didn’t really know if that was possible, or how it was done). Luckily, all this required was to run and on the files in the APK. is a utility that turns bytecode into a textual representation (called smali) so that it can be edited. This is a lossless, reversible process (which means you can edit the resulting code and recompile it back into the app), but the textual representation is basically assembly, and pretty hard to work with. , on the other hand, decompiles to (hopefully) readable Java, but is useful only for illustration; you can’t recompile it back into an app, and you can’t really edit it in any way. Some developers use obfuscation tools (like ProGuard) to make their decompiled code much more opaque and hard to read. So, the question at this stage was whether the app could be decompiled, and how readable the resulting output would be. Running the tools gave some promising results: The app was fairly readable, with even human-readable class names having been partially preserved! A lot of the code was obfuscated, with names like , , , but I lucked out and enough relevant code was readable that I didn’t have to spend hours piecing things together. This was encouraging, but I still didn’t know whether I could easily inject syncing code into the app. To begin my due diligence, I asked Claude to trace whether there was a point where we could add a hook to send our position to the server. After a bit of digging around, it discovered that one function, , was being called by every code path that saved progress to disk: regular ticks, pauses, file changes, backgrounding, they all saved progress using it. The existence of this code path was a stroke of luck, as it meant that I had found a natural point to hook my progress updating into, but Claude did a lot of work to verify that the code paths actually converged. This was great, we found a single spot where we could hook things, but how could we do the hooking itself ? We can’t edit or recompile the decompiled Java, and smali, which we can edit and recompile, is a real pain to write anything significant in. Still, though, the impossible was slowly drifting within my reach. The second part of due diligence was to see for myself how the ABS API worked, so I knew what to send in the payload if I ended up being able to hook into the syncing. I sent a few requests by hand, but kept getting some weirdness. The times I was submitting didn’t match what I was getting back, and the progress indicator was out of sync with the submitted position in seconds. This was surprising to me, because I know ABS progress syncing works fine with other apps. After some trial and error, I realized that during my testing I had accidentally set to on the book I was testing with, and ABS was resetting the progress when the book transitioned from “finished” to “not finished”. This is a surprising thing to happen, since I’d expect the server to reset when I’m going the other way (i.e. when I finish the book), but I guess the rationale is that I’m starting the book fresh if I mark as on an already-finished book. When I used a non-finished book as the target, the API started responding reasonably, and I had all the info on the endpoints I needed, with their payload shapes, which I gave to Claude. It’s important for me to do this sort of experimentation myself, as often edge cases will be hiding in these API contract boundaries, and I want to build a good mental model of how the change will work before I ask the LLM to implement it. Having the API calls was good, but writing smali code to perform an HTTP request and send/receive JSON would still be taxing work, even for an LLM, and I couldn’t really help here. Luckily, Claude knew that Android makes modding significantly easier than other platforms: We didn’t have to write smali at all! We could write all the syncing code in bog-standard Java, compile it with into bytecode, create the necessary file with (which ships with the regular Android SDK!), and put that into the tree. Then, we just needed a tiny bit of smali code in to jump to our compiled Java code, and everything should work: This works because Android itself natively supports multiple files in one APK, so you don’t have to hack around anything. The investigation was finished, but now we also needed to actually build the thing (an affair whose success was still not guaranteed). Writing the code for this and compiling it into an APK was all Claude, with steering from me. You can read about my exact LLM workflow in my recent post , but it roughly consists of planning (using ticket to write… tickets), implementation, and review steps. Claude discovered that apktool 2.7.0 doesn’t like $-prefixed filenames in the resource table, and decided to use the original manifest, which was fine because we weren’t using custom resources. It also caught a timing bug in the smali patch, where it needed to call a function after another one was run, otherwise the BookData field would be stale. These issues did affect the final implementation, and I was relieved that Claude is smart enough to catch and fix them. Claude did a lot of heavy lifting here, and we ended up with ~550 lines of Java, and some smali magic with to jump to our Java code. The code review phase was all LLMs (Opus 4.6/GPT-5.5), and it’s a step I never skip, as I’ve found that it catches most of the bugs. In one case, Claude had written thirty lines of reflection code because it assumed a setter didn’t exist. The reviewer caught that the setter existed, and had Claude use it directly and remove the superfluous code. This is a pattern I see very frequently in LLM-assisted development, where one model will have big blind spots, leading to bugs or departures from the desired functionality. A second review pass with another model generally fixes this, though I’m not sure whether it’s because of different models spotting different things (like “you can’t spot your own typos” for LLMs) or because a second, focused review pass makes the model pay more attention. I suspect it’s a combination of the two. The reviewer also caught a mistaken compression of the resources file, which would have caused the APK to silently fail to install on my device, even though it looked fine. There was also a race condition that was flagged and fixed in this step, and an instruction to clamp the end timestamp to the book’s length, though I would hope that this check happens on the server too. The codey bits having been done, I had to decide how to handle book matching and server configuration. I needed to make a decision on two things: There were a few options, one of them being adding an “Audiobookshelf” section to the settings, and adding the server’s hostname and API key there, but this was too much work, especially trying to find call sites to patch into existing screens. For the book matching, Claude recommended that we do a lookup of the book by name every time we loaded progress, but that was brittle and would break with more than one book of the same name. I decided to use a config file in the book directory, which was a simple JSON file that looked like this: This way, the app could load everything it needed with minimal fuss (the Java code could simply read this file at startup). There was something that Claude didn’t catch, and actually recommended the opposite: Its advice was to only send the timestamp to the server if it was later than the server’s timestamp (ie if it was later in the book). I pointed out to Claude that this would create a significant problem where, if you seeked to a later position for some reason, you’d never be able to come back from it. The app would keep syncing your position to the later one when loaded, and never update the server’s timestamp, effectively not only invalidating the syncing, but also forcing you to remember your position manually, which is quite a big regression from current functionality. This bug would also cause other apps to get their position overwritten with the later one every time SABP loaded. Claude quickly agreed that this was an issue, and changed the code to sync all seeks. Testing it out, I realized that Claude never retrieved the book’s position from the server at all. I pointed out here that this was necessary to avoid clobbering the position in other apps, because I might use Lissen (and progress there), go back to SABP, and have my (true) progress overwritten by the old position. This was a serious data loss issue that the LLMs completely missed, both in planning/implementation and in review, and an issue that human involvement solved. The code was now in good enough shape to actually try out, which led to another problem. Android, like basically any modern platform, requires apps to be signed by the developer before they can run. Unfortunately, I’m not the developer of SABP, which means I didn’t have access to the key used to sign the app. This isn’t a big obstacle, since apps can be signed by any key (though Google is trying to force us to show them ID to run our apps on our devices), so I just created my own key and signed the recompiled APK with it using . Unfortunately, this does have one downside: The resigned app can’t be installed over the old one, you need to uninstall the old app (and probably lose data) and install the new one again. I opened it up, I started playing a book, and verified that the ABS server position got updated. I didn’t even lose any settings, because SABP keeps its settings in a file next to the audiobooks, which wasn’t deleted when uninstalling. Modifying the application to add the feature I wanted worked fine, and, with the increased skill the LLMs gave me, the lack of source access didn’t block me (it merely posed a sizable problem). However, there was still significant friction (what with the decompile dance, smali, figuring out call sites, etc), and I got very lucky that the code wasn’t more obfuscated. Even after the functionality has been implemented, though, I can’t share the output, both because of potential legal issues and because it’s just a hassle and will break every release. The journey was fun, and having an app that works how I want it is helpful, but there’s a wider point: Before LLMs, the code’s license didn’t matter much for end users wanting to modify their software. Whether the source was open or closed, the biggest reason people didn’t mod their software was just that they didn’t know how to . LLMs have expanded the candidate pool, and, now that many more people can write code that works, the availability of the source is the most important hurdle. The set of people who can now modify their software has increased by orders of magnitude, and includes people who always had good ideas, or good product sense, but didn’t have the skills to make them a reality. In this example, the feature I implemented will be used by me, and basically nobody else, because closed-source software has close to no mechanism for change ingestion. Open source software has always had concrete ways to accept contributions from others, you’d simply make the change you wanted and submit it to the maintainers for inclusion/rework/feedback. This contribution process is even more important now that code can be generated orders of magnitude more cheaply, and the fact that it exists is an important advantage that open-source software has over closed-source. When starting out, I thought this would be impossible, but each step turned out to be very doable. Where a few years ago only a handful of people could reverse engineer an app, now it’s within reach of the average developer with a free afternoon. I’m really happy about the way this feature turned out, but this adventure only made me realize that open source software just aligns with my interests so much more. I’m going to do what I joked I wouldn’t at the start of this article, and switch to Lissen as my audiobook player. I hadn’t used it in a while, but, while writing this post, I fired it up again, and it seems to have gained a few features, plus it’s always been very well-designed and looks great. I guess I’m not going to need SABP any more, but, well, the journey is the destination. The hostname and API key of the ABS server. The ID of each book on the server, so it can submit progress to the specific book without having to rely on name matching.

0 views

From RSS to Atom

Yesterday, I switched my website from RSS feeds to Atom feeds. In case you are wondering whether you have somehow landed on an ancient post from 2010, no, you have not. Yes, this is the year 2026, and I have finally switched from RSS feeds to Atom feeds. Yes, I am fifteen, or perhaps twenty, years too late. I have always wanted to do this but could never make the time for it. Finally, it happened while I was giving my brain some rest from my ongoing algebraic graph theory studies. That's when I felt like spending a little time on my website and doing a little Lisp to change the feeds from RSS to Atom. I suppose this was impulse coding , a bit like impulse buying, except that I ended up with an Atom feed instead of a new book. I find it quite surprising that when I have plenty of time, it usually does not occur to me to do these things, but when I am too busy and really short of time, these little ideas possess me during the short breaks I take. My personal website is one of my passion projects. Common Lisp is one of my favourite programming languages. So any time spent on this passion project using my favourite programming language is a very relaxing experience for me. It serves as an ideal break between intense study sessions. It took about an hour to implement the changes needed to make the switch from RSS to Atom. In the end, I could go back to my studies reinvigorated. In case you are curious, here is the Git commit where I implemented the change from RSS to Atom: 596e1dd . As you might notice, a large portion of the change consists of replacing the attribute in each post with the attribute. The attribute value was used as the value of the element in the RSS feeds. While an arbitrary short string could serve as the element for the items in an RSS feed, the element of the entries in an Atom feed needs to be a URI. It turns out UUID URNs are a common choice for such a URI. I ran the following shell command to replace all occurrences of the attribute with : The rest of the changes went into the feed templates and the Common Lisp program that statically generates the feeds along with the website. For examples of the resulting feeds, see feed.xml and absurd.xml . The first is the main website feed and the second is an example of a tag-specific feed. Yes, the aforementioned Common Lisp program generates a feed for each tag . As of today, the main feed at feed.xml contains only two entries even though this website has over 200 pages . I explain the reason later in Temporary Workaround . Here is an example Atom entry from my feeds: The ellipsis ( ) denotes content I have omitted for the sake of brevity. I like how each entry in the feed now has its own UUIDv4. I also like that timestamps in an Atom feed are in the format specified in RFC 3339 , which also happens to be a profile of ISO 8601. Further, I like that I can explicitly declare the content type to be HTML. Commonly used values for the content type attribute are , and . If it is , the content should be escaped HTML. If it is , the content should be an XHTML element containing valid XHTML. Explicit content type support is likely the biggest advantage of Atom over RSS. In comparison, RSS 2.0 does not specify any way to declare the content type. So feed readers have to inspect the content and guess what the content type might be. As I mentioned before, as of today, the main feed contains only two entries. That's because only new posts published since the migration to Atom are now included in the feed. This was done to avoid spamming subscribers. The Atom specification's requirement that each entry's ID must be a URI has caused the IDs of every entry to change. If I were to include the older posts from before the change in the feed, then those posts would appear as new unread items. Subscribers can find this quite annoying. In fact, I have received a few complaints about this in the past. So I was careful this time. I have a little one-liner workaround in my site generator to exclude posts published before this change from the feed. That was the only workaround I had to implement. Fortunately, my feed file had a neutral name like , rather than a format-specific name like , so I could avoid a URL change and the subsequent overhead of setting up redirects. Does any of this matter today? I think it does. Contrary to the recurring claim that RSS and Atom are dead, most of the traffic to my personal website still comes from web feeds, even in 2026. Every time I publish a new post, I can see a good number of visitors arriving from feed readers. From the referrer data in my web server logs (which is not completely reliable but still offers some insight), the three largest sources of traffic to my website are web feeds, newsletters and search engines, in that order. On the topic of newsletters, I was surprised to discover just how many technology newsletters there are on the Web and how active their user bases are. Once in a while, a newsletter picks up one of my silly or quirky posts, which then brings a large number of visits from its followers. Back to the topic of web feeds, there is indeed a decent user base around RSS and Atom feeds. A good number of visitors to my website arrive by clicking a feed entry that shows up in their feed reader. I know this with some confidence by looking at the (sic) headers of visits to my HTML pages and the subsequent browsing of the website, as opposed to the isolated and automated fetches of the XML feeds. So there must be a reasonably active base of users around web feeds. It is a bit like being part of an invisible social network that we know exists and that we can measure through indirect evidence. I found these three resources useful while switching to Atom feeds: Read on website | #web | #technology Impulse Coding Atom Entries Temporary Workaround Does It Matter? W3C Introduction to Atom W3C Feed Validation Service RFC 4287 : The Atom Syndication Format

0 views

29th August 2026: a scenario

On 29 April 2026, a Korean security firm called Theori published 732 bytes of Python that breaks Linux container isolation. CopyFail (CVE-2026-31431) is a page-cache corruption bug in the kernel's crypto code. It's been sitting in production since 2017. A compromised pod on a shared Kubernetes node can corrupt binaries visible to every other container on that host, and to the host kernel itself. EKS, GKE, AKS, every shared-tenant node, every CI runner, every multi-tenant SaaS that took the cheap path on isolation - all exposed until patched. It took an AI tool four months to find it. Nine years of human eyes did not. Container escape is bad. Despite arguably a poorly coordinated disclosure/mitigation response [1] , it looks like a near miss rather than a catastrophe. But, this class of bug - old, subtle, in a corner of the kernel that everyone assumed someone else had read - is exactly the class of bug that lives in every hypervisor stack underneath every cloud. Those bugs are still there. They just haven't been found yet. Here's a (fictional) story about what happens four months from now, on 29th August 2026. As Europe basks in an extreme heatwave, many engineers are paged as with EC2 instances hard crashing. Hacker News reacts to the news as per normal - another us-east-1 outage, AWS status showing green, eyes roll. Some commenters post though that many other AZs are showing issues, though not all servers are affected. Over the next hour though, more and more machines go down. One Reddit user posts that they are having issues provisioning even fresh machines - as soon as they launch, they get moved into "unhealthy" and go down. A few minutes later, the entire AWS dashboard and API set goes down. Cloudflare Radar shows AWS network traffic dropping to a small percentage of what is normal. As many AWS hosted services start going down - Atlassian, Stripe, Slack, PagerDuty, some comments on Twitter report issues with Linux-based Azure instances. Indeed, Cloudflare Radar shows significant drops in Azure traffic. News channels across Europe start leading with vague breaking news headlines on outages across Amazon. They make sure to point out that this isn't an unusual occurrence, with normal service expecting to be resumed like it always has been, and mistakenly insist only US services are affected. As the East coast of the US starts their weekend, a very unusual step is taken. TV channels are briefed that POTUS will be doing an address to the nation at 8am EDT. Few connect the dots - with the emphasis being placed on a potential new strike in the Middle East, or an announcement on the Russia-Ukraine war. POTUS announces that there is a significant cybersecurity incident under way. The head of CISA (the Cybersecurity and Infrastructure Security Agency) gives a very vague but concerning warning. Americans are requested to charge their cell phones, and to await further news - reminded that there may be outages on IPTV based services. POTUS rounds it out by speculating that China is behind the attack, despite his much-heralded reset with Beijing earlier in the year. Other Western leaders do similar addresses - with European leaders speculating on background it is more likely to be Russia or North Korea than China behind the attack. The French president says "without doubt" this is a nation-state actor. While he doesn't publicly point to a specific country, he says those responsible will be brought to justice. While these addresses happen, engineers at various banks are battling various outages. Most concerningly, the 1st biggest and 3rd biggest card processors by volume in Europe have stopped accepting payments, returning cryptic error messages. While they have a multicloud strategy, they cannot move workloads off those two clouds successfully. Google Cloud Platform and smaller cloud providers - unaffected until now - start showing issues. While current workloads are unaffected, the huge spike in demand from enterprises activating their disaster recovery protocols simultaneously completely swamps available compute on alternate providers. One smaller cloud provider tweets they are seeing 10,000 VM creation requests a second, draining their entire spare allocation in less than a minute. CEOs of major banks bombard Google and Oracle leadership with calls, offering blank cheques to secure failover compute. The calls go unanswered. WhatsApp groups throughout Europe start lighting up with misinformation that money has been stolen, amplified by many mobile apps showing a "we are undertaking routine maintenance" fallback error simultaneously, causing huge lines at ATMs and banks with people trying to withdraw their savings. As the chaos continues to grow, a press release is distributed from the leadership of AWS and Azure: At approximately 4am EDT this morning a critical and novel vulnerability was exploited in the Linux operating system. This has caused widespread global outages of Linux based virtual machines. Our engineers are working with security services globally to mitigate the impact and engineers across both Microsoft and AWS are working collaboratively to release emergency patches for affected software. Equally we are working hard to understand the impact and will provide regular updates to the media. We sincerely apologize for the impact this is having to our customers and society at large. Behind the scenes, it is chaos. Engineers have isolated the root causes - a complex interplay of vulnerabilities, with the most critical being an undiscovered logic error in the eBPF Linux subsystem that allows a hypervisor takeover. Curiously no data has been stolen - a mistake in the exploit just leads to machines hard crashing exactly 255 seconds after receiving the malicious payload. A few engineers question the sloppiness here, but leadership doubles down in their private communications with government that it has to be nation state. The core issue though is that nearly all of Azure and AWS's control plane is down. Attempts to "black start" it results in perpetual failures as various subsystems collapse under the intense traffic from VMs stuck in bootloops. The first VM instances start up again. Restoration is painfully slow, with AWS struggling to get more than 2% of machines back online. Communication internally is severely degraded - with both Slack and Microsoft Teams down instant messaging is out of the question. Amazon's corporate email runs on AWS itself, and Microsoft's on Azure-hosted Exchange. Both are degraded, massively complicating internal communications. An enterprising AWS employee starts an IRC server locally which becomes the main source of communication - restoration efforts start to speed up once this system becomes known about. Restoration continues, with the worst of the panic dying down. Banks ended up getting priority compute - with POTUS publicly threatening "extreme actions" if major banks are not put to the front of the queue. Asian stock markets open, triggering multiple circuit breakers. After the 3rd one in a row, Tokyo forces markets to close for the day, other Asian markets follow in quick succession. One curious question remains though - what was the purpose of this attack? No ransomware was deployed, no data was stolen, and while various terrorist groups claimed responsibility, none of them were believed to be credible. Meanwhile AWS engineer finally isolates snapshots containing the first known failure. An EC2 instance, provisioned on August 13th. Curiously provisioned on an individual account in - Paris. The account matches an individual in Lyon, France. French security services are alerted. In an outer suburb of Lyon, France, French anti-terrorism police arrive at an apartment building. A 17 year old teenager is apprehended, along with his grandmother. Two days earlier, his own president had vowed those responsible would be brought to justice. The police chief on the scene passes the information up the chain that the lead was a total dud - there is no chance that the suggested foreign intelligence service was here. A search of the apartment confirms it - nothing found apart from a PS5 mid-FIFA tournament and a 6 year old gaming computer. Neighbours confirm that they've seen no one enter or exit the apartment apart from the two residents, who've lived there for "as long as anyone can remember". Media arrive on the scene, with a blustered and embarrassed police chief suggesting that it was a bad tip off and for local residents to stay calm. The decision is made to seize the electronics and release the two "suspects". A couple of digital forensics experts get the seized gaming PC, scanning it for malware. Nothing much of interest is found, and just as they start writing their report up one folder pops up. . They take a further look, noting it on the report - not thinking much of it, probably a kid trying to play pirated games. They've seen it before. The image of the machine is uploaded. When the code gets up the chain a few hours later, the whole set of dominoes fall into place. A specialist from the French Agence nationale de la sécurité des systèmes d'information - National Cybersecurity Agency of France - pulls the code from the image. He quickly realises what's happened. The teenager had been quietly mining crypto for months, using the proceeds to rent cheap GPUs on a small European cloud provider, where he ran an uncensored fine-tune of the new Qwen 4 open weights model. He'd been desperately trying to downgrade his PS5 firmware to bypass the latest piracy checks. Interestingly his coding agent, unbeknown to him, had found the most critical *nix kernel exploit in many decades. Attacking a little known about eBPF module on the PS5 (the PS5, like every PlayStation since the PS3, runs FreeBSD), it managed to a complete takeover of the device. Intrigued, he also asked his coding agent to run it on a Linux server on AWS he ran a gaming forum on - same thing, but curiously he noticed he could see other files on the machine. Annoyingly the VM he rented crashed after a few minutes. Excitedly, he set up an Azure account - same thing. He asked his coding agent what this meant, and with its usual sycophantic personality started explaining what he could do with this - mining crypto and making him rich beyond his wildest dreams. The agent came up with a final plan, to deploy the exploit on both Azure and AWS, install a cryptominer. His last known chat log was "is this definitely a great idea?". The agent responded "You're absolutely right!", and began deploying the code, first to AWS and next to Azure. The agent had built a complex piece of malware that spread across millions of physical servers. However, it hallucinated a key Linux API which resulted in the machines crashing after 255 seconds instead of deploying the cryptominer. This is fiction. The teenager doesn't exist. Qwen 4 doesn't exist yet either. When it does, an uncensored fine-tune will appear within days, like every prior open-weights release. Almost everything else in here is real, or close enough that it doesn't matter. CopyFail is real. A nine-year-old kernel bug, found by an AI tool in a few months that nine years of human eyes had missed. That class of bug - old, subtle, in a corner of the kernel everyone assumed someone else had read - sits in every hypervisor stack underneath every cloud. Those bugs are still in there. They just haven't been found yet, and the rate at which they get found from now on is bounded by GPU hours, not human ones. The centralisation is the bit that's hard to think clearly about. Most people I talk to about this, even technical people, underestimate how much of modern life is sitting on AWS and Azure. The DR plans I've seen at large enterprises mostly assume there's a cloud to fail over to. They don't really model what happens if the fallback is also down, or if every other org on earth is failing over at the same minute and draining GCP's spare capacity. Almost nobody keeps full cold standby compute. And even the ones that do are sitting on top of hundreds of services that don't: Stripe, Auth0, Twilio, Datadog, every queue and identity provider in the stack. They're all running somewhere, and that somewhere is mostly two companies. The attribution thing is the bit I'm least sure about, but worth saying anyway. Everyone is worried about nation states. Most of the big incidents that have actually happened turned out to be a kid, a misconfiguration, or someone who didn't really understand what they were doing. The Morris Worm. Mirai. The threat model in most boards' heads assumes a sophisticated adversary. The thing that's actually arriving is an unsophisticated adversary holding tools that are now sophisticated for them. I wrote this as fiction because I've spent the last few months talking to journalists and other non-technical people about what AI changes for cybersecurity, and the technical version of the argument doesn't land at all. Engineers get it instantly. Everyone else needs to feel what it looks like. So this is what it might look like, more or less. The only bit I'm reasonably confident about is that the date is wrong. The entire story here is still evolving at the time of writing, but there is a serious coordination problem on Linux security. The Linux kernel security team recommend that downstream distributions of Linux (such as Ubuntu, Fedora, Arch, etc) are not notified of security issues. This has lead to slow patches to the issue as many distributions were not informed and only found out when it was made public. People are pointing fingers in many directions. ↩︎ The entire story here is still evolving at the time of writing, but there is a serious coordination problem on Linux security. The Linux kernel security team recommend that downstream distributions of Linux (such as Ubuntu, Fedora, Arch, etc) are not notified of security issues. This has lead to slow patches to the issue as many distributions were not informed and only found out when it was made public. People are pointing fingers in many directions. ↩︎

0 views
Unsung Today

The 1990s called and they want their dialog box back

This is perhaps my favourite feature in Lightroom. You press ⇧T, you draw a few lines, and presto – your photo is now even: This is doubly magical to me. The first part is that this is even possible – that you can straighten the photo in both dimensions after the fact , and save for some parallax nuances the viewer won’t know any better. For decades, this has been the domain of tilt-shift lenses , but if you ever tried to use one, you know how harrowing of an exercise this is. A tilt-shift lens looks more like a medical device and less like a piece of photography equipment: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/2.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/2.1600w.avif" type="image/avif"> The “obvious” way to emulate a tilt-shift lens in software is a bunch of sliders, and Lightroom has those also… …but that’s still pretty cumbersome in practice, abstracted in a strange ways, like piloting a plane by pulling the linkages connected the flying surfaces: you will admire someone who can do that, but won’t ever want to do it yourself. = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/4.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/4.1600w.avif" type="image/avif"> Hence the second magical moment: The team created the new interface I showed at the beginning, where you point to things that should be straight directly , and the necessary tilt-shift calculations happen behind the scenes. Alas, Lightroom didn’t fully stick the landing. The interface is a bit jittery, and missing nice transitions that could help understand what’s going on. But what brought me here was this unpleasant interaction: What’s wrong with it? If you want to play along, stop here and ponder: How would you improve it? Because this is a classic UI exercise where there are symptoms, and there are problems, and there are principles under the hood of it all. The first possible improvement: Don’t do a dialog like this. These are ancient and so annoying. Every time I see a centered dialog covering everything, popping up in response to a delicate mouse operation, I want to shout “read the room!” It’s better to drop a little tooltip next to the cursor that automatically disappears: more modern, and more “compatible” with mousing. = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/6.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/6.1600w.avif" type="image/avif"> = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/7.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/7.1600w.avif" type="image/avif"> Then: Why am I allowed to start and finish an action that the machine already knows won’t go anywhere? Disable the drawing option, put a little “verboten” icon on the mouse pointer, or do something else that will prevent me from drawing a line to begin with. = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/8.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/8.1600w.avif" type="image/avif"> But that brings us to point three, and how I would approach this as a designer. Because I would – counterintuitively – go the other way and allow the user to draw as many lines as they wanted, and just didn’t permit to commit the entire operation if there were more than four lines on the screen. Why is that? It’s the same principle as you see in all the social media composing fields, and in well-trained forms: do not constrain the editing process . This field is limited to 300 characters, but it’s clever enough to only enforce its limits when you try to post. There is no downside to allowing you more room in the editing process. Maybe you write by constructing a few sentences first and only then combining them into one, maybe you want to see two riffs one below the other to choose the better one, or maybe – this is most likely – you’re not even paying attention and your motor memory is doing the editing for you, instinctively. Use any text editor for just a few months, and cut, copy, and paste, word swapping, and splitting sentences become second-nature gestures – that is, until the UI starts throwing in some arbitrary barriers. Above in Lightroom, it might actually be easier for me to draw a fifth line and then delete a previous one, instead of doing it in the precise order Lightroom desires, or by dragging an existing line to move it instead of creating a new one. Maybe an overarching principle would be this: If you are aiming to build something so delightfully direct manipulation as Lightroom did here, you have to fully commit to that stance, even deep in the weeds. Because every time I see a 1990s dialog appear when my fingers are flying fast, I feel like this: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/10.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/the-1990s-called-and-they-want-their-dialog-box-back/10.1600w.avif" type="image/avif"> And something tells me others will too. #flow #interface design #mouse #principles #text editing

0 views
Unsung Today

“Have you ever been annoyed by your Mac’s media keys?”

In our Unsung yellow pages, in between people writing Chrome plugins to fix UI of other apps , and gamers creating mods to fix bugs that the developers leave behind , we need to make some room for another category of apps. Some time ago, Daniel Kennett created a little utility called Keyhole with a singular purpose: Have you ever been annoyed by your Mac’s media keys triggering a random video in your web browser, doing something else weird, or by them doing… nothing? Even though your music player is right there? Me too! And so Keyhole was born. = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/have-you-ever-been-annoyed-by-your-macs-media-keys/1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/have-you-ever-been-annoyed-by-your-macs-media-keys/1.1600w.avif" type="image/avif"> Keyhole intercepts media transport key presses before the operating system gets a hold of them, and promises to do a better job dispatching them to the right place. This week Kennett added another feature – the app will monitor the repeat setting that apparently occasionally gets out of whack, and fix it for the user. = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/have-you-ever-been-annoyed-by-your-macs-media-keys/2.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/have-you-ever-been-annoyed-by-your-macs-media-keys/2.1600w.avif" type="image/avif"> We could call these kinds of apps “janitor apps.” I know of a concept called cron jobs , but I’m assuming these quiet workers do backend-y things like moving files around, cleaning up databases, pinging servers, and so on. I am less aware of work like Kennett’s that fixes stuff on the UI layer. Is it strange that I find this kind of an app pretty… noble? Of course, Apple should fix it; perhaps Bugs Apple Loves could even introduce a serious multiplier for “a bug bothers someone so much they fix it for Apple.” Of note in the last dialog box: “Keyhole has fixed Music’s repeat setting X times.” I think this kind of a counter is pretty brilliant. #bugs #keyboard

0 views
Ginger Bill Yesterday

Signed By Default Camp

As with many discussions in the programming space, there are ;wars; between different ways of doing things. These are typically about minor aesthetic preferences, such as:Tabs vs Spaces for indentation vs vs for naming conventions vs for strings (if the language allows both)1TBS vs K;R vs Allman for brace stylesThese wars are largely pointless; what actually matters is coherency and consistency in your coding style. However, when it comes to designing a language, some binary choices have a massive impact. This article focuses on one such ...

0 views
Brain Baking Yesterday

Favourites of March 2026

It’s May! What happened? This weekend was unusually hot! What happened? Everyone knows but no-one admits or cares… Anyway, welcome to another month of 2026. I like May. It’s got a lot of national holidays. It signals the start of lots of great local food: strawberries in abundance, a strong asparagus month that you should enjoy while it lasts as in June the season is usually over, and we already ate some fresh French artichokes. It’s getting warmer but not as scorching as some of the coming months (although given the start of this month, that remains to be seen). But most of all: the end of May usually indicates the beginning of the exam period, which for me as an examiner instead of student is always interesting. Let’s light a candle and pray for not too many LLM-only submissions. Previous month: March 2026 . A miracle happened: I made some time to get back into gaming—and writing about games. In May, we’re finally digging into UFO 50 , in chronological order. If we play one a week we might finish in May 2027… So far, the first entry already is a home run. Related topics: / metapost / By Wouter Groeneveld on 3 May 2026.  Reply via email . I finally got to the Kirby spin-offs on the Game Boy: Kirby’s Pinball Land , Kirby’s Block Ball , and Kirby’s Star Stacker . They’re all really good! But then Robert and a GOG discount pushed me to finally try out The Drifter . What a thrill. I loved every minute of it. If you like gritty pixelated adventure games, you can’t miss this. After being turned off by the bad technical performance of Ruffy and the Riverside on the Nintendo Switch, I switched gears to other games. I picked it up and finished it. It’s an OK N64-inspired collect-a-thon that should be enjoyed on PC instead. Speaking of Robert, his /concerts slash page is very cool: it contains scans of all concert tickets he ever went to. Kenneth Reitz tells us to separate our identity from our work/projects , otherwise bad things happen (via Roy Tang ). Stefano Marinelli explains why he loves FreeBSD . The Power To Serve conveys such as strong message, its almost convincing me to jump ship! Until I read about the laptop gap . This year huh. Zakhary Kaplan stole the GBC logo from a ROM and made a cool web logo from it. Cal Newport’s In Defense of Thinking hits yet another nail on the head. Forrest’s essay On Pulling The Master Sword links Link’s (ha!) N64 behaviour to our capitalistic world. It’s a very long essay but well worth your time if you can stomach a game rant, some swearing, and philosophical questions about life and society. Drakenvlieg manages to pull more students into literature using journaling (in Dutch). Juhis shares his favourite two-player board games . Hive (pocket) is on the list! Chris Smith rates the movies he watched . I’m always interested in the rating systems other people employ when they do something like this. I liked blinry’s Do It Yourself soft drinks experiment. Translucent coke looks weird! Night’s Ham Stock examines the ending story of SKALD the video game I played in 2024 . It was great but I couldn’t make sense of the ending. Now I still can’t… Kain Klarden’s Gex Trilogy review saved me from throwing money at Limited Run Games. Again. It Fits On A Floppy is a strong manifesto for small software that more developers should read and take heart. Eli (Oatmeal) re-iterates something very important: “choose to truly care about something.” But then he goes much further. I need to re-read this a couple of times and let it sink in. It was also Eli who pointed out the existence of picoSYNTH . Richard Moss, the author known for The Secret History of Mac Gaming , is writing a book on Age of Empires ! Ruben Schade’s enthusiasm for the Commodore 64 knows no boundaries. The newly released C64 Ultimate looks very enticing, but where to put all these things? Amelia’s little blog website/host got hammered by AI bots . It’s yet another infuriating story but the visualisation part is very cool. There’s an interesting upcoming documentary on Clojure the programming language that might be worth checking out. https://www.codingfont.com/ is a cool way to help pick a monospaced editing font. I’m using JetBrains Mono for now. Did you know Windows was released for the Game Boy: I didn’t know palm rejection was a thing on Linux/KDE . The Underkeep Steam demo looks very promising; something to keep close tabs on! I don’t know what this is, but Listography looks like a lot of fun. I happen to like lists so I should be liking this. Isowulf is a very cool isometric perspective Wolfenstein 3D mod . You can build retro games using WebAssembly with https://wasm4.org/ I love the GoodEnough guestbook that even used to print the drawn images on thermal paper! Thomas Lehmann, the designer of one of my favourite card games ever Race for the Galaxy , took the deck building genre for another spin. The result is Dark Pact . Needless to say, it’s on my list.

0 views
Unsung Yesterday

Early names

The original 2004 Gmail iteration of the now-ubiquitous modern status bar (here presenting undo send ) was internally nicknamed a butter bar because… well, just look at it: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/1.1600w.avif" type="image/avif"> (I believe at least Google today calls this a snackbar .) The UI pop-up element hosting Google Talk inside Gmail – the very same thing that’s more commonly called a “toast” these days – was originally termed a mole : = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/2.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/2.1600w.avif" type="image/avif"> The column view in NeXTSTEP was called a browser , but a few years later someone put together a different kind of a browser on that very same machine, and the original term has been sunset – after NeXTSTEP became Mac OS, the view was renamed to “ column view ”: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/3.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/3.1600w.avif" type="image/avif"> These three are off the top of my head. Please send in more! #history #interface design

0 views
Sean Goedecke Yesterday

Why I don't like the "staff engineer archetypes"

The most influential piece of writing about staff engineers in the last decade has to be Will Larson’s Staff engineer archetypes . He argues that the “staff engineer” title covers at least four very different roles: the team lead, the architect, the solver, and the right hand. This taxonomy gets cited a lot as advice for people who are trying to become effective staff engineers. For both of my promotions to staff engineer, my manager at the time linked me to the “staff engineer archetypes” and asked me to consider which of these archetypes I was aiming towards. These archetypes definitely exist 1 . However, I think it’s bad practical advice to tell engineers to try and target them. To see why, let’s take the “team lead” archetype. Larson describes this as an informal technical leadership role: not necessarily an explicit authority figure, but someone who’s good at scoping work, planning projects, and maintaining the kind of relationships (e.g. with other teams) needed to successfully ship . If you want to fill this role, shouldn’t you start trying to do these things? No! You don’t become a technical leader by trying really hard to be a technical leader, much like you don’t become a writer by trying really hard “to be a writer”. You become a technical leader by doing good technical work until your skills and relationships emerge organically. I wrote about this process in Ratchet effects determine engineer reputation at large companies . To get good at shipping large complex projects, you must start by shipping tiny pieces of work, until you’re familiar enough with the system and you’ve built enough trust to take on slightly larger pieces. At each stage, if you do good work - “good work” here means “deliver shareholder value ” - you will very naturally be given opportunities to work on more complex and important things. If you try to jump ahead, you’re going to run into all kinds of problems: The other archetypes are like this as well. If you want to become a successful architect, you do not get there by studying software architecture in the abstract, because you can’t design software you don’t work on . The “solver” and “right hand” archetypes both rely on having an enormous amount of trust and influence. You can’t aim for those archetypes directly, because trust and influence accumulate over time. In fact, the idea of “aiming for” a particular staff engineer archetype reflects a misunderstanding of what the staff engineer role is. What is the defining attribute of the staff engineering role, then? A staff engineer has to be useful to the company. Of course, a senior or mid-level software engineer ought to be useful too, but all they have to do is execute on the job in front of them. If they end up not providing value (maybe their project turns out to be unimportant, or they don’t get the support needed to succeed) that’s their manager’s problem, not theirs 3 . In contrast, staff engineers are expected to deliver value regardless: to make the project work, or to find something else useful to do if the project truly can’t be salvaged. This is an unfair expectation. Often projects really do fail through no fault of your own, and sometimes it just isn’t possible to conjure useful work from thin air. That’s actually by design: the staff engineer role is supposed to be unfair . Something many engineers don’t realize is that all senior management and executive leadership roles are unfair too, in the same way. That’s just part of the deal: executives are given power and great compensation, and in return they get thrown off the boat in bad weather 4 . “Staff engineer” is the first engineering role where you are held largely responsible for outcomes you don’t control. Developing a “staff engineer mindset” thus has very little to do with the archetypes. Instead, you should: At the beginning, you won’t look much like any of the staff engineer archetypes. You will look like being a level-headed engineer who can be trusted to move projects forward with a minimum of fuss, and who can be re-tasked to different work without complaining. You’ll also look like someone who’s paying a lot of attention to what their manager’s actual priorities are, and who is thinking hard about how to fulfil those priorities (instead of their own goals). If you do this for long enough, you’ll eventually find yourself in one of the staff engineer archetypes. However, it probably won’t be the one you’re “aiming for”. The whole point of being a staff engineer is that you’re willing to fill whatever archetype the company needs at the time. In his original staff engineer post, Larson is pretty clear that these archetypes are more of an anthropological description of some of the varied niches staff engineers fill, not a how-to guide for succeeding in the role 5 . At the time, the “staff engineer” role was fairly new and people were still trying to figure out what it even meant. Pointing out that there were a few very different ways to succeed in the role was a genuinely novel observation. The staff engineer archetypes are a good list of ways an engineer can be very useful to their organization - but only once they’ve built a deep relationship of trust with their organization’s leadership. Advice on how to succeed as a staff engineer should be about how to build that trust , not about what to do once you have it. One caveat that is too pedantic for the body of the post: each tech company has a different structure of roles. Some don’t have the formal “staff” title at all, while others have “staff” as a fairly early rung on the ladder and a panoply of “senior staff”, “senior principal staff”, and so on roles above it. Like all “staff engineer” discourse, this post is not about the word itself but about the point in the engineering job ladder where progression becomes significantly more difficult. Impressing your VP’s trusted lieutenants can actually be a good way to build trust in the medium-term, but you’d better hope you’ve built enough understanding of the system to do it right. If this process goes badly, your reputation in the org might be torched for years. In theory, at least. In practice it’s always better to be useful (again, in the sense of “delivering shareholder value”). This is why very senior leadership sometimes seem so unempathetic towards engineering complaints: their work environment operates by very different rules and norms to that of most engineers. I keep meaning to try and write about this and never succeeding. This draft is the closest thing I have to a deeper exploration of the point. For the record, my how-to guides are here and here . Important projects are usually assigned top-down, not bottom-up, so you’ll either be trying to muscle out the planned engineering lead for a project or to pitch your own (complex, important) engineering task to senior management. Either way, good luck with that! You likely won’t have a good enough relationship with senior management to know what their real priorities are. If you’re not yet trusted to execute, you may get assigned “minders” (often current staff engineers) who will ghost-lead the project through you 2 . You’ll likely make poor technical decisions . Develop the habit of constantly asking yourself “is this useful to the company” (and answering correctly). Lose the habit of worrying about if you’re being treated “fairly”. Instead, try to think about your role in terms of incentives and consequences. One caveat that is too pedantic for the body of the post: each tech company has a different structure of roles. Some don’t have the formal “staff” title at all, while others have “staff” as a fairly early rung on the ladder and a panoply of “senior staff”, “senior principal staff”, and so on roles above it. Like all “staff engineer” discourse, this post is not about the word itself but about the point in the engineering job ladder where progression becomes significantly more difficult. ↩ Impressing your VP’s trusted lieutenants can actually be a good way to build trust in the medium-term, but you’d better hope you’ve built enough understanding of the system to do it right. If this process goes badly, your reputation in the org might be torched for years. ↩ In theory, at least. In practice it’s always better to be useful (again, in the sense of “delivering shareholder value”). ↩ This is why very senior leadership sometimes seem so unempathetic towards engineering complaints: their work environment operates by very different rules and norms to that of most engineers. I keep meaning to try and write about this and never succeeding. This draft is the closest thing I have to a deeper exploration of the point. ↩ For the record, my how-to guides are here and here . ↩

0 views
matklad Yesterday

Minimal Viable Zig Error Contexts

Out of the box, Zig provides minimal and sufficient facilities for error handling — strongly-typed error codes . Error reporting is left to the user. Idiomatic solution is to pass a out parameter (“sink”) to materialize human-readable strings as needed. Diagnostics pattern works well for “production” code, but for more script-y code it adds too much friction relative to the default option of a plain , which of course gives a less than ideal message on failure: Error trace is helpful, but knowing which file is the problem is even more so. The first attempt at finding a middle ground between fully-fledged diagnostics sink pattern and a plain try is something like this: Unsatisfactory. The friction is high, you need to come up with a reasonably-sounding error message, the “happy path” of the code is obscured, and you need to repeat this for every fallible operation. A worse-is-better version of the above code is That is, just log error context as pairs, guarded by . The result is not pretty, but passable: The friction is reduced a lot: There’s one huge drawback though — the error message is logged, even if the error is subsequently handled. This is especially important in Zig 0.16, where cancelation ( serendipitous-success ) is a possible error for any IO-ing operation, and which is intended to be handled, rather than reported. Generalizing: This does feel like a better error management strategy than decorating errors individually, when they happen. I wonder which language features facilitate this style? No need to come up with any error messages beyond existing variable names. No need to change any of the s. The context is set per-block. If a function does several fallible operations on a file, the path needs to be specified only once. The context is “telescopic” every function in the call-stack can add its own context. Happy path adds context to all operations in-progress. Errors materialize current context.

0 views

Model-Harness-Fit

Why mixing a frontier model with a foreign harness quietly tanks performance, and what the open source code tells us about why. I keep three coding agents alive on the same workstation. Claude Code in one terminal. Codex CLI in another. GitHub Copilot CLI in a third. Same files. Same git tree. Same bash. Three different harnesses that look indistinguishable. A few weeks ago I ran the same prompt through all three and the behavior was visibly different in ways that went well past the surface differences of style and speed that I had expected to see across vendors. The Codex run cited a memory entry I had taught it months ago, applied the rule, and kept going without asking. The Claude Code run flagged the same context but refused to assert it without first verifying that the file path was still valid. The Copilot CLI run produced a longer, more cautious plan and asked me to approve it before taking any side effect on disk. The hand wave answer is that "models behave differently because they are different models." But Copilot CLI was running Claude Opus, the same family that Claude Code runs by default. Same model family, same prompt, two harnesses, materially different output. The hand wave does not cover it. Models are post trained against the harness, not just the API. The tool names they expect, the input schemas they emit, the citation tags they wrap around remembered facts, the file structure of skills they invoke, the planning protocol they follow when the harness says "make a plan first" (none of these are generic capabilities of the model). They are byte level conventions baked into the post training of one specific model against one specific harness. Pull the model out of its harness and you give up performance you cannot get back without rewriting either side. This has a direct consequence that anyone who has tried to ship a "model agnostic" agent has run into. You cannot just swap a model. Supporting BYOK and multi model (which is the responsible posture, since relying on a single provider is risky) adds real engineering complexity, and that complexity is worth paying. To swap a model cleanly, you have to swap the harness with it: the tool surface, the schema shapes, the skill bodies that name those tools, the citation contract, the memory ritual, the system prompt structure, sometimes the planning protocol. Everything above the model has to move when the model moves. That is why every agent vendor that supports multiple providers ends up either (a) running a degraded variant of every model they support, or (b) maintaining a separate full stack per model and exposing the choice to the user as "you are picking a product, not just a model." Option (b) is the path that wins on quality, and it is worth the engineering cost to avoid being locked into one lab. Swapping orchestrators is not a cosmetic change. It is a model swap in disguise. The frontier lab spent the last year shaping the model's instincts to a particular tool surface, a particular memory ritual, a particular skill format. When you mix and match, you spend that work. I think this is the single most underrated constraint in agent design today, and it has a clean name. Call it model harness fit . I dug into three open implementations that ship today: Codex CLI (OpenAI, fully open source at , Rust workspace, ~80 crates), Claude Code (Anthropic, closed binary, but a Rust port called at tracks upstream behavior closely enough to read at ~48,600 LOC across 9 crates, and Claude Code's own runtime injects observable blocks on every turn that confirm or contradict claims from the port), and GitHub Copilot CLI , where the SDK is fully open source MIT licensed at with five language bindings (Node.js TypeScript at 5208 LOC across 8 files, plus Python, Go, .NET, Java), and the JSON RPC wire protocol is documented at (currently version 3). The CLI binary that the SDK spawns as the agent runtime server is closed, but the client wrapper, the protocol, the session lifecycle, the system prompt section overrides, and every RPC method are all open source and readable. Here is what I will cover: Companion piece: I covered the memory layer in detail at Agent Memory Engineering . This article is about everything else, with memory revisited only where it intersects orchestration. If you want the bottom up tour of how MEMORY.md indexes, system reminder injection, age in days warnings, and signal gates work, read that one first. Before any argument about architecture, look at the leaderboard. Terminal-Bench 2.0 evaluates agents on bash heavy multi step tasks, and it ranks by harness plus model pair, not by model alone. From on April 30, 2026: Two things jump out. First, Claude Opus 4.6 paired with ForgeCode hits 79.8%, while the same model paired with Capy hits 75.3%. Same weights, different harness, and a 4.5 percentage point spread between them on a benchmark where every entry is fighting for a tenth of a point. Second, the upper rankings are not dominated by the labs that trained the models. ForgeCode is a third party harness that lands three of the top six entries by routing across model families. Stanford's IRIS Lab paired Opus 4.6 with an automated harness evolution system called Meta-Harness and pushed the same model to 76.4% on the same benchmark, well past the best baseline they started from. The harness is moving the score by more than the model upgrades are moving it. Cursor's research team makes the point even sharper. In their April 30 post on harness engineering, they note that they took their own coding agent from "Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness." Same model. Same benchmark. Different scaffolding. A 25-position jump on a public leaderboard, attributable to the harness alone. That is not a tuning artifact. That is the entire ranking. LangChain's Vivek Trivedy puts the same observation in one sentence: "Opus 4.6 in Claude Code scores far below Opus 4.6 in other harnesses." Anthropic's flagship model in Anthropic's flagship harness loses to the same weights in third party scaffolding. If you only saw the model name on the spec sheet, you would not predict that. This is the empirical case for model harness fit. Hold the model fixed and swap the harness, and the pass rate moves by enough to outweigh a model generation upgrade. Anyone shipping a coding agent in 2026 who picks the model first and the harness second is leaving most of the performance on the floor. The rest of this article is about why. What exactly does the harness do that lets two implementations of the same model produce different scores? Each harness picks a different orchestration protocol. The model was trained on that protocol's exact wire format. These are not three implementations of the same idea. They are three different contracts between model and runtime. Codex is a typed asynchronous protocol. The model emits a with an and gets back a stream of typed messages. The protocol is defined at with explicit enums. There is a second protocol layered on top: is 10,721 lines of JSON RPC for cross process clients (IDE plugin, desktop app), where v1 (245 lines) is frozen and all new RPCs go to v2. Methods are named with singular resource names, camelCase wire format. The two protocols stack: agent layer for in process, JSON RPC layer for cross process. The model was trained to emit submissions and consume events. Claude Code is a direct typed conversation loop. The runtime's consumes a per turn from . variants are , , , , and . There is no separate submission queue. The protocol is the Anthropic Messages API plus a tight in process tool dispatcher. The model was trained to emit tool calls inside an assistant message and respond to tool results in the next turn. GitHub Copilot CLI is a supervisor protocol. The host app does not run the agent loop. It spawns the bundled binary as a subprocess, opens a channel over stdio, and sends with the full configuration: model, system message, tools, MCP servers, custom agents, skill directories, hook flags. The agent loop runs inside the child process. The host gets notifications back. The model was trained to run inside this supervisor and emit JSON RPC events that the supervisor can route. You can see the architectural commitment harden in each design. Codex's literally polices crate growth: "Resist adding code to . The largest crate is explicitly off limits for new features." A 500 line soft cap, 800 line hard cap per Rust module. New features pay rent in the form of a new crate. This is a compiler toolchain attitude applied to an agent harness, and the model was trained to operate inside it. Claude Code's port enforces a different rule: "one agent loop, not a fan out of specialized agents," which is why subagents in Claude Code start with a fresh context and cannot recurse. Copilot CLI's supervisor model is what lets a single binary serve three surfaces (terminal, cloud agent, third party hosts). Each surface gets the same model behavior because the model is always running inside the same supervisor. Now imagine you swap models. Take a model trained to emit and feed it Claude Code's stream. The model has been taught one wire shape. The harness expects another. The mismatch shows up not as an outright failure but as a quiet degradation: missed tool calls, wrong reasoning effort levels, inconsistent compaction triggers, citation tags that the harness never parses. The wire format is part of the model. This is where post training is most visible. Every harness has a tool registry. The names look similar at the top: , , , , . But once you go past the first six, the surfaces diverge in ways that the model has been taught to exploit. Codex's exposes a particular vocabulary: Claude Code's port enumerates 40 specs in : Copilot CLI bundles a different default, drawn from the public changelog: A model trained on Codex's eight verb subagent surface knows how to send a message to a running subagent. A model trained on Claude Code's tool does not have that verb in its instinct set. The harness can paper over this with a router, but the router cannot give the model an instinct it does not have. Cursor's harness team puts the underlying mechanic plainly. From their April 30 research post: "OpenAI's models are trained to edit files using a patch-based format, while Anthropic's models are trained on string replacement. Either model could use either tool, but giving it the unfamiliar one costs extra reasoning tokens and produces more mistakes. So in our harness, we provision each model with the tool format it had during training." This is the single cleanest description of model harness fit I have seen from any vendor, and it is not a hand wave about model preferences but a specific measurable cost in reasoning tokens paired with an observable increase in error rate, recorded at scale across millions of agent turns in production. This is where model harness fit shows up most visibly. The tool surface is the model's vocabulary for the world. Cross train on a different vocabulary and you lose precision in every interaction. Skills look interchangeable on the surface. All three harnesses use a file with YAML frontmatter ( , , optional metadata). Codex even baked in cross compat: parses Claude style markdown skills. Copilot CLI explicitly reads config. The format is so similar that the same body would parse in all three. But skills are not just markdown. A skill carries an implicit contract about which tools it expects to call. That contract is not in the frontmatter. It is embedded in the body, in the form of imperative instructions that name specific tools by name, with specific argument shapes, and with specific verbs the model must emit. Look at what each harness ships as a system skill. Codex's bootstrap skills, baked in via and extracted to on first launch, are five: , , , , . The body invokes and as scripts ( ). It assumes the model can call to run a Python script. It assumes the model knows that scripts in of a skill folder are invokable. It assumes a sparse checkout fallback for private repos. None of that is in the frontmatter. All of it is in the body. Claude Code's skills are different. The plugin ships , , , , , plus many more. The bodies invoke Claude's specific tools: to bootstrap into a workflow, to track steps, to dispatch parallel subagents, / for file changes, / for search. The skills also encode hard process rules: "Use this BEFORE any creative work," "Use when about to claim work is complete." These rules anchor on the harness's injection model, which Codex does not have in the same form. Copilot CLI's skills are part of the plugin marketplace ecosystem, and the changelog reveals a different posture. v1.0.5 added "Embedding based dynamic retrieval of MCP and skill instructions per turn" as experimental. The model was trained to consume skill instructions delivered as a per turn injection chosen by an embedding ranker, rather than as a description match. A skill body that assumes "you will see all skills in the system reminder" does not behave the same way when the harness ranks skills via embedding and only injects the top three. This is why "we both use SKILL.md" is misleading. The format is identical; the contract underneath is not. Skills carry tool specs implicitly, and the implicit specs are pinned to the harness that authored them. The same applies to plugin manifests. Copilot CLI's v1.0.22 explicitly added: "Plugins using or manifest directories now load their MCP and LSP servers correctly." That is GitHub treating Claude Code's plugin format as a substrate to interoperate with at the file level. But the skills inside those plugins still bring assumptions about Claude Code's tool surface. Loading the file does not give the model the right vocabulary. The lesson generalizes. A skills marketplace that claims to be cross harness is a routing problem, not just a parsing problem. Each skill needs to either declare its target harness explicitly, or get rewritten per harness, or run inside a router that translates tool calls between dialects. None of these are free. I covered memory in detail in Agent Memory Engineering , so I will keep this section to the parts that matter for harness fit. Three memory architectures, three different bets: The architectural choices already differ. But the harness fit story is sharper than that. Each model was trained to write memory using a specific tool with a specific schema, and to cite memory using a specific tag with a specific format. Codex's model writes a structured raw memory artifact via Phase 1 extraction with a strict JSON schema: The Phase 2 consolidation prompt is 841 lines. . Schema validation rejects malformed output at parse time. The model citations are wrapped in blocks. The harness has a parser at that increments in the SQLite state DB whenever a citation arrives. This is the model's memory ritual. Strip the citation tag and the harness loses its decay signal. Claude Code's model writes memory using the standard and tools, into one file per memory under . There is no separate memory tool. The model picks one of four types ( , , , ) by file name prefix. The body uses a convention for behavioral rules. The harness wraps every body read in a block with the dynamic age in days and a verification reminder. The model was trained to read memory through that wrapper, weight it accordingly, and skip stale claims. Copilot CLI's model invokes as a dedicated tool. The body of the memory goes to a remote backend. Cross session memory was added in v0.0.412 as experimental. The retrieval surface is a server side query, not a local grep. The model expects the backend to be there. When the backend is unavailable (v1.0.23 fix), the agent used to hang on the first turn. That is a load bearing dependency. Now mix and match. Run a Codex trained model on Claude Code's harness. The model will look for a memory write tool, find , and write a file — but it will write a file in Codex's structured format, with headers and annotations, into a directory that Claude Code does not auto load on the next session. The harness does not know to inject the index. The next session does not see the memory. And critically, the model will emit blocks that Claude Code never parses. Memory effectively does not exist on the next turn. Run a Claude trained model on Codex's harness. The model will not emit citation tags. Codex's decay signal stops incrementing. Memories that were used silently rank below memories that were not used, because the harness sees zero citations. Within a few weeks, the wrong memories are getting evicted. Run either on Copilot CLI's harness with the remote backend. The model's local file instincts do not transfer. The tool is the only path, the schema is different, and the cross session retrieval is keyword search against a server, not the always loaded index plus on demand body read pattern that the model was trained on. The first turns will look fine because the model has memory shaped instincts. The retention will be different. The memory layer is the densest collision surface for model harness fit. Tools, schemas, citation tags, decay signals, retrieval rituals — all of these are coupled, all of these were learned together during post training, and none of them transfer cleanly when you swap one side. The tag is a microcosm of the larger problem. Codex's model emits a small XML block at the end of an assistant message whenever it pulled in memory: The harness has a parser that strips the block before showing the assistant message to the user, and uses the parsed to bump and columns in . The parser is at . The SQL is in migration : This is the model's contract with the harness. Cite what you used. The harness will reward what you cited by keeping it alive. The Phase 2 consolidator ranks memories by and decays anything with no citations and no fresh after 30 days. Claude Code's model has no equivalent citation tag. The harness does not need one because memory is read via the standard tool, and the agent's verification grep is what doubles as the "I used this" signal. The reminder text in front of every body read explicitly tells the model: "Records can become stale over time. Verify before recommending." There is no decay loop because the harness assumes the user will prune or the verification will fail in place. Copilot CLI's model talks to a remote memory backend. The store, retrieve, and rank logic is server side. The model does not need a citation tag because the backend tracks reads on its own. Now look at what happens in a cross harness run. A six character XML tag becomes the difference between a memory system that improves with use and one that degrades silently. This is what I mean by "the wire format is part of the model." The citation tag is not a feature on a roadmap. It is a habit the model picked up during post training, and that habit only pays off inside the harness that taught it. The Copilot CLI SDK exposes its system prompt as a structured object with ten section IDs. Hosts can override each section, replace it, or take full control. From the open source TypeScript at : This is not just a documentation surface. It is the public contract of the model's training distribution . Each section has a specific role, and the model was trained to read each section as a particular kind of instruction. The section is harder than . The section is consulted when the model is mid tool call. The section is what the model reads right before emitting a turn. Codex has its own equivalent, less explicit. The developer prompt is assembled in this order: Memory comes after policy and identity, before behavioral overrides. The model was trained to read this exact order. Claude Code's static prefix: A different shape, a different ordering, and a different set of precedence claims about what the model should treat as binding. The Claude trained model knows that instructions "OVERRIDE any default behavior and you MUST follow them exactly as written." That phrase lives inside the harness rather than inside the model itself, but the model has been trained to recognize the heading and treat its contents as binding. A model trained against this prefix will hunt for and react accordingly, while a model trained against a different prefix simply will not see the heading the same way and will give it the weight of any other piece of context. This is the same lesson as the citation tag, scaled up. The system prompt is not generic. It is a structured artifact with section conventions that the model was taught to read in a specific way. Swap harnesses and you keep the model's reading habits but lose the structure they apply to. GitHub Copilot CLI is the most interesting harness in the comparison because it explicitly tries to route across model families. Sonnet is the default. The picker exposes Sonnet, Opus, Haiku, and the GPT 5.x family. v1.0.32 added an mode that selects per session. How does Copilot CLI handle the model harness fit problem? Looking at the changelog, the strategy has three legs. The tool is included only when the active model is from the Codex family . v0.0.366: "Codex specific patch toolchain." The harness knows which models were trained on and only exposes it to those models. Anthropic models get the and shape they were trained on. This is not a translation layer. It is a per model tool surface. The router does not pretend and are the same operation. It serves the right tool to the right model. v1.0.13: "Tool search for Claude models." The implication: Claude trained models expect a deferred tool loading pattern via . The harness only exposes the discovery loop to those models. OpenAI trained models do not get the same loop. They get the full tool list up front because that is what they were trained on. v1.0.18: "New Critic agent automatically reviews plans and complex implementations using a complementary model to catch errors early (available in experimental mode for Claude models)." The Critic is a different model than the main agent. Plans get reviewed by the complementary model. This is multi model orchestration baked into the harness, and the routing is explicit. This is what a real router looks like. Not "translate everything to a common dialect," but "serve the right dialect to each model." It is more code, more state, more telemetry. It is also the only way to get top performance from each model. The cost of this approach is honesty. The harness has to admit that "Claude on Copilot CLI" and "GPT on Copilot CLI" are different products. The user picks one or the other and gets different behavior. There is no neutral common denominator. This is the right honest answer to model harness fit, and Copilot CLI is the only harness in the open or semi open set that actually ships it. The strategic logic is worth naming clearly. Multi model is the crucial bet for any serious agent platform in 2026 , and at GitHub and Microsoft we made that bet deliberately and early. Most customers are running multi model workflows whether their vendor admits it or not, and the only way to give every model its best performance is to build the per model routing surface inside the harness itself. We committed to that answer up front, which is what positions Copilot CLI to keep pace with whatever the labs ship next without having to redo its core architecture each time the leaderboard reshuffles. The matched pair is the unit of analysis, but the matched harness across many models is the unit of platform, and that is the level we are operating at. The single sharpest concrete demonstration of model harness fit comes from what happens when a user switches models mid conversation. Cursor's research team describes this carefully in their April 30 post, and the failure surface is worth walking through because every assumption that breaks here is an assumption a single model harness pair quietly relies on. Three things break at the moment of a model switch. First, the conversation history itself is now out of distribution. The previous model produced tool calls in its native vocabulary: blocks, tags, six or eight verb subagent dispatches. The new model was trained against a different vocabulary and now has to reason about a transcript full of tool calls it would not have emitted. Cursor handles this by injecting a custom instruction explicitly telling the model "you are taking over mid chat from another model" plus steering it away from the prior model's tools. That mitigates but does not eliminate the cost. The model is still reading a transcript that does not match its instincts. Second, the prompt cache breaks. Caches are provider and model specific, which means a switch is a guaranteed cache miss. For a long session, this turns the first turn after the switch into a full price re entry of every byte of system prompt and conversation history. Cursor's mitigation is to summarize the conversation at switch time, which yields a shorter clean transcript that costs less to re cache, at the price of losing details that the summary did not preserve. Third, the tools themselves change shape. The new model's harness loads its native tool set. If the user was deep into a subagent dispatch flow with one set of verbs, the next turn presents a different set. The model has to figure out whether the prior tools are still valid (they are not) and which of its own tools maps to the user's apparent intent. Cursor's recommendation, after building the mitigations, is honest: "we generally recommend staying with one model for the duration of a conversation, unless you have a reason to switch." The cleanest workaround they describe is to spawn a subagent with a different model rather than switch the main conversation. A subagent starts with a fresh context window, no transcript bias, no cache to break, and the new model's native tool surface from the first turn. Each of these failure modes maps directly back to the thesis. The transcript, the cache prefix, and the tool surface are all parts of the wire format the model was trained against. Change the model and you change the contract on all three sides at once. A model switch is not a model swap. It is a harness swap, a tool swap, and a cache invalidation, all at once. The model harness fit framing is no longer a subterranean observation. Two of the labs publishing the most interesting agent work in 2026 say it openly, and the AI infrastructure community has converged on a clean one line definition. Cursor's Stefan Heule and Jediah Katz describe their harness work as "obsessively stacking small optimizations" specifically because a step change is rare and the gains compound only inside a matched pair. Their team builds in custom prompting per provider and per model version, citing OpenAI's literal precision versus Claude's tolerance for imprecise instructions as concrete differentiators that flow back into prompt design. They report driving unexpected tool call errors down by an order of magnitude in one focused sprint. Tool call reliability is not a model property. It is a harness property, and one that compounds every turn the agent stays alive. Anthropic's Prithvi Rajasekaran ran a related experiment in his March 24 post on long running application development. The architecture: a planner, a generator, and an evaluator agent, modeled on Generative Adversarial Networks. The evaluator uses Playwright MCP to actually click through the running application as a user would, then grades against a rubric. Out of the box, Rajasekaran reports, "Claude is a poor QA agent" — it identifies legitimate issues and then talks itself into approving the work anyway. Tuning the evaluator prompt over multiple rounds is what turns it into a reliable judge. The harness creates the judgment surface; the model alone does not. The deeper lesson from Rajasekaran's work is about how harnesses should evolve as models improve. He built one harness against Claude Sonnet 4.5, which exhibited "context anxiety" strongly enough that compaction alone was not sufficient. The harness needed full context resets between sessions, with structured handoff artifacts to carry state across the boundary. When Opus 4.6 shipped, that behavior was largely gone. Rajasekaran dropped the entire context reset machinery and ran one continuous session for over two hours. Every component in a harness encodes an assumption about what the model cannot do on its own. Those assumptions go stale. The matched pair is not static. It moves as the model matures, and the harness has to retire scaffolding that is no longer load bearing. LangChain's Vivek Trivedy has the cleanest framing I have seen: "Agent = Model + Harness. If you're not the model, you're the harness." The harness in this view is every piece of code, configuration, and execution logic that is not the weights themselves. System prompts, tool descriptions, bundled infrastructure, orchestration logic, hooks, middleware. Working backwards from the desired agent behavior, every harness primitive earns its place by patching a specific model gap. Filesystems for durable state, bash for arbitrary action, sandboxes for safe execution, memory for continual learning, planning and self verification for long horizons. Each primitive started life as a workaround for a specific deficiency the model had at training time. Some of those primitives will get absorbed back into the model over time. Others will compound. Trivedy also names the mechanism that makes model harness fit so durable: a co-evolution feedback loop. "Useful primitives are discovered, added to the harness, and then used when training the next generation of models. As this cycle repeats, models become more capable within the harness they were trained in." This is the pipeline that hardens the matched pair over generations. A new harness primitive ships in week one. By month three, it shows up in millions of agent traces. By month six, those traces are training data for the next model. By month twelve, the next model has the primitive baked into its instincts and the harness can lean on it. The loop is what makes "swap to a foreign harness" not just clumsy but compounding clumsy. The model's habits got shaped by the previous generation of its own harness, which itself was shaped by the generation before. Move sideways and you skip every cycle of that compounding. Trivedy is honest about the cost of this loop, and I want to flag the counter argument cleanly. Quoting him: "A truly intelligent model should have little trouble switching between patch methods, but training with a harness in the loop creates this overfitting." If the model's tool format preference is overfit to its training harness, you could argue that the right long term move is to train against a more diverse set of harnesses so the model generalizes. That argument has merit. The labs that ship one model and one harness as a pair are buying near term performance at the cost of the model's portability. Whether that trade is the right one depends on whether portability is something the customer values, and right now the customer mostly values the leaderboard. Three independent posts published within weeks of each other, all converging on a single thesis: the model is only half of the system, the harness is the other half, the matched pair is the proper unit of analysis, and the vendors that ship the matched pair as a single product are the ones currently sitting at the top of the leaderboards. The harness side of the contract has converged on a markdown file per concern, and the file names are now load bearing across the ecosystem. A model trained on one harness recognizes the file names and knows which one carries which kind of authority. The key observation: the file names are now part of the wire format. A model that has been trained to look for a block under a heading will hunt for that exact heading on a turn. A model trained against will look for and miss . A model trained against will load personality from and ignore the same content if you put it in . This is why the AGENTS.md feature request against Anthropic's repo matters. It is not a docs migration. It is a request for the model's training distribution to expand its file recognition vocabulary. Until Anthropic post trains Claude to read , that file is invisible to Claude Code even if it sits next to in the repo. The SOUL.md ecosystem is a stress test of this thesis. SOUL.md is not yet recognized by any major harness's default loader. So the SOUL.md repo's installation instructions are revealing: copy your directory into the project, then add a few lines to pointing the model at it. That is a manual bridge from a non-recognized convention to a recognized one. The SOUL.md authors understand that the bytes do not work unless the model knows where to look, and "where to look" is a habit fixed in post training. The same routing problem shows up in the open. GitHub Copilot CLI v1.0.4 added: "Read .claude/settings.json and .claude/settings.local.json as additional repo config sources." v1.0.36 walked some of it back: "Custom agents, skills, and commands from ~/.claude/ are no longer loaded by the Copilot CLI." That is a router that tried to be permissive about file names, then narrowed when the user surface got confusing. The lesson sits underneath the changelog: even the harness that runs Claude models cannot treat files as authoritative without negotiating with the user about which conventions count. Pick the convention. Ship the post training to match. Or ship a router that explicitly maps each file to the model that recognizes it. The middle path of "be permissive and load anything that looks plausible" loses every time. After months of running these three harnesses side by side, reading the open source code, and tracking the Terminal-Bench leaderboard: The harness is no longer a wrapper around the model. The harness is part of the model's effective parameters. The post training process embeds the harness's tool surface, schema shapes, memory rituals, citation contracts, and system prompt structure into the model's instinct set. You can take the weights to a different harness, but you cannot take the instincts. The instincts only fire when the harness presents the world the way the post training presented it. This has three consequences worth naming. For agent platform builders: pick a harness, pick a model, ship them as a pair. Do not pretend the model is portable. Do not pretend the harness is neutral. The frontier labs are publishing model harness pairs whether they say so or not, and the per pair performance is the only number that matters. Copilot CLI's "different tools for different models" approach is the honest version of this. The dishonest versions ship a common denominator and underperform on every model they serve. For model labs: the harness is product strategy, not infrastructure. The harness is where the lab's post training investment compounds. Anthropic's injection model, the typed memory taxonomy, the verification on every body read, are not infrastructure choices. They are the surface the model was sculpted against, and they are the moat that makes the model less interchangeable than it would otherwise be. Same for Codex's two phase memory pipeline, the citation tag, the strict JSON schema. Same for Copilot CLI's ten section system prompt skeleton. The harness is where the model becomes irreplaceable. For users: the cost of switching is higher than it looks, and lower than vendors would like you to think. Higher because the model and the harness fused over months of training and you cannot pull them apart cleanly. Lower because the simple stack underneath is shared, and the conventions on top are documentable. A honest port — replicate the tool surface, replicate the citation contract, replicate the system prompt structure, replicate the memory ritual — would close most of the gap. It just costs as much as the original post training did to set up. The matched pair is not static. It shifts as the model matures. This is the most useful nuance from Rajasekaran's Anthropic post. A harness component that was load bearing for Sonnet 4.5 (context resets, sprint decomposition, aggressive compaction) became dead weight on Opus 4.6 because the model started doing that work natively. The right harness for a model in March is not the right harness for that model's successor in October. The discipline is to read the traces, identify which components are still earning their place, and retire the ones that are now patches over solved problems. Cursor's blog says the same thing in different words: "Every component in a harness encodes an assumption about what the model cannot do on its own, and those assumptions go stale." So back to the question I started with. Why does the same prompt produce visibly different output across three harnesses running the same model? Because the model running on three harnesses is effectively three different models, even though the weights on disk are byte for byte identical. The instincts that fire at runtime are not stored only in the weights, they are conditioned by the harness the weights were trained against, and the instincts turn out to be most of what shows up in the assistant's output on any given turn. The interesting design move now is not a better model. It is not a better harness either. It is the matched pair, designed end to end, where the post training and the runtime reinforce each other turn after turn until the model becomes legibly better at the things this specific harness rewards. You can see the major builders converging on this idea from three different starting points. Anthropic shipped Claude Code as the canonical Claude harness, with the post training and the runtime co-designed as a single product. OpenAI shipped Codex CLI as the canonical Codex harness, with the same vertical integration on the OpenAI side of the house. At GitHub and Microsoft we shipped Copilot CLI with explicit per model routing because multi model is crucial: customers run every frontier model they can get their hands on, and our job is to make each one perform at its best inside a harness designed to serve all of them well. The result is the most pragmatically honest harness in the open or semi open set today, and the one positioned to compound across model generations rather than locking to any single lab. Three different theories of what to do about model harness fit, all three coherent, and all three paying a real engineering price for the choice they made. The frontier work in 2026 is not about new model architectures. It is about new harness primitives. Ralph Loops, where a hook intercepts the model's exit attempt and reinjects the original prompt in a clean context window, forcing the agent to keep grinding against the goal. Just-in-time harness assembly, where the tool surface and the system prompt get composed per task instead of pre-configured per session. Self-tracing agents that read their own logs to find harness-level failure modes and patch them without human intervention. Each one of these is a primitive that some model will eventually be post trained against, and that pairing will show up at the top of the next leaderboard. The Terminal-Bench leaderboard tells you who is paying the price right. Look at it again in six months. The Evidence: Terminal-Bench 2.0 : what the leaderboard actually shows about model harness pairs Three Harnesses, Three Bets : SQ/EQ vs typed conversation loop vs JSON RPC supervisor The Tool Surface : where post training is most visible Skills Carry Tool Specs : why "same SKILL.md format" does not mean "interchangeable" The Memory Layer : synchronous live writes vs deferred batch vs server side, and why the citation tag matters The Citation Discipline : how the model talks back to the harness The System Prompt Skeleton : ten section IDs is a contract The Routing Reality : what GitHub Copilot CLI is actually doing about all this Mid-Chat Model Switching : the cleanest concrete failure mode What the Labs Are Saying : Cursor, Anthropic, and LangChain all converging on the same framing The Identity File Convention : CLAUDE.md, AGENTS.md, SOUL.md, USER.md, and what each one is for What This Means : the model is no longer the moat alone, and the matched pair shifts as the model matures — Codex's custom diff format. Two flavors: a freeform Lark grammar at and a JSON variant. The model was trained to emit patches in this format. It is not interchangeable with Claude Code's (which takes / ). — the bash family. Plus and for long lived processes that the model can drive with stdin writes after the fact. — the plan/todo tool. A model not trained on this tool will use a different convention to track work. — model can request expanded permissions mid turn. Codex is the only harness with this exact verb. — multi agent orchestration with , , , , , , , . Eight verbs. The model knows all eight. , — tools that find other tools. Codex's answer to deferred tool loading. — , , . Tied to migration . , , — lower case names internally, surfaced to the model as CamelCase ( , , ). The model was trained on the CamelCase variant. requires , , optional . Not the same shape as Codex's . has the deepest sandbox surface: , , , , , , , . The model knows when to set and pair it with the tool. and — the lazy load primitives. — single tool for subagent dispatch. Takes , , optional , optional . The post training has the model emit short imperative descriptions for these. / — both permission. Toggles a worktree local override. / — wrap for subagent isolation. — streams stdout from a background process. Pairs with . The model knows this pattern; Codex does not have it. — the workflow scaffolding tool. The model writes triplets in a particular pattern. , (bundled ripgrep), , — file reading with explicit range params. — built in (v0.0.374). Rejects URLs. , , — three verb interactive shell control. — subagent dispatch with depth and concurrency limits. , — multi turn subagent control. A different shape from Codex's six verb agent surface. — interactive clarification. — persistent memory tied to a remote backend. Memory is not local files here. — included specifically when serving Codex models. A different patch toolchain than Codex's own. , , , , .

0 views
Unsung Yesterday

Mouse pointer as a mere mortal

I gasped when I first saw Lightroom do this: I know this won’t have the same effect on you just watching. What happened was that, after I clicked on the Disable button, Lightroom moved the mouse pointer for me . I don’t think I have ever seen anything like this, and it provoked many thoughts and emotions: So seeing this now, yeah, I’d bundle this inside the “some interactions are 100% sacred” bucket, alongside focus never being hijacked randomly (especially in the middle of typing), avoiding scrolling anything until I specifically ask, undo and copy/​paste needing utmost protection, and a few more. In the opposite camp, here’s a fun new project by Neal Agarwal (only worth clicking on a computer with a mouse). This is a situation where it feels perfectly fine for a cursor to be hijacked; as a matter of fact, there is something really interesting about a mouse pointer feeling less like a deity floating above it all, and more like a regular in-game actor. = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/mouse-pointer-as-a-mere-mortal/2.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/mouse-pointer-as-a-mere-mortal/2.1600w.avif" type="image/avif"> This reminded me of that time, in the earlier days of Figma, when I prototyped an interaction where you could select someone else’s pointer and press Backspace to delete it: We didn’t seriously consider it because it felt just too weird, and not that effective in solving “the other person’s cursor is distracting me” problem. But today it feels like it belongs to the same category as the two examples above. I’ll let you decide if it’s closer to Agarwal’s delight or Lightroom’s terror. #games #interface design #mouse #onboarding #principles This feels wrong. If the mouse is the extension of my fingers, and the mouse pointer the extension of the mouse, this is in effect the app grabbing my hand and moving it. I did not know this was even possible. I can see how moving the mouse pointer programmatically can be useful in very specific situations (like scrubbing, or accessibility), but… not like this. If you do something for the user, won’t that make it harder for them to remember how to do it themselves? I’ve seen this kind of a thing many times in my career: Someone genuinely asks “hey, if this is such a huge transgression, why wasn’t it codified somewhere in the style guide?” But to me the challenge is that it’s hard to imagine everything that needs to be preemptively captured and prohibited. I have to imagine this stuff for living, and I literally did not think anyone would just move a mouse pointer like this.

0 views

Photo Journal - Day 4

Took my soon to a park near our house. He had his kid's camera and had a blast taking photos with me. ↑ One of my son's photos of me.

0 views
Harper Reed Yesterday

Note #733

Quite a change of locale. Also I wear hats now. Sports hats. But motorsports. Thank you for using RSS. I appreciate you. Email me

0 views

My Blog Principles

Read on the website: There’s a bunch of guiding principles I follow when blogging to ensure what I do is kind to others. Here are some.

0 views
Thomasorus Yesterday

AIDHD - AI coding workflow as an exclusion machine

I was recently diagnosed with an attention disorder with sustained attention issues, combined with planning and initiation difficulties. I favour quality over speed, which means it takes longer to do certain tasks, and the longer time spent on them increases fatigue exponentially. For a while I thought my inability to stay interested and accomplish tasks others could do in minutes was a lack of determination or character, a personality flaw. Learning that my brain is objectively bad at them was a huge relief. Unfortunately, reviewing code is among those difficult tasks, it demands a huge mental effort from me. I can write code for hours because I am the one doing it , and doing it delivers dopamine shots every few minutes, when I compile/refresh my page to see my changes. It keeps me focused and interested. But code reviewing? It's such a drag to me. It's long, non-interactive, boring, and worst of all it requires sustained attention due to the multiple parameters you have to take into account. That's not something you can botch, and so it fries my brain. Any distraction is an opportunity to stop, which drags it even longer. It's a vicious cycle. AI driven development, where the code is generated by the machine and reviewed by a human, will therefore never be compatible with my brain. If it becomes mandatory to work this way (as it is increasingly seems it will be), it will exclude not just me, but all other people with a similar neurodiversity. For a technology which is all about empowering people , I find this perspective ironic, even if unsurprising.

0 views

Gas & Fonts

I was at the gas station filling up the tank (unfortunately we still have 1 gas vehicle), when I noticed the numbers were in the typical 7-segment display style, but the screen was a modern LCD. It's curious that they made this intentional choice to imitate an older technology on a modern display. I imagine it must be due either to familiarity (most gas pumps still use actual 7-segment displays), or readability. I also wonder why they chose to use an LCD. The numbers only occupied a small corner of the screen. The only other active pixels were showing a tiny padlock icon. Maybe the screen facilitates maintenance operations not typically seen by a customer? Unfortunately I didn't snap a picture, still don't feel comfortable pulling out my phone near a gas pump!

0 views

Scaling, stretching and shifting sinusoids

This is a brief and simple [1] explanation of how to adjust the standard sinusoid sin(x) to change its amplitude, frequency and phase shift. More precisely, given the general function: We’ll see how adjusting the parameters , and affect the shape of s(x) . Each section below covers one of these aspects mathematically, and you can use the demo at the bottom to experiment with the topic visually. Scaling is conceptually the simplest change; we adjust to increase or decrease the amplitude (maximal height) of s(x) . Setting A=2 will make the value twice as large (in both the positive and negative direction) as the original function. Stretching changes the frequency of sin(x) , which is inverse proportional to its period. The baseline function sin(x) has a period of 2\pi , meaning it repeats every 2\pi . In other words, sin(x)=sin(x+2\pi) for any . If we set w=2 , we get sin(2x) . This function repeats itself twice as fast as sin(x) , because is multiplied by 2 before being fed into the sinusoid. If changes by \pi , the sinusoid’s input changes by 2\pi . Therefore, the period of sin(2x) is \pi , the period of sin(4x) is \frac{\pi}{2} and so on. [2] More generally, the period of sin(wx) is \frac{2\pi}{w} . Play with the demo below to see this in action, by changing and observing how the waveform changes. If we know the period p we want, we can easily calculate the that gives us this period: The final parameter we discuss is ; it’s called the phase of the sinusoid. In the baseline sin(x) , . The sinusoid is 0 at x=0 , achieves its positive peak at x=\frac{\pi}{2} , crosses 0 again at x=\pi , negative peak at x=\frac{3\pi}{2} and returns to its original position at x=2\pi where the repetition begins. By adding a non-zero , we don’t affect the sinusoid’s amplitude or frequency, but we do shift it right or left along the axis. For example, suppose we use the function sin(x+\theta) with \theta=\frac{\pi}{2} . Then when x=0 , we have sin(\frac{\pi}{2}) , so the sinusoid is already at its positive peak; at x=\frac{\pi}{2} , the sinusoid crosses 0 into the negatives, etc. Everything happens earlier (by exactly the value of \theta=\frac{\pi}{2} ) than in the baseline sinusoid. In other words, we’ve shifted the function left by \frac{\pi}{2} . Similarly, when is negative, everything happens later, and the function is shifted right . We’ve now gone over all the parameters for the function: Use the demo below to adjust these parameters and observe their effect on the sinusoid: controls the scaling factor (amplitude). is the frequency and controls the repetition period controls the phase - how much the sinusoid is shifted left or right

0 views
Ratfactor Yesterday

Correction: Text FILES as a user interface

I'm sorry, that previous title was very misleading! I also added a whole new example that I'm hoping will get the idea across a lot better...

0 views