Latest Posts (20 found)

Patch Tuesday, May 2026 Edition

Artificial intelligence platforms may be just as susceptible to social engineering as human beings, but they are proving remarkably good at finding security vulnerabilities in human-made computer code. That reality is on full display this month with some of the more widely-used software makers — including Apple , Google , Microsoft , Mozilla and Oracle — fixing near record volumes of security bugs, and/or quickening the tempo of their patch releases. As it does on the second Tuesday of every month, Microsoft today released software updates to address at least 118 security vulnerabilities in its various Windows operating systems and other products. Remarkably, this is the first Patch Tuesday in nearly two years that Microsoft is not shipping any fixes to deal with emergency zero-day flaws that are already being exploited. Nor have any of the flaws fixed today been previously disclosed (potentially giving attackers a heads up in how to exploit the weakness). Sixteen of the vulnerabilities earned Microsoft’s most-dire “critical” label, meaning malware or miscreants could abuse these bugs to seize remote control over a vulnerable Windows device with little or no help from the user. Rapid7 has done much of the heavy lifting in identifying some of the more concerning critical weaknesses this month, including: May’s Patch Tuesday is a welcome respite from April, which saw Microsoft fix a near-record 167 security flaws . Microsoft was among a few dozen tech giants given access to a “ Project Glasswing ,” a much-hyped AI capability developed by Anthropic that appears quite effective at unearthing security vulnerabilities in code. Apple, another early participant in Project Glasswing, typically fixes an average of 20 vulnerabilities each time it ships a security update for iOS devices, said Chris Goettl , vice president of product management at Ivanti . On May 11, Apple shipped iOS 15, which addressed at least 52 vulnerabilities and backported the changes all the way to iPhone 6s and iOS 15. Last month, Mozilla released Firefox 150 , which resolved a whopping 271 vulnerabilities that were reportedly discovered during the Glasswing evaluation. “Since Firefox 150.0.0 released, they have been on a more aggressive weekly cadence for security updates including the release of Firefox 150.0.3 on May Patch Tuesday resolving between three to five CVEs in each release,” Goettl said. The software giant Oracle likewise recently increased its patch pace in response to their work with Glasswing. In its most recent quarterly patch update, Oracle addressed at least 450 flaws, including more than 300 fixes for remotely exploitable, unauthenticated flaws . But at the end of April, Oracle announced it was switching to a monthly update cycle for critical security issues. On May 8, Google started rolling out updates to its Chrome browser that fixed an astonishing 127 security flaws (up from just 30 the previous month). Chrome automagically downloads available security updates, but installing them requires fully restarting the browser. If you encounter any weirdness applying the updates from Microsoft or any other vendor mentioned here, feel free to sound off in the comments below. Meantime, if you haven’t backed up your data and/or drive lately, doing that before updating is generally sound advice. For a more granular look at the Microsoft updates released today, checkout this inventory by the SANS Internet Storm Center . CVE-2026-41089 : A critical stack-based buffer overflow in Windows Netlogon that offers an attacker SYSTEM privileges on the domain controller. No privileges or user interaction are required, and attack complexity is low. Patches are available for all versions of Windows Server from 2012 onwards. CVE-2026-41096 : A critical RCE in the Windows DNS client implementation worthy of attention despite Microsoft assessing exploitation as less likely. CVE-2026-41103 : A critical elevation of privilege vulnerability that allows an unauthorized attacker to impersonate an existing user by presenting forged credentials, thus bypassing Entra ID. Microsoft expects that exploitation is more likely.

0 views

the flaws of digital consent management

Following up to my agentic consent piece , a reader (Shugo Nozaki) shared with me some interesting perspective around human consent (both by email and post ) that I felt like was worth exploring and discussing! He pointed out that while our current model of consent still relies on direct human perception and some understanding of what is being agreed to, it is already quite fragile in most areas. He rightfully points out that human consent that currently rests on real understanding is a polite fiction, as most consent flows are not really designed to be read. They are designed to get users past the gate. So he asks: How much of the current standard are humans actually meeting today? The reality is definitely that companies have squandered our trust and curiosity with the way consent mechanisms and Privacy Policies, Terms of Service etc. have been designed. Cookie banners keep popping up and sometimes don't seem to work correctly, other consent forms make it as annoying as possible to opt out, and any lengthier text is full of dry legalese. It has caused quite the consent fatigue, and for what? Unfortunately, the wrong things seem to be incentivized: Agreement to data processing is strategically beneficial to companies, so optimizing an easy workflow to not consent is not in their best interest. And the following is just a hypothesis of mine, but I firmly believe that companies have used the little leeway they were given to implement privacy law requirements to make things an absolute hassle in the hopes that they'd be seen as a failed experiment and users would complain until things were abolished again. When we actually read the laws, recitals, recommendations by organizations etc., we quickly see that we do not have to live with these unpleasant implementations; yet, companies get to point at laws for a job done badly and go " They made us do it! ". Multi-layered approaches 1 have been acknowledged and recommended for a while now, but implementation of them is still rare. For many companies, these texts seem to be a one-time thing that is once invested into and never again, instead of being the living document they should be. " If it works and we fulfill the requirement, why change it? " At the same time, laypeople unfamiliar with law have to work with heuristics and take characteristics as signs of quality when it comes to legal texts like PP's, ToS and more: If it's very long, it must be complete and enough effort has been invested, and if it has a lot of complicated jargon, it must be professional and correct. So that is what companies want to see, and weirdly, what some very invested users are reassured by. A short, casual version might seem incomplete and as if the company doesn't take privacy seriously. The money side is the same: What do law firms and their clients feel more comfortable charging and paying a lot of money for - the short, casual-toned text that people will understand, or the huge, dry and difficult to read one that comes across better? How do companies might feel if the person they hired to write these produces a more sloppy sounding one than their competitor has? Will it just come off as unprofessional to customers? In case of small businesses needing to save money, they're usually confronted with the question: Why hire anyone to make it more understandable and engaging to read while fulfilling the law, if you can just copy a trusted template online and fill in the blanks? No one wants to risk possible legal problems by a text that does not cover enough, so understandably, they resort to the most intense sounding texts, and they let consent and cookies be implemented and handled by big consent management companies because otherwise, it can be really difficult; but those sell the promise to increase the agreement numbers, so the service is designed around that metric. These wrong incentives and constraints have caused lost trust that is hard or in some cases, impossible to get back. Most people aren't born yesterday, and they have lived through years of shitty implementation. What would convince them that it is worth reading or that the next one will be better? Consent management in general offloads a lot of data management on individuals who are seldom correctly informed. On one hand, choice is what we want, on the other hand, it is also willfully ignorant of the collective issues. For now, the move is: Giving the user the option to read and agree or disagree is enough. We cannot force anyone to do anything, and if they choose to forego information or agree to make the process go faster because they are tired, then we have to accept that. Having a choice is also about the option to make a bad decision or one you regret later or wouldn't have done if you were your best self. Fittingly, Shugo Nozaki also poses the following idea in email: “ If the user policy is explicit enough, an agent may apply it with a kind of rule-following integrity that tired or distracted humans often fail to maintain. [...] How [can] we represent a user’s intent, boundaries, and escalation rules clearly enough for an agent to act on them? ” He brings up the option of a machine-readable user policy that is a set of constraints that defines what an agent may accept, must reject, or should bring back to the user. We'll likely have to move into that direction, but it still brings legal challenges as a broad consent isn't valid and needs to be granular and specific. A user could set an agent to always agree to cookies via personalization/custom instructions set as userpolicy.md, for example, but as they did not get to consent to each specific situation (website, their partners, their terms), its worth is questionable, and also difficult to prove in court for the companies. Ideally, an agent would have to ask the user on first "visit" of a website how to proceed for current and further uses of the website. So the more the agent gets around, the less this needs to happen. From a design perspective, even just asking the user to set up a policy themselves can be time-consuming and I assume not quite feasible for people who are not embedded in the law context or very passionate about privacy. There is too little context and information about why a decision matters, what could happen, what is tracked and what different kinds of situations could come up. I consider it beyond the realistic use case that the average user would introduce different categories of consent based on, for example, whether it is a blog or a shopping website, whether the banner says 4 partners or 1500, and more things that would enable more granular consent. All upfront. An agent could, on first setup, lead a user through it, but it could also be seen as annoying and skipped. Issues around the modalities of being asked and informed still remain: Do we trust the agent to relay information accurately? Will there be hidden instructions to influence the bot in what it tells the user? Would this approach really be less annoying than the existing method, when basically everything needs to get brought back for the user to decide at first? How will we reliably handle agents informing the user of changes in policies and the like? Yes, ideally, agents and other means could handle consent management better than a fatigued and annoyed human, but what counts primarily for the laws around data processing consent is that without a middleman, there is no doubt that a user directly had a choice in taking notice and the chance to inform themselves even if they chose not to; it often matters little whether they actually read and understood, as it cannot be proven or checked (and again, freedom to make bad choices). The waters get muddier when there is a translator in the middle that can sway you or skip it entirely, and no directly presented option. I'm sure on the other side, companies will also be interested in not having their consent workflow and information maligned by a bot either. It will actually also be interesting to see how worthwhile browsing data will still be if the metrics track the behavior of bots, not humans. I'll hopefully have more to say about this soon, as I will be at a conference which has some sessions about consent management in the age of AI that I will attend! :) Reply via email Published 12 May, 2026 In which the first text version a user sees is in easy language, casual and short, and if the they want to see it, they can have a lengthier, more detailed version, down to another layer, where the heavy legalese ones we are used to pops up. ↩ In which the first text version a user sees is in easy language, casual and short, and if the they want to see it, they can have a lengthier, more detailed version, down to another layer, where the heavy legalese ones we are used to pops up. ↩

0 views
Unsung Today

Rug pulled

The best thing the crypto industry coined might have been the expression “rug pull,” but I’m not happy about that. To me, it perfectly describes how it feels when an app or a website randomly changes your scroll position for no rhyme or reason. You’ve seen it so many times before: To me, the scroll position is as sacred as the mouse pointer position , given the two are related whether Scroll Lock is around or not: one is you, the other is the world around you. But there are moments when software scrolling with the user or even for the user is appropriate, and here’s one example: When you switch tabs, the content below should always scroll to the top, but it doesn’t here. Here’s an even worse example, also from Settings: Why should the content scroll to the top here? Because in these situations, the fact that the content container gets reused is just a technical quirk of the implementation. From the user’s perspective, this is all new content, and new content should always start at the top. Otherwise, things will get confusing really fast; imagine it especially in the default configuration without scrollbars , where you might assume result number 6 is the first result, or completely miss the most important, topmost options. (Before you ask: Yes, I also see this in Tahoe.) #interface design #mouse you start reading a webpage, but it throws you back to the top when JavaScript finishes loading you start reading a webpage, and ads or other stuff appear and shove you around up and down you press a back button and that goes to the previous page… but to its top, rather than where you actually were you zoom in or out, the position isn’t recalculated properly, and suddenly you see a different part of the page and lose your orientation

0 views

Building Software Requires Digestion

Here’s Scott Jenson in his insightful piece “The Ma of a New Machine” : the chatbot interface [makes us] feel like deep cognitive work is happening. But the interface is fundamentally reactive. It spits complex text at you, you skim it quickly, and you immediately type a reaction to keep the momentum going. My hypothesis is that the very structure of the chatbot interface (type, read, type again) actively discourages reflection. When you are moving too fast, you get stuck in a groove. You literally need to take a break, step back, and basically step out of this groove so you can view the problem from a new angle. We’ve all walked away from a tough problem only to have the solution arrive unbidden into our thoughts later in the day. In my decades+ experience designing and developing software, I can’t count the number of times I’ve stepped away from a problem at the computer only to return and find the problem magically resolved in my brain. But the human-computer interaction of prompting doesn’t encourage the use of that skill in our subconscious. In fact, I think it actively discourages it (our tools shape us). Scott talks about this Japanese concept called “Ma” which is about deliberately creating pauses between things. He quotes Studio Ghibli director Hayao Miyazaki who says “if you just have non-stop action with no breathing space at all, it’s just busyness.” Here’s Scott (emphasis mine): Ma provides a framework for understanding that a pause is not a lack of work As humans we need pauses. We need space to breathe. We need time to digest. Pausing, breathing, synthesizing, digesting — these are all necessary work . “Digestion” is an interesting word here. Putting food in your body is merely the beginning of feeding yourself. Our bodies must digest that food, break it down, absorb it, and get rid of the waste. But that’s all happening mostly without our attentive oversight, so I guess it’s not “real” work — right? Building good, healthy software requires digestion. Reply via: Email · Mastodon · Bluesky

0 views

Upgrading My Home Internet to Full Fibre

As many regular readers know, we live in the North Wales countryside, which means it can take time to get the latest and greatest when it comes to technology. As a result, we were previously "limited" to FTTC (fibre to the cabinet) which had a max speed of 70Mbps. As a result, we got okay internet speeds: But then I saw the ISP vans in the village, and I asked them what they were doing - "oh, we're upgrading the village to full fibre" she said. I had to have it! As soon as FTTP (fibre to the premises) was available, I placed the order with my ISP (who offered me a great deal that's only £5 per month more), and this is the result: In all honesty, I haven't noticed the difference. We didn't have any buffering issues when watching things like Netflix or Apple TV, so I'm not really sure why I upgraded in hindsight. I thought it would be this incredible difference where my internet would then be rapid, but the truth is, it's complete imperceptible. I remember when I upgraded from a 56k MODEM, to ~2Mbps broadband and it blew my mind. I was thinking this would be the same, but no. I do think the increased upload speed is going to come in handy when it comes to things like syncing my private git repos back to my Synology, but aside from that, there's not much in it. Had I paid full price (~£20 more per month) I don't think I'd have been too happy, but since I got a good deal, I'm not too bothered. Thanks for reading this post via RSS. RSS is ace, and so are you. ❤️ You can reply to this post by email , or leave a comment .

0 views

Fixing a proxying problem with my HomeAssistantOS installation by replacing nginx proxy manager

tl;dr: I removed the “nginx proxy manager” add-on, and replaced it with the Let’s Encrypt add-on and (second) the nginx add-on. A couple of months ago, I moved my HomeAssistant installation to HAos . I think that it is fair to say that I was not overly pleased with this. Honestly, I preferred the “Core” python-venv approach, but I also wanted a “supported” installation, and so I switched to HAos. i got it up and running okay, and I thought that I had got proxying working too, using an add-on called “nginx proxy manager”. This is not something that I had used before; I’d rather just configure nginx myself. Well, either I got something wrong, or it just does not work very well, as I kept having problems using HomeAssistant, stuck on a “loading data” screen, or it simply not responding. This bugged me for quite a while. Annoyingly, the logs available to me within HAos were unhelpful. I couldn’t spot anything indicating a problem. Using the console in my web browser, I noted that some files were not loading correctly, but why that was the case, I wasn’t sure. I thought that I’d had a similar issue with my “Core” installation years ago, which I got down to the issue of the in the file, but that looked correct here (which I was able to check, using the SSH add-on. I tried various parameters in the nginx proxy manager add-on, but to no avail. In the end, I tried removing the nginx proxy manager add-on, and replacing it with the Let’s Encrypt add-on (which I installed, configured, and ran first), and then the nginx add-on. And it immediately started working correctly. So I don’t know exactly why my original set-up was not working, but at least it is working better now.

0 views
Unsung Today

Save For Web claws

Randomly found this 2014 Dribbble from Jamie Nicoll and it made me smile: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/1.1600w.avif" type="image/avif"> For context, Save For Web was a popular export function in Photoshop at the peak of its use for web design, but assigned a rather unpleasant ⌘⌥⇧S shortcut. Using it often turned your hand into a… claw of sorts. There was a Tumblr cataloging real and humorous photos of people pressing Save For Web. You can still find parts of it on Internet Archive , and here are some choice photos: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/2.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/2.1600w.avif" type="image/avif"> = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/3.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/3.1600w.avif" type="image/avif"> = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/4.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/4.1600w.avif" type="image/avif"> = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/5.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/5.1600w.avif" type="image/avif"> = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/6.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/6.1600w.avif" type="image/avif"> = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/7.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/save-for-web-claws/7.1600w.avif" type="image/avif"> This is funny, but I actually found it enlightening – and lightly frightening – to ask coworkers how exactly they press common shortcuts like ⌘Z, ⌘C, ⌘V, and so on. There was a lot more variety than I expected. (My basic heuristics say: three-modifier-key shortcuts should not be assigned to anything used often.) #humor #keyboard

0 views

The Rise of the Bullshittery

Disclaimer: This is an opinion piece and it is the result of years of watching the same pattern play out in different industries, and sort of running out of patience. If you are one of the people doing honest, careful work in a field that no longer rewards it, this post is for you. However, if you are one of the people I am about to describe, then you probably already know who you are and you might want to keep on reading nevertheless. The tl;dr is at the bottom. A few weeks ago, I found myself in one of the rare situations in which I was mindlessly doom-scrolling on LinkedIn just to exclusively see one post after another that contained no actual information and not a single sentence that would have lacked any more substance if you replaced every noun in it with a different noun. There were thought leaders leading no thoughts, founders founding nothing of actual value, strategists describing strategies that amounted to “be visible” and “ship fast” , and an alarming number of self-described AI experts whose expertise appeared to consist entirely of having a ChatGPT or Claude subscription and the willingness to write about it in seventeen-paragraph posts. There is a word for this kind of communication, one the philosopher Harry Frankfurt famously employed back in 1986, when he wrote a short essay called On Bullshit . Frankfurt’s central observation, which has aged terrifyingly well, is that the bullshitter is not the same as the liar , because the liar at least respects the truth enough to try to hide it, but the bullshitter does not care whether what they are saying is true or false. The truth-value of the statement is simply not part of their concern. The bullshitter is optimising for a different objective, usually appearing competent , appearing confident , or appearing to be the right kind of person to be in the room . And precisely because the bullshitter is indifferent to truth, Frankfurt argued, they are a greater threat to honest discourse than any liar. Twenty years on, that essay reads like a pre-mortem on the modern internet and, in parts, modern society. The unspoken contract behind most professional life used to be as simple as learning how to do something, doing it well and gradually developing a reputation among people who could tell the difference. Over time, that reputation would then translate into work, money, and a degree of stability. It was a slow process, that sometimes was unfair, and that was never as meritocratic as its proponents claimed, but at least the basic shape of it made sense. Doing a good job was, on average, an advantage. That contract, however, has been broken in ways that are hard to comprehend, let alone ignore these days. The dominant mechanism for distributing professional opportunity is no longer slow reputation, it is algorithmic visibility . The algorithm, howeveer, does not particularly care whether you are good at your job, it only cares whether your message is engaging enough to spread fast and far. Researchers studying the so-called attention economy have been making this point for years, but one specific area that is particularly interesting is the one about politicians. A 2024 analysis of more than 6,500 U.S. state legislators found that distributing low-credibility information correlated positively with attention on the major platforms. In other words, being less reliable was, on average, a winning strategy for getting noticed. The same dynamic applies, in a less visible but more pervasive way, to anyone who has to build an audience to find work. The people who optimise for being correct are competing on an unfair playing field against people who optimise for being heard , and the result of this is a slow inversion of incentives. The careful professional, who takes a week to think through a problem, who refuses to claim expertise they do not have, and who writes one in-depth researched post about a specific topic, gets out-competed and buried by the carnival barker who will claim any expertise that fits the trending topic, and who fires off five posts a day, each of them a slightly different rephrasing of the same content-free observation. I am not arguing that honest, competent work has disappeared, but I am arguing that the incentive structure no longer points toward it, and that this fact has consequences that compound over time. If you want to see the cleanest expression of this, the place to look is LinkedIn . The platform has become, by any reasonable metric, the professional-class equivalent of late-night infomercial television, except the products on offer are other people’s careers . There is now a well-documented genre of so-called mentorship influencers on the platform who leverage job seekers’ desperation to sell hollow advice, false hope, and bogus referrals, often under the facade of having worked at a recognisable (mostly tech) company. The trick is the same one snake-oil salesmen have been running for centuries: Look at me, I am living proof that what I am selling works! These days, however, this trick comes with a slightly more modern twist and the proof for the sales pitch tends to be a curated profile picture, a fabricated job title, and a few thousand bot-inflated followers. What makes this maddening is not the existence of grifters , who are an old problem, but the way LinkedIn (and many other platforms) actively rewards them. The algorithm does not know the difference between a thoughtful five-paragraph essay by somebody who has spent a decade in the field, and a five-paragraph essay generated in twenty seconds by an LLM, that’s probably sprinkled with emojis. From the algorithm’s perspective, both are content , and the one that triggers more engagement (usually the cheaper, more emotional, more bombastic one) wins. Multiply that across millions of users and you end up with a feed in which the loudest claims rise to the top, and the people doing the actual work become invisible. The same shape repeats on Medium , on Twitter X , on Instagram , on YouTube , on TikTok , on Substack , and on all the other content-driven platforms, where there is now an entire AI grift economy of fake money-making gurus recycling the same handful of prompts and selling courses about how to do it. While the platforms might be different, the physics are the same, the currency is engagement, and the byproduct is bullshit. The casualty of all of this is sadly anyone whose work cannot be compressed into a fifteen-second hook. While snake oil predates the internet by a few centuries, and plenty of people built lucrative careers out of nothing long before LinkedIn existed, what is new, and what I think changes the problem, is that the marginal cost of producing convincing bullshit has collapsed. Large Language Models have done for grift what the shipping container did for global trade. They did not invent it, but they turned a manual process into an industrial one. Now, anyone with a browser can generate a thousand words of confident, on-topic, syntactically clean text on any subject in under a minute. They can ship a book to Amazon , an article to a content farm, a thread to LinkedIn , and even a video to YouTube , all without ever having to know what they are talking about. The output passes the basic test of sounds about right , and that is, increasingly, the only test the distribution channels (and sadly the readers/viewers) apply. This behavior might however stem from a phenomenon that was observed over a decade ago already, which is the spread of paid employment that even the employee secretly believes is pointless and in a sense hollow . In his 2013 essay On the Phenomenon of Bullshit Jobs , David Graeber argued that an enormous and growing fraction of professional work, in finance, consulting, middle management, communications, and adjacent fields, was producing nothing of obvious social value, and that the people doing it knew. However, it is important to mention that the empirical data for Graeber’s strongest claims is contested , and that a 2022 study found that less than 8% of European workers reported feeling their job was useless, well below the 20-60% that Graeber’s framing implied. Also, it appears that toxic culture and bad management were better explanations than pointlessness for the unhappiness he was describing. I nevertheless think that there is an argument of his observation that survives the critique, which is that an awful lot of modern professional life consists of producing artifacts whose primary audience is other people producing artifacts . Slide decks for slide decks, strategy documents about strategy documents, posts about posting. Obviously this work seems not useless to the worker, who is being paid, or to the platform, which is selling ads against it, but it is still utterly useless to anyone outside the loop. This is the bullshittery in its mature form, which doesn’t consist of individual lies, or individual scams, but a steady-state ecosystem in which a large share of professional output is produced to be seen by other people producing output, and in which the connection to anything resembling a real customer, a real problem, or a real outcome has gone slack. The part that bothers me the most is what it does to the people who refuse to participate in this whole charade. If you are a software engineer who insists on shipping things that work, a writer who insists on knowing the subject before publishing, a designer who insists on testing the thing on actual humans, a craftsperson of any kind who treats the work as the whole point of it, you are competing in a market that has been quietly tilted against you. The person next to you, who is willing to fake the demo and declare victory on LinkedIn even before the launch, is going to look more successful than you. They will get the speaking slots, they will get the promotions or, worse, the funding rounds. Heck, they might even end up on Forbes’ 30 under 30 . All that you will get is the satisfaction of doing the job properly, which, don’t get me wrong, is a beautiful thing, but sadly it does not pay rent. I think a lot of the cynicism, exhaustion, and quiet bitterness that has crept into professional life over the last years is downstream of this problem. I don’t believe that people no longer want to do good work, but I think that doing good work has stopped paying the way it used to, while doing bad work loudly has started paying significantly better, so people notice and they adjust. Of course, I might be completely off here and it is possible that the situation is not actually worse, only more visible. Bullshit has always been with us and neither LinkedIn nor any other platform invented the self-promoting middle manager. What has changed, though, is the observability of the bullshit, for which we now have a continuously updating feed. We see it all consolidated into a handful of prominent places, and maybe the volume looks higher because we are looking at all of it at once, and maybe not because the per-capita rate has actually climbed. This could be an explanation, but I frankly don’t think it accounts for all of what I am describing. It could also be, however, that what I’m describing are just people trying to keep up . The slop-posting middle manager who cannot tell you what their team actually built last quarter is not necessarily a malicious fraud, but they may be a person whose job no longer rewards them for knowing, in a system that has trained them to perform and act instead. While this, if true, does not make the output less hollow, it certainly does change who the actual villain is. Frankly, I don’t know, and I do not have any advice to give straight away on this. I believe, however, that in order to be able to dial things down again with regard to the bullshittery, we need actions on both sides, the reader/viewer, as well as the performer / creator . As viewers, we probably need to go back to reward substance when we see it. If somebody you follow does the careful and properly-sourced version of a piece of work, say so out loud. The system is starving them of the signal that it cheerfully overpays the bullshitters with and you are one of the people who can correct that. If you, as a viewer, can afford it, pay for the human-made version when you can. If a writer, an engineer, a designer, a musician is doing the work, and there is a way to give them money that does not pass through three instances of platform extraction, do it! The economics of doing real work in public are bad enough already without the further insult of zero direct support. As creators, we have to refuse to perform what we do not believe. This is harder than it sounds, because there is incentive and maybe even pressure to write that post , record that video , do that talk , publish that announcement , and saying no costs visibility you may not be able to afford to lose. But every honest professional who declines to bullshit is a small data point against this trend, and I think there need to be more of those data points. Frankfurt’s deepest argument is that the bullshitter is not embarrassable, because they have no relationship to the truth they could betray, while the honest person can be embarrassed, because they have made a claim they meant. As a creator, hold on to that, because being embarrassable is not a weakness. In a market that has stopped penalising shamelessness, it is one of the few remaining markers that the person you are talking to is operating in good faith. So be embarrassable! When I started writing this post, the angry version of it was about the people. The grifters and the gurus , the LinkedIn content pushers and the vibe-coding founders shipping vaporware to investors who frankly should know better. But after a few drafts I realised that I was aiming at the wrong target, because the people are mostly responding rationally to a system that pays for performance and ignores substance. If I blame them, I have to also blame myself for the times I stayed quiet and smiled at the demo, or signed off on the launch I did not believe in. I guess that most of us have done some version of that. It’s the system that is to blame, or as the old saying goes, “don’t hate the player, hate the game” . A market that prices visibility above credibility, that rewards the loudest claim over the truest one, and that lets a thin facade outsell a real product because the facade ships faster, is not a force of nature, but the cumulative effect of a lot of small decisions made by platforms, regulators, employers, and consumers, including me and you. None of those decisions are settled forever and each one of them is, in principle, reversible. I do not think honest work is going away, but I do think it is being pushed into a narrower, harder-to-find tier, the way handmade goods got pushed away when the factories arrived. There will still be a livelihood in it, and for some of us a very rewarding one, but the path to that livelihood will increasingly require you to do the work and to make the case, in public , for why your version of it is worth more than the cheaper, louder, hollower alternative. And that is a significantly harder game than the one we used to play. The simplest thing I can offer to anyone reading this, who is tired of being out-shouted by the bullshittery, is also the most boring: Keep doing the work, keep a principled and honest stance, keep saying I don’t know when you don’t, keep being embarrassable. Even though the market is bad at rewarding it right now, it will not continue to be forever. Hopefully.

0 views

Mythos finds a curl vulnerability

Link: https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-vulnerability/ Daniel Stenberg , creator and lead developer of cURL: My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing. I signed the contract for getting access, but then nothing happened. Weeks went past and I was told there was a hiccup somewhere and access was delayed. Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report. To me, the distinction isn’t that important. It’s not that I would have a lot of time to explore lots of different prompts and doing deep dive adventures anyway. Getting the tool to generate a first proper scan and analysis would be great, whoever did it. I happily accepted this offer. So Daniel didn't have access to Mythos. Someone else ran the analysis on his behalf. It's unclear what methodology this "someone else" used, how familiar they were with the cURL codebase, or how well they were acquainted with the sort of security issues the project has seen before. What if Daniel had run the scan himself? I'm willing to bet the results would've been radically different. I'm not saying all the hype around Mythos is necessarily justified—Anthropic is an AI lab after all, and AI labs lie. However, it's becoming clear that LLMs are remarkably effective at finding bugs and security issues as long as they have the right guidance . For an example of what Claude can do with expert guidance and access to custom tools, see Using LLMs to find Python C-extension bugs . Broadly speaking, I believe Daniel would agree with this sentiment. He writes: But allow me to highlight and reiterate what I have said before: AI powered code analyzers are significantly better at finding security flaws and mistakes in source code than any traditional code analyzers did in the past. All modern AI models are good at this now. Anyone with time and some experimental spirits can find security problems now. The high quality chaos is real. Any project that has not scanned their source code with AI powered tooling will likely find huge number of flaws, bugs and possible vulnerabilities with this new generation of tools. Mythos will, and so will many of the others. Not using AI code analyzers in your project means that you leave adversaries and attackers time and opportunity to find and exploit the flaws you don’t find. Lately I find myself drawn to how LLMs can help improve existing human-authored (or mostly human-authored) code. I'm no longer thrilled with the idea of using them to write most of my code for me— been there , dealt with the cognitive debt—but I'm intrigued by how I could use them as superhuman code reviewers to catch my mistakes. What would a coding harness designed primarily around improving code quality look like?

0 views
Stratechery Yesterday

SpaceX and Anthropic, xAI’s Two Companies, Elon Musk and SpaceXAI’s Future

The Anthropic xAI deal is shocking but not surprising: Musk should double down on serving other companies.

0 views
(think) Yesterday

Port: a minimalist prepl client for Emacs

For ages I’ve had “add prepl support to CIDER” sitting somewhere in the back of my head. CIDER is built firmly around nREPL, but prepl ships with Clojure itself, and the appeal of dropping the external REPL server requirement is obvious. Recently, as part of a broader internals cleanup “mini-project” in CIDER, I finally sat down and put a prototype together: cider#3899 . The good news is that the prototype sort of worked. The bad news is that the more I poked at it, the more I kept running into the same pattern. CIDER assumes ops, sessions, request ids, and a whole structured protocol that prepl simply doesn’t have. The amount of CIDER code that would need to grow “is this nREPL or prepl?” branches added up quickly, and I’d be papering over prepl’s limitations in dozens of subtle places. The exercise was fun, but it ended up reaffirming my long-standing belief that nREPL is a much better fit for editor tooling than prepl is. The exercise did leave me thinking though. What if, instead of bolting prepl onto CIDER, I built a small standalone client in the spirit of inf-clojure and monroe ? Something tiny and focused that doesn’t have to pretend to be CIDER, and where prepl’s quirks would be the design rather than something to work around. Conveniently, I was on vacation in Portugal at the time, where I spent a few days in Porto, and the name pretty much picked itself. Port was born. It kept us firmly in the land of fun, drink-inspired Clojure-on-editor names: CIDER, Calva (after Calvados, the apple brandy), and now Port (the famous fortified wine). The protocol Port talks to is , over a TCP port , so the pun was hard to pass up. This time around I didn’t manage to land on a backronym I love (at least not yet). The contenders so far: Naming is hard. I remain open to better suggestions. :D Port is a side project. I don’t plan to invest serious time in it past the point I consider it feature-complete, which won’t be far beyond what’s already there. The deliberate goal is to keep it simple and focused, and its feature set will stay close to inf-clojure and monroe. Port is not competing with CIDER. If you want the full feature set (debugger, inspector, test runner, profiler, structured stacktraces, refactor support), CIDER + is, and will remain, the way. What Port gives you today is a small, dependable Clojure REPL that you can hook into Emacs without any external dependencies, just a stock Clojure JVM with a prepl listening on a port. If you’re up for the long version, doc/design.md goes deep. Here’s the short version of what prepl gives you compared to nREPL: That last point is the central design constraint. If you want to issue a request and reliably read back its result without accidentally consuming output from an unrelated background or , you need to layer correlation on top of the protocol. Port does this with two tricks: This is what nREPL provides via sessions and ops, just reinvented at the TCP layer. It’s a fair amount of work for something nREPL gives you for free, which only strengthens my view that nREPL is the better protocol for editor tooling. Still, it was an interesting and educational exercise. One thing I’m fairly proud of: Port has no hard dependencies. You’ll want either or installed for the source-buffer side of things, but Port itself only soft-depends on them via runtime checks. Hook it onto whichever one(s) you actually use. I intend to keep it that way. Dependency creep is a real problem in the Emacs (every?) ecosystem, and a small package should stay small. I tagged v0.1.0 yesterday. It’s small but already perfectly usable: MELPA submission is queued up next. After that, expect Port to be in burst-driven maintenance mode like most of my smaller projects. Feedback, ideas, and contributions are most welcome. The issue tracker is the right place. Funny thing, I’d never actually written any code against prepl until I started this project. It was fun to spend some quality time with the “competition” of my beloved nREPL. Working with a different protocol always teaches you something about the one you’re used to, and I came away from this with a renewed appreciation for both: prepl is genuinely elegant for what it is, and nREPL is genuinely well-designed for what we use it for. Big thanks to Clojurists Together and everyone else who supports my OSS Clojure work. You rock! Now if you’ll excuse me, I have new releases of CIDER, clj-refactor, and refactor-nrepl to get back to. Keep hacking! prepl omnipotent repl toolkit (my favorite so far) prepl-operated repl toolkit peak optimized repl toolkit No bencode . prepl emits EDN-tagged maps, one per line. This might be a feature or a problem, depending on your perspective. No middleware . Whatever the server prints is what you get. No interception, no extension surface. No sessions . There’s one thread per TCP connection. No ops . You send a Clojure form, the server evaluates it, and prints back tagged messages: , , , , plus an flag on errors. No request id . This is the main issue. Tags identify the kind of message, not which request produced it. Two sockets per session. A user socket drives the REPL buffer with raw streaming output, and a separate tool socket carries helper-command requests. Background prints from / on the user thread don’t bleed into the tool channel. A bootstrap form. On connect, the tool socket evaluates a one-shot form that defines a wrapper. Every subsequent helper call goes through it, which captures / and returns a tagged map containing the request id. The client matches the id against a pending-callback registry. “jacks in” (bootstraps) (auto-detects / / , starts a server and connects to it) single-buffer REPL with persistent input history, completion, and eldoc at the prompt interactive evaluation from source buffers with pretty-printed results structured stacktrace buffer with cause chain and navigable frames find-definition that follows into jar sources doc/source/apropos/macroexpand helpers

0 views

Long Running Agent Engineering

What does it take for an agent to keep working after you leave? Not "answer a long question." Not "use a big context window." I mean actually keep working. Hours. Days. Maybe weeks. Wake up in a fresh session, understand what happened before, choose the next useful thing, make progress, verify it, leave the workspace cleaner than it found it, and do it again. For the last few years we have mostly talked about agents as if the hard thing was autonomy inside one conversation. Give the model tools. Put it in a loop. Let it call bash, edit files, search the web, open a browser, run tests. That loop is real, and it is already enough to change how software gets built. But long running agents expose a different problem. The agent loop is not the product. The harness is. The model does not naturally persist across turns, context windows, sandboxes, process crashes, or days of work. A fresh session is born with amnesia. It has no idea what the last session tried, which tests failed, which files were half edited, which plan is stale, which shortcut was tempting but wrong, or whether the thing it is about to mark done was already marked done three runs ago and later discovered broken. That is the real long running agent problem: handoff across amnesia. The answer emerging across Anthropic, Cursor, OpenAI, Claude Code, Addy Osmani's survey of long running agents , and the Ralph Wiggum community is surprisingly consistent. It is not one magical always awake model. It is not stuffing the whole history into a bigger window. It is a harness that externalizes state into the workspace, restarts agents with fresh context, uses machine verifiable checks as backpressure, and assigns completion judgment to something other than the worker that wants to be done. Here is the punchline up front: Long running agents are not long conversations. They are recoverable workflows. The model is one worker inside that workflow. The durable artifacts are the real continuity layer. It also helps to separate three ideas people collapse into one phrase: long horizon reasoning, long running execution, and persistent agency. A model can reason through a deep task without running for days. A process can run for days without remembering anything useful. An agent can remember the user without owning one large task. Production systems blur the three, but the engineering problems are different. Here's what I'll cover: The naive version of a long running agent is a single agent in a single conversation with a very large context window. This works for small tasks. It fails exactly where long running agents are supposed to matter. The failure is not just that the context window fills. A 200K or 1M token window still becomes a junk drawer if you keep pushing tool outputs, diffs, plans, screenshots, stack traces, and half obsolete reasoning into it. The model does not get a clean working memory. It gets an archaeological site. Anthropic's effective harnesses post frames this cleanly: complex tasks span multiple context windows, but each new agent session begins with no memory unless the environment itself tells the story. They describe two predictable failures. First, the agent tries to one shot too much, runs out of context, and leaves a half implemented mess. Second, a later session looks around, sees progress, and decides the whole project is done. That second failure is the one I keep seeing. The agent is not lazy. It is locally rational. It sees a repo with code, some tests, maybe a UI that loads, maybe a checklist with many items checked. In the absence of a crisp external completion contract, "looks basically done" becomes an attractive stopping point. Long running work makes this worse because every session inherits ambiguity from the previous one. Compaction helps, but compaction is not continuity. A summary can preserve some facts, but it cannot replace a workspace that is structured for recovery. This is the same lesson as agent memory engineering, just at task scale. Memory that lives only in the context window dies when the window dies. Work that lives only in the agent's chain of thought dies when the session dies. If you want continuity, put it somewhere the next worker can read. The architecture that keeps recurring looks like this: There are variations, but the spine is stable. Anthropic uses an initializer agent plus repeated coding agents. The initializer creates the environment future agents need: an , a progress file, a feature list, and a first git commit. Subsequent agents read the state, pick one not yet passing feature, implement it, test it end to end, update the progress log, and commit. The community Ralph Wiggum pattern is the minimal version: The important thing is not the loop. The important thing is what the loop forces. Every iteration starts with fresh context. Every iteration rehydrates from disk. Every iteration must leave disk in a state the next iteration can understand. Blake Crosley's Ralph Loop writeup describes the same pattern through stop hooks: intercept exit attempts, persist state to the filesystem, and restart with a fresh context window until machine verifiable completion criteria are met. Geoffrey Huntley's community guide reduces it to a beautiful primitive: a shell loop feeding a prompt file to the agent, with the implementation plan on disk acting as shared state between otherwise isolated runs. That is the thing people keep underestimating. The loop can be dumb if the workspace is smart. No blackboard server. No bespoke orchestration database. No vector store. No "agent society" with vibes based coordination. Markdown files, git, tests, and a process supervisor. Annoyingly simple. Annoyingly effective. The Ralph loop works because it replaces one degrading conversation with many clean attempts. The agent is not continuous. The workspace is. This flips the unit of autonomy. You stop asking, "Can this one conversation survive for ten hours?" You ask, "Can each session leave enough evidence that the next session can continue without asking me?" That means the agent's job is not only to build. It has to maintain the run state. A good Ralph prompt usually contains four contracts: This is not glamorous. It is project management for an amnesiac coworker. The loop also gives you a natural escape hatch. If the agent goes off track, you edit the plan. If the prompt is too loose, you add a guardrail. If the tests are weak, you strengthen the oracle. If the agent keeps duplicating work, you make completed work more visible. If it keeps touching unrelated files, you narrow the write scope. The prompts you start with are never the prompts you end with. Long running harnesses are tuned by watching failure patterns. That is why Ralph is more than a meme. It is the first pattern that made the correct abstraction obvious: the human sits outside the loop and engineers the environment, not inside the loop approving every step. The roles keep converging: Sometimes these are separate prompts. Sometimes separate models. Sometimes separate processes. Sometimes the judge is a test suite. Sometimes it is a small evaluator model. But the roles are conceptually different, and mixing them is where harnesses get mushy. The initializer is the first agent that touches the task. Its job is not to implement the product. Its job is to make implementation possible across many future sessions. Anthropic's initializer writes a comprehensive feature list. In their clone example, the feature list expanded the user's high level prompt into hundreds of end to end feature requirements, all initially marked failing. This prevents the later worker from inventing a tiny definition of done. A good initializer creates: The initializer is where you spend tokens to save tokens later. Every future worker starts faster because the workspace already has a map. The worker should not be asked to "finish the project." That is how you get giant diffs, brittle code, and fake completion. The worker should be asked to make one bounded unit of progress. The stop matters. A worker that never stops slowly turns into the bad single session architecture. Fresh starts are not overhead. Fresh starts are the mechanism that keeps drift from compounding. The worker should not be the final judge of completion. Workers want to be done. Not emotionally, obviously, but statistically. The completion token is attractive. The model has a strong prior toward wrapping up once the output looks coherent. On long horizon tasks this creates false positives. Claude Code's productizes this separation. You give Claude a completion condition. After each turn, a separate evaluator model checks whether the condition has been met. If the answer is no, the evaluator's reason becomes guidance for the next turn. The worker model is not the only judge of its own success. That one design detail is huge. OpenAI's harness engineering post describes a similar review loop: Codex writes code, reviews its own changes, requests additional agent reviews locally and in the cloud, responds to feedback, and iterates until reviewers are satisfied. They explicitly call this a Ralph Wiggum loop. The pattern generalizes: The judge does not have to be smarter than the worker. It just has to be fresh, narrower, and less invested in the worker's local narrative. Long running agents need durable state, but not all state is the same. If this state lives only in the transcript, the next session has to reconstruct it. If it lives on disk, the next session can read it. Anthropic's scientific computing post is the cleanest non web app example. Claude worked over multiple days on a differentiable cosmological Boltzmann solver and reached sub percent agreement with the reference CLASS implementation. The interesting part is not that the model wrote numerical code. The interesting part is the harness discipline around it: reference implementation, test oracles, persistent notes, git history, and quantifiable progress. Scientific computing makes the verification problem unusually crisp. You can compare your solver to CLASS or CAMB. You can plot error over time. You can watch the agent get closer to a reference implementation. That gives the run a real gradient. Most coding tasks have weaker oracles, so you have to build them. Long running agents magnify weak specs. A human can carry fuzzy intent across a week because humans have common sense, memory, and the ability to ask clarifying questions. An unattended agent will happily optimize the wrong proxy for hours. The more autonomy you grant, the more literal the state layer has to become. A long running agent without verification is just a text generator with file permissions. Verification is what turns motion into progress. This is why end to end tests matter so much. Anthropic observed that Claude would often mark features complete after shallow checks. Once explicitly prompted to use browser automation and test as a human user would, performance improved. That matches my experience. Unit tests are useful, but they are often too close to the implementation. Browser tests force the agent to confront the product surface. The right verification depends on the domain: The best verification is machine checkable and hard to game. The worst verification is asking the same model, in the same context, "are you sure?" That does not mean model judges are useless. They are useful when they judge surfaced evidence against a narrow condition. Claude Code's docs are careful about this: the evaluator does not run commands or read files independently. It judges what Claude has surfaced in the conversation. So the completion condition has to include how the worker should prove it. The judge cannot save you from a vague goal. It can enforce a crisp one. Single worker loops are enough for many tasks. But the moment you want to run hundreds of agents on one codebase for weeks, coordination becomes the whole game. Cursor's scaling agents post is useful because it talks about what failed. Their first approach let agents coordinate as peers through a shared file. Agents would check what others were doing, claim a task, update status, and use locks to prevent duplicate claims. This sounds reasonable. It is also exactly the kind of distributed system that gets weird fast. The problem is not that agents cannot coordinate. The problem is that peer to peer coordination asks every worker to think about the global project while also doing local implementation. That is too much. Cursor moved toward a planner worker judge hierarchy: This is the same role separation again, just scaled out. Workers should not coordinate with other workers if you can avoid it. They should receive a task with a bounded write scope, complete it, and report back. The planner should own the global dependency graph. The judge should decide whether the current state is good enough to continue, merge, or stop. This has a strong human engineering analogue. You do not ask every engineer on a large project to constantly negotiate the whole roadmap with every other engineer. You create ownership boundaries. You run reviews. You integrate. You keep the shared state legible. The hard part is choosing the grain size. Cursor's product follow up, Expanding our long running agents research preview , says long running agents produced substantially larger PRs while keeping merge rates comparable to other agents. That is the product significance. The harness lets agents take on work that previously exceeded the practical size of a single agent session. But "larger PRs with comparable merge rates" is not magic model dust. It is the result of better state, better delegation, better judges, and better recovery. Long running agents need a computer. That computer should be disposable. An agent that can run commands, install packages, edit files, open browsers, and call APIs is powerful enough to be useful and powerful enough to be dangerous. If you run it on your laptop with all your cookies, SSH keys, cloud credentials, and private files, the blast radius is ugly. The long running version makes this worse. A five minute agent can do damage. A five day agent can do creative damage. So the production architecture increasingly separates durable harness state from disposable compute. OpenAI's Agents SDK update points in this direction: model native harnesses, sandbox execution, filesystem tools, memory, manifests, and state rehydration. The key idea is that the agent gets a controlled workspace with the files, tools, and dependencies it needs, while credentials and durable orchestration live outside the sandbox. If the sandbox dies, the run should not die. The harness should rehydrate a fresh sandbox from the last checkpoint, mount the workspace, hand the worker the current state, and continue. This is the same principle again: state must outlive the worker. Sandboxing also changes how you think about tools. In a local interactive agent, giving bash broad access is convenient. In a long running cloud agent, every tool is a capability grant. Network, filesystem, credentials, browser profile, package installation, deploy keys, issue tracker access, email access. Each one needs scope. The Ralph community guide makes this point bluntly: assume the agent environment will be popped at some point, then ask what the blast radius is. That is the right mental model. The best long running harnesses will feel boring operationally: Boring is good. Boring means the agent can be weird without the system becoming weird. There are two product directions converging. The first is the practitioner loop: prompt files, plans, hooks, shell scripts, git commits. This is how power users run agents overnight today. It is messy, flexible, and close to the metal. The second is the productized loop: , cloud agents, background tasks, research previews, SDK harnesses, managed sandboxes. This turns the same patterns into a UX that normal teams can use. The underlying mechanics are more similar than they look. Claude Code's is basically a session scoped Ralph loop with a model judge. Cursor's long running agents are a cloud product built from planner worker judge orchestration. OpenAI's Agents SDK is standardizing the sandbox and filesystem substrate. Anthropic's harness posts are turning the workflow into repeatable environment design. The abstraction is moving up the stack. In 2024, you wrote your own while loop. In 2025, you wrote prompt files and hooks. In 2026, the loop is becoming a product primitive. But the product primitive still has to answer the same questions: The UI can hide the loop. It cannot remove the harness. Long running agents fail differently from short running agents. Short running agents fail by making a bad tool call, hallucinating an answer, editing the wrong file, or stopping too soon. Long running agents fail by accumulating drift. Each failure suggests a harness feature. This is why long running agent engineering looks less like prompt hacking and more like operating a tiny software organization. You need task intake, planning, execution, QA, review, release, rollback, observability, and security. The agent is the worker. The harness is the company. Here are the questions every long running agent system has to answer. My current bias: Fresh sessions beat giant sessions. A fresh context window that reads good state from disk is better than a stale context window carrying ten hours of tool output. Restarting is not giving up. Restarting is garbage collection. The workspace is the memory bus. Plans, progress logs, feature lists, tests, screenshots, git commits, and benchmark outputs are not side effects. They are the continuity layer. If the next worker cannot understand the run from disk, the harness is broken. Judges should be separate from workers. The worker can propose done. Something else should decide done. Ideally tests. Sometimes a model evaluator. Often both. The judge should inspect evidence, not vibes. External verification matters more than longer reasoning. A mediocre plan with a strong oracle will often beat an elegant plan with no backpressure. The agent needs reality to push back. Keep worker scope small. A long running system does not require each worker to do a long task. It requires the whole system to sustain progress across many bounded tasks. Make state disposable and regenerable. Plans rot. Progress logs bloat. Specs change. A good harness can regenerate the plan from the current repo and goal. Treat planning artifacts as useful scaffolding, not sacred truth. Sandbox by default. Long running agents should assume hostile inputs, accidental exfiltration, bad generated code, and runaway loops. Least privilege is not paranoia. It is table stakes. The human's job moves up a level. You stop micromanaging tool calls and start designing the environment: better specs, better evals, better prompts, better ownership boundaries, better recovery points. That last point is the real mindset shift. When code was scarce, the human wrote code. When code became cheap, the human reviewed code. When agents became persistent, the human designs the system in which code keeps getting written after they leave. OpenAI calls this harness engineering, and I think that phrase is going to stick. Harness engineering is the work around the model that makes the model useful over time: This is different from traditional software engineering. You are not only writing deterministic code paths. You are designing an environment that a non deterministic worker can repeatedly enter, understand, act inside, and leave in a better state. That is why the best long running agent harnesses feel weirdly old fashioned. Git. Markdown. Shell scripts. JSON checklists. Test suites. Logs. Small commits. Clear ownership. These are not legacy habits. They are the primitives that survive context death. The future of long running agents is not one immortal session thinking forever. It is many mortal sessions, each with a clean context window, waking up inside a workspace that remembers. So back to the original question: what does it take for an agent to keep working after you leave? Not a bigger prompt. Not just a better model. A durable state layer. A crisp goal. A fresh worker loop. A judge that is not the worker. Tests that push back. Git history that tells the story. Sandboxes that can die without killing the run. Logs that let the human tune the system when it fails. The model is the engine. The harness is the vehicle. And the companies that get this right will not merely have "agents that run longer." They will have agents that can be trusted with larger units of work because the work is recoverable, inspectable, and verifiable. That is the threshold that matters. Not autonomy as theater. Autonomy with a receipt. Why Long Sessions Fail - Context windows rot, agents declare victory early, and half finished work becomes invisible The Architecture That Won - Fresh worker sessions plus durable workspace artifacts The Ralph Loop - Why a dumb restart loop beats a single heroic conversation Initializer, Worker, Judge - The three roles that keep showing up State Outside the Model - Feature lists, progress logs, plans, git history, tests, and notes Verification As Backpressure - Why test oracles matter more than better pep talks Multi Agent Coordination - Why peer to peer locks break and planner worker hierarchies survive Sandboxing and Rehydration - Why long running execution needs disposable compute and durable state What This Means For Agent Design - The checklist every long running harness has to answer Where does state live? What does a new worker read first? How does it choose work? How does it prove progress? Who decides it is done? How do you recover from a bad turn? What happens when the sandbox dies? What is the budget? What is the blast radius?

0 views

How I lost a database and learned to actually use AI

I ran AI-generated SQL without reading it properly and lost a database. The experience changed how I work with AI tools, replacing freeform chat sessions with a structured process built around PRDs, small tasks, and frequent commits.

0 views
matklad Yesterday

Learning Software Architecture

In reply to an email asking about learning software design skills as a researcher physicist: I was attached to a bioinformatics lab early in my career, so I think I understand what you are talking about, the phenomenon of “scientific code”! My thoughts: First meta observation is that “software design” is something best learned by doing. While I had some formal “design” courses at the University, and I was even “an architect” for our course project, that stuff was mostly make-believe, kindergarteners playing fire-fighters. What really taught me how to do stuff was an accident of my career, where my second real project ( IntelliJ Rust ) propelled me to a position of software leadership, and made design my problem. I did make a few mistakes in IJ Rust, but nothing too horrible, and I learned a lot. So that’s good news — software engineering is simple enough that an inquisitive mind can figure it out from first principles (and reading random blog posts). Second meta observation, the bad news: Conway’s law is important. Softwaregenesis repeats the social architecture of the organization producing software. Or, as put eloquently by neugierig , If I were to summarize what I learned in a single sentence, it would be this: we talk about programming like it is about writing code, but the code ends up being less important than the architecture, and the architecture ends up being less important than social issues. I suspect that the difference you perceive between industrial and scientific software is not so much about software-building knowledge, but rather about the field of incentives that compels people to produce the software. Something like “my PhD needs to publish a paper in three months” is perhaps a significant explainer? Two things you can do here. One , at times you get a chance to design or nudge an incentive structure for a project. This happens once in a blue moon, but is very impactful. This is the secret sauce behind TIGER_STYLE , not the set of rules per se, but the social context that makes this set of rules a good idea. Two , you can speedrun the four stages of grief to acceptance. Incentive structure is almost never what you want it to be, but, if you can’t change it, you can adapt to it. This is also true about most industrial software projects — there’s never a time to do a thing properly, you must do the best you can, given constraints. Let me use rust-analyzer as an example. The physical reality of the project is that it’s simultaneously very deep (it’s a compiler! Yay!) and very wide (opposite to an LLM, a classical IDE is a lot of purpose-built special features). The social reality is that “deep compiler” can attract a few brilliant dedicated contributors, and that the “breadth features” can be a good fit for an army of weekend warriors, people who learn Rust, who don’t have sustained capacity to participate in the project, but who can sink an hour or two to scratch their own itch. My insistence that doesn’t require building , that it builds on stable, that it doesn’t have any C dependencies, and that the entire test suite takes seconds, was in the service of the goal of attracting high-impact contributors. I was wrangling the build system to make sure people can work on the borrow checker without thinking about anything else. To attract weekend warriors, the internals of rust-analyzer are split into multiple independent features, where each feature is guarded by at runtime. The thinking was that I explicitly don’t want to care too much about quality there, that the bar for getting a feature PR in is “happy path works & tested”. It’s fine if the code crashes, it will only attract further contributors, provided that: In contrast, when working on the core spine which provided support for features, I was very relatively more pedantic about quality. A word of caution about adapting to, rather than fixing incentive structure — the future is uncertain, and tends to happen in the least convenient manner. The original motivation behind rust-analyzer experiment was to avoid the need to write a parallel compiler (the one in IntelliJ Rust), and to prototype a better architecture for LSP, so that the learnings could be backported to . So, even in core (especially in core), the code was very experimental. Oh well. Stuck with one more compiler now, I guess? I might hazard a guess that something similar happened to uutils project, which started as the primary destination for people learning Rust, and ended up as Ubuntu coreutils implementation. Third , now to some concrete recommendations. Sadly, I don’t know of a single book I can recommend which contains the truths. I suspect one can only find such a book in an apocryphal short story by Borges: practice seems to be an essential element here. But here are some things worth paying attention to: Boundaries talk by Gary Bernhardt is all-time favorite. It contains solid object-level advice, and, for me, it triggered the meta inquiry. How to Test is something I wish I had. I immediately understood the importance of testing, but it took me a long time to grow arrogant enough to admit that most widely-cited testing advice is shamanistic snake-oil, and to conceptualize what actually works. ∅MQ guide and, more generally, writings by Pieter Hintjens introduced me to Conway’s Law thinking. That “feature development” architecture of rust-analyzer? – optimistic merging , applied. Reflections on a decade of coding by Jamii is excellent, goes very meta. It is intentionally the first of my links . Ted Kaminski blog is the closest there is to a coherent theory of software development, appropriately framed as a set of notes to a non-existing book! As for the actual books, Software Engineering at Google and Ousterhout’s The Philosophy of Software Design are often recommended. They are good. SWE, in particular, helped me with a couple of important names . But they weren’t ground breaking for me. the quality is isolated to a feature, and doesn’t spill over, at runtime, the crash is invisible to the user (it’s crucial that rust-analyzer features work with an immutable snapshot, and can’t poison the data).

0 views
Sean Goedecke Yesterday

Thinking Machines and interaction models

Thinking Machines just released Interaction Models . This is their first real AI model release 1 after a year of work and two billion dollars of capital. What is an “interaction model”? First, it’s not a frontier model . Thinking Machines is not yet competing with OpenAI, Anthropic and Google. Instead, they’re working on the problem of better real-time interaction with models. Some parts of what they’re doing are not new at all, other parts are slightly-questionable benchmark gaming, and still other parts represent a genuine technological advancement. I’ll try to lay it all out. If you’ve used ChatGPT in audio mode, you know that you can’t talk to it exactly how you’d talk to a human. There’s a big latency gap between when you finish talking and when the model jumps in. The model won’t interrupt you like a human, and doesn’t react to you interrupting it like a human would either. And of course you can’t give the model visual feedback like facial expressions. That’s because ChatGPT is either speaking or listening at any given time . When you’re talking, it’s in “listening” mode; when it’s talking, it’s in “speaking” mode, and isn’t absorbing any information from you. It relies on VAD (“voice activity detection”) to figure out if you’re talking. The alternative (and what “interaction models” do) is a fully-duplex system, where the model is constantly both in listening and speaking mode at the same time. Of course, the model can’t literally do this. Like all language models, it’s either doing prefill (ingesting prompt tokens) or decode (producing completion tokens). But what fully-duplex models can do is switch from listening to speaking mode in tiny chunks, called “micro-turns”. Instead of listening for ten seconds (or however long it takes you to stop talking), then speaking for ten seconds (or however long it takes to pass the model output through TTS), the model can listen for 200ms, then output for 200ms, then listen for 200ms, and so on. While the user is speaking, the model will know to output silence - most of the time. But if it decides it’s good to interrupt you or speak at the same time as you, it’s capable of doing that. So far, so unoriginal. There are plenty of examples of fully duplex audio systems that the Thinking Machines blog post already cites: Moshi , PersonaPlex , Nemotron-VoiceChat , and so on. But at least this outlines the space that “interaction models” are playing in: not “superintelligence from a frontier model”, but “better real-time conversational interaction” 2 . Given that, what is Thinking Machines doing that’s new? For existing fully-duplex models, you talk to the model itself. That’s a fairly big problem, since fully-duplex models have to be fast: fast enough that they can operate in tiny 200ms turns 3 . A model that fast cannot be particularly intelligent. Thinking Machines’ solution is to introduce an actual smart model - any regular language model will do here - in the background that the interaction model can delegate tasks to. In practice this is probably implemented as a tool call. The interaction model keeps chatting while the smart model works away, and then the smart model output is directly integrated into the interaction model’s context in the same way as audio and video input (a genuinely cool idea, I think). This is kind of neat, though it remains to be seen how well it works in practice. Will the model do a lot of “oh wait, the last thing I said was dumb, never mind” self-correction as the smarter model output trickles in? Will the fast interaction model be smart enough to delegate the right tasks at the right time? In general, the “start with a fast dumb model and have it hand off tasks” approach has been tricky for the AI labs to get right for a variety of reasons. If I’m being uncharitable, I might say that bolting on a strong reasoning model was an easy way for Thinking Machines to post impressive values for competitive benchmarks like FD-bench V3 (where they barely beat GPT-realtime-2.0) and BigBench Audio (where introducing the reasoning model bumps their score from 76% to 96%, only 0.1% below GPT-realtime-2.0). If I’m being charitable, I might say that a model fast enough for realtime conversation will have to have some way to punt hard tasks to a slower, smarter model. Both of those things are probably true. It’s also worth noting that Thinking Machines have also bolted on video input to their fully-duplex model. This is more exciting than it sounds, because face-to-face human conversation is very dependent on being able to read human expressions. In theory, this could unlock the ability to have genuine human-like conversations. The other reason why this is exciting is that it means Thinking Machines have been able to make a pretty big fully-duplex model (maybe twice the size of Moshi in terms of active parameters, and 40x the size in terms of total parameters). In fact, this is probably the biggest real technical achievement here. Other fully-duplex models are already doing micro-turns and interruptions, and could delegate reasoning fairly easily if they wanted to, but they aren’t doing video because they can’t . Being able to make a fully-duplex model the size of DeepSeek V4-Flash is pretty impressive. Much of the Thinking Machines blog post is dedicated to explaining how they’ve managed to do this: ingesting data in a more lightweight way, optimizing their inference libraries for tiny prefill/decode chunks, various decisions to make inference deterministic (a long-held hobbyhorse for Thinking Machines). There’s a lot of pressure on Thinking Machines to produce a genuine AI advancement. It doesn’t seem like they’re willing or able to compete in the frontier-model space (which makes sense, I wouldn’t want to either). Given that, I can see why they’re highlighting the parts of interaction models that are impressive to laypeople - all the fully-duplex interaction stuff - even though those parts are not truly innovative. So what are Interaction Models? A scaled-up, multimodal version of existing fully-duplex models like Moshi, with a real model bolted on for extra intelligence (and maybe better benchmarks). The scale and video parts are new and cool, and something like the overall approach has to be right. In general, I’m glad that we’ve got well-funded and high-profile AI labs tackling problems other than “build a smarter frontier model”. I think there’s a lot of low-hanging fruit waiting to be picked in other areas of AI research. People do seem to really like Tinker , which is their tooling for researchers who want to fine-tune models, but it’s not exactly the hot new frontier model that people were expecting. I think it’s at least a little shady that the Interaction Models video demo is making a big deal about some features (like real-time simultaneous translation) that are just features of fully-duplex audio models, not anything specific to their system. Even 200ms is a bit long. You can see from the demo that there’s an uncomfortable half-second lag sometimes as the model finishes its prefill slice and has to move to the decode slice. People do seem to really like Tinker , which is their tooling for researchers who want to fine-tune models, but it’s not exactly the hot new frontier model that people were expecting. ↩ I think it’s at least a little shady that the Interaction Models video demo is making a big deal about some features (like real-time simultaneous translation) that are just features of fully-duplex audio models, not anything specific to their system. ↩ Even 200ms is a bit long. You can see from the demo that there’s an uncomfortable half-second lag sometimes as the model finishes its prefill slice and has to move to the decode slice. ↩

0 views

Regatta Starting Stations – Chi-squared Continued

In the Henley Royal Regatta two teams at a time propel their boats up a river and compete to be first to go a distance. Teams get assigned to their starting stations – Berkshire or Buckinghamshire – at random. From there, it is a straight shot up the river, with the lane from each starting station being seemingly identical. I didn’t know any of this, but a reader reached out some time ago because they had noticed something odd about this, and they wanted to borrow me as a sounding board. Here’s the odd thing: the team that starts from the Berkshire station has won 53.5 % of the 7555 races in the historic data this reader looked at. This is highly unexpected. If teams are assigned at random, and the starting stations are practically equal, then the starting station of the winning team should be a coin flip. If we flip 7555 coins, we would never have as many as 53.5 % come up heads. (Continue reading the full article on the web.)

0 views

Meet People Where They're At

There's a shopping center I sometimes walk to for lunch. It's been there long enough that it doesn't have a sidewalk (before city ordinances required sidewalks I imagine). A few years ago, a mixed-use complex was built next to it, complete with a sidewalk that ended right at the boundary of the old plaza. This new sidewalk has resulted in a path of trampled grass as people (like myself) walk to the restaurants in the old plaza. Today on my way to get some "Italian food" (it's America, nothing is authentic here), I was greeted with a new gravel path at the end of the sidewalk. The path had been placed to line up with the curve of dead grass and perfectly connected both plazas. ↑ I didn't have a camera on me, so enjoy this detailed sketch done on my Palm Pilot It seems like a small thing, but it surprised me. Just a week ago I remember wondering to myself how long it would be until a "stay off of grass" sign appeared. Instead, I was treated to a rare instance of people's needs being directly addressed. It reminded me of a similar story around Ohio State University (the university in my city). The sidewalks built across the campus green were made to follow the paths students trekked in the early days of the campus. A similar method, named Sneckdown , is used to determine where traffic calming measures are needed based on snow that has not been touched by traffic. I wish this was more common, identifying pain points and improving the situation. Instead, we spend hours in meetings figuring out how to fight people's goals because what they want isn't "sticky enough" or "doesn't meet business goals".

0 views
Unsung Yesterday

“Nothing short of a magic trick.”

A fascinating 25-minute video from Mark Brown at Game Maker’s Toolkit about how the team building Grand Theft Auto 3 conquered the technical limitations of PlayStation 2: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/nothing-short-of-a-magic-trick/yt1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/nothing-short-of-a-magic-trick/yt1.1600w.avif" type="image/avif"> How do you squeeze a city that occupies over 50 megabytes into the 32MB memory of the console? You simply do what The Truman Show did , and construct the city around the player as they’re moving around : This has, as you can expect, a lot of technical and even game-design consequences, and the video goes into a lot of detail on these – including Brown rebuilding the Grand Theft Auto 3 source to visualize things better. This technique is also used in interface design, for example if you have a really long list of things that would take too much memory or GPU power to render. What the video calls “streaming” is, in the context of UI, often called “virtualization”: instead of having a full long list (or an entire world), you abstract it away – or, virtualize – into something nimbler. Some of the challenges and techniques used by Grand Theft Auto 3 apply pretty directly here, as well: On the other hand, “speedy players” and “pop in” can’t ever be solved because any UI list is random access, and slowing users down is not typically appropriate; better to make loading as pleasant as possible than introduce any roadblocks, even if figurative ones. #definitions #games #performance #youtube you can use UI skeletons as “low poly” models, in some contexts, you can guess the user is more likely to move in one direction (for example, going through fonts in a font picker), and more eagerly preload where they’re going to look next, rather than symmetrically in both directions.

0 views

Installing JPilot on Arch

This post is a quick tip for anyone else running into issues installing the Palm Pilot desktop software, JPilot on Arch Linux. If you just try installing via , the build will fail as the dependency no longer builds on modern systems. The solution is to first install , then .

0 views