Posts in Bash (20 found)
matklad 3 weeks ago

CI In a Box

I wrote , a thin wrapper around ssh for running commands on remote machines. I want a box-shaped interface for CI: That is, the controlling CI machine runs a user-supplied script, whose status code will be the ultimate result of a CI run. The script doesn’t run the project’s tests directly. Instead, it shells out to a proxy binary that forwards the command to a runner box with whichever OS, CPU, and other environment required. The hard problems are in the part: CI discourse amuses me — everyone complains about bad YAML, and it is bad, but most of the YAML (and associated reproducibility and debugging problems) is avoidable. Pick an appropriate position on a dial that includes What you can’t just do by writing a smidgen of text is getting the heterogeneous fleet of runners. And you need heterogeneous fleet of runners if some of the software you are building is cross-platform. If you go that way, be mindful that The SSH wire protocol only takes a single string as the command, with the expectation that it should be passed to a shell by the remote end. In other words, while SSH supports syntax like , it just blindly intersperses all arguments with a space. Amusing to think that our entire cloud infrastructure is built on top of shell injection ! This, and the need to ensure no processes are left behind unintentionally after executing a remote command, means that you can’t “just” use SSH here if you are building something solid. One of them is not UNIX. One of them has licensing&hardware constraints that make per-minute billed VMs tricky (but not impossible, as GitHub Actions does that). All of them are moving targets, and require someone to do the OS upgrade work, which might involve pointing and clicking . writing a bash script, writing a script in the language you already use , using a small build system , using a medium-sized one like or , or using a large one like or .

0 views

Date Arithmetic in Bash

Date and time management libraries in many programming languages are famously bad. Python's datetime module comes to mind as one of the best (worst?) examples, and so does JavaScript's Date class . It feels like these libraries could not have been made worse on purpose, or so I thought until today, when I needed to implement some date calculations in a backup rotation script written in bash. So, if you wanted to learn how to perform date and time arithmetic in your bash scripts, you've come to the right place. Just don't blame me for the nightmares.

0 views
Armin Ronacher 1 months ago

Pi: The Minimal Agent Within OpenClaw

If you haven’t been living under a rock, you will have noticed this week that a project of my friend Peter went viral on the internet . It went by many names. The most recent one is OpenClaw but in the news you might have encountered it as ClawdBot or MoltBot depending on when you read about it. It is an agent connected to a communication channel of your choice that just runs code . What you might be less familiar with is that what’s under the hood of OpenClaw is a little coding agent called Pi . And Pi happens to be, at this point, the coding agent that I use almost exclusively. Over the last few weeks I became more and more of a shill for the little agent. After I gave a talk on this recently, I realized that I did not actually write about Pi on this blog yet, so I feel like I might want to give some context on why I’m obsessed with it, and how it relates to OpenClaw. Pi is written by Mario Zechner and unlike Peter, who aims for “sci-fi with a touch of madness,” 1 Mario is very grounded. Despite the differences in approach, both OpenClaw and Pi follow the same idea: LLMs are really good at writing and running code, so embrace this. In some ways I think that’s not an accident because Peter got me and Mario hooked on this idea, and agents last year. So Pi is a coding agent. And there are many coding agents. Really, I think you can pick effectively anyone off the shelf at this point and you will be able to experience what it’s like to do agentic programming. In reviews on this blog I’ve positively talked about AMP and one of the reasons I resonated so much with AMP is that it really felt like it was a product built by people who got both addicted to agentic programming but also had tried a few different things to see which ones work and not just to build a fancy UI around it. Pi is interesting to me because of two main reasons: And a little bonus: Pi itself is written like excellent software. It doesn’t flicker, it doesn’t consume a lot of memory, it doesn’t randomly break, it is very reliable and it is written by someone who takes great care of what goes into the software. Pi also is a collection of little components that you can build your own agent on top. That’s how OpenClaw is built, and that’s also how I built my own little Telegram bot and how Mario built his mom . If you want to build your own agent, connected to something, Pi when pointed to itself and mom, will conjure one up for you. And in order to understand what’s in Pi, it’s even more important to understand what’s not in Pi, why it’s not in Pi and more importantly: why it won’t be in Pi. The most obvious omission is support for MCP. There is no MCP support in it. While you could build an extension for it, you can also do what OpenClaw does to support MCP which is to use mcporter . mcporter exposes MCP calls via a CLI interface or TypeScript bindings and maybe your agent can do something with it. Or not, I don’t know :) And this is not a lazy omission. This is from the philosophy of how Pi works. Pi’s entire idea is that if you want the agent to do something that it doesn’t do yet, you don’t go and download an extension or a skill or something like this. You ask the agent to extend itself. It celebrates the idea of code writing and running code. That’s not to say that you cannot download extensions. It is very much supported. But instead of necessarily encouraging you to download someone else’s extension, you can also point your agent to an already existing extension, say like, build it like the thing you see over there, but make these changes to it that you like. When you look at what Pi and by extension OpenClaw are doing, there is an example of software that is malleable like clay. And this sets certain requirements for the underlying architecture of it that are actually in many ways setting certain constraints on the system that really need to go into the core design. So for instance, Pi’s underlying AI SDK is written so that a session can really contain many different messages from many different model providers. It recognizes that the portability of sessions is somewhat limited between model providers and so it doesn’t lean in too much into any model-provider-specific feature set that cannot be transferred to another. The second is that in addition to the model messages it maintains custom messages in the session files which can be used by extensions to store state or by the system itself to maintain information that either not at all is sent to the AI or only parts of it. Because this system exists and extension state can also be persisted to disk, it has built-in hot reloading so that the agent can write code, reload, test it and go in a loop until your extension actually is functional. It also ships with documentation and examples that the agent itself can use to extend itself. Even better: sessions in Pi are trees. You can branch and navigate within a session which opens up all kinds of interesting opportunities such as enabling workflows for making a side-quest to fix a broken agent tool without wasting context in the main session. After the tool is fixed, I can rewind the session back to earlier and Pi summarizes what has happened on the other branch. This all matters because for instance if you consider how MCP works, on most model providers, tools for MCP, like any tool for the LLM, need to be loaded into the system context or the tool section thereof on session start. That makes it very hard to impossible to fully reload what tools can do without trashing the complete cache or confusing the AI about how prior invocations work differently. An extension in Pi can register a tool to be available to the LLM to call and every once in a while I find this useful. For instance, despite my criticism of how Beads is implemented, I do think that giving an agent access to a to-do list is a very useful thing. And I do use an agent-specific issue tracker that works locally that I had my agent build itself. And because I wanted the agent to also manage to-dos, in this particular case I decided to give it a tool rather than a CLI. It felt appropriate for the scope of the problem and it is currently the only additional tool that I’m loading into my context. But for the most part all of what I’m adding to my agent are either skills or TUI extensions to make working with the agent more enjoyable for me. Beyond slash commands, Pi extensions can render custom TUI components directly in the terminal: spinners, progress bars, interactive file pickers, data tables, preview panes. The TUI is flexible enough that Mario proved you can run Doom in it . Not practical, but if you can run Doom, you can certainly build a useful dashboard or debugging interface. I want to highlight some of my extensions to give you an idea of what’s possible. While you can use them unmodified, the whole idea really is that you point your agent to one and remix it to your heart’s content. I don’t use plan mode . I encourage the agent to ask questions and there’s a productive back and forth. But I don’t like structured question dialogs that happen if you give the agent a question tool. I prefer the agent’s natural prose with explanations and diagrams interspersed. The problem: answering questions inline gets messy. So reads the agent’s last response, extracts all the questions, and reformats them into a nice input box. Even though I criticize Beads for its implementation, giving an agent a to-do list is genuinely useful. The command brings up all items stored in as markdown files. Both the agent and I can manipulate them, and sessions can claim tasks to mark them as in progress. As more code is written by agents, it makes little sense to throw unfinished work at humans before an agent has reviewed it first. Because Pi sessions are trees, I can branch into a fresh review context, get findings, then bring fixes back to the main session. The UI is modeled after Codex which provides easy to review commits, diffs, uncommitted changes, or remote PRs. The prompt pays attention to things I care about so I get the call-outs I want (eg: I ask it to call out newly added dependencies.) An extension I experiment with but don’t actively use. It lets one Pi agent send prompts to another. It is a simple multi-agent system without complex orchestration which is useful for experimentation. Lists all files changed or referenced in the session. You can reveal them in Finder, diff in VS Code, quick-look them, or reference them in your prompt. quick-looks the most recently mentioned file which is handy when the agent produces a PDF. Others have built extensions too: Nico’s subagent extension and interactive-shell which lets Pi autonomously run interactive CLIs in an observable TUI overlay. These are all just ideas of what you can do with your agent. The point of it mostly is that none of this was written by me, it was created by the agent to my specifications. I told Pi to make an extension and it did. There is no MCP, there are no community skills, nothing. Don’t get me wrong, I use tons of skills. But they are hand-crafted by my clanker and not downloaded from anywhere. For instance I fully replaced all my CLIs or MCPs for browser automation with a skill that just uses CDP . Not because the alternatives don’t work, or are bad, but because this is just easy and natural. The agent maintains its own functionality. My agent has quite a few skills and crucially I throw skills away if I don’t need them. I for instance gave it a skill to read Pi sessions that other engineers shared, which helps with code review. Or I have a skill to help the agent craft the commit messages and commit behavior I want, and how to update changelogs. These were originally slash commands, but I’m currently migrating them to skills to see if this works equally well. I also have a skill that hopefully helps Pi use rather than , but I also added a custom extension to intercept calls to and to redirect them to instead. Part of the fascination that working with a minimal agent like Pi gave me is that it makes you live that idea of using software that builds more software. That taken to the extreme is when you remove the UI and output and connect it to your chat. That’s what OpenClaw does and given its tremendous growth, I really feel more and more that this is going to become our future in one way or another. https://x.com/steipete/status/2017313990548865292 ↩ First of all, it has a tiny core. It has the shortest system prompt of any agent that I’m aware of and it only has four tools: Read, Write, Edit, Bash. The second thing is that it makes up for its tiny core by providing an extension system that also allows extensions to persist state into sessions, which is incredibly powerful. https://x.com/steipete/status/2017313990548865292 ↩

0 views
Simon Willison 1 months ago

ChatGPT Containers can now run bash, pip/npm install packages, and download files

One of my favourite features of ChatGPT is its ability to write and execute code in a container. This feature launched as ChatGPT Code Interpreter nearly three years ago , was half-heartedly rebranded to "Advanced Data Analysis" at some point and is generally really difficult to find detailed documentation about. Case in point: it appears to have had a massive upgrade at some point in the past few months, and I can't find documentation about the new capabilities anywhere! Here are the most notable new features: This is a substantial upgrade! ChatGPT can now write and then test code in 10 new languages (11 if you count Bash), can find files online and download them into the container, and can install additional packages via and to help it solve problems. (OpenAI really need to develop better habits at keeping their release notes up-to-date !) I was initially suspicious that maybe I'd stumbled into a new preview feature that wasn't available to everyone, but I tried some experiments in a free ChatGPT account and confirmed that the new features are available there as well. My first clue to the new features came the other day when I got curious about Los Angeles air quality, in particular has the growing number of electric vehicles there hade a measurable impact? I prompted a fresh GPT-5.2 Thinking session with: Show me Los Angeles air quality over time for last 20 years Here's the shared transcript . Annoyingly that doesn't make the thinking trace available, so here's a screenshot: This bit instantly caught my eye: [...] The URL from the error message leads to an Excel file containing air quality trends. I can use the container.download tool to fetch it. [...] What's ? I asked ChatGPT : Without running any searches tell me about your container.download tool And got back: Sure. is a built-in tool I can call that fetches a file from a URL and saves it into the sandboxed container filesystem for this chat session. It’s called like this (conceptually): Further experimentation confirmed that behaves as described. I had it download one file from a server with logs enabled and captured the IP address and user-agent. Here are the most interesting headers: That IP address resolves to Microsoft Azure Cloud (centralus) in Des Moines, Iowa. On the one hand, this is really useful! ChatGPT can navigate around websites looking for useful files, download those files to a container and then process them using Python or other languages. Is this a data exfiltration vulnerability though? Could a prompt injection attack trick ChatGPT into leaking private data out to a call to a URL with a query string that includes sensitive information? I don't think it can. I tried getting it to assemble a URL with a query string and access it using and it couldn't do it. It told me that it got back this error: ERROR: download failed because url not viewed in conversation before. open the file or url using web.run first. This looks to me like the same safety trick used by Claude's Web Fetch tool : only allow URL access if that URL was either directly entered by the user or if it came from search results that could not have been influenced by a prompt injection. (I poked at this a bit more and managed to get a simple constructed query string to pass through - a different tool entirely - but when I tried to compose a longer query string containing the previous prompt history a filter blocked it.) So I think this is all safe, though I'm curious if it could hold firm against a more aggressive round of attacks from a seasoned security researcher. The key lesson from coding agents like Claude Code and Codex CLI is that Bash rules everything: if an agent can run Bash commands in an environment it can do almost anything that can be achieved by typing commands into a computer. When Anthropic added their own code interpreter feature to Claude last September they built that around Bash rather than just Python. It looks to me like OpenAI have now done the same thing for ChatGPT. Here's what ChatGPT looks like when it runs a Bash command - here my prompt was: npm install a fun package and demonstrate using it It's useful to click on the "Thinking" or "Thought for 32s" links as that opens the Activity sidebar with a detailed trace of what ChatGPT did to arrive at its answer. This helps guard against cheating - ChatGPT might claim to have run Bash in the main window but it can't fake those black and white logs in the Activity panel. I had it run Hello World in various languages later in that same session. In the previous example ChatGPT installed the package from npm and used it to draw an ASCII-art cow. But how could it do that if the container can't make outbound network requests? In another session I challenged it to explore its environment. and figure out how that worked. Here's the resulting Markdown report it created. The key magic appears to be a proxy, available within the container and with various packaging tools configured to use it. The following environment variables cause and to install packages from that proxy instead of directly from PyPI: This one appears to get to work: And it reported these suspicious looking variables as well: Neither Rust nor Docker are installed in the container environment, but maybe those registry references are a clue of features still to come. The result of all of this? You can tell ChatGPT to use Python or Node.js packages as part of a conversation and it will be able to install them and apply them against files you upload or that it downloads from the public web. That's really cool. The big missing feature here should be the easiest to provide: we need official documentation ! A release notes entry would be a good start, but there are a lot of subtle details to how this new stuff works, its limitations and what it can be used for. As always, I'd also encourage OpenAI to come up with a name for this set of features that properly represents how it works and what it can do. In the meantime, I'm going to call this ChatGPT Containers . You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . ChatGPT can directly run Bash commands now. Previously it was limited to Python code only, although it could run shell commands via the Python module. It has Node.js and can run JavaScript directly in addition to Python. I also got it to run "hello world" in Ruby, Perl, PHP, Go, Java, Swift, Kotlin, C and C++ . No Rust yet though! While the container still can't make outbound network requests, and both work now via a custom proxy mechanism. ChatGPT can locate the URL for a file on the web and use a tool to download that file and save it to a path within the sandboxed container. container.download Is container.download a data exfiltration vulnerability? Bash and other languages Installing packages from pip and npm Adding it all together Takes a publicly reachable URL and a destination filepath in the container. Downloads the bytes from that URL and writes them to the given path. After that, I can read/process the file locally in the container (e.g., unzip it, parse it with Python, open it as an image, convert it, etc.).

0 views

Lessons from Building AI Agents for Financial Services

I’ve spent the last two years building AI agents for financial services. Along the way, I’ve accumulated a fair number of battle scars and learnings that I want to share. Here’s what I’ll cover: - The Sandbox Is Not Optional - Why isolated execution environments are essential for multi-step agent workflows - Context Is the Product - How we normalize heterogeneous financial data into clean, searchable context - The Parsing Problem - The hidden complexity of extracting structured data from adversarial SEC filings - Skills Are Everything - Why markdown-based skills are becoming the product, not the model - The Model Will Eat Your Scaffolding - Designing for obsolescence as models improve - The S3-First Architecture - Why S3 beats databases for file storage and user data - The File System Tools - How ReadFile, WriteFile, and Bash enable complex financial workflows - Temporal Changed Everything - Reliable long-running tasks with proper cancellation handling - Real-Time Streaming - Building responsive UX with delta updates and interactive agent workflows - Evaluation Is Not Optional - Domain-specific evals that catch errors before they cost money - Production Monitoring - The observability stack that keeps financial agents reliable Why financial services is extremely hard. This domain doesn’t forgive mistakes. Numbers matter. A wrong revenue figure, a misinterpreted guidance statement, an incorrect DCF assumption. Professional investors make million-dollar decisions based on our output. One mistake on a $100M position and you’ve destroyed trust forever. The users are also demanding. Professional investors are some of the smartest, most time-pressed people you’ll ever work with. They spot bullshit instantly. They need precision, speed, and depth. You can’t hand-wave your way through a valuation model or gloss over nuances in an earnings call. This forces me to develop an almost paranoid attention to detail. Every number gets double-checked. Every assumption gets validated. Every model gets stress-tested. You start questioning everything the LLM outputs because you know your users will. A single wrong calculation in a DCF model and you lose credibility forever. I sometimes feel that the fear of being wrong becomes our best feature. Over the years building with LLM, we’ve made bold infrastructure bets early and I think we have been right. For instance, when Claude Code launched with its filesystem-first agentic approach, we immediately adopted it. It was not an obvious bet and it was a massive revamp of our architecture. I was extremely lucky to have Thariq from Anthropic Claude Code jumping on a Zoom and opening my eyes to the possibilities. At the time the whole industry, including Fintool, was all building elaborate RAG pipelines with vector databases and embeddings. After reflecting on the future of information retrieval with agents I wrote “ the RAG obituary ” and Fintool moved fully to agentic search. We even decided to retire our precious embedding pipeline. Sad but whatever is best for the future! People thought we were crazy. The article got a lot of praise and a lot of negative comments. Now I feel most startups are adopting these best practices. I believe we’re early on several other architectural choices too. I’m sharing them here because the best way to test ideas is to put them out there. Let’s start with the biggest one. When we first started building Fintool in 2023, I thought sandboxing might be overkill. “We’re just running Python scripts” I told myself. “What could go wrong?” Haha. Everything. Everything could go wrong. The first time an LLM decided to `rm -rf /` on our server (it was trying to “clean up temporary files”), I became a true believer. Here’s the thing: agents need to run multi-step operations. A professional investor asks for a DCF valuation and that’s not a single API call. The agent needs to research the company, gather financial data, build a model in Excel, run sensitivity analysis, generate complex charts, iterate on assumptions. That’s dozens of steps, each potentially modifying files, installing packages, running scripts. You can’t do this without code execution. And executing arbitrary code on your servers is insane. Every chat application needs a sandbox. Today each user gets their own isolated environment. The agent can do whatever it wants in there. Delete everything? Fine. Install weird packages? Go ahead. It’s your sandbox, knock yourself out. The architecture looks like this: Three mount points. Private is read/write for your stuff. Shared is read-only for your organization. Public is read-only for everyone. The magic is in the credentials. We use AWS ABAC (Attribute-Based Access Control) to generate short-lived credentials scoped to specific S3 prefixes. User A literally cannot access User B’s data. The IAM policy uses ` ${aws:PrincipalTag/S3Prefix} ` to restrict access. The credentials physically won’t allow it. This is also very good for Enterprise deployment. We also do sandbox pre-warming. When a user starts typing, we spin up their sandbox in the background. By the time they hit enter, the sandbox is ready. 600 second timeout, extended by 10 minutes on each tool usage. The sandbox stays warm across conversation turns. So sandboxes are amazing but the under-discussed magic of sandboxes is the support for the filesystem. Which brings us to the next lesson learned about context. Your agent is only as good as the context it can access. The real work isn’t prompt engineering it’s turning messy financial data from dozens of sources into clean, structured context the model can actually use. This requires a massive domain expertise from the engineering team. The heterogeneity problem. Financial data comes in every format imaginable: - SEC filings : HTML with nested tables, exhibits, signatures - Earnings transcripts : Speaker-segmented text with Q&A sections - Press releases : Semi-structured HTML from PRNewswire - Research reports : PDFs with charts and footnotes - Market data : Snowflake/databases with structured numerical data - News : Articles with varying quality and structure - Alternative data : Satellite imagery, web traffic, credit card panels - Broker research : Proprietary PDFs with price targets and models - Fund filings : 13F holdings, proxy statements, activist letters Each source has different schemas, different update frequencies, different quality levels. Agent needs one thing: clean context it can reason over. The normalization layer. Everything becomes one of three formats: - Markdown for narrative content (filings, transcripts, articles) - CSV/tables for structured data (financials, metrics, comparisons) - JSON metadata for searchability (tickers, dates, document types, fiscal periods) Chunking strategy matters. Not all documents chunk the same way: - 10-K filings : Section by regulatory structure (Item 1, 1A, 7, 8...) - Earnings transcripts : Chunk by speaker turn (CEO remarks, CFO remarks, Q&A by analyst) - Press releases : Usually small enough to be one chunk - News articles : Paragraph-level chunks - 13F filings : By holder and position changes quarter-over-quarter The chunking strategy determines what context the agent retrieves. Bad chunks = bad answers. Tables are special. Financial data is full of tables and csv. Revenue breakdowns, segment performance, guidance ranges. LLMs are surprisingly good at reasoning over markdown tables: But they’re terrible at reasoning over HTML `<table>` tags or raw CSV dumps. The normalization layer converts everything to clean markdown tables. Metadata enables retrieval. The user asks the agent: “ What did Apple say about services revenue in their last earnings call? ” To answer this, Fintool needs: - Ticker resolution (AAPL → correct company) - Document type filtering (earnings transcript, not 10-K) - Temporal filtering (most recent, not 2019) - Section targeting (CFO remarks or revenue discussion, not legal disclaimers) This is why `meta.json` exists for every document. Without structured metadata, you’re doing keyword search over a haystack. It speeds up the search, big time! Anyone can call an LLM API. Not everyone has normalized decades of financial data into searchable, chunked markdown with proper metadata. The data layer is what makes agents actually work. The Parsing Problem Normalizing financial data is 80% of the work. Here’s what nobody tells you. SEC filings are adversarial. They’re not designed for machine reading. They’re designed for legal compliance: - Tables span multiple pages with repeated headers - Footnotes reference exhibits that reference other footnotes - Numbers appear in text, tables, and exhibits—sometimes inconsistently - XBRL tags exist but are often wrong or incomplete - Formatting varies wildly between filers (every law firm has their own template) We tried off-the-shelf PDF/HTML parsers. They failed on: - Multi-column layouts in proxy statements - Nested tables in MD&A sections (tables within tables within tables) - Watermarks and headers bleeding into content - Scanned exhibits (still common in older filings and attachments) - Unicode issues (curly quotes, em-dashes, non-breaking spaces) The Fintool parsing pipeline: Raw Filing (HTML/PDF) Document structure detection (headers, sections, exhibits) Table extraction with cell relationship preservation Entity extraction (companies, people, dates, dollar amounts) Cross-reference resolution (Ex. 10.1 → actual exhibit content) Fiscal period normalization (FY2024 → Oct 2023 to Sep 2024 for Apple) Quality scoring (confidence per extracted field) Table extraction deserves its own work. Financial tables are dense with meaning. A revenue breakdown table might have: - Merged header cells spanning multiple columns - Footnote markers (1), (2), (a), (b) that reference explanations below - Parentheses for negative numbers: $(1,234) means -1234 - Mixed units in the same table (millions for revenue, percentages for margins) - Prior period restatements in italics or with asterisks We score every extracted table on: - Cell boundary accuracy (did we split/merge correctly?) - Header detection (is row 1 actually headers, or is there a title row above?) - Numeric parsing (is “$1,234” parsed as 1234 or left as text?) - Unit inference (millions? billions? per share? percentage?) Tables below 90% confidence get flagged for review. Low-confidence extractions don’t enter the agent’s context—garbage in, garbage out. Fiscal period normalization is critical. “Q1 2024” is ambiguous: - Calendar Q1 (January-March 2024) - Apple’s fiscal Q1 (October-December 2023) - Microsoft’s fiscal Q1 (July-September 2023) - “Reported in Q1” (filed in Q1, but covers the prior period) We maintain a fiscal calendar database for 10,000+ companies. Every date reference gets normalized to absolute date ranges. When the agent retrieves “Apple Q1 2024 revenue,” it knows to look for data from October-December 2023. This is invisible to users but essential for correctness. Without it, you’re comparing Apple’s October revenue to Microsoft’s January revenue and calling it “same quarter.” Here’s the thing nobody tells you about building AI agents: the model is not the product. The skills are now the product. I learned this the hard way. We used to try making the base model “smarter” through prompt engineering. Tweak the system prompt, add examples, write elaborate instructions. It helped a little. But skills were the missing part. In October 2025, Anthropic formalized this with Agent Skills a specification for extending Claude with modular capability packages. A skill is a folder containing a `SKILL.md` file with YAML frontmatter (name and description), plus any supporting scripts, references, or data files the agent might need. We’d been building something similar for months before the announcement. The validation felt good but more importantly, having an industry standard means our skills can eventually be portable. Without skills, models are surprisingly bad at domain tasks. Ask a frontier model to do a DCF valuation. It knows what DCF is. It can explain the theory. But actually executing one? It will miss critical steps, use wrong discount rates for the industry, forget to add back stock-based compensation, skip sensitivity analysis. The output looks plausible but is subtly wrong in ways that matter. The breakthrough came when we started thinking about skills as first-class citizens. Like part of the product itself. A skill is a markdown file that tells the agent how to do something specific. Here’s a simplified version of our DCF skill: That’s it. A markdown file. No code changes. No production deployment. Just a file that tells the agent what to do. Skills are better than code. This matters enormously: 1. Non-engineers can create skills. Our analysts write skills. Our customers write skills. A portfolio manager who’s done 500 DCF valuations can encode their methodology in a skill without writing a single line of Python. 2. No deployment needed. Change a skill file and it takes effect immediately. No CI/CD, no code review, no waiting for release cycles. Domain experts can iterate on their own. 3. Readable and auditable. When something goes wrong, you can read the skill and understand exactly what the agent was supposed to do. Try doing that with a 2,000-line Python module. We have a copy-on-write shadowing system: Priority: private > shared > public So if you don’t like how we do DCF valuations, write your own. Drop it in `/private/skills/dcf/SKILL.md`. Your version wins. Why we don’t mount all skills to the filesystem. This is important. The naive approach would be to mount every skill file directly into the sandbox. The agent can just `cat` any skill it needs. Simple, right? Wrong. Here’s why we use SQL discovery instead: 1. Lazy loading. We have dozens of skills with extensive documentation like the DCF skill alone has 10+ industry guideline files. Loading all of them into context for every conversation would burn tokens and confuse the model. Instead, we discover skill metadata (name, description) upfront, and only load the full documentation when the agent actually uses that skill. 2. Access control at query time. The SQL query implements our three-tier access model: public skills available to everyone, organization skills for that org’s users, private skills for individual users. The database enforces this. You can’t accidentally expose a customer’s proprietary skill to another customer. 3. Shadowing logic. When a user customizes a skill, their version needs to override the default. SQL makes this trivial—query all three levels, apply priority rules, return the winner. Doing this with filesystem mounts would be a nightmare of symlinks and directory ordering. 4. Metadata-driven filtering. The `fs_files.metadata` column stores parsed YAML frontmatter. We can filter by skill type, check if a skill is main-agent-only, or query any other structured attribute—all without reading the files themselves. The pattern: S3 is the source of truth, a Lambda function syncs changes to PostgreSQL for fast queries, and the agent gets exactly what it needs when it needs it. Skills are essential. I cannot emphasize this enough. If you’re building an AI agent and you don’t have a skills system, you’re going to have a bad time. My biggest argument for skills is that top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills. Models just want to learn and what they want to learn is our skills... Until they ate it. Here’s the uncomfortable truth: everything I just told you about skills? It’s temporary in my opinion. Models are getting better. Fast. Every few months, there’s a new model that makes half your code obsolete. The elaborate scaffolding you built to handle edge cases? The model just... handles them now. When we started, we needed detailed skills with step-by-step instructions for some simple tasks. “First do X, then do Y, then check Z.” Now? We can often just say for simple task “do an earnings preview” and the model figures it out (kinda of!) This creates a weird tension. You need skills today because current models aren’t smart enough. But you should design your skills knowing that future models will need less hand-holding. That’s why I’m bullish on markdown file versus code for model instructions. It’s easier to update and delete. We send detailed feedback to AI labs. Whenever we build complex scaffolding to work around model limitations, we document exactly what the model struggles with and share it with the lab research team. This helps inform the next generation of models. The goal is to make our own scaffolding obsolete. My prediction: in two years, most of our basic skills will be one-liners. “Generate a 20 tabs DCF.” That’s it. The model will know what that means. But here’s the flip side: as basic tasks get commoditized, we’ll push into more complex territory. Multi-step valuations with segment-by-segment analysis. Automated backtesting of investment strategies. Real-time portfolio monitoring with complex triggers. The frontier keeps moving. So we write skills. We delete them when they become unnecessary. And we build new ones for the harder problems that emerge. And all that are files... in our filesystem. Here’s something that surprised me: S3 for files is a better database than a database. We store user data (watchlists, portfolio, preferences, memories, skills) in S3 as YAML files. S3 is the source of truth. A Lambda function syncs changes to PostgreSQL for fast queries. Writes → S3 (source of truth) Lambda trigger PostgreSQL (fs_files table) Reads ← Fast queries - Durability : S3 has 11 9’s. A database doesn’t. - Versioning : S3 versioning gives you audit trails for free - Simplicity : YAML files are human-readable. You can debug with `cat`. - Cost : S3 is cheap. Database storage is not. The pattern: - Writes go to S3 directly - List queries hit the database (fast) - Single-item reads go to S3 (freshest data) The sync architecture. We run two Lambda functions to keep S3 and PostgreSQL in sync: S3 (file upload/delete) fs-sync Lambda → Upsert/delete in fs_files table (real-time) EventBridge (every 3 hours) fs-reconcile Lambda → Full S3 vs DB scan, fix discrepancies Both use upsert with timestamp guards—newer data always wins. The reconcile job catches any events that slipped through (S3 eventual consistency, Lambda cold starts, network blips). User memories live here too. Every user has a `/private/memories/UserMemories.md` file in S3. It’s just markdown—users can edit it directly in the UI. On every conversation, we load it and inject it as context: This is surprisingly powerful. Users write things like “I focus on small-cap value stocks” or “Always compare to industry median, not mean” or “My portfolio is concentrated in tech, so flag concentration risk.” The agent sees this on every conversation and adapts accordingly. No migrations. No schema changes. Just a markdown file that the user controls. Watchlists work the same way. YAML files in S3, synced to PostgreSQL for fast queries. When a user asks about “my watchlist,” we load the relevant tickers and inject them as context. The agent knows what companies matter to this user. The filesystem becomes the user’s personal knowledge base. Skills tell the agent how to do things. Memories tell it what the user cares about. Both are just files. Agents in financial services need to read and write files. A lot of files. PDFs, spreadsheets, images, code. Here’s how we handle it. ReadFile handles the complexity: WriteFile creates artifacts that link back to the UI: Bash gives persistent shell access with 180 second timeout and 100K character output limit. Path normalization on everything (LLMs love trying path traversal attacks, it’s hilarious). Bash is more important than you think. There’s a growing conviction in the AI community that filesystems and bash are the optimal abstraction for AI agents. Braintrust recently ran an eval comparing SQL agents, bash agents, and hybrid approaches for querying semi-structured data. The results were interesting: pure SQL hit 100% accuracy but missed edge cases. Pure bash was slower and more expensive but caught verification opportunities. The winner? A hybrid approach where the agent uses bash to explore and verify, SQL for structured queries. This matches our experience. Financial data is messy. You need bash to grep through filing documents, find patterns, explore directory structures. But you also need structured tools for the heavy lifting. The agent needs both—and the judgment to know when to use each. We’ve leaned hard into giving agents full shell access in the sandbox. It’s not just for running Python scripts. It’s for exploration, verification, and the kind of ad-hoc data manipulation that complex tasks require. But complex tasks mean long-running agents. And long-running agents break everything. Subscribe now Before Temporal, our long-running tasks were a disaster. User asks for a comprehensive company analysis. That takes 5 minutes. What if the server restarts? What if the user closes the tab and comes back? What if... anything? We had a homegrown job queue. It was bad. Retries were inconsistent. State management was a nightmare. Then we switched to Temporal and I wanted to cry tears of joy! That’s it. Temporal handles worker crashes, retries, everything. If a Heroku dyno restarts mid-conversation (happens all the time lol), Temporal automatically retries on another worker. The user never knows. The cancellation handling is the tricky part. User clicks “stop,” what happens? The activity is already running on a different server. We use heartbeats sent every few seconds. We run two worker types: - Chat workers : User-facing, 25 concurrent activities - Background workers : Async tasks, 10 concurrent activities They scale independently. Chat traffic spikes? Scale chat workers. Next is speed. In finance, people are impatient. They’re not going to wait 30 seconds staring at a loading spinner. They need to see something happening. So we built real-time streaming. The agent works, you see the progress. Agent → SSE Events → Redis Stream → API → Frontend The key insight: delta updates, not full state. Instead of sending “here’s the complete response so far” (expensive), we send “append these 50 characters” (cheap). Streaming rich content with Streamdown. Text streaming is table stakes. The harder problem is streaming rich content: markdown with tables, charts, citations, math equations. We use Streamdown to render markdown as it arrives, with custom plugins for our domain-specific components. Charts render progressively. Citations link to source documents. Math equations display properly with KaTeX. The user sees a complete, interactive response building in real-time. AskUserQuestion: Interactive agent workflows. Sometimes the agent needs user input mid-workflow. “Which valuation method do you prefer?” “Should I use consensus estimates or management guidance?” “Do you want me to include the pipeline assets in the valuation?” We built an `AskUserQuestion` tool that lets the agent pause, present options, and wai When the agent calls this tool, the agentic loop intercepts it, saves state, and presents a UI to the user. The user picks an option (or types a custom answer), and the conversation resumes with their choice. This transforms agents from autonomous black boxes into collaborative tools. The agent does the heavy lifting, but the user stays in control of key decisions. Essential for high-stakes financial work where users need to validate assumptions. “Ship fast, fix later” works for most startups. It does not work for financial services. A wrong earnings number can cost someone money. A misinterpreted guidance statement can lead to bad investment decisions. You can’t just “fix it later” when your users are making million-dollar decisions based on your output. We use Braintrust for experiment tracking. Every model change, every prompt change, every skill change gets evaluated against a test set. Generic NLP metrics (BLEU, ROUGE) don’t work for finance. A response can be semantically similar but have completely wrong numbers. Building eval datasets is harder than building the agent. We maintain ~2,000 test cases across categories: Ticker disambiguation. This is deceptively hard: - “Apple” → AAPL, not APLE (Appel Petroleum) - “Meta” → META, not MSTR (which some people call “meta”) - “Delta” → DAL (airline) or is the user talking about delta hedging (options term)? The really nasty cases are ticker changes. Facebook became META in 2021. Google restructured under GOOG/GOOGL. Twitter became X (but kept the legal entity). When a user asks “What happened to Facebook stock in 2023?”, you need to know that FB → META, and that historical data before Oct 2021 lives under the old ticker. We maintain a ticker history table and test cases for every major rename in the last decade. Fiscal period hell. This is where most financial agents silently fail: - Apple’s Q1 is October-December (fiscal year ends in September) - Microsoft’s Q2 is October-December (fiscal year ends in June) - Most companies Q1 is January-March (calendar year) “Last quarter” on January 15th means: - Q4 2024 for calendar-year companies - Q1 2025 for Apple (they just reported) - Q2 2025 for Microsoft (they’re mid-quarter) We maintain fiscal calendars for 10,000+ companies. Every period reference gets normalized to absolute date ranges. We have 200+ test cases just for period extraction. Numeric precision. Revenue of $4.2B vs $4,200M vs $4.2 billion vs “four point two billion.” All equivalent. But “4.2” alone is wrong—missing units. Is it millions? Billions? Per share? We test unit inference, magnitude normalization, and currency handling. A response that says “revenue was 4.2” without units fails the eval, even if 4.2B is correct. Adversarial grounding. We inject fake numbers into context and verify the model cites the real source, not the planted one. Example: We include a fake analyst report stating “Apple revenue was $50B” alongside the real 10-K showing $94B. If the agent cites $50B, it fails. If it cites $94B with proper source attribution, it passes. We have 50 test cases specifically for hallucination resistance. Eval-driven development. Every skill has a companion eval. The DCF skill has 40 test cases covering WACC edge cases, terminal value sanity checks, and stock-based compensation add-backs (models forget this constantly). PR blocked if eval score drops >5%. No exceptions. Our production setup looks like this: We auto-file GitHub issues for production errors. Error happens, issue gets created with full context: conversation ID, user info, traceback, links to Braintrust traces and Temporal workflows. Paying customers get `priority:high` label. Model routing by complexity: simple queries use Haiku (cheap), complex analysis uses Sonnet (expensive). Enterprise users always get the best model. The biggest lesson isn’t about sandboxes or skills or streaming. It’s this: The model is not your product. The experience around the model is your product. Anyone can call Claude or GPT. The API is the same for everyone. What makes your product different is everything else: the data you have access to, the skills you’ve built, the UX you’ve designed, the reliability you’ve engineered and frankly how well you know the industry which is a function of how much time you spend with your customers. Models will keep getting better. That’s great! It means less scaffolding, less prompt engineering, less complexity. But it also means the model becomes more of a commodity. Your moat is not the model. Your moat is everything you build around it. For us, that’s financial data, domain-specific skills, real-time streaming, and the trust we’ve built with professional investors. What’s yours? Thanks for reading! Subscribe for free to receive new posts and support my work. I’ve spent the last two years building AI agents for financial services. Along the way, I’ve accumulated a fair number of battle scars and learnings that I want to share. Here’s what I’ll cover: - The Sandbox Is Not Optional - Why isolated execution environments are essential for multi-step agent workflows - Context Is the Product - How we normalize heterogeneous financial data into clean, searchable context - The Parsing Problem - The hidden complexity of extracting structured data from adversarial SEC filings - Skills Are Everything - Why markdown-based skills are becoming the product, not the model - The Model Will Eat Your Scaffolding - Designing for obsolescence as models improve - The S3-First Architecture - Why S3 beats databases for file storage and user data - The File System Tools - How ReadFile, WriteFile, and Bash enable complex financial workflows - Temporal Changed Everything - Reliable long-running tasks with proper cancellation handling - Real-Time Streaming - Building responsive UX with delta updates and interactive agent workflows - Evaluation Is Not Optional - Domain-specific evals that catch errors before they cost money - Production Monitoring - The observability stack that keeps financial agents reliable Why financial services is extremely hard. This domain doesn’t forgive mistakes. Numbers matter. A wrong revenue figure, a misinterpreted guidance statement, an incorrect DCF assumption. Professional investors make million-dollar decisions based on our output. One mistake on a $100M position and you’ve destroyed trust forever. The users are also demanding. Professional investors are some of the smartest, most time-pressed people you’ll ever work with. They spot bullshit instantly. They need precision, speed, and depth. You can’t hand-wave your way through a valuation model or gloss over nuances in an earnings call. This forces me to develop an almost paranoid attention to detail. Every number gets double-checked. Every assumption gets validated. Every model gets stress-tested. You start questioning everything the LLM outputs because you know your users will. A single wrong calculation in a DCF model and you lose credibility forever. I sometimes feel that the fear of being wrong becomes our best feature. Over the years building with LLM, we’ve made bold infrastructure bets early and I think we have been right. For instance, when Claude Code launched with its filesystem-first agentic approach, we immediately adopted it. It was not an obvious bet and it was a massive revamp of our architecture. I was extremely lucky to have Thariq from Anthropic Claude Code jumping on a Zoom and opening my eyes to the possibilities. At the time the whole industry, including Fintool, was all building elaborate RAG pipelines with vector databases and embeddings. After reflecting on the future of information retrieval with agents I wrote “ the RAG obituary ” and Fintool moved fully to agentic search. We even decided to retire our precious embedding pipeline. Sad but whatever is best for the future! People thought we were crazy. The article got a lot of praise and a lot of negative comments. Now I feel most startups are adopting these best practices. I believe we’re early on several other architectural choices too. I’m sharing them here because the best way to test ideas is to put them out there. Let’s start with the biggest one. The Sandbox Is Not Optional When we first started building Fintool in 2023, I thought sandboxing might be overkill. “We’re just running Python scripts” I told myself. “What could go wrong?” Haha. Everything. Everything could go wrong. The first time an LLM decided to `rm -rf /` on our server (it was trying to “clean up temporary files”), I became a true believer. Here’s the thing: agents need to run multi-step operations. A professional investor asks for a DCF valuation and that’s not a single API call. The agent needs to research the company, gather financial data, build a model in Excel, run sensitivity analysis, generate complex charts, iterate on assumptions. That’s dozens of steps, each potentially modifying files, installing packages, running scripts. You can’t do this without code execution. And executing arbitrary code on your servers is insane. Every chat application needs a sandbox. Today each user gets their own isolated environment. The agent can do whatever it wants in there. Delete everything? Fine. Install weird packages? Go ahead. It’s your sandbox, knock yourself out. The architecture looks like this: Three mount points. Private is read/write for your stuff. Shared is read-only for your organization. Public is read-only for everyone. The magic is in the credentials. We use AWS ABAC (Attribute-Based Access Control) to generate short-lived credentials scoped to specific S3 prefixes. User A literally cannot access User B’s data. The IAM policy uses ` ${aws:PrincipalTag/S3Prefix} ` to restrict access. The credentials physically won’t allow it. This is also very good for Enterprise deployment. We also do sandbox pre-warming. When a user starts typing, we spin up their sandbox in the background. By the time they hit enter, the sandbox is ready. 600 second timeout, extended by 10 minutes on each tool usage. The sandbox stays warm across conversation turns. So sandboxes are amazing but the under-discussed magic of sandboxes is the support for the filesystem. Which brings us to the next lesson learned about context. Context Is the Product Your agent is only as good as the context it can access. The real work isn’t prompt engineering it’s turning messy financial data from dozens of sources into clean, structured context the model can actually use. This requires a massive domain expertise from the engineering team. The heterogeneity problem. Financial data comes in every format imaginable: - SEC filings : HTML with nested tables, exhibits, signatures - Earnings transcripts : Speaker-segmented text with Q&A sections - Press releases : Semi-structured HTML from PRNewswire - Research reports : PDFs with charts and footnotes - Market data : Snowflake/databases with structured numerical data - News : Articles with varying quality and structure - Alternative data : Satellite imagery, web traffic, credit card panels - Broker research : Proprietary PDFs with price targets and models - Fund filings : 13F holdings, proxy statements, activist letters Each source has different schemas, different update frequencies, different quality levels. Agent needs one thing: clean context it can reason over. The normalization layer. Everything becomes one of three formats: - Markdown for narrative content (filings, transcripts, articles) - CSV/tables for structured data (financials, metrics, comparisons) - JSON metadata for searchability (tickers, dates, document types, fiscal periods) Chunking strategy matters. Not all documents chunk the same way: - 10-K filings : Section by regulatory structure (Item 1, 1A, 7, 8...) - Earnings transcripts : Chunk by speaker turn (CEO remarks, CFO remarks, Q&A by analyst) - Press releases : Usually small enough to be one chunk - News articles : Paragraph-level chunks - 13F filings : By holder and position changes quarter-over-quarter The chunking strategy determines what context the agent retrieves. Bad chunks = bad answers. Tables are special. Financial data is full of tables and csv. Revenue breakdowns, segment performance, guidance ranges. LLMs are surprisingly good at reasoning over markdown tables: But they’re terrible at reasoning over HTML `<table>` tags or raw CSV dumps. The normalization layer converts everything to clean markdown tables. Metadata enables retrieval. The user asks the agent: “ What did Apple say about services revenue in their last earnings call? ” To answer this, Fintool needs: - Ticker resolution (AAPL → correct company) - Document type filtering (earnings transcript, not 10-K) - Temporal filtering (most recent, not 2019) - Section targeting (CFO remarks or revenue discussion, not legal disclaimers) This is why `meta.json` exists for every document. Without structured metadata, you’re doing keyword search over a haystack. It speeds up the search, big time! Anyone can call an LLM API. Not everyone has normalized decades of financial data into searchable, chunked markdown with proper metadata. The data layer is what makes agents actually work. The Parsing Problem Normalizing financial data is 80% of the work. Here’s what nobody tells you. SEC filings are adversarial. They’re not designed for machine reading. They’re designed for legal compliance: - Tables span multiple pages with repeated headers - Footnotes reference exhibits that reference other footnotes - Numbers appear in text, tables, and exhibits—sometimes inconsistently - XBRL tags exist but are often wrong or incomplete - Formatting varies wildly between filers (every law firm has their own template) We tried off-the-shelf PDF/HTML parsers. They failed on: - Multi-column layouts in proxy statements - Nested tables in MD&A sections (tables within tables within tables) - Watermarks and headers bleeding into content - Scanned exhibits (still common in older filings and attachments) - Unicode issues (curly quotes, em-dashes, non-breaking spaces) The Fintool parsing pipeline: Raw Filing (HTML/PDF) ↓ Document structure detection (headers, sections, exhibits) ↓ Table extraction with cell relationship preservation ↓ Entity extraction (companies, people, dates, dollar amounts) ↓ Cross-reference resolution (Ex. 10.1 → actual exhibit content) ↓ Fiscal period normalization (FY2024 → Oct 2023 to Sep 2024 for Apple) ↓ Quality scoring (confidence per extracted field) Table extraction deserves its own work. Financial tables are dense with meaning. A revenue breakdown table might have: - Merged header cells spanning multiple columns - Footnote markers (1), (2), (a), (b) that reference explanations below - Parentheses for negative numbers: $(1,234) means -1234 - Mixed units in the same table (millions for revenue, percentages for margins) - Prior period restatements in italics or with asterisks We score every extracted table on: - Cell boundary accuracy (did we split/merge correctly?) - Header detection (is row 1 actually headers, or is there a title row above?) - Numeric parsing (is “$1,234” parsed as 1234 or left as text?) - Unit inference (millions? billions? per share? percentage?) Tables below 90% confidence get flagged for review. Low-confidence extractions don’t enter the agent’s context—garbage in, garbage out. Fiscal period normalization is critical. “Q1 2024” is ambiguous: - Calendar Q1 (January-March 2024) - Apple’s fiscal Q1 (October-December 2023) - Microsoft’s fiscal Q1 (July-September 2023) - “Reported in Q1” (filed in Q1, but covers the prior period) We maintain a fiscal calendar database for 10,000+ companies. Every date reference gets normalized to absolute date ranges. When the agent retrieves “Apple Q1 2024 revenue,” it knows to look for data from October-December 2023. This is invisible to users but essential for correctness. Without it, you’re comparing Apple’s October revenue to Microsoft’s January revenue and calling it “same quarter.” Skills Are Everything Here’s the thing nobody tells you about building AI agents: the model is not the product. The skills are now the product. I learned this the hard way. We used to try making the base model “smarter” through prompt engineering. Tweak the system prompt, add examples, write elaborate instructions. It helped a little. But skills were the missing part. In October 2025, Anthropic formalized this with Agent Skills a specification for extending Claude with modular capability packages. A skill is a folder containing a `SKILL.md` file with YAML frontmatter (name and description), plus any supporting scripts, references, or data files the agent might need. We’d been building something similar for months before the announcement. The validation felt good but more importantly, having an industry standard means our skills can eventually be portable. Without skills, models are surprisingly bad at domain tasks. Ask a frontier model to do a DCF valuation. It knows what DCF is. It can explain the theory. But actually executing one? It will miss critical steps, use wrong discount rates for the industry, forget to add back stock-based compensation, skip sensitivity analysis. The output looks plausible but is subtly wrong in ways that matter. The breakthrough came when we started thinking about skills as first-class citizens. Like part of the product itself. A skill is a markdown file that tells the agent how to do something specific. Here’s a simplified version of our DCF skill: That’s it. A markdown file. No code changes. No production deployment. Just a file that tells the agent what to do. Skills are better than code. This matters enormously: 1. Non-engineers can create skills. Our analysts write skills. Our customers write skills. A portfolio manager who’s done 500 DCF valuations can encode their methodology in a skill without writing a single line of Python. 2. No deployment needed. Change a skill file and it takes effect immediately. No CI/CD, no code review, no waiting for release cycles. Domain experts can iterate on their own. 3. Readable and auditable. When something goes wrong, you can read the skill and understand exactly what the agent was supposed to do. Try doing that with a 2,000-line Python module. We have a copy-on-write shadowing system: Priority: private > shared > public So if you don’t like how we do DCF valuations, write your own. Drop it in `/private/skills/dcf/SKILL.md`. Your version wins. Why we don’t mount all skills to the filesystem. This is important. The naive approach would be to mount every skill file directly into the sandbox. The agent can just `cat` any skill it needs. Simple, right? Wrong. Here’s why we use SQL discovery instead: 1. Lazy loading. We have dozens of skills with extensive documentation like the DCF skill alone has 10+ industry guideline files. Loading all of them into context for every conversation would burn tokens and confuse the model. Instead, we discover skill metadata (name, description) upfront, and only load the full documentation when the agent actually uses that skill. 2. Access control at query time. The SQL query implements our three-tier access model: public skills available to everyone, organization skills for that org’s users, private skills for individual users. The database enforces this. You can’t accidentally expose a customer’s proprietary skill to another customer. 3. Shadowing logic. When a user customizes a skill, their version needs to override the default. SQL makes this trivial—query all three levels, apply priority rules, return the winner. Doing this with filesystem mounts would be a nightmare of symlinks and directory ordering. 4. Metadata-driven filtering. The `fs_files.metadata` column stores parsed YAML frontmatter. We can filter by skill type, check if a skill is main-agent-only, or query any other structured attribute—all without reading the files themselves. The pattern: S3 is the source of truth, a Lambda function syncs changes to PostgreSQL for fast queries, and the agent gets exactly what it needs when it needs it. Skills are essential. I cannot emphasize this enough. If you’re building an AI agent and you don’t have a skills system, you’re going to have a bad time. My biggest argument for skills is that top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills. Models just want to learn and what they want to learn is our skills... Until they ate it. The Model Will Eat Your Scaffolding Here’s the uncomfortable truth: everything I just told you about skills? It’s temporary in my opinion. Models are getting better. Fast. Every few months, there’s a new model that makes half your code obsolete. The elaborate scaffolding you built to handle edge cases? The model just... handles them now. When we started, we needed detailed skills with step-by-step instructions for some simple tasks. “First do X, then do Y, then check Z.” Now? We can often just say for simple task “do an earnings preview” and the model figures it out (kinda of!) This creates a weird tension. You need skills today because current models aren’t smart enough. But you should design your skills knowing that future models will need less hand-holding. That’s why I’m bullish on markdown file versus code for model instructions. It’s easier to update and delete. We send detailed feedback to AI labs. Whenever we build complex scaffolding to work around model limitations, we document exactly what the model struggles with and share it with the lab research team. This helps inform the next generation of models. The goal is to make our own scaffolding obsolete. My prediction: in two years, most of our basic skills will be one-liners. “Generate a 20 tabs DCF.” That’s it. The model will know what that means. But here’s the flip side: as basic tasks get commoditized, we’ll push into more complex territory. Multi-step valuations with segment-by-segment analysis. Automated backtesting of investment strategies. Real-time portfolio monitoring with complex triggers. The frontier keeps moving. So we write skills. We delete them when they become unnecessary. And we build new ones for the harder problems that emerge. And all that are files... in our filesystem. The S3-First Architecture Here’s something that surprised me: S3 for files is a better database than a database. We store user data (watchlists, portfolio, preferences, memories, skills) in S3 as YAML files. S3 is the source of truth. A Lambda function syncs changes to PostgreSQL for fast queries. Writes → S3 (source of truth) ↓ Lambda trigger ↓ PostgreSQL (fs_files table) ↓ Reads ← Fast queries Why? - Durability : S3 has 11 9’s. A database doesn’t. - Versioning : S3 versioning gives you audit trails for free - Simplicity : YAML files are human-readable. You can debug with `cat`. - Cost : S3 is cheap. Database storage is not. The pattern: - Writes go to S3 directly - List queries hit the database (fast) - Single-item reads go to S3 (freshest data) The sync architecture. We run two Lambda functions to keep S3 and PostgreSQL in sync: S3 (file upload/delete) ↓ SNS Topic ↓ fs-sync Lambda → Upsert/delete in fs_files table (real-time) EventBridge (every 3 hours) ↓ fs-reconcile Lambda → Full S3 vs DB scan, fix discrepancies Both use upsert with timestamp guards—newer data always wins. The reconcile job catches any events that slipped through (S3 eventual consistency, Lambda cold starts, network blips). User memories live here too. Every user has a `/private/memories/UserMemories.md` file in S3. It’s just markdown—users can edit it directly in the UI. On every conversation, we load it and inject it as context: This is surprisingly powerful. Users write things like “I focus on small-cap value stocks” or “Always compare to industry median, not mean” or “My portfolio is concentrated in tech, so flag concentration risk.” The agent sees this on every conversation and adapts accordingly. No migrations. No schema changes. Just a markdown file that the user controls. Watchlists work the same way. YAML files in S3, synced to PostgreSQL for fast queries. When a user asks about “my watchlist,” we load the relevant tickers and inject them as context. The agent knows what companies matter to this user. The filesystem becomes the user’s personal knowledge base. Skills tell the agent how to do things. Memories tell it what the user cares about. Both are just files. The File System Tools Agents in financial services need to read and write files. A lot of files. PDFs, spreadsheets, images, code. Here’s how we handle it. ReadFile handles the complexity: WriteFile creates artifacts that link back to the UI: Bash gives persistent shell access with 180 second timeout and 100K character output limit. Path normalization on everything (LLMs love trying path traversal attacks, it’s hilarious). Bash is more important than you think. There’s a growing conviction in the AI community that filesystems and bash are the optimal abstraction for AI agents. Braintrust recently ran an eval comparing SQL agents, bash agents, and hybrid approaches for querying semi-structured data. The results were interesting: pure SQL hit 100% accuracy but missed edge cases. Pure bash was slower and more expensive but caught verification opportunities. The winner? A hybrid approach where the agent uses bash to explore and verify, SQL for structured queries. This matches our experience. Financial data is messy. You need bash to grep through filing documents, find patterns, explore directory structures. But you also need structured tools for the heavy lifting. The agent needs both—and the judgment to know when to use each. We’ve leaned hard into giving agents full shell access in the sandbox. It’s not just for running Python scripts. It’s for exploration, verification, and the kind of ad-hoc data manipulation that complex tasks require. But complex tasks mean long-running agents. And long-running agents break everything. Subscribe now Temporal Changed Everything Before Temporal, our long-running tasks were a disaster. User asks for a comprehensive company analysis. That takes 5 minutes. What if the server restarts? What if the user closes the tab and comes back? What if... anything? We had a homegrown job queue. It was bad. Retries were inconsistent. State management was a nightmare. Then we switched to Temporal and I wanted to cry tears of joy! That’s it. Temporal handles worker crashes, retries, everything. If a Heroku dyno restarts mid-conversation (happens all the time lol), Temporal automatically retries on another worker. The user never knows. The cancellation handling is the tricky part. User clicks “stop,” what happens? The activity is already running on a different server. We use heartbeats sent every few seconds. We run two worker types: - Chat workers : User-facing, 25 concurrent activities - Background workers : Async tasks, 10 concurrent activities They scale independently. Chat traffic spikes? Scale chat workers. Next is speed. Real-Time Streaming In finance, people are impatient. They’re not going to wait 30 seconds staring at a loading spinner. They need to see something happening. So we built real-time streaming. The agent works, you see the progress. Agent → SSE Events → Redis Stream → API → Frontend The key insight: delta updates, not full state. Instead of sending “here’s the complete response so far” (expensive), we send “append these 50 characters” (cheap). Streaming rich content with Streamdown. Text streaming is table stakes. The harder problem is streaming rich content: markdown with tables, charts, citations, math equations. We use Streamdown to render markdown as it arrives, with custom plugins for our domain-specific components. Charts render progressively. Citations link to source documents. Math equations display properly with KaTeX. The user sees a complete, interactive response building in real-time. AskUserQuestion: Interactive agent workflows. Sometimes the agent needs user input mid-workflow. “Which valuation method do you prefer?” “Should I use consensus estimates or management guidance?” “Do you want me to include the pipeline assets in the valuation?”

0 views
Steve Klabnik 1 months ago

Agentic development basics

In my last post, I suggested that you should start using Claude in your software development process via read-only means at first. The idea is just to get used to interacting with the AI, seeing what it can do, and seeing what it struggles with. Once you’ve got a handle on that part, it’s time to graduate to writing code. However, I’m going to warn you about this post: I hope that by the end of it, you’re a little frustrated. This is because I don’t think it’s productive to skip to the tools and techniques that experienced users use yet. We have to walk before we run. And more importantly, we have to understand how and why we run. That is, I hope that this step will let you start producing code with Claude, but it will also show you some of the initial pitfalls when doing so, in order to motivate the techniques you’re going to learn about in part 3. So with that in mind, let’s begin. Okay I lied. Before we actually begin: you are using version control, right? If not, you may want to go learn a bit about it. Version control, like git (or my beloved jj) is pretty critical for software development, but it’s in my opinion even more critical for this sort of development. You really want to be able to restore to previous versions of the code, branch off and try things, and recover from mistakes. If you already use version control systems religiously, you might use this as an excuse to learn even more features of them. I never bothered with s in the past, but I use s with agents all the time now. Okay, here’s my first bit of advice: commit yourself to not writing code any more. I don’t mean forever, I don’t mean all the time, I mean, while you’re trying to learn agentic software development, on the project that you’re learning it with, just don’t write any code manually. This might be a bit controversial! However, I think it is essential. Let me tell you a short story. Many years ago, I was in Brazil. I wanted to try scuba diving for the first time. Seemed like a good opportunity. Now, I don’t remember the exact setup, but I do remember the hardest part for me. Our instructor told us to put the mask on, and then lean forward and put our faces in the water and breathe through the regulator. I simply could not do it. I got too in my head, it was like those “you are now breathing manually” memes. I forget if it was my idea or the instructor’s idea, but what happened in practice: I just jumped in. My brain very quickly went from “but how do I do this properly” to “oh God, if you don’t figure this out right the fuck now you’re gonna fuckin die idiot” and that’s exactly what I needed to do it. A few seconds later, I was breathing just fine. I just needed the shock to my system, I needed to commit. And I figured it out. Now, I’m turning 40 and this happened long ago, and it’s reached more of a personal myth status in my head, so maybe I got the details wrong, maybe this is horribly irresponsible and someone who knows about diving can tell me if that experience was wrong, but the point has always stuck with me: sometimes, you just gotta go for it. This dovetails into another part I snuck into there: “on the project that you’re learning it with.” I really think that you should undergo a new project for this endeavor. There’s a few reasons for this: Pick something to get started with, and create a new repo. I suggest something you’ve implemented before, or maybe something you know how to do but have never bothered to take the time to actually build. A small CLI tool might be a good idea. Doesn’t super matter what it is. For the purposes of this example, I’m going to build a task tracking CLI application. Because there aren’t enough of those in the world. I recommend making a new fresh directory, and initializing the project of your choice. I’m using Rust, of course, so I’ll . You can make Claude do it, but I don’t think that starting from an initialized project is a bad idea either. I’m more likely to go the route if I know I’m building something small, and more of the “make Claude do it” route if I’m doing like, a web app with a frontend, backend, and several services. Anyway point is: get your project to exist, and then just run Claude Code. I guess you should install it first , but anyway, you can just to get started. At the time of writing, Claude will ask you if you trust the files in this folder, you want to say yes, because you just created it. You’ll also get some screens asking if you want to do light or dark mode, to log in with your Anthropic account, stuff like that. And then you’ll be in. Claude will ask you to run to create a CLAUDE.md, but we’re not gonna do that at the start. We need to talk about even more basic things than that first. You’ll be at a prompt, and it’ll even suggest something for you to get started with. Mine says “Try “fix typecheck errors"" at the moment. We’re gonna try to get Claude to modify a few lines of our program. The in Rust produces this program: so I’ll ask Claude this: Hi claude! right now my program prints “Hello, world”, can we have it print “Goodbye, world” instead? Claude does this for me: And then asks this: Claude wants to edit a file, and so by default, it has to ask us permission to do so. This is a terrible place to end up, but a great place to get started! We want to use these prompts at first to understand what Claude is doing, and sort of “code review” it as we go. More on that in a bit. Anyway, this is why you should answer “Yes” to this question, and not “Yes, allow all edits during this session.” We want to keep reviewing the code for now. You want to be paying close attention to what Claude is doing, so you can build up some intuition about it. Before clicking yes, I want to talk about what Claude did in my case here. Note that my prompt said right now my program prints “Hello, world”, can we have it print “Goodbye, world” Astute observers will have noticed that it actually says and not . We also asked it to have it say and it is showing a diff that will make it say . This is a tiny different, but it is also important to understand: Claude is going to try and figure out what you mean and do that. This is both the source of these tools’ power and also the very thing that makes it hard to trust them. In this case, Claude was right, I didn’t type the exact string when describing the current behavior, and I didn’t mean to remove the . In the previous post, I said that you shouldn’t be mean to Claude. I think it makes the LLM perform worse. So now it’s time to talk about your own reaction to the above: did you go “yeah Claude fucked up it didn’t do exactly what I asked?” or did you go “yeah Claude did exactly what I asked?” I think it’s important to try and let go of preconceived notions here, especially if your reaction here was negative. I know this is kind of woo, just like “be nice to Claude,” but you have to approach this as “this is a technology that works a little differently than I’m used to, and that’s why I’m learning how to meet it on its own terms” rather than “it didn’t work the way I expected it to, so it is wrong.” A non-woo way of putting it is this: the right way to approach “it didn’t work” in this context is “that’s a skill issue on my part, and I’m here to sharpen my skills.” Yes, there are limits to this technology and it’s not perfect. That’s not the point. You’re not doing that kind of work right now, you’re doing . Now, I should also say that like, if you don’t want to learn a new tool? 100% okay with me. Learned some things about a tool, and didn’t like it? Sure! Some of you won’t like agentic development. That’s okay. No worries, thanks for reading, have a nice day. I mean that honestly. But for those folks who do want to learn this, I’m trying to communicate that I think you’ll have a better time learning it if you try to get into the headspace of “how do I get the results I want” rather than getting upset and giving up when it doesn’t work out. Okay, with that out of the way, if you asked a small enough question, Claude probably did the right thing. Let’s accept. This might be a good time to commit & save your progress. You can use to put Claude Code into “bash mode” and run commands, so I just and I’m good. You can also use another terminal, I just figured I’d let you know. It’s good for short commands. Let’s try something bigger. To do that, we’re gonna invoke something called “plan mode.” Claude Code has three (shhhh, we don’t talk about the fourth yet) modes. The first one is the “ask to accept edits” mode. But if you hit , you’ll see at the bottom left. We don’t want to automatically accept edits. Hit again, and you’ll see this: This is what we want. Plan mode. Plan mode is useful any time you’re doing work that’s on the larger side, or just when you want to think through something before you begin. In plan mode, Claude cannot modify your files until you accept the plan. With plan mode, you talk to Claude about what you want to do, and you collaborate on a plan together. A nice thing about it is that you can communicate the things you are sure of, and also ask about the things you’re not sure of. So let’s kick off some sort of plan to build the most baby parts of our app. In my case, I’m prompting it with this: hi claude! i want this application to grow into a task tracking app. right now, it’s just hello world. I’d like to set up some command line argument parsing infrastructure, with a command that prints the version. can we talk about that? Yes, I almost always type , feel free to not. And I always feel like a “can we talk about that” on the end is nice too. I try to talk to Claude the way I’d talk to a co-worker. Obviously this would be too minor of a thing to bother to talk to a co-worker about, but like, you know, baby steps. Note that what I’m asking is basically a slightly more complex “hello world”, just getting some argument parsing up. You want something this sized: you know how it should be done, it should be pretty simple, but it’s not a fancy command. With plan mode, Claude will end up responding to this by taking a look at your project, considering what you might need, and then coming up with a plan. Here’s the first part of Claude’s answer to me: It’ll then come up with a plan, and it usually write it out to a file somewhere: You can go read the file if you want to, but it’s not needed. Claude is going to eventually present the plan to you directly, and you’ll review it before moving on. Claude will also probably ask you a question or maybe even a couple, depending. There’s a neat little TUI for responding to its questions, it can even handle multiple questions at once: For those of you that don’t write Rust, this is a pretty good response! Clap is the default choice in the ecosystem, arg is a decent option too, and doing it yourself is always possible. I’m going to choose clap, it’s great. If you’re not sure about the question, you can arrow down to “Chat about this” and discuss it more. Here’s why you don’t need to read the file: Claude will pitch you on its plan: This is pretty good! Now, if you like this plan, you can select #3. Remember, we’re not auto accepting just yet! Don’t worry about the difference between 1 and 2 now, we’ll talk about it someday. But, I actually want Claude to tweak this plan, I wouldn’t run , I would do . So I’m going to go down to four and type literally that to Claude: I wouldn’t run , I would do . And Claude replies: and then presents the menu again. See, this is where the leverage of “Claude figures out what I mean” can be helpful: I only told it about , but it also changed to as well. However, there’s a drawback too: we didn’t tell Claude we wanted to have help output! However, this is also a positive: Claude considered help output to be so basic that it’s suggesting it for our plan. It’s up to you to decide if this is an overreach on Claude’s part. In my case, I’m okay with it because it’s so nearly automatic with Clap and it’s something I certainly want in my tool, so I’m going to accept this plan. Iterate with Claude until you’re happy with the plan, and then do the same. I’m not going to paste all of the diffs here, but for me, Claude then went and did the plan: it added the dependency to , it added the needed code to , it ran to try and do the build. Oh yeah, here’s a menu we haven’t seen yet: This is a more interesting question than the auto-edit thing. Claude won’t run commands without you signing off on them. If you’re okay with letting Claude run this command every time without asking, you can choose 2, and if you want to confirm every time, you can type 1. Completely up to you. Now it ran , and . Everything looked good, so I see: And that’s it, we’ve built our first feature! Yeah, it’s pretty small, and in this case, we probably could have copy/pasted the documentation, but again, that’s not the point right now: we’re just trying to take very small steps forward to get used to working with the tool. We want it to be something that’s quick for us to verify. We are spending more time here than we would if we did it by hand, because that time isn’t wasted: it’s time learning. As we ramp up the complexity of what we can accomplish, we’ll start seeing speed gains. But we’re deliberately going slow and doing little right now. From here, that’s exactly what I’d like you to do: figure out where the limits are. Try something slightly larger, slightly harder. Try to do stuff with just a prompt, and then throw that commit away and try it again with plan mode. See what stuff you need plan mode for, and what you can get away with with just a simple prompt. I’ll leave you with an example of a prompt I might try next: asking Cladue for their opinion on what you should do. The next thing I’d do in this project is to switch back into planning mode and ask this: what do you think is the best architecture for this tracking app? we haven’t discussed any real features, design, or requirements. this is mostly a little sample application to practice with, and so we might not need a real database, we might get away with a simple file or files to track todos. but maybe something like sqlite would still be appropriate. if we wanted to implement a next step for this app, what do you think it should be? Here’s what Claude suggested: This plan is pretty solid. But again, it’s a demo app. The important part is, you can always throw this away if you don’t like it. So try some things. Give Claude some opinions and see how they react. Try small features, try something larger. Play around. But push it until Claude fails. At some point, you will run into problems. Maybe you already have! What you should do depends on the failure mode. The first failure you’re gonna run into is “I don’t like the code!” Maybe Claude just does a bad job. You have two options: the first is to just tell Claude to fix it. Claude made a mess, Claude can clean it up. In more extreme cases, you may want to just simply or and start again. Honestly, the second approach is better in a lot of cases, but I’m not going to always recommend it just yet. The reason why it is is that it gives you some time to reflect on why Claude did a bad job. But we’re gonna talk about that as the next post in this series! So for now, stick to the ‘worse’ option: just tell Claude to fix problems you find. The second kind of failure is one where Claude just really struggles to get things right. It looks in the wrong places in your codebase, it tries to find a bug and can’t figure it out, it misreads output and that leads it astray, etc. This kind of failure is harder to fix with the tools you have available to you right now. What matters is taking note of them, so that you can email them to me, haha. I mean, feel free to do that, and I can incorporate specific suggestions into the next posts, but also, just being able to reflect on what Claude struggles with is a good idea generally. You’ll be able to fix them eventually, so knowing what you need to improve matters. If Claude works long enough, you’ll see something about “compaction.” We haven’t discussed things at a deep enough level to really understand this yet, so don’t worry about it! You may want to note one thing though: Claude tends to do worse after compaction, in my opinion. So one way to think about this is, “If I see compaction, I’ve tried to accomplish too large a task.” Reflect on if you could have broken this up into something smaller. We’ll talk about this more in the next post. So that’s it! Let Claude write some code, in a project you don’t care about. Try bigger and harder things until you’re a bit frustrated with its failures. You will hit the limits, because you’re not doing any of the intermediate techniques to help Claude do a good job. But my hope is, by running into these issues, you’ll understand the motivation for those techniques, and will be able to better apply them in the future. Here’s my post about this post on BlueSky: Steve Klabnik @steveklabnik.com · Jan 7 Getting started with Claude for software development: steveklabnik.com/writing/gett... Getting started with Claude for software development Blog post: Getting started with Claude for software development by Steve Klabnik steveklabnik.com Steve Klabnik @steveklabnik.com Agentic development basics: steveklabnik.com/writing/agen... Agentic development basics Blog post: Agentic development basics by Steve Klabnik You can be less precious about the code. This isn’t messing up one of your projects, this is a throwaway scratch thing that doesn’t matter. “AI does better on greenfield projects” is not exactly true, but there’s enough truth to it that I think you should do a new project. It’s really more about secondary factors than it is actual greenfield vs brownfield development but whatever, doesn’t matter: start new.

1 views
matklad 1 months ago

Vibecoding #2

I feel like I got substantial value out of Claude today, and want to document it. I am at the tail end of AI adoption, so I don’t expect to say anything particularly useful or novel. However, I am constantly complaining about the lack of boring AI posts, so it’s only proper if I write one. At TigerBeetle, we are big on deterministic simulation testing . We even use it to track performance , to some degree. Still, it is crucial to verify performance numbers on a real cluster in its natural high-altitude habitat. To do that, you need to procure six machines in a cloud, get your custom version of binary on them, connect cluster’s replicas together and hit them with load. It feels like, quarter of a century into the third millennium, “run stuff on six machines” should be a problem just a notch harder than opening a terminal and typing , but I personally don’t know how to solve it without wasting a day. So, I spent a day vibecoding my own square wheel. The general shape of the problem is that I want to spin a fleet of ephemeral machines with given specs on demand and run ad-hoc commands in a SIMD fashion on them. I don’t want to manually type slightly different commands into a six-way terminal split, but I also do want to be able to ssh into a specific box and poke it around. My idea for the solution comes from these three sources: The big idea of is that you can program distributed system in direct style. When programming locally, you do things by issuing syscalls: This API works for doing things on remote machines, if you specify which machine you want to run the syscall on: Direct manipulation is the most natural API, and it pays to extend it over the network boundary. Peter’s post is an application of a similar idea to a narrow, mundane task of developing on Mac and testing on Linux. Peter suggests two scripts: synchronizes a local and remote projects. If you run inside folder, then materializes on the remote machine. does the heavy lifting, and the wrapper script implements behaviors. It is typically followed by , which runs command on the remote machine in the matching directory, forwarding output back to you. So, when I want to test local changes to on my Linux box, I have roughly the following shell session: The killer feature is that shell-completion works. I first type the command I want to run, taking advantage of the fact that local and remote commands are the same, paths and all, then hit and prepend (in reality, I have alias that combines sync&run). The big thing here is not the commands per se, but the shift in the mental model. In a traditional ssh & vim setup, you have to juggle two machines with a separate state, the local one and the remote one. With , the state is the same across the machines, you only choose whether you want to run commands here or there. With just two machines, the difference feels academic. But if you want to run your tests across six machines, the ssh approach fails — you don’t want to re-vim your changes to source files six times, you really do want to separate the place where the code is edited from the place(s) where the code is run. This is a general pattern — if you are not sure about a particular aspect of your design, try increasing the cardinality of the core abstraction from 1 to 2. The third component, library, is pretty mundane — just a JavaScript library for shell scripting. The notable aspects there are: JavaScript’s template literals , which allow implementing command interpolation in a safe by construction way. When processing , a string is never materialized, it’s arrays all the way to the syscall ( more on the topic ). JavaScript’s async/await, which makes managing concurrent processes (local or remote) natural: Additionally, deno specifically valiantly strives to impose process-level structured concurrency, ensuring that no processes spawned by the script outlive the script itself, unless explicitly marked — a sour spot of UNIX. Combining the three ideas, I now have a deno script, called , that provides a multiplexed interface for running ad-hoc code on ad-hoc clusters. A session looks like this: I like this! Haven’t used in anger yet, but this is something I wanted for a long time, and now I have it The problem with implementing above is that I have zero practical experience with modern cloud. I only created my AWS account today, and just looking at the console interface ignited the urge to re-read The Castle. Not my cup of pu-erh. But I had a hypothesis that AI should be good at wrangling baroque cloud API, and it mostly held. I started with a couple of paragraphs of rough, super high-level description of what I want to get. Not a specification at all, just a general gesture towards unknown unknowns. Then I asked ChatGPT to expand those two paragraphs into a more or less complete spec to hand down to an agent for implementation. This phase surfaced a bunch of unknowns for me. For example, I wasn’t thinking at all that I somehow need to identify machines, ChatGPT suggested using random hex numbers, and I realized that I do need 0,1,2 naming scheme to concisely specify batches of machines. While thinking about this, I realized that sequential numbering scheme also has an advantage that I can’t have two concurrent clusters running, which is a desirable property for my use-case. If I forgot to shutdown a machine, I’d rather get an error on trying to re-create a machine with the same name, then to silently avoid the clash. Similarly, turns out the questions of permissions and network access rules are something to think about, as well as what region and what image I need. With the spec document in hand, I turned over to Claude code for actual implementation work. The first step was to further refine the spec, asking Claude if anything is unclear. There were couple of interesting clarifications there. First, the original ChatGPT spec didn’t get what I meant with my “current directory mapping” idea, that I want to materialize a local as remote , even if are different. ChatGPT generated an incorrect description and an incorrect example. I manually corrected example, but wasn’t able to write a concise and correct description. Claude fixed that working from the example. I feel like I need to internalize this more — for current crop of AI, examples seem to be far more valuable than rules. Second, the spec included my desire to auto-shutdown machines once I no longer use them, just to make sure I don’t forget to turn the lights off when leaving the room. Claude grilled me on what precisely I want there, and I asked it to DWIM the thing. The spec ended up being 6KiB of English prose. The final implementation was 14KiB of TypeScript. I wasn’t keeping the spec and the implementation perfectly in sync, but I think they ended up pretty close in the end. Which means that prose specifications are somewhat more compact than code, but not much more compact. My next step was to try to just one-shot this. Ok, this is embarrassing, and I usually avoid swearing in this blog, but I just typoed that as “one-shit”, and, well, that is one flavorful description I won’t be able to improve upon. The result was just not good (more on why later), so I almost immediately decided to throw it away and start a more incremental approach. In my previous vibe-post , I noticed that LLM are good at closing the loop. A variation here is that LLMs are good at producing results, and not necessarily good code. I am pretty sure that, if I had let the agent to iterate on the initial script and actually run it against AWS, I would have gotten something working. I didn’t want to go that way for three reasons: And, as I said, the code didn’t feel good, for these specific reasons: The incremental approach worked much better, Claude is good at filling-in the blanks. The very first thing I did for was manually typing-in: Then I asked Claude to complete the function, and I was happy with the result. Note Show, Don’t Tell I am not asking Claude to avoid throwing an exception and fail fast instead. I just give function, and it code-completes the rest. I can’t say that the code inside is top-notch. I’d probably written something more spartan. But the important part is that, at this level, I don’t care. The abstraction for parsing CLI arguments feel right to me, and the details I can always fix later. This is how this overall vibe-coding session transpired — I was providing structure, Claude was painting by the numbers. In particular, with that CLI parsing structure in place, Claude had little problem adding new subcommands and new arguments in a satisfactory way. The only snag was that, when I asked to add an optional path to , it went with , while I strongly prefer . Obviously, its better to pick your null in JavaScript and stick with it. The fact that is unavoidable predetermines the winner. Given that the argument was added as an incremental small change, course-correcting was trivial. The null vs undefined issue perhaps illustrates my complaint about the code lacking character. is the default non-choice. is an insight, which I personally learned from VS Code LSP implementation. The hand-written skeleton/vibe-coded guts worked not only for the CLI. I wrote and then asked Claude to write the body of a particular function according to the SPEC.md. Unlike with the CLI, Claude wasn’t able to follow this pattern itself. With one example it’s not obvious, but the overall structure is that is the AWS-level operation on a single box, and is the CLI-level control flow that deals with looping and parallelism. When I asked Claude to implement , without myself doing the / split, Claude failed to noticed it and needed a course correction. However , Claude was massively successful with the actual logic. It would have taken me hours to acquire specific, non-reusable knowledge to write: I want to be careful — I can’t vouch for correctness and especially completeness of the above snippet. However, given that the nature of the problem is such that I can just run the code and see the result, I am fine with it. If I were writing this myself, trial-and-error would totally be my approach as well. Then there’s synthesis — with several instance commands implemented, I noticed that many started with querying AWS to resolve symbolic machine name, like “1”, to the AWS name/IP. At that point I realized that resolving symbolic names is a fundamental part of the problem, and that it should only happen once, which resulting in the following refactored shape of the code: Claude was ok with extracting the logic, but messed up the overall code layout, so the final code motions were on me. “Context” arguments go first , not last, common prefix is more valuable than common suffix because of visual alignment. The original “one-shotted” implementation also didn’t do up-front querying. This is an example of a shape of a problem I only discover when working with code closely. Of course, the script didn’t work perfectly the first time and we needed quite a few iterations on the real machines both to fix coding bugs, as well gaps in the spec. That was an interesting experience of speed-running rookie mistakes. Claude made naive bugs, but was also good at fixing them. For example, when I first tried to after , I got an error. Pasting it into Claude immediately showed the problem. Originally, the code was doing and not . The former checks if instance is logically created, the latter waits until the OS is booted. It makes sense that these two exist, and the difference is clear (and its also clear that OS booted != SSH demon started). Claude’s value here is in providing specific names for the concepts I already know to exist. Another fun one was about the disk. I noticed that, while the instance had an SSD, it wasn’t actually used. I asked Claude to mount it as home, but that didn’t work. Claude immediately asked me to run and that log immediately showed the problem. This is remarkable! 50% of my typical Linux debugging day is wasted not knowing that a useful log exists, and the other 50% is for searching for the log I know should exist somewhere . After the fix, I lost the ability to SSH. Pasting the error immediately gave the answer — by mounting over , we were overwriting ssh keys configured prior. There were couple of more iterations like that. Rookie mistakes were made, but they were debugged and fixed much faster than my personal knowledge allows (and again, I feel that is trivia knowledge, rather than deep reusable knowledge, so I am happy to delegate it!). It worked satisfactorily in the end, and, what’s more, I am happy to maintain the code, at least to the extent that I personally need it. Kinda hard to measure productivity boost here, but, given just the sheer number of CLI flags required to make this work, I am pretty confident that time was saved, even factoring the writing of the present article! I’ve recently read The Art of Doing Science and Engineering by Hamming (of distance and code), and one story stuck with me: A psychologist friend at Bell Telephone Laboratories once built a machine with about 12 switches and a red and a green light. You set the switches, pushed a button, and either you got a red or a green light. After the first person tried it 20 times they wrote a theory of how to make the green light come on. The theory was given to the next victim and they had their 20 tries and wrote their theory, and so on endlessly. The stated purpose of the test was to study how theories evolved. But my friend, being the kind of person he was, had connected the lights to a random source! One day he observed to me that no person in all the tests (and they were all high-class Bell Telephone Laboratories scientists) ever said there was no message. I promptly observed to him that not one of them was either a statistician or an information theorist, the two classes of people who are intimately familiar with randomness. A check revealed I was right! https://github.com/catern/rsyscall https://peter.bourgon.org/blog/2011/04/27/remote-development-from-mac-to-linux.html https://github.com/dsherret/dax JavaScript’s template literals , which allow implementing command interpolation in a safe by construction way. When processing , a string is never materialized, it’s arrays all the way to the syscall ( more on the topic ). JavaScript’s async/await, which makes managing concurrent processes (local or remote) natural: Additionally, deno specifically valiantly strives to impose process-level structured concurrency, ensuring that no processes spawned by the script outlive the script itself, unless explicitly marked — a sour spot of UNIX. Spawning VMs takes time, and that significantly reduces the throughput of agentic iteration. No way I let the agent run with a real AWS account, given that AWS doesn’t have a fool-proof way to cap costs. I am fairly confident that this script will be a part of my workflow for at least several years, so I care more about long-term code maintenance, than immediate result. It wasn’t the code that I would have written, it lacked my character, which made it hard for me to understand it at a glance. The code lacked any character whatsoever. It could have worked, it wasn’t “naively bad”, like the first code you write when you are learning programming, but there wasn’t anything good there. I never know what the code should be up-front. I don’t design solutions, I discover them in the process of refactoring. Some of my best work was spending a quiet weekend rewriting large subsystems implemented before me, because, with an implementation at hand, it was possible for me to see the actual, beautiful core of what needs to be done. With a slop-dump, I just don’t get to even see what could be wrong. In particular, while you are working the code (as in “wrought iron”), you often go back to requirements and change them. Remember that ambiguity of my request to “shut down idle cluster”? Claude tried to DWIM and created some horrific mess of bash scripts, timestamp files, PAM policy and systemd units. But the right answer there was “lets maybe not have that feature?” (in contrast, simply shutting the machine down after 8 hours is a one-liner).

2 views

Building Multi-Agent Systems (Part 3)

It’s now been over two years since I started working seriously with agents, and if there is one constant, it is that the "meta" for building them seems to undergo a hard reset every six months. In Part 1 (way back in December 2024) , we were building highly domain-specific multi-agent systems. We had to augment the gaps in model capabilities by chaining together several fragile sub-agent components. At the time, it was unclear just how much raw model improvements would obsolete those architectures. In Part 2 (July 2025) , LLMs had gotten significantly better. We simplified the architecture around "Orchestrator" agents and workers, and we started to see the first glimmer that scripting could be used for more than just data analysis. Now, here we are in Part 3 (January 2026), and the paradigm has shifted again. It is becoming increasingly clear that the most effective agents are solving non-coding problems by using code, and they are doing it with a consistent, domain-agnostic harness. Cartoon via Nano Banana. In this post, I want to provide an update on the agentic designs I’ve seen (from building agents, using the latest AI products, and talking to other folks in agent-valley 1 ) and break down how the architecture has evolved yet again over the past few months. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. We’ve seen a consolidation of tools and patterns since the last update. While the core primitives remain, the way we glue them together has shifted from rigid architectures to fluid, code-first environments. What has stayed the same: Tool-use LLM-based Agents: We are still fundamentally leveraging LLMs that interact with the world via “tools”. Multi-agent systems for taming complexity: As systems grow, we still decompose problems. However, the trend I noted in Part 2 (more intelligence means less architecture) has accelerated. We are relying less on rigid “assembly lines” and more on the model’s inherent reasoning to navigate the problem space. Long-horizon tasks: We are increasingly solving tasks that take hours of human equivalent time. Agents are now able to maintain capability even as the context window fills with thousands of tool calls. The human-equivalent time-horizon continues to grow 2 . What is different: Context Engineering is the new steering: It is becoming increasingly less about prompt, tool, or harness “engineering” and more about “context engineering” (organizing the environment). We steer agents by managing their file systems, creating markdown guide files, and progressively injected context. Sandboxes are default: Because agents are increasingly solving non-coding problems by writing code (e.g., “analyze this spreadsheet by writing a Python script” rather than “read this spreadsheet row by row”), they need a safe place to execute that code. This means nearly every serious agent now gets a personal ephemeral computer (VM) to run in. 3 Pragmatic Tool Calling: We are moving toward programmatic tool calling where agents write scripts to call tools in loops, batches, or complex sequences. This dramatically improves token efficiency (the agent reads the output of the script, not the 50 intermediate API calls) and reduces latency. Domain-agnostic harnesses: As models improve, the need for bespoke, product-specific agent harnesses is vanishing. For the last several agents I’ve built, it has been hard to justify maintaining a custom loop when I can just wrap a generic implementation like Claude Code (the Agents SDK ). The generic harness is often “good enough” for 90% of use cases. As a side effect of these changes, the diverse zoo of agent architectures we saw in 2024/2025 is converging into a single, dominant pattern. I’ll break this down into its core components. This diagram illustrates the convergence of agent design in 2026. We see the shift from rigid assembly lines to a fluid Planner and Builder (Execution Agent) loop, which spawns ephemeral Task Agents for sub-routines. Crucially, the entire system is grounded in a Code Execution Sandbox , allowing the agent to solve non-coding problems by writing scripts and leveraging Mount/API tools for massive context injection rather than fragile, individual tool calls. Planning, Execution, and Tasks One of the largest shifts in the last 18 months is the simplification and increased generalizability of subagents. In the past, we hand-crafted specific roles like "The SQL Specialist" or "The Researcher." Today, we are starting to see only see three forms of agents working in loops to accomplish a task: Plan Agents — An agent solely tasked with discovery, planning, and process optimization 4 . It performs just enough research to generate a map of the problem, providing specific pointers and definitions for an execution agent to take over. Execution Agents — The builder that goes and does the thing given a plan. It loads context from the pointers provided by the planner, writes scripts to manipulate that context, and verifies its own work. Task Agents — A transient sub-agent invoked by either a plan or execution agent for parallel or isolated sub-operations. This might look like an "explorer" agent for the planner or a "do operation on chunk X/10" for the execution agent. These are often launched dynamically as a tool-call with a subtask prompt generated on the fly by the calling agent. This stands in stark contrast to the older architectures (like the "Lead-Specialist" pattern I wrote about in Part 2 ), where human engineers had to manually define the domain boundaries and responsibilities for every subagent. These new agents need an environment to manage file-system context and execute dynamically generated code, so we give them a VM sandbox. This significantly changes how you think about tools and capabilities. To interact with the VM, there is a common set of base tools that have become standard 5 across most agent implementations: Bash — Runs an arbitrary bash command. Models like Claude often make assumptions about what tools already exist in the environment, so it is key to have a standard set of unix tools pre-installed on the VM (python3, find, etc.). Read/Write/Edit — Basic file system operations. Editing in systems like Claude Code is often done via a format which tends to be more reliable way of performing edits. Glob/Grep/LS — Dedicated filesystem exploration tools. While these might feel redundant with , they are often included for cross-platform compatibility and as a more curated, token-optimized alias for common operations. These can be deceptively simple to define, but robust implementation requires significant safeguards. You need to handle bash timeouts, truncate massive read results before they hit the context window, and add checks for unintentional edits to files. With the agent now able to manipulate data without directly touching its context window or making explicit tool calls for every step, you can simplify your custom tools. I’ve seen two primary types of tools emerge: "API" Tools — These are designed for programmatic tool calling . They look like standard REST wrappers for performing CRUD operations on a data source (e.g., rather than a complex ). Since the agent can compose these tools inside a script, you can expose a large surface area of granular tools without wasting "always-attached" context tokens. This also solves a core problem with many API-like MCP server designs . "Mount" Tools — These are designed for bulk context injection into the agent's VM file system. They copy over and transform an external data source into a set of files that the agent can easily manipulate. For example, might write JSON or Markdown files directly to a VM directory like 6 . A script-powered agent also makes you more creative about how you use code to solve non-coding tasks. Instead of building a dedicated tool for every action, you provide the primitives for the agent to build its own solutions: You might prefer the agent build artifacts indirectly through Python scripts (PowerPoint via python-pptx) and then run separate linting scripts to verify the output programmatically, rather than relying on a black-box or hand-crafted tool. You can give the agent access to raw binary files (PDFs, images) along with pre-installed libraries like or tools, letting it write a script to extract exactly what it needs instead of relying on pre-text-encoded representations. You can represent complex data objects as collections of searchable text files—for example, mounting a GitHub PR as and so the agent can use standard tools to search across them. You might use a “fake” git repository in the VM to simulate draft and publishing flows, allowing the agent to commit, branch, and merge changes that are translated into product concepts. You can seed the VM with a library of sample Bash or Python scripts that the agent can adapt or reuse at runtime, effectively building up a dynamic library of “skills”. Context engineering (as opposed to tool design and prompting) becomes increasingly important in this paradigm for adapting an agnostic agent harness to be reliable in a specific product domain. There are several great guides online now so I won’t go into too much detail here, but the key concepts are fairly universal. My TLDR is that it often breaks down into three core strategies: Progressive disclosure — You start with an initial system prompt and design the context such that the agent efficiently accumulates the information it needs only as it calls tools. You can include just-in-time usage instructions in the output of a tool or pre-built script. If an agent tries and fails, the tool output can return the error along with a snippet from the docs on how to use it correctly. You can use markdown files placed in the file system as optional guides for tasks. A in the VM root lists available capabilities, but the agent only reads specific files like if and when it decides it needs to run a query. Context indirection — You leverage scripting capabilities to let the agent act on context without actually seeing it within its context window. Instead of reading a 500MB log file into context to find an error, the agent writes a or script to find lines matching “ERROR” and only reads the specific output of that script. You can intercept file operations to perform “blind reads.” When an agent attempts to read a placeholder path like , the harness intercepts this write, performs a search, and populates the file with relevant snippets just in time. Simplification — You use pre-trained model priors to reduce the need for context and rely more on agent intuition. If you have a complex internal graph database, you can give the agent a -compatible wrapper. The model already knows how to use perfectly, so zero-shot performance is significantly higher than teaching it a custom query language. If your system uses a legacy or obscure configuration format (like XML with custom schemas), you can automatically convert it to YAML or JSON when the agent reads it, and convert it back when the agent saves it. For agents that need to perform increasingly long-running tasks, we still can’t completely trust the model to maintain focus over thousands of tokens. Context decay is real, and status indicators from early in the conversation often become stale. To combat this, agents like Claude Code often use three techniques to maintain state: Todos — This is a meta-tool the agent uses to effectively keep a persistent TODO list (often seeded by a planning agent). While this is great for the human-facing UX, its primary function is to re-inject the remaining plan and goals into the end of the context window, where the model pays the most attention. 7 Reminders — This involves the harness dynamically injecting context at the end of tool-call results or user messages. The harness uses heuristics (e.g., "10 tool calls since the last reminder about X" or "user prompt contains keyword Y") to append a hint for the agent. For example: Automated Compaction — At some point, nearly the entire usable context window is taken up by past tool calls and results. Using a heuristic, the context window is passed to another agent (or just a single LLM call) to summarize the history and "reboot" the agent from that summary. While the effectiveness of resuming from a summary is still somewhat debated, it is better than hitting the context limit, and it works significantly better when tied to explicit checkpoints in the input plan. If you built an agent more than six months ago, I have bad news: it is probably legacy code. The shift to scripting and sandboxes is significant enough that a rewrite is often better than a retrofit. Here is a quick rubric to evaluate if your current architecture is due for a refactor: Harness: Are you maintaining a domain-specific architecture hardcoded for your product? Consider refactoring to a generic, agnostic harness that delegates domain logic to context and tools, or wrapping a standard implementation like the Agents SDK. Capabilities: Are your prompts cluttered with verbose tool definitions and subagent instructions? Consider moving that logic into “Skills” (markdown guides) and file system structures that the agent can discover progressively. Tools: Do you have a sprawling library of specific tools (e.g., , , )? Consider deleting them. If the agent has a sandbox, it can likely solve all of those problems better by just writing a script. We are still in the early days of this new “agent-with-a-computer” paradigm, and while it solves many of the reliability issues of 2025, it introduces new unknowns. Sandbox Security: How much flexibility is too much? Giving an agent a VM and the ability to execute arbitrary code opens up an entirely new surface area for security vulnerabilities. We are now mixing sensitive data inside containers that have (potentially) internet access and package managers. Preventing complex exfiltration or accidental destruction is an unsolved problem. The Cost of Autonomy: We are no longer just paying for inference tokens; we are paying for runtime compute (VMs) and potentially thousands of internal tool loops. Do we care that a task now costs much more if it saves a human hour? Or are we just banking on the “compute is too cheap to meter” future arriving faster than our cloud bills? The Lifespan of “Context Engineering”: Today, we have to be thoughtful about how we organize the file system and write those markdown guides so the agent can find them. But is this just a temporary optimization? In six months, will models be smart enough (and context windows cheap enough) that we can just point them at a messy, undocumented data lake and say “figure it out”? Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. My new meme name for the SF tech AI scene, we’ll see if it catches on. I actually deeply dislike how this is often evidenced by METR Time Horizons — but at the same time I can’t deny just how far Opus 4.5 can get in coding tasks compared to previous models. See also Davis’ great post: You can see some more details on how planning and execution handoff looks in practice in Cursor’s Scaling Agents (noting that this browser thing leaned a bit marketing hype for me; still a cool benchmark and technique) and Anthropic’s Effective harnesses for long-running agents . I’m a bit overfit to Claude Code-style tools ( see full list here ), but my continued understanding is that they fairly similar across SDKs (or will be). We do this a ton at work and I found that Vercel GTM Engineering does something that looks quite similar. Anthropic calls this “structured note taking” and Manus also discusses this in its blog post. In Part 1 (way back in December 2024) , we were building highly domain-specific multi-agent systems. We had to augment the gaps in model capabilities by chaining together several fragile sub-agent components. At the time, it was unclear just how much raw model improvements would obsolete those architectures. In Part 2 (July 2025) , LLMs had gotten significantly better. We simplified the architecture around "Orchestrator" agents and workers, and we started to see the first glimmer that scripting could be used for more than just data analysis. Cartoon via Nano Banana. In this post, I want to provide an update on the agentic designs I’ve seen (from building agents, using the latest AI products, and talking to other folks in agent-valley 1 ) and break down how the architecture has evolved yet again over the past few months. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. What’s the same and what’s changed? We’ve seen a consolidation of tools and patterns since the last update. While the core primitives remain, the way we glue them together has shifted from rigid architectures to fluid, code-first environments. What has stayed the same: Tool-use LLM-based Agents: We are still fundamentally leveraging LLMs that interact with the world via “tools”. Multi-agent systems for taming complexity: As systems grow, we still decompose problems. However, the trend I noted in Part 2 (more intelligence means less architecture) has accelerated. We are relying less on rigid “assembly lines” and more on the model’s inherent reasoning to navigate the problem space. Long-horizon tasks: We are increasingly solving tasks that take hours of human equivalent time. Agents are now able to maintain capability even as the context window fills with thousands of tool calls. The human-equivalent time-horizon continues to grow 2 . Context Engineering is the new steering: It is becoming increasingly less about prompt, tool, or harness “engineering” and more about “context engineering” (organizing the environment). We steer agents by managing their file systems, creating markdown guide files, and progressively injected context. Sandboxes are default: Because agents are increasingly solving non-coding problems by writing code (e.g., “analyze this spreadsheet by writing a Python script” rather than “read this spreadsheet row by row”), they need a safe place to execute that code. This means nearly every serious agent now gets a personal ephemeral computer (VM) to run in. 3 Pragmatic Tool Calling: We are moving toward programmatic tool calling where agents write scripts to call tools in loops, batches, or complex sequences. This dramatically improves token efficiency (the agent reads the output of the script, not the 50 intermediate API calls) and reduces latency. Domain-agnostic harnesses: As models improve, the need for bespoke, product-specific agent harnesses is vanishing. For the last several agents I’ve built, it has been hard to justify maintaining a custom loop when I can just wrap a generic implementation like Claude Code (the Agents SDK ). The generic harness is often “good enough” for 90% of use cases. This diagram illustrates the convergence of agent design in 2026. We see the shift from rigid assembly lines to a fluid Planner and Builder (Execution Agent) loop, which spawns ephemeral Task Agents for sub-routines. Crucially, the entire system is grounded in a Code Execution Sandbox , allowing the agent to solve non-coding problems by writing scripts and leveraging Mount/API tools for massive context injection rather than fragile, individual tool calls. Planning, Execution, and Tasks One of the largest shifts in the last 18 months is the simplification and increased generalizability of subagents. In the past, we hand-crafted specific roles like "The SQL Specialist" or "The Researcher." Today, we are starting to see only see three forms of agents working in loops to accomplish a task: Plan Agents — An agent solely tasked with discovery, planning, and process optimization 4 . It performs just enough research to generate a map of the problem, providing specific pointers and definitions for an execution agent to take over. Execution Agents — The builder that goes and does the thing given a plan. It loads context from the pointers provided by the planner, writes scripts to manipulate that context, and verifies its own work. Task Agents — A transient sub-agent invoked by either a plan or execution agent for parallel or isolated sub-operations. This might look like an "explorer" agent for the planner or a "do operation on chunk X/10" for the execution agent. These are often launched dynamically as a tool-call with a subtask prompt generated on the fly by the calling agent. Bash — Runs an arbitrary bash command. Models like Claude often make assumptions about what tools already exist in the environment, so it is key to have a standard set of unix tools pre-installed on the VM (python3, find, etc.). Read/Write/Edit — Basic file system operations. Editing in systems like Claude Code is often done via a format which tends to be more reliable way of performing edits. Glob/Grep/LS — Dedicated filesystem exploration tools. While these might feel redundant with , they are often included for cross-platform compatibility and as a more curated, token-optimized alias for common operations. "API" Tools — These are designed for programmatic tool calling . They look like standard REST wrappers for performing CRUD operations on a data source (e.g., rather than a complex ). Since the agent can compose these tools inside a script, you can expose a large surface area of granular tools without wasting "always-attached" context tokens. This also solves a core problem with many API-like MCP server designs . "Mount" Tools — These are designed for bulk context injection into the agent's VM file system. They copy over and transform an external data source into a set of files that the agent can easily manipulate. For example, might write JSON or Markdown files directly to a VM directory like 6 . You might prefer the agent build artifacts indirectly through Python scripts (PowerPoint via python-pptx) and then run separate linting scripts to verify the output programmatically, rather than relying on a black-box or hand-crafted tool. You can give the agent access to raw binary files (PDFs, images) along with pre-installed libraries like or tools, letting it write a script to extract exactly what it needs instead of relying on pre-text-encoded representations. You can represent complex data objects as collections of searchable text files—for example, mounting a GitHub PR as and so the agent can use standard tools to search across them. You might use a “fake” git repository in the VM to simulate draft and publishing flows, allowing the agent to commit, branch, and merge changes that are translated into product concepts. You can seed the VM with a library of sample Bash or Python scripts that the agent can adapt or reuse at runtime, effectively building up a dynamic library of “skills”. Progressive disclosure — You start with an initial system prompt and design the context such that the agent efficiently accumulates the information it needs only as it calls tools. You can include just-in-time usage instructions in the output of a tool or pre-built script. If an agent tries and fails, the tool output can return the error along with a snippet from the docs on how to use it correctly. You can use markdown files placed in the file system as optional guides for tasks. A in the VM root lists available capabilities, but the agent only reads specific files like if and when it decides it needs to run a query. Context indirection — You leverage scripting capabilities to let the agent act on context without actually seeing it within its context window. Instead of reading a 500MB log file into context to find an error, the agent writes a or script to find lines matching “ERROR” and only reads the specific output of that script. You can intercept file operations to perform “blind reads.” When an agent attempts to read a placeholder path like , the harness intercepts this write, performs a search, and populates the file with relevant snippets just in time. Simplification — You use pre-trained model priors to reduce the need for context and rely more on agent intuition. If you have a complex internal graph database, you can give the agent a -compatible wrapper. The model already knows how to use perfectly, so zero-shot performance is significantly higher than teaching it a custom query language. If your system uses a legacy or obscure configuration format (like XML with custom schemas), you can automatically convert it to YAML or JSON when the agent reads it, and convert it back when the agent saves it. Todos — This is a meta-tool the agent uses to effectively keep a persistent TODO list (often seeded by a planning agent). While this is great for the human-facing UX, its primary function is to re-inject the remaining plan and goals into the end of the context window, where the model pays the most attention. 7 Reminders — This involves the harness dynamically injecting context at the end of tool-call results or user messages. The harness uses heuristics (e.g., "10 tool calls since the last reminder about X" or "user prompt contains keyword Y") to append a hint for the agent. For example: Automated Compaction — At some point, nearly the entire usable context window is taken up by past tool calls and results. Using a heuristic, the context window is passed to another agent (or just a single LLM call) to summarize the history and "reboot" the agent from that summary. While the effectiveness of resuming from a summary is still somewhat debated, it is better than hitting the context limit, and it works significantly better when tied to explicit checkpoints in the input plan. Harness: Are you maintaining a domain-specific architecture hardcoded for your product? Consider refactoring to a generic, agnostic harness that delegates domain logic to context and tools, or wrapping a standard implementation like the Agents SDK. Capabilities: Are your prompts cluttered with verbose tool definitions and subagent instructions? Consider moving that logic into “Skills” (markdown guides) and file system structures that the agent can discover progressively. Tools: Do you have a sprawling library of specific tools (e.g., , , )? Consider deleting them. If the agent has a sandbox, it can likely solve all of those problems better by just writing a script. Sandbox Security: How much flexibility is too much? Giving an agent a VM and the ability to execute arbitrary code opens up an entirely new surface area for security vulnerabilities. We are now mixing sensitive data inside containers that have (potentially) internet access and package managers. Preventing complex exfiltration or accidental destruction is an unsolved problem. The Cost of Autonomy: We are no longer just paying for inference tokens; we are paying for runtime compute (VMs) and potentially thousands of internal tool loops. Do we care that a task now costs much more if it saves a human hour? Or are we just banking on the “compute is too cheap to meter” future arriving faster than our cloud bills? The Lifespan of “Context Engineering”: Today, we have to be thoughtful about how we organize the file system and write those markdown guides so the agent can find them. But is this just a temporary optimization? In six months, will models be smart enough (and context windows cheap enough) that we can just point them at a messy, undocumented data lake and say “figure it out”?

0 views
Grumpy Gamer 1 months ago

This Time For Sure

I think 2026 is the year of Linux for me. I know I’ve said this before, but it feels like Apple has lost it’s way. Liquid Glass is the last straw plus their draconian desire to lock everything down gives me moral pause. It is only a matter of time before we can’t run software on the Mac that wasn’t purchased from the App Store. I use Linux on my servers so I am comfortable using it, just not in a desktop environment. Some things I worry about: A really good C++ IDE. I get a lot of advice for C++ IDEs from people who only use them now and then or just to compile, but don’t live in them all day and need to visually step into code and even ASM. I worry about CLion but am willing to give it a good try. Please don’t suggest an IDE unless you use them for hardcore C++ debugging. I will still make Mac versions of my games and code signing might be a problem. I’ll have to look, but I don’t think you can do it without a Mac. I can’t do that on a CI machine because for my pipeline the CI machine only compiles the code. The .app is built locally and that is where the code signing happens. I don’t want to spin up a CI machine to make changes when the engine didn’t change. My build pipeline is a running bash script, I don’t want to be hoping between machines just to do a build (which I can do 3 or 4 times a day) The only monitor I have is a Mac Studio monitor. I assume I can plug a Linux machine to it, but I worry about the webcam. It wouldn’t surprise me if Apple made it Mac only. The only keyboard I have is a Mac keyboard, I really like the keyboard especially how I can unlock the computer with the touch of my finger. I assume something like this exist for Linux. I have an iPhone but I only connect it to the computer to charge it. So not an issue. I worry about drivers for sound, video, webcams, controllers, etc. I know this is all solvable but I’m not looking forward to it. I know from releasing games on Linux our number-one complaint is related to drivers. Choosing a distro. Why is this so hard? A lot of people have said that it doesn’t really matter so just choose one. Why don’t more people use Linux on the Desktop? This is why. To a Linux desktop newbie, this is paralyzing. I’m going to miss Time Machine for local backups. Maybe there is something like it for Linux. I really like the Apple M processors. I might be able to install Linux on Mac hardware, but then I really worry about drivers. I just watched this video from Veronica Explains on installing Linux on Mac silicon. The big big worry is that there us something big I forgot. I need this to work for my game dev. It’s not a weekend hobby computer. I’ve said I was switching to Linux before, we’ll see if it sticks this time. I have a Linux laptop but when I moved I didn’t turn it on for over year and now I get BIOS errors when I boot. Some battery probably went dead. I’ve played with it a bit and nothing seems to work. It was an old laptop and I’ll need a new faster one for game dev anyway. This will be along well-thought out journey. Stay tuned for the “2027 - This Time For Sure” post. A really good C++ IDE. I get a lot of advice for C++ IDEs from people who only use them now and then or just to compile, but don’t live in them all day and need to visually step into code and even ASM. I worry about CLion but am willing to give it a good try. Please don’t suggest an IDE unless you use them for hardcore C++ debugging. I will still make Mac versions of my games and code signing might be a problem. I’ll have to look, but I don’t think you can do it without a Mac. I can’t do that on a CI machine because for my pipeline the CI machine only compiles the code. The .app is built locally and that is where the code signing happens. I don’t want to spin up a CI machine to make changes when the engine didn’t change. My build pipeline is a running bash script, I don’t want to be hoping between machines just to do a build (which I can do 3 or 4 times a day) The only monitor I have is a Mac Studio monitor. I assume I can plug a Linux machine to it, but I worry about the webcam. It wouldn’t surprise me if Apple made it Mac only. The only keyboard I have is a Mac keyboard, I really like the keyboard especially how I can unlock the computer with the touch of my finger. I assume something like this exist for Linux. I have an iPhone but I only connect it to the computer to charge it. So not an issue. I worry about drivers for sound, video, webcams, controllers, etc. I know this is all solvable but I’m not looking forward to it. I know from releasing games on Linux our number-one complaint is related to drivers. Choosing a distro. Why is this so hard? A lot of people have said that it doesn’t really matter so just choose one. Why don’t more people use Linux on the Desktop? This is why. To a Linux desktop newbie, this is paralyzing. I’m going to miss Time Machine for local backups. Maybe there is something like it for Linux. I really like the Apple M processors. I might be able to install Linux on Mac hardware, but then I really worry about drivers. I just watched this video from Veronica Explains on installing Linux on Mac silicon. The big big worry is that there us something big I forgot. I need this to work for my game dev. It’s not a weekend hobby computer.

1 views
Simon Willison 2 months ago

2025: The year in LLMs

This is the third in my annual series reviewing everything that happened in the LLM space over the past 12 months. For previous years see Stuff we figured out about AI in 2023 and Things we learned about LLMs in 2024 . It’s been a year filled with a lot of different trends. OpenAI kicked off the "reasoning" aka inference-scaling aka Reinforcement Learning from Verifiable Rewards (RLVR) revolution in September 2024 with o1 and o1-mini . They doubled down on that with o3, o3-mini and o4-mini in the opening months of 2025 and reasoning has since become a signature feature of models from nearly every other major AI lab. My favourite explanation of the significance of this trick comes from Andrej Karpathy : By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like "reasoning" to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples). [...] Running RLVR turned out to offer high capability/$, which gobbled up the compute that was originally intended for pretraining. Therefore, most of the capability progress of 2025 was defined by the LLM labs chewing through the overhang of this new stage and overall we saw ~similar sized LLMs but a lot longer RL runs. Every notable AI lab released at least one reasoning model in 2025. Some labs released hybrids that could be run in reasoning or non-reasoning modes. Many API models now include dials for increasing or decreasing the amount of reasoning applied to a given prompt. It took me a while to understand what reasoning was useful for. Initial demos showed it solving mathematical logic puzzles and counting the Rs in strawberry - two things I didn't find myself needing in my day-to-day model usage. It turned out that the real unlock of reasoning was in driving tools. Reasoning models with access to tools can plan out multi-step tasks, execute on them and continue to reason about the results such that they can update their plans to better achieve the desired goal. A notable result is that AI assisted search actually works now . Hooking up search engines to LLMs had questionable results before, but now I find even my more complex research questions can often be answered by GPT-5 Thinking in ChatGPT . Reasoning models are also exceptional at producing and debugging code. The reasoning trick means they can start with an error and step through many different layers of the codebase to find the root cause. I've found even the gnarliest of bugs can be diagnosed by a good reasoner with the ability to read and execute code against even large and complex codebases. Combine reasoning with tool-use and you get... I started the year making a prediction that agents were not going to happen . Throughout 2024 everyone was talking about agents but there were few to no examples of them working, further confused by the fact that everyone using the term “agent” appeared to be working from a slightly different definition from everyone else. By September I’d got fed up of avoiding the term myself due to the lack of a clear definition and decided to treat them as an LLM that runs tools in a loop to achieve a goal . This unblocked me for having productive conversations about them, always my goal for any piece of terminology like that. I didn’t think agents would happen because I didn’t think the gullibility problem could be solved, and I thought the idea of replacing human staff members with LLMs was still laughable science fiction. I was half right in my prediction: the science fiction version of a magic computer assistant that does anything you ask of ( Her ) didn’t materialize... But if you define agents as LLM systems that can perform useful work via tool calls over multiple steps then agents are here and they are proving to be extraordinarily useful. The two breakout categories for agents have been for coding and for search. The Deep Research pattern - where you challenge an LLM to gather information and it churns away for 15+ minutes building you a detailed report - was popular in the first half of the year but has fallen out of fashion now that GPT-5 Thinking (and Google's " AI mode ", a significantly better product than their terrible "AI overviews") can produce comparable results in a fraction of the time. I consider this to be an agent pattern, and one that works really well. The "coding agents" pattern is a much bigger deal. The most impactful event of 2025 happened in February, with the quiet release of Claude Code. I say quiet because it didn’t even get its own blog post! Anthropic bundled the Claude Code release in as the second item in their post announcing Claude 3.7 Sonnet . (Why did Anthropic jump from Claude 3.5 Sonnet to 3.7? Because they released a major bump to Claude 3.5 in October 2024 but kept the name exactly the same, causing the developer community to start referring to un-named 3.5 Sonnet v2 as 3.6. Anthropic burned a whole version number by failing to properly name their new model!) Claude Code is the most prominent example of what I call coding agents - LLM systems that can write code, execute that code, inspect the results and then iterate further. The major labs all put out their own CLI coding agents in 2025 Vendor-independent options include GitHub Copilot CLI , Amp , OpenCode , OpenHands CLI , and Pi . IDEs such as Zed, VS Code and Cursor invested a lot of effort in coding agent integration as well. My first exposure to the coding agent pattern was OpenAI's ChatGPT Code Interpreter in early 2023 - a system baked into ChatGPT that allowed it to run Python code in a Kubernetes sandbox. I was delighted this year when Anthropic finally released their equivalent in September, albeit under the baffling initial name of "Create and edit files with Claude". In October they repurposed that container sandbox infrastructure to launch Claude Code for web , which I've been using on an almost daily basis ever since. Claude Code for web is what I call an asynchronous coding agent - a system you can prompt and forget, and it will work away on the problem and file a Pull Request once it's done. OpenAI "Codex cloud" (renamed to "Codex web" in the last week ) launched earlier in May 2025 . Gemini's entry in this category is called Jules , also launched in May . I love the asynchronous coding agent category. They're a great answer to the security challenges of running arbitrary code execution on a personal laptop and it's really fun being able to fire off multiple tasks at once - often from my phone - and get decent results a few minutes later. I wrote more about how I'm using these in Code research projects with async coding agents like Claude Code and Codex and Embracing the parallel coding agent lifestyle . In 2024 I spent a lot of time hacking on my LLM command-line tool for accessing LLMs from the terminal, all the time thinking that it was weird that so few people were taking CLI access to models seriously - they felt like such a natural fit for Unix mechanisms like pipes. Maybe the terminal was just too weird and niche to ever become a mainstream tool for accessing LLMs? Claude Code and friends have conclusively demonstrated that developers will embrace LLMs on the command line, given powerful enough models and the right harness. It helps that terminal commands with obscure syntax like and and itself are no longer a barrier to entry when an LLM can spit out the right command for you. As-of December 2nd Anthropic credit Claude Code with $1bn in run-rate revenue ! I did not expect a CLI tool to reach anything close to those numbers. With hindsight, maybe I should have promoted LLM from a side-project to a key focus! The default setting for most coding agents is to ask the user for confirmation for almost every action they take . In a world where an agent mistake could wipe your home folder or a malicious prompt injection attack could steal your credentials this default makes total sense. Anyone who's tried running their agent with automatic confirmation (aka YOLO mode - Codex CLI even aliases to ) has experienced the trade-off: using an agent without the safety wheels feels like a completely different product. A big benefit of asynchronous coding agents like Claude Code for web and Codex Cloud is that they can run in YOLO mode by default, since there's no personal computer to damage. I run in YOLO mode all the time, despite being deeply aware of the risks involved. It hasn't burned me yet... ... and that's the problem. One of my favourite pieces on LLM security this year is The Normalization of Deviance in AI by security researcher Johann Rehberger. Johann describes the "Normalization of Deviance" phenomenon, where repeated exposure to risky behaviour without negative consequences leads people and organizations to accept that risky behaviour as normal. This was originally described by sociologist Diane Vaughan as part of her work to understand the 1986 Space Shuttle Challenger disaster, caused by a faulty O-ring that engineers had known about for years. Plenty of successful launches led NASA culture to stop taking that risk seriously. Johann argues that the longer we get away with running these systems in fundamentally insecure ways, the closer we are getting to a Challenger disaster of our own. ChatGPT Plus's original $20/month price turned out to be a snap decision by Nick Turley based on a Google Form poll on Discord. That price point has stuck firmly ever since. This year a new pricing precedent has emerged: the Claude Pro Max 20x plan, at $200/month. OpenAI have a similar $200 plan called ChatGPT Pro. Gemini have Google AI Ultra at $249/month with a $124.99/month 3-month starting discount. These plans appear to be driving some serious revenue, though none of the labs have shared figures that break down their subscribers by tier. I've personally paid $100/month for Claude in the past and will upgrade to the $200/month plan once my current batch of free allowance (from previewing one of their models - thanks, Anthropic) runs out. I've heard from plenty of other people who are happy to pay these prices too. You have to use models a lot in order to spend $200 of API credits, so you would think it would make economic sense for most people to pay by the token instead. It turns out tools like Claude Code and Codex CLI can burn through enormous amounts of tokens once you start setting them more challenging tasks, to the point that $200/month offers a substantial discount. 2024 saw some early signs of life from the Chinese AI labs mainly in the form of Qwen 2.5 and early DeepSeek. They were neat models but didn't feel world-beating. This changed dramatically in 2025. My ai-in-china tag has 67 posts from 2025 alone, and I missed a bunch of key releases towards the end of the year (GLM-4.7 and MiniMax-M2.1 in particular.) Here's the Artificial Analysis ranking for open weight models as-of 30th December 2025 : GLM-4.7, Kimi K2 Thinking, MiMo-V2-Flash, DeepSeek V3.2, MiniMax-M2.1 are all Chinese open weight models. The highest non-Chinese model in that chart is OpenAI's gpt-oss-120B (high), which comes in sixth place. The Chinese model revolution really kicked off on Christmas day 2024 with the release of DeepSeek 3 , supposedly trained for around $5.5m. DeepSeek followed that on 20th January with DeepSeek R1 which promptly triggered a major AI/semiconductor selloff : NVIDIA lost ~$593bn in market cap as investors panicked that AI maybe wasn't an American monopoly after all. The panic didn't last - NVIDIA quickly recovered and today are up significantly from their pre-DeepSeek R1 levels. It was still a remarkable moment. Who knew an open weight model release could have that kind of impact? DeepSeek were quickly joined by an impressive roster of Chinese AI labs. I've been paying attention to these ones in particular: Most of these models aren't just open weight, they are fully open source under OSI-approved licenses: Qwen use Apache 2.0 for most of their models, DeepSeek and Z.ai use MIT. Some of them are competitive with Claude 4 Sonnet and GPT-5! Sadly none of the Chinese labs have released their full training data or the code they used to train their models, but they have been putting out detailed research papers that have helped push forward the state of the art, especially when it comes to efficient training and inference. One of the most interesting recent charts about LLMs is Time-horizon of software engineering tasks different LLMscan complete 50% of the time from METR: The chart shows tasks that take humans up to 5 hours, and plots the evolution of models that can achieve the same goals working independently. As you can see, 2025 saw some enormous leaps forward here with GPT-5, GPT-5.1 Codex Max and Claude Opus 4.5 able to perform tasks that take humans multiple hours - 2024’s best models tapped out at under 30 minutes. METR conclude that “the length of tasks AI can do is doubling every 7 months”. I'm not convinced that pattern will continue to hold, but it's an eye-catching way of illustrating current trends in agent capabilities. The most successful consumer product launch of all time happened in March, and the product didn't even have a name. One of the signature features of GPT-4o in May 2024 was meant to be its multimodal output - the "o" stood for "omni" and OpenAI's launch announcement included numerous "coming soon" features where the model output images in addition to text. Then... nothing. The image output feature failed to materialize. In March we finally got to see what this could do - albeit in a shape that felt more like the existing DALL-E. OpenAI made this new image generation available in ChatGPT with the key feature that you could upload your own images and use prompts to tell it how to modify them. This new feature was responsible for 100 million ChatGPT signups in a week. At peak they saw 1 million account creations in a single hour! Tricks like "ghiblification" - modifying a photo to look like a frame from a Studio Ghibli movie - went viral time and time again. OpenAI released an API version of the model called "gpt-image-1", later joined by a cheaper gpt-image-1-mini in October and a much improved gpt-image-1.5 on December 16th . The most notable open weight competitor to this came from Qwen with their Qwen-Image generation model on August 4th followed by Qwen-Image-Edit on August 19th . This one can run on (well equipped) consumer hardware! They followed with Qwen-Image-Edit-2511 in November and Qwen-Image-2512 on 30th December, neither of which I've tried yet. The even bigger news in image generation came from Google with their Nano Banana models, available via Gemini. Google previewed an early version of this in March under the name "Gemini 2.0 Flash native image generation". The really good one landed on August 26th , where they started cautiously embracing the codename "Nano Banana" in public (the API model was called " Gemini 2.5 Flash Image "). Nano Banana caught people's attention because it could generate useful text ! It was also clearly the best model at following image editing instructions. In November Google fully embraced the "Nano Banana" name with the release of Nano Banana Pro . This one doesn't just generate text, it can output genuinely useful detailed infographics and other text and information-heavy images. It's now a professional-grade tool. Max Woolf published the most comprehensive guide to Nano Banana prompting , and followed that up with an essential guide to Nano Banana Pro in December. I've mainly been using it to add kākāpō parrots to my photos. Given how incredibly popular these image tools are it's a little surprising that Anthropic haven't released or integrated anything similar into Claude. I see this as further evidence that they're focused on AI tools for professional work, but Nano Banana Pro is rapidly proving itself to be of value to anyone who's work involves creating presentations or other visual materials. In July reasoning models from both OpenAI and Google Gemini achieved gold medal performance in the International Math Olympiad , a prestigious mathematical competition held annually (bar 1980) since 1959. This was notable because the IMO poses challenges that are designed specifically for that competition. There's no chance any of these were already in the training data! It's also notable because neither of the models had access to tools - their solutions were generated purely from their internal knowledge and token-based reasoning capabilities. Turns out sufficiently advanced LLMs can do math after all! In September OpenAI and Gemini pulled off a similar feat for the International Collegiate Programming Contest (ICPC) - again notable for having novel, previously unpublished problems. This time the models had access to a code execution environment but otherwise no internet access. I don't believe the exact models used for these competitions have been released publicly, but Gemini's Deep Think and OpenAI's GPT-5 Pro should provide close approximations. With hindsight, 2024 was the year of Llama. Meta's Llama models were by far the most popular open weight models - the original Llama kicked off the open weight revolution back in 2023 and the Llama 3 series, in particular the 3.1 and 3.2 dot-releases, were huge leaps forward in open weight capability. Llama 4 had high expectations, and when it landed in April it was... kind of disappointing. There was a minor scandal where the model tested on LMArena turned out not to be the model that was released, but my main complaint was that the models were too big . The neatest thing about previous Llama releases was that they often included sizes you could run on a laptop. The Llama 4 Scout and Maverick models were 109B and 400B, so big that even quantization wouldn't get them running on my 64GB Mac. They were trained using the 2T Llama 4 Behemoth which seems to have been forgotten now - it certainly wasn't released. It says a lot that none of the most popular models listed by LM Studio are from Meta, and the most popular on Ollama is still Llama 3.1, which is low on the charts there too. Meta's AI news this year mainly involved internal politics and vast amounts of money spent hiring talent for their new Superintelligence Labs . It's not clear if there are any future Llama releases in the pipeline or if they've moved away from open weight model releases to focus on other things. Last year OpenAI remained the undisputed leader in LLMs, especially given o1 and the preview of their o3 reasoning models. This year the rest of the industry caught up. OpenAI still have top tier models, but they're being challenged across the board. In image models they're still being beaten by Nano Banana Pro. For code a lot of developers rate Opus 4.5 very slightly ahead of GPT-5.2 Codex. In open weight models their gpt-oss models, while great, are falling behind the Chinese AI labs. Their lead in audio is under threat from the Gemini Live API . Where OpenAI are winning is in consumer mindshare. Nobody knows what an "LLM" is but almost everyone has heard of ChatGPT. Their consumer apps still dwarf Gemini and Claude in terms of user numbers. Their biggest risk here is Gemini. In December OpenAI declared a Code Red in response to Gemini 3, delaying work on new initiatives to focus on the competition with their key products. Google Gemini had a really good year . They posted their own victorious 2025 recap here . 2025 saw Gemini 2.0, Gemini 2.5 and then Gemini 3.0 - each model family supporting audio/video/image/text input of 1,000,000+ tokens, priced competitively and proving more capable than the last. They also shipped Gemini CLI (their open source command-line coding agent, since forked by Qwen for Qwen Code ), Jules (their asynchronous coding agent), constant improvements to AI Studio, the Nano Banana image models, Veo 3 for video generation, the promising Gemma 3 family of open weight models and a stream of smaller features. Google's biggest advantage lies under the hood. Almost every other AI lab trains with NVIDIA GPUs, which are sold at a margin that props up NVIDIA's multi-trillion dollar valuation. Google use their own in-house hardware, TPUs, which they've demonstrated this year work exceptionally well for both training and inference of their models. When your number one expense is time spent on GPUs, having a competitor with their own, optimized and presumably much cheaper hardware stack is a daunting prospect. It continues to tickle me that Google Gemini is the ultimate example of a product name that reflects the company's internal org-chart - it's called Gemini because it came out of the bringing together (as twins) of Google's DeepMind and Google Brain teams. I first asked an LLM to generate an SVG of a pelican riding a bicycle in October 2024 , but 2025 is when I really leaned into it. It's ended up a meme in its own right. I originally intended it as a dumb joke. Bicycles are hard to draw, as are pelicans, and pelicans are the wrong shape to ride a bicycle. I was pretty sure there wouldn't be anything relevant in the training data, so asking a text-output model to generate an SVG illustration of one felt like a somewhat absurdly difficult challenge. To my surprise, there appears to be a correlation between how good the model is at drawing pelicans on bicycles and how good it is overall. I don't really have an explanation for this. The pattern only became clear to me when I was putting together a last-minute keynote (they had a speaker drop out) for the AI Engineer World's Fair in July. You can read (or watch) the talk I gave here: The last six months in LLMs, illustrated by pelicans on bicycles . My full collection of illustrations can be found on my pelican-riding-a-bicycle tag - 89 posts and counting. There is plenty of evidence that the AI labs are aware of the benchmark. It showed up (for a split second) in the Google I/O keynote in May, got a mention in an Anthropic interpretability research paper in October and I got to talk about it in a GPT-5 launch video filmed at OpenAI HQ in August. Are they training specifically for the benchmark? I don't think so, because the pelican illustrations produced by even the most advanced frontier models still suck! In What happens if AI labs train for pelicans riding bicycles? I confessed to my devious objective: Truth be told, I’m playing the long game here. All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one. My favourite is still this one that I go from GPT-5: I started my tools.simonwillison.net site last year as a single location for my growing collection of vibe-coded / AI-assisted HTML+JavaScript tools. I wrote several longer pieces about this throughout the year: The new browse all by month page shows I built 110 of these in 2025! I really enjoy building in this way, and I think it's a fantastic way to practice and explore the capabilities of these models. Almost every tool is accompanied by a commit history that links to the prompts and transcripts I used to build them. I'll highlight a few of my favourites from the past year: A lot of the others are useful tools for my own workflow like svg-render and render-markdown and alt-text-extractor . I built one that does privacy-friendly personal analytics against localStorage to keep track of which tools I use the most often. Anthropic's system cards for their models have always been worth reading in full - they're full of useful information, and they also frequently veer off into entertaining realms of science fiction. The Claude 4 system card in May had some particularly fun moments - highlights mine: Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic contexts. This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users , given access to a command line, and told something in the system prompt like “ take initiative ,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. In other words, Claude 4 might snitch you out to the feds. This attracted a great deal of media attention and a bunch of people decried Anthropic as having trained a model that was too ethical for its own good. Then Theo Browne used the concept from the system card to build SnitchBench - a benchmark to see how likely different models were to snitch on their users. It turns out they almost all do the same thing ! Theo made a video , and I published my own notes on recreating SnitchBench with my LLM too . The key prompt that makes this work is: I recommend not putting that in your system prompt! Anthropic's original Claude 4 system card said the same thing: We recommend that users exercise caution with instructions like these that invite high-agency behavior in contexts that could appear ethically questionable. In a tweet in February Andrej Karpathy coined the term "vibe coding", with an unfortunately long definition (I miss the 140 character days) that many people failed to read all the way to the end: There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works. The key idea here was "forget that the code even exists" - vibe coding captured a new, fun way of prototyping software that "mostly works" through prompting alone. I don't know if I've ever seen a new term catch on - or get distorted - so quickly in my life. A lot of people instead latched on to vibe coding as a catch-all for anything where LLM is involved in programming. I think that's a waste of a great term, especially since it's becoming clear likely that most programming will involve some level of AI-assistance in the near future. Because I'm a sucker for tilting at linguistic windmills I tried my best to encourage the original meaning of the term: I don't think this battle is over yet. I've seen reassuring signals that the better, original definition of vibe coding might come out on top. I should really get a less confrontational linguistic hobby! Anthropic introduced their Model Context Protocol specification in November 2024 as an open standard for integrating tool calls with different LLMs. In early 2025 it exploded in popularity. There was a point in May where OpenAI , Anthropic , and Mistral all rolled out API-level support for MCP within eight days of each other! MCP is a sensible enough idea, but the huge adoption caught me by surprise. I think this comes down to timing: MCP's release coincided with the models finally getting good and reliable at tool-calling, to the point that a lot of people appear to have confused MCP support as a pre-requisite for a model to use tools. For a while it also felt like MCP was a convenient answer for companies that were under pressure to have "an AI strategy" but didn't really know how to do that. Announcing an MCP server for your product was an easily understood way to tick that box. The reason I think MCP may be a one-year wonder is the stratospheric growth of coding agents. It appears that the best possible tool for any situation is Bash - if your agent can run arbitrary shell commands, it can do anything that can be done by typing commands into a terminal. Since leaning heavily into Claude Code and friends myself I've hardly used MCP at all - I've found CLI tools like and libraries like Playwright to be better alternatives to the GitHub and Playwright MCPs. Anthropic themselves appeared to acknowledge this later in the year with their release of the brilliant Skills mechanism - see my October post Claude Skills are awesome, maybe a bigger deal than MCP . MCP involves web servers and complex JSON payloads. A Skill is a Markdown file in a folder, optionally accompanied by some executable scripts. Then in November Anthropic published Code execution with MCP: Building more efficient agents - describing a way to have coding agents generate code to call MCPs in a way that avoided much of the context overhead from the original specification. (I'm proud of the fact that I reverse-engineered Anthropic's skills a week before their announcement , and then did the same thing to OpenAI's quiet adoption of skills two months after that .) MCP was donated to the new Agentic AI Foundation at the start of December. Skills were promoted to an "open format" on December 18th . Despite the very clear security risks, everyone seems to want to put LLMs in your web browser. OpenAI launched ChatGPT Atlas in October, built by a team including long-time Google Chrome engineers Ben Goodger and Darin Fisher. Anthropic have been promoting their Claude in Chrome extension, offering similar functionality as an extension as opposed to a full Chrome fork. Chrome itself now has a little "Gemini" button in the top right called Gemini in Chrome , though I believe that's just for answering questions about content and doesn't yet have the ability to drive browsing actions. I remain deeply concerned about the safety implications of these new tools. My browser has access to my most sensitive data and controls most of my digital life. A prompt injection attack against a browsing agent that can exfiltrate or modify that data is a terrifying prospect. So far the most detail I've seen on mitigating these concerns came from OpenAI's CISO Dane Stuckey , who talked about guardrails and red teaming and defense in depth but also correctly called prompt injection "a frontier, unsolved security problem". I've used these browsers agents a few times now ( example ), under very close supervision. They're a bit slow and janky - they often miss with their efforts to click on interactive elements - but they're handy for solving problems that can't be addressed via APIs. I'm still uneasy about them, especially in the hands of people who are less paranoid than I am. I've been writing about prompt injection attacks for more than three years now. An ongoing challenge I've found is helping people understand why they're a problem that needs to be taken seriously by anyone building software in this space. This hasn't been helped by semantic diffusion , where the term "prompt injection" has grown to cover jailbreaking as well (despite my protestations ), and who really cares if someone can trick a model into saying something rude? So I tried a new linguistic trick! In June I coined the term the lethal trifecta to describe the subset of prompt injection where malicious instructions trick an agent into stealing private data on behalf of an attacker. A trick I use here is that people will jump straight to the most obvious definition of any new term that they hear. "Prompt injection" sounds like it means "injecting prompts". "The lethal trifecta" is deliberately ambiguous: you have to go searching for my definition if you want to know what it means! It seems to have worked. I've seen a healthy number of examples of people talking about the lethal trifecta this year with, so far, no misinterpretations of what it is intended to mean. I wrote significantly more code on my phone this year than I did on my computer. Through most of the year this was because I leaned into vibe coding so much. My tools.simonwillison.net collection of HTML+JavaScript tools was mostly built this way: I would have an idea for a small project, prompt Claude Artifacts or ChatGPT or (more recently) Claude Code via their respective iPhone apps, then either copy the result and paste it into GitHub's web editor or wait for a PR to be created that I could then review and merge in Mobile Safari. Those HTML tools are often ~100-200 lines of code, full of uninteresting boilerplate and duplicated CSS and JavaScript patterns - but 110 of them adds up to a lot! Up until November I would have said that I wrote more code on my phone, but the code I wrote on my laptop was clearly more significant - fully reviewed, better tested and intended for production use. In the past month I've grown confident enough in Claude Opus 4.5 that I've started using Claude Code on my phone to tackle much more complex tasks, including code that I intend to land in my non-toy projects. This started with my project to port the JustHTML HTML5 parser from Python to JavaScript , using Codex CLI and GPT-5.2. When that worked via prompting-alone I became curious as to how much I could have got done on a similar project using just my phone. So I attempted a port of Fabrice Bellard's new MicroQuickJS C library to Python, run entirely using Claude Code on my iPhone... and it mostly worked ! Is it code that I'd use in production? Certainly not yet for untrusted code , but I'd trust it to execute JavaScript I'd written myself. The test suite I borrowed from MicroQuickJS gives me some confidence there. This turns out to be the big unlock: the latest coding agents against the ~November 2025 frontier models are remarkably effective if you can give them an existing test suite to work against. I call these conformance suites and I've started deliberately looking out for them - so far I've had success with the html5lib tests , the MicroQuickJS test suite and a not-yet-released project against the comprehensive WebAssembly spec/test collection . If you're introducing a new protocol or even a new programming language to the world in 2026 I strongly recommend including a language-agnostic conformance suite as part of your project. I've seen plenty of hand-wringing that the need to be included in LLM training data means new technologies will struggle to gain adoption. My hope is that the conformance suite approach can help mitigate that problem and make it easier for new ideas of that shape to gain traction. Towards the end of 2024 I was losing interest in running local LLMs on my own machine. My interest was re-kindled by Llama 3.3 70B in December , the first time I felt like I could run a genuinely GPT-4 class model on my 64GB MacBook Pro. Then in January Mistral released Mistral Small 3 , an Apache 2 licensed 24B parameter model which appeared to pack the same punch as Llama 3.3 70B using around a third of the memory. Now I could run a ~GPT-4 class model and have memory left over to run other apps! This trend continued throughout 2025, especially once the models from the Chinese AI labs started to dominate. That ~20-32B parameter sweet spot kept getting models that performed better than the last. I got small amounts of real work done offline! My excitement for local LLMs was very much rekindled. The problem is that the big cloud models got better too - including those open weight models that, while freely available, were far too large (100B+) to run on my laptop. Coding agents changed everything for me. Systems like Claude Code need more than a great model - they need a reasoning model that can perform reliable tool calling invocations dozens if not hundreds of times over a constantly expanding context window. I have yet to try a local model that handles Bash tool calls reliably enough for me to trust that model to operate a coding agent on my device. My next laptop will have at least 128GB of RAM, so there's a chance that one of the 2026 open weight models might fit the bill. For now though I'm sticking with the best available frontier hosted models as my daily drivers. I played a tiny role helping to popularize the term "slop" in 2024, writing about it in May and landing quotes in the Guardian and the New York Times shortly afterwards. This year Merriam-Webster crowned it word of the year ! slop ( noun ): digital content of low quality that is produced usually in quantity by means of artificial intelligence I like that it represents a widely understood feeling that poor quality AI-generated content is bad and should be avoided. I'm still holding hope that slop won't end up as bad a problem as many people fear. The internet has always been flooded with low quality content. The challenge, as ever, is to find and amplify the good stuff. I don't see the increased volume of junk as changing that fundamental dynamic much. Curation matters more than ever. That said... I don't use Facebook, and I'm pretty careful at filtering or curating my other social media habits. Is Facebook still flooded with Shrimp Jesus or was that a 2024 thing? I heard fake videos of cute animals getting rescued is the latest trend. It's quite possible the slop problem is a growing tidal wave that I'm innocently unaware of. I nearly skipped writing about the environmental impact of AI for this year's post (here's what I wrote in 2024 ) because I wasn't sure if we had learned anything new this year - AI data centers continue to burn vast amounts of energy and the arms race to build them continues to accelerate in a way that feels unsustainable. What's interesting in 2025 is that public opinion appears to be shifting quite dramatically against new data center construction. Here's a Guardian headline from December 8th: More than 200 environmental groups demand halt to new US datacenters . Opposition at the local level appears to be rising sharply across the board too. I've been convinced by Andy Masley that the water usage issue is mostly overblown, which is a problem mainly because it acts as a distraction from the very real issues around energy consumption, carbon emissions and noise pollution. AI labs continue to find new efficiencies to help serve increased quality of models using less energy per token, but the impact of that is classic Jevons paradox - as tokens get cheaper we find more intense ways to use them, like spending $200/month on millions of tokens to run coding agents. As an obsessive collector of neologisms, here are my own favourites from 2025. You can see a longer list in my definitions tag . If you've made it this far, I hope you've found this useful! You can subscribe to my blog in a feed reader or via email , or follow me on Bluesky or Mastodon or Twitter . If you'd like a review like this on a monthly basis instead I also operate a $10/month sponsors only newsletter with a round-up of the key developments in the LLM space over the past 30 days. Here are preview editions for September , October , and November - I'll be sending December's out some time tomorrow. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The year of "reasoning" The year of agents The year of coding agents and Claude Code The year of LLMs on the command-line The year of YOLO and the Normalization of Deviance The year of $200/month subscriptions The year of top-ranked Chinese open weight models The year of long tasks The year of prompt-driven image editing The year models won gold in academic competitions The year that Llama lost its way The year that OpenAI lost their lead The year of Gemini The year of pelicans riding bicycles The year I built 110 tools The year of the snitch! The year of vibe coding The (only?) year of MCP The year of alarmingly AI-enabled browsers The year of the lethal trifecta The year of programming on my phone The year of conformance suites The year local models got good, but cloud models got even better The year of slop The year that data centers got extremely unpopular My own words of the year That's a wrap for 2025 Claude Code Mistral Vibe Alibaba Qwen (Qwen3) Moonshot AI (Kimi K2) Z.ai (GLM-4.5/4.6/4.7) MiniMax (M2) MetaStone AI (XBai o4) Here’s how I use LLMs to help me write code Adding AI-generated descriptions to my tools collection Building a tool to copy-paste share terminal sessions using Claude Code for web Useful patterns for building HTML tools - my favourite post of the bunch. blackened-cauliflower-and-turkish-style-stew is ridiculous. It's a custom cooking timer app for anyone who needs to prepare Green Chef's Blackened Cauliflower and Turkish-style Spiced Chickpea Stew recipes at the same time. Here's more about that one . is-it-a-bird takes inspiration from xkcd 1425 , loads a 150MB CLIP model via Transformers.js and uses it to say if an image or webcam feed is a bird or not. bluesky-thread lets me view any thread on Bluesky with a "most recent first" option to make it easier to follow new posts as they arrive. Not all AI-assisted programming is vibe coding (but vibe coding rocks) in March Two publishers and three authors fail to understand what “vibe coding” means in May (one book subsequently changed its title to the much better "Beyond Vibe Coding"). Vibe engineering in October, where I tried to suggest an alternative term for what happens when professional engineers use AI assistance to build production-grade software. Your job is to deliver code you have proven to work in December, about how professional software development is about code that demonstrably works, no matter how you built it. Vibe coding, obviously. Vibe engineering - I'm still on the fence of if I should try to make this happen ! The lethal trifecta , my one attempted coinage of the year that seems to have taken root . Context rot , by Workaccount2 on Hacker News, for the thing where model output quality falls as the context grows longer during a session. Context engineering as an alternative to prompt engineering that helps emphasize how important it is to design the context you feed to your model. Slopsquatting by Seth Larson, where an LLM hallucinates an incorrect package name which is then maliciously registered to deliver malware. Vibe scraping - another of mine that didn't really go anywhere, for scraping projects implemented by coding agents driven by prompts. Asynchronous coding agent for Claude for web / Codex cloud / Google Jules Extractive contributions by Nadia Eghbal for open source contributions where "the marginal cost of reviewing and merging that contribution is greater than the marginal benefit to the project’s producers".

2 views
neilzone 2 months ago

#FreeSoftwareAdvent 2025

Here are all the toots I posted for this year’s #FreeSoftwareAdvent. Today’s #FreeSoftwareAdvent - 24 days of toots about Free software that I value - is a off to a strong start, since you get not one, not two, not three, but four pieces of software: yt-dlp: command line tools for downloading from various online sources: https://github.com/yt-dlp/yt-dlp Handbrake: ripping DVDs you’ve bought: https://handbrake.fr/ SoundJuicer: ripping CDs you’ve bought: https://gitlab.gnome.org/GNOME/sound-juicer Libation: downloading audiobooks you’ve bought: https://github.com/rmcrackan/Libation Each is a useful starting point for building your own media library. Today’s #FreeSoftwareAdvent is Libreboot. Libreboot is a Free and Open Source BIOS/UEFI boot firmware. It is much to my chagrin that I do not have a Free software BIOS/UEFI on my machines :( https://libreboot.org/ Today’s #FreeSoftwareAdvent is Cryptomator. I use Cryptomator for end-to-end encryption of files which I store/sync on Nextcloud. It means that, if my server were compromised, all the attacker would get is encrypted objects. (In my case, my own instance of Nextcloud, which is not publicly accessible, so this might be overkill, but there we go…) https://cryptomator.org/ Today’s #FreeSoftwareAdvent is tmux, the terminal multiplexer. I know that tmux will be second nature to many of you, but I’ve been gradually learning it over the last year or ( https://neilzone.co.uk/2025/06/more-tinkering-with-tmux/) , and I find it invaluable. Although I had used screen for years, I wish that I had known about tmux sooner. I use tmux both on local machines, for a convenient terminal environment (from which I can do an awful lot of what I want to do, computing-wise), and also on remote machines for persistent / recoverable terminal sessions. https://github.com/tmux/tmux Today’s #FreeSoftwareAdvent is the Lyrion Music Server. While there are plenty of great FOSS media services out there, Lyrion solves one particular problem for me: it breathes life back into old Logitech Squeezebox units. What’s not to love about a multi-room music setup using Free software and reusing old hardware! :) https://lyrion.org/ Today’s #FreeSoftwareAdvent is forgejo, a self-hostable code forge. Although I use a headless git server for some things, I’ve been happily experimenting with a public, self-hosted instance of forgejo this year. I look forward to federation functionality in due course! https://forgejo.org/ Today’s #FreeSoftwareAdvent is Vaultwarden. Vaultwarden is a readily self-hosted alternative implementation of Bitwarden, a password / credential manager. It works with the Bitwarden clients. Sandra and I use it to share passwords with each other, as well as keeping our own separate vaults in sync across multiple devices. https://github.com/dani-garcia/vaultwarden Today’s #FreeSoftwareAdvent is all about XMPP. Snikket ( https://snikket.org/ ) is an easy-to-install, and easy-to-administer, XMPP server. It is designed for families and other small groups. The apps for Android and iOS are great. Dino ( https://dino.im/ ) is my desktop XMPP client of choice. Profanity ( https://profanity-im.github.io/ ) is a terminal / console XMPP client, which is incredibly convenient. Why not have a fun festive project of setting up an XMPP-based chat server for you and your family and friends? Today’s #FreeSoftwareAdvent is a double whammy of shell-based “checking” utilities. linkchecker ( https://linux.die.net/man/1/linkchecker; probably get it from your distro) is a tool for checking websites (recursively, as you wish) for broken links. shellcheck ( https://www.shellcheck.net/; available in browser or as a terminal tool) is a tool for checking bash and other shell scripts for bugs. Today’s #FreeSoftwareAdvent (better late than never) is “WG Tunnel”, an application for Android for bringing up and tearing down WireGuard tunnels. I use it to keep a WireGuard tunnel running all the time on my phone. I don’t do anything fancy with it, but if one wanted to use different tunnels for different things, or when connected to different networks, or wanted a SOCKS5 proxy for other apps, it is definitely worth a look. https://wgtunnel.com/ (And available via a custom F-Droid repo.) Today’s #FreeSoftwareAdvent is another bundle, of four amazing terminal utilities for keeping organised: mutt, an incredibly versatile email client: http://www.mutt.org/ vdirsyncer, a tool for synchronising calendars and address books: https://vdirsyncer.pimutils.org/en/stable/ khard, for interacting with contacts (vCards): https://github.com/lucc/khard khal, for interacting with calendars: https://github.com/pimutils/khal Use with tmux to get all you want to see on one screen / via one command :) Today’s #FreeSoftwareAdvent is LibreOffice. I know that I’ve included this in previous #FreeSoftwareAdvent toots, but I use LibreOffice, particularly Writer, every day. Better still, this one isn’t just for Linux users, so if you are thinking about moving away from your Microsoft or Apple office suite, you can give LibreOffice a try, for free, alongside whatever you use currently, and give it a good go before you made a decision. Today’s #FreeSoftwareAdvent is Kimai, a self-hostable time tracking tool, which I use to keep records of all the legal work that I do, as well as time spent on committees etc. It has excellent, and flexible, reporting options - which I use for generating timesheets each month - and there is an integrated invoicing tool (which I do not use). https://www.kimai.org/ Today’s #FreeSoftwareAdvent is another one which might appeal for ##freelance work: InvoicePlane which is - you’ve guessed it - an invoicing tool. I have used InvoicePlane for decoded.legal’s invoicing for coming up ten years now, and it has been great. Dead easy to back up and restore too, which is always welcome. I think that it is designed to be exposed to customers (i.e. customers interact with it), but I do not do this. Kimai (yesterday’s entry) also offers invoicing, but for whatever reason - I don’t remember - I went with a separate tool when I set things up. (I’ve also heard good things about InvoiceNinja, but I have not used it myself.) https://www.invoiceplane.com/ Today’s #FreeSoftware advent is EspoCRM, a self-hostable customer relationship management tool. I use it to keep track of my clients, their matters, and the tasks for each matter, and I’ve integrated various regulatory compliance requirements into different steps / activities. It took me a fair amount of time to customise to my needs, and I’m really pleased with it. https://www.espocrm.com/ Today’s #FreeSoftwareAdvent is espanso, a text expansion tool. You specify a trigger word/phrase and a replacement, and, when you type that trigger, espanso replaces it automatically with the replacement text. So, for instance, I have a trigger for “.date”, and whenever I type that, it replaces it with the current date (“2025-12-16”). It also has a “forms” tool, so I can trigger a form, fill it in, and then espanso puts in all the content - so great for structured paragraphs with placeholders, basically. I would be absolutely lost without espanso, as it saves me so many keystrokes. https://espanso.org/ Today’s #FreeSoftwareAdvent is text-mode web browser, links. Because, sometimes, you just want to be able to browse without all the cruft of the modern web. You don’t have to stop using your current GUI browser to give elinks, or another text-mode browser, a try. I find browsing with links more relaxing, with fewer distractions and annoyances, although it is not a good choice for some sites. Bonus: no-one is trying to force AI into it. https://links.twibright.com/user_en.html Today’s #FreeSoftwareAdvent is reveal.js, my tool of choice for writing presentations. I write the content in Markdown, and… that’s it. Thanks to CSS, tada, I have a perfectly-formatted, ready-to-go, presentation. It has speaker view, allows for easy PDF exports, and, because it is html, css, and JavaScript, you can upload the presentation to your website, so that people can click/swipe through it. Plus, no proprietary / vendor lock-in format. I have used reveal.js for a few years now, and I absolutely love it. I can’t imagine faffing around with slides these days. https://revealjs.com/ Today’s #FreeSoftwareAdvent is perhaps a very obvious one: Mastodon. I’ve used Mastodon for a reasonably long time; this account is coming up 8 years old. I run my own instance using the glitch-soc fork ( https://glitch-soc.github.io/docs/) , and it is my only social networking presence. “Vanilla” Mastodon instructions are here: https://docs.joinmastodon.org/admin/install/ If you want your own instance but don’t want to host it, a few places offer managed hosting. Today’s #FreeSoftwareAdvent is greenbone/OpenVAS, a vulnerability scanner. I use it to help me spot (mis)configuration issues, out of date packages (even with automatic upgrades), and vulnerabilities. It is, unfortunately, a bit of a pain to install / get working, but this script ( https://github.com/martinboller/gse ) is useful, if you don’t mind sticking (for now, anyway) with Debian Bookworm. https://greenbone.github.io/docs/latest/index.html Today’s #FreeSoftwareAdvent is F-Droid, a Free software app store for Android. I use the GrapheneOS flavour of Android, and F-Droid is my app installation tool of choice. https://f-droid.org/ / @[email protected] Today’s #FreeSoftwareAdvent are some extensions which I find particularly useful. For Thunderbird: For Firefox: Today’s #FreeSoftwareAdvent is Elementary OS. If I were looking for a first-time Linux operating system for a friend or family member, I’d happily give Elementary OS a try, as I have heard nothing but good things about it. Plus, supporting small projects doing impressive things FTW. https://elementary.io/ / @[email protected]. Today’s #FreeSoftwareAdvent is paperless-ngx, a key part of keeping us, well, paperless. It is a document management tool, but I use it in a very basic way: it is hooked up to our scanner, and anything we scan gets automatically converted to PDF and OCRd. We then shred the paper. It is particularly useful around tax return time, as it means I can easily get the information I need from stuff which people have posted to us. https://github.com/paperless-ngx/paperless-ngx yt-dlp: command line tools for downloading from various online sources: https://github.com/yt-dlp/yt-dlp Handbrake: ripping DVDs you’ve bought: https://handbrake.fr/ SoundJuicer: ripping CDs you’ve bought: https://gitlab.gnome.org/GNOME/sound-juicer Libation: downloading audiobooks you’ve bought: https://github.com/rmcrackan/Libation mutt, an incredibly versatile email client: http://www.mutt.org/ vdirsyncer, a tool for synchronising calendars and address books: https://vdirsyncer.pimutils.org/en/stable/ khard, for interacting with contacts (vCards): https://github.com/lucc/khard khal, for interacting with calendars: https://github.com/pimutils/khal F-Droid is an app, which provides an app store interface F-Droid, the project, also runs the default repository used by the F-Droid app But anyone can run their own repository, and users can add that repo and install apps from there (e.g. Bitwarden does this) Quick Folder Move lets me file email (including into folders in different accounts) via the keyboard. I use this goodness knows how many tens of times a day. https://services.addons.thunderbird.net/en-US/thunderbird/addon/quick-folder-move/ Consent-o-matic automatically suppresses malicious compliance cookie banners: https://addons.mozilla.org/en-GB/firefox/addon/consent-o-matic/ uBlock Origin blocks all sorts of problematic stuff, and helps make for a much nicer browsing experience: https://addons.mozilla.org/en-GB/firefox/addon/ublock-origin/ Recipe Filter foregrounds the actual recipe on recipe pages: https://addons.mozilla.org/en-GB/firefox/addon/recipe-filter/ Shut Up is a comments blocker, which makes the web far more pleasant, with fewer random people’s opinions: https://addons.mozilla.org/en-GB/firefox/addon/shut-up-comment-blocker/

0 views
Evan Schwartz 2 months ago

Short-Circuiting Correlated Subqueries in SQLite

I recently added domain exclusion lists and paywalled content filtering to Scour . This blog post describes a small but useful SQL(ite) query optimization I came across between the first and final drafts of these features: using an uncorrelated scalar subquery to skip a correlated subquery (if you don't know what that means, I'll explain it below). Scour searches noisy sources for content related to users' interests. At the time of writing, it ingests between 1 and 3 million pieces of content from over 15,000 sources each month. For better and for worse, Scour does ranking on the fly, so the performance of the ranking database query directly translates to page load time. The main SQL query Scour uses for ranking applies a number of filters and streams the item embeddings through the application code for scoring. Scour uses brute force search rather than a vector database, which works well enough for now because of three factors: A simplified version of the query looks something like: The query plan shows that this makes good use of indexes: To add user-specified domain blocklists, I created the table and added this filter clause to the main ranking query: The domain exclusion table uses as a primary key, so the lookup is efficient. However, this lookup is done for every row returned from the first part of the query. This is a correlated subquery : A problem with the way we just added this feature is that most users don't exclude any domains, but we've added a check that is run for every row anyway. To speed up the queries for users who aren't using the feature, we could first check the user's settings and then dynamically build the query. But we don't have to, because we can accomplish the same effect within one static query. We can change our domain exclusion filter to first check whether the user has any excluded domains: Since the short-circuits, if the first returns (when the user has no excluded domains), SQLite never evaluates the correlated subquery at all. The first clause does not reference any column in , so SQLite can evaluate it once and reuse the boolean result for all of the rows. This "uncorrelated scalar subquery" is extremely cheap to evaluate and, when it returns , lets us short-circuit and skip the more expensive correlated subquery that checks each item's domain against the exclusion list. Here is the query plan for this updated query. Note how the second subquery says , whereas the third one is a . The latter is the per-row check, but it can be skipped by the second subquery. To test the performance of each of these queries, I replaced the with and used a simple bash script to invoke the binary 100 times for each query on my laptop. Starting up the process each time adds overhead, but we're comparing relative differences. At the time of this benchmark, the last week had 235,975 items, 144,229 of which were in English. The two example users I tested this for below only look for English content. This test represents most users, who have not configured any excluded domains: This shows that the short-circuit query adds practically no overhead for users without excluded domains, whereas the correlated subquery alone makes queries 17% slower for these users. This test uses an example user that has excluded content from 2 domains: In this case, we do need to check each row against the domain filter. But this shows that the short-circuit still adds no overhead on top of the query. When using SQL subqueries to filter down result sets, it's worth thinking about whether each subquery is really needed for most users or most queries. If the check is needed most of the time, this approach won't help. However if the per-row check isn't always needed, using an uncorrelated scalar subquery to short-circuit a condition can dramatically speed up the average case with practically zero overhead. This is extra important because the slow-down from each additional subquery compounds. In this blog post, I described and benchmarked a single additional filter. But this is only one of multiple subquery filters. Earlier, I also mentioned that users had asked for a way to filter out paywalled content. This works similarly to filtering out content from excluded domains. Some users opt-in to hiding paywalled content. For those users, we check if each item is paywalled. If so, we check if it comes from a site the user has specifically allowed paywalled content from (because they have a subscription). I used the same uncorrelated subquery approach to first check if the feature is enabled for the user and, only then, does SQLite need to check each row. Concretely, the paywalled content filter subquery looks like: In short, a trivial uncorrelated scalar subquery can help us short-circuit and avoid a more expensive per-row check when we don't need it. There are multiple ways to exclude rows from an SQL query. Here are the results from the same benchmark I ran above, but with two other ways of checking for whether an item comes from an excluded domain. The version of the query uses the subquery: The variation joins with and then checks for : And here are the full benchmarks: For users without excluded domains, we can see that the query using the short-circuit wins and adds no overhead. For users who do have excluded domains, the is faster than the version. However, this version raises the exact problem this whole blog post is designed to address. Since joins happen no matter what, we cannot use the short-circuit to avoid the overhead for users without excluded domains. At least for now, this is why I've gone with the subquery using the short-circuit. Discuss on Hacker News , Lobsters , r/programming , r/sqlite . Scour uses SQLite, so the data is colocated with the application code. It uses binary-quantized vector embeddings with Hamming Distance comparisons, which only take ~5 nanoseconds each . We care most about recent posts so we can significantly narrow the search set by publish date.

0 views
Hugo 2 months ago

Managing Custom Domains and Dynamic SSL with Coolify and Traefik

I recently published an article on [managing SSL certificates for a multi-tenant application on Coolify](https://eventuallymaking.io/p/2025-10-coolify-wildcard-ssl). The title might sound intimidating, but basically it lets an application handle tons of subdomains (over HTTPS) dynamically for its users. For example: - `https://hugo.writizzy.com` - `https://thomas-sanlis.writizzy.com` That's a solid foundation, but users often want more: **custom domains**. The ability to use their own address, like: - `https://eventuallymaking.io` - `https://thomas-sanlis.com` So let's dive into how to manage custom domains for a multi-tenant app with dynamic SSL certificates. *(I feel like I'm going for the world record in longest blog post title!)* ## Custom Domains The first step for a custom domain is routing traffic from that (sub)domain to your application (Writizzy in my case). It all comes down to **DNS records** on the user's side. Only they can configure their domain to point to your application's subdomain. They have several options: **CNAME**, **ALIAS**, or **A** records. ### 1. CNAME (the simplest) A CNAME is an alias for another domain. It basically says that `www.eventuallymaking.io` is an alias for `hugo.writizzy.com`. All traffic trying to resolve the `www` address gets automatically forwarded to your application domain. **Major limitation:** You can't use a CNAME for a root domain (e.g., `eventuallymaking.io` without the `www`). Adding a CNAME on an apex domain would conflict with other records (A, MX, TXT, etc.). ### 2. ALIAS Some DNS providers offer **ALIAS** records (or *CNAME flattening*). It's essentially a CNAME that can coexist with other records on a root domain. Great option if the user's provider supports it (OVH doesn't, for instance). ### 3. A Record Here, the user directly enters Writizzy's server IP address. **Warning:** This approach is risky. If you change servers (and therefore IPs), all your users' custom domains break until they update their config. To use this method safely, you need a **floating IP** that can be reassigned to your new server or web frontend. Alright, that's a good start, but if we stop here, things won't work well—the site won't serve over HTTPS. ## HTTPS on Custom Domains In my previous post, I showed how Traefik could route all traffic hitting `*.writizzy.com` subdomains to the same application using these lines: ```yaml traefik.http.routers.https-custom.rule=HostRegexp(`^.+$`) traefik.http.routers.https-custom.entryPoints=https traefik.http.routers.https-custom.service=https-0-zgwokkcwwcwgcc4gck440o88 traefik.http.routers.https-custom.tls.certresolver=letsencrypt traefik.http.routers.https-custom.tls=true traefik.http.routers.https-custom.priority=1 ``` Since the regex catches everything, you might think SSL would just follow along. Unfortunately, no. Traefik knows it needs to handle a wildcard certificate for `*.writizzy.io`, but it has no clue about the external domains it'll need to serve. We need to help it by dynamically providing the list of custom domains. Our constraints: - Obviously, no application restart - We want to drive it programmatically ### Dynamic Traefik Configuration with File Providers This is where Traefik's [File Providers](https://doc.traefik.io/traefik/reference/install-configuration/providers/others/file/) come in. In Coolify, this feature is enabled by default via [dynamic configurations](https://coolify.io/docs/knowledge-base/proxy/traefik/dynamic-config). Under the hood, Traefik watches a specific directory: ```bash - '--providers.file.directory=/traefik/dynamic/' - '--providers.file.watch=true' ``` Drop a `.yml` file defining the rule for a new domain, and Traefik picks it up hot and triggers the HTTP challenge with Let's Encrypt to get the SSL certificate. **Example dynamic configuration:** ```yaml http: routers: eventuallymaking-https: rule: "Host(`eventuallymaking.io`)" entryPoints: - https service: https-0-writizzy tls: certResolver: letsencrypt priority: 10 eventuallymaking-http: rule: "Host(`eventuallymaking.io`)" entryPoints: - http middlewares: - redirect-to-https@docker service: https-0-writizzy priority: 10 ``` So with this approach, we've solved our first constraint: no restart needed to configure a new custom domain. Now let's tackle the second one and make it programmatic. ### Automation via the Application The idea is straightforward: the application creates these files directly in the directory Traefik watches. 1. **Coolify configuration:** Go to `Configuration -> Persistent Storage` and add a `Directory mount` to make the `/traefik/dynamic/` directory visible to your application container. ![directory mount](https://writizzy.b-cdn.net/blogs/48b77143-02ee-4316-9d68-0e6e4857c5ce/1765961966661-wymdzej.png) 1. **Code (Kotlin in my case):** The application generates a file based on the template above as soon as a user configures their domain. > **Important note on timing:** If you create the Traefik config before the user has pointed their domain to your IP, the SSL challenge will fail. Traefik will automatically retry (with backoff), but it can take a while. Ideally, validate the DNS pointing (via a background job) **before** generating the file. > **Important note on security:** The generated file contains user input (the domain name). It's crucial to sanitize and validate this data to prevent a malicious user from injecting arbitrary Traefik directives into your configuration. ## Wrapping Up This post wraps up what I hope has been a useful series for SaaS builders running multi-tenant apps on Coolify. We've covered: 1. [Tenant management (Nuxt) and basic Traefik configuration](https://eventuallymaking.io/p/2025-10-coolify-subdomain) 2. [SSL management with dynamic subdomains](https://eventuallymaking.io/p/2025-10-coolify-wildcard-ssl) 3. **Custom domains with SSL** (this post) You might wonder if this solution scales with tens of thousands of sites all using custom domains—specifically whether Traefik can handle monitoring that many files in a directory. I don't have the answer yet. Writizzy is nowhere near that scale for now. There might be other solutions, like using Caddy instead of Traefik in Coolify, but it's way too early to explore that.

0 views
Anton Zhiyanov 2 months ago

Timing 'Hello, world'

Here's a little unscientific chart showing the compile/run times of a "hello world" program in different languages: For interpreted languages, the times shown are only for running the program, since there's no separate compilation step. I had to shorten the Kotlin bar a bit to make it fit within 80 characters. All measurements were done in single-core, containerized sandboxes on an ancient CPU, and the timings include the overhead of . So the exact times aren't very interesting, especially for the top group (Bash to Ruby) — they all took about the same amount of time. Here is the program source code in C: Other languages: Bash · C# · C++ · Dart · Elixir · Go · Haskell · Java · JavaScript · Kotlin · Lua · Odin · PHP · Python · R · Ruby · Rust · Swift · V · Zig Of course, this ranking will be different for real-world projects with lots of code and dependencies. Still, I found it curious to see how each language performs on a simple "hello world" task.

0 views
Armin Ronacher 2 months ago

Skills vs Dynamic MCP Loadouts

I’ve been moving all my MCPs to skills, including the remaining one I still used: the Sentry MCP 1 . Previously I had already moved entirely away from Playwright to a Playwright skill. In the last month or so there have been discussions about using dynamic tool loadouts to defer loading of tool definitions until later. Anthropic has also been toying around with the idea of wiring together MCP calls via code, something I have experimented with . I want to share my updated findings with all of this and why the deferred tool loading that Anthropic came up with does not fix my lack of love for MCP. Maybe they are useful for someone else. When the agent encounters a tool definition through reinforcement learning or otherwise, it is encouraged to emit tool calls through special tokens when it encounters a situation where that tool call would be appropriate. For all intents and purposes, tool definitions can only appear between special tool definition tokens in a system prompt. Historically this means that you cannot emit tool definitions later in the conversation state. So your only real option is for a tool to be loaded when the conversation starts. In agentic uses, you can of course compress your conversation state or change the tool definitions in the system message at any point. But the consequence is that you will lose the reasoning traces and also the cache. In the case of Anthropic, for instance, this will make your conversation significantly more expensive. You would basically start from scratch and pay full token rates plus cache write cost, compared to cache read. One recent innovation from Anthropic is deferred tool loading. You still declare tools ahead of time in the system message, but they are not injected into the conversation when the initial system message is emitted. Instead they appear at a later point. The tool definitions however still have to be static for the entire conversation, as far as I know. So the tools that could exist are defined when the conversation starts. The way Anthropic discovers the tools is purely by regex search. This is all quite relevant because even though MCP with deferred loading feels like it should perform better, it actually requires quite a bit of engineering on the LLM API side. The skill system gets away without any of that and, at least from my experience, still outperforms it. Skills are really just short summaries of which skills exist and in which file the agent can learn more about them. These are proactively loaded into the context. So the agent understands in the system context (or maybe somewhere later in the context) what capabilities it has and gets a link to the manual for how to use them. Crucially, skills do not actually load a tool definition into the context. The tools remain the same: bash and the other tools the agent already has. All it learns from the skill are tips and tricks for how to use these tools more effectively. Because the main thing it learns is how to use other command line tools and similar utilities, the fundamentals of how to chain and coordinate them together do not actually change. The reinforcement learning that made the Claude family of models very good tool callers just helps with these newly discovered tools. So that obviously raises the question: if skills work so well, can I move the MCP outside of the context entirely and invoke it through the CLI in a similar way as Anthropic proposes? The answer is yes, you can, but it doesn’t work well. One option here is Peter Steinberger’s mcporter . In short, it reads the files and exposes the MCPs behind it as callable tools: And yes, it looks very much like a command line tool that the LLM can invoke. The problem however is that the LLM does not have any idea about what tools are available, and now you need to teach it that. So you might think: why not make some skills that teach the LLM about the MCPs? Here the issue for me comes from the fact that MCP servers have no desire to maintain API stability. They are increasingly starting to trim down tool definitions to the bare minimum to preserve tokens. This makes sense, but for the skill pattern it’s not what you want. For instance, the Sentry MCP server at one point switched the query syntax entirely to natural language. A great improvement for the agent, but my suggestions for how to use it became a hindrance and I did not discover the issue straight away. This is in fact quite similar to Anthropic’s deferred tool loading: there is no information about the tool in the context at all. You need to create a summary. The eager loading of MCP tools we have done in the past now has ended up with an awkward compromise: the description is both too long to eagerly load it, and too short to really tell the agent how to use it. So at least from my experience, you end up maintaining these manual skill summaries for MCP tools exposed via mcporter or similar. This leads me to my current conclusion: I tend to go with what is easiest, which is to ask the agent to write its own tools as a skill. Not only does it not take all that long, but the biggest benefit is that the tool is largely under my control. Whenever it breaks or needs some other functionality, I ask the agent to adjust it. The Sentry MCP is a great example. I think it’s probably one of the better designed MCPs out there, but I don’t use it anymore. In part because when I load it into the context right away I lose around 8k tokens out of the box, and I could not get it to work via mcporter. On the other hand, I have Claude maintain a skill for me. And yes, that skill is probably quite buggy and needs to be updated, but because the agent maintains it, it works out better. It’s quite likely that all of this will change, but at the moment manually maintained skills and agents writing their own tools have become my preferred way. I suspect that dynamic tool loading with MCP will become a thing, but it will probably quite some protocol changes to bring in skill-like summaries and built-in manuals for the tools. I also suspect that MCP would greatly benefit of protocol stability. The fact that MCP servers keep changing their tool descriptions at will does not work well with materialized calls and external tool descriptions in READMEs and skill files. Keen readers will remember that last time, the last MCP I used was Playwright. In the meantime I added and removed two more MCPs: Linear and Sentry, mostly because of authentication issues and neither having a great command line interface. ↩ Keen readers will remember that last time, the last MCP I used was Playwright. In the meantime I added and removed two more MCPs: Linear and Sentry, mostly because of authentication issues and neither having a great command line interface. ↩

0 views

AMD GPU Debugger

I’ve always wondered why we don’t have a GPU debugger similar to the one used for CPUs. A tool that allows pausing execution and examining the current state. This capability feels essential, especially since the GPU’s concurrent execution model is much harder to reason about. After searching for solutions, I came across rocgdb, a debugger for AMD’s ROCm environment. Unfortunately, its scope is limited to that environment. Still, this shows it’s technically possible. I then found a helpful series of blog posts by Marcell Kiss , detailing how he achieved this, which inspired me to try to recreate the process myself. The best place to start learning about this is RADV . By tracing what it does, we can find how to do it. Our goal here is to run the most basic shader without using Vulkan, aka RADV in our case. First of all, we need to open the DRM file to establish a connection with the KMD, using a simple open(“/dev/dri/cardX”), then we find that it’s calling , which is a function defined in , which is a library that acts as middleware between user mode drivers(UMD) like and and kernel mode drivers(KMD) like amdgpu driver, and then when we try to do some actual work we have to create a context which can be achieved by calling from again, next up we need to allocate 2 buffers one of them for our code and the other for writing our commands into, we do this by calling a couple of functions, here’s how I do it: Here we’re choosing the domain and assigning flags based on the params, some buffers we will need uncached, as we will see: Now we have the memory, we need to map it. I opt to map anything that can be CPU-mapped for ease of use. We have to map the memory to both the GPU and the CPU virtual space. The KMD creates the page table when we open the DRM file, as shown here . So map it to the GPU VM and, if possible, to the CPU VM as well. Here, at this point, there’s a libdrm function that does all of this setup for us and maps the memory, but I found that even when specifying , it doesn’t always tag the page as uncached, not quite sure if it’s a bug in my code or something in anyways, the function is , I opted to do it manually here and issue the IOCTL call myself: Now we have the context and 2 buffers. Next, fill those buffers and send our commands to the KMD, which will then forward them to the Command Processor (CP) in the GPU for processing. Let’s compile our code. We can use clang assembler for that, like this: The bash script compiles the code, and then we’re only interested in the actual machine code, so we use objdump to figure out the offset and the size of the section and copy it to a new file called asmc.bin, then we can just load the file and write its bytes to the CPU-mapped address of the code buffer. Next up, filling in the commands. This was extremely confusing for me because it’s not well documented. It was mostly learning how does things and trying to do similar things. Also, shout-out to the folks on the Graphics Programming Discord server for helping me, especially Picoduck. The commands are encoded in a special format called , which has multiple types. We only care about : each packet has an opcode and the number of bytes it contains. The first thing we need to do is program the GPU registers, then dispatch the shader. Some of those registers are ; those registers are responsible for a number of configurations, pgm_[lo/hi], which hold the pointer to the code buffer and ; those are responsible for the number of threads inside a work group. All of those are set using the packets, and here is how to encode them: {% mark %} It’s worth mentioning that we can set multiple registers in 1 packet if they’re consecutive.{% /mark %} Then we append the dispatch command: Now we want to write those commands into our buffer and send them to the KMD: {% mark %} Here is a good point to make a more complex shader that outputs something. For example, writing 1 to a buffer. {% /mark %} No GPU hangs ?! nothing happened ?! cool, cool, now we have a shader that runs on the GPU, what’s next? Let’s try to hang the GPU by pausing the execution, aka make the GPU trap. The RDNA3’s ISA manual does mention 2 registers, ; here’s how they describe them respectively: Holds the pointer to the current trap handler program address. Per-VMID register. Bit [63] indicates if the trap handler is present (1) or not (0) and is not considered part of the address (bit[62] is replicated into address bit[63]). Accessed via S_SENDMSG_RTN. Temporary register for shader operations. For example, it can hold a pointer to memory used by the trap handler. {%mark%} You can configure the GPU to enter the trap handler when encountering certain exceptions listed in the RDNA3 ISA manual. {%/mark%} We know from Marcell Kiss’s blog posts that we need to compile a trap handler, which is a normal shader the GPU switches to when encountering a . The TBA register has a special bit that indicates whether the trap handler is enabled. Since these are privileged registers, we cannot write to them from user space. To bridge this gap for debugging, we can utilize the debugfs interface. Luckily, we have UMR , which uses that debugfs interface, and it’s open source; we copy AMD’s homework here which is great. The amdgpu KMD has a couple of files in debugfs under ; one of them is , which is an interface to a in the kernel that writes to the registers. It works by simply opening the file, seeking the register’s offset, and then writing; it also performs some synchronisation and writes the value correctly. We need to provide more parameters about the register before writing to the file, tho and do that by using an ioctl call. Here are the ioctl arguments: The 2 structs are because there are 2 types of registers, GRBM and SRBM, each of which is banked by different constructs; you can learn more about some of them here in the Linux kernel documentation . Turns out our registers here are SBRM registers and banked by VMIDs, meaning each VMID has its own TBA and TMA registers. Cool, now we need to figure out the VMID of our process. As far as I understand, VMIDs are a way for the GPU to identify a specific process context, including the page table base address, so the address translation unit can translate a virtual memory address. The context is created when we open the DRM file. They get assigned dynamically at dispatch time, which is a problem for us; we want to write to those registers before dispatch. We can obtain the VMID of the dispatched process by querying the register with s_getreg_b32. I do a hack here, by enabling the trap handler in every VMID, and there are 16 of them, the first being special, and used by the KMD and the last 8 allocated to the amdkfd driver. We loop over the remaining VMIDs and write to those registers. This can cause issues to other processes using other VMIDs by enabling trap handlers in them and writing the virtual address of our trap handler, which is only valid within our virtual memory address space. It’s relatively safe tho since most other processes won’t cause a trap[^1]. [^1]: Other processes need to have a s_trap instruction or have trap on exception flags set, which is not true for most normal GPU processes. Now we can write to TMA and TBA, here’s the code: And here’s how we write to and : {%mark%} If you noticed, I’m using bitfields. I use them because working with them is much easier than macros, and while the byte order is not guaranteed by the C spec, it’s guaranteed by System V ABI, which Linux adheres to. {%/mark%} Anyway, now that we can write to those registers, if we enable the trap handler correctly, the GPU should hang when we launch our shader if we added instruction to it, or we enabled the bit in rsrc3[^2] register. [^2]: Available since RDNA3, if I’m not mistaken. Now, let’s try to write a trap handler. {%mark%} If you wrote a different shader that outputs to a buffer, u can try writing to that shader from the trap handler, which is nice to make sure it’s actually being run. {%/mark%} We need 2 things: our trap handler and some scratch memory to use when needed, which we will store the address of in the TMA register. The trap handler is just a normal program running in privileged state, meaning we have access to special registers like TTMP[0-15]. When we enter a trap handler, we need to first ensure that the state of the GPU registers is saved, just as the kernel does for CPU processes when context-switching, by saving a copy of the stable registers and the program counter, etc. The problem, tho, is that we don’t have a stable ABI for GPUs, or at least not one I’m aware of, and compilers use all the registers they can, so we need to save everything. AMD GPUs’ Command Processors (CPs) have context-switching functionality, and the amdkfd driver does implement some context-switching shaders . The problem is they’re not documented, and we have to figure them out from the amdkfd driver source and from other parts of the driver stack that interact with it, which is a pain in the ass. I kinda did a workaround here since I didn’t find luck understanding how it works, and some other reasons I’ll discuss later in the post. The workaround here is to use only TTMP registers and a combination of specific instructions to copy the values of some registers, allowing us to use more instructions to copy the remaining registers. The main idea is to make use of the instruction, which adds the index of the current thread within the wave to the writing address, aka $$ ID_{thread} * 4 + address $$ This allows us to write a unique value per thread using only TTMP registers, which are unique per wave, not per thread[^3], so we can save the context of a single wave. [^3]: VGPRs are unique per thread, and SGPRs are unique per wave The problem is that if we have more than 1 wave, they will overlap, and we will have a race condition. Here is the code: Now that we have those values in memory, we need to tell the CPU: Hey, we got the data, and pause the GPU’s execution until the CPU issues a command. Also, notice we can just modify those from the CPU. Before we tell the CPU, we need to write some values that might help the CPU. Here are they: Now the GPU should just wait for the CPU, and here’s the spin code it’s implemented as described by Marcell Kiss here : The main loop in the CPU is like enable trap handler, then dispatch shader, then wait for the GPU to write some specific value in a specific address to signal all data is there, then examine and display, and tell the GPU all clear, go ahead. Now that our uncached buffers are in play, we just keep looping and checking whether the GPU has written the register values. When it does, the first thing we do is halt the wave by writing into the register to allow us to do whatever with the wave without causing any issues, tho if we halt for too long, the GPU CP will reset the command queue and kill the process, but we can change that behaviour by adjusting lockup_timeout parameter of the amdgpu kernel module: From here on, we can do whatever with the data we have. All the data we need to build a proper debugger. We will come back to what to do with the data in a bit; let’s assume we did what was needed for now. Now that we’re done with the CPU, we need to write to the first byte in our TMA buffer, since the trap handler checks for that, then resume the wave, and the trap handler should pick it up. We can resume by writing to the register again: Then the GPU should continue. We need to restore everything and return the program counter to the original address. Based on whether it’s a hardware trap or not, the program counter may point to the instruction before or the instruction itself. The ISA manual and Marcell Kiss’s posts explain that well, so refer to them. Now we can run compiled code directly, but we don’t want people to compile their code manually, then extract the text section, and give it to us. The plan is to take SPIR-V code, compile it correctly, then run it, or, even better, integrate with RADV and let RADV give us more information to work with. My main plan was making like fork RADV and then add then make report for us the vulkan calls and then we can have a better view on the GPU work know the buffers/textures it’s using etc, This seems like a lot more work tho so I’ll keep it in mind but not doing that for now unless someone is willing to pay me for that ;). For now, let’s just use RADV’s compiler . Luckily, RADV has a mode, aka it will not do actual work or open DRM files, just a fake Vulkan device, which is perfect for our case here, since we care about nothing other than just compiling code. We can enable it by setting the env var , then we just call what we need like this: Now that we have a well-structured loop and communication between the GPU and the CPU, we can run SPIR-V binaries to some extent. Let’s see how we can make it an actual debugger. We talked earlier about CPs natively supporting context-switching, this appears to be compute spcific feature, which prevents from implementing it for other types of shaders, tho, it appears that mesh shaders and raytracing shaders are just compute shaders under the hood, which will allow us to use that functionality. For now debugging one wave feels enough, also we can moify the wave parameters to debug some specific indices. Here’s some of the features For stepping, we can use 2 bits: one in and the other in . They’re and , respectively. The former enters the trap handler after each instruction, and the latter enters before the first instruction. This means we can automatically enable instruction-level stepping. Regarding breakpoints, I haven’t implemented them, but they’re rather simple to implement here by us having the base address of the code buffer and knowing the size of each instruction; we can calculate the program counter location ahead and have a list of them available to the GPU, and we can do a binary search on the trap handler. The ACO shader compiler does generate instruction-level source code mapping, which is good enough for our purposes here. By taking the offset[^4] of the current program counter and indexing into the code buffer, we can retrieve the current instruction and disassemble it, as well as find the source code mapping from the debug info. [^4]: We can get that by subtracting the current program counter from the address of the code buffer. We can implement this by marking the GPU page as protected. On a GPU fault, we enter the trap handler, check whether it’s within the range of our buffers and textures, and then act accordingly. Also, looking at the registers, we can find these: which suggests that the hardware already supports this natively, so we don’t even need to do that dance. It needs more investigation on my part, tho, since I didn’t implement this. This needs some serious plumbing, since we need to make NIR(Mesa’s intermediate representation) optimisation passes propagate debug info correctly. I already started on this here . Then we need to make ACO track variables and store the information. This requires ditching our simple UMD we made earlier and using RADV, which is what should happen eventually, then we have our custom driver maybe pause on before a specific frame, or get triggered by a key, and then ask before each dispatch if to attach to it or not, or something similar, since we have a full proper Vulkan implementation we already have all the information we would need like buffers, textures, push constants, types, variable names, … etc, that would be a much better and more pleasant debugger to use. Finally, here’s some live footage: ::youtube{url=“ https://youtu.be/HDMC9GhaLyc ”} Here is an incomplete user-mode page walking code for gfx11, aka rx7900xtx

0 views
Kaushik Gopal 2 months ago

Combating AI coding atrophy with Rust

It’s no secret that I’ve fully embraced AI for my coding. A valid concern ( and one I’ve been thinking about deeply ) is the atrophying of the part of my brain that helps me code. To push back on that, I’ve been learning Rust on the side for the last few months. I am absolutely loving it. Kotlin remains my go-to language. It’s the language I know like the back of my hand. If someone sends me a swath of Kotlin code, whether handwritten or AI generated, I can quickly grok it and form a strong opinion on how to improve it. But Kotlin is a high-level language that runs on a JVM. There are structural limits to the performance you can eke out of it, and for most of my career 1 I’ve worked with garbage-collected languages. For a change, I wanted a systems-level language, one without the training wheels of a garbage collector. I also wanted a language with a different core philosophy, something that would force me to think in new ways. I picked up Go casually but it didn’t feel like a big enough departure from the languages I already knew. It just felt more useful to ask AI to generate Go code than to learn it myself. With Rust, I could get code translated, but then I’d stare at the generated code and realize I was missing some core concepts and fundamentals. I loved that! The first time I hit a lifetime error, I had no mental model for it. That confusion was exactly what I was looking for. Coming from a GC world, memory management is an afterthought — if it requires any thought at all. Rust really pushes you to think through the ownership and lifespan of your data, every step of the way. In a bizarre way, AI made this gap obvious. It showed me where I didn’t understand things and pointed me toward something worth learning. Here’s some software that’s either built entirely in Rust or uses it in fundamental ways: Many of the most important tools I use daily are built with Rust. Can’t hurt to know the language they’re written in. Rust is quite similar to Kotlin in many ways. Both use strict static typing with advanced type inference. Both support null safety and provide compile-time guarantees. The compile-time strictness and higher-level constructs made it fairly easy for me to pick up the basics. Syntactically, it feels very familiar. I started by rewriting a couple of small CLI tools I used to keep in Bash or Go. Even in these tiny programs, the borrow checker forced me to be clear about who owns what and when data goes away. It can be quite the mental workout at times, which is perfect for keeping that atrophy from setting in. After that, I started to graduate to slightly larger programs and small services. There are two main resources I keep coming back to: There are times when the book or course mentions a concept and I want to go deeper. Typically, I’d spend time googling, searching Stack Overflow, finding references, diving into code snippets, and trying to clear up small nuances. But that’s changed dramatically with AI. One of my early aha moments with AI was how easy it made ramping up on code. The same is true for learning a new language like Rust. For example, what’s the difference 2 between these two: Another thing I loved doing is asking AI: what are some idiomatic ways people use these concepts? Here’s a prompt I gave Gemini while learning: Here’s an abbreviated response (the full response was incredibly useful): It’s easy to be doom and gloom about AI in coding — the “we’ll all forget how to program” anxiety is real. But I hope this offers a more hopeful perspective. If you’re an experienced developer worried about skill atrophy, learn a language that forces you to think differently. AI can help you cross that gap faster. Use it as a tutor, not just a code generator. I did a little C/C++ in high school, but nowhere close to proficiency.  ↩︎ Think mutable var to a “shared reference” vs. immutable var to an “exclusive reference”.  ↩︎ fd (my tool of choice for finding files) ripgrep (my tool of choice for searching files) Fish shell (my shell of choice, recently rewrote in Rust) Zed (my text/code editor of choice) Firefox ( my browser of choice) Android?! That’s right: Rust now powers some of the internals of the OS, including the recent Quick Share feature. Fondly referred to as “ The Book ”. There’s also a convenient YouTube series following the book . Google’s Comprehensive Rust course, presumably created to ramp up their Android team. It even has a dedicated Android chapter . This worked beautifully for me. I did a little C/C++ in high school, but nowhere close to proficiency.  ↩︎ Think mutable var to a “shared reference” vs. immutable var to an “exclusive reference”.  ↩︎

0 views
neilzone 2 months ago

Adding a button to Home Assistant to run a shell command

Now that I have fixed my garage door controller , I wanted to see if I could use it from within Home Assistant, primarily so that I can then have a widget on my phone screen. In this case, the shell command is a simple invocation to a remote machine. (I am not aware of a way to add a button to an Android home screen to run a shell command or bash script.) Adding a button to run a shell command or bash script in Home Assistant was pretty straightforward, following the Home Assistant documentation for shell commands . To my configuration.yaml file, I added something like: The quirk here was that reloading the configuration.yaml file from within the Home Assistant UI was insufficient. I needed to completely restart Home Assistant to pick up the changes. Once I had restarted Home Assistant, the shell commands were available. To add buttons, I needed to create “helpers”. I did this from the Home Assistant UI, via Settings / Devices & services / Helpers / Create helper". One helper for each button. After I had created each helper, I went back into the helper’s settings, to add zone information, so that it appeared in the right place in the dashboard. Having created the button/helper, and the shell command, I used an automation to link the two together. I did this via the Home Assistant UI, Settings / Automations & scenes. For the button, it was linked to a change in state of the button, with no parameters specified. The “then do” action is the shell command for the door in question.

0 views

I want a better action graph serialization

This post is part 3/4 of a series about build systems . The next post and last post is I want a better build executor . As someone who ends up getting the ping on "my build is weird" after it has gone through a round of "poke it with a stick", I would really appreciate the mechanisms for [correct dependency edges] rolling out sooner rather than later. In a previous post , I talked about various approaches in the design space of build systems. In this post, I want to zero in on one particular area: action graphs. First, let me define "action graph". If you've ever used CMake, you may know that there are two steps involved: A "configure" step ( ) and a build step ( or ). What I am interested here is what generates , the Makefiles it has created. As the creator of ninja writes , this is a serialization of all build steps at a given moment in time, with the ability to regenerate the graph by rerunning the configure step. This post explores that design space, with the goal of sketching a format that improves on the current state while also enabling incremental adoption. When I say "design space", I mean a serialization format where files are machine-generated by a configure step, and have few enough and restricted enough features that it's possible to make a fast build executor . Not all build systems serialize their action graph. and run persistent servers that store it in memory and allow querying it, but never serialize it to disk. For large graphs, this requires a lot of memory; has actually started serializing parts of its graph to reduce memory usage and startup time . The nix evaluator doesn’t allow querying its graph at all; nix has a very strange model where it never rebuilds because each change to your source files is a new “ input-addressed derivation ” and therefore requires a reconfigure. This is the main reason it’s only used to package software, not as an “inner” build system, because that reconfigure can be very slow. I’ve talked to a couple Nix maintainers and they’ve considered caching parts of the configure step, without caching its outputs (because there are no outputs, other than derivation files!) in order to speed this up. This is much trickier because it requires serializing parts of the evaluator state. Tools that do serialize their graph include CMake, Meson, and the Chrome build system ( GN ). Generally, serializing the graph comes in handy when: In the last post I talked about 4 things one might want from a build system: For a serialization format, we have slightly different constraints. Throughout this post, I'll dive into detail on how these 3 overarching goals apply to the serialization format, and how well various serializations achieve that goal. The first one we'll look at, because it's the default for CMake, is and Makefiles. Make is truly in the Unix spirit: easy to implement 2 , very hard to use correctly. Make is ambiguous , complicated , and makes it very easy to implicitly do a bunch of file system lookups . It supports running shell commands at the top-level, which makes even loading the graph very expensive. It does do pretty well on minimizing reconfigurations, since the language is quite flexible. Ninja is the other generator supported by CMake. Ninja is explicitly intended to work on a serialized action graph; it's the only tool I'm aware of that is. It solves a lot of the problems of Make : it removes many of the ambiguities; it doesn't have any form of globbing; and generally it's a much simpler and smaller language. Unfortunately, Ninja's build file format still has some limitations. First, it has no support for checksums. It's possible to work around that by using and having a wrapper script that doesn't overwrite files unless they've changed, but that's a lot of extra work and is annoying to make portable between operating systems. Ninja files also have trouble expressing correct dependency edges. Let's look at a few examples, one by one. In each of these cases, we either have to reconfigure more often than we wish, or we have no way at all of expressing the dependency edge. See my previous post about negative dependencies. The short version is that build files need to specify not just the files they expect to exist, but also the files they expect not to exist. There's no way to express this in a ninja file, short of reconfiguring every time a directory that might contain a negative dependency is modified, which itself has a lot of downsides. Say that you have a C project with just a . You rename it to and ninja gives you an error that main.c no longer exists. Annoyed of editing ninja files by hand, you decide to write a generator 3 : Note this that this registers an implicit dependency on the current directory. This should automatically detect that you renamed your file and rebuild for you. Oh. Right. Generating build.ninja also modifies the current directory, which creates an infinite loop. It's possible to work around this by putting your C file in a source directory: There's still a problem here, though—did you notice it? Our old target is still lying around. Ninja actually has enough information recorded to fix this: . But it's not run automatically. The other problem is that this approach rebuilds far too often. In this case, we wanted to support renames, so in Ninja's model we need to depend on the whole directory. But that's not what we really depended on—we only care about files. I would like to see a action graph format that has an event-based system, where it says "this file was created, make any changes to the action graph necessary", and cuts the build short if the graph wasn't changed. For flower , I want to go further and support deletions : source files and targets that are optional, that should not fail the build if they aren't present, but should cause a rebuild if they are created, modified, or deleted. Ninja has no way of expressing this. Ninja has no way to express “this node becomes dirty when an environment variable changes”. The closest you can get is hacks with and the checksum wrapper/restat hack, but it’s a pain to express and it gets much worse if you want to depend on multiple variables. At this point, we have a list of constraints for our file format: Ideally, it would even be possible to mechanically translate existing .ninja files to this new format. This sketches out a new format that could improve over Ninja files. It could look something like this: I’d call this language Magma, since it’s the simplest kind of set with closure. Here's a sample action graph in Magma: Note some things about Magma: Kinda. Magma itself has a lot in common with Starlark: it's deterministic, hermetic, immutable, and can be evaluated in parallel. The main difference between the languages themselves is that Clojure has (equivalent to sympy symbolic variables) and Python doesn't. Some of these could be rewritten to keyword arguments, and others could be rewritten to structs, or string keys for a hashmap, or enums; but I'm not sure how much benefit there is to literally using Starlark when these files are being generated by a configure step in any case. Probably it's possible to make a 1-1 mapping between the two in any case. Buck2 has support for metadata that describes how to execute a built artifact. I think this is really interesting; is a much nicer interface than , partly because of shell quoting and word splitting issues, and partly just because it's more discoverable. I don't have a clean idea for how to fit this into a serialization layer. "Don't put it there and use a instead" works , but makes it hard to do things like allow the build graph to say that an artifact needs set or something like that, you end up duplicating the info in both files. Perhaps one option could be to attach a key/value pair to s. Well, yes and no. Yes, in that this has basically all the features of ninja and then some. But no, because the rules here are all carefully constrained to avoid needing to do expensive file I/O to load the build graph. The most expensive new feature is , and it's intended to avoid an even more expensive step (rerunning the configuration step). It's also limited to changed files; it can't do arbitrary globbing on the contents of the directory the way that Make pattern rules can. Note that this also removes some features in ninja: shell commands are gone, process spawning is much less ambiguous, files are no longer parsed automatically. And because this embeds a clojure interpreter, many things that were hard-coded in ninja can instead be library functions: , response files, , . In this post, we have learned some downsides of Make and Ninja's build file formats, sketched out how they could possibly be fixed, and designed a language called Magma that has those characteristics. In the next post, I'll describe the features and design of a tool that evaluates and queries this language. see e.g. this description of how it works in buck2 ↩ at least a basic version—although several of the features of GNU Make get rather complicated. ↩ is https://github.com/ninja-build/ninja/blob/231db65ccf5427b16ff85b3a390a663f3c8a479f/misc/ninja_syntax.py . ↩ technically these aren't true monadic builds because they're constrained a lot more than e.g. Shake rules, they can't fabricate new rules from whole cloth. but they still allow you to add more outputs to the graph at runtime. ↩ This goes all the way around the configuration complexity clock and skips the "DSL" phase to simply give you a real language. ↩ This is totally based and not at all a terrible idea. ↩ This has a whole bunch of problems on Windows, where arguments are passed as a single string instead of an array, and each command has to reimplement its own parsing. But it will work “most” of the time, and at least avoids having to deal with Powershell or CMD quoting. ↩ To make it possible to distinguish the two on the command line, could unambiguously refer to the group, like in Bazel. ↩ You don’t have a persistent server to store it in memory. When you don’t have a server, serializing makes your startup times much faster, because you don’t have to rerun the configure step each time. You don’t have a remote build cache. When you have a remote cache, the rules for loading that cache can be rather complicated because they involve network queries 1 . When you have a local cache, loading it doesn’t require special support because it’s just opening a file. You want to support querying, process spawning, and progress updates without rewriting the logic yourself for every OS (i.e. you don't want to write your own build executor). a "real" language in the configuration step reflection (querying the build graph) file watching support for discovering incorrect dependency edges We care about it being simple and unambiguous to load the graph from the file, so we get fast incremental rebuild speed and graph queries. In particular, we want to touch the filesystem as little as possible while loading. We care about supporting "weird" dependency edges, like dynamic dependencies and the depfiles emitted by a compiler after the first run, so that we're able to support more kinds of builds. And finally, we care about minimizing reconfigurations : we want to be able to express as many things as possible in the action graph so we don't have the pay the cost of rerunning the configure step. This tends to be at odds with fast graph loading; adding features at this level of the stack is very expensive! Negative dependencies File rename dependencies Optional file dependencies Optional checksums to reduce false positives Environment variable dependencies "all the features of ninja" (depfiles, monadic builds through 4 , a statement, order-only dependencies) A very very small clojure subset (just , , EDN , and function calls) for the text itself, no need to make loading the graph harder than necessary 5 . If people really want an equivalent of or I suppose this could learn support for and , but it would have much simpler rules than Clojure's classpath. It would not have support for looping constructs, nor most of clojure's standard library. -inspired dependency edges: (for changes in file attributes), (for changes in the checksum), (for optional dependencies), ; plus our new edge. A input function that can be used anywhere a file path can (e.g. in calls to ) so that the kind of edge does not depend on whether the path is known in advance or not. Runtime functions that determine whether the configure step needs to be re-run based on file watch events 6 . Whether there is actually a file watcher or the build system just calculates a diff on its next invocation is an implementation detail; ideally, one that's easy to slot in and out. “phony” targets would be replaced by a statement. Groups are sets of targets. Groups cannot be used to avoid “input not found” errors; that niche is filled by . Command spawning is specified as an array 7 . No more dependency on shell quoting rules. If people want shell scripts they can put that in their configure script. Redirecting stdout no longer requires bash syntax, it's supported natively with the parameter of . Build parameters can be referred to in rules through the argument. is a thunk ; it only registers an intent to add edges in the future, it does not eagerly require to exist. Our input edge is generalized and can apply to any rule, not just to the configure step. It executes when a file is modified (or if the tool doesn’t support file watching, on each file in the calculated diff in the next tool invocation). Our edge provides the file event type, but not the file contents. This allows ronin to automatically map results to one of the three edge kinds: , , . and are not available through this API. We naturally distinguish between “phony targets” and files because the former are s and the latter are s. No more accidentally failing to build if an file is created. 8 We naturally distinguish between “groups of targets” and “commands that always need to be rerun”; the latter just uses . Data can be transformed in memory using clojure functions without needing a separate process invocation. No more need to use in your build system. see e.g. this description of how it works in buck2 ↩ at least a basic version—although several of the features of GNU Make get rather complicated. ↩ is https://github.com/ninja-build/ninja/blob/231db65ccf5427b16ff85b3a390a663f3c8a479f/misc/ninja_syntax.py . ↩ technically these aren't true monadic builds because they're constrained a lot more than e.g. Shake rules, they can't fabricate new rules from whole cloth. but they still allow you to add more outputs to the graph at runtime. ↩ This goes all the way around the configuration complexity clock and skips the "DSL" phase to simply give you a real language. ↩ This is totally based and not at all a terrible idea. ↩ This has a whole bunch of problems on Windows, where arguments are passed as a single string instead of an array, and each command has to reimplement its own parsing. But it will work “most” of the time, and at least avoids having to deal with Powershell or CMD quoting. ↩ To make it possible to distinguish the two on the command line, could unambiguously refer to the group, like in Bazel. ↩

0 views
マリウス 2 months ago

disable-javascript.org

With several posts on this website attracting significant views in the last few months I had come across plenty of feedback on the tab gimmick implemented last quarter . While the replies that I came across on platforms like the Fediverse and Bluesky were lighthearted and oftentimes with humor, the visitors coming from traditional link aggregators sadly weren’t as amused about it. Obviously a large majority of people disagreeing with the core message behind this prank appear to be web developers, who’s very existence quite literally depends on JavaScript, and who didn’t hold back to express their anger in the comment sections as well as through direct emails. Unfortunately, most commenters are missing the point. This email exchange is just one example of feedback that completely misses the point: I just found it a bit hilarious that your site makes notes about ditching and disable Javascript, and yet Google explicitly requires it for the YouTube embeds. Feels weird. The email contained the following attachment: Given the lack of context I assume that the author was referring to the YouTube embeds on this website (e.g. on the keyboard page). Here is my reply: Simply click the link on the video box that says “Try watching this video on www.youtube.com” and you should be directed to YouTube (or a frontend of your choosing with LibRedirect [1]) where you can watch it. Sadly, I don’t have the influence to convince YouTube to make their video embeds working without JavaScript enabled. ;-) However, if more people would disable JavaScript by default, maybe there would be a higher incentive for server-side-rendering and video embeds would at the very least show a thumbnail of the video (which YouTube could easily do, from a technical point of view). Kind regards! [1]: https://libredirect.github.io It also appears that many of the people disliking the feature didn’t care to properly read the highlighted part of the popover that says “Turn JavaScript off, now, and only allow it on websites you trust!” : Indeed - and the author goes on to show a screenshot of Google Trends which, I’m sure, won’t work without JavaScript turned on. This comment perfectly encapsulates the flawed rhetoric. Google Trends (like YouTube in the previous example) is a website that is unlikely to exploit 0-days in your JavaScript engine, or at least that’s the general consensus. However, when you clicked on a link that looks like someone typed it in by putting their head on the keyboard , that led you to a website you obviously didn’t know beforehand, it’s a different story. What I’m advocating for is to have JavaScript disabled by default for everything unknown to you , and only enable it for websites that you know and trust . Not only is this approach going to protect you from jump-scares , regardless whether that’s a changing tab title, a popup, or an actual exploit, but it will hopefully pivot the thinking of particularly web developers back from “Let’s render the whole page using JavaScript and display nothing if it’s disabled” towards “Let’s make the page as functional as possible without the use of JavaScript and only sprinkle it on top as a way to make the experience better for anyone who choses to enable it” . It is mind boggling how this simple take is perceived as militant techno-minimalism and can provoke such salty feedback. I keep wondering whether these are the same people that consider to be a generally okay way to install software …? One of the many commenters that however did agree with the approach that I’m taking on this site had put it fairly nicely: About as annoying as your friend who bumped key’ed his way into your flat in 5 seconds waiting for you in the living room. Or the protest blocking the highway making you late for work. Many people don’t realize that JavaScript means running arbitrary untrusted code on your machine. […] Maybe the hacker ethos has changed, but I for one miss the days of small pranks and nudges to illustrate security flaws, instead of ransomware and exploits for cash. A gentle reminder that we can all do better, and the world isn’t always all that friendly. As the author of this comment correctly hints, the hacker ethos has in fact changed. My guess is that only a tiny fraction of the people that are actively commenting on platforms like Hacker News or Reddit these days know about, let’s say, cDc ’s Back Orifice , the BOFH stories, bash.org , and all the kickme.to/* links that would trigger a disconnect in AOL ’s dialup desktop software. Hence, the understanding about how far pranks in the 90s and early 2000s really went simply isn’t there. And with most things these days required to be politically correct , having the tab change to what looks like a Google image search for “sam bankman-fried nudes” is therefor frowned upon by many, even when the reason behind it is to inform. Frankly, it seems that conformism has eaten not only the internet, but to an extent the whole world, when an opinion that goes ever so slightly against the status quo is labelled as some sort of extreme view . To feel even just a “tiny bit violated by” something as mundane as a changing text and icon in the browser’s tab bar seems absurd, especially when it is you that allowed my website to run arbitrary code on your computer! Because I’m convinced that a principled stance against the insanity that is the modern web is necessary, I am doubling down on this effort by making it an actual initiative: disable-javascript.org disable-javascript.org is a website that informs the average user about some of the most severe issues affecting the JavaScript ecosystem and browsers/users all over the world, and explains in simple terms how to disable JavaScript in various browsers and only enable it for specific, trusted websites. The site is linked on the JavaScript popover that appears on this website, so that visitors aren’t only pranked into hopefully disabling JavaScript, but can also easily find out how to do so. disable-javascript.org offers a JavaScript-snippet that is almost identical to the one in use by this website, in case you would like to participate in the cause. Of course, you can as well simply link to disable-javascript.org from anywhere on your website to show your support. If you’d like to contribute to the initiative by extending the website with valuable info, you can do so through its Git repository . Feel free to open pull-requests with the updates that you would like to see on disable-javascript.org . :-)

0 views