Posts in Json (20 found)
Corrode 1 weeks ago

Cloudsmith

Rust adoption can be loud, like when companies such as Microsoft, Meta, and Google announce their use of Rust in high-profile projects. But there are countless smaller teams quietly using Rust to solve real-world problems, sometimes even without noticing. This episode tells one such story. Cian and his team at Cloudsmith have been adopting Rust in their Python monolith not because they wanted to rewrite everything in Rust, but because Rust extensions were simply best-in-class for the specific performance problems they were trying to solve in their Django application. As they had these initial successes, they gained more confidence in Rust and started using it in more and more areas of their codebase. CodeCrafters helps you become proficient in Rust by building real-world, production-grade projects. Learn hands-on by creating your own shell, HTTP server, Redis, Kafka, Git, SQLite, or DNS service from scratch. Start for free today and enjoy 40% off any paid plan by using this link . Made with love in Belfast and trusted around the world. Cloudsmith is the fully-managed solution for controlling, securing, and distributing software artifacts. They analyze every package, container, and ML model in an organization’s supply chain, allow blocking bad packages before they reach developers, and build an ironclad chain of custody. Cian is a Service Reliability Engineer located in Dublin, Ireland. He has been working with Rust for 10 years and has a history of helping companies build reliable and efficient software. He has a BA in Computer Programming from Dublin City University. Lee Skillen’s blog - The blog of Lee Skillen, Cloudsmith’s co-founder and CTO Django - Python on Rails Django Mixins - Great for scaling up, not great for long-term maintenance SBOM - Software Bill of Materials Microservice vs Monolith - Martin Fowler’s canonical explanation Jaeger - “Debugger” for microservices PyO3 - Rust-to-Python and Python-to-Rust FFI crate orjson - Pretty fast JSON handling in Python using Rust drf-orjson-renderer - Simple orjson wrapper for Django REST Framework Rust in Python cryptography - Parsing complex data formats is just safer in Rust! jsonschema-py - jsonschema in Python with Rust, mentioned in the PyO3 docs WSGI - Python’s standard for HTTP server interfaces uWSGI - A application server providing a WSGI interface rustimport - Simply import Rust files as modules in Python, great for prototyping granian - WSGI application server written in Rust with tokio and hyper hyper - HTTP parsing and serialization library for Rust HAProxy - Feature rich reverse proxy with good request queue support nginx - Very common reverse proxy with very nice and readable config locust - Fantastic load-test tool with configuration in Python goose - Locust, but in Rust Podman - Daemonless container engine Docker - Container platform buildx - Docker CLI plugin for extended build capabilities with BuildKit OrbStack - Faster Docker for Desktop alternative Rust in Production: curl with Daniel Stenberg - Talking about hyper’s strictness being at odds with curl’s permissive design axum - Ergonomic and modular web framework for Rust rocket - Web framework for Rust Cloudsmith Website Cian Butler’s Website Cian’s E-Mail

0 views
Ahead of AI 1 weeks ago

Components of A Coding Agent

In this article, I want to cover the overall design of coding agents and agent harnesses: what they are, how they work, and how the different pieces fit together in practice. Readers of my Build a Large Language Model (From Scratch) and Build a Large Reasoning Model (From Scratch) books often ask about agents, so I thought it would be useful to write a reference I can point to. More generally, agents have become an important topic because much of the recent progress in practical LLM systems is not just about better models, but about how we use them. In many real-world applications, the surrounding system, such as tool use, context management, and memory, plays as much of a role as the model itself. This also helps explain why systems like Claude Code or Codex can feel significantly more capable than the same models used in a plain chat interface. In this article, I lay out six of the main building blocks of a coding agent. You are probably familiar with Claude Code or the Codex CLI, but just to set the stage, they are essentially agentic coding tools that wrap an LLM in an application layer, a so-called agentic harness, to be more convenient and better-performing for coding tasks. Figure 1: Claude Code CLI, Codex CLI, and my Mini Coding Agent . Coding agents are engineered for software work where the notable parts are not only the model choice but the surrounding system, including repo context, tool design, prompt-cache stability, memory, and long-session continuity. That distinction matters because when we talk about the coding capabilities of LLMs, people often collapse the model, the reasoning behavior, and the agent product into one thing. But before getting into the coding agent specifics, let me briefly provide a bit more context on the difference between the broader concepts, the LLMs, reasoning models, and agents. An LLM is the core next-token model. A reasoning model is still an LLM, but usually one that was trained and/or prompted to spend more inference-time compute on intermediate reasoning, verification, or search over candidate answers. An agent is a layer on top, which can be understood as a control loop around the model. Typically, given a goal, the agent layer (or harness) decides what to inspect next, which tools to call, how to update its state, and when to stop, etc. Roughly, we can think about the relationship as this: the LLM is the engine, a reasoning model is a beefed-up engine (more powerful, but more expensive to use), and an agent harness helps us the model. The analogy is not perfect, because we can also use conventional and reasoning LLMs as standalone models (in a chat UI or Python session), but I hope it conveys the main point. Figure 2: The relationship between conventional LLM, reasoning LLM (or reasoning model), and an LLM wrapped in an agent harness. In other words, the agent is the system that repeatedly calls the model inside an environment. So, in short, we can summarize it like this: LLM: the raw model Reasoning model : an LLM optimized to output intermediate reasoning traces and to verify itself more Agent: a loop that uses a model plus tools, memory, and environment feedback Agent harness: the software scaffold around an agent that manages context, tool use, prompts, state, and control flow Coding harness: a special case of an agent harness; i.e., a task-specific harness for software engineering that manages code context, tools, execution, and iterative feedback As listed above, in the context of agents and coding tools, we also have the two popular terms agent harness and (agentic) coding harness . A coding harness is the software scaffold around a model that helps it write and edit code effectively. And an agent harness is a bit broader and not specific to coding (e.g., think of OpenClaw). Codex and Claude Code can be considered coding harnesses. Anyways, A better LLM provides a better foundation for a reasoning model (which involves additional training), and a harness gets more out of this reasoning model. Sure, LLMs and reasoning models are also capable of solving coding tasks by themselves (without a harness), but coding work is only partly about next-token generation. A lot of it is about repo navigation, search, function lookup, diff application, test execution, error inspection, and keeping all the relevant information in context. (Coders may know that this is hard mental work, which is why we don’t like to be disrupted during coding sessions :)). Figure 3. A coding harness combines three layers: the model family, an agent loop, and runtime supports. The model provides the “engine”, the agent loop drives iterative problem solving, and the runtime supports provide the plumbing. Within the loop, “observe” collects information from the environment, “inspect” analyzes that information, “choose” selects the next step, and “act” executes it. The takeaway here is that a good coding harness can make a reasoning and a non-reasoning model feel much stronger than it does in a plain chat box, because it helps with context management and more. As mentioned in the previous section, when we say harness , we typically mean the software layer around the model that assembles prompts, exposes tools, tracks file state, applies edits, runs commands, manages permissions, caches stable prefixes, stores memory, and many more. Today, when using LLMs, this layer shapes most of the user experience compared to prompting the model directly or using web chat UI (which is closer to “chat with uploaded files”). Since, in my view, the vanilla versions of LLMs nowadays have very similar capabilities (e.g., the vanilla versions of GPT-5.4, Opus 4.6, and GLM-5 or so), the harness can often be the distinguishing factor that makes one LLM work better than another. This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code. That said, some harness-specific post-training is usually beneficial. For example, OpenAI historically maintained separate GPT-5.3 and GPT-5.3-Codex variants. In the next section, I want to go more into the specifics and discuss the core components of a coding harness using my Mini Coding Agent : https://github.com/rasbt/mini-coding-agent . Figure 4: Main harness features of a coding agent / coding harness that will be discussed in the following sections. By the way, in this article, I use the terms “coding agent” and “coding harness” somewhat interchangeably for simplicity. (Strictly speaking, the agent is the model-driven decision-making loop, while the harness is the surrounding software scaffold that provides context, tools, and execution support.) Figure 5: Minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python) Anyways, below are six main components of coding agents. You can check out the source code of my minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python), for more concrete code examples. The code annotates the six components discussed below via code comments: This is maybe the most obvious component, but it is also one of the most important ones. When a user says “fix the tests” or “implement xyz,” the model should know whether it is inside a Git repo, what branch it is on, which project documents might contain instructions, and so on. That’s because those details often change or affect what the correct action is. For example, “Fix the tests” is not a self-contained instruction. If the agent sees AGENTS.md or a project README, it may learn which test command to run, etc. If it knows the repo root and layout, it can look in the right places instead of guessing. Also, the git branch, status, and commits can help provide more context about what changes are currently in progress and where to focus. Figure 6: The agent harness first builds a small workspace summary that gets combined with the user request for additional project context. The takeaway is that the coding agent collects info (”stable facts” as a workspace summary) upfront before doing any work, so that it’s is not starting from zero, without context, on every prompt. Once the agent has a repo view, the next question is how to feed that information to the model. The previous figure showed a simplified view of this (“Combined prompt: prefix + request”), but in practice, it would be relatively wasteful to combine and re-process the workspace summary on every user query. I.e., coding sessions are repetitive, and the agent rules usually stay the same. The tool descriptions usually stay the same, too. And even the workspace summary usually stays (mostly) the same. The main changes are usually the latest user request, the recent transcript, and maybe the short-term memory. “Smart” runtimes don’t rebuild everything as one giant undifferentiated prompt on every turn, as illustrated in the figure below. Figure 7: The agent harness builds a stable prompt prefix, adds the changing session state, and then feeds that combined prompt to the model. The main difference from section 1 is that section 1 was about gathering repo facts. Here, we are now interested in packaging and caching those facts efficiently for repeated model calls. The “stable” “Stable prompt prefix” means that the information contained there doesn’t change too much. It usually contains the general instructions, tool descriptions, and the workspace summary. We don’t want to waste compute on rebuilding it from scratch in each interaction if nothing important has changed. The other components are updated more frequently (usually each turn). This includes short-term memory, the recent transcript, and the newest user request. In short, the caching aspect for the “Stable prompt prefix” is simply that a smart runtime tries to reuse that part. Tool access and tool use are where it starts to feel less like chat and more like an agent. A plain model can suggest commands in prose, but an LLM in a coding harness should do something narrower and more useful and be actually able to execute the command and retrieve the results (versus us calling the command manually and pasting the results back into the chat). But instead of letting the model improvise arbitrary syntax, the harness usually provides a pre-defined list of allowed and named tools with clear inputs and clear boundaries. (But of course, something like Python can be part of this so that the agent could also execute an arbitrary wide list of shell commands.) The tool-use flow is illustrated in the figure below. Figure 8: The model emits a structured action, the harness validates it, optionally asks for approval, executes it, and feeds the bounded result back into the loop. To illustrate this, below is an example of how this usually looks to the user using my Mini Coding Agent. (This is not as pretty as Claude Code or Codex because it is very minimal and uses plain Python without any external dependencies.) Figure 9: Illustration of a tool call approval request in the Mini Coding Agent. Here, the model has to choose an action that the harness recognizes, like list files, read a file, search, run a shell command, write a file, etc. It also has to provide arguments in a shape that the harness can check. So when the model asks to do something, the runtime can stop and run programmatic checks like “Is this a known tool?”, “Are the arguments valid?”, “Does this need user approval?” “Is the requested path even inside the workspace?” Only after those checks pass does anything actually run. While running coding agents, of course, carries some risk, the harness checks also improve reliability because the model doesn’t execute totally arbitrary commands. Also, besides rejecting malformed actions and approval gating, file access can be kept inside the repo by checking file paths. In a sense, the harness is giving the model less freedom, but it also improves the usability at the same time. Context bloat is not a unique problem of coding agents but an issue for LLMs in general. Sure, LLMs are supporting longer and longer contexts these days (and I recently wrote about the attention variants that make it computationally more feasible), but long contexts are still expensive and can also introduce additional noise (if there is a lot of irrelevant info). Coding agents are even more susceptible to context bloat than regular LLMs during multi-turn chats, because of repeated file reads, lengthy tool outputs, logs, etc. If the runtime keeps all of that at full fidelity, it will run out of available context tokens pretty quickly. So, a good coding harness is usually pretty sophisticated about handling context bloat beyond just cutting our summarizing information like regular chat UIs. Conceptually, the context compaction in coding agents might work as summarized in the figure below. Specifically, we are zooming a bit further into the clip (step 6) part of Figure 8 in the previous section. Figure 10: Large outputs are clipped, older reads are deduplicated, and the transcript is compressed before it goes back into the prompt. A minimal harness uses at least two compaction strategies to manage that problem. The first is clipping, which shortens long document snippets, large tool outputs, memory notes, and transcript entries. In other words, it prevents any one piece of text from taking over the prompt budget just because it happened to be verbose. The second strategy is transcript reduction or summarization, which turns the full session history (more on that in the next section) into a smaller promptable summary. A key trick here is to keep recent events richer because they are more likely to matter for the current step. And we compress older events more aggressively because they are likely less relevant. Additionally, we also deduplicate older file reads so the model does not keep seeing the same file content over and over again just because it was read multiple times earlier in the session. Overall, I think this is one of the underrated, boring parts of good coding-agent design. A lot of apparent “model quality” is really context quality. In practice, all these 6 core concepts covered here are highly intertwined, and the different sections and figures cover them with different focuses or zoom levels. In the previous section, we covered prompt-time use of history and how we build a compact transcript. The question there is: how much of the past should go back into the model on the next turn? So the emphasis is compression, clipping, deduplication, and recency. Now, this section, structured session memory, is about the storage-time structure of history. The question here is: what does the agent keep over time as a permanent record? So the emphasis is that the runtime keeps a fuller transcript as a durable state, alongside a lighter memory layer that is smaller and gets modified and compacted rather than just appended to. To summarize, a coding agent separates state into (at least) two layers: working memory: the small, distilled state the agent keeps explicitly a full transcript: this covers all the user requests, tool outputs, and LLM responses Figure 11: New events get appended to a full transcript and summarized in a working memory. The session files on disk are usually stored as JSON files. The figure above illustrates the two main session files, the full transcript and the working memory, that usually get stored as JSON files on disk. As mentioned before, the full transcript stores the whole history, and it’s resumable if we close the agent. The working memory is more of a distilled version with the currently most important info, which is somewhat related to the compact transcript. But the compact transcript and working memory have slightly different jobs. The compact transcript is for prompt reconstruction. Its job is to give the model a compressed view of recent history so it can continue the conversation without seeing the full transcript every turn. The working memory is more meant for task continuity. Its job is to keep a small, explicitly maintained summary of what matters across turns, things like the current task, important files, and recent notes. Following step 4 in the figure above, the latest user request, together with the LLM response and tool output, would then be recorded as a “new event” in both the full transcript and working memory, in the next round, which is not shown to reduce clutter in the figure above. Once an agent has tools and state, one of the next useful capabilities is delegation. The reason is that it allows us to parallelize certain work into subtasks via subagents and speed up the main task. For example, the main agent may be in the middle of one task and still need a side answer, for example, which file defines a symbol, what a config says, or why a test is failing. It is useful to split that off into a bounded subtask instead of forcing one loop to carry every thread of work at once. (In my mini coding agent, the implementation is simpler, and the child still runs synchronously, but the underlying idea is the same.) A subagent is only useful if it inherits enough context to do real work. But if we don’t restrict it, we now have multiple agents duplicating work, touching the same files, or spawning more subagents, and so on. So the tricky design problem is not just how to spawn a subagent but also how to bind one :). Figure 12: The subagent inherits enough context to be useful, but it runs inside tighter boundaries than the main agent. The trick here is that the subagent inherits enough context to be useful, but also has it constrained (for example, read-only and restricted in recursion depth) Claude Code has supported subagents for a long time, and Codex added them more recently. Codex does not generally force subagents into read-only mode. Instead, they usually inherit much of the main agent’s sandbox and approval setup. So, the boundary is more about task scoping, context, and depth. The section above tried to cover the main components of coding agents. As mentioned before, they are more or less deeply intertwined in their implementation. However, I hope that covering them one by one helps with the overall mental model of how coding harnesses work, and why they can make the LLM more useful compared to simple multi-turn chats. Figure 13: Six main features of a coding harness discussed in previous sections. If you are interested in seeing these implemented in clean, minimalist Python code, you may like my Mini Coding Agent . OpenClaw may be an interesting comparison, but it is not quite the same kind of system. OpenClaw is more like a local, general agent platform that can also code, rather than being a specialized (terminal) coding assistant. There are still several overlaps with a coding harness: it uses prompt and instruction files in the workspace, such as AGENTS.md, SOUL.md, and TOOLS.md it keeps JSONL session files and includes transcript compaction and session management it can spawn helper sessions and subagents However, as mentioned above, the emphasis is different. Coding agents are optimized for a person working in a repository and asking a coding assistant to inspect files, edit code, and run local tools efficiently. OpenClaw is more optimized for running many long-lived local agents across chats, channels, and workspaces, with coding as one important workload among several others. I am excited to share that I finished writing Build A Reasoning Model (From Scratch) and all chapters are in early access yet. The publisher is currently working on the layouts, and it should be available this summer. This is probably my most ambitious book so far. I spent about 1.5 years writing it, and a large number of experiments went into it. It is also probably the book I worked hardest on in terms of time, effort, and polish, and I hope you’ll enjoy it. Build a Reasoning Model (From Scratch) on Manning and Amazon . The main topics are evaluating reasoning models inference-time scaling self-refinement reinforcement learning distillation There is a lot of discussion around “reasoning” in LLMs, and I think the best way to understand what it really means in the context of LLMs is to implement one from scratch! Amazon (pre-order) Manning (complete book in early access , pre-final layout, 528 pages) Figure 1: Claude Code CLI, Codex CLI, and my Mini Coding Agent . Coding agents are engineered for software work where the notable parts are not only the model choice but the surrounding system, including repo context, tool design, prompt-cache stability, memory, and long-session continuity. That distinction matters because when we talk about the coding capabilities of LLMs, people often collapse the model, the reasoning behavior, and the agent product into one thing. But before getting into the coding agent specifics, let me briefly provide a bit more context on the difference between the broader concepts, the LLMs, reasoning models, and agents. On The Relationship Between LLMs, Reasoning Models, and Agents An LLM is the core next-token model. A reasoning model is still an LLM, but usually one that was trained and/or prompted to spend more inference-time compute on intermediate reasoning, verification, or search over candidate answers. An agent is a layer on top, which can be understood as a control loop around the model. Typically, given a goal, the agent layer (or harness) decides what to inspect next, which tools to call, how to update its state, and when to stop, etc. Roughly, we can think about the relationship as this: the LLM is the engine, a reasoning model is a beefed-up engine (more powerful, but more expensive to use), and an agent harness helps us the model. The analogy is not perfect, because we can also use conventional and reasoning LLMs as standalone models (in a chat UI or Python session), but I hope it conveys the main point. Figure 2: The relationship between conventional LLM, reasoning LLM (or reasoning model), and an LLM wrapped in an agent harness. In other words, the agent is the system that repeatedly calls the model inside an environment. So, in short, we can summarize it like this: LLM: the raw model Reasoning model : an LLM optimized to output intermediate reasoning traces and to verify itself more Agent: a loop that uses a model plus tools, memory, and environment feedback Agent harness: the software scaffold around an agent that manages context, tool use, prompts, state, and control flow Coding harness: a special case of an agent harness; i.e., a task-specific harness for software engineering that manages code context, tools, execution, and iterative feedback Figure 3. A coding harness combines three layers: the model family, an agent loop, and runtime supports. The model provides the “engine”, the agent loop drives iterative problem solving, and the runtime supports provide the plumbing. Within the loop, “observe” collects information from the environment, “inspect” analyzes that information, “choose” selects the next step, and “act” executes it. The takeaway here is that a good coding harness can make a reasoning and a non-reasoning model feel much stronger than it does in a plain chat box, because it helps with context management and more. The Coding Harness As mentioned in the previous section, when we say harness , we typically mean the software layer around the model that assembles prompts, exposes tools, tracks file state, applies edits, runs commands, manages permissions, caches stable prefixes, stores memory, and many more. Today, when using LLMs, this layer shapes most of the user experience compared to prompting the model directly or using web chat UI (which is closer to “chat with uploaded files”). Since, in my view, the vanilla versions of LLMs nowadays have very similar capabilities (e.g., the vanilla versions of GPT-5.4, Opus 4.6, and GLM-5 or so), the harness can often be the distinguishing factor that makes one LLM work better than another. This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code. That said, some harness-specific post-training is usually beneficial. For example, OpenAI historically maintained separate GPT-5.3 and GPT-5.3-Codex variants. In the next section, I want to go more into the specifics and discuss the core components of a coding harness using my Mini Coding Agent : https://github.com/rasbt/mini-coding-agent . Figure 4: Main harness features of a coding agent / coding harness that will be discussed in the following sections. By the way, in this article, I use the terms “coding agent” and “coding harness” somewhat interchangeably for simplicity. (Strictly speaking, the agent is the model-driven decision-making loop, while the harness is the surrounding software scaffold that provides context, tools, and execution support.) Figure 5: Minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python) Anyways, below are six main components of coding agents. You can check out the source code of my minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python), for more concrete code examples. The code annotates the six components discussed below via code comments: 1. Live Repo Context This is maybe the most obvious component, but it is also one of the most important ones. When a user says “fix the tests” or “implement xyz,” the model should know whether it is inside a Git repo, what branch it is on, which project documents might contain instructions, and so on. That’s because those details often change or affect what the correct action is. For example, “Fix the tests” is not a self-contained instruction. If the agent sees AGENTS.md or a project README, it may learn which test command to run, etc. If it knows the repo root and layout, it can look in the right places instead of guessing. Also, the git branch, status, and commits can help provide more context about what changes are currently in progress and where to focus. Figure 6: The agent harness first builds a small workspace summary that gets combined with the user request for additional project context. The takeaway is that the coding agent collects info (”stable facts” as a workspace summary) upfront before doing any work, so that it’s is not starting from zero, without context, on every prompt. 2. Prompt Shape And Cache Reuse Once the agent has a repo view, the next question is how to feed that information to the model. The previous figure showed a simplified view of this (“Combined prompt: prefix + request”), but in practice, it would be relatively wasteful to combine and re-process the workspace summary on every user query. I.e., coding sessions are repetitive, and the agent rules usually stay the same. The tool descriptions usually stay the same, too. And even the workspace summary usually stays (mostly) the same. The main changes are usually the latest user request, the recent transcript, and maybe the short-term memory. “Smart” runtimes don’t rebuild everything as one giant undifferentiated prompt on every turn, as illustrated in the figure below. Figure 7: The agent harness builds a stable prompt prefix, adds the changing session state, and then feeds that combined prompt to the model. The main difference from section 1 is that section 1 was about gathering repo facts. Here, we are now interested in packaging and caching those facts efficiently for repeated model calls. The “stable” “Stable prompt prefix” means that the information contained there doesn’t change too much. It usually contains the general instructions, tool descriptions, and the workspace summary. We don’t want to waste compute on rebuilding it from scratch in each interaction if nothing important has changed. The other components are updated more frequently (usually each turn). This includes short-term memory, the recent transcript, and the newest user request. In short, the caching aspect for the “Stable prompt prefix” is simply that a smart runtime tries to reuse that part. 3. Tool Access and Use Tool access and tool use are where it starts to feel less like chat and more like an agent. A plain model can suggest commands in prose, but an LLM in a coding harness should do something narrower and more useful and be actually able to execute the command and retrieve the results (versus us calling the command manually and pasting the results back into the chat). But instead of letting the model improvise arbitrary syntax, the harness usually provides a pre-defined list of allowed and named tools with clear inputs and clear boundaries. (But of course, something like Python can be part of this so that the agent could also execute an arbitrary wide list of shell commands.) The tool-use flow is illustrated in the figure below. Figure 8: The model emits a structured action, the harness validates it, optionally asks for approval, executes it, and feeds the bounded result back into the loop. To illustrate this, below is an example of how this usually looks to the user using my Mini Coding Agent. (This is not as pretty as Claude Code or Codex because it is very minimal and uses plain Python without any external dependencies.) Figure 9: Illustration of a tool call approval request in the Mini Coding Agent. Here, the model has to choose an action that the harness recognizes, like list files, read a file, search, run a shell command, write a file, etc. It also has to provide arguments in a shape that the harness can check. So when the model asks to do something, the runtime can stop and run programmatic checks like “Is this a known tool?”, “Are the arguments valid?”, “Does this need user approval?” “Is the requested path even inside the workspace?” Figure 10: Large outputs are clipped, older reads are deduplicated, and the transcript is compressed before it goes back into the prompt. A minimal harness uses at least two compaction strategies to manage that problem. The first is clipping, which shortens long document snippets, large tool outputs, memory notes, and transcript entries. In other words, it prevents any one piece of text from taking over the prompt budget just because it happened to be verbose. The second strategy is transcript reduction or summarization, which turns the full session history (more on that in the next section) into a smaller promptable summary. A key trick here is to keep recent events richer because they are more likely to matter for the current step. And we compress older events more aggressively because they are likely less relevant. Additionally, we also deduplicate older file reads so the model does not keep seeing the same file content over and over again just because it was read multiple times earlier in the session. Overall, I think this is one of the underrated, boring parts of good coding-agent design. A lot of apparent “model quality” is really context quality. 5. Structured Session Memory In practice, all these 6 core concepts covered here are highly intertwined, and the different sections and figures cover them with different focuses or zoom levels. In the previous section, we covered prompt-time use of history and how we build a compact transcript. The question there is: how much of the past should go back into the model on the next turn? So the emphasis is compression, clipping, deduplication, and recency. Now, this section, structured session memory, is about the storage-time structure of history. The question here is: what does the agent keep over time as a permanent record? So the emphasis is that the runtime keeps a fuller transcript as a durable state, alongside a lighter memory layer that is smaller and gets modified and compacted rather than just appended to. To summarize, a coding agent separates state into (at least) two layers: working memory: the small, distilled state the agent keeps explicitly a full transcript: this covers all the user requests, tool outputs, and LLM responses Figure 11: New events get appended to a full transcript and summarized in a working memory. The session files on disk are usually stored as JSON files. The figure above illustrates the two main session files, the full transcript and the working memory, that usually get stored as JSON files on disk. As mentioned before, the full transcript stores the whole history, and it’s resumable if we close the agent. The working memory is more of a distilled version with the currently most important info, which is somewhat related to the compact transcript. But the compact transcript and working memory have slightly different jobs. The compact transcript is for prompt reconstruction. Its job is to give the model a compressed view of recent history so it can continue the conversation without seeing the full transcript every turn. The working memory is more meant for task continuity. Its job is to keep a small, explicitly maintained summary of what matters across turns, things like the current task, important files, and recent notes. Following step 4 in the figure above, the latest user request, together with the LLM response and tool output, would then be recorded as a “new event” in both the full transcript and working memory, in the next round, which is not shown to reduce clutter in the figure above. 6. Delegation With (Bounded) Subagents Once an agent has tools and state, one of the next useful capabilities is delegation. The reason is that it allows us to parallelize certain work into subtasks via subagents and speed up the main task. For example, the main agent may be in the middle of one task and still need a side answer, for example, which file defines a symbol, what a config says, or why a test is failing. It is useful to split that off into a bounded subtask instead of forcing one loop to carry every thread of work at once. (In my mini coding agent, the implementation is simpler, and the child still runs synchronously, but the underlying idea is the same.) A subagent is only useful if it inherits enough context to do real work. But if we don’t restrict it, we now have multiple agents duplicating work, touching the same files, or spawning more subagents, and so on. So the tricky design problem is not just how to spawn a subagent but also how to bind one :). Figure 12: The subagent inherits enough context to be useful, but it runs inside tighter boundaries than the main agent. The trick here is that the subagent inherits enough context to be useful, but also has it constrained (for example, read-only and restricted in recursion depth) Claude Code has supported subagents for a long time, and Codex added them more recently. Codex does not generally force subagents into read-only mode. Instead, they usually inherit much of the main agent’s sandbox and approval setup. So, the boundary is more about task scoping, context, and depth. Components Summary The section above tried to cover the main components of coding agents. As mentioned before, they are more or less deeply intertwined in their implementation. However, I hope that covering them one by one helps with the overall mental model of how coding harnesses work, and why they can make the LLM more useful compared to simple multi-turn chats. Figure 13: Six main features of a coding harness discussed in previous sections. If you are interested in seeing these implemented in clean, minimalist Python code, you may like my Mini Coding Agent . How Does This Compare To OpenClaw? OpenClaw may be an interesting comparison, but it is not quite the same kind of system. OpenClaw is more like a local, general agent platform that can also code, rather than being a specialized (terminal) coding assistant. There are still several overlaps with a coding harness: it uses prompt and instruction files in the workspace, such as AGENTS.md, SOUL.md, and TOOLS.md it keeps JSONL session files and includes transcript compaction and session management it can spawn helper sessions and subagents Build a Reasoning Model (From Scratch) on Manning and Amazon . The main topics are evaluating reasoning models inference-time scaling self-refinement reinforcement learning distillation Amazon (pre-order) Manning (complete book in early access , pre-final layout, 528 pages)

0 views
A Room of My Own 1 weeks ago

Craving Quiet: Stepping Away for a While

Lately I've realised that even though I'm barely on social media, my life still feels 95% digital. I don't post on LinkedIn. My Instagram account mostly exists so I can open links people send me when I absolutely have to. I only keep a fake Facebook account for Marketplace and I use my real account (I've had it since the beginning of Facebook and all my friends live there so it stays) for Messenger only. But there is more than social media to occupy our time now. My days are still full of feeds, links, apps, messages (whatsapp groups and such), digital projects, and little things I feel like I should be keeping up with. And they are easy to keep up with, my phone is always in my hand anyway. RELATED: I Choose Living Over Documenting On the Compulsion to Record The Journal Project I Can’t Quit The Art of Organizing (Things That Don’t Need to Be Organized) At work we showcase our AI agents and I wonder (from my anecdotal experience) if we are creating more busy work for ourselves and replacing reflection and with it, the actual prouctivity and output and good old ““getting the job done.” Most of our work meetings now have extensive transcripts that turn into minutes, notes, action points and insights. I remember when the output of such a meeting would be 2-3 points that we actually remembered. AI Generated Workslop certainly is a thing now. I need a break from it all. And from all the self-imposed shoulds such as scanning my old journals into Day One. Backing up Day One, which hasn't been backed up in a while. An external hard drive backup that's probably a year overdue. A Trello board full of things I want to do but don't really want to or have to do, or maybe I want to do them but can't justify the time when I already feel so busy. After a full day of work and virtual meetings, I feel completely depleted. Those self-imposed obligations, things that used to be fun because they were few and far between, are no longer acceptable. I used to sneak in 15 minutes of personal things at work. Now when I have a break, I'd rather grab a coffee with someone or go for a walk. I crave analog. I crave nature. I crave quiet thinking time (not with a meditation app). I have made some changes already and they seem to be sticking. We have dinner at the table now, which has been good, at least we get some family time before everyone retreats to their own corners. We used to eat while watching a show together as a family, which is fine every now and then, but it was too much of it all. But still my phone is somewhere nearby, and I'm half-watching TV and half-checking a message or voice journaling into an app. None of it is thoughtful. It's just me blabbering. My brain feels like it's all over the place. I used to be able to sit with my own thoughts. I haven't been able to do that in a long time. My daughter broke her arm two weeks ago. She has a purple cast all her friends signed, and she was wondering whether to keep it when it comes off. I told her how I broke my arm as a kid, and she asked if I kept my cast. I said I would have liked to, but what we have now is better. I can take a clear photo of hers and she'll have that memory without keeping the physical thing. Then she asked if I had a photo of mine. I didn't. It never even occurred to me. Back then we took maybe 20 photos a year, if that, and they were all the more precious for it. Now I'm struggling to keep my monthly saves under 150 photos and screenshots, most of which I probably don't need. RELATED: My Photo Management and Memory Keeping Workflow I love my Day One journals , I really do. I just exported all of 2025 to PDF and JSON. But reading back through it, it's every tiny minutia of my life. I like to think it'll be interesting to me one day. Probably not to anyone else. And I wonder whether the time I spent on it was worth it. Yes, there are some insights there , but nothing that I didn’t already know. Had I allowed myself that thinking time instead of outsourcing it to AI. RELATED: Committing to the Thinking Life If my house burned down and I lost everything, the memories that matter are still in my head. I'm a cumulative experience of all of it. Do I need the artifact to know who I am? I still have journals from my 20s and 30s sitting back home in Bosnia. Thick ones, full of pasted tickets and stubs and mementos. I haven't looked at them in years but I can't let them go. My plan is to eventually scan them, maybe pay one of my kids to do it since they won't be able to read my handwriting anyway. RELATED: Letting Go of Old Journals and Mementos But anyway. The point is, I just need a break. From reading things online, from note-keeping, from digital journaling, blogging, saving notes and highlights (even my Readwise subscription feels intrusive now), from all of it. I've decided to do a 30-day digital detox. Within reason, because I still have to work. But I'm off until Tuesday, so I have a few days to ease into it. I'm lucky and privileged that I can do this. That I can shut down for a while and stop following things I can't influence and let go of expectations I put on myself. So that's what I'm doing. Simplifying my phone, deleting apps, putting the phone away when I get home. If we're watching something as a family, fine. One episode. But otherwise, even if I'm bored and restless, I'll go for a walk or play a board game, read a book. Journal (on paper). I'll do nothing, like I used to. Go to bed early. Meet a friend for coffee (and be more proactive about that). It's all become too hard because easy distractions that scratch the itch of everything are too easy. Calm my mind. Slow down. It's been too much. Time to reclaim myself. And if you've gotten this far, the world is reminding me once again of E.M. Forster's The Machine Stops , which I wrote about in 2020 . It feels eerily even more relevant now.

0 views
Evan Hahn 2 weeks ago

Notes from March 2026

March always seems to be my life’s busiest month. “The two kinds of error” : in my mind, software errors are divided into two categories: expected and unexpected errors. I finally wrote up this idea I’ve had for a long time. “All tests pass” is a short story about a strange, and sorta sad, experience I had with a coding agent. Inspired by others, I published a disclaimer about how I use generative AI to write this blog . My main rule of thumb: the final product must be word-for-word what I would’ve written without AI, given enough time. And I have discomfort about its use. Built llm-eliza , a plugin for LLM that lets you use the ELIZA chatbot at the command line. I think this is my first satirical software project. (Also the first thing I’ve published to the Python package registry, PyPI.) Found the human.json standard , which is “a protocol for humans to assert authorship of their site content and vouch for the humanity of others.” I added it to my site this month. Scraped Rosetta Code and built a stupid little website that picks a random programming language . At work, I helped with a project to improve the editor for Ghost’s “welcome emails” feature . This month marked the one year anniversary of my first post on Zelda Dungeon . I celebrated by writing more articles, including a treatise the difference between 2D and 3D games and a personal piece about Ocarina of Time . I also wrote my first article that contained an interview , which was a skill I’m totally new to. It’s a small change, but I fixed a little bug in fzf . From a tale about vibe coding : “I’d be embarrassed to show it at a code review. I’d also be embarrassed to admit how many times I failed to ship the ‘clean’ version.” “Claude is the only AI model that has actually been deployed inside classified [American] military systems. So to the extent that AI is having an effect in Iran, it is probably Claude.” From a Hard Fork podcast episode . From “AI’s Enthusiasm Chasm” : “people—well, again, most people—don’t enjoy existing in a strict state of quantification. Pursuits and pastimes—joy—are underpinned by qualitative thought, and those considerations make people less likely to want to involve AI just to get something at a tenth of the cost or five times faster.” “The Cognitive Dark Forest” posits that AI forces us, socially, to close down the open web. “The sheer act of thinking outside the box makes the box bigger.” This post has a good—if incomplete—list of all the downsides of generative AI: perpetuation of bias, erosion of critical thinking, harm to artists, and more. Uber used to be inexpensive because it was subsidized by VC money. Now it’s more costly because they needed to stop losing money. “Don’t get used to cheap AI” posits that the same will happen with AI. Similar ideas are presented in “Is the Future of AI Local?” . From “It’s time to embrace climate conspiracy” : “the actual story of climate change—the one we’ve reported exhaustively—is one about coordinated power, deliberate deception, and a bought-off government that repeatedly acts to promote an industry that is poisoning humans and the environment for profit. It just so happens to be a real conspiracy.” Really liked this short piece about what’s lost when new technology becomes commonplace . Few people today remember what we lost when we switched from candles to lightbulbs. “we don’t need more ram, we need better software” had me whispering “hell yeah” to myself. I’ve long pondered a blog post called “Why I’m afraid of YAML”. This post from a former colleague says it better than I ever could. “Costs of War” highlights the costs, financial and otherwise, of the United States’s wars. The US FBI is buying location data for surveillance , as is our Secret Service . This review of the new Marathon shooter game was surprisingly poignant. “It’s just thoughts and if I don’t get them out, my tummy hurts.” As a Legend of Zelda fan and programmer, I was happy to discover YouTuber Skawo . Their videos explain Zelda quirks by delving into real source code. I especially liked this explanation of why some players were experiencing rumble in a game that shouldn’t have it . The US effectively bans foreign-made routers. Hope you had a good March. “The two kinds of error” : in my mind, software errors are divided into two categories: expected and unexpected errors. I finally wrote up this idea I’ve had for a long time. “All tests pass” is a short story about a strange, and sorta sad, experience I had with a coding agent. Inspired by others, I published a disclaimer about how I use generative AI to write this blog . My main rule of thumb: the final product must be word-for-word what I would’ve written without AI, given enough time. And I have discomfort about its use. Built llm-eliza , a plugin for LLM that lets you use the ELIZA chatbot at the command line. I think this is my first satirical software project. (Also the first thing I’ve published to the Python package registry, PyPI.) Found the human.json standard , which is “a protocol for humans to assert authorship of their site content and vouch for the humanity of others.” I added it to my site this month. Scraped Rosetta Code and built a stupid little website that picks a random programming language . At work, I helped with a project to improve the editor for Ghost’s “welcome emails” feature . This month marked the one year anniversary of my first post on Zelda Dungeon . I celebrated by writing more articles, including a treatise the difference between 2D and 3D games and a personal piece about Ocarina of Time . I also wrote my first article that contained an interview , which was a skill I’m totally new to. It’s a small change, but I fixed a little bug in fzf . From a tale about vibe coding : “I’d be embarrassed to show it at a code review. I’d also be embarrassed to admit how many times I failed to ship the ‘clean’ version.” “Claude is the only AI model that has actually been deployed inside classified [American] military systems. So to the extent that AI is having an effect in Iran, it is probably Claude.” From a Hard Fork podcast episode . From “AI’s Enthusiasm Chasm” : “people—well, again, most people—don’t enjoy existing in a strict state of quantification. Pursuits and pastimes—joy—are underpinned by qualitative thought, and those considerations make people less likely to want to involve AI just to get something at a tenth of the cost or five times faster.” “The Cognitive Dark Forest” posits that AI forces us, socially, to close down the open web. “The sheer act of thinking outside the box makes the box bigger.” This post has a good—if incomplete—list of all the downsides of generative AI: perpetuation of bias, erosion of critical thinking, harm to artists, and more. Uber used to be inexpensive because it was subsidized by VC money. Now it’s more costly because they needed to stop losing money. “Don’t get used to cheap AI” posits that the same will happen with AI. Similar ideas are presented in “Is the Future of AI Local?” . From “It’s time to embrace climate conspiracy” : “the actual story of climate change—the one we’ve reported exhaustively—is one about coordinated power, deliberate deception, and a bought-off government that repeatedly acts to promote an industry that is poisoning humans and the environment for profit. It just so happens to be a real conspiracy.” Really liked this short piece about what’s lost when new technology becomes commonplace . Few people today remember what we lost when we switched from candles to lightbulbs. “we don’t need more ram, we need better software” had me whispering “hell yeah” to myself. I’ve long pondered a blog post called “Why I’m afraid of YAML”. This post from a former colleague says it better than I ever could. “Costs of War” highlights the costs, financial and otherwise, of the United States’s wars. The US FBI is buying location data for surveillance , as is our Secret Service . This review of the new Marathon shooter game was surprisingly poignant. “It’s just thoughts and if I don’t get them out, my tummy hurts.” As a Legend of Zelda fan and programmer, I was happy to discover YouTuber Skawo . Their videos explain Zelda quirks by delving into real source code. I especially liked this explanation of why some players were experiencing rumble in a game that shouldn’t have it . The US effectively bans foreign-made routers.

0 views
neilzone 2 weeks ago

Implementing the somewhat whimsical human.json protocol on my website

Terence blogged about adding a human.json file to his website . I wanted to do the same. The specification for human.json describes itself as a lightweight protocol for humans to assert authorship of their site content and vouch for the humanity of others. It uses URL ownership as identity, and trust propagates through a crawlable web of vouches between sites. A bit like signing each other’s PGP keys, really. There are a few steps: I made a simple bash script to simplify the process of creating the json to vouch for someone: I am sure that there are better ways of doing this, but it works for me. I am using a separate directory for this json file, as it wants specific headers. I am using apache, so in the file in , I have: Using the Firefox browser extension , which is probably available for other browsers too, I can see if a site offers human.json file, or is vouched for by another person whose own human.json file I have already trusts. Will it catch on? I doubt it. It is a bit of whimsy, and that is no bad thing. I have only included URLs where the site owner has consented for me to do so. If you are such a person and wish me to remove the “vouch” from my site, then please do just let me know. Consent is sexy. Because I am low-key “vouching” for people, I’ve only vouched for people that I know, even for a relatively limited definition of “know”. Not strangers, but not limited to the most intimate of relationships either. Mostly fedi friends, which is nice. Is it bad ? I don’t think so. I have seen a couple of comments about it being a useful thing for AI scrapers to follow, but frankly they seem to be doing just fine anyway. If signalling to fellow humans also attracts unwanted traffic well, in this case, so be it. add a json file to your webserver, with some basic information update that file when you “vouch” for someone else’s site, as being created by a human and free of AI added some header material to your website, to reference the source of your human.json file set a couple of web server headers (below) use a browser extension to surface that file on other people’s websites if they have implemented human.json

0 views
Max Bernstein 2 weeks ago

Using Perfetto in ZJIT

Originally published on Rails At Scale . Look! A trace of slow events in a benchmark! Hover over the image to see it get bigger. Now read on to see what the slow events are and how we got this pretty picture. The first rule of just-in-time compilers is: you stay in JIT code. The second rule of JIT is: you STAY in JIT code! When control leaves the compiled code to run in the interpreter—what the ZJIT team calls either a “side-exit” or a “deopt”, depending on who you talk to—things slow down. In a well-tuned system, this should happen pretty rarely. Right now, because we’re still bringing up the compiler and runtime system, it happens more than we would like. We’re reducing the number of exits over time. We can track our side-exit reduction progress with , which, on process exit, prints out a tidy summary of the counters for all of the bad stuff we track. It’s got side-exits. It’s got calls to C code. It’s got calls to slow-path runtime helpers. It’s got everything. Here is a chopped-up sample of stats output for the Lobsters benchmark, which is a large Rails app: (I’ve cut out significant chunks of the stats output and replaced them with because it’s overwhelming the first time you see it.) The first thing you might note is that the thing I just described as terrible for performance is happening over twelve million times . The second thing you might notice is that despite this, we’re staying in JIT code seemingly a high percentage of the time. Or are we? Is 80% high? Is a 4.5% class guard miss ratio high? What about 11% for shapes? It’s hard to say. The counters are great because they’re quick and they’re reasonably stable proxies for performance. There’s no substitute for painstaking measurements on a quiet machine but if the counter for Bad Slow Thing goes down (and others do not go up), we’re probably doing a good job. But they’re not great for building intuition. For intuition, we want more tangible feeling numbers. We want to see things. The third thing is that you might ask yourself “self, where are these exits coming from?” Unfortunately, counters cannot tell you that. For that, we want stack traces. This lets us know where in the guest (Ruby) code triggers an exit. Ideally also we would want some notion of time: we would want to know not just where these events happen but also when. Are the exits happening early, at application boot? At warmup? Even during what should be steady state application time? Hard to say. So we need more tools. Thankfully, Perfetto exists. Perfetto is a system for visualizing and analyzing traces and profiles that your application generates. It has both a web UI and a command-line UI. We can emit traces for Perfetto and visualize them there. Take a look at this sample ZJIT Perfetto trace generated by running Ruby with 1 . What do you see? I see a couple arrows on the left. Arrows indicate “instant” point-in-time events. Then I see a mess of purple to the right of that until the end of the trace. Hover over an arrow. Find out that each arrow is a side-exit. Scream silently. But it’s a friendly arrow. It tells you what the side-exit reason is. If you click it, it even tells you the stack trace in the pop-up panel on the bottom. If we click a couple of them, maybe we can learn more. We can also zoom by mousing over the track, holding Ctrl, and scrolling. That will get us look closer. But there are so many… Fortunately, Perfetto also provides a SQL interface to the traces. We can write a query to aggregate all of the side exit events from the table and line them up with the topmost method from the backtrace arguments in the table: This pulls up a query box at the bottom showing us that there are a couple big hotspots: It even has a helpful option to export the results Markdown table so I can paste (an edited version) into this blog post: Looks like we should figure out why we’re having shape misses so much and that will clear up a lot of exits. (Hint: it’s because once we make our first guess about what we think the object shape will be, we don’t re-assess… yet .) This has been a taste of Perfetto. There’s probably a lot more to explore. Please join the ZJIT Zulip and let us know if you have any cool tracing or exploring tricks. Now I’ll explain how you too can use Perfetto from your system. Adding support to ZJIT was pretty straightforward. The first thing is that you’ll need some way to get trace data out of your system. We write to a file with a well-known location ( ), but you could do any number of things. Perhaps you can stream events over a socket to another process, or to a server that aggregates them, or store them internally and expose a webserver that serves them over the internet, or… anything, really. Once you have that, you need a couple lines of code to emit the data. Perfetto accepts a number of formats. For example, in his excellent blog post , Tristan Hume opens with such a simple snippet of code for logging Chromium Trace JSON-formatted events (lightly modified by me): This snippet is great. It shows, end-to-end, writing a stream of one event. It is a complete (X) event, as opposed to either: It was enough to get me started. Since it’s JSON, and we have a lot of side exits, the trace quickly ballooned to 8GB large for a several second benchmark. Not great. Now, part of this is our fault—we should side exit less—and part of it is just the verbosity of JSON. Thankfully, Perfetto ingests more compact binary formats, such as the Fuchsia trace format . In addition to being more compact, FXT even supports string interning. After modifying the tracer to emit FXT, we ended with closer to 100MB for the same benchmark. We can reduce further by sampling —not writing every exit to the trace, but instead every K exits (for some (probably prime) K). This is why we provide the option. Check out the trace writer implementation from the point this article was written. We could trace: Visualizations are awesome. Get your data in the right format so you can ask the right questions easily. Thanks for Perfetto! Also, looks like visualizations are now available in Perfetto canary. Time to go make some fun histograms… This is also sampled/strobed, so not every exit is in there. This is just 1/K of them for some K that I don’t remember.  ↩ two discrete timestamped begin (B) and end (E) events that book-end something, or an instant (i) event that has no duration, or a couple other event types in the Chromium Trace Event Format doc When methods get compiled How big the generated code is How long each compile phase takes When (and where) invalidation events happen When (and where) allocations happen from JITed code Garbage collection events This is also sampled/strobed, so not every exit is in there. This is just 1/K of them for some K that I don’t remember.  ↩

0 views
Simon Willison 3 weeks ago

Profiling Hacker News users based on their comments

Here's a mildly dystopian prompt I've been experimenting with recently: "Profile this user", accompanied by a copy of their last 1,000 comments on Hacker News. Obtaining those comments is easy. The Algolia Hacker News API supports listing comments sorted by date that have a specific tag, and the author of a comment is tagged there as . Here's a JSON feed of my ( ) most recent comments, for example: https://hn.algolia.com/api/v1/search_by_date?tags=comment,author_simonw&hitsPerPage=1000 The Algolia API is served with open CORS headers, which means you can access the API from JavaScript running on any web page. Last August I had ChatGPT build me a simple tool for hitting that API for any user which fetches their comments and gives me a mobile-friendly "copy to clipboard" button. I've since tweaked it a few times with Claude. I can then paste the whole lot into any LLM - these days I mostly use Claude Opus 4.6 for this - and prompt "profile this user" to see what it can figure out. It's startlingly effective! It feels invasive to quote the profile of another user here, so I'll show you what it produces for me: This is Simon Willison — a prolific, independent software developer, blogger, and one of the most visible voices in the AI-assisted coding space. Here's a profile drawn from his comments: Professional identity: Independent developer and writer. Co-creator of Django, creator of Datasette and many other open source tools. On the board of the Python Software Foundation. Runs a heavily-trafficked personal blog at simonwillison.net with multiple content types (entries, link blog, notes, quotations). Monetizes through GitHub sponsors, ethical ads, and occasional consulting — not through any AI company payroll, a point he defends frequently. Core thesis on AI coding: He's the leading evangelist for what he calls "agentic engineering" — using coding agents (Claude Code, Codex CLI) as productivity multipliers for skilled developers. His central argument: LLMs don't replace programmers, they amplify existing expertise. He's writing a guide/book on this at simonwillison.net/guides. He insists these tools are deceptively difficult to use well, and that most people who report bad results simply haven't learned the craft yet. Working style: He does most of his programming from his iPhone via Claude Code for web, often while riding BART, walking the dog, or sitting in the garden. He embraces "YOLO mode" (auto-approving agent actions) and frequently runs 2-3 agent sessions in parallel. He starts every session with "run uv run pytest" to anchor the agent in TDD. He's on the $200/month Claude Max plan. Key technical interests: Sandboxing and security (WebAssembly, Pyodide, sandbox-exec, Firecracker), SQLite (deeply), Python packaging and tooling (uv, PyPI distribution tricks), browser-in-a-browser experiments (v86, WASM Linux), and local LLM inference. He's been learning Go "by osmosis" through coding agents. Security consciousness: He coined the term "prompt injection" and the "lethal trifecta" (access to private data + exposure to untrusted input + ability to take actions). He's genuinely alarmed about OpenClaw/Clawdbot security risks and has been predicting a headline-grabbing prompt injection attack "in the next six months" for over two years — acknowledging it keeps not happening while insisting the risk is real. He references the "normalization of deviance" pattern frequently. Personality and debate style: Energetic, combative when challenged, but generally good-natured. He engages heavily in HN comment threads — sometimes dozens of replies in a single discussion. He's transparent about his biases and maintains a public disclosures page. He pushes back hard against what he sees as motivated reasoning from AI skeptics, but also against AGI hype ("I refer to that as the science fiction version of AI"). He gets visibly frustrated by sock-puppet accounts and bad-faith criticism. He has a self-deprecating streak ("my stupid pelican benchmark"). Recurring themes across hundreds of comments: Personal interests mentioned: Niche museums (runs niche-museums.com), New Zealand kākāpō parrots, cooking food from countries he's never visited, chickens in his garden. Lives in Half Moon Bay, CA near a major pelican roost. The pelican-on-a-bicycle SVG test is his signature LLM benchmark. In a sentence: A deeply experienced, independently-minded developer who's genuinely excited about AI coding tools, fights hard against both uncritical hype and reflexive dismissal, and is trying to professionalize how the industry uses these tools — while also worrying loudly about the security implications almost nobody else takes seriously enough. This all checks out! I ran this in Claude incognito mode to hopefully prevent Claude from guessing that I was evaluating myself and sycophantically glazing me - the tone of the response it gave here is similar to the tone I've seen against other accounts. I expect it guessed my real name due to my habit of linking to my own writing from some of my comments, which provides plenty of simonwillison.net URLs for it to associate with my public persona. I haven't seen it take a guess at a real name for any of the other profiles I've generated. It's a little creepy to be able to derive this much information about someone so easily, even when they've shared that freely in a public (and API-available) place. I mainly use this to check that I'm not getting embroiled in an extensive argument with someone who has a history of arguing in bad faith. Thankfully that's rarely the case - Hacker News continues to be a responsibly moderated online space. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . "Two things can be true at the same time" — he holds nuanced positions Tests are for productivity, not just quality The November 2025 model releases (Opus 4.5, GPT-5.2) were a genuine inflection point Code review is the biggest bottleneck in agent-assisted workflows "Cognitive debt" is a real and unsolved problem The best engineering practices (tests, docs, CI/CD, clean code) also make agents work better He's deliberately trying to "teach people good software engineering while tricking them into thinking the book is about AI"

0 views
devansh 1 months ago

Four Vulnerabilities in Parse Server

Parse Server is one of those projects that sits quietly beneath a lot of production infrastructure. It powers the backend of a meaningful number of mobile and web applications, particularly those that started on Parse's original hosted platform before it shut down in 2017 and needed somewhere to migrate. Currently the project has over 21,000+ stars on GitHub I recently spent some time auditing its codebase and found four security vulnerabilities. Three of them share a common root, a fundamental gap between what is documented to do and what the server actually enforces. The fourth is an independent issue in the social authentication adapters that is arguably more severe, a JWT validation bypass that allows an attacker to authenticate as any user on a target server using a token issued for an entirely different application. The Parse Server team was responsive throughout and coordinated fixes promptly. All four issues have been patched. Parse Server is an open-source Node.js backend framework that provides a complete application backend out of the box, a database abstraction layer (typically over MongoDB or PostgreSQL), a REST and GraphQL API, user authentication, file storage, push notifications, Cloud Code for serverless functions, and a real-time event system. It is primarily used as the backend for mobile applications and is the open-source successor to Parse's original hosted backend-as-a-service platform. Parse Server authenticates API requests using one of several key types. The grants full administrative access to all data, bypassing all object-level and class-level permission checks. It is intended for trusted server-side operations only. Parse Server also exposes a option. Per its documentation, this key grants master-level read access, it can query any data, bypass ACLs for reading, and perform administrative reads, but is explicitly intended to deny all write operations. It is the kind of credential you might hand to an analytics service, a monitoring agent, or a read-only admin dashboard, enough power to see everything, but no ability to change anything. That contract is what three of these four vulnerabilities break. The implementation checks whether a request carries master-level credentials by testing a single flag — — on the auth object. The problem is that authentication sets both and , and a large number of route handlers only check the former. The flag is set but never consulted, which means the read-only restriction exists in concept but not in enforcement. Cloud Hooks are server-side webhooks that fire when specific Parse Server events occur — object creation, deletion, user signup, and so on. Cloud Jobs are scheduled or manually triggered background tasks that can execute arbitrary Cloud Code functions. Both are powerful primitives: Cloud Hooks can exfiltrate any data passing through the server's event stream, and Cloud Jobs can execute arbitrary logic on demand. The routes that manage Cloud Hooks and Cloud Jobs — creating new hooks, modifying existing ones, deleting them, and triggering job execution — are all guarded by master key access checks. Those checks verify only that the requesting credential has . Because satisfies that condition, a caller holding only the read-only credential can fully manage the Cloud Hook lifecycle and trigger Cloud Jobs at will. The practical impact is data exfiltration via Cloud Hook. An attacker who knows the can register a new Cloud Hook pointing to an external endpoint they control, then watch as every matching Parse Server event — user signups, object writes, session creation — is delivered to them in real time. The read-only key, intended to allow passive observation, can be turned into an active wiretap on the entire application's event stream. The fix adds explicit rejection checks to the Cloud Hook and Cloud Job handlers. Parse Server's Files API exposes endpoints for uploading and deleting files — and . Both routes are guarded by , a middleware that checks whether the incoming request has master-level credentials. Like the Cloud Hooks routes, this check only tests and never consults . The root cause traces through three locations in the codebase. In at lines 267–278, the read-only auth object is constructed with . In at lines 107–113, the delete route applies as its only guard. At lines 586–602 of the same file, the delete handler calls through to without any additional read-only check in the call chain. The consequence is that a caller with only can upload arbitrary files to the server's storage backend or permanently delete any existing file by name. The upload vector is primarily an integrity concern — poisoning stored assets. The deletion vector is a high-availability concern — an attacker can destroy application data (user avatars, documents, media) that may not have backups, and depending on how the application is structured, deletion of certain files could cause cascading application failures. The fix adds rejection to both the file upload and file delete handlers. This is the most impactful of the three issues. The endpoint is a privileged administrative route intended for master-key workflows — it accepts a parameter and returns a valid, usable session token for that user. The design intent is to allow administrators to impersonate users for debugging or support purposes. It is the digital equivalent of a master key that can open any door. The route's handler, , is located in at lines 339–345 and is mounted as at lines 706–708. The guard condition rejects requests where is false. Because produces an auth object where is true — and because there is no check anywhere in the handler or its middleware chain — the read-only credential passes the gate and the endpoint returns a fully usable for any provided. That session token is not a read-only token. It is a normal user session token, indistinguishable from one obtained by logging in with a password. It grants full read and write access to everything that user's ACL and role memberships permit. An attacker with the and knowledge of any user's object ID can silently mint a session as that user and then act as them with complete write access — modifying their data, making purchases, changing their email address, deleting their account, or doing anything else the application allows its users to do. There is no workaround other than removing from the deployment or upgrading. The fix is a single guard added to that rejects the request when is true. This vulnerability is independent of the theme and is the most severe of the four. It sits in Parse Server's social authentication layer — specifically in the adapters that validate identity tokens for Sign in with Google, Sign in with Apple, and Facebook Login. When a user authenticates via one of these providers, the client receives a JSON Web Token signed by the provider. Parse Server's authentication adapters are supposed to verify this token, they check the signature, the expiry, and critically, the audience claim — the field that specifies which application the token was issued for. Audience validation is what prevents a token issued for one application from being used to authenticate against a different application. Without it, a validly signed token from any Google, Apple, or Facebook application in the world can be used to authenticate against any Parse Server that trusts the same provider. The vulnerability arises from how the adapters handle missing configuration. For the Google and Apple adapters, the audience is passed to JWT verification via the configuration option. When is not set, the adapters do not reject the configuration as incomplete — they silently skip audience validation entirely. The JWT is verified for signature and expiry only, and any valid Google or Apple token from any app will be accepted. For Facebook Limited Login, the situation is worse, the vulnerability exists regardless of configuration. The Facebook adapter validates as the expected audience for the Standard Login (Graph API) flow. However, the Limited Login path — which uses JWTs rather than Graph API tokens — never passes to JWT verification at all. The code path simply does not include the audience parameter in the verification call, meaning no configuration value, however correct, can prevent the bypass on the Limited Login path. The attack is straightforward. An attacker creates or uses any existing Google, Apple, or Facebook application they control, signs in to obtain a legitimately signed JWT, and then presents that token to a vulnerable Parse Server's authentication endpoint. Because audience validation is skipped, the token passes verification. Combined with the ability to specify which Parse Server user account to associate the token with, this becomes full pre-authentication account takeover for any user on the server — with no credentials, no brute force, and no interaction from the victim. The fix enforces (Google/Apple) and (Facebook) as mandatory configuration and passes them correctly to JWT verification for both the Standard Login and Limited Login paths on all three adapters. What is Parse Server? The readOnlyMasterKey Contract Vulnerabilities CVE-2026-29182 Cloud Hooks and Cloud Jobs bypass readOnlyMasterKey CVE-2026-30228 File Creation and Deletion bypass readOnlyMasterKey CVE-2026-30229 /loginAs allows readOnlyMasterKey to gain full access as any user CVE-2026-30863 JWT Audience Validation Bypass in Google, Apple, and Facebook Adapters Disclosure Timeline CVE-2026-29182: GHSA-vc89-5g3r-cmhh — Fixed in 8.6.4 , 9.4.1-alpha.3 CVE-2026-30228: GHSA-xfh7-phr7-gr2x — Fixed in 8.6.5 , 9.5.0-alpha.3 CVE-2026-30229: GHSA-79wj-8rqv-jvp5 — Fixed in 8.6.6 , 9.5.0-alpha.4 CVE-2026-30863: GHSA-x6fw-778m-wr9v — Fixed in 8.6.10 , 9.5.0-alpha.11 Parse Server repository: github.com/parse-community/parse-server

0 views
Stone Tools 1 months ago

Lotus 1-2-3 on the PC w/DOS

What would a piece of software have to do today to make you cheer and applaud upon seeing a demo? I don't mean the "I'm attending a keynote and this is expected, please don't glower at me Mr. Pichai," polite-company type of applause. I mean the "Everything's different now." kind. For that, the bar is pretty high these days. "Photorealistic" fight scenes between Brad Pitt and Tom Cruise against an apocalyptic cityscape are generated out of nothing but a wish, and social media, smelling the cynical desperation, can offer no more than a clenched-teeth grimace. Within 48 hours the cold light of the epic battle has faded, leaving no residual heat. A sense of awe was easier to elicit back in the golden era. Bill Atkinson scrubbed out some pixels with an eraser in MacPaint to thunderous applause. Andy Warhol did a flood fill on an image capture of Debbie Harry, leaving an audience enraptured. Perhaps miracles work best when they're minor. Mitch Kapor has been on the receiving end of the adulation. As CEO of newly-formed Lotus Corporation, demos of their flagship product 1-2-3 generated significant light and heat with the crowds. In a 2004 interview with the Computer History Museum, Kapor said, "You could with one-click see the graph from your spreadsheet. You could not do that before. That was the killer feature when we demo’d it. I mean, literally, people used to applaud – as hard as it is to believe." He knew all too well the struggles of the VisiCalc crowd, having previously built VisiPlot and VisiTrend for VisiCorp. Those programs worked with VisiCalc data to draw graphs, but required a lot of disk swapping to move in and out of the various programs when fine-tuning charts and graphs. 48K on the Apple 2 made it essentially impossible to fit all of the software into memory at once, but they could at least put everything onto the same diskette, Kapor reasoned. Eliminating that song and dance would be useful to the customers. Depicted as a literal song-and-dance in their advertising. In an interview in Founders at Work, Kapor said, "At various times I raised a number of ideas with the publisher about combining ( VisiCalc and VisiPlot onto one disk) and they weren't interested at all. I don't think they really saw me as an equal. They saw me, when I was there as a product manager, as an annoyance—as a marginal person without experience or credentials who was kind of a pest. And I suppose I was kind of a pest." He said the feeling was mutual, and that was basically it for his employment with Personal Software and the VisiCalc team. He let them buy him out (i.e. the juicy royalties he was receiving for VisiPlot and VisiTrend ) for $1.2M, then took that money and went off to build the better mousetrap he had tried to pitch. Lotus 1-2-3 would quickly become the "killer app" for the nascent IBM-PC, doing for that system what VisiCalc had done earlier for Apple. 1-2-3 's success (and corporate in-fighting between Personal Software and VisiCorp) drove VisiCalc sales into the ground almost immediately. Two years later, Lotus would buy out Personal Software. One year later, Lotus would kill VisiCalc . Today, Microsoft Excel documentation still references Lotus 1-2-3 , not VisiCalc . I have no 1-2-3 experience going into this. I always thought "1-2-3" referred to its relationship to numbers. "1, 2, 3. Row numbers. Numbers in a spreadsheet. Mathy number stuff. I get it." I honestly had no idea "1-2-3" indicated something more. I'm learning that VisiCalc walked so 1-2-3 could run (over VisiCalc's ashes in a Sherman tank) . I have one goal in learning Lotus 1-2-3 . I want to understand what it did that was so superior to my beloved VisiCalc that it practically wiped them out in the first year of launch. Kapor had projected first year 1-2-3 sales of US$1M, but did US$53M instead. That's not just a little better than VisiCalc, that's " VisiWho ?" dominance. VisiCalc is a spreadsheet and 1-2-3 is a spreadsheet, so what's the big fuss? First, the platform of choice, the IBM-PC running PC-DOS (MS-DOS, to those buying it separately), affords two big wins right off the bat. 80-column text mode makes the Apple 2's 40-columns feel claustrophobic (and perhaps a bit un-business-like?). The greatly expanded memory of the 16-bit PC, max 640K vs. the 8-bit Apple 2's 48K, lets far more complex worksheets fill out those roomy 80-columns. As Lotus Corporation and magazines and Wikipedia pages and other blogs love to point out, the true game-changer is contained in the program's very name. "1-2-3" refers to the three components of this "integrated software" package. "1" is the spreadsheet capability, which surpassed most contemporaries handily in speed, being written in x86 assembly (until Release 3). "2" is for those graphing tools which had Kapor's audiences applauding. "3" was intended to be a word processor, but according to programmer Jonathan Sachs, "I was a few weeks into working on the word processing part, and I was getting bogged down. That's about when Context MBA came out, and I got a look at what they had done." "What they had done" was integrate a word processor, communications, and database, along with the spreadsheet and graphics components. Context 1-2-3-4-5 , as it were. When Sachs saw the database, that felt to him like a more natural fit and "3" was re-implemented as a database. "It would be a heck of a lot easier to implement," he noted. Woz bless our lazy programmers. The upshot is 1-2-3 plays nicely with last post's focus, dBase , which feels like a particularly powerful combination. I feel a tingle when skills picked up on a previous exploration pay dividends later. Deluxe Paint + Scala paid off similarly. Is this what it feels like to "level up?" Obtaining literature on Lotus 1-2-3 is only difficult in the " overchoice " sense. I expected to find a lot of books, but perhaps not the "What have I gotten myself into?" existential dread of 1,000 hits on archive.org. It wasn't just books, that period had an interesting side phenomenon of "software vendor published enthusiast magazines." Companies like Aldus, Corel and Oracle all had self-titled publications on newsstands. Lotus Corporation did as well with LOTUS Magazine . Published monthly by Lotus Corporation, it debuted with the May 1985 issue (probably on newsstands late March, early April). The tagline, "Computing for Managers and Professionals," oriented itself toward the decision makers, the ones with purchasing power. A poll of Lotus software users revealed, "Most of you see the computer primarily as a tool and are not interested in computing, per se." Toward that end, the magazine took a different tack than the BYTE s and PC Magazine s of the time. It was to be no-nonsense, non-techno-babble, short, easy-to-digest articles about computing from the manager's perspective. "What's all this I keep hearing about 'floopy disks' and 'rams' and 'memories' and such and so on? It's enough to drive a reasonable business computerist straight to distraction!" says the frazzled corporate executive trope. There there, fret not! LOTUS Magazine feels your pain and addresses it with the cover story of issue 1. "The world of computer memory has enough complexity and high-tech jargon to drive the most reasonable business computerist straight to distraction," leads in to "An Inside Look at Computer Memory" by T.R. Reid. The article explains the differences between RAM and ROM, floppies and hard disks, and so on, unfurrowing the knitted brows of befuddled mid-80's business executives. When it got into the 1-2-3 of it all, LOTUS Magazine didn't pull its punches. Articles were short, around four pages, and assumed a higher level of analytical aptitude than IT aptitude. Lots of charts of formulas, macro definitions with explanations, tips and tricks for faster data entry, and so on fill out the pages. That ran for about seven years, until the December 1992 issue, when publishing duties transferred to PC Magazine as PC Magazine: LOTUS Edition . It was PC Magazine with a mini-magazine's worth of Lotus-specific content appended each month, as a special imprint. That ran until August 1995 , marking a 10-year publication run which would have exceeded my prediction by about eight years. After judging books entirely by their covers, I've chosen the official Lotus manuals for 1.0A, 2.2, and 3.4, and two compilations of tips and tricks previously published in LOTUS Magazine . I flip through other stuff as well, but honestly nothing is holding my attention this time around; they all read the same, "dry and boring." 1,000 pages or more for some of those books and they didn't have room for even one joke? I promise at least seven in this post alone. See if you can spot them all! Launching into the program proper brings me to the expected "I'm a spreadsheet!" grid layout, with column and row labels, arrow-key controllable cell cursor, and a blank area at the top for VisiCalc -y stuff. Let's go. As an intermediate level VisiCalc user, I am delighted my menu muscle memory pays immediate dividends. Clearly Lotus welcomes defectors and even makes life easier on everyone by taking advantage of the 80-column display. VisiCalc 's single-letter menu mnemonics are enhanced in 1-2-3 by simply spelling it all out on-screen. Full menu item names are always visible, yet still accessible by single-letter commands. From the jump, 1-2-3 makes a strong case for itself, providing improved usability and discoverable tools. Before digging in too deeply, I should note that 1-2-3 does all of the VisiCalc things. A1-style cell references, slash menu, fixed and relative cell references, @ functions including transcendentals, range specifier, prefix for values, and on and on. It adds, it subtracts, it calculates interest. 1-2-3 "Yes, and..."s VisiCalc from there. We gain a lot, but there is a notable absence: the upper-right status check. VisiCalc shows calculation order, arrow-key toggle, and free memory in that spot. Those are all gone in 1-2-3 and good riddance, frankly. On the PC I have full arrow keys and more RAM than Woz; 1-2-3 sees my full 16MB of DOS Extended memory. There is no stopping me. 1-2-3 also says nuts to VisiCalc 's "calculation order" (by row or by column) hoo-hah and introduces "minimal recalculation." From the almost comically-straightforward named book Lotus 1-2-3, Release 2.3 , "When 1-2-3 recalculates a worksheet, only those formulas directly affected by a change in the data are recalculated." I am living large here in 1989, or 1991, or whatever year I'm pretending it is this week. Even VisiCalc 's gets a glow up. You know it today as and , both of which were present in 1-2-3 Release 1 back in 1983. At this rate, 1-2-3 is flirting dangerously close to "expected spreadsheet behavior in 2026." Don't get my hopes up, Lotus. There's only down from there. The more I encounter this, the more I wonder if we gave up on it too soon. This could be "blogger overly immersed in their subject matter" brain, but I'm growing to oftentimes prefer two-line horizontal menus over modern GUI menus. I find the left-right, up-down, left-right, up-down, scanning through GUI menus kind of tiring. With the two-line menu, I can step through top-level options with the left/right arrow keys, eyes focused on line two as I scan sub-menu items. It also provides something GUI menus don't: an immediate explanation of a menu item before committing its action to the document. If a menu item is not a sub-menu, line two describes it. It's easy to audit features in an unknown program. Also, every menu item has a keyboard shortcut; just type the first letter. This requires creativity by the developer when naming menu items such that each has a unique first letter, but it also creates a de-facto mnemonic for the user. Don't discount muscle memory! There's one "drawback," but I'll try to make a case for it. Specifically, it is probably impossible to fit everything in a modern GUI menu into a two-line scheme. There's just too much! I suggest the horizontal menu-bar solves this precisely because of that design constraint. If there's too much, the menu needs to be simplified. "Problem solved," the author asserted. This has to be one of 1-2-3 's greatest contributions to modern spreadsheets. It still exists, just open up your modern spreadsheet of choice and try it. Enter 1 through 5 down the A column. Starting with B2, enter the formula and copy it down a few rows. Old hands know that a symbol in a cell reference fixes that row or column of the reference, otherwise references are relative. That's a huge step up from VisiCalc 's "all or nothing" approach to cell references. Put in a formula and copy it through to other cells. For every cell reference, in every copy of the formula, VisiCalc prompts the user for "relative or fixed?" It is a complete drag, and Woz help you the day that formula needs updating. The approach is superior, allowing us to embed relativity into the formula itself. Then, copying a formula across cells copies our intent as a natural course. It's simple to understand and hard to mess up: my favorite combination. While it can't load non- 1-2-3 documents natively, Lotus does provide a nice translation tool for helping us get data out of the heavy hitters of the day. From a Stone Tools perspective, this handles everything I need so far, as VisiCalc and dBase are both accounted for and work as advertised. Translation works both ways, so bringing in dBase data, messing around with it in 1-2-3 , and going back out to dBase is possible, though there are cautions in doing so. One notable thing to watch out for is "deleted" records. dBase only "marks for deletion" (until a .PACK command), and that flag won't survive transit. A small inconvenience, all things considered. In the top-level menu is the shiny new option, the "2" in "1-2-3." I know exactly what I want: a pie chart of game software genres imported from dBase II . The options for are straightforward, and the limitations are self-evident. Notably, look at the "Ranges" settings. Range sets value labels which will appear along the X-axis. Ranges through define six, and only six, ranges of data to plot on the graph. That's it. Everything else you see is "make it pretty." Within the confines of my self-imposed time capsule, my only point of reference thus far is VisiCalc and its clones. Through that lens, I'm blown away by Lotus 1-2-3 . I mean, come on, 3-D bar charts ?! Am I living in the world of TRON right now?! The applause is well-earned, Mitch. Bravo! Encore, even! Now, Mr. Kapor, if you'll excuse me a moment, I need to have a quick, private chat with my readers. Yes, sorry, I'll only be a moment. Hello dear readers. Mitch can't hear us, yeah? We're safe? OK, between you and me, that graphing tool is a little underwhelming, huh? There's a lot we can do to make a graph look as pretty as possible for screens and printers of the time, but the core graphing options themselves are kind of anemic. Here's Google Sheets making the pie chat I'd hoped 1-2-3 could generate. However, 1-2-3 cannot do this because it can only graph strict numeric values; strings, like "genre" types, return blank charts. 1-2-3 also can't coalesce data, like we see Sheets doing above. To achieve my goal, I'll need to figure out a different approach. (Plus, maybe I've discovered a DOSBox-X bug ?) It's not fair to judge past tools as being "inferior" just because they don't live up to 2026 standards. Still, what I'm trying to do must have been one of the first things many business owners wanted to do, right? Am I storing my data in a style that hadn't been popularized yet? Is my 2026 brain making life more difficult for my 1991 doppelgänger unnecessarily? How does one graph out the count of each unique genre? Alright, this is going to get complicated, so I think a diagram is in order. This actually explains a lot about the Lotus 1-2-3 approach to data in general, how to manipulate it, how to query it, and generally how to interface with the more complex functions of the program. Having imported the dBase list of CP/M games from the dBase article, let's extract a list of all titles that are of genre "Simulation." I'll use a subset of the total data so everything fits on screen for demonstration purposes and perform (aka , aka The Notorious DQU, aka Query's L'il Helper) A worksheet is not just rows and columns of data. It also serves as a control mechanism for defining interactions with the data. A worksheet has columns up to IV (256) and rows up to 8192. What do we do with 2,000,000+ cells? In true Dwarf Fortress fashion, we section off areas ("ranges" in 1-2-3 speak) and designate functions to those areas. First, I have my data as the main table, field names at top. Then, I need to set up my query criteria. This is a separate portion of the worksheet, with the fields I want to query against and room below to accept the criteria definition. Think of it like building a little query request form. Then, Lotus needs a place to spit out the results. Again, I set up a little "form" to receive the data. Put in whichever field names are of interest in the final data capture. Now, what if there are multiple queries I want to re-use from time to time? Painful as it sounds, I must set up multiple query forms, one for each query I expect to re-use. So, re-copy all of the field headers of interest into a new portion of the worksheet. Re-copy the field headers for the output range. Put in the new query criteria. Do another extraction. Keep dividing the worksheet up into all of the various queries one might need to reuse. Each lives in its own little area of the worksheet, so maybe now's a good time to start labeling things? Maybe mentally divide the worksheet into "my queries live over here, in Q-Town" and "my results live over there, in Resultsville" and so on. For my stated goal, I need the unique list of genres for my game list and the count of each genre within the data set. From the previous section, I know how to extract a list of unique genres. To count them, can count all non-empty records which match my criteria. Lemme draw up another diagram here. After extracting the list of unique values for "Genre", I get a column of results as seen at in the image above. Notice the criteria at is empty? By not specifying anything, that equates to matching any "Genre". Next, I need to reformat that column into countable criteria for . Just like in a query, criteria consists of two vertically contiguous cells, the top of which is the field name and the bottom holds the parameter. The field name must be physically, immediately above each and every genre I want to count. will transpose a range of vertical or horizontal cells into their mirror universe opposite. That's how I generated the horizontal list at . A of the field name across row 15 generated nice pairings, perfect for use with . The cell formula outlined in yellow is essentially the same across , each lightly modified to point to a different criteria range. That calculates the count for each genre in column , and column holds my titles. Now I have what I need to generate the chart I wanted (aforementioned pie chart drawing bug notwithstanding). Here it is in glorious 3-D from the future (of the past)! Frustratingly, figuring all of that out took the better part of a day. But now I know! If only there were some way to make it easier. There are issues with my solution thus far, many of which boil down to the physical spaces assigned to hold queries and results and transformations and data. If I bring in new data with new genres, new result lists could physically lengthen and overlap one another. Planning a physical map for the worksheet is a priority. Building out the sheet, especially keeping cell references flexible to changes in data, is a drag. I'd also like to generate a graph from the new sheet arrangement, with just a simple hot-key. Like all great developers, I want to be lazy. The first step toward the promised land of laziness is "hard work," unfortunately. Hard work can be captured and reused, luckily, as Lotus 1-2-3 features "Friend of the Blog": macros. VisiCalc didn't have it, and 1-2-3 's implementation is robust enough that many books were devoted to understanding and taming it. Here's a simple macro, which hints at its latent power. 0:00 / 0:07 1× Custom menus are easy to build. Selecting an option could trigger a longer automation task, simplifying a multi-step process, or something as simple as a help menu. Macros are stored... ( say it with me now ) ...in the worksheet. Yep, whatever map you had in mind for dividing up the worksheet into query-related fiefdoms, redistrict once more to hold macro definitions. Custom menus are an easy way to illustrate macro structure. Here's a dumb example. The text in column A is mostly comments to organize our worksheet and thoughts. represents the keyboard shortcut assigned to the macro, accessed by . is a reference to a named cell range. Named ranges are an important improvement over VisiCalc . Once defined, a range can be invoked by name anywhere a range is expected. Assuming a cell range as has been assigned a name like , is totally valid. is a range defined as . is a range defined as . Notice a range only needs to define the first start of a macro definition. Macro execution will read each cell in order down a given column until the first empty cell. range names are interpreted by 1-2-3 as macro keyboard shortcuts automatically. The convention shown, of a human-readable label to the immediate left of a range by the same name is so common it has its own menu shortcut. applied to column A will auto-assign column B cells to the names in A. To a certain extent, a named range can function like a programming "goto". In the macro case, its saying "Goto the range named and continue executing the macro from there." Programmers in the readership are salivating at the deviously complex ways this "goto labeling" could be abused. Combine it with decision making through and iteration through and the possibility space opens wide. After doing dBase work last post, I noted that I had accidentally become a dBase developer without even trying; the dBase scripting language was precisely equivalent to the commands issued at the dot prompt. I'm not so lucky with 1-2-3 . Setting up a macro which issues a simple string of commands is easy enough, and reads (mostly) like how I'd type it at the menu, akin to Bank Street Writer 's approach to macros. For example, will issue to bring up the slash menu, access the ( W )orksheet menu, then the ( C )olumn sub-menu, and finally ( H )ide a column. ~ issues "enter", which at this point in the menu navigation will commit the prompt default, i.e. the current position of the cursor. Just like that, hiding the current column just became a single keystroke. There is also a menu tool which is "record every keystroke I do from now." That recording will be output into the worksheet. Apply a range name to that and it transforms into a macro. Very nice! That said, 1-2-3 macros go from zero to 100 pretty quickly and are visually difficult to parse and reason out. One must be super-duper intimately familiar with every command in the slash menu, plus the macro-specific vocabulary. Lotus understood things could get hairy pretty quickly and added a debugging tool to help make sense of things. enters mode, which executes macros one line at a time. The status bar at the bottom of the screen explains what is being run, so when something goes wrong I know who to blame. OK , are you ready to dig in and implement macros which simplify the queries and procedure discussed earlier? < cracking knuckles> Well, I'm not. < uncracks knuckles back to stiffness > The macro system has proven too complicated to feel any sense of control or mastery beyond Baby's First Macro™. With a couple of more weeks' study I think I could achieve my goal. Unfortunately, for this post, I am defeated. The "3" in "1-2-3", 1-2-3 can function as a database. A very simple, limited, one-row-equals-one-record, 8192 record max, 256 field max, flat database. Let's be honest, oftentimes that's more than enough. I showed examples of querying earlier, and that's as fancy as it gets for this. We can sort records ascending/descending by up to two keys, find and replace values, find records which match a search query, and extract those records into another area of the spreadsheet. And nothing else (at least for Releases 2.x). 0:00 / 0:52 1× Sorting dBase II data by genre. It may seem I'm giving this aspect of the program short-shrift, but so did Lotus. In their own manual for Release 2.2, macros have 300 pages devoted to them. Database functionality has 50, and the first 20 of those are instructions for typing in dummy data. Sorting, querying, finding, and extracting, the meat and potatoes of database-ing, warrant a mere 20 pages total. It's a useful feature and I'm glad it's here. It's enough to handle most of my meager needs. Beyond that, there's not much to say, except to note its legacy. It was an obvious idea to anyone who touched VisiCalc for more than five minutes, so its development feels inevitable. Do some database work in Excel tonight and light a candle for 1-2-3 . A very nice feature of 1-2-3 that fits right in with its "integrated" approach, is what we would call today "plug-ins" or "extensions," but which Lotus calls "add-ins." 1-2-3 shipped with a few. For example, one expanded macros by letting them live in-memory, for use across worksheets. Normally the only macros accessible to a worksheet are those defined within itself. Man, VisiCalc is just getting lapped by 1-2-3 's ingenuity, huh? According to a PC Magazine article about the state of add-ins, many business-people lived inside 1-2-3 all day long and wanted to do everything from within its confines . The 3rd party add-in after-market happily commodified those desires. In addition to obvious ideas, like automated save/backup utilities, or industry-specific analysis tools, add-ins could mold 1-2-3 into almost anything. Complete word processors, entire graphic subsystem replacements for complicated graphing needs, expert system logic, and non-linear function solvers were injected into the program. Oracle offered a way to connect to their external SQL databases from within the snugly confines of 1-2-3 's security blanket. The Lotus approach, being a product of lower-memory days, is both annoying and useful. Add-ins can be, though are not by default, loaded at app startup. Add-ins must be "activated" one-by-one to gain access to their extended powers, or "deactivated" to make room for other add-ins or a larger worksheet. I have enough memory, so I'm not in trouble here, though I'm sure it's easy to imagine on a 512K system that manual memory management was a real thing. Between macros and add-ins, 1-2-3 becomes an ecosystem unto itself, like dBase or HyperCard . One thing I don't like about Lotus's approach is how it can bifurcate the user experience. That's seen clearly with their own WYSIWYG add-in. With Release 2.3, Lotus included this add-in to help a world transitioning from textual interfaces into the flash and sizzle of OS/2, Windows, and Mac GUI interfaces. It's DOS for the GUI envious and frankly, I'm cold on it. It's not integrated elegantly, feels sluggish, and makes the program more difficult to use. Activating WYSIWYG switches the application from terminal mode to graphics mode, so already as a DOSBox-X user I'm annoyed at losing my lovely TrueType text. That's not Lotus's fault, but a blogger's gotta have his standards. The big usability problem is how the functionality of the program now splits in two. The menu works as before, but we also have a new menu for all things WYSIWYG. So, when you want to use a menu command, you must remember which menu holds that command. Many options appear at first blush to be the same as their counterparts, but they control WYSIWYG-specific parameters of those functions. Usually. That's not to say the add-in isn't useful for cell styling, or placing graphs into a worksheet directly. Making documents look nice is important after all. The boss needs to be impressed with those Q3 projection charts, even when they forecast doom. Especially then, probably! Release 3 embraced WYSIWYG as its main and only interface, no add-in required, which is probably why I keep gravitating to the 2.x releases. I'd chalk it up to being a stubborn old man, but the recent embrace of TUI interfaces by the Hacker News crowd seems to have me in good company. I'm writing this part on February 22. Two days prior, a project called "Pi for Excel: AI sidebar add-in for Excel" released and got good traction on Hacker News. As I noted in the XPER column , our current "AI" boom is the biggest, but not the first. English language interactions, first by keyboard and fingers-crossed-one-day-by-voice-if-AI-technology-continues-along-our-projected-path-of-wishes-and-dreams, were available as add-ins to various programs. Databases in particular were a notable target for those experiments. Consider how English-like dBase 's user interface is, and it doesn't take a huge leap to understand why developers felt something closer to true English was within reach. Symantec's Q&A had its natural language "Intelligent Assistant" built right in. R:BASE tried it with their CLOUT add-in, promising a user could query, "Which warehouses shipped more red and green argyle socks than planned?" The spreadsheet Silk promised built-in English language control over its tools. Like those self-published magazines at the start of this article, Lotus didn't want to miss out on this English parser party either. (For this exploration I must drop down into R2.01) Released for US$150 in late 1986, HAL is a memory-resident wrapper to 1-2-3 . We launch HAL directly, which in turn launches 1-2-3 . Its advertising explains the gimmick well enough. "Lotus HAL gives you the ability to perform 1-2-3 tasks using simple English phrases." What I've seen in my early time with it can honestly feel kind of magical. Look at how easily it generates monthly column headers. 0:00 / 0:22 1× That's pretty slick, I can't deny it. Similarly tedious actions are promised to be eased greatly by "requesting" HAL to do the heavy lifting. Here, I'm stepping through a quick tutorial to have HAL build an entire spreadsheet. I never touch the formula; I only describe it by intent. 0:00 / 1:14 1× HAL only recognizes the first three letters of anything. "Name" and "Names" and "Namaste" are all the same to well-meaning, but a bit dimwitted, HAL. As is the case for all such English-like languages for the time, it's English only within a generous definition of the word. Ultimately, we're learning to speak 1-2-3 's specific dialect and vocabulary. PC Magazine , February 1987, their HAL review was the cover story, " HAL comes with a 250-page manual. It is as important to read this manual as it is to read the 1-2-3 manual. All the commands are described as rigidly as the syntax of any command-line interface." That it takes a 250 page manual to explain how to speak "English" with HAL perhaps makes an argument against its own existence? The base 640K of DOS must hold both programs in memory at the same time, so this is a nice piece of corroborating history for those who think software today is too bloated. An industry-defining spreadsheet with graphing and database capabilities close to modern expectations, an online help system, plus a natural language interface, all run together in less than 1MB of RAM . There's the retro-computing dopamine hit I've been hoping for! HAL doesn't just provide an English-language interface to 1-2-3 's native tools, it brings its own unique toys to the Release 2.01 sandbox. I do need to emphasize the release version here, because some of these tools were later worked into the product proper over time. That said, HAL worked hard to be your friend. Even though HAL controls 1-2-3 , interfacing with it still feels bolted on. brings up the HAL dialog box, which isn't hard to remember, but never feels natural. Even after setting the HAL request dialog to remain on screen, it feels tenuous. Sometimes it toggles off after navigating a menu option, or the request box will intercept commands I wanted to do through the normal slash menu. It's in the way more than I expected, and I couldn't find a balance between "when I want it" and "when I don't." PC Magazine also felt that HAL is a bit of a kludge. Charles Petzold wrote in his review, "Is HAL really a natural-language interface for 1-2-3 ? Is it useful? Will it revolutionize the computer industry? Are menus dead? My answers are: Not really. Often. Give me a break. No way." This is all academic, because Lotus killed HAL . It has been difficult to find sales figures, though in a Raymond Chen post we catch a glimpse of the Softsel Hot List for December 1986. HAL hit the top 10 (along with other, future blog subjects), moving up the charts over the previous three weeks. On the other hand, it was only available for Releases 1A through 2.01, the pre-WYSIWYG releases, and never returned. Earlier I poked at macros, hoping to make charting "count by genre" easier, and failed. Then I got to ponderin' if HAL might be able to do it for me. Shockingly, HAL can, through its special vocabulary word "tabulate." It makes those previously complex actions, the ones I diagrammed earlier, so simple to perform I don't really need a macro (though I could make one). Check out this 80's magic . 0:00 / 0:22 1× We are supposed to be able to execute HAL requests via to have the system output the 1-2-3 commands HAL puts together to get the job done. It's a peek inside HAL 's brain, basically. If I watch HAL think, maybe it can teach me a better way to do all of the busywork I slogged through earlier? In 1962's Diffusion of Innovations , author Everett Rogers described five characteristics individuals consider when adopting new solutions to existing problems. If VisiCalc was the "existing problem," how well did Lotus 1-2-3 make its case as the "new solution?" In the VisiCalc post I talked about how much of its DNA is seen in modern spreadsheets. I see now that an equal case can be made for Lotus 1-2-3 . I'd phrase it as VisiCalc contributed the "look," and 1-2-3 contributed the "feel" we've come to expect. Where VisiCalc was life-changing for number crunchers, 1-2-3 positioned itself as an engine for business and executed that vision almost perfectly. Having gotten to know 1-2-3 over the past weeks, I can now say, "I get it." I see what the fuss was about and, truth be told, I'm a convert. Sorry, VisiCalc , you know I love you! But the next time I reach for a spreadsheet, I'm reaching for 1-2-3 . Ways to improve the experience, notable deficiencies, workarounds, and notes about incorporating the software into modern workflows (if possible). Obviously, it depends on what you're trying to do. For business work, it doesn't play well in groups unless you're the CEO and can dictate, "OK people, we're all switching to DOS now." For personal projects, it meets many common needs and doesn't feel too much like compromise, aside from the graphing. Heck, the DOS version supports mouse control, and you can always turn on WYSIWYG mode to approximate modernity. We're also in luck with Y2K compatibility. Even Release 1.0 supports dates up to the year 2099. Let's take a moment of silent appreciation for yet another 1-2-3 foresight which keeps its spirit alive and kicking here in the 21st century. DOSBox-X 2026.01.02, Windows x64 build. I updated from the 2025.12 build mid-investigation. CPU set to 286 DOS reports as v6.22 Windows folder mounted as drive C:\ holds multiple Lotus installations 2x (forced) scaling; 80 columns x 25 lines I flipped back and forth with TrueType text mode (this is moot for 1-2-3 's WYSIWYG mode) Lotus 1-2-3 Releases 2.01, 2.2, 2.3, 2.4, and 3.4 all get exercised to some extent; you'll see that reflected in the screenshots. I mostly gravitate toward R2.3; it does what I need without bogging me down in feature creep. "Sharpening the Stone" explains getting DOSBox-X to work with R3.x. dBase III Plus for compatibility testing with 1-2-3 . Undoing your last action. It's almost worth installing HAL just for this, though it is a little dangerous that is the keyboard shortcut. Entering a sequential list of days, months, letters, or numbers automatically, though I wonder if macros could duplicate this to a certain degree. Linking a cell in one worksheet to data in another. Release 2.3 has this. Referring to columns and rows by name is a very neat trick. In fact, it's so neat I'm going to ask you to remember this fact for a later article. Just keep it tucked away in the part of your mind devoted to spreadsheet history, as we all have. The cell-row-bellum, I think its called? (I refuse to apologize.) Worksheet "auditing" can identify cell relationships/dependencies, or list out all formulas in use by a table in natural English. Auditing would become an add-in in later 2.x releases. Find and replace; change all instances of a product name, for example. Macros can mix HAL English with native 1-2-3 macro commands. "Relative advantage  is the degree to which an innovation is perceived as better than the idea it supersedes." 1-2-3 received applause for one-button graphing. Check. "Compatibility  is the degree to which an innovation is perceived as being consistent with...past experiences, and needs of potential adopters." 1-2-3 shipped with a VisiCalc translation tool and its interface is clearly built to make VisiCalc users comfortable. Check. " Complexity  is the degree to which an innovation is perceived as difficult to understand and use." 1-2-3 was initially praised for the simplicity with which a user could get up to speed. Its adoption of high-level VisiCalc concepts, like the slash menu, @ functions, and A1 cell references, helped. Check. "Trialability  is the degree to which an innovation may be experimented with on a limited basis." Trial disks for software during the 80's and 90's wasn't so prevalent; there was a lot of "blind faith" in software purchasing. I can't find any widespread cases of 1-2-3 demo disks circulating. No check. " Observability  is the degree to which the results of an innovation are visible to others." If the live demos, prevalent advertising, and magazine write-ups didn't convince you, 1-2-3 made it clear in the product name itself that you're getting 3x what VisiCalc delivers. Check. As with ThinkTank , DOSBox-X provided a simple, pain-free experience to get Lotus running. Multi-disk installs are handled well, but could be improved. Specifically, the "Swap Disk" option when loading up a stack of disks into the A: drive could use a selector and/or indicator of which disk is currently loaded. in autoexec.bat to auto-mount at launch. Revision 3.4 would not run until I explicitly set in DOSBox-X. I noted the pie graph bug in Release 2.x. I suspect, but cannot prove, that some x86 assembly call is being mangled by DOSBox-X. 86Box, which strives to be as pedantically accurate a simulation of real-world hardware as possible, does not exhibit this issue. However, setting up 86Box comes with a whole day of learning about the parts and pieces of assembling one's own raw DOS system from virtual components, installing from diskettes, and all of the old-school troubleshooting that entails. It's a commitment, is what I'm saying. I found that DOSBox-X would run the for Release 2.2, but failed to run it for Releases 2.3 and 2.4. can launch and run without issue. is a front-end utility to launch auxiliary programs like GraphPrint . If you're mounting a system folder as a "hard drive" in DOSBox-X, it is trivial to extract your data files. The Lotus utility "Translate" is handy for moving data between formats. I found that native .wk1 files open in LibreOffice , as-is. From there, you have any number of modern exporting options, though you might find some quirks from time to time. Check your formulas, just in case! I'd recommend checking out Travis Ormandy 's site. He's smarter than me and performs magic I didn't think possible, like pulling live stock data as JSON into 1-2-3 . He also got the Unix build to work natively in Linux.

0 views
(think) 1 months ago

Learning OCaml: PPX for Mere Mortals

When I started learning OCaml I kept running into code like this: My first reaction was “what the hell is ?” Coming from languages like Ruby and Clojure, where metaprogramming is either built into the runtime (reflection) or baked into the language itself (macros), OCaml’s approach felt alien. There’s no runtime reflection, no macro system in the Lisp sense – just this mysterious syntax that somehow generates code at compile time. That mystery is PPX (PreProcessor eXtensions), and once you understand it, a huge chunk of the OCaml ecosystem suddenly makes a lot more sense. This article is my attempt to demystify PPX for people like me – developers who want to use PPX effectively without necessarily becoming PPX authors themselves. OCaml is a statically typed language with no runtime reflection. That means you can’t do things like “iterate over all fields of a record at runtime” or “automatically serialize any type to JSON.” The type information simply isn’t available at runtime – it’s erased during compilation. One of my biggest frustrations as a newcomer was not being able to just print arbitrary data for debugging – there’s no generic or that works on any type. That frustration was probably my first real interaction with PPX. PPX solves this by generating code at compile time . When the OCaml compiler parses your source code, it builds an Abstract Syntax Tree (AST) – a tree data structure that represents the syntactic structure of your program. PPX rewriters are programs that receive this AST, transform it, and return a modified AST back to the compiler. The compiler then continues as if you had written the generated code by hand. In practical terms, this means that when you write: The PPX rewriter generates something like this behind the scenes: You get a pretty-printer for free, derived from the type definition. No boilerplate, no manual work, and it stays in sync with your type automatically. If you’ve used Rust’s or Haskell’s , the idea is very similar. The syntax is different, but the motivation is identical – generating repetitive code from type definitions. If you’re coming from Rust, you might wonder why OCaml doesn’t just have a built-in macro system like . It’s a fair question, and the answer says a lot about OCaml’s design philosophy. OCaml has always favored a small, stable language core . The compiler is famously lean and fast, and the language team is conservative about adding complexity to the specification. A full macro system baked into the compiler would be a significant undertaking – it would need to be designed, specified, maintained, and kept compatible across versions, forever. Instead, OCaml took a more minimal approach: the compiler provides just two things – extension points and attributes – as syntactic hooks in the AST. Everything else lives in the ecosystem. The actual PPX rewriters are ordinary OCaml programs that happen to transform ASTs. The ppxlib framework that ties it all together is a regular library, not part of the compiler. This has some real advantages: The trade-offs are real, though. Rust’s proc macros are more tightly integrated – you get better error messages pointing at macro-generated code, better IDE support for macro expansions, and the macro system is a documented, stable part of the language. With PPX, you’re sometimes left staring at cryptic type errors in generated code and reaching for to figure out what went wrong. That said, OCaml’s approach feels very OCaml – pragmatic, minimal, and trusting the ecosystem to build what’s needed on top of a simple foundation. And in practice, it works remarkably well. PPX wasn’t OCaml’s first metaprogramming system. Before PPX, there was Camlp4 (and its fork Camlp5 ) – a powerful but complex preprocessor that maintained its own parser, separate from the compiler’s parser. Camlp4 could extend OCaml’s syntax in arbitrary ways, which sounds great in theory but was a maintenance nightmare in practice. Every OCaml release risked breaking Camlp4, and code using Camlp4 extensions often couldn’t be processed by standard tools like editors and documentation generators. OCaml 4.02 (2014) introduced extension points and attributes directly into the language grammar – syntactic hooks specifically designed for preprocessor extensions. This was a much simpler and more maintainable approach: PPX rewriters use the compiler’s own AST, the syntax is valid OCaml (so tools can still parse your code), and the whole thing is conceptually just “AST in, AST out.” Camlp4 was officially retired in 2019. Today, the PPX ecosystem is built on ppxlib , a unified framework that provides a stable API across OCaml versions and handles all the plumbing for PPX authors. Before diving into specific libraries, let’s decode the bracket soup. PPX uses two syntactic mechanisms built into OCaml: Extension nodes are placeholders that a PPX rewriter must replace with generated code (compilation fails if no PPX handles them): Attributes attach metadata to existing code. Unlike extension nodes, the compiler silently ignores attributes that no PPX handles: The one you’ll see most often is on type declarations. The distinction between , , and is about scope – one for the innermost node, two for the enclosing declaration, three for the whole module-level. Tip: Don’t worry about memorizing all of this upfront. In practice, you’ll mostly use and occasionally or – and the specific PPX library’s documentation will tell you exactly which syntax to use. To use a PPX library in your project, you add it to the stanza in your file: That’s it. List all the PPX rewriters you need after , and Dune takes care of the rest (it even combines them into a single binary for performance). For plugins specifically, you use dotted names like . Let’s look at the PPX libraries that cover probably 90% of real-world use cases. ppx_deriving is the community’s general-purpose deriving framework. It comes with several built-in plugins: is the one you’ll reach for first – it’s essentially the answer to “how do I just print this thing?” that every OCaml newcomer asks sooner or later. The most commonly used plugins: A neat convention: if your type is named (as is idiomatic in OCaml), the generated functions drop the type name suffix – you get , , , instead of , , etc. You can also customize behavior per field with attributes: And you can derive for anonymous types inline: ppx_deriving_yojson generates JSON serialization and deserialization functions using the Yojson library: You can use or if you only need one direction. This is incredibly useful in practice – writing JSON serializers by hand for complex types is tedious and error-prone. If you’re using Jane Street’s Core library, you’ll encounter S-expression serialization everywhere. ( Tip: Jane Street bundles most of their PPXs into a single ppx_jane package, so you can add just to your instead of listing each one individually.) ppx_sexp_conv generates converters between OCaml types and S-expressions: The attributes here are quite handy – provides a default value during deserialization, and means the field is represented as a present/absent atom rather than . Two more Jane Street PPXs that you’ll see a lot in Core-based codebases. ppx_fields_conv generates first-class accessors and iterators for record fields: ppx_variants_conv does something similar for variant types – generating constructors as functions, fold/iter over all variants, and more. These Jane Street PPXs let you write tests directly in your source files: ppx_expect is particularly nice – it captures printed output and compares it against expected output: If the output doesn’t match, the test fails and you can run to automatically update the expected output in your source file. It’s a very productive workflow for testing functions that produce output. ppx_let provides syntactic sugar for working with monads and other “container” types: How does know which to call? It looks for a module in scope that provides the underlying and functions. In practice, you’ll typically open a module that defines before using : Note: Since OCaml 4.08, the language has built-in binding operators ( , , , ) that cover the basic use cases of without needing a preprocessor. If you’re not using Jane Street’s ecosystem, binding operators are probably the simpler choice. still offers extra features like , , and optimized though. ppx_blob is beautifully simple – it embeds a file’s contents as a string at compile time: No more worrying about file paths at runtime or packaging data files with your binary. The file contents become part of your compiled program. One thing that’s always bugged me about OCaml is the lack of string interpolation. ppx_string fills that gap: The suffix tells the PPX to convert the value using . You can use any module that provides a function. Most OCaml developers will never need to write a PPX, but understanding the basics helps demystify the whole system. Let’s build a very simple one. Say we want an extension that converts a string literal to uppercase at compile time. Here’s the complete implementation using ppxlib : The dune file: The key pieces are: For more complex PPXs (especially derivers), you’ll also want to use Metaquot ( ), which lets you write AST-constructing code using actual OCaml syntax instead of manual AST builder calls: The ppxlib documentation has excellent tutorials if you want to go deeper. One practical tip: when something goes wrong with PPX-generated code and you’re staring at a confusing type error, you can inspect what the PPX actually generated: Seeing the expanded code often makes the error immediately obvious. Most of the introductory PPX content out there was written around 2018-2019, so it’s worth noting how things have evolved since then. The big story has been ppxlib’s consolidation of the ecosystem . Back in 2019, some PPX rewriters still used the older (OMP) library, creating fragmentation. By 2021, nearly all PPXs had migrated to ppxlib , effectively ending the split. Today ppxlib is the way to write PPX rewriters – there’s no real alternative to consider. The transition hasn’t always been smooth, though. In 2025, ppxlib 0.36.0 bumped its internal AST to match OCaml 5.2, which changed how functions are represented in the parse tree. This broke many downstream PPXs and temporarily split the opam universe between packages that worked with the new version and those that didn’t. The community worked through it with proactive patching, but it highlighted an ongoing tension in the PPX world: ppxlib shields you from most compiler changes, but major AST overhauls still ripple through the ecosystem. On the API side, ppxlib is gradually deprecating its copy of in favor of , with plans to remove entirely in a future 1.0.0 release. If you’re writing a new PPX today, use exclusively. Meanwhile, OCaml 4.08’s built-in binding operators ( , , etc.) have reduced the need for in projects that don’t use Jane Street’s ecosystem. It’s a nice example of the language absorbing a pattern that PPX pioneered. Perhaps one day we’ll see more of this (e.g. native string interpolation). This article covers a lot of ground, but the PPX topic is pretty deep and complex, so depending on how far you want to go you might want to read more on it. Here are some of the best resources I’ve found on PPX: I was amused to see whitequark’s name pop up while I was doing research for this article – we collaborated quite a bit back in the day on her Ruby parser project, which was instrumental to RuboCop . Seems you can find (former) Rubyists in pretty much every language community. This article turned out to be a beast! I’ve wanted to write something on the subject for quite a while now, but I’ve kept postponing it because I was too lazy to do all the necessary research. I’ll feel quite relieved to put it behind me! PPX might look intimidating at first – all those brackets and symbols can feel like line noise. But the core idea is simple: PPX generates boilerplate code from your type definitions at compile time. You annotate your types with what you want ( , , , , etc.), and the PPX rewriter produces the code you’d otherwise have to write by hand. For day-to-day OCaml programming, you really only need to know: The “writing your own PPX” part is there for when you need it, but honestly most OCaml developers get by just fine using the existing ecosystem. That’s all I have for you today. Keep hacking! The ecosystem can evolve independently. ppxlib can ship new features, fix bugs, and improve APIs without waiting for a compiler release. Compare this to Rust, where changes to the proc macro system require the full RFC process and a compiler update. Tooling stays simple. Because and are valid OCaml syntax, every tool – editors, formatters, documentation generators – can parse PPX-annotated code without knowing anything about the specific PPX. The code is always syntactically valid OCaml, even before preprocessing. The compiler stays lean. No macro expander, no hygiene system, no special compilation phases – just a hook that says “here, transform this AST before I type-check it.” – registers an extension with a name, the context where it can appear (expressions, patterns, types, etc.), the expected payload pattern, and an expansion function. – a pattern-matching DSL for destructuring AST nodes. Here matches a string literal and captures its value. – helpers for constructing AST nodes. builds a string literal expression. – registers the rule with ppxlib’s driver. Preprocessors and PPXs – the official OCaml documentation on metaprogramming. A solid reference, though it assumes some comfort with the compiler internals. An Introduction to OCaml PPX Ecosystem – Nathan Rebours’ 2019 deep dive for Tarides. This is the most thorough tutorial on writing PPX rewriters I’ve seen. Some API details have changed since 2019 (notably the → shift), but the concepts and approach are still excellent. ppxlib Quick Introduction – ppxlib’s own getting-started guide. The best place to begin if you want to write your own PPX. A Guide to PreProcessor eXtensions – OCamlverse’s reference page with a comprehensive list of available PPX libraries. A Guide to Extension Points in OCaml – Whitequark’s original 2014 guide that introduced many developers to PPX. Historically interesting as a snapshot of the early PPX days. on type declarations to generate useful functions How to add PPX libraries to your dune file with Which PPX libraries exist for common tasks (serialization, testing, pretty-printing)

0 views

You can't always fix it

I have some weird hobbies, and one of those is opening up the network tab on just about anything I'm using. Sometimes, I find egregious problems. Usually, this is something that can be fixed, when responsibly reported. But over time, I learned a bitter lesson: sometimes, you can't get it fixed. Recently, I was waiting for a time-sensitive delivery of medication. It used a courier company which focused on just delivering prescription medications. I opened up the tracking page on my computer, and saw the information I wanted: the medication would probably arrive around 6 PM. But... what if there's more? And what are they doing with my data? Can anyone else see it? So I peeked at the network tools, and was disappointed by what I saw. The first time this happened, I was surprised. By now, I expect to see this. And what I saw was every customer's address along the delivery route. I also saw how much the courier would get paid per stop, what their hourly rate was, and the driver's GPS coordinates (though these were sometimes missing). After the package was delivered, the tracking page changed and displayed a feedback form, my signature, and a picture of my porch. The JSON payload no longer included the entire route, but it included my address, and the payload from an easily guessable related endpoint did still contain the entire route. And that route? It included other recipients' ids, which can be used to find their home addresses, names, contents of the package (sometimes), a photo of their porch, and a copy of their signature. Um. This is bad, right? I've actually found approximately this vulnerability in two separate couriers' tracking pages (and they're using different software). One of them was even worse for them, it included their Stripe private key, I suppose as a bug bounty for people without ethics. And each time I find it, I try to report it. And I fail. They don't let me report it. These companies don't list security contacts. The staff I can find on LinkedIn or their website don't have email addresses that I can find or guess. Mail sent to the addresses I do find listed has all bounced. I tried going through back channels. I messaged the pharmacy which was using this courier. I talked to my prescriber, who was shocked at this issue. And the next time I got a delivery, it came via UPS instead (they do not have a leaky sieve for a tracking page, but they did "lose" my prescription once). But I don't know if they just did that for me , the miscreant who looks at her network tools? Or did they switch everyone over to a different courier? Either way, at least my data was safe now, right? It was, until I started using a different pharmacy, and this one is back to using the leaky couriers again. Sigh. I got pretty upset about this at one point. There's a security issue! Data is being leaked, I must get this fixed! And someone told me something really wise: "it's not your responsibility to fix this, and you've done everything you can (and more than you had to)." And ultimately, she was right. I was getting myself worked up about it, but it's not my responsibility to fix. Sometimes there will be things like this that are bad, that I cannot fix, and that I have to accept. So, where do I go from here? I could probably publicly name-and-shame the couriers, but it would not do anything productive. It would not get their attention to fix it, and it wouldn't be seen by the folks who need to know (pharmacists and prescribers). So I'm not going to disclose the specific company, because the main thing it would do is risk me getting in legal trouble, for dubious benefit. I've already notified the pharmacists and prescribers that I know; it's on them, if they want to let anyone else know.

0 views
Binary Igor 1 months ago

JSON Documents Performance, Storage and Search: MongoDB vs PostgreSQL

Does MongoDB still have an edge as a document-oriented database for JSON in particular? Or is Postgres better? Or at least good-enough to stick with it, since it is a more universal database, offering a richer feature set and wider applicability?

0 views
Brain Baking 1 months ago

Managing Multiple Development Ecosystem Installs

In the past year, I occasionally required another Java Development Kit besides the usual one defined in to build certain modules against older versions and certain modules against bleeding edge versions. In the Java world, that’s rather trivial thanks to IntelliJ’s project settings: you can just interactively click through a few panels to install another JDK flavour and get on with your life. The problem starts once you close IntelliJ and want to do some command line work. Luckily, SDKMan , the “The Software Development Kit Manager”, has got you covered. Want to temporarily change the Java compiler for the current session? . Want to change the default? . Easy! will point to , a symlink that gets rewired by SDKMan. A Java project still needs a dependency management system such as Gradle, but you don’t need to install a global specific Gradle version. Instead, just points to the jar living at . Want another one? Change the version number in and it’ll be auto-downloaded. Using Maven instead? Tough luck! Just kidding: don’t use but , the Maven Wrapper that works exactly the same. .NET comes with built-in support to change the toolchain (and specify the runtime target), more or less equal to a typical Gradle project. Actually, the command can both build list its own installed toolchains: . Yet installing a new one is done by hand. You switch toolchains by specifying the SDK version in a global.json file and tell the compiler to target a runtime in the file. In Python , the concept of virtual environments should solve that problem: each project creates its own that points to a specific version of Python. Yet I never really enjoyed working with this system: you’ve got , , , , , … That confusing mess is solved with a relatively new kid in town: uv , “An extremely fast Python package and project manager, written in Rust.” It’s more than as it also manages your multiple development ecosystems. Want to install a new Python distribution? . Want to temporarily change the Python binary for the current session? . Creating a new project with will also create a virtual environment, meaning you don’t run your stuff with but with that auto-selects the correct version. Lovely! What about JS/TS and Node ? Of course there the options are many: there’s nvm —but that’s been semi-abandoned ?—and of course someone built a Rust-alternative called fnm , but you can also manage Node versions with . I personally don’t care and use instead, which is aimed at not managing but replacing the Node JS runtime. But who will manage the bun versions? PHP is more troublesome because it’s tied to a web server. Solutions such as Laravel Nerd combine both PHP and web server dependency management into a sleek looking tool that’s “free”. Of course you can let your OS-system package manager manage your SDK packages: and then . That definitely feels a bit more hacky. For PHP, I’d even consider Mise. Speaking of which… Why use a tool that limits the scope to one specific development environment? If you’re a full-stack developer you’ll still need to know how to manage both your backend and frontend dev environment. That’s not needed with Mise-en-place , a tool that manages all these things . Asdf is another popular one that manages any development environment that doesn’t have its own dedicated tool. I personally think that’s an extraction layer too far. You’ll still need to dissect these tools separately in case things go wrong. Some ecosystems come with built-in multi-toolkit support, such as Go : simply installs into your directory 1 . That means you’ve installed the compiler (!) in exactly the same way as any other (global) dependency, how cool is that? The downside of this is that you’ll have to remember to type instead of so there’s no symlink rewiring involved. or can do that—or the above Mise. But wait, I hear you think, why not just use containers to isolate everything? Spinning up containers to build in an isolated environment: sure, that’s standard practice in continuous integration servers, but locally? Really? Really. Since the inception of Dev Containers by Microsoft, specifically designed for VS Code, working “inside” a container is as easy as opening up the project and “jumping inside the container”. From that moment on, your terminal, IntelliSense, … runs inside that container. That means you won’t have to wrestle Node/PHP versions on your local machine, and you can even use the same container to build your stuff on the CI server. That also means your newly onboarded juniors don’t need to wrestle through a week of “installing stuff”. Microsoft open sourced the Dev Container specification and the JetBrains folks jumped the gun: it has support for but I have yet to try it out. Of course the purpose was to integrate this into GitHub: their cloud-based IDE Codespaces makes heavy use of the idea—and yes, there’s an open-source alternative . Is there Emacs support for Dev Containers? Well, Tramp allows you to remotely open and edit any file, also inside a container . So just install the Dev Container CLI, run it and point Emacs to a source file inside it. From then on, everything Emacs does—including the LSP server, compilation, …—happens inside that container. That means you’ll also have to install your LSP binaries in there. devcontainer.el just wraps complication commands to execute inside the container whilst still letting you edit everything locally in case you prefer a hybrid approach. And then there’s Nix and devenv . Whatever that does, it goes way over my head! You’ll still have to execute after that.  ↩︎ Related topics: / containers / By Wouter Groeneveld on 26 February 2026.  Reply via email . You’ll still have to execute after that.  ↩︎

0 views
Evan Schwartz 1 months ago

Great RSS Feeds That Are Too Noisy to Read Manually

Some RSS feeds are fantastic but far too noisy to add to most RSS readers directly. Without serious filtering, you'd get swamped with more posts than you could possibly read, while missing the hidden gems. I built Scour specifically because I wanted to find the great articles I was missing in noisy feeds like these, without feeling like I was drowning in unread posts. If you want to try it, you can add all of these sources in one click . But these feeds are worth knowing about regardless of what reader you use. Feed: https://hnrss.org/newest Thousands of posts are submitted to Hacker News each week. While the front page gives a sense of what matches the tech zeitgeist, there are plenty of interesting posts that get buried simply because of the randomness of who happens to be reading the Newest page and voting in the ~20 minutes after posts are submitted. (You can try searching posts that were submitted but never made the front page in this demo I built into the Scour docs.) Feed: https://feeds.pinboard.in/rss/recent/ Pinboard describes itself as "Social Bookmarking for Introverts". The recent page is a delightfully random collection of everything one of the 30,000+ users has bookmarked. Human curated, without curation actually being the goal. Feed: https://bearblog.dev/discover/feed/?newest=True Bear is "A privacy-first, no-nonsense, super-fast blogging platform". This post is published on it, and I'm a big fan. The Discovery feed gives a snapshot of blogs that users have upvoted on the platform. But, even better than that, the Most Recent feed gives you every post published on it. There are lots of great articles, and plenty of blogs that are just getting started. Feed: https://feedle.world/rss Feedle is a search engine for blogs and podcasts. You can search for words or phrases among their curated collection of blogs, and every search can become an RSS feed. An empty search will give you a feed of every post published by any one of their blogs. Feed: https://kagi.com/api/v1/smallweb/feed/ Kagi, the search engine, maintains an open source list of around 30,000 "small web" websites that are personal and non-commercial sites. Their Small Web browser lets you browse random posts one at a time. The RSS feed gives you every post published by any one of those websites. Feed: https://threadreaderapp.com/rss.xml Thread Reader is a Twitter/X bot that lets users "unroll" threads into an easier-to-read format. While getting RSS feeds out of Twitter/X content is notoriously difficult, Thread Reader provides an RSS feed of all threads that users have used them to unroll. Like the content on that platform, the threads are very hit-or-miss, but there are some gems in there. Not an RSS feed: https://minifeed.net/global Minifeed is a nice "curated blog reader and search engine". They have a Global page that shows every post published by one of the blogs they've indexed. While this isn't technically an RSS feed, I thought it deserved a mention. Note that Scour can add some websites that don't have RSS feeds. It treats pages with repeated structures that look like blogs (e.g. they have links, titles, and publish dates) as if they were RSS feeds. Minifeed's Global view is one such page, so you can also get every post published from any one of their collected blogs. Feeds galore: https://info.arxiv.org/help/rss.html arXiv has preprint academic articles for technical fields ranging from Computer Science and Mathematics to Physics and Quantitative Biology. Like many of the feeds listed above, most of the categories are very noisy. But, if you're into reading academic articles, there is also plenty of great new research hidden in the noise. Every field and sub-field has its own RSS feed. (You can browse them and subscribe on Scour here ). While reading my Scour feed, I'll often check which feeds an article I liked came from (see what this looks like here ), and I'm especially delighted when it comes from some source I had no idea existed. These types of noisy feeds are great ways of discovering new content and new blogs, but you definitely need some good filters to make use of them. I hope you'll give Scour a try! P.S. Scour makes all of the feeds it creates consumable as RSS/Atom/JSON feeds , so you can add your personalized feed or each of your interests-specific feeds to your favorite feed reader. Read more in this guide for RSS users .

0 views
(think) 1 months ago

Supercharging Claude Code with the Right (CLI) Tools

I’ve been using Claude Code quite a bit lately, and I got curious – what if I asked it directly which tools would make it more productive? Not the usual suspects like , or , but tools it wishes it had access to, tools that would genuinely extend its capabilities. So I did exactly that. I asked Claude Code: “What are the most valuable CLI tools I could install for you, outside of the ones you already have?” The answer was surprisingly thoughtful and insightful, so I figured I’d share it here along with my own commentary. Here are 10 tools, ranked by how useful they’d be for an AI coding assistant. Note: I write all my blog posts old-school, but this time around I took the liberty to just extend with my comments the output generated by Claude Code. Note also that the post includes some installation instructions that are macOS-specific. That’s what I got from Claude on my local machine (a Mac mini), and I felt it didn’t make much sense to tweak them given how many combinations of operating systems and package managers exist. This was Claude’s number one pick, and I can see why. ast-grep does structural code search and refactoring using AST patterns. Instead of fumbling with regex to find “all calls to function X with 3 arguments”, you write patterns that look like actual code: This is the kind of thing where regex is fragile and error-prone, but AST matching just works. Supports 20+ languages via tree-sitter . A structural diff tool that understands syntax. difftastic compares files by AST nodes rather than lines, so it won’t flag whitespace changes or reformatting as meaningful diffs. This makes reviewing AI-generated changes much clearer – and let’s be honest, reviewing changes is half the job when working with an AI assistant. AI assistants generate a lot of shell commands, and shell scripting is notoriously full of pitfalls (unquoted variables, vs. , POSIX compatibility…). ShellCheck catches these before they blow up. Given that shell bugs can be destructive (e.g., expanding to ), having a safety net here is valuable. A modern replacement with sane regex syntax – no more escaping nightmares. Uses standard PCRE-style regex and has a string-literal mode ( ) for replacing code strings full of metacharacters. Simple, but it eliminates a whole class of errors when generating substitution commands. Sloc Cloc and Code – a fast code counter that gives you an instant overview of a codebase: languages, lines of code, complexity estimates. Understanding the shape of a project before diving in is genuinely useful context for an AI assistant, and this is hard to replicate by manually scanning files. Note: I was under the impression that cloc is a better tool, but perhaps I was mistaken. 1 for YAML (and JSON, TOML, XML). Modern projects are drowning in YAML – GitHub Actions workflows, Kubernetes manifests, Docker Compose files. yq can programmatically query and update YAML while preserving comments and formatting, which is much more reliable than text-based editing that can break indentation. Structural search and replace that works across languages without needing a full parser. Complements ast-grep for simpler pattern matching – it understands delimiters (braces, parens, quotes) but doesn’t need tree-sitter grammar support. Great for quick refactoring across less common languages or config files. Note: I was happy to see that was written in OCaml, but when I installed it I got a warning that the project was deprecated and doesn’t support OCaml 5, so I’m not sure about its future. A command-line benchmarking tool that runs commands multiple times and gives you proper statistical analysis. When you ask an AI to optimize something, it’s nice to have real numbers. The flag produces results ready for a PR description. A file watcher that executes commands when files change. Useful for setting up persistent feedback loops – rerun tests on save, rebuild docs when markdown changes, restart a dev server after config edits. One command instead of cobbling together something with and shell scripts. A syntax-highlighting pager for and friends. Provides word-level diff highlighting, so when only a variable name changes in a long line, you see exactly that. Mostly benefits the human reviewing the AI’s work, but that’s arguably where it matters most. If you only install one tool from this list, make it . It’s the biggest capability gap – an AI assistant limited to regex-based search and replace is like a carpenter limited to a hand saw. Everything else is nice to have, but structural code understanding is a genuine superpower. You can install everything at once if you’re feeling adventurous: I’m not ashamed to admit that I had never heard of some of the tools (e.g. , and ), and I had only one of them installed ( ). 2 It’s never too late to learn something new! By the way, keep in mind that depending on the programming languages that you’re using there are other language specific tools that you can benefit from, so make sure to ask your favorite AI coding tool about those. That’s all I have for you today. Keep hacking! I asked Claude about this as well and it told me that it prefers because it’s written in Go (as opposed to Perl) and therefore it’s much faster than .  ↩ Of course, I didn’t really have it installed - I only thought I did, otherwise Claude wouldn’t have suggested it. (I switch between computers and my setup on all of them is not exactly the same)  ↩ I asked Claude about this as well and it told me that it prefers because it’s written in Go (as opposed to Perl) and therefore it’s much faster than .  ↩ Of course, I didn’t really have it installed - I only thought I did, otherwise Claude wouldn’t have suggested it. (I switch between computers and my setup on all of them is not exactly the same)  ↩

1 views
Simon Willison 1 months ago

Two new Showboat tools: Chartroom and datasette-showboat

I introduced Showboat a week ago - my CLI tool that helps coding agents create Markdown documents that demonstrate the code that they have created. I've been finding new ways to use it on a daily basis, and I've just released two new tools to help get the best out of the Showboat pattern. Chartroom is a CLI charting tool that works well with Showboat, and datasette-showboat lets Showboat's new remote publishing feature incrementally push documents to a Datasette instance. I normally use Showboat in Claude Code for web (see note from this morning ). I've used it in several different projects in the past few days, each of them with a prompt that looks something like this: Here's the resulting document . Just telling Claude Code to run is enough for it to learn how to use the tool - the help text is designed to work as a sort of ad-hoc Skill document. The one catch with this approach is that I can't see the new Showboat document until it's finished. I have to wait for Claude to commit the document plus embedded screenshots and push that to a branch in my GitHub repo - then I can view it through the GitHub interface. For a while I've been thinking it would be neat to have a remote web server of my own which Claude instances can submit updates to while they are working. Then this morning I realized Showboat might be the ideal mechanism to set that up... Showboat v0.6.0 adds a new "remote" feature. It's almost invisible to users of the tool itself, instead being configured by an environment variable. Set a variable like this: And every time you run a or or or command the resulting document fragments will be POSTed to that API endpoint, in addition to the Showboat Markdown file itself being updated. There are full details in the Showboat README - it's a very simple API format, using regular POST form variables or a multipart form upload for the image attached to . It's simple enough to build a webapp to receive these updates from Showboat, but I needed one that I could easily deploy and would work well with the rest of my personal ecosystem. So I had Claude Code write me a Datasette plugin that could act as a Showboat remote endpoint. I actually had this building at the same time as the Showboat remote feature, a neat example of running parallel agents . datasette-showboat is a Datasette plugin that adds a endpoint to Datasette for viewing documents and a endpoint for receiving updates from Showboat. Here's a very quick way to try it out: Click on the sign in as root link that shows up in the console, then navigate to http://127.0.0.1:8001/-/showboat to see the interface. Now set your environment variable to point to this instance: And run Showboat like this: Refresh that page and you should see this: Click through to the document, then start Claude Code or Codex or your agent of choice and prompt: The command assigns a UUID and title and sends those up to Datasette. The best part of this is that it works in Claude Code for web. Run the plugin on a server somewhere (an exercise left up to the reader - I use Fly.io to host mine) and set that environment variable in your Claude environment, then any time you tell it to use Showboat the document it creates will be transmitted to your server and viewable in real time. I built Rodney , a CLI browser automation tool, specifically to work with Showboat. It makes it easy to have a Showboat document load up web pages, interact with them via clicks or injected JavaScript and captures screenshots to embed in the Showboat document and show the effects. This is wildly useful for hacking on web interfaces using Claude Code for web, especially when coupled with the new remote publishing feature. I only got this stuff working this morning and I've already had several sessions where Claude Code has published screenshots of its work in progress, which I've then been able to provide feedback on directly in the Claude session while it's still working. A few days ago I had another idea for a way to extend the Showboat ecosystem: what if Showboat documents could easily include charts? I sometimes fire up Claude Code for data analysis tasks, often telling it to download a SQLite database and then run queries against it to figure out interesting things from the data. With a simple CLI tool that produced PNG images I could have Claude use Showboat to build a document with embedded charts to help illustrate its findings. Chartroom is exactly that. It's effectively a thin wrapper around the excellent matplotlib Python library, designed to be used by coding agents to create charts that can be embedded in Showboat documents. Here's how to render a simple bar chart: It can also do line charts, bar charts, scatter charts, and histograms - as seen in this demo document that was built using Showboat. Chartroom can also generate alt text. If you add to the above it will output the alt text for the chart instead of the image: Or you can use or to get the image tag with alt text directly: I added support for Markdown images with alt text to Showboat in v0.5.0 , to complement this feature of Chartroom. Finally, Chartroom has support for different matplotlib styles . I had Claude build a Showboat document to demonstrate these all in one place - you can see that at demo/styles.md . I started the Chartroom repository with my click-app cookiecutter template, then told a fresh Claude Code for web session: We are building a Python CLI tool which uses matplotlib to generate a PNG image containing a chart. It will have multiple sub commands for different chart types, controlled by command line options. Everything you need to know to use it will be available in the single "chartroom --help" output. It will accept data from files or standard input as CSV or TSV or JSON, similar to how sqlite-utils accepts data - clone simonw/sqlite-utils to /tmp for reference there. Clone matplotlib/matplotlib for reference as well It will also accept data from --sql path/to/sqlite.db "select ..." which runs in read-only mode Start by asking clarifying questions - do not use the ask user tool though it is broken - and generate a spec for me to approve Once approved proceed using red/green TDD running tests with "uv run pytest" Also while building maintain a demo/README.md document using the "uvx showboat --help" tool - each time you get a new chart type working commit the tests, implementation, root level README update and a new version of that demo/README.md document with an inline image demo of the new chart type (which should be a UUID image filename managed by the showboat image command and should be stored in the demo/ folder Make sure "uv build" runs cleanly without complaining about extra directories but also ensure dist/ and uv.lock are in gitignore This got most of the work done. You can see the rest in the PRs that followed. The Showboat family of tools now consists of Showboat itself, Rodney for browser automation, Chartroom for charting and datasette-showboat for streaming remote Showboat documents to Datasette. I'm enjoying how these tools can operate together based on a very loose set of conventions. If a tool can output a path to an image Showboat can include that image in a document. Any tool that can output text can be used with Showboat. I'll almost certainly be building more tools that fit this pattern. They're very quick to knock out! The environment variable mechanism for Showboat's remote streaming is a fun hack too - so far I'm just using it to stream documents somewhere else, but it's effectively a webhook extension mechanism that could likely be used for all sorts of things I haven't thought of yet. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Showboat remote publishing datasette-showboat How I built Chartroom The burgeoning Showboat ecosystem

0 views
Ankur Sethi 2 months ago

I used a local LLM to analyze my journal entries

In 2025, I wrote 162 journal entries totaling 193,761 words. In December, as the year came to a close and I found myself in a reflective mood, I wondered if I could use an LLM to comb through these entries and extract useful insights. I’d had good luck extracting structured data from web pages using Claude, so I knew this was a task LLMs were good at. But there was a problem: I write about sensitive topics in my journal entries, and I don’t want to share them with the big LLM providers. Most of them have at least a thirty-day data retention policy, even if you call their models using their APIs, and that makes me uncomfortable. Worse, all of them have safety and abuse detection systems that get triggered if you talk about certain mental health issues. This can lead to account bans or human review of your conversations. I didn’t want my account to get banned, and the very idea of a stranger across the world reading my journal mortifies me. So I decided to use a local LLM running on my MacBook for this experiment. Writing the code was surprisingly easy. It took me a few evenings of work—and a lot of yelling at Claude Code—to build a pipeline of Python scripts that would extract structured JSON from my journal entries. I then turned that data into boring-but-serviceable visualizations. This was a fun side-project, but the data I extracted didn’t quite lead me to any new insights. That’s why I consider this a failed experiment. The output of my pipeline only confirmed what I already knew about my year. Besides, I didn’t have the hardware to run the larger models, so some of the more interesting analyses I wanted to run were plagued with hallucinations. Despite how it turned out, I’m writing about this experiment because I want to try it again in December 2026. I’m hoping I won’t repeat my mistakes again. Selfishly, I’m also hoping that somebody who knows how to use LLMs for data extraction tasks will find this article and suggest improvements to my workflow. I’ve pushed my data extraction and visualization scripts to GitHub. It’s mostly LLM-generated slop, but it works. The most interesting and useful parts are probably the prompts . Now let’s look at some graphs. I ran 12 different analyses on my journal, but I’m only including the output from 6 of them here. Most of the others produced nonsensical results or were difficult to visualize. For privacy, I’m not using any real names in these graphs. Here’s how I divided time between my hobbies through the year: Here are my most mentioned hobbies: This one is media I engaged with. There isn’t a lot of data for this one: How many mental health issues I complained about each day across the year: How many physical health issues I complained about each day across the year: The big events of 2025: The communities I spent most of my time with: Top mentioned people throughout the year: I ran all these analyses on my MacBook Pro with an M4 Pro and 48GB RAM. This hardware can just barely manage to run some of the more useful open-weights models, as long as I don’t run anything else. For running the models, I used Apple’s package . Picking a model took me longer than putting together the data extraction scripts. People on /r/LocalLlama had a lot of strong opinions, but there was no clear “best” model when I ran this experiment. I just had to try out a bunch of them and evaluate their outputs myself. If I had more time and faster hardware, I might have looked into building a small-scale LLM eval for this task. But for this scenario, I picked a few popular models, ran them on a subset of my journal entries, and picked one based on vibes. This project finally gave me an excuse to learn all the technical terms around LLMs. What’s quantization ? What does the number of parameters do? What does it mean when a model has , , , or in its name? What is a reasoning model ? What’s MoE ? What are active parameters? This was fun, even if my knowledge will be obsolete in six months. In the beginning, I ran all my scripts with Qwen 2.5 Instruct 32b at 8-bit quantization as the model. This fit in my RAM with just enough room left over for a browser, text editor, and terminal. But Qwen 2.5 didn’t produce the best output and hallucinated quite a bit, so I ran my final analyses using Llama-3.3 70B Instruct at 3bit quantization. This could just about fit in my RAM if I quit every other app and increased the amount of GPU RAM a process was allowed to use . While quickly iterating on my Python code, I used a tiny model: Qwen 3 4b Instruct quantized to 4bits. A major reason this experiment didn’t yield useful insights was that I didn’t know what questions to ask the LLM. I couldn’t do a qualitative analysis of my writing—the kind of analysis a therapist might be able to do—because I’m not a trained psychologist. Even if I could figure out the right prompts, I wouldn’t want to do this kind of work with an LLM. The potential for harm is too great, and the cost of mistakes is too high. With a few exceptions, I limited myself to extracting quantitative data only. From each journal entry, I extracted the following information: None of the models was as accurate as I had hoped at extracting this data. In many cases, I noticed hallucinations and examples from my system prompt leaking into the output, which I had to clean up afterwards. Qwen 2.5 was particularly susceptible to this. Some of the analyses (e.g. list of new people I met) produced nonsensical results, but that wasn’t really the fault of the models. They were all operating on a single journal entry at a time, so they had no sense of the larger context of my life. I couldn’t run all my journal entries through the LLM at once. I didn’t have that kind of RAM and the models didn’t have that kind of context window. I had to run the analysis one journal entry at a time. Even then, my computer choked on some of the larger entries, and I had to write my scripts in a way that I could run partial analyses or continue failed analyses. Trying to extract all the information listed above in one pass produced low-quality output. I had to split my analysis into multiple prompts and run them one at a time. Surprisingly, none of the models I tried had an issue with the instruction . Even the really tiny models had no problems following the instruction. Some of them occasionally threw in a Markdown fenced code block, but it was easy enough to strip using a regex. My prompts were divided into two parts: The task-specific prompts included detailed instructions and examples that made the structure of the JSON output clear. Every model followed the JSON schema mentioned in the prompt, and I rarely ever ran into JSON parsing issues. But the one issue I never managed to fix was the examples from the prompts leaking into the extracted output. Every model insisted that I had “dinner with Sarah” several times last year, even though I don’t know anybody by that name. This name came from an example that formed part of one of my prompts. I just had to make sure the examples I used stood out—e.g., using names of people I didn’t know at all or movies I hadn’t watched—so I could filter them out using plain old Python code afterwards. Here’s what my prompt looked like: To this prompt, I appended task-specific prompts. Here’s the prompt for extracting health issues mentioned in an entry: You can find all the prompts in the GitHub repository . The collected output from all the entries looked something like this: Since my model could only look at one journal entry at a time, it would sometimes refer to the same health issue, gratitude item, location, or travel destination using different synonyms. For example, “exhaustion” and “fatigue” should refer to the same health issue, but they would appear in the output as two different issues. My first attempt at de-duplicating these synonyms was to keep a running tally of unique terms discovered during each analysis and append them to the end of the prompt for each subsequent entry. Something like this: But this quickly led to some really strange hallucinations. I still don’t understand why. This list of terms wasn’t even that long, maybe 15-20 unique terms for each analysis. My second attempt at solving this was a separate normalization pass for each analysis. After an analysis finished running, I extracted a unique list of terms from its output file and collected them into a prompt. Then asked the LLM to produce a mapping to de-duplicate the terms. This is what the prompt looked like: There were better ways to do this than using an LLM. But you know what happens when all you have is a hammer? Yep, exactly. The normalization step was inefficient, but it did its job. This was the last piece of the puzzle. With all the extraction scripts and their normalization passes working correctly, I left my MacBook running the pipeline of scripts all day. I’ve never seen an M-series MacBook get this hot. I was worried that I’d damage my hardware somehow, but it all worked out fine. There was nothing special about this step. I just decided on a list of visualizations for the data I’d extracted, then asked Claude to write some code to generate them for me. Tweak, rinse, repeat until done. I’m underwhelmed by the results of this experiment. I didn’t quite learn anything new or interesting from the output, at least nothing I didn’t already know. This was only partly because of LLM limitations. I believe I didn’t quite know what questions to ask in the first place. What was I hoping to discover? What kinds of patterns was I looking for? What was the goal of the experiment besides producing pretty graphs? I went into the project with a cool new piece of tech to try out, but skipped the important up-front human-powered thinking work required to extract good insights from data. I neglected to sit down and design a set of initial questions I wanted to answer and assumptions I wanted to test before writing the code. Just goes to show that no amount of generative AI magic will produce good results unless you can define what success looks like. Maybe this year I’ll learn more about data analysis and visualization and run this experiment again in December to see if I can go any further. I did learn one thing from all of this: if you have access to state-of-the-art language models and know the right set of questions to ask, you can process your unstructured data to find needles in some truly massive haystacks. This allows you analyze datasets that would take human reviewers months to comb through. A great example is how the NYT monitors hundreds of podcasts every day using LLMs. For now, I’m putting a pin in this experiment. Let’s try again in December. List of things I was grateful for, if any List of hobbies or side-projects mentioned List of locations mentioned List of media mentioned (including books, movies, games, or music) A boolean answer to whether it was a good or bad day for my mental health List of mental health issues mentioned, if any A boolean answer to whether it was a good or bad day for my physical health List of physical health issues mentioned, if any List of things I was proud of, if any List of social activities mentioned Travel destinations mentioned, if any List of friends, family members, or acquaintances mentioned List of new people I met that day, if any A “core” prompt that was common across analyses Task-specific prompts for each analysis

0 views

The LLM Context Tax: Best Tips for Tax Avoidance

Every token you send to an LLM costs money. Every token increases latency. And past a certain point, every additional token makes your agent dumber. This is the triple penalty of context bloat: higher costs, slower responses, and degraded performance through context rot, where the agent gets lost in its own accumulated noise. Context engineering is very important. The difference between a $0.50 query and a $5.00 query is often just how thoughtfully you manage context. Here’s what I’ll cover: Stable Prefixes for KV Cache Hits - The single most important optimization for production agents Append-Only Context - Why mutating context destroys your cache hit rate Store Tool Outputs in the Filesystem - Cursor’s approach to avoiding context bloat Design Precise Tools - How smart tool design reduces token consumption by 10x Clean Your Data First (Maximize Your Deductions) - Strip the garbage before it enters context Delegate to Cheaper Subagents (Offshore to Tax Havens) - Route token-heavy operations to smaller models Reusable Templates Over Regeneration (Standard Deductions) - Stop regenerating the same code The Lost-in-the-Middle Problem - Strategic placement of critical information Server-Side Compaction (Depreciation) - Let the API handle context decay automatically Output Token Budgeting (Withholding Tax) - The most expensive tokens are the ones you generate The 200K Pricing Cliff (The Tax Bracket) - The tax bracket that doubles your bill overnight Parallel Tool Calls (Filing Jointly) - Fewer round trips, less context accumulation Application-Level Response Caching (Tax-Exempt Status) - The cheapest token is the one you never send With Claude Opus 4.6, the math is brutal: That’s a 10x difference between cached and uncached inputs. Output tokens cost 5x more than uncached inputs. Most agent builders focus on prompt engineering while hemorrhaging money on context inefficiency. In most agent workflows, context grows substantially with each step while outputs remain compact. This makes input token optimization critical: a typical agent task might involve 50 tool calls, each accumulating context. The performance penalty is equally severe. Research shows that past 32K tokens, most models show sharp performance degradation. Your agent isn’t just getting expensive. It’s getting confused. This is the single most important metric for production agents: KV cache hit rate. The Manus team considers this the most important optimization for their agent infrastructure, and I agree completely. The principle is simple: LLMs process prompts autoregressively, token by token. If your prompt starts identically to a previous request, the model can reuse cached key-value computations for that prefix. The killer of cache hit rates? Timestamps. A common mistake is including a timestamp at the beginning of the system prompt. It’s a simple mistake but the impact is massive. The key is granularity: including the date is fine. Including the hour is acceptable since cache durations are typically 5 minutes (Anthropic default) to 10 minutes (OpenAI default), with longer options available. But never include seconds or milliseconds. A timestamp precise to the second guarantees every single request has a unique prefix. Zero cache hits. Maximum cost. Move all dynamic content (including timestamps) to the END of your prompt. System instructions, tool definitions, few-shot examples, all of these should come first and remain identical across requests. For distributed systems, ensure consistent request routing. Use session IDs to route requests to the same worker, maximizing the chance of hitting warm caches. Context should be append-only. Any modification to earlier content invalidates the KV cache from that point forward. This seems obvious but the violations are subtle: The tool definition problem is particularly insidious. If you dynamically add or remove tools based on context, you invalidate the cache for everything after the tool definitions. Manus solved this elegantly: instead of removing tools, they mask token logits during decoding to constrain which actions the model can select. The tool definitions stay constant (cache preserved), but the model is guided toward valid choices through output constraints. For simpler implementations, keep your tool definitions static and handle invalid tool calls gracefully in your orchestration layer. Deterministic serialization matters too. Python dicts don’t guarantee order. If you’re serializing tool definitions or context as JSON, use sort_keys=True or a library that guarantees deterministic output. A different key order = different tokens = cache miss. Cursor’s approach to context management changed how I think about agent architecture. Instead of stuffing tool outputs into the conversation, write them to files. In their A/B testing, this reduced total agent tokens by 46.9% for runs using MCP tools. The insight: agents don’t need complete information upfront. They need the ability to access information on demand. Files are the perfect abstraction for this. We apply this pattern everywhere: Shell command outputs : Write to files, let agent tail or grep as needed Search results : Return file paths, not full document contents API responses : Store raw responses, let agent extract what matters Intermediate computations : Persist to disk, reference by path When context windows fill up, Cursor triggers a summarization step but exposes chat history as files. The agent can search through past conversations to recover details lost in the lossy compression. Clever. A vague tool returns everything. A precise tool returns exactly what the agent needs. Consider an email search tool: The two-phase pattern: search returns metadata, separate tool returns full content. The agent decides which items deserve full retrieval. This is exactly how our conversation history tool works at Fintool. It passes date ranges or search terms and returns up to 100-200 results with only user messages and metadata. The agent then reads specific conversations by passing the conversation ID. Filter parameters like has_attachment, time_range, and sender let the agent narrow results before reading anything. The same pattern applies everywhere: Document search : Return titles and snippets, not full documents Database queries : Return row counts and sample rows, not full result sets File listings : Return paths and metadata, not contents API integrations : Return summaries, let agent drill down Each parameter you add to a tool is a chance to reduce returned tokens by an order of magnitude. Garbage tokens are still tokens. Clean your data before it enters context. For emails, this means: For HTML content, the gains are even larger. A typical webpage might be 100KB of HTML but only 5KB of actual content. CSS selectors that extract semantic regions (article, main, section) and discard navigation, ads, and tracking can reduce token counts by 90%+. Markdown uses significantly fewer tokens than HTML , making conversion valuable for any web content entering your pipeline. For financial data specifically: Strip SEC filing boilerplate (every 10-K has the same legal disclaimers) Collapse repeated table headers across pages Remove watermarks and page numbers from extracted text Normalize whitespace (multiple spaces, tabs, excessive newlines) Convert HTML tables to markdown tables The principle: remove noise at the earliest possible stage, not after tokenization. Every preprocessing step that runs before the LLM call saves money and improves quality. Not every task needs your most expensive model. The Claude Code subagent pattern processes 67% fewer tokens overall due to context isolation. Instead of stuffing every intermediate search result into a single global context, workers keep only what’s relevant inside their own window and return distilled outputs. Tasks perfect for cheaper subagents: Data extraction : Pull specific fields from documents Classification : Categorize emails, documents, or intents Summarization : Compress long documents before main agent sees them Validation : Check outputs against criteria Formatting : Convert between data formats The orchestrator sees condensed results, not raw context. This prevents hitting context limits and reduces the risk of the main agent getting confused by irrelevant details. Scope subagent tasks tightly. The more iterations a subagent requires, the more context it accumulates and the more tokens it consumes. Design for single-turn completion when possible. Every time an agent generates code from scratch, you’re paying for output tokens. Output tokens cost 5x input tokens with Claude. Stop regenerating the same patterns. Our document generation workflow used to be painfully inefficient: OLD APPROACH: User: “Create a DCF model for Apple” Agent: *generates 2,000 lines of Excel formulas from scratch* Cost: ~$0.50 in output tokens alone NEW APPROACH: User: “Create a DCF model for Apple” Agent: *loads DCF template, fills in Apple-specific values* Cost: ~$0.05 The template approach: Skill references template : dcf_template.xlsx in /public/skills/dcf/ Agent reads template once : Understands structure and placeholders Agent fills parameters : Company-specific values, assumptions WriteFile with minimal changes : Only modified cells, not full regeneration For code generation, the same principle applies. If your agent frequently generates similar Python scripts, data processing pipelines, or analysis frameworks, create reusable functions: # Instead of regenerating this every time: def process_earnings_transcript(path): # 50 lines of parsing code... # Reference a skill with reusable utilities: from skills.earnings import parse_transcript, extract_guidance The agent imports and calls rather than regenerates. Fewer output tokens, faster responses, more consistent results. Subscribe now LLMs don’t process context uniformly. Research shows a consistent U-shaped attention pattern: models attend strongly to the beginning and end of prompts while “losing” information in the middle. Strategic placement matters: System instructions : Beginning (highest attention) Current user request : End (recency bias) Critical context : Beginning or end, never middle Lower-priority background : Middle (acceptable loss) For retrieval-augmented generation, this means reordering retrieved documents. The most relevant chunks should go at the beginning and end. Lower-ranked chunks fill the middle. Manus uses an elegant hack: they maintain a todo.md file that gets updated throughout task execution. This “recites” current objectives at the end of context, combating the lost-in-the-middle effect across their typical 50-tool-call trajectories. We use a similar architecture at Fintool. As agents run, context grows until it hits the window limit. You used to have two options: build your own summarization pipeline, or implement observation masking (replacing old tool outputs with placeholders). Both require significant engineering. Now you can let the API handle it. Anthropic’s server-side compaction automatically summarizes your conversation when it approaches a configurable token threshold. Claude Code uses this internally, and it’s the reason you can run 50+ tool call sessions without the agent losing track of what it’s doing. The key design decisions: Trigger threshold : Default is 150K tokens. Set it lower if you want to stay under the 200K pricing cliff, or higher if you need more raw context before summarizing. Custom instructions : You can replace the default summarization prompt entirely. For financial workflows, something like “Preserve all numerical data, company names, and analytical conclusions” prevents the summary from losing critical details. Pause after compaction : The API can pause after generating the summary, letting you inject additional context (like preserving the last few messages verbatim) before continuing. This gives you control over what survives the compression. Compaction also stacks well with prompt caching. Add a cache breakpoint on your system prompt so it stays cached separately. When compaction occurs, only the summary needs to be written as a new cache entry. Your system prompt cache stays warm. The beauty of this approach: context depreciates in value over time, and the API handles the depreciation schedule for you. Output tokens are the most expensive tokens. With Claude Sonnet, outputs cost 5x inputs. With Opus, they cost 5x inputs that are already expensive. Yet most developers leave max_tokens unlimited and hope for the best. # BAD: Unlimited output response = client.messages.create( model=”claude-sonnet-4-20250514”, max_tokens=8192, # Model might use all of this messages=[...] ) # GOOD: Task-appropriate limits TASK_LIMITS = { “classification”: 50, “extraction”: 200, “short_answer”: 500, “analysis”: 2000, “code_generation”: 4000, } Structured outputs reduce verbosity. JSON responses use fewer tokens than natural language explanations of the same information. Natural language: “The company’s revenue was 94.5 billion dollars, which represents a year-over-year increase of 12.3 percent compared to the previous fiscal year’s revenue of 84.2 billion dollars.” Structured: {”revenue”: 94.5, “unit”: “B”, “yoy_change”: 12.3} For agents specifically, consider response chunking. Instead of generating a 10,000-token analysis in one shot, break it into phases: Outline phase : Generate structure (500 tokens) Section phases : Generate each section on demand (1000 tokens each) Review phase : Check and refine (500 tokens) This gives you control points to stop early if the user has what they need, rather than always generating the maximum possible output. With Claude Opus 4.6 and Sonnet 4.5, crossing 200K input tokens triggers premium pricing. Your per-token cost doubles: Opus goes from $5 to $10 per million input tokens, and output jumps from $25 to $37.50. This isn’t gradual. It’s a cliff. This is the LLM equivalent of a tax bracket. And just like tax planning, the right strategy is to stay under the threshold when you can. For agent workflows that risk crossing 200K, implement a context budget. Track cumulative input tokens across tool calls. When you approach the cliff, trigger aggressive compression: observation masking, summarization of older turns, or pruning low-value context. The cost of a compression step is far less than doubling your per-token rate for the rest of the conversation. Every sequential tool call is a round trip. Each round trip re-sends the full conversation context. If your agent makes 20 tool calls sequentially, that’s 20 times the context gets transmitted and billed. The Anthropic API supports parallel tool calls: the model can request multiple independent tool calls in a single response, and you execute them simultaneously. This means fewer round trips for the same amount of work. The savings compound. With fewer round trips, you accumulate less intermediate context, which means each subsequent round trip is also cheaper. Design your tools so that independent operations can be identified and batched by the model. The cheapest token is the one you never send to the API. Before any LLM call, check if you’ve already answered this question. At Fintool, we cache aggressively for earnings call summarizations and common queries. When a user asks for Apple’s latest earnings summary, we don’t regenerate it from scratch for every request. The first request pays the full cost. Every subsequent request is essentially free. This operates above the LLM layer entirely. It’s not prompt caching or KV cache. It’s your application deciding that this query has a valid cached response and short-circuiting the API call. Good candidates for application-level caching: Factual lookups : Company financials, earnings summaries, SEC filings Common queries : Questions that many users ask about the same data Deterministic transformations : Data formatting, unit conversions Stable analysis : Any output that won’t change until the underlying data changes The cache invalidation strategy matters. For financial data, earnings call summaries are stable once generated. Real-time price data obviously isn’t. Match your cache TTL to the volatility of the underlying data. Even partial caching helps. If an agent task involves five tool calls and you can cache two of them, you’ve cut 40% of your tool-related token costs without touching the LLM. The Meta Lesson Context engineering isn’t glamorous. It’s not the exciting part of building agents. But it’s the difference between a demo that impresses and a product that scales with decent gross margin. The best teams building sustainable agent products are obsessing over token efficiency the same way database engineers obsess over query optimization. Because at scale, every wasted token is money on fire. The context tax is real. But with the right architecture, it’s largely avoidable. Subscribe now Every token you send to an LLM costs money. Every token increases latency. And past a certain point, every additional token makes your agent dumber. This is the triple penalty of context bloat: higher costs, slower responses, and degraded performance through context rot, where the agent gets lost in its own accumulated noise. Context engineering is very important. The difference between a $0.50 query and a $5.00 query is often just how thoughtfully you manage context. Here’s what I’ll cover: Stable Prefixes for KV Cache Hits - The single most important optimization for production agents Append-Only Context - Why mutating context destroys your cache hit rate Store Tool Outputs in the Filesystem - Cursor’s approach to avoiding context bloat Design Precise Tools - How smart tool design reduces token consumption by 10x Clean Your Data First (Maximize Your Deductions) - Strip the garbage before it enters context Delegate to Cheaper Subagents (Offshore to Tax Havens) - Route token-heavy operations to smaller models Reusable Templates Over Regeneration (Standard Deductions) - Stop regenerating the same code The Lost-in-the-Middle Problem - Strategic placement of critical information Server-Side Compaction (Depreciation) - Let the API handle context decay automatically Output Token Budgeting (Withholding Tax) - The most expensive tokens are the ones you generate The 200K Pricing Cliff (The Tax Bracket) - The tax bracket that doubles your bill overnight Parallel Tool Calls (Filing Jointly) - Fewer round trips, less context accumulation Application-Level Response Caching (Tax-Exempt Status) - The cheapest token is the one you never send That’s a 10x difference between cached and uncached inputs. Output tokens cost 5x more than uncached inputs. Most agent builders focus on prompt engineering while hemorrhaging money on context inefficiency. In most agent workflows, context grows substantially with each step while outputs remain compact. This makes input token optimization critical: a typical agent task might involve 50 tool calls, each accumulating context. The performance penalty is equally severe. Research shows that past 32K tokens, most models show sharp performance degradation. Your agent isn’t just getting expensive. It’s getting confused. Stable Prefixes for KV Cache Hits This is the single most important metric for production agents: KV cache hit rate. The Manus team considers this the most important optimization for their agent infrastructure, and I agree completely. The principle is simple: LLMs process prompts autoregressively, token by token. If your prompt starts identically to a previous request, the model can reuse cached key-value computations for that prefix. The killer of cache hit rates? Timestamps. A common mistake is including a timestamp at the beginning of the system prompt. It’s a simple mistake but the impact is massive. The key is granularity: including the date is fine. Including the hour is acceptable since cache durations are typically 5 minutes (Anthropic default) to 10 minutes (OpenAI default), with longer options available. But never include seconds or milliseconds. A timestamp precise to the second guarantees every single request has a unique prefix. Zero cache hits. Maximum cost. Move all dynamic content (including timestamps) to the END of your prompt. System instructions, tool definitions, few-shot examples, all of these should come first and remain identical across requests. For distributed systems, ensure consistent request routing. Use session IDs to route requests to the same worker, maximizing the chance of hitting warm caches. Append-Only Context Context should be append-only. Any modification to earlier content invalidates the KV cache from that point forward. This seems obvious but the violations are subtle: The tool definition problem is particularly insidious. If you dynamically add or remove tools based on context, you invalidate the cache for everything after the tool definitions. Manus solved this elegantly: instead of removing tools, they mask token logits during decoding to constrain which actions the model can select. The tool definitions stay constant (cache preserved), but the model is guided toward valid choices through output constraints. For simpler implementations, keep your tool definitions static and handle invalid tool calls gracefully in your orchestration layer. Deterministic serialization matters too. Python dicts don’t guarantee order. If you’re serializing tool definitions or context as JSON, use sort_keys=True or a library that guarantees deterministic output. A different key order = different tokens = cache miss. Store Tool Outputs in the Filesystem Cursor’s approach to context management changed how I think about agent architecture. Instead of stuffing tool outputs into the conversation, write them to files. In their A/B testing, this reduced total agent tokens by 46.9% for runs using MCP tools. The insight: agents don’t need complete information upfront. They need the ability to access information on demand. Files are the perfect abstraction for this. We apply this pattern everywhere: Shell command outputs : Write to files, let agent tail or grep as needed Search results : Return file paths, not full document contents API responses : Store raw responses, let agent extract what matters Intermediate computations : Persist to disk, reference by path The two-phase pattern: search returns metadata, separate tool returns full content. The agent decides which items deserve full retrieval. This is exactly how our conversation history tool works at Fintool. It passes date ranges or search terms and returns up to 100-200 results with only user messages and metadata. The agent then reads specific conversations by passing the conversation ID. Filter parameters like has_attachment, time_range, and sender let the agent narrow results before reading anything. The same pattern applies everywhere: Document search : Return titles and snippets, not full documents Database queries : Return row counts and sample rows, not full result sets File listings : Return paths and metadata, not contents API integrations : Return summaries, let agent drill down For HTML content, the gains are even larger. A typical webpage might be 100KB of HTML but only 5KB of actual content. CSS selectors that extract semantic regions (article, main, section) and discard navigation, ads, and tracking can reduce token counts by 90%+. Markdown uses significantly fewer tokens than HTML , making conversion valuable for any web content entering your pipeline. For financial data specifically: Strip SEC filing boilerplate (every 10-K has the same legal disclaimers) Collapse repeated table headers across pages Remove watermarks and page numbers from extracted text Normalize whitespace (multiple spaces, tabs, excessive newlines) Convert HTML tables to markdown tables The Claude Code subagent pattern processes 67% fewer tokens overall due to context isolation. Instead of stuffing every intermediate search result into a single global context, workers keep only what’s relevant inside their own window and return distilled outputs. Tasks perfect for cheaper subagents: Data extraction : Pull specific fields from documents Classification : Categorize emails, documents, or intents Summarization : Compress long documents before main agent sees them Validation : Check outputs against criteria Formatting : Convert between data formats Scope subagent tasks tightly. The more iterations a subagent requires, the more context it accumulates and the more tokens it consumes. Design for single-turn completion when possible. Reusable Templates Over Regeneration (Standard Deductions) Every time an agent generates code from scratch, you’re paying for output tokens. Output tokens cost 5x input tokens with Claude. Stop regenerating the same patterns. Our document generation workflow used to be painfully inefficient: OLD APPROACH: User: “Create a DCF model for Apple” Agent: *generates 2,000 lines of Excel formulas from scratch* Cost: ~$0.50 in output tokens alone NEW APPROACH: User: “Create a DCF model for Apple” Agent: *loads DCF template, fills in Apple-specific values* Cost: ~$0.05 The template approach: Skill references template : dcf_template.xlsx in /public/skills/dcf/ Agent reads template once : Understands structure and placeholders Agent fills parameters : Company-specific values, assumptions WriteFile with minimal changes : Only modified cells, not full regeneration Strategic placement matters: System instructions : Beginning (highest attention) Current user request : End (recency bias) Critical context : Beginning or end, never middle Lower-priority background : Middle (acceptable loss) The key design decisions: Trigger threshold : Default is 150K tokens. Set it lower if you want to stay under the 200K pricing cliff, or higher if you need more raw context before summarizing. Custom instructions : You can replace the default summarization prompt entirely. For financial workflows, something like “Preserve all numerical data, company names, and analytical conclusions” prevents the summary from losing critical details. Pause after compaction : The API can pause after generating the summary, letting you inject additional context (like preserving the last few messages verbatim) before continuing. This gives you control over what survives the compression. Outline phase : Generate structure (500 tokens) Section phases : Generate each section on demand (1000 tokens each) Review phase : Check and refine (500 tokens) This is the LLM equivalent of a tax bracket. And just like tax planning, the right strategy is to stay under the threshold when you can. For agent workflows that risk crossing 200K, implement a context budget. Track cumulative input tokens across tool calls. When you approach the cliff, trigger aggressive compression: observation masking, summarization of older turns, or pruning low-value context. The cost of a compression step is far less than doubling your per-token rate for the rest of the conversation. Parallel Tool Calls (Filing Jointly) Every sequential tool call is a round trip. Each round trip re-sends the full conversation context. If your agent makes 20 tool calls sequentially, that’s 20 times the context gets transmitted and billed. The Anthropic API supports parallel tool calls: the model can request multiple independent tool calls in a single response, and you execute them simultaneously. This means fewer round trips for the same amount of work. The savings compound. With fewer round trips, you accumulate less intermediate context, which means each subsequent round trip is also cheaper. Design your tools so that independent operations can be identified and batched by the model. Application-Level Response Caching (Tax-Exempt Status) The cheapest token is the one you never send to the API. Before any LLM call, check if you’ve already answered this question. At Fintool, we cache aggressively for earnings call summarizations and common queries. When a user asks for Apple’s latest earnings summary, we don’t regenerate it from scratch for every request. The first request pays the full cost. Every subsequent request is essentially free. This operates above the LLM layer entirely. It’s not prompt caching or KV cache. It’s your application deciding that this query has a valid cached response and short-circuiting the API call. Good candidates for application-level caching: Factual lookups : Company financials, earnings summaries, SEC filings Common queries : Questions that many users ask about the same data Deterministic transformations : Data formatting, unit conversions Stable analysis : Any output that won’t change until the underlying data changes

1 views
Simon Willison 2 months ago

Introducing Showboat and Rodney, so agents can demo what they’ve built

A key challenge working with coding agents is having them both test what they’ve built and demonstrate that software to you, their overseer. This goes beyond automated tests - we need artifacts that show their progress and help us see exactly what the agent-produced software is able to do. I’ve just released two new tools aimed at this problem: Showboat and Rodney . I recently wrote about how the job of a software engineer isn't to write code, it's to deliver code that works . A big part of that is proving to ourselves and to other people that the code we are responsible for behaves as expected. This becomes even more important - and challenging - as we embrace coding agents as a core part of our software development process. The more code we churn out with agents, the more valuable tools are that reduce the amount of manual QA time we need to spend. One of the most interesting things about the StrongDM software factory model is how they ensure that their software is well tested and delivers value despite their policy that "code must not be reviewed by humans". Part of their solution involves expensive swarms of QA agents running through "scenarios" to exercise their software. It's fascinating, but I don't want to spend thousands of dollars on QA robots if I can avoid it! I need tools that allow agents to clearly demonstrate their work to me, while minimizing the opportunities for them to cheat about what they've done. Showboat is the tool I built to help agents demonstrate their work to me. It's a CLI tool (a Go binary, optionally wrapped in Python to make it easier to install) that helps an agent construct a Markdown document demonstrating exactly what their newly developed code can do. It's not designed for humans to run, but here's how you would run it anyway: Here's what the result looks like if you open it up in VS Code and preview the Markdown: Here's that demo.md file in a Gist . So a sequence of , , and commands constructs a Markdown document one section at a time, with the output of those commands automatically added to the document directly following the commands that were run. The command is a little special - it looks for a file path to an image in the output of the command and copies that image to the current folder and references it in the file. That's basically the whole thing! There's a command to remove the most recently added section if something goes wrong, a command to re-run the document and check nothing has changed (I'm not entirely convinced by the design of that one) and a command that reverse-engineers the CLI commands that were used to create the document. It's pretty simple - just 172 lines of Go. I packaged it up with my go-to-wheel tool which means you can run it without even installing it first like this: That command is really important: it's designed to provide a coding agent with everything it needs to know in order to use the tool. Here's that help text in full . This means you can pop open Claude Code and tell it: And that's it! The text acts a bit like a Skill . Your agent can read the help text and use every feature of Showboat to create a document that demonstrates whatever it is you need demonstrated. Here's a fun trick: if you set Claude off to build a Showboat document you can pop that open in VS Code and watch the preview pane update in real time as the agent runs through the demo. It's a bit like having your coworker talk you through their latest work in a screensharing session. And finally, some examples. Here are documents I had Claude create using Showboat to help demonstrate features I was working on in other projects: row-state-sql CLI Demo shows a new command I added to that same project. Change grouping with Notes demonstrates another feature where groups of changes within the same transaction can have a note attached to them. I've now used Showboat often enough that I've convinced myself of its utility. (I've also seen agents cheat! Since the demo file is Markdown the agent will sometimes edit that file directly rather than using Showboat, which could result in command outputs that don't reflect what actually happened. Here's an issue about that .) Many of the projects I work on involve web interfaces. Agents often build entirely new pages for these, and I want to see those represented in the demos. Showboat's image feature was designed to allow agents to capture screenshots as part of their demos, originally using my shot-scraper tool or Playwright . The Showboat format benefits from CLI utilities. I went looking for good options for managing a multi-turn browser session from a CLI and came up short, so I decided to try building something new. Claude Opus 4.6 pointed me to the Rod Go library for interacting with the Chrome DevTools protocol. It's fantastic - it provides a comprehensive wrapper across basically everything you can do with automated Chrome, all in a self-contained library that compiles to a few MBs. All Rod was missing was a CLI. I built the first version as an asynchronous report prototype , which convinced me it was worth spinning out into its own project. I called it Rodney as a nod to the Rod library it builds on and a reference to Only Fools and Horses - and because the package name was available on PyPI. You can run Rodney using or install it like this: (Or grab a Go binary from the releases page .) Here's a simple example session: Here's what that looks like in the terminal: As with Showboat, this tool is not designed to be used by humans! The goal is for coding agents to be able to run and see everything they need to know to start using the tool. You can see that help output in the GitHub repo. Here are three demonstrations of Rodney that I created using Showboat: After being a career-long skeptic of the test-first, maximum test coverage school of software development (I like tests included development instead) I've recently come around to test-first processes as a way to force agents to write only the code that's necessary to solve the problem at hand. Many of my Python coding agent sessions start the same way: Telling the agents how to run the tests doubles as an indicator that tests on this project exist and matter. Agents will read existing tests before writing their own so having a clean test suite with good patterns makes it more likely they'll write good tests of their own. The frontier models all understand that "red/green TDD" means they should write the test first, run it and watch it fail and then write the code to make it pass - it's a convenient shortcut. I find this greatly increases the quality of the code and the likelihood that the agent will produce the right thing with the smallest amount of prompts to guide it. But anyone who's worked with tests will know that just because the automated tests pass doesn't mean the software actually works! That’s the motivation behind Showboat and Rodney - I never trust any feature until I’ve seen it running with my own eye. Before building Showboat I'd often add a “manual” testing step to my agent sessions, something like: Both Showboat and Rodney started life as Claude Code for web projects created via the Claude iPhone app. Most of the ongoing feature work for them happened in the same way. I'm still a little startled at how much of my coding work I get done on my phone now, but I'd estimate that the majority of code I ship to GitHub these days was written for me by coding agents driven via that iPhone app. I initially designed these two tools for use in asynchronous coding agent environments like Claude Code for the web. So far that's working out really well. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Proving code actually works Showboat: Agents build documents to demo their work Rodney: CLI browser automation designed to work with Showboat Test-driven development helps, but we still need manual testing I built both of these tools on my phone shot-scraper: A Comprehensive Demo runs through the full suite of features of my shot-scraper browser automation tool, mainly to exercise the command. sqlite-history-json CLI demo demonstrates the CLI feature I added to my new sqlite-history-json Python library. row-state-sql CLI Demo shows a new command I added to that same project. Change grouping with Notes demonstrates another feature where groups of changes within the same transaction can have a note attached to them. krunsh: Pipe Shell Commands to an Ephemeral libkrun MicroVM is a particularly convoluted example where I managed to get Claude Code for web to run a libkrun microVM inside a QEMU emulated Linux environment inside the Claude gVisor sandbox. Rodney's original feature set , including screenshots of pages and executing JavaScript. Rodney's new accessibility testing features , built during development of those features to show what they could do. Using those features to run a basic accessibility audit of a page . I was impressed at how well Claude Opus 4.6 responded to the prompt "Use showboat and rodney to perform an accessibility audit of https://latest.datasette.io/fixtures " - transcript here .

0 views
Armin Ronacher 2 months ago

A Language For Agents

Last year I first started thinking about what the future of programming languages might look like now that agentic engineering is a growing thing. Initially I felt that the enormous corpus of pre-existing code would cement existing languages in place but now I’m starting to think the opposite is true. Here I want to outline my thinking on why we are going to see more new programming languages and why there is quite a bit of space for interesting innovation. And just in case someone wants to start building one, here are some of my thoughts on what we should aim for! Does an agent perform dramatically better on a language that it has in its weights? Obviously yes. But there are less obvious factors that affect how good an agent is at programming in a language: how good the tooling around it is and how much churn there is. Zig seems underrepresented in the weights (at least in the models I’ve used) and also changing quickly. That combination is not optimal, but it’s still passable: you can program even in the upcoming Zig version if you point the agent at the right documentation. But it’s not great. On the other hand, some languages are well represented in the weights but agents still don’t succeed as much because of tooling choices. Swift is a good example: in my experience the tooling around building a Mac or iOS application can be so painful that agents struggle to navigate it. Also not great. So, just because it exists doesn’t mean the agent succeeds and just because it’s new also doesn’t mean that the agent is going to struggle. I’m convinced that you can build yourself up to a new language if you don’t want to depart everywhere all at once. The biggest reason new languages might work is that the cost of coding is going down dramatically. The result is the breadth of an ecosystem matters less. I’m now routinely reaching for JavaScript in places where I would have used Python. Not because I love it or the ecosystem is better, but because the agent does much better with TypeScript. The way to think about this: if important functionality is missing in my language of choice, I just point the agent at a library from a different language and have it build a port. As a concrete example, I recently built an Ethernet driver in JavaScript to implement the host controller for our sandbox. Implementations exist in Rust, C, and Go, but I wanted something pluggable and customizable in JavaScript. It was easier to have the agent reimplement it than to make the build system and distribution work against a native binding. New languages will work if their value proposition is strong enough and they evolve with knowledge of how LLMs train. People will adopt them despite being underrepresented in the weights. And if they are designed to work well with agents, then they might be designed around familiar syntax that is already known to work well. So why would we want a new language at all? The reason this is interesting to think about is that many of today’s languages were designed with the assumption that punching keys is laborious, so we traded certain things for brevity. As an example, many languages — particular modern ones — lean heavily on type inference so that you don’t have to write out types. The downside is that you now need an LSP or the resulting compiler error messages to figure out what the type of an expression is. Agents struggle with this too, and it’s also frustrating in pull request review where complex operations can make it very hard to figure out what the types actually are. Fully dynamic languages are even worse in that regard. The cost of writing code is going down, but because we are also producing more of it, understanding what the code does is becoming more important. We might actually want more code to be written if it means there is less ambiguity when we perform a review. I also want to point out that we are heading towards a world where some code is never seen by a human and is only consumed by machines. Even in that case, we still want to give an indication to a user, who is potentially a non-programmer, about what is going on. We want to be able to explain to a user what the code will do without going into the details of how. So the case for a new language comes down to: given the fundamental changes in who is programming and what the cost of code is, we should at least consider one. It’s tricky to say what an agent wants because agents will lie to you and they are influenced by all the code they’ve seen. But one way to estimate how they are doing is to look at how many changes they have to perform on files and how many iterations they need for common tasks. There are some things I’ve found that I think will be true for a while. The language server protocol lets an IDE infer information about what’s under the cursor or what should be autocompleted based on semantic knowledge of the codebase. It’s a great system, but it comes at one specific cost that is tricky for agents: the LSP has to be running. There are situations when an agent just won’t run the LSP — not because of technical limitations, but because it’s also lazy and will skip that step if it doesn’t have to. If you give it an example from documentation, there is no easy way to run the LSP because it’s a snippet that might not even be complete. If you point it at a GitHub repository and it pulls down individual files, it will just look at the code. It won’t set up an LSP for type information. A language that doesn’t split into two separate experiences (with-LSP and without-LSP) will be beneficial to agents because it gives them one unified way of working across many more situations. It pains me as a Python developer to say this, but whitespace-based indentation is a problem. The underlying token efficiency of getting whitespace right is tricky, and a language with significant whitespace is harder for an LLM to work with. This is particularly noticeable if you try to make an LLM do surgical changes without an assisted tool. Quite often they will intentionally disregard whitespace, add markers to enable or disable code and then rely on a code formatter to clean up indentation later. On the other hand, braces that are not separated by whitespace can cause issues too. Depending on the tokenizer, runs of closing parentheses can end up split into tokens in surprising ways (a bit like the “strawberry” counting problem), and it’s easy for an LLM to get Lisp or Scheme wrong because it loses track of how many closing parentheses it has already emitted or is looking at. Fixable with future LLMs? Sure, but also something that was hard for humans to get right too without tooling. Readers of this blog might know that I’m a huge believer in async locals and flow execution context — basically the ability to carry data through every invocation that might only be needed many layers down the call chain. Working at an observability company has really driven home the importance of this for me. The challenge is that anything that flows implicitly might not be configured. Take for instance the current time. You might want to implicitly pass a timer to all functions. But what if a timer is not configured and all of a sudden a new dependency appears? Passing all of it explicitly is tedious for both humans and agents and bad shortcuts will be made. One thing I’ve experimented with is having effect markers on functions that are added through a code formatting step. A function can declare that it needs the current time or the database, but if it doesn’t mark this explicitly, it’s essentially a linting warning that auto-formatting fixes. The LLM can start using something like the current time in a function and any existing caller gets the warning; formatting propagates the annotation. This is nice because when the LLM builds a test, it can precisely mock out these side effects — it understands from the error messages what it has to supply. For instance: Agents struggle with exceptions, they are afraid of them. I’m not sure to what degree this is solvable with RL (Reinforcement Learning), but right now agents will try to catch everything they can, log it, and do a pretty poor recovery. Given how little information is actually available about error paths, that makes sense. Checked exceptions are one approach, but they propagate all the way up the call chain and don’t dramatically improve things. Even if they end up as hints where a linter tracks which errors can fly by, there are still many call sites that need adjusting. And like the auto-propagation proposed for context data, it might not be the right solution. Maybe the right approach is to go more in on typed results, but that’s still tricky for composability without a type and object system that supports it. The general approach agents use today to read files into memory is line-based, which means they often pick chunks that span multi-line strings. One easy way to see this fall apart: have an agent work on a 2000-line file that also contains long embedded code strings — basically a code generator. The agent will sometimes edit within a multi-line string assuming it’s the real code when it’s actually just embedded code in a multi-line string. For multi-line strings, the only language I’m aware of with a good solution is Zig, but its prefix-based syntax is pretty foreign to most people. Reformatting also often causes constructs to move to different lines. In many languages, trailing commas in lists are either not supported (JSON) or not customary. If you want diff stability, you’d aim for a syntax that requires less reformatting and mostly avoids multi-line constructs. What’s really nice about Go is that you mostly cannot import symbols from another package into scope without every use being prefixed with the package name. Eg: instead of . There are escape hatches (import aliases and dot-imports), but they’re relatively rare and usually frowned upon. That dramatically helps an agent understand what it’s looking at. In general, making code findable through the most basic tools is great — it works with external files that aren’t indexed, and it means fewer false positives for large-scale automation driven by code generated on the fly (eg: , invocations). Much of what I’ve said boils down to: agents really like local reasoning. They want it to work in parts because they often work with just a few loaded files in context and don’t have much spatial awareness of the codebase. They rely on external tooling like grep to find things, and anything that’s hard to grep or that hides information elsewhere is tricky. What makes agents fail or succeed in many languages is just how good the build tools are. Many languages make it very hard to determine what actually needs to rebuild or be retested because there are too many cross-references. Go is really good here: it forbids circular dependencies between packages (import cycles), packages have a clear layout, and test results are cached. Agents often struggle with macros. It was already pretty clear that humans struggle with macros too, but the argument for them was mostly that code generation was a good way to have less code to write. Since that is less of a concern now, we should aim for languages with less dependence on macros. There’s a separate question about generics and comptime . I think they fare somewhat better because they mostly generate the same structure with different placeholders and it’s much easier for an agent to understand that. Related to greppability: agents often struggle to understand barrel files and they don’t like them. Not being able to quickly figure out where a class or function comes from leads to imports from the wrong place, or missing things entirely and wasting context by reading too many files. A one-to-one mapping from where something is declared to where it’s imported from is great. And it does not have to be overly strict either. Go kind of goes this way, but not too extreme. Any file within a directory can define a function, which isn’t optimal, but it’s quick enough to find and you don’t need to search too far. It works because packages are forced to be small enough to find everything with grep. The worst case is free re-exports all over the place that completely decouple the implementation from any trivially reconstructable location on disk. Or worse: aliasing. Agents often hate it when aliases are involved. In fact, you can get them to even complain about it in thinking blocks if you let them refactor something that uses lots of aliases. Ideally a language encourages good naming and discourages aliasing at import time as a result. Nobody likes flaky tests, but agents even less so. Ironic given how particularly good agents are at creating flaky tests in the first place. That’s because agents currently love to mock and most languages do not support mocking well. So many tests end up accidentally not being concurrency safe or depend on development environment state that then diverges in CI or production. Most programming languages and frameworks make it much easier to write flaky tests than non-flaky ones. That’s because they encourage indeterminism everywhere. In an ideal world the agent has one command, that lints and compiles and it tells the agent if all worked out fine. Maybe another command to run all tests that need running. In practice most environments don’t work like this. For instance in TypeScript you can often run the code even though it fails type checks . That can gaslight the agent. Likewise different bundler setups can cause one thing to succeed just for a slightly different setup in CI to fail later. The more uniform the tooling the better. Ideally it either runs or doesn’t and there is mechanical fixing for as many linting failures as possible so that the agent does not have to do it by hand. I think we will. We are writing more software now than we ever have — more websites, more open source projects, more of everything. Even if the ratio of new languages stays the same, the absolute number will go up. But I also truly believe that many more people will be willing to rethink the foundations of software engineering and the languages we work with. That’s because while for some years it has felt you need to build a lot of infrastructure for a language to take off, now you can target a rather narrow use case: make sure the agent is happy and extend from there to the human. I just hope we see two things. First, some outsider art: people who haven’t built languages before trying their hand at it and showing us new things. Second, a much more deliberate effort to document what works and what doesn’t from first principles. We have actually learned a lot about what makes good languages and how to scale software engineering to large teams. Yet, finding it written down, as a consumable overview of good and bad language design, is very hard to come by. Too much of it has been shaped by opinion on rather pointless things instead of hard facts. Now though, we are slowly getting to the point where facts matter more, because you can actually measure what works by seeing how well agents perform with it. No human wants to be subject to surveys, but agents don’t care . We can see how successful they are and where they are struggling.

0 views