Latest Posts (20 found)
Simon Willison 6 days ago

Claude can write complete Datasette plugins now

This isn't necessarily surprising, but it's worth noting anyway. Claude Sonnet 4.5 is capable of building a full Datasette plugin now. I've seen models complete aspects of this in the past, but today is the first time I've shipped a new plugin where every line of code and test was written by Claude, with minimal prompting from myself. The plugin is called datasette-os-info . It's a simple debugging tool - all it does is add a JSON page which dumps out as much information as it can about the OS it's running on. Here's a live demo on my TIL website. I built it to help experiment with changing the Docker base container that Datasette uses to publish images to one that uses Python 3.14. Here's the full set of commands I used to create the plugin. I started with my datasette-plugin cookiecutter template: This created a directory with the initial and and files. Here's an example of that starter template . I created a virtual environment for it, installed the initial test dependencies and ran to check that worked: Then I fired up Claude Code in that directory in YOLO mode: (I actually used my shortcut which runs the above.) Then, in Claude, I told it how to run the tests: When that worked, I told it to build the plugin: ... and that was it! Claude implemented the plugin using Datasette's register_routes() plugin hook to add the page,and then without me prompting it to do so built this basic test as well . It ran the new test, spotted a bug (it had guessed a non-existent parameter), fixed the bug and declared itself done. I built myself a wheel: Then uploaded that to an S3 bucket and deployed it to test it out using . It did exactly what I had hoped - here's what that page looked like: I dread to think how long it would have taken me to figure out all of those different useful OS details! To finish up, I had Claude fill out the README via a couple of extra prompts: Then had a brief conversation about how safe the output was: Claude said: Yes, there are several potentially sensitive items in the output: Most sensitive: Moderately sensitive: Considerations: Suggestions: You might want to: Would you like me to add any of these protections to the plugin? In this case I'm fine with the current behavior provided people understand the risks, so I told it: And then (because the warning it added was quite long): And requested an extra README note: Three last prompts: ... and that was the project finished. I pushed it to GitHub , configured Trusted Publishing for it on PyPI and posted the 0.1 release , which ran this GitHub Actions publish.yml and deployed that release to datasette-os-info on PyPI . Now that it's live you can try it out without even installing Datasette using a one-liner like this: That's using the CLI option to show what that path in the Datasette instance would return, as described in the Datasette documentation . I've shared my full Claude Code transcript in a Gist. A year ago I'd have been very impressed by this. Today I wasn't even particularly surprised that this worked - the coding agent pattern implemented by Claude Code is spectacularly effective when you combine it with pre-existing templates, and Datasette has been aroung for long enough now that plenty of examples of plugins have made it into the training data for the leading models. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . hostname - Could reveal internal naming schemes or organization details Environment variables: HOME - Reveals user's home directory path USER - Reveals the username PATH - Reveals filesystem structure and software installation locations VIRTUAL_ENV - Could reveal project paths and directory structure python_executable - Reveals filesystem paths and user directories CPU count and memory info - Could reveal infrastructure capacity Kernel version details - Could potentially be used to identify specific vulnerabilities This endpoint is accessible without authentication by default (at /-/os) If you're running Datasette on a public server, this information would be visible to anyone The information could help attackers understand your infrastructure setup Add authentication requirements to this endpoint Make certain fields optional/configurable Add a warning in the README that this exposes system information Consider filtering out or redacting sensitive paths (replace usernames with ***, etc.)

1 views
Simon Willison 1 weeks ago

Vibe engineering

I feel like vibe coding is pretty well established now as covering the fast, loose and irresponsible way of building software with AI - entirely prompt-driven, and with no attention paid to how the code actually works. This leaves us with a terminology gap: what should we call the other end of the spectrum, where seasoned professionals accelerate their work with LLMs while staying proudly and confidently accountable for the software they produce? I propose we call this vibe engineering , with my tongue only partially in my cheek. One of the lesser spoken truths of working productively with LLMs as a software engineer on non-toy-projects is that it's difficult . There's a lot of depth to understanding how to use the tools, there are plenty of traps to avoid, and the pace at which they can churn out working code raises the bar for what the human participant can and should be contributing. The rise of coding agents - tools like Claude Code (released February 2025), OpenAI's Codex CLI (April) and Gemini CLI (June) that can iterate on code, actively testing and modifying it until it achieves a specified goal, has dramatically increased the usefulness of LLMs for real-world coding problems. I'm increasingly hearing from experienced, credible software engineers who are running multiple copies of agents at once, tackling several problems in parallel and expanding the scope of what they can take on. I was skeptical of this at first but I've started running multiple agents myself now and it's surprisingly effective, if mentally exhausting! This feels very different from classic vibe coding, where I outsource a simple, low-stakes task to an LLM and accept the result if it appears to work. Most of my tools.simonwillison.net collection ( previously ) were built like that. Iterating with coding agents to produce production-quality code that I'm confident I can maintain in the future feels like a different process entirely. It's also become clear to me that LLMs actively reward existing top tier software engineering practices: If you're going to really exploit the capabilities of these new tools, you need to be operating at the top of your game . You're not just responsible for writing the code - you're researching approaches, deciding on high-level architecture, writing specifications, defining success criteria, designing agentic loops , planning QA, managing a growing army of weird digital interns who will absolutely cheat if you give them a chance, and spending so much time on code review . Almost all of these are characteristics of senior software engineers already! AI tools amplify existing expertise . The more skills and experience you have as a software engineer the faster and better the results you can get from working with LLMs and coding agents. Is this a stupid name? Yeah, probably. "Vibes" as a concept in AI feels a little tired at this point. "Vibe coding" itself is used by a lot of developers in a dismissive way. I'm ready to reclaim vibes for something more constructive. I've never really liked the artificial distinction between "coders" and "engineers" - that's always smelled to me a bit like gatekeeping. But in this case a bit of gatekeeping is exactly what we need! Vibe engineering establishes a clear distinction from vibe coding. It signals that this is a different, harder and more sophisticated way of working with AI tools to build production software. I like that this is cheeky and likely to be controversial. This whole space is still absurd in all sorts of different ways. We shouldn't take ourselves too seriously while we figure out the most productive ways to apply these new tools. I've tried in the past to get terms like AI-assisted programming to stick, with approximately zero success. May as well try rubbing some vibes on it and see what happens. I also really like the clear mismatch between "vibes" and "engineering". It makes the combined term self-contradictory in a way that I find mischievous and (hopefully) sticky. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Automated testing . If your project has a robust, comprehensive and stable test suite agentic coding tools can fly with it. Without tests? Your agent might claim something works without having actually tested it at all, plus any new change could break an unrelated feature without you realizing it. Test-first development is particularly effective with agents that can iterate in a loop. Planning in advance . Sitting down to hack something together goes much better if you start with a high level plan. Working with an agent makes this even more important - you can iterate on the plan first, then hand it off to the agent to write the code. Comprehensive documentation . Just like human programmers, an LLM can only keep a subset of the codebase in its context at once. Being able to feed in relevant documentation lets it use APIs from other areas without reading the code first. Write good documentation first and the model may be able to build the matching implementation from that input alone. Good version control habits . Being able to undo mistakes and understand when and how something was changed is even more important when a coding agent might have made the changes. LLMs are also fiercely competent at Git - they can navigate the history themselves to track down the origin of bugs, and they're better than most developers at using git bisect . Use that to your advantage. Having effective automation in place. Continuous integration, automated formatting and linting, continuous deployment to a preview environment - all things that agentic coding tools can benefit from too. LLMs make writing quick automation scripts easier as well, which can help them then repeat tasks accurately and consistently next time. A culture of code review . This one explains itself. If you're fast and productive at code review you're going to have a much better time working with LLMs than if you'd rather write code yourself than review the same thing written by someone (or something) else. A very weird form of management . Getting good results out of a coding agent feels uncomfortably close to getting good results out of a human collaborator. You need to provide clear instructions, ensure they have the necessary context and provide actionable feedback on what they produce. It's a lot easier than working with actual people because you don't have to worry about offending or discouraging them - but any existing management experience you have will prove surprisingly useful. Really good manual QA (quality assurance) . Beyond automated tests, you need to be really good at manually testing software, including predicting and digging into edge-cases. Strong research skills . There are dozens of ways to solve any given coding problem. Figuring out the best options and proving an approach has always been important, and remains a blocker on unleashing an agent to write the actual code. The ability to ship to a preview environment . If an agent builds a feature, having a way to safely preview that feature (without deploying it straight to production) makes reviews much more productive and greatly reduces the risk of shipping something broken. An instinct for what can be outsourced to AI and what you need to manually handle yourself. This is constantly evolving as the models and tools become more effective. A big part of working effectively with LLMs is maintaining a strong intuition for when they can best be applied. An updated sense of estimation . Estimating how long a project will take has always been one of the hardest but most important parts of being a senior engineer, especially in organizations where budget and strategy decisions are made based on those estimates. AI-assisted coding makes this even harder - things that used to take a long time are much faster, but estimations now depend on new factors which we're all still trying to figure out.

0 views
Simon Willison 1 weeks ago

OpenAI DevDay 2025 live blog

I'm at OpenAI DevDay in Fort Mason, San Francisco today. As I did last year , I'm going to be live blogging the announcements from the kenote. Unlike last year, this year there's a livestream . Disclosure: OpenAI provides me with a free ticket and reserved me a seat in the press/influencer section for the keynote. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Simon Willison 1 weeks ago

Embracing the parallel coding agent lifestyle

For a while now I've been hearing from engineers who run multiple coding agents at once - firing up several Claude Code or Codex CLI instances at the same time, sometimes in the same repo, sometimes against multiple checkouts or git worktrees . I was pretty skeptical about this at first. AI-generated code needs to be reviewed, which means the natural bottleneck on all of this is how fast I can review the results. It's tough keeping up with just a single LLM given how fast they can churn things out, where's the benefit from running more than one at a time if it just leaves me further behind? Despite my misgivings, over the past few weeks I've noticed myself quietly starting to embrace the parallel coding agent lifestyle. I can only focus on reviewing and landing one significant change at a time, but I'm finding an increasing number of tasks that can still be fired off in parallel without adding too much cognitive overhead to my primary work. Here are some patterns I've found for applying parallel agents effectively. The first category of tasks I've been applying this pattern to is research . Research tasks answer questions or provide recommendations without making modifications to a project that you plan to keep. A lot of software projects start with a proof of concept. Can Yjs be used to implement a simple collaborative note writing tool with a Python backend? The libraries exist , but do they work when you wire them together? Today's coding agents can build a proof of concept with new libraries and resolve those kinds of basic questions. Libraries too new to be in the training data? Doesn't matter: tell them to checkout the repos for those new dependencies and read the code to figure out how to use them. If you need a reminder about how a portion of your existing system works, modern "reasoning" LLMs can provide a detailed, actionable answer in just a minute or two. It doesn't matter how large your codebase is: coding agents are extremely effective with tools like grep and can follow codepaths through dozens of different files if they need to. Ask them to make notes on where your signed cookies are set and read, or how your application uses subprocesses and threads, or which aspects of your JSON API aren't yet covered by your documentation. These LLM-generated explanations are worth stashing away somewhere, because they can make excellent context to paste into further prompts in the future. Now we're moving on to code edits that we intend to keep, albeit with very low-stakes. It turns out there are a lot of problems that really just require a little bit of extra cognitive overhead which can be outsourced to a bot. Warnings are a great example. Is your test suite spitting out a warning that something you are using is deprecated? Chuck that at a bot - tell it to run the test suite and figure out how to fix the warning. No need to take a break from what you're doing to resolve minor irritations like that. There is a definite knack to spotting opportunities like this. As always, the best way to develop that instinct is to try things - any small maintenance task is something that's worth trying with a coding agent. You can learn from both their successes and their failures. Reviewing code that lands on your desk out of nowhere is a lot of work. First you have to derive the goals of the new implementation: what's it trying to achieve? Is this something the project needs? Is the approach taken the best for this current project, given other future planned changes? A lot of big questions before you can even start digging into the details of the code. Code that started from your own specification is a lot less effort to review. If you already decided what to solve, picked the approach and worked out a detailed specification for the work itself, confirming it was built to your needs can take a lot less time. I described my more authoritarian approach to prompting models for code back in March. If I tell them exactly how to build something the work needed to review the resulting changes is a whole lot less taxing. My daily drivers are currently Claude Code (on Sonnet 4.5), Codex CLI (on GPT-5-Codex), and Codex Cloud (for asynchronous tasks, frequently launched from my phone.) I'm also dabbling with GitHub Copilot Coding Agent (the agent baked into the GitHub.com web interface in various places) and Google Jules , Google's currently-free alternative to Codex Cloud. I'm still settling into patterns that work for me. I imagine I'll be iterating on my processes for a long time to come, especially as the landscape of coding agents continues to evolve. I frequently have multiple terminal windows open running different coding agents in different directories. These are currently a mixture of Claude Code and Codex CLI, running in YOLO mode (no approvals) for tasks where I'm confident malicious instructions can't sneak into the context. (I need to start habitually running my local agents in Docker containers to further limit the blast radius if something goes wrong.) I haven't adopted git worktrees yet: if I want to run two agents in isolation against the same repo I do a fresh checkout, often into . For riskier tasks I'm currently using asynchronous coding agents - usually Codex Cloud - so if anything goes wrong the worst that can happen is my source code getting leaked (since I allow it to have network access while running). Most of what I work on is open source anyway so that's not a big concern for me. I occasionally use GitHub Codespaces to run VS Code's agent mode, which is surprisingly effective and runs directly in my browser. This is particularly great for workshops and demos since it works for anyone with GitHub account, no extra API key necessary. This category of coding agent software is still really new, and the models have only really got good enough to drive them effectively in the past few months - Claude 4 and GPT-5 in particular. I plan to write more as I figure out the ways of using them that are most effective. I encourage other practitioners to do the same! Jesse Vincent wrote How I'm using coding agents in September, 2025 which describes his workflow for parallel agents in detail, including having an architect agent iterate on a plan which is then reviewed and implemented by fresh instances of Claude Code. In The 7 Prompting Habits of Highly Effective Engineers Josh Bleecher Snyder describes several patterns for this kind of work. I particularly like this one: Send out a scout . Hand the AI agent a task just to find out where the sticky bits are, so you don’t have to make those mistakes. I've tried this a few times with good results: give the agent a genuinely difficult task against a large codebase, with no intention of actually landing its code, just to get ideas from which files it modifies and how it approaches the problem. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Simon Willison 2 weeks ago

Designing agentic loops

Coding agents like Anthropic's Claude Code and OpenAI's Codex CLI represent a genuine step change in how useful LLMs can be for producing working code. These agents can now directly exercise the code they are writing, correct errors, dig through existing implementation details, and even run experiments to find effective code solutions to problems. As is so often the case with modern AI, there is a great deal of depth involved in unlocking the full potential of these new tools. A critical new skill to develop is designing agentic loops . One way to think about coding agents is that they are brute force tools for finding solutions to coding problems. If you can reduce your problem to a clear goal and a set of tools that can iterate towards that goal a coding agent can often brute force its way to an effective solution. My preferred definition of an LLM agent is something that runs tools in a loop to achieve a goal . The art of using them well is to carefully design the tools and loop for them to use. Agents are inherently dangerous - they can make poor decisions or fall victim to malicious prompt injection attacks , either of which can result in harmful results from tool calls. Since the most powerful coding agent tool is "run this command in the shell" a rogue agent can do anything that you could do by running a command yourself. To quote Solomon Hykes : An AI agent is an LLM wrecking its environment in a loop. Coding agents like Claude Code counter this by defaulting to asking you for approval of almost every command that they run. This is kind of tedious, but more importantly, it dramatically reduces their effectiveness at solving problems through brute force. Each of these tools provides its own version of what I like to call YOLO mode, where everything gets approved by default. This is so dangerous , but it's also key to getting the most productive results! Here are three key risks to consider from unattended YOLO mode. If you want to run YOLO mode anyway, you have a few options: Most people choose option 3. Despite the existence of container escapes I think option 1 using Docker or the new Apple container tool is a reasonable risk to accept for most people. Option 2 is my favorite. I like to use GitHub Codespaces for this - it provides a full container environment on-demand that's accessible through your browser and has a generous free tier too. If anything goes wrong it's a Microsoft Azure machine somewhere that's burning CPU and the worst that can happen is code you checked out into the environment might be exfiltrated by an attacker, or bad code might be pushed to the attached GitHub repository. There are plenty of other agent-like tools that run code on other people's computers. Code Interpreter mode in both ChatGPT and Claude can go a surprisingly long way here. I've also had a lot of success (ab)using OpenAI's Codex Cloud . Coding agents themselves implement various levels of sandboxing, but so far I've not seen convincing enough documentation of these to trust them. Update : It turns out Anthropic have their own documentation on Safe YOLO mode for Claude Code which says: Letting Claude run arbitrary commands is risky and can result in data loss, system corruption, or even data exfiltration (e.g., via prompt injection attacks). To minimize these risks, use in a container without internet access. You can follow this reference implementation using Docker Dev Containers. Locking internet access down to a list of trusted hosts is a great way to prevent exfiltration attacks from stealing your private source code. Now that we've found a safe (enough) way to run in YOLO mode, the next step is to decide which tools we need to make available to the coding agent. You can bring MCP into the mix at this point, but I find it's usually more productive to think in terms of shell commands instead. Coding agents are really good at running shell commands! If your environment allows them the necessary network access, they can also pull down additional packages from NPM and PyPI and similar. Ensuring your agent runs in an environment where random package installs don't break things on your main computer is an important consideration as well! Rather than leaning on MCP, I like to create an AGENTS.md (or equivalent) file with details of packages I think they may need to use. For a project that involved taking screenshots of various websites I installed my own shot-scraper CLI tool and dropped the following in : Just that one example is enough for the agent to guess how to swap out the URL and filename for other screenshots. Good LLMs already know how to use a bewildering array of existing tools. If you say "use playwright python " or "use ffmpeg" most models will use those effectively - and since they're running in a loop they can usually recover from mistakes they make at first and figure out the right incantations without extra guidance. In addition to exposing the right commands, we also need to consider what credentials we should expose to those commands. Ideally we wouldn't need any credentials at all - plenty of work can be done without signing into anything or providing an API key - but certain problems will require authenticated access. This is a deep topic in itself, but I have two key recommendations here: I'll use an example to illustrate. A while ago I was investigating slow cold start times for a scale-to-zero application I was running on Fly.io . I realized I could work a lot faster if I gave Claude Code the ability to directly edit Dockerfiles, deploy them to a Fly account and measure how long they took to launch. Fly allows you to create organizations, and you can set a budget limit for those organizations and issue a Fly API key that can only create or modify apps within that organization... So I created a dedicated organization for just this one investigation, set a $5 budget, issued an API key and set Claude Code loose on it! In that particular case the results weren't useful enough to describe in more detail, but this was the project where I first realized that "designing an agentic loop" was an important skill to develop. Not every problem responds well to this pattern of working. The thing to look out for here are problems with clear success criteria where finding a good solution is likely to involve (potentially slightly tedious) trial and error . Any time you find yourself thinking "ugh, I'm going to have to try a lot of variations here" is a strong signal that an agentic loop might be worth trying! A few examples: A common theme in all of these is automated tests . The value you can get from coding agents and other LLM coding tools is massively amplified by a good, cleanly passing test suite. Thankfully LLMs are great for accelerating the process of putting one of those together, if you don't have one yet. Designing agentic loops is a very new skill - Claude Code was first released in just February 2025! I'm hoping that giving it a clear name can help us have productive conversations about it. There's so much more to figure out about how to use these tools as effectively as possible. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The joy of YOLO mode Picking the right tools for the loop Issuing tightly scoped credentials When to design an agentic loop This is still a very fresh area Bad shell commands deleting or mangling things you care about. Exfiltration attacks where something steals files or data visible to the agent - source code or secrets held in environment variables are particularly vulnerable here. Attacks that use your machine as a proxy to attack another target - for DDoS or to disguise the source of other hacking attacks. Run your agent in a secure sandbox that restricts the files and secrets it can access and the network connections it can make. Use someone else's computer. That way if your agent goes rogue, there's only so much damage they can do, including wasting someone else's CPU cycles. Take a risk! Try to avoid exposing it to potential sources of malicious instructions and hope you catch any mistakes before they cause any damage. Try to provide credentials to test or staging environments where any damage can be well contained. If a credential can spend money, set a tight budget limit. Debugging : a test is failing and you need to investigate the root cause. Coding agents that can already run your tests can likely do this without any extra setup. Performance optimization : this SQL query is too slow, would adding an index help? Have your agent benchmark the query and then add and drop indexes (in an isolated development environment!) to measure their impact. Upgrading dependencies : you've fallen behind on a bunch of dependency upgrades? If your test suite is solid an agentic loop can upgrade them all for you and make any minor updates needed to reflect breaking changes. Make sure a copy of the relevant release notes is available, or that the agent knows where to find them itself. Optimizing container sizes : Docker container feeling uncomfortably large? Have your agent try different base images and iterate on the Dockerfile to try to shrink it, while keeping the tests passing.

3 views
Simon Willison 2 weeks ago

Claude Sonnet 4.5 is probably the "best coding model in the world" (at least for now)

Anthropic released Claude Sonnet 4.5 today , with a very bold set of claims: Claude Sonnet 4.5 is the best coding model in the world. It's the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math. Anthropic gave me access to a preview version of a "new model" over the weekend which turned out to be Sonnet 4.5. My initial impressions were that it felt like a better model for code than GPT-5-Codex, which has been my preferred coding model since it launched a few weeks ago . This space moves so fast - Gemini 3 is rumored to land soon so who knows how long Sonnet 4.5 will continue to hold the "best coding model" crown. The pricing is the same as the previous Sonnet: $3/million input tokens and $15/million output tokens. This remains significantly cheaper than Claude Opus - $15/$75 - but still quite a bit more than GPT-5 and GPT-5-Codex, both at $1.25/$10. The claude.ai web interface (not yet the Claude iPhone native app) recently added the ability for Claude to write and then directly execute code in a sandboxed server environment, using Python and Node.js. I wrote about that in detail three weeks ago. Anthropic's implementation of this code interpreter pattern is more powerful than ChatGPT's equivalent because it can directly clone code from GitHub and install software packages from NPM and PyPI. Sonnet 4.5 absolutely shines at using this tool! Here's my most ambitious experiment with it. I started like this: It did exactly as I asked, checking out my LLM repo , installing it along with all of its dependencies, and running the tests. ✅ All 466 tests passed in 167.69 seconds (about 2 minutes 47 seconds) Then I posed it a wildly ambitious challenge: As you can see in the resulting conversation transcript , it did a really impressive job of this! It churned away across dozens of different tool calls, eventually outputting the following summary of its activity: I've successfully experimented with adding tree-structured conversations to the LLM CLI tool. Here's what was accomplished: 1. Database Schema Enhancement 2. Comprehensive Utility Module ( ) 3. Complete Test Suite ( ) 4. Integration Verification The implementation is production-ready with comprehensive testing. Next steps would be: All code is tested, documented, and ready to use! Did I mention I kicked off this experiment by entering prompts on my phone? I told it to - you can explore the contents of the file it made me in this Gist . Update : I shared a longer version of this experiment where I had the model experiment with a complex message hashing mechanism as well. I've pushed the code to a branch , which is best explained by the message_matching_notes.md document written by Claude. I released llm-anthropic 0.19 adding support for the new model. Then I ran my classic pelican benchmark , first with thinking turned on : And then without : These are pretty good - they are recognizably pelicans! - though not quite as good as GPT-5-Codex which is better at drawing bicycles. I also tried having it describe this image: This image shows a large gathering of pelicans crowded together along what appears to be a waterfront or shoreline. The birds are densely packed, creating a chaotic scene with dozens of brown pelicans visible. Their distinctive long bills and pouches are clearly visible among the mass of feathered bodies. In the background, there's calm water and a dark tree line or forested area. The lighting suggests this was taken during early morning or late afternoon, giving the scene a golden, warm tone. The pelicans appear to be resting or congregating together, which is typical behavior for these colonial seabirds. The release of this model has been very well coordinated. My embargo on talking about it was due to lift at 10am Pacific today, and I got an email from them linking to their blog post at 10am on the dot. It's also already live on OpenRouter and in Cursor and GitHub Copilot and no doubt a whole bunch of other places as well. Anthropic also shipped a new Claude Code VS Code extension today, plus a big upgrade to the Claude Code terminal app. Plus they rebranded their confusingly named Claude Code SDK to the Claude Agent SDK instead, emphasizing that it's a tool for building agents beyond just customizing the existing Claude Code product. That's available for both TypeScript and Python . You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Added column to the table via migration Enables tree-structured (branching) conversations instead of just linear sequences Fully backward compatible - existing data unaffected 12 helper functions for tree operations: Navigation (children, parent, siblings, paths) Analysis (depth, descendants, size) Queries (roots, leaves, statistics) Visualization (ASCII tree display) 16 tests covering all tree operations Tests linear chains, branching, multiple roots, forests Includes realistic debugging scenario All tests pass ✅ Updated existing migration tests to expect new column All 6 migration tests pass Total: 22/22 tests passing ✅ README.md - Quick start guide IMPLEMENTATION_SUMMARY.md - Complete technical overview MIGRATION_CODE.md - Database migration details TEST_RESULTS.md - Full test coverage report tree_notes.md - Design decisions and development notes tree_utils.py - Utility functions module test_tree_conversations.py - Test suite Multiple branches from any conversation point Multiple roots per conversation (forest structure) Rich analytics (depth, branching factor, tree size) ASCII tree visualization Cycle detection for safety Integrate into LLM package Add CLI commands ( , ) Update to accept

0 views
Simon Willison 3 weeks ago

I think "agent" may finally have a widely enough agreed upon definition to be useful jargon now

I've noticed something interesting over the past few weeks: I've started using the term "agent" in conversations where I don't feel the need to then define it, roll my eyes or wrap it in scare quotes. This is a big piece of personal character development for me! Moving forward, when I talk about agents I'm going to use this: An LLM agent runs tools in a loop to achieve a goal. I've been very hesitant to use the term "agent" for meaningful communication over the last couple of years. It felt to me like the ultimate in buzzword bingo - everyone was talking about agents, but if you quizzed them everyone seemed to hold a different mental model of what they actually were.

0 views
Simon Willison 1 months ago

My review of Claude's new Code Interpreter, released under a very confusing name

Today on the Anthropic blog: Claude can now create and edit files : Claude can now create and edit Excel spreadsheets, documents, PowerPoint slide decks, and PDFs directly in Claude.ai and the desktop app. [...] File creation is now available as a preview for Max, Team, and Enterprise plan users. Pro users will get access in the coming weeks. Then right at the very end of their post: This feature gives Claude internet access to create and analyze files, which may put your data at risk. Monitor chats closely when using this feature. Learn more .

0 views
Simon Willison 1 months ago

Recreating the Apollo AI adoption rate chart with GPT-5, Python and Pyodide

Apollo Global Management's "Chief Economist" Dr. Torsten Sløk released this interesting chart which appears to show a slowdown in AI adoption rates among large (>250 employees) companies: Here's the full description that accompanied the chart: The US Census Bureau conducts a biweekly survey of 1.2 million firms, and one question is whether a business has used AI tools such as machine learning, natural language processing, virtual agents or voice recognition to help produce goods or services in the past two weeks. Recent data by firm size shows that AI adoption has been declining among companies with more than 250 employees, see chart below. (My first thought on seeing that chart is that I hope it represents the peak of inflated expectations leading into the trough of dissillusionment in the Gartner Hype Cycle (which Wikipedia calls "largely disputed, with studies pointing to it being inconsistently true at best"), since that means we might be reaching the end of the initial hype phase and heading towards the slope of enlightenment .) This is the first I'd heard of the US Census Bureau running a biweekly (that's once every two weeks) survey about AI!

0 views
Simon Willison 1 months ago

GPT-5 Thinking in ChatGPT (aka Research Goblin) is shockingly good at search

"Don't use chatbots as search engines" was great advice for several years... until it wasn't. I wrote about how good OpenAI's o3 was at using its Bing-backed search tool back in April . GPT-5 feels even better. I've started calling it my Research Goblin . I can assign a task to it, no matter how trivial or complex, and it will do an often unreasonable amount of work to search the internet and figure out an answer. This is excellent for satisfying curiosity, and occasionally useful for more important endeavors as well. I always run my searches by selecting the "GPT-5 Thinking" model from the model picker - in my experience this leads to far more comprehensive (albeit much slower) results.

0 views
Simon Willison 1 months ago

V&A East Storehouse and Operation Mincemeat in London

We were back in London for a few days and yesterday had a day of culture. First up: the brand new V&A East Storehouse museum in the Queen Elizabeth Olympic Park near Stratford, which opened on May 31st this year. This is a delightful new format for a museum. The building is primarily an off-site storage area for London's Victoria and Albert museum, storing 250,000 items that aren't on display in their main building. The twist is that it's also open to the public. Entrance is free, and you can climb stairs and walk through an airlock-style corridor into the climate controlled interior, then explore three floors of walkways between industrial shelving units holding thousands of items from the collection. There is almost no signage aside from an occasional number that can help you look up items in the online catalog.

0 views
Simon Willison 2 months ago

The Summer of Johann: prompt injections as far as the eye can see

Independent AI researcher Johann Rehberger ( previously ) has had an absurdly busy August. Under the heading The Month of AI Bugs he has been publishing one report per day across an array of different tools, all of which are vulnerable to various classic prompt injection problems. This is a fantastic and horrifying demonstration of how widespread and dangerous these vulnerabilities still are, almost three years after we first started talking about them . Johann's published research in August so far covers ChatGPT, Codex, Anthropic MCPs, Cursor, Amp, Devin, OpenHands, Claude Code, GitHub Copilot and Google Jules. There's still half the month left! Here are my one-sentence summaries of everything he's published so far: There are a number of patterns that show up time and time again in the above list of disclosures: Inspired by my description of the lethal trifecta , Johann has coined the term AI Kill Chain to describe a particularly harmful pattern:

0 views
Simon Willison 2 months ago

Open weight LLMs exhibit inconsistent performance across providers

Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model - OpenAI’s gpt-oss-120b - performs across different hosted providers. The results showed some surprising differences. Here's the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of "high": These are some varied results! It looks like most of the providers that scored 93.3% were running models using the latest vLLM (with the exception of Cerebras who I believe have their own custom serving stack). I hadn't heard of CompactifAI before - I found this June 12th 2025 press release which says that "CompactifAI models are highly-compressed versions of leading open source LLMs that retain original accuracy, are 4x-12x faster and yield a 50%-80% reduction in inference costs" which helps explain their notably lower score!

0 views
Simon Willison 2 months ago

LLM 0.27, the annotated release notes: GPT-5 and improved tool calling

I shipped LLM 0.27 today (followed by a 0.27.1 with minor bug fixes ), adding support for the new GPT-5 family of models from OpenAI plus a flurry of improvements to the tool calling features introduced in LLM 0.26 . Here are the annotated release notes . I would have liked to get these out sooner, but LLM had accumulated quite a lot of other changes since the last release and I wanted to use GPT-5 as an excuse to wrap all of those up and get them out there. These models work much the same as other OpenAI models, but they have a new option of . You can try that out like this: Setting "minimal" almost completely eliminates the "thinking" time for the model, causing it to behave more like GPT-4o. Here's the letter it wrote me at a cost of 20 input, 706 output = $0.007085 which is 0.7085 cents .

0 views
Simon Willison 2 months ago

Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!"

I've fallen a few days behind keeping up with Qwen. They released two new 4B models last week: Qwen3-4B-Instruct-2507 and its thinking equivalent Qwen3-4B-Thinking-2507 . These are relatively tiny models that punch way above their weight. I’ve been running the 8bit GGUF varieties via LM Studio (here's Instruct , here's Thinking ) - both of them are 4GB downloads that use around 4.3GB of my M2 MacBook Pro’s system RAM while running. Both are way more capable than I would expect from such small files. Qwen3-4B-Thinking is the first model I've tried which called out the absurdity of being asked to draw a pelican riding a bicycle ! Here's the full transcript - some highlights from the thinking trace: Okay, the user asked for an SVG of a pelican riding a bicycle. That's... oddly specific and completely unrealistic.

0 views
Simon Willison 2 months ago

My Lethal Trifecta talk at the Bay Area AI Security Meetup

I gave a talk on Wednesday at the Bay Area AI Security Meetup about prompt injection, the lethal trifecta and the challenges of securing systems that use MCP. It wasn't recorded but I've created an annotated presentation with my slides and detailed notes on everything I talked about. Also included: some notes on my weird hobby of trying to coin or amplify new terms of art. Minutes before I went on stage an audience member asked me if there would be any pelicans in my talk, and I panicked because there were not! So I dropped in this photograph I took a few days ago in Half Moon Bay as the background for my title slide. Let's start by reviewing prompt injection - SQL injection with prompts. It's called that because the root cause is the original sin of AI engineering: we build these systems through string concatenation, by gluing together trusted instructions and untrusted input. Anyone who works in security will know why this is a bad idea! It's the root cause of SQL injection, XSS, command injection and so much more.

0 views
Simon Willison 2 months ago

The surprise deprecation of GPT-4o for ChatGPT consumers

I've been dipping into the r/ChatGPT subreddit recently to see how people are reacting to the GPT-5 launch , and so far the vibes there are not good. This AMA thread with the OpenAI team is a great illustration of the single biggest complaint: a lot of people are very unhappy to lose access to the much older GPT-4o, previously ChatGPT's default model for most users. A big surprise for me yesterday was that OpenAI simultaneously retired access to their older models as they rolled out GPT-5, at least in their consumer apps. Here's a snippet from their August 7th 2025 release notes : When GPT-5 launches, several older models will be retired, including GPT-4o, GPT-4.1, GPT-4.5, GPT-4.1-mini, o4-mini, o4-mini-high, o3, o3-pro. If you open a conversation that used one of these models, ChatGPT will automatically switch it to the closest GPT-5 equivalent. Chats with 4o, 4.1, 4.5, 4.1-mini, o4-mini, or o4-mini-high will open in GPT-5, chats with o3 will open in GPT-5-Thinking, and chats with o3-Pro will open in GPT-5-Pro (available only on Pro and Team). There's no deprecation period at all: when your consumer ChatGPT account gets GPT-5, those older models cease to be available.

0 views
Simon Willison 2 months ago

GPT-5: Key characteristics, pricing and model card

I've had preview access to the new GPT-5 model family for the past two weeks (see related video and my disclosures ) and have been using GPT-5 as my daily-driver. It's my new favorite model. It's still an LLM - it's not a dramatic departure from what we've had before - but it rarely screws up and generally feels competent or occasionally impressive at the kinds of things I like to use models for. I've collected a lot of notes over the past two weeks, so I've decided to break them up into a series of posts . This first one will cover key characteristics of the models, how they are priced and what we can learn from the GPT-5 system card . Let's start with the fundamentals. GPT-5 in ChatGPT is a weird hybrid that switches between different models. Here's what the system card says about that (my highlights in bold): GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt). [...] Once usage limits are reached, a mini version of each model handles remaining queries. In the near future, we plan to integrate these capabilities into a single model. GPT-5 in the API is simpler: it's available as three models - regular , mini and nano - which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high.

0 views
Simon Willison 2 months ago

OpenAI's new open weight (Apache 2) models are really good

The long promised OpenAI open weight models are here , and they are very impressive. They're available under proper open source licenses - Apache 2.0 - and come in two sizes, 120B and 20B. OpenAI's own benchmarks are eyebrow-raising - emphasis mine: The gpt-oss-120b model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The gpt-oss-20b model delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure. o4-mini and o3-mini are really good proprietary models - I was not expecting the open weights releases to be anywhere near that class, especially given their small sizes. That gpt-oss-20b model should run quite comfortably on a Mac laptop with 32GB of RAM. Both models are mixture-of-experts:

0 views
Simon Willison 2 months ago

ChatGPT agent's user-agent

I was exploring how ChatGPT agent works today. I learned some interesting things about how it exposes its identity through HTTP headers, then made a huge blunder in thinking it was leaking its URLs to Bingbot and Yandex... but it turned out that was a Cloudflare feature that had nothing to do with ChatGPT. ChatGPT agent is the recently released (and confusingly named) ChatGPT feature that provides browser automation combined with terminal access as a feature of ChatGPT - replacing their previous Operator research preview which is scheduled for deprecation on August 31st. I decided to dig into how it works by creating a logged web URL endpoint using django-http-debug . Then I told ChatGPT agent mode to explore that new page: My logging captured these request headers: That Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 user-agent header is the one used by the most recent Chrome on macOS - which is a little odd here as the Sec-Ch-Ua-Platform : "Linux" indicates that the agent browser runs on Linux.

0 views