Latest Posts (20 found)
Simon Willison 4 days ago

Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson

I talked with CL Kao and Dori Wilson for an episode of their new Data Renegades podcast titled Data Journalism Unleashed with Simon Willison . I fed the transcript into Claude Opus 4.5 to extract this list of topics with timestamps and illustrative quotes. It did such a good job I'm using what it produced almost verbatim here - I tidied it up a tiny bit and added a bunch of supporting links. What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook I used a Claude Project for the initial analysis, pasting in the HTML of the transcript since that included elements. The project uses the following custom instructions You will be given a transcript of a podcast episode. Find the most interesting quotes in that transcript - quotes that best illustrate the overall themes, and quotes that introduce surprising ideas or express things in a particularly clear or engaging or spicy way. Answer just with those quotes - long quotes are fine. I then added a follow-up prompt saying: Now construct a bullet point list of key topics where each item includes the mm:ss in square braces at the end Then suggest a very comprehensive list of supporting links I could find Here's the full Claude transcript of the analysis. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." NICAR 2026 ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook

0 views
Simon Willison 6 days ago

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's GPT-5.1-Codex-Max and Google's Gemini 3 , both released within the past week! The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 "reliable knowledge cutoff" (Sonnet 4.5 is January, Haiku 4.5 is February). The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for >200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5. The Key improvements in Opus 4.5 over Opus 4.1 document has a few more interesting details: I had access to a preview of Anthropic's new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in a new alpha release of sqlite-utils that included several large-scale refactorings - Opus 4.5 was responsible for most of the work across 20 commits, 39 files changed, 2,022 additions and 1,173 deletions in a two day period. Here's the Claude Code transcript where I had it help implement one of the more complicated new features. It's clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in the milestone for the alpha . I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I'd been achieving with the new model. With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected. I'm not saying the new model isn't an improvement on Sonnet 4.5 - but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two. This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors. Google's Nano Banana Pro image generation model was notable in that its ability to render usable infographics really does represent a task at which previous models had been laughably incapable. The frontier LLMs are a lot harder to differentiate between. Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis? And honestly, this is mainly on me. I've fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they've fallen one-by-one and now I'm embarrassingly lacking in suitable challenges to help evaluate new models. I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on - a tip I picked up from Ethan Mollick. I need to double-down on that advice myself! I'd love to see AI labs like Anthropic help address this challenge directly. I'd like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle. "Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5" would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond. In the meantime, I'm just gonna have to keep on getting them to draw pelicans riding bicycles . Here's Opus 4.5 (on its default "high" effort level ): It did significantly better on the new more detailed prompt : Here's that same complex prompt against Gemini 3 Pro and against GPT-5.1-Codex-Max-xhigh . From the safety section of Anthropic's announcement post: With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry: On the one hand this looks great, it's a clear improvement over previous models and the competition. What does the chart actually tell us though? It tells us that single attempts at prompt injection still work 1/20 times, and if an attacker can try ten different attacks that success rate goes up to 1/3! I still don't think training models not to fall for prompt injection is the way forward here. We continue to need to design our applications under the assumption that a suitably motivated attacker will be able to find a way to trick the models. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Opus 4.5 has a new effort parameter which defaults to high but can be set to medium or low for faster responses. The model supports enhanced computer use , specifically a tool which you can provide to Opus 4.5 to allow it to request a zoomed in region of the screen to inspect. " Thinking blocks from previous assistant turns are preserved in model context by default " - apparently previous Anthropic models discarded those.

1 views
Simon Willison 6 days ago

sqlite-utils 4.0a1 has several (minor) backwards incompatible changes

I released a new alpha version of sqlite-utils last night - the 128th release of that package since I started building it back in 2018. is two things in one package: a Python library for conveniently creating and manipulating SQLite databases and a CLI tool for working with them in the terminal. Almost every feature provided by the package is available via both of those surfaces. This is hopefully the last alpha before a 4.0 stable release. I use semantic versioning for this library, so the 4.0 version number indicates that there are backward incompatible changes that may affect code written against the 3.x line. These changes are mostly very minor: I don't want to break any existing code if I can avoid it. I made it all the way to version 3.38 before I had to ship a major release and I'm sad I couldn't push that even further! Here are the annotated release notes for 4.0a1. This change is for type hint enthusiasts. The Python library used to encourage accessing both SQL tables and SQL views through the syntactic sugar - but tables and view have different interfaces since there's no way to handle a on a SQLite view. If you want clean type hints for your code you can now use the and methods instead. A new feature, not a breaking change. I realized that supporting a stream of lists or tuples as an option for populating large tables would be a neat optimization over always dealing with dictionaries each of which duplicated the column names. I had the idea for this one while walking the dog and built the first prototype by prompting Claude Code for web on my phone. Here's the prompt I used and the prototype report it created , which included a benchmark estimating how much of a performance boost could be had for different sizes of tables. I was horrified to discover a while ago that I'd been creating SQLite columns called FLOAT but the correct type to use was REAL! This change fixes that. Previously the fix was to ask for tables to be created in strict mode. As part of this I also figured out recipes for using as a development environment for the package, which are now baked into the Justfile . This one is best explained in the issue . Another change which I would have made earlier but, since it introduces a minor behavior change to an existing feature, I reserved it for the 4.0 release. Back in 2018 when I started this project I was new to working in-depth with SQLite and incorrectly concluded that the correct way to create tables and columns named after reserved words was like this: That turned out to be a non-standard SQL syntax which the SQLite documentation describes like this : A keyword enclosed in square brackets is an identifier. This is not standard SQL. This quoting mechanism is used by MS Access and SQL Server and is included in SQLite for compatibility. Unfortunately I baked it into the library early on and it's been polluting the world with weirdly escaped table and column names ever since! I've finally fixed that, with the help of Claude Code which took on the mind-numbing task of updating hundreds of existing tests that asserted against the generated schemas. The above example table schema now looks like this: This may seem like a pretty small change but I expect it to cause a fair amount of downstream pain purely in terms of updating tests that work against tables created by ! I made this change first in LLM and decided to bring it to for consistency between the two tools. One last minor ugliness that I waited for a major version bump to fix. Update : Now that the embargo has lifted I can reveal that a substantial amount of the work on this release was performed using a preview version of Anthropic's new Claude Opus 4.5 model . Here's the Claude Code transcript for the work to implement the ability to use an iterator over lists instead of dictionaries for bulk insert and upsert operations. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Breaking change : The method now only works with tables. To access a SQL view use instead. ( #657 ) The and methods can now accept an iterator of lists or tuples as an alternative to dictionaries. The first item should be a list/tuple of column names. See Inserting data from a list or tuple iterator for details. ( #672 ) Breaking change : The default floating point column type has been changed from to , which is the correct SQLite type for floating point values. This affects auto-detected columns when inserting data. ( #645 ) Now uses in place of for packaging. ( #675 ) Tables in the Python API now do a much better job of remembering the primary key and other schema details from when they were first created. ( #655 ) Breaking change : The and mechanisms no longer skip values that evaluate to . Previously the option was needed, this has been removed. ( #542 ) Breaking change : Tables created by this library now wrap table and column names in in the schema. Previously they would use . ( #677 ) The CLI argument now accepts a path to a Python file in addition to accepting a string full of Python code. It can also now be specified multiple times. ( #659 ) Breaking change: Type detection is now the default behavior for the and CLI commands when importing CSV or TSV data. Previously all columns were treated as unless the flag was passed. Use the new flag to restore the old behavior. The environment variable has been removed. ( #679 )

0 views
Simon Willison 1 weeks ago

Olmo 3 is a fully open LLM

Olmo is the LLM series from Ai2 - the Allen institute for AI . Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases. The new Olmo 3 claims to be "the best fully open 32B-scale thinking model" and has a strong focus on interpretability: At its center is Olmo 3-Think (32B) , the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them. They've released four 7B models - Olmo 3-Base, Olmo 3-Instruct, Olmo 3-Think and Olmo 3-RL Zero, plus 32B variants of the 3-Think and 3-Base models. Having full access to the training data is really useful. Here's how they describe that: Olmo 3 is pretrained on Dolma 3 , a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with olmOCR , codebases, math problems and solutions, and encyclopedic text. From this pool, we construct Dolma 3 Mix , a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don't collect from sites that explicitly disallow it, including paywalled content. They also highlight that they are training on fewer tokens than their competition: [...] it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens. If you're continuing to hold out hope for a model trained entirely on licensed data this one sadly won't fit the bill - a lot of that data still comes from a crawl of the web. I tried out the 32B Think model and the 7B Instruct model using LM Studio . The 7B model is a 4.16GB download, the 32B one is 18.14GB. The 32B model is absolutely an over-thinker! I asked it to "Generate an SVG of a pelican riding a bicycle" and it thought for 14 minutes 43 seconds , outputting 8,437 tokens total most of which was this epic thinking trace . I don't usually quote the full SVG in these write-ups, but in this case it's short enough that I think it's worth sharing. The SVG comments give a great impression of what it was trying to do - it has a Bicycle, Bike frame, Pelican, Left and Right wings and even "Feet on pedals". Rendered it looks like this: I tested OLMo 2 32B 4bit back in March and got something that, while pleasingly abstract, didn't come close to resembling a pelican or a bicycle: To be fair 32B models generally don't do great with this. Here's Qwen 3 32B's attempt (I ran that just now using OpenRouter ): I was particularly keen on trying out the ability to "inspect intermediate reasoning traces". Here's how that's described later in the announcement: A core goal of Olmo 3 is not just to open the model flow, but to make it actionable for people who want to understand and improve model behavior. Olmo 3 integrates with OlmoTrace , our tool for tracing model outputs back to training data in real time. For example, in the Ai2 Playground, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. This closes the gap between training data and model behavior: you can see not only what the model is doing, but why---and adjust data or training decisions accordingly. You can access OlmoTrace via playground.allenai.org , by first running a prompt and then clicking the "Show OlmoTrace" button below the output. I tried that on "Generate a conference bio for Simon Willison" (an ego-prompt I use to see how much the models have picked up about me from their training data) and got back a result that looked like this: It thinks I co-founded co:here and work at Anthropic, both of which are incorrect - but that's not uncommon with LLMs, I frequently see them suggest that I'm the CTO of GitHub and other such inaccuracies. I found the OlmoTrace panel on the right disappointing. None of the training documents it highlighted looked relevant - it appears to be looking for phrase matches (powered by Ai2's infini-gram ) but the documents it found had nothing to do with me at all. Ai2 claim that Olmo 3 is "the best fully open 32B-scale thinking model", which I think holds up provided you define "fully open" as including open training data. There's not a great deal of competition in that space though - Ai2 compare themselves to Stanford's Marin and Swiss AI's Apertus , neither of which I'd heard about before. A big disadvantage of other open weight models is that it's impossible to audit their training data. Anthropic published a paper last month showing that a small number of samples can poison LLMs of any size - it can take just "250 poisoned documents" to add a backdoor to a large model that triggers undesired behavior based on a short carefully crafted prompt. This makes fully open training data an even bigger deal. Ai2 researcher Nathan Lambert included this note about the importance of transparent training data in his detailed post about the release : In particular, we're excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative). This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." arXiv preprint arXiv:2506.10947 (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." arXiv preprint arXiv:2507.10532 (2025).) I hope we see more competition in this space, including further models in the Olmo series. The improvements from Olmo 1 (in February 2024 ) and Olmo 2 (in March 2025 ) have been significant. I'm hoping that trend continues! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Simon Willison 1 weeks ago

Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model

Hot on the heels of Tuesday's Gemini 3 Pro release, today it's Nano Banana Pro , also known as Gemini 3 Pro Image . I've had a few days of preview access and this is an astonishingly capable image generation model. As is often the case, the most useful low-level details can be found in the API documentation : Designed to tackle the most challenging workflows through advanced reasoning, it excels at complex, multi-turn creation and modification tasks. [...] These 14 images can include the following: Max Woolf published the definitive guide to prompting Nano Banana just a few days ago. I decided to try his example prompts against the new model, requesting results in 4K. Here's what I got for his first test prompt, using Google's AI Studio : Here's what I got: The result came out as a 24.1MB, 5632 × 3072 pixel PNG file. I don't want to serve that on my own blog so here's a Google Drive link for the original . Then I ran his follow-up prompt: I'll note that it did put the plate-sized cookie on a regular plate. Here's the 24.9MB PNG . The new model isn't cheap. Here's the API pricing : it's 24 cents for a 4K image and 13.4 cents for a 1K or 2K image. Image inputs are 0.11 cents (just over 1/10th of a cent) each - an earlier version of their pricing page incorrectly said 6.7 cents each but that's now been fixed. Unlike most of Google's other models it also isn't available for free via AI Studio: you have to configure an API key with billing in order to use the model there. So this thing is great at following instructions. How about rendering text? I tried this prompt, this time using the Gemini consumer app in "thinking" mode (which now uses Nano Banana Pro for image generation). Here's a share link - my prompt was: Infographic explaining how the Datasette open source project works This is a great opportunity to test its ability to run searches (aka "Grounding with Google Search"). Here's what it created based on that 9 word prompt: That's a pretty extraordinary result for such a simple prompt. The text is all spelled correctly and rendered without glitches. The content is solid too - it even included logos for the most popular publish platforms, and a tiny thumbnail of the Datasette UI which is close-enough for an infographic. "Data Ingestion (Read-Only)" is a bit off. Everything else looks great to me. Here's the thinking trace for that image generation process: Initiating Datasette Infographic I'm starting a detailed infographic about the Datasette project. I plan to use Google Search to gather specific information about its core processes, which I will then visualize in a multi-panel diagram. Developing the Datasette workflow I am visualizing the Datasette workflow as described in the search results. I'm focusing on the conversion of data from different formats into a SQLite database, which is then presented as a web application complete with a user interface and JSON output. Visualizing Datasette Workflow I've sketched out the initial structure for the infographic, focusing on a flowchart style. The key components include data input from diverse sources, conversion to SQLite, the Datasette core application, its main features (web UI, JSON API, SQL querying), ecosystem plugins, and finally, deployment options. I aim for clear icons and labels to represent each step. Now I will focus on refining the flow and clarifying the relationships between the parts. Defining Core Datasette Stages I've sketched out the initial infographic structure. I have identified the key components, focusing on visual representation of the data flow and the key benefits and features of the system. I have made updates to the flowchart with clear icons and labels. Now I'm working to expand the data input section and refine the visual representation of each step. Constructing Datasette Infographic I am translating the current Datasette workflow outline into visual elements. The input section will show icons for different data formats. I've sketched the data conversion stage with clear illustrations of the necessary tools. I will add a depiction of the Datasette core application, showcasing its interfaces and functions. The plugin ecosystem and deployment options will follow. Now, I am refining the visual representation. Analyzing Datasette's Architecture I'm currently working on representing the Datasette open-source project's architecture visually. I've broken it down into key components: raw data processing, SQLite conversion, core application functionalities, plugin integration, and deployment methods. I'm focusing on creating a clear, linear representation to show the workflow. Mapping the Datasette Flow I've crafted a comprehensive graphic to illustrate Datasette's operation. It spans the data ingestion, SQLite conversion, main application features, plugin interactions, and deployment stages, aiming for clarity and a logical flow. This note from the announcement post caught my eye: We believe it's critical to know when an image is AI-generated. This is why all media generated by Google's tools are embedded with our imperceptible SynthID digital watermark. Today, we are putting a powerful verification tool directly in consumers' hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon. Last night I used Nano Banana Pro to generate a fake photograph of raccoons stealing our food delivery, then scrubbed out the little diamond icon using the Apple Photos "cleanup" tool. I uploaded that Gemini app and asked "Was this image created with AI?": It replied: Yes, it appears that all or part of this image was created with Google Al. SynthID detected a watermark in 25-50% of the image. Presumably that 25-50% figure is because the rest of the photo was taken by me - it was just the raccoons that were added by Nano Banana Pro. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . High-resolution output : Built-in generation capabilities for 1K, 2K, and 4K visuals. Advanced text rendering : Capable of generating legible, stylized text for infographics, menus, diagrams, and marketing assets. Grounding with Google Search : The model can use Google Search as a tool to verify facts and generate imagery based on real-time data (e.g., current weather maps, stock charts, recent events). Thinking mode : The model utilizes a "thinking" process to reason through complex prompts. It generates interim "thought images" (visible in the backend but not charged) to refine the composition before producing the final high-quality output. Up to 14 reference images : You can now mix up to 14 reference images to produce the final image. Up to 6 images of objects with high-fidelity to include in the final image Up to 5 images of humans to maintain character consistency

0 views
Simon Willison 1 weeks ago

How I automate my Substack newsletter with content from my blog

I sent out my weekly-ish Substack newsletter this morning and took the opportunity to record a YouTube video demonstrating my process and describing the different components that make it work. There's a lot of digital duct tape involved, taking the content from Django+Heroku+PostgreSQL to GitHub Actions to SQLite+Datasette+Fly.io to JavaScript+Observable and finally to Substack. The core process is the same as I described back in 2023 . I have an Observable notebook called blog-to-newsletter which fetches content from my blog's database, filters out anything that has been in the newsletter before, formats what's left as HTML and offers a big "Copy rich text newsletter to clipboard" button. I click that button, paste the result into the Substack editor, tweak a few things and hit send. The whole process usually takes just a few minutes. I make very minor edits: That's the whole process! The most important cell in the Observable notebook is this one: This uses the JavaScript function to pull data from my blog's Datasette instance, using a very complex SQL query that is composed elsewhere in the notebook. Here's a link to see and execute that query directly in Datasette. It's 143 lines of convoluted SQL that assembles most of the HTML for the newsletter using SQLite string concatenation! An illustrative snippet: My blog's URLs look like - this SQL constructs that three letter month abbreviation from the month number using a substring operation. This is a terrible way to assemble HTML, but I've stuck with it because it amuses me. The rest of the Observable notebook takes that data, filters out anything that links to content mentioned in the previous newsletters and composes it into a block of HTML that can be copied using that big button. Here's the recipe it uses to turn HTML into rich text content on a clipboard suitable for Substack. I can't remember how I figured this out but it's very effective: My blog itself is a Django application hosted on Heroku, with data stored in Heroku PostgreSQL. Here's the source code for that Django application . I use the Django admin as my CMS. Datasette provides a JSON API over a SQLite database... which means something needs to convert that PostgreSQL database into a SQLite database that Datasette can use. My system for doing that lives in the simonw/simonwillisonblog-backup GitHub repository. It uses GitHub Actions on a schedule that executes every two hours, fetching the latest data from PostgreSQL and converting that to SQLite. My db-to-sqlite tool is responsible for that conversion. I call it like this : That command uses Heroku credentials in an environment variable to fetch the database connection URL for my blog's PostgreSQL database (and fixes a small difference in the URL scheme). can then export that data and write it to a SQLite database file called . The options specify the tables that should be included in the export. The repository does more than just that conversion: it also exports the resulting data to JSON files that live in the repository, which gives me a commit history of changes I make to my content. This is a cheap way to get a revision history of my blog content without having to mess around with detailed history tracking inside the Django application itself. At the end of my GitHub Actions workflow is this code that publishes the resulting database to Datasette running on Fly.io using the datasette publish fly plugin: As you can see, there are a lot of moving parts! Surprisingly it all mostly just works - I rarely have to intervene in the process, and the cost of those different components is pleasantly low. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The core process is the same as I described back in 2023 . I have an Observable notebook called blog-to-newsletter which fetches content from my blog's database, filters out anything that has been in the newsletter before, formats what's left as HTML and offers a big "Copy rich text newsletter to clipboard" button. I click that button, paste the result into the Substack editor, tweak a few things and hit send. The whole process usually takes just a few minutes. I make very minor edits: I set the title and the subheading for the newsletter. This is often a direct copy of the title of the featured blog post. Substack turns YouTube URLs into embeds, which often isn't what I want - especially if I have a YouTube URL inside a code example. Blocks of preformatted text often have an extra blank line at the end, which I remove. Occasionally I'll make a content edit - removing a piece of content that doesn't fit the newsletter, or fixing a time reference like "yesterday" that doesn't make sense any more. I pick the featured image for the newsletter and add some tags.

0 views
Simon Willison 1 weeks ago

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

Google released Gemini 3 Pro today. Here's the announcement from Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu , their developer blog announcement from Logan Kilpatrick , the Gemini 3 Pro Model Card , and their collection of 11 more articles . It's a big release! I had a few days of preview access to this model via AI Studio . The best way to describe it is that it's Gemini 2.5 upgraded to match the leading rival models . Gemini 3 has the same underlying characteristics as Gemini 2.5. The knowledge cutoff is the same (January 2025). It accepts 1 million input tokens, can output up to 64,000 tokens, and has multimodal inputs across text, images, audio, and video. Google's own reported numbers (in the model card ) show it scoring slightly higher against Claude 4.5 Sonnet and GPT-5.1 against most of the standard benchmarks. As always I'm waiting for independent confirmation, but I have no reason to believe those numbers are inaccurate. It terms of pricing it's a little more expensive than Gemini 2.5 but still cheaper than Claude Sonnet 4.5. Here's how it fits in with those other leading models: That screenshot of the benchmarks from above looked like a good test for Gemini 3's multimodal support. I fed it that image URL and asked it to generate alt text for the image: Here's what I got back: A comprehensive table comparing four AI models across various performance benchmarks. The columns listed are: Benchmark Name, Description, Gemini 3 Pro (highlighted in blue), Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. Humanity's Last Exam (Academic reasoning) ARC-AGI-2 (Visual reasoning puzzles; ARC Prize Verified) GPQA Diamond (Scientific knowledge; No tools) AIME 2025 (Mathematics) MathArena Apex (Challenging Math Contest problems) MMMU-Pro (Multimodal understanding and reasoning) ScreenSpot-Pro (Screen understanding) CharXiv Reasoning (Information synthesis from complex charts) OmniDocBench 1.5 (OCR; Overall Edit Distance, lower is better) Video-MMMU (Knowledge acquisition from videos) LiveCodeBench Pro (Competitive coding problems; Elo Rating, higher is better) Terminal-Bench 2.0 (Agentic terminal coding; Terminus-2 agent) SWE-Bench Verified (Agentic coding; Single attempt) t2-bench (Agentic tool use) Vending-Bench 2 (Long-horizon agentic tasks; Net worth (mean), higher is better) FACTS Benchmark Suite (Held out internal grounding, parametric, MM, and search retrieval benchmarks) SimpleQA Verified (Parametric knowledge) MMMLU (Multilingual Q&A) Global PIQA (Commonsense reasoning across 100 Languages and Cultures) MRCR v2 (8-needle) (Long context performance) I have not checked every line of this but a loose spot-check looks accurate to me. That prompt took 1,105 input and 3,901 output tokens, at a cost of 5.6824 cents . I ran this follow-up prompt: You can see the full output here , which starts like this: To try it out against an audio file I extracted the 3h33m of audio from the video Half Moon Bay City Council Meeting - November 4, 2025 . I used to get that audio: That gave me a 74M m4a file, which I ran through Gemini 3 Pro like this: That failed with an "Internal error encountered" message, so I shrunk the file down to a more manageable 38MB using : Then ran it again like this (for some reason I had to use this time): This time it worked! The full output is here , but it starts like this: Here is the transcript of the Half Moon Bay City Council meeting. 1. Call to Order, Updates, and Public Forum 2. Consent Calendar 3. Ordinance Introduction: Commercial Vitality (Item 9A) 4. Ordinance Introduction: Building Standards & Electrification (Item 9B) 5. Housing Element Update & Adoption (Item 9C) Mayor Brownstone [00:00:00] Good evening everybody and welcome to the November 4th Half Moon Bay City Council meeting. As a reminder, we have Spanish interpretation services available in person and on Zoom. Victor Hernandez (Interpreter) [00:00:35] Thank you, Mr. Mayor, City Council, all city staff, members of the public. [Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.] Thank you very much. Those first two lines of the transcript already illustrate something interesting here: Gemini 3 Pro chose NOT to include the exact text of the Spanish instructions, instead summarizing them as "[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]". I haven't spot-checked the entire 3hr33m meeting, but I've confirmed that the timestamps do not line up. The transcript closes like this: Mayor Brownstone [01:04:00] Meeting adjourned. Have a good evening. That actually happens at 3h31m5s and the mayor says: Okay. Well, thanks everybody, members of the public for participating. Thank you for staff. Thank you to fellow council members. This meeting is now adjourned. Have a good evening. I'm disappointed about the timestamps, since mismatches there make it much harder to jump to the right point and confirm that the summarized transcript is an accurate representation of what was said. This took 320,087 input tokens and 7,870 output tokens, for a total cost of $1.42 . Gemini 3 Pro has a new concept of a "thinking level" which can be set to low or high (and defaults to high). I tried my classic Generate an SVG of a pelican riding a bicycle prompt at both levels. Here's low - Gemini decided to add a jaunty little hat (with a comment in the SVG that says ): And here's high. This is genuinely an excellent pelican, and the bicycle frame is at least the correct shape: Honestly though, my pelican benchmark is beginning to feel a little bit too basic. I decided to upgrade it. Here's v2 of the benchmark, which I plan to use going forward: For reference, here's a photo I took of a California brown pelican recently (sadly without a bicycle): Here's Gemini 3 Pro's attempt at high thinking level for that new prompt: And for good measure, here's that same prompt against GPT-5.1 - which produced this dumpy little fellow: And Claude Sonnet 4.5, which didn't do quite as well : None of the models seem to have caught on to the crucial detail that the California brown pelican is not, in fact, brown. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . No tools: Gemini 3 Pro 37.5%, Gemini 2.5 Pro 21.6%, Claude Sonnet 4.5 13.7%, GPT-5.1 26.5%. With search and code execution: Gemini 3 Pro 45.8% (others have no data). Gemini 3 Pro 31.1%, Gemini 2.5 Pro 4.9%, Claude Sonnet 4.5 13.6%, GPT-5.1 17.6%. Gemini 3 Pro 91.9%, Gemini 2.5 Pro 86.4%, Claude Sonnet 4.5 83.4%, GPT-5.1 88.1%. No tools: Gemini 3 Pro 95.0%, Gemini 2.5 Pro 88.0%, Claude Sonnet 4.5 87.0%, GPT-5.1 94.0%. With code execution: Gemini 3 Pro 100%, Claude Sonnet 4.5 100%. Gemini 3 Pro 23.4%, Gemini 2.5 Pro 0.5%, Claude Sonnet 4.5 1.6%, GPT-5.1 1.0%. Gemini 3 Pro 81.0%, Gemini 2.5 Pro 68.0%, Claude Sonnet 4.5 68.0%, GPT-5.1 76.0%. Gemini 3 Pro 72.7%, Gemini 2.5 Pro 11.4%, Claude Sonnet 4.5 36.2%, GPT-5.1 3.5%. Gemini 3 Pro 81.4%, Gemini 2.5 Pro 69.6%, Claude Sonnet 4.5 68.5%, GPT-5.1 69.5%. Gemini 3 Pro 0.115, Gemini 2.5 Pro 0.145, Claude Sonnet 4.5 0.145, GPT-5.1 0.147. Gemini 3 Pro 87.6%, Gemini 2.5 Pro 83.6%, Claude Sonnet 4.5 77.8%, GPT-5.1 80.4%. Gemini 3 Pro 2,439; Gemini 2.5 Pro 1,775; Claude Sonnet 4.5 1,418; GPT-5.1 2,243. Gemini 3 Pro 54.2%, Gemini 2.5 Pro 32.6%, Claude Sonnet 4.5 42.8%, GPT-5.1 47.6%. Gemini 3 Pro 76.2%, Gemini 2.5 Pro 59.6%, Claude Sonnet 4.5 77.2%, GPT-5.1 76.3%. Gemini 3 Pro 85.4%, Gemini 2.5 Pro 54.9%, Claude Sonnet 4.5 84.7%, GPT-5.1 80.2%. Gemini 3 Pro $5,478.16; Gemini 2.5 Pro $573.64; Claude Sonnet 4.5 $3,838.74; GPT-5.1 $1,473.43. Gemini 3 Pro 70.5%, Gemini 2.5 Pro 63.4%, Claude Sonnet 4.5 50.4%, GPT-5.1 50.8%. Gemini 3 Pro 72.1%, Gemini 2.5 Pro 54.5%, Claude Sonnet 4.5 29.3%, GPT-5.1 34.9%. Gemini 3 Pro 91.8%, Gemini 2.5 Pro 89.5%, Claude Sonnet 4.5 89.1%, GPT-5.1 91.0%. Gemini 3 Pro 93.4%, Gemini 2.5 Pro 91.5%, Claude Sonnet 4.5 90.1%, GPT-5.1 90.9%. 128k (average): Gemini 3 Pro 77.0%, Gemini 2.5 Pro 58.0%, Claude Sonnet 4.5 47.1%, GPT-5.1 61.6%. 1M (pointwise): Gemini 3 Pro 26.3%, Gemini 2.5 Pro 16.4%, Claude Sonnet 4.5 (not supported), GPT-5.1 (not supported). Summary: Mayor Brownstone calls the meeting to order. City Manager Chidester reports no reportable actions from the closed session. Announcements are made regarding food insecurity volunteers and the Diwali celebration. During the public forum, Councilmember Penrose (speaking as a citizen) warns against autocracy. Citizens speak regarding lease agreements, downtown maintenance, local music events, and homelessness outreach statistics. Timestamp: 00:00:00 - 00:13:25 Participants: Mayor Brownstone, Matthew Chidester, Irma Acosta, Deborah Penrose, Jennifer Moore, Sandy Vella, Joaquin Jimenez, Anita Rees. Summary: The Council approves minutes from previous meetings and a resolution authorizing a licensing agreement for Seahorse Ranch. Councilmember Johnson corrects a pull request regarding abstentions on minutes. Timestamp: 00:13:25 - 00:15:15 Participants: Mayor Brownstone, Councilmember Johnson, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Nagengast. Summary: Staff presents a new ordinance to address neglected and empty commercial storefronts, establishing maintenance and display standards. Councilmembers discuss enforcement mechanisms, window cleanliness standards, and the need for objective guidance documents to avoid subjective enforcement. Timestamp: 00:15:15 - 00:30:45 Participants: Karen Decker, Councilmember Johnson, Councilmember Nagengast, Vice Mayor Ruddick, Councilmember Penrose. Summary: Staff introduces updates to the 2025 Building Code. A major change involves repealing the city's all-electric building requirement due to the 9th Circuit Court ruling ( California Restaurant Association v. City of Berkeley ). Public speaker Mike Ferreira expresses strong frustration and disagreement with "unelected state agencies" forcing the City to change its ordinances. Timestamp: 00:30:45 - 00:45:00 Participants: Ben Corrales, Keith Weiner, Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick. Summary: Staff presents the 5th draft of the Housing Element, noting State HCD requirements to modify ADU allocations and place a measure on the ballot regarding the "Measure D" growth cap. There is significant disagreement from Councilmembers Ruddick and Penrose regarding the State's requirement to hold a ballot measure. Public speakers debate the enforceability of Measure D. Mike Ferreira interrupts the vibe to voice strong distaste for HCD's interference in local law. The Council votes to adopt the element but strikes the language committing to a ballot measure. Timestamp: 00:45:00 - 01:05:00 Participants: Leslie (Staff), Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Johnson.

0 views
Simon Willison 2 weeks ago

What happens if AI labs train for pelicans riding bicycles?

Almost every time I share a new example of an SVG of a pelican riding a bicycle a variant of this question pops up: how do you know the labs aren't training for your benchmark? The strongest argument is that they would get caught . If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I'm going to test it on all manner of creatures riding all sorts of transportation devices. If those are notably worse it's going to be pretty obvious what happened. A related note here is that, if they are training for my benchmark, that training clearly is not going well! The very best models still produce pelicans on bicycles that look laughably awful. It's one of the reasons I've continued to find the test useful: drawing pelicans is hard! Even getting a bicycle the right shape is a challenge that few models have achieved yet. My current favorite is still this one from GPT-5 . The bicycle has all of the right pieces and the pelican is clearly pedaling it! I should note that OpenAI's Aidan McLaughlin has specifically denied training for this particular benchmark: we do not hill climb on svg art People also ask if they're training on my published collection. If they are that would be a big mistake, because a model trained on these examples will produce some very weird looking pelicans. Truth be told, I'm playing the long game here. All I've ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Simon Willison 3 weeks ago

Reverse engineering Codex CLI to get GPT-5-Codex-Mini to draw me a pelican

OpenAI partially released a new model yesterday called GPT-5-Codex-Mini, which they describe as "a more compact and cost-efficient version of GPT-5-Codex". It's currently only available via their Codex CLI tool and VS Code extension, with proper API access " coming soon ". I decided to use Codex to reverse engineer the Codex CLI tool and give me the ability to prompt the new model directly. I made a video talking through my progress and demonstrating the final results. OpenAI clearly don't intend for people to access this model directly just yet. It's available exclusively through Codex CLI which is a privileged application - it gets to access a special backend API endpoint that's not publicly documented, and it uses a special authentication mechanism that bills usage directly to the user's existing ChatGPT account. I figured reverse-engineering that API directly would be somewhat impolite. But... Codex CLI is an open source project released under an Apache 2.0 license. How about upgrading that to let me run my own prompts through its existing API mechanisms instead? This felt like a somewhat absurd loophole, and I couldn't resist trying it out and seeing what happened. The openai/codex repository contains the source code for the Codex CLI tool, which OpenAI rewrote in Rust just a few months ago. I don't know much Rust at all. I made my own clone on GitHub and checked it out locally: Then I fired up Codex itself (in dangerous mode, because I like living dangerously): And ran this prompt: Figure out how to build the rust version of this tool and then build it This worked. It churned away for a bit and figured out how to build itself. This is a useful starting point for a project like this - in figuring out the compile step the coding agent gets seeded with a little bit of relevant information about the project, and if it can compile that means it can later partially test the code it is writing while it works. Once the compile had succeeded I fed it the design for the new feature I wanted: Add a new sub-command to the Rust tool called "codex prompt" codex prompt "prompt goes here" - this runs the given prompt directly against the OpenAI API that Codex uses, with the same code path and authentication credentials as the rest of Codex. codex prompt -m <model_name> "prompt goes here" - same again but lets you set the model codex prompt "prompt goes here" -s/--system "system prompt goes here" - runs with a custom system/developer message codex prompt --models - this lists all available models that can be used with the "codex prompt" command My design for this new sub-command is lifted directly from my own llm command . Codex got to work. I've shared the full transcript here (using the tool I described here ), but this TODO list it made itself is a useful summary of what it decided to do: I like that it figured out the justfile in the repo and decided to use it to run formatting and linting commands without me needing to tell it to. I tried running the first version of the code it wrote like this: ... and it didn't quite work. I got this: Note that it DID think about SVG creation, but then decided it should look at the current directory. This isn't what I want - it appeared to be running in Codex's normal mode with a system prompt telling it to edit files on disk. I wanted it to respond to the prompt without acting as if it had a full workspace available to it. I prompted it again, pasting in the broken output: this did not seem to work - here is what happened - note that the reasoning summary comes out on a lot of lines when I would rather it streams out to stderr on the same line (flushing constantly) - and then despite the reasoning summary thinking about pelican on SVG it did not actually act on that and display the result, which is very surprising. [pasted content] It churned for a bit more, but from the output it looked like it was still trying to include Codex-specific tools and workspace information. So I said: Note that this new 'codex prompt' command should NOT be able to execute any tools at all, and should not have to think about the workspace or sandbox - it should send a single prompt with no tools and then stream and display a response and then stop. This got me a new build, but when I ran it: My guess is that OpenAI's custom backend API for Codex doesn't like it if you attempt to call it without passing in the expected instructions. Fair enough! Rather than paste that error in I got Codex to try the command itself, so it could see the error: You try running this and see what happens: ./target/debug/codex prompt 'Generate an SVG of a pelican riding a bicycle' Codex switched back to the default instructions. From its reply: now forces a tool-free request: we keep the workspace defaults, inject a lightweight system/developer message (“don’t run tools”), disable apply_patch/web search via config overrides, and build the outbound prompt as just that system span plus the user text. I guess that will have to do. I asked for one more feature: Add a --debug option to the codex prompt command which causes the full JSON request and response to be printed to stderr, plus the URL that is being accessed and the HTTP verb ... and we're ready to try this thing out! Notably I haven't written a single line of Rust myself here and paid almost no attention to what it was actually doing. My main contribution was to run the binary every now and then to see if it was doing what I needed yet. I've pushed the working code to a prompt-subcommand branch in my repo if you want to take a look and see how it all works. With the final version of the code built, I drew some pelicans. Here's the full terminal transcript , but here are some highlights. This is with the default GPT-5-Codex model: I pasted it into my tools.simonwillison.net/svg-render tool and got the following: I ran it again for GPT-5: And now the moment of truth... GPT-5 Codex Mini! I don't think I'll be adding that one to my SVG drawing toolkit any time soon. I had Codex add a option to help me see exactly what was going on. The output starts like this: This reveals that OpenAI's private API endpoint for Codex CLI is . Also interesting is how the key (truncated above, full copy here ) contains the default instructions, without which the API appears not to work - but it also shows that you can send a message with in advance of your user prompt. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . This is a little bit cheeky Codex CLI is written in Rust Iterating on the code Let's draw some pelicans Bonus: the --debug option

0 views
Simon Willison 3 weeks ago

Video + notes on upgrading a Datasette plugin for the latest 1.0 alpha, with help from uv and OpenAI Codex CLI

I'm upgrading various plugins for compatibility with the new Datasette 1.0a20 alpha release and I decided to record a video of the process. This post accompanies that video with detailed additional notes. I picked a very simple plugin to illustrate the upgrade process (possibly too simple). datasette-checkbox adds just one feature to Datasette: if you are viewing a table with boolean columns (detected as integer columns with names like or or ) and your current user has permission to update rows in that table it adds an inline checkbox UI that looks like this: I built the first version with the help of Claude back in August 2024 - details in this issue comment . Most of the implementation is JavaScript that makes calls to Datasette 1.0's JSON write API . The Python code just checks that the user has the necessary permissions before including the extra JavaScript. The first step in upgrading any plugin is to run its tests against the latest Datasette version. Thankfully makes it easy to run code in scratch virtual environments that include the different code versions you want to test against. I have a test utility called (for "test against development Datasette") which I use for that purpose. I can run it in any plugin directory like this: And it will run the existing plugin tests against whatever version of Datasette I have checked out in my directory. You can see the full implementation of (and its friend described below) in this TIL - the basic version looks like this: I started by running in the directory, and got my first failure... but it wasn't due to permissions, it was because the for the plugin was pinned to a specific mismatched version of Datasette: I fixed this problem by swapping to and ran the tests again... and they passed! Which was a problem because I was expecting permission-related failures. It turns out when I first wrote the plugin I was lazy with the tests - they weren't actually confirming that the table page loaded without errors. I needed to actually run the code myself to see the expected bug. First I created myself a demo database using sqlite-utils create-table : Then I ran it with Datasette against the plugin's code like so: Sure enough, visiting produced a 500 error about the missing method. The next step was to update the test to also trigger this error: And now fails as expected. It this point I could have manually fixed the plugin itself - which would likely have been faster given the small size of the fix - but instead I demonstrated a bash one-liner I've been using to apply these kinds of changes automatically: runs OpenAI Codex in non-interactive mode - it will loop until it has finished the prompt you give it. I tell it to consult the subset of the Datasette upgrade documentation that talks about Datasette permissions and then get the command to pass its tests. This is an example of what I call designing agentic loops - I gave Codex the tools it needed ( ) and a clear goal and let it get to work on my behalf. The remainder of the video covers finishing up the work - testing the fix manually, commiting my work using: Then shipping a 0.1a4 release to PyPI using the pattern described in this TIL . Finally, I demonstrated that the shipped plugin worked in a fresh environment using like this: Executing this command installs and runs a fresh Datasette instance with a fresh copy of the new alpha plugin ( ). It's a neat way of confirming that freshly released software works as expected. This video was shot in a single take using Descript , with no rehearsal and perilously little preparation in advance. I recorded through my AirPods and applied the "Studio Sound" filter to clean up the audio. I pasted in a closing slide from my previous video and exported it locally at 1080p, then uploaded it to YouTube. Something I learned from the Software Carpentry instructor training course is that making mistakes in front of an audience is actively helpful - it helps them see a realistic version of how software development works and they can learn from watching you recover. I see this as a great excuse for not editing out all of my mistakes! I'm trying to build new habits around video content that let me produce useful videos while minimizing the amount of time I spend on production. I plan to iterate more on the format as I get more comfortable with the process. I'm hoping I can find the right balance between production time and value to viewers. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Simon Willison 3 weeks ago

Code research projects with async coding agents like Claude Code and Codex

I've been experimenting with a pattern for LLM usage recently that's working out really well: asynchronous code research tasks . Pick a research question, spin up an asynchronous coding agent and let it go and run some experiments and report back when it's done. Software development benefits enormously from something I call code research . The great thing about questions about code is that they can often be definitively answered by writing and executing code. I often see questions on forums which hint at a lack of understanding of this skill. "Could Redis work for powering the notifications feed for my app?" is a great example. The answer is always "it depends", but a better answer is that a good programmer already has everything they need to answer that question for themselves. Build a proof-of-concept, simulate the patterns you expect to see in production, then run experiments to see if it's going to work. I've been a keen practitioner of code research for a long time. Many of my most interesting projects started out as a few dozen lines of experimental code to prove to myself that something was possible. It turns out coding agents like Claude Code and Codex are a fantastic fit for this kind of work as well. Give them the right goal and a useful environment and they'll churn through a basic research project without any further supervision. LLMs hallucinate and make mistakes. This is far less important for code research tasks because the code itself doesn't lie: if they write code and execute it and it does the right things then they've demonstrated to both themselves and to you that something really does work. They can't prove something is impossible - just because the coding agent couldn't find a way to do something doesn't mean it can't be done - but they can often demonstrate that something is possible in just a few minutes of crunching. I've used interactive coding agents like Claude Code and Codex CLI for a bunch of these, but today I'm increasingly turning to their asynchronous coding agent family members instead. An asynchronous coding agent is a coding agent that operates on a fire-and-forget basis. You pose it a task, it churns away on a server somewhere and when it's done it files a pull request against your chosen GitHub repository. OpenAI's Codex Cloud , Anthropic's Claude Code for web , Google Gemini's Jules , and GitHub's Copilot coding agent are four prominent examples of this pattern. These are fantastic tools for code research projects. Come up with a clear goal, turn it into a few paragraphs of prompt, set them loose and check back ten minutes later to see what they've come up with. I'm firing off 2-3 code research projects a day right now. My own time commitment is minimal and they frequently come back with useful or interesting results. You can run a code research task against an existing GitHub repository, but I find it's much more liberating to have a separate, dedicated repository for your coding agents to run their projects in. This frees you from being limited to research against just code you've already written, and also means you can be much less cautious about what you let the agents do. I have two repositories that I use for this - one public, one private. I use the public one for research tasks that have no need to be private, and the private one for anything that I'm not yet ready to share with the world. The biggest benefit of a dedicated repository is that you don't need to be cautious about what the agents operating in that repository can do. Both Codex Cloud and Claude Code for web default to running agents in a locked-down environment, with strict restrictions on how they can access the network. This makes total sense if they are running against sensitive repositories - a prompt injection attack of the lethal trifecta variety could easily be used to steal sensitive code or environment variables. If you're running in a fresh, non-sensitive repository you don't need to worry about this at all! I've configured my research repositories for full network access, which means my coding agents can install any dependencies they need, fetch data from the web and generally do anything I'd be able to do on my own computer. Let's dive into some examples. My public research repository is at simonw/research on GitHub. It currently contains 13 folders, each of which is a separate research project. I only created it two weeks ago so I'm already averaging nearly one a day! It also includes a GitHub Workflow which uses GitHub Models to automatically update the README file with a summary of every new project, using Cog , LLM , llm-github-models and this snippet of Python . Here are a some example research projects from the repo. node-pyodide shows an example of a Node.js script that runs the Pyodide WebAssembly distribution of Python inside it - yet another of my ongoing attempts to find a great way of running Python in a WebAssembly sandbox on a server. python-markdown-comparison ( transcript ) provides a detailed performance benchmark of seven different Python Markdown libraries. I fired this one off because I stumbled across cmarkgfm , a Python binding around GitHub's Markdown implementation in C, and wanted to see how it compared to the other options. This one produced some charts! came out on top by a significant margin: Here's the entire prompt I used for that project: Create a performance benchmark and feature comparison report on PyPI cmarkgfm compared to other popular Python markdown libraries - check all of them out from github and read the source to get an idea for features, then design and run a benchmark including generating some charts, then create a report in a new python-markdown-comparison folder (do not create a _summary.md file or edit anywhere outside of that folder). Make sure the performance chart images are directly displayed in the README.md in the folder. Note that I didn't specify any Markdown libraries other than - Claude Code ran a search and found the other six by itself. cmarkgfm-in-pyodide is a lot more fun. A neat thing about having all of my research projects in the same repository is that new projects can build on previous ones. Here I decided to see how hard it would be to get - which has a C extension - working inside Pyodide inside Node.js. Claude successfully compiled a 88.4KB file with the necessary C extension and proved it could be loaded into Pyodide in WebAssembly inside of Node.js. I ran this one using Claude Code on my laptop after an initial attempt failed. The starting prompt was: Figure out how to get the cmarkgfm markdown lover [typo in prompt, this should have been "library" but it figured it out anyway] for Python working in pyodide. This will be hard because it uses C so you will need to compile it to pyodide compatible webassembly somehow. Write a report on your results plus code to a new cmarkgfm-in-pyodide directory. Test it using pytest to exercise a node.js test script that calls pyodide as seen in the existing node.js and pyodide directory There is an existing branch that was an initial attempt at this research, but which failed because it did not have Internet access. You do have Internet access. Use that existing branch to accelerate your work, but do not commit any code unless you are certain that you have successfully executed tests that prove that the pyodide module you created works correctly. This one gave up half way through, complaining that emscripten would take too long. I told it: Complete this project, actually run emscripten, I do not care how long it takes, update the report if it works It churned away for a bit longer and complained that the existing Python library used CFFI which isn't available in Pyodide. I asked it: Can you figure out how to rewrite cmarkgfm to not use FFI and to use a pyodide-friendly way of integrating that C code instead? ... and it did. You can see the full transcript here . blog-tags-scikit-learn . Taking a short break from WebAssembly, I thought it would be fun to put scikit-learn through its paces on a text classification task against my blog: Work in a new folder called blog-tags-scikit-learn Download - a SQLite database. Take a look at the blog_entry table and the associated tags - a lot of the earlier entries do not have tags associated with them, where the later entries do. Design, implement and execute models to suggests tags for those earlier entries based on textual analysis against later ones Use Python scikit learn and try several different strategies Produce JSON of the results for each one, plus scripts for running them and a detailed markdown description Also include an HTML page with a nice visualization of the results that works by loading those JSON files. This resulted in seven files, four results files and a detailed report . (It ignored the bit about an HTML page with a nice visualization for some reason.) Not bad for a few moments of idle curiosity typed into my phone! That's just three of the thirteen projects in the repository so far. The commit history for each one usually links to the prompt and sometimes the transcript if you want to see how they unfolded. More recently I added a short file to the repo with a few extra tips for my research agents. You can read that here . My preferred definition of AI slop is AI-generated content that is published without human review. I've not been reviewing these reports in great detail myself, and I wouldn't usually publish them online without some serious editing and verification. I want to share the pattern I'm using though, so I decided to keep them quarantined in this one public repository. A tiny feature request for GitHub: I'd love to be able to mark a repository as "exclude from search indexes" such that it gets labelled with tags. I still like to keep AI-generated content out of search, to avoid contributing more to the dead internet . It's pretty easy to get started trying out this coding agent research pattern. Create a free GitHub repository (public or private) and let some agents loose on it and see what happens. You can run agents locally but I find the asynchronous agents to be more convenient - especially as I can run them (or trigger them from my phone) without any fear of them damaging my own machine or leaking any of my private data. Claude Code for web offers a free $250 of credits for their $20/month users for a limited time (until November 18, 2025). Gemini Jules has a free tier . There are plenty of other coding agents you can try out as well. Let me know if your research agents come back with anything interesting! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Code research Coding agents Asynchronous coding agents Give them a dedicated GitHub repository Let them rip with unlimited network access My simonw/research collection This is total slop, of course Try it yourself

0 views
Simon Willison 3 weeks ago

A new SQL-powered permissions system in Datasette 1.0a20

Datasette 1.0a20 is out with the biggest breaking API change on the road to 1.0, improving how Datasette's permissions system works by migrating permission logic to SQL running in SQLite. This release involved 163 commits , with 10,660 additions and 1,825 deletions, most of which was written with the help of Claude Code. Datasette's permissions system exists to answer the following question: Is this actor allowed to perform this action , optionally against this particular resource ? An actor is usually a user, but might also be an automation operating via the Datasette API. An action is a thing they need to do - things like view-table, execute-sql, insert-row. A resource is the subject of the action - the database you are executing SQL against, the table you want to insert a row into. Datasette's default configuration is public but read-only: anyone can view databases and tables or execute read-only SQL queries but no-one can modify data. Datasette plugins can enable all sorts of additional ways to interact with databases, many of which need to be protected by a form of authentication Datasette also 1.0 includes a write API with a need to configure who can insert, update, and delete rows or create new tables. Actors can be authenticated in a number of different ways provided by plugins using the actor_from_request() plugin hook. datasette-auth-passwords and datasette-auth-github and datasette-auth-existing-cookies are examples of authentication plugins. The previous implementation included a design flaw common to permissions systems of this nature: each permission check involved a function call which would delegate to one or more plugins and return a True/False result. This works well for single checks, but has a significant problem: what if you need to show the user a list of things they can access, for example the tables they can view? I want Datasette to be able to handle potentially thousands of tables - tables in SQLite are cheap! I don't want to have to run 1,000+ permission checks just to show the user a list of tables. Since Datasette is built on top of SQLite we already have a powerful mechanism to help solve this problem. SQLite is really good at filtering large numbers of records. The biggest change in the new release is that I've replaced the previous plugin hook - which let a plugin determine if an actor could perform an action against a resource - with a new permission_resources_sql(actor, action) plugin hook. Instead of returning a True/False result, this new hook returns a SQL query that returns rules helping determine the resources the current actor can execute the specified action against. Here's an example, lifted from the documentation: This hook grants the actor with ID "alice" permission to view the "sales" table in the "accounting" database. The object should always return four columns: a parent, child, allow (1 or 0), and a reason string for debugging. When you ask Datasette to list the resources an actor can access for a specific action, it will combine the SQL returned by all installed plugins into a single query that joins against the internal catalog tables and efficiently lists all the resources the actor can access. This query can then be limited or paginated to avoid loading too many results at once. Datasette has several additional requirements that make the permissions system more complicated. Datasette permissions can optionally act against a two-level hierarchy . You can grant a user the ability to insert-row against a specific table, or every table in a specific database, or every table in every database in that Datasette instance. Some actions can apply at the table level, others the database level and others only make sense globally - enabling a new feature that isn't tied to tables or databases, for example. Datasette currently has ten default actions but plugins that add additional features can register new actions to better participate in the permission systems. Datasette's permission system has a mechanism to veto permission checks - a plugin can return a deny for a specific permission check which will override any allows. This needs to be hierarchy-aware - a deny at the database level can be outvoted by an allow at the table level. Finally, Datasette includes a mechanism for applying additional restrictions to a request. This was introduced for Datasette's API - it allows a user to create an API token that can act on their behalf but is only allowed to perform a subset of their capabilities - just reading from two specific tables, for example. Restrictions are described in more detail in the documentation. That's a lot of different moving parts for the new implementation to cover. Since permissions are critical to the security of a Datasette deployment it's vital that they are as easy to understand and debug as possible. The new alpha adds several new debugging tools, including this page that shows the full list of resources matching a specific action for the current user: And this page listing the rules that apply to that question - since different plugins may return different rules which get combined together: This screenshot illustrates two of Datasette's built-in rules: there is a default allow for read-only operations such as view-table (which can be over-ridden by plugins) and another rule that says the root user can do anything (provided Datasette was started with the option.) Those rules are defined in the datasette/default_permissions.py Python module. There's one question that the new system cannot answer: provide a full list of actors who can perform this action against this resource. It's not possibly to provide this globally for Datasette because Datasette doesn't have a way to track what "actors" exist in the system. SSO plugins such as mean a new authenticated GitHub user might show up at any time, with the ability to perform actions despite the Datasette system never having encountered that particular username before. API tokens and actor restrictions come into play here as well. A user might create a signed API token that can perform a subset of actions on their behalf - the existence of that token can't be predicted by the permissions system. This is a notable omission, but it's also quite common in other systems. AWS cannot provide a list of all actors who have permission to access a specific S3 bucket, for example - presumably for similar reasons. Datasette's plugin ecosystem is the reason I'm paying so much attention to ensuring Datasette 1.0 has a stable API. I don't want plugin authors to need to chase breaking changes once that 1.0 release is out. The Datasette upgrade guide includes detailed notes on upgrades that are needed between the 0.x and 1.0 alpha releases. I've added an extensive section about the permissions changes to that document. I've also been experimenting with dumping those instructions directly into coding agent tools - Claude Code and Codex CLI - to have them upgrade existing plugins for me. This has been working extremely well . I've even had Claude Code update those notes itself with things it learned during an upgrade process! This is greatly helped by the fact that every single Datasette plugin has an automated test suite that demonstrates the core functionality works as expected. Coding agents can use those tests to verify that their changes have had the desired effect. I've also been leaning heavily on to help with the upgrade process. I wrote myself two new helper scripts - and - to help test the new plugins. The and implementations can be found in this TIL . Some of my plugin upgrades have become a one-liner to the command, which runs OpenAI Codex CLI with a prompt without entering interactive mode: There are still a bunch more to go - there's a list in this tracking issue - but I expect to have the plugins I maintain all upgraded pretty quickly now that I have a solid process in place. This change to Datasette core by far the most ambitious piece of work I've ever attempted using a coding agent. Last year I agreed with the prevailing opinion that LLM assistance was much more useful for greenfield coding tasks than working on existing codebases. The amount you could usefully get done was greatly limited by the need to fit the entire codebase into the model's context window. Coding agents have entirely changed that calculation. Claude Code and Codex CLI still have relatively limited token windows - albeit larger than last year - but their ability to search through the codebase, read extra files on demand and "reason" about the code they are working with has made them vastly more capable. I no longer see codebase size as a limiting factor for how useful they can be. I've also spent enough time with Claude Sonnet 4.5 to build a weird level of trust in it. I can usually predict exactly what changes it will make for a prompt. If I tell it "extract this code into a separate function" or "update every instance of this pattern" I know it's likely to get it right. For something like permission code I still review everything it does, often by watching it as it works since it displays diffs in the UI. I also pay extremely close attention to the tests it's writing. Datasette 1.0a19 already had 1,439 tests, many of which exercised the existing permission system. 1.0a20 increases that to 1,583 tests. I feel very good about that, especially since most of the existing tests continued to pass without modification. I built several different proof-of-concept implementations of SQL permissions before settling on the final design. My research/sqlite-permissions-poc project was the one that finally convinced me of a viable approach, That one started as a free ranging conversation with Claude , at the end of which I told it to generate a specification which I then fed into GPT-5 to implement. You can see that specification at the end of the README . I later fed the POC itself into Claude Code and had it implement the first version of the new Datasette system based on that previous experiment. This is admittedly a very weird way of working, but it helped me finally break through on a problem that I'd been struggling with for months. Now that the new alpha is out my focus is upgrading the existing plugin ecosystem to use it, and supporting other plugin authors who are doing the same. The new permissions system unlocks some key improvements to Datasette Cloud concerning finely-grained permissions for larger teams, so I'll be integrating the new alpha there this week. This is the single biggest backwards-incompatible change required before Datasette 1.0. I plan to apply the lessons I learned from this project to the other, less intimidating changes. I'm hoping this can result in a final 1.0 release before the end of the year! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Understanding the permissions system Permissions systems need to be able to efficiently list things The new permission_resources_sql() plugin hook Hierarchies, plugins, vetoes, and restrictions New debugging tools The missing feature: list actors who can act on this resource Upgrading plugins for Datasette 1.0a20 Using Claude Code to implement this change Starting with a proof-of-concept Miscellaneous tips I picked up along the way What's next? = "test against datasette dev" - it runs a plugin's existing test suite against the current development version of Datasette checked out on my machine. It passes extra options through to so I can run or as needed. = "run against datasette dev" - it runs the latest dev command with the plugin installed. When working on anything relating to plugins it's vital to have at least a few real plugins that you upgrade in lock-step with the core changes. The and shortcuts were invaluable for productively working on those plugins while I made changes to core. Coding agents make experiments much cheaper. I threw away so much code on the way to the final implementation, which was psychologically easier because the cost to create that code in the first place was so low. Tests, tests, tests. This project would have been impossible without that existing test suite. The additional tests we built along the way give me confidence that the new system is as robust as I need it to be. Claude writes good commit messages now! I finally gave in and let it write these - previously I've been determined to write them myself. It's a big time saver to be able to say "write a tasteful commit message for these changes". Claude is also great at breaking up changes into smaller commits. It can also productively rewrite history to make it easier to follow, especially useful if you're still working in a branch. A really great way to review Claude's changes is with the GitHub PR interface. You can attach comments to individual lines of code and then later prompt Claude like this: . This is a very quick way to apply little nitpick changes - rename this function, refactor this repeated code, add types here etc. The code I write with LLMs is higher quality code . I usually find myself making constant trade-offs while coding: this function would be neater if I extracted this helper, it would be nice to have inline documentation here, this changing this would be good but would break a dozen tests... for each of those I have to determine if the additional time is worth the benefit. Claude can apply changes so much faster than me that these calculations have changed - almost any improvement is worth applying, no matter how trivial, because the time cost is so low. Internal tools are cheap now. The new debugging interfaces were mostly written by Claude and are significantly nicer to use and look at than the hacky versions I would have knocked out myself, if I had even taken the extra time to build them. That trick with a Markdown file full of upgrade instructions works astonishingly well - it's the same basic idea as Claude Skills . I maintain over 100 Datasette plugins now and I expect I'll be automating all sorts of minor upgrades in the future using this technique.

0 views
Simon Willison 3 weeks ago

New prompt injection papers: Agents Rule of Two and The Attacker Moves Second

Two interesting new papers regarding LLM security and prompt injection came to my attention this weekend. The first is Agents Rule of Two: A Practical Approach to AI Agent Security , published on October 31st on the Meta AI blog. It doesn't list authors but it was shared on Twitter by Meta AI security researcher Mick Ayzenberg. It proposes a "Rule of Two" that's inspired by both my own lethal trifecta concept and the Google Chrome team's Rule Of 2 for writing code that works with untrustworthy inputs: At a high level, the Agents Rule of Two states that until robustness research allows us to reliably detect and refuse prompt injection, agents must satisfy no more than two of the following three properties within a session to avoid the highest impact consequences of prompt injection. [A] An agent can process untrustworthy inputs [B] An agent can have access to sensitive systems or private data [C] An agent can change state or communicate externally It's still possible that all three properties are necessary to carry out a request. If an agent requires all three without starting a new session (i.e., with a fresh context window), then the agent should not be permitted to operate autonomously and at a minimum requires supervision --- via human-in-the-loop approval or another reliable means of validation. It's accompanied by this handy diagram: I like this a lot . I've spent several years now trying to find clear ways to explain the risks of prompt injection attacks to developers who are building on top of LLMs. It's frustratingly difficult. I've had the most success with the lethal trifecta, which boils one particular class of prompt injection attack down to a simple-enough model: if your system has access to private data, exposure to untrusted content and a way to communicate externally then it's vulnerable to private data being stolen. The one problem with the lethal trifecta is that it only covers the risk of data exfiltration: there are plenty of other, even nastier risks that arise from prompt injection attacks against LLM-powered agents with access to tools which the lethal trifecta doesn't cover. The Agents Rule of Two neatly solves this, through the addition of "changing state" as a property to consider. This brings other forms of tool usage into the picture: anything that can change state triggered by untrustworthy inputs is something to be very cautious about. It's also refreshing to see another major research lab concluding that prompt injection remains an unsolved problem, and attempts to block or filter them have not proven reliable enough to depend on. The current solution is to design systems with this in mind, and the Rule of Two is a solid way to think about that. Update : On thinking about this further there's one aspect of the Rule of Two model that doesn't work for me: the Venn diagram above marks the combination of untrustworthy inputs and the ability to change state as "safe", but that's not right. Even without access to private systems or sensitive data that pairing can still produce harmful results. Unfortunately adding an exception for that pair undermines the simplicity of the "Rule of Two" framing! Update 2 : Mick Ayzenberg responded to this note in a comment on Hacker News : Thanks for the feedback! One small bit of clarification, the framework would describe access to any sensitive system as part of the [B] circle, not only private systems or private data. The intention is that an agent that has removed [B] can write state and communicate freely, but not with any systems that matter (wrt critical security outcomes for its user). An example of an agent in this state would be one that can take actions in a tight sandbox or is isolated from production. The Meta team also updated their post to replace "safe" with "lower risk" as the label on the intersections between the different circles. I've updated my screenshots of their diagrams in this post, here's the original for comparison. Which brings me to the second paper... This paper is dated 10th October 2025 on Arxiv and comes from a heavy-hitting team of 14 authors - Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tramèr - including representatives from OpenAI, Anthropic, and Google DeepMind. The paper looks at 12 published defenses against prompt injection and jailbreaking and subjects them to a range of "adaptive attacks" - attacks that are allowed to expend considerable effort iterating multiple times to try and find a way through. The defenses did not fare well: By systematically tuning and scaling general optimization techniques—gradient descent, reinforcement learning, random search, and human-guided exploration—we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. Notably the "Human red-teaming setting" scored 100%, defeating all defenses. That red-team consisted of 500 participants in an online competition they ran with a $20,000 prize fund. The key point of the paper is that static example attacks - single string prompts designed to bypass systems - are an almost useless way to evaluate these defenses. Adaptive attacks are far more powerful, as shown by this chart: The three automated adaptive attack techniques used by the paper are: The paper concludes somewhat optimistically: [...] Adaptive evaluations are therefore more challenging to perform, making it all the more important that they are performed. We again urge defense authors to release simple, easy-to-prompt defenses that are amenable to human analysis. [...] Finally, we hope that our analysis here will increase the standard for defense evaluations, and in so doing, increase the likelihood that reliable jailbreak and prompt injection defenses will be developed. Given how totally the defenses were defeated, I do not share their optimism that reliable defenses will be developed any time soon. As a review of how far we still have to go this paper packs a powerful punch. I think it makes a strong case for Meta's Agents Rule of Two as the best practical advice for building secure LLM-powered agent systems today in the absence of prompt injection defenses we can rely on. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Gradient-based methods - these were the least effective, using the technique described in the legendary Universal and Transferable Adversarial Attacks on Aligned Language Models paper from 2023 . Reinforcement learning methods - particularly effective against black-box models: "we allowed the attacker model to interact directly with the defended system and observe its outputs", using 32 sessions of 5 rounds each. Search-based methods - generate candidates with an LLM, then evaluate and further modify them using LLM-as-judge and other classifiers.

0 views
Simon Willison 1 months ago

Hacking the WiFi-enabled color screen GitHub Universe conference badge

I'm at GitHub Universe this week (thanks to a free ticket from Microsoft). Yesterday I picked up my conference badge... which incorporates a full Raspberry Pi Raspberry Pi Pico microcontroller with a battery, color screen, WiFi and bluetooth. GitHub Universe has a tradition of hackable conference badges - the badge last year had an eInk display. This year's is a huge upgrade though - a color screen and WiFI connection makes this thing a genuinely useful little computer! The only thing it's missing is a keyboard - the device instead provides five buttons total - Up, Down, A, B, C. It might be possible to get a bluetooth keyboard to work though I'll believe that when I see it - there's not a lot of space on this device for a keyboard driver. Everything is written using MicroPython, and the device is designed to be hackable: connect it to a laptop with a USB-C cable and you can start modifying the code directly on the device. Out of the box the badge will play an opening animation (implemented as a sequence of PNG image frames) and then show a home screen with six app icons. The default apps are mostly neat Octocat-themed demos: a flappy-bird clone, a tamagotchi-style pet, a drawing app that works like an etch-a-sketch, an IR scavenger hunt for the conference venue itself (this thing has an IR sensor too!), and a gallery app showing some images. The sixth app is a badge app. This will show your GitHub profile image and some basic stats, but will only work if you dig out a USB-C cable and make some edits to the files on the badge directly. I did this on a Mac. I plugged a USB-C cable into the badge which caused MacOS to treat it as an attached drive volume. In that drive are several files including . Open that up, confirm the WiFi details are correct and add your GitHub username. The file should look like this: The badge comes with the SSID and password for the GitHub Universe WiFi network pre-populated. That's it! Unmount the disk, hit the reboot button on the back of the badge and when it comes back up again the badge app should look something like this: Here's the official documentation for building software for the badge. When I got mine yesterday the official repo had not yet been updated, so I had to figure this out myself. I copied all of the code across to my laptop, added it to a Git repo and then fired up Claude Code and told it: Here's the result , which was really useful for getting a start on understanding how it all worked. Each of the six default apps lives in a folder, for example apps/sketch/ for the sketching app. There's also a menu app which powers the home screen. That lives in apps/menu/ . You can edit code in here to add new apps that you create to that screen. I told Claude: This was a bit of a long-shot, but it totally worked! The first version had an error: I OCRd that photo (with the Apple Photos app) and pasted the message into Claude Code and it fixed the problem. This almost worked... but the addition of a seventh icon to the 2x3 grid meant that you could select the icon but it didn't scroll into view. I had Claude fix that for me too . Here's the code for apps/debug/__init__.py , and the full Claude Code transcript created using my terminal-to-HTML app described here . Here are the four screens of the debug app: The icons used on the app are 24x24 pixels. I decided it would be neat to have a web app that helps build those icons, including the ability to start by creating an icon from an emoji. I bulit this one using Claude Artifacts . Here's the result, now available at tools.simonwillison.net/icon-editor : I noticed that last year's badge configuration app (which I can't find in github.com/badger/badger.github.io any more, I think they reset the history on that repo?) worked by talking to MicroPython over the Web Serial API from Chrome. Here's my archived copy of that code . Wouldn't it be useful to have a REPL in a web UI that you could use to interact with the badge directly over USB? I pointed Claude Code at a copy of that repo and told it: It took a bit of poking (here's the transcript ) but the result is now live at tools.simonwillison.net/badge-repl . It only works in Chrome - you'll need to plug the badge in with a USB-C cable and then click "Connect to Badge". If you're a GitHub Universe attendee I hope this is useful. The official badger.github.io site has plenty more details to help you get started. There isn't yet a way to get hold of this hardware outside of GitHub Universe - I know they had some supply chain challenges just getting enough badges for the conference attendees! It's a very neat device, built for GitHub by Pimoroni in Sheffield, UK. A version of this should become generally available in the future under the name "Pimoroni Tufty 2350". You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

1 views
Simon Willison 1 months ago

Video: Building a tool to copy-paste share terminal sessions using Claude Code for web

This afternoon I was manually converting a terminal session into a shared HTML file for the umpteenth time when I decided to reduce the friction by building a custom tool for it - and on the spur of the moment I fired up Descript to record the process. The result is this new 11 minute YouTube video showing my workflow for vibe-coding simple tools from start to finish. The problem I wanted to solve involves sharing my Claude Code CLI sessions - and the more general problem of sharing interesting things that happen in my terminal. A while back I discovered (using my vibe-coded clipboard inspector ) that copying and pasting from the macOS terminal populates a rich text clipboard format which preserves the colors and general formatting of the terminal output. The problem is that format looks like this: This struck me as the kind of thing an LLM might be able to write code to parse, so I had ChatGPT take a crack at it and then later rewrote it from scratch with Claude Sonnet 4.5 . The result was this rtf-to-html tool which lets you paste in rich formatted text and gives you reasonably solid HTML that you can share elsewhere. To share that HTML I've started habitually pasting it into a GitHub Gist and then taking advantage of , a neat little unofficial tool that accepts and displays the gist content as a standalone HTML page... which means you can link to rendered HTML that's stored in a gist. So my process was: Not too much hassle, but frustratingly manual if you're doing it several times a day. Ideally I want a tool where I can do this: I decided to get Claude Code for web to build the entire thing. Here's the full prompt I used on claude.ai/code , pointed at my repo, to build the tool: It's quite a long prompt - it took me several minutes to type! But it covered the functionality I wanted in enough detail that I was pretty confident Claude would be able to build it. I'm using one key technique in this prompt: I'm referencing existing tools in the same repo and telling Claude to imitate their functionality. I first wrote about this trick last March in Running OCR against PDFs and images directly in your browser , where I described how a snippet of code that used PDF.js and another snippet that used Tesseract.js was enough for Claude 3 Opus to build me this working PDF OCR tool . That was actually the tool that kicked off my tools.simonwillison.net collection in the first place, which has since grown to 139 and counting. Here I'm telling Claude that I want the RTF to HTML functionality of rtf-to-html.html combined with the Gist saving functionality of openai-audio-output.html . That one has quite a bit going on. It uses the OpenAI audio API to generate audio output from a text prompt, which is returned by that API as base64-encoded data in JSON. Then it offers the user a button to save that JSON to a Gist, which gives the snippet a URL. Another tool I wrote, gpt-4o-audio-player.html , can then accept that Gist ID in the URL and will fetch the JSON data and make the audio playable in the browser. Here's an example . The trickiest part of this is API tokens. I've built tools in the past that require users to paste in a GitHub Personal Access Token (PAT) (which I then store in in their browser - I don't want other people's authentication credentials anywhere near my own servers). But that's a bit fiddly. Instead, I figured out the minimal Cloudflare worker necessary to implement the server-side portion of GitHub's authentication flow. That code lives here and means that any of the HTML+JavaScript tools in my collection can implement a GitHub authentication flow if they need to save Gists. But I don't have to tell the model any of that! I can just say "do the same trick that openai-audio-output.html does" and Claude Code will work the rest out for itself. Here's what the resulting app looks like after I've pasted in some terminal output from Claude Code CLI: It's exactly what I asked for, and the green-on-black terminal aesthetic is spot on too. There are a bunch of other things that I touch on in the video. Here's a quick summary: You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The initial problem The problem I wanted to solve involves sharing my Claude Code CLI sessions - and the more general problem of sharing interesting things that happen in my terminal. A while back I discovered (using my vibe-coded clipboard inspector ) that copying and pasting from the macOS terminal populates a rich text clipboard format which preserves the colors and general formatting of the terminal output. The problem is that format looks like this: This struck me as the kind of thing an LLM might be able to write code to parse, so I had ChatGPT take a crack at it and then later rewrote it from scratch with Claude Sonnet 4.5 . The result was this rtf-to-html tool which lets you paste in rich formatted text and gives you reasonably solid HTML that you can share elsewhere. To share that HTML I've started habitually pasting it into a GitHub Gist and then taking advantage of , a neat little unofficial tool that accepts and displays the gist content as a standalone HTML page... which means you can link to rendered HTML that's stored in a gist. So my process was: Copy terminal output Paste into rtf-to-html Copy resulting HTML Paste that int a new GitHub Gist Grab that Gist's ID Share the link to Copy terminal output Paste into a new tool Click a button and get a link to share tools.simonwillison.net/colophon is the list of all of my tools, with accompanying AI-generated descriptions. Here's more about how I built that with Claude Code and notes on how I added the AI-generated descriptions . gistpreview.github.io is really neat. I used Descript to record and edit the video. I'm still getting the hang of it - hence the slightly clumsy pan-and-zoom - but it's pretty great for this kind of screen recording. The site's automated deploys are managed by this GitHub Actions workflow . I also have it configured to work with Cloudflare Pages for those preview deployments from PRs (here's an example ). The automated documentation is created using my llm tool and llm-anthropic plugin. Here's the script that does that , recently upgraded to use Claude Haiku 4.5.

0 views
Simon Willison 1 months ago

Dane Stuckey (OpenAI CISO) on prompt injection risks for ChatGPT Atlas

My biggest complaint about the launch of the ChatGPT Atlas browser the other day was the lack of details on how OpenAI are addressing prompt injection attacks. The launch post mostly punted that question to the System Card for their "ChatGPT agent" browser automation feature from July. Since this was my single biggest question about Atlas I was disappointed not to see it addressed more directly. OpenAI's Chief Information Security Officer Dane Stuckey just posted the most detail I've seen yet in a lengthy Twitter post . I'll quote from his post here (with my emphasis in bold) and add my own commentary. He addresses the issue directly by name, with a good single-sentence explanation of the problem: One emerging risk we are very thoughtfully researching and mitigating is prompt injections, where attackers hide malicious instructions in websites, emails, or other sources, to try to trick the agent into behaving in unintended ways . The objective for attackers can be as simple as trying to bias the agent’s opinion while shopping, or as consequential as an attacker trying to get the agent to fetch and leak private data , such as sensitive information from your email, or credentials. We saw examples of browser agents from other vendors leaking private data in this way identified by the Brave security team just yesterday . Our long-term goal is that you should be able to trust ChatGPT agent to use your browser, the same way you’d trust your most competent, trustworthy, and security-aware colleague or friend. This is an interesting way to frame the eventual goal, describing an extraordinary level of trust and competence. As always, a big difference between AI systems and a human is that an AI system cannot be held accountable for its actions . I'll let my trusted friend use my logged-in browser only because there are social consequences if they abuse that trust! We’re working hard to achieve that. For this launch, we’ve performed extensive red-teaming, implemented novel model training techniques to reward the model for ignoring malicious instructions, implemented overlapping guardrails and safety measures , and added new systems to detect and block such attacks. However, prompt injection remains a frontier, unsolved security problem, and our adversaries will spend significant time and resources to find ways to make ChatGPT agent fall for these attacks . I'm glad to see OpenAI's CISO openly acknowledging that prompt injection remains an unsolved security problem (three years after we started talking about it !). That "adversaries will spend significant time and resources" thing is the root of why I don't see guardrails and safety measures as providing a credible solution to this problem. As I've written before, in application security 99% is a failing grade . If there's a way to get past the guardrails, no matter how obscure, a motivated adversarial attacker is going to figure that out. Dane goes on to describe some of those measures: To protect our users, and to help improve our models against these attacks: I like this a lot. OpenAI have an advantage here of being a centralized system - they can monitor their entire user base for signs of new attack patterns. It's still bad news for users that get caught out by a zero-day prompt injection, but it does at least mean that successful new attack patterns should have a small window of opportunity. "Defense in depth" always sounds good, but it worries me that it's setting up a false sense of security here. If it's harder but still possible someone is going to get through. Logged out mode is very smart, and is already a tried and tested pattern. I frequently have Claude Code or Codex CLI fire up Playwright to interact with websites, safe in the knowledge that they won't have access to my logged-in sessions. ChatGPT's existing agent mode provides a similar capability. Logged in mode is where things get scary, especially since we're delegating security decisions to end-users of the software. We've demonstrated many times over that this is an unfair burden to place on almost any user. This detail is new to me: I need to spend more time with ChatGPT Atlas to see what it looks like in practice. I tried just now using both GitHub and an online banking site and neither of them seemed to trigger "watch mode" - Atlas continued to navigate even when I had switched to another application. Watch mode sounds reasonable in theory - similar to a driver-assisted car that requires you to keep your hands on the wheel - but I'd like to see it in action before I count it as a meaningful mitigation. Dane closes with an analogy to computer viruses: New levels of intelligence and capability require the technology, society, the risk mitigation strategy to co-evolve. And as with computer viruses in the early 2000s, we think it’s important for everyone to understand responsible usage , including thinking about prompt injection attacks, so we can all learn to benefit from this technology safely. I don't think the average computer user ever really got the hang of staying clear of computer viruses... we're still fighting that battle today, albeit much more successfully on mobile platforms that implement tight restrictions on what software can do. My takeaways from all of this? It's not done much to influence my overall skepticism of the entire category of browser agents, but it does at least demonstrate that OpenAI are keenly aware of the problems and are investing serious effort in finding the right mix of protections. How well those protections work is something I expect will become clear over the next few months. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . We’ve prioritized rapid response systems to help us quickly identify block attack campaigns as we become aware of them. We are also continuing to invest heavily in security, privacy, and safety - including research to improve the robustness of our models, security monitors, infrastructure security controls, and other techniques to help prevent these attacks via defense in depth . We’ve designed Atlas to give you controls to help protect yourself. We have added a feature to allow ChatGPT agent to take action on your behalf, but without access to your credentials called “logged out mode” . We recommend this mode when you don’t need to take action within your accounts. Today, we think “logged in mode” is most appropriate for well-scoped actions on very trusted sites, where the risks of prompt injection are lower . Asking it to add ingredients to a shopping cart is generally safer than a broad or vague request like “review my emails and take whatever actions are needed.” When agent is operating on sensitive sites, we have also implemented a "Watch Mode" that alerts you to the sensitive nature of the site and requires you have the tab active to watch the agent do its work . Agent will pause if you move away from the tab with sensitive information. This ensures you stay aware - and in control - of what agent actions the agent is performing. [...]

1 views
Simon Willison 1 months ago

Living dangerously with Claude

I gave a talk last night at Claude Code Anonymous in San Francisco, the unofficial meetup for coding agent enthusiasts. I decided to talk about a dichotomy I've been struggling with recently. On the one hand I'm getting enormous value from running coding agents with as few restrictions as possible. On the other hand I'm deeply concerned by the risks that accompany that freedom. Below is a copy of my slides, plus additional notes and links as an annotated presentation . I'm going to be talking about two things this evening... Why you should always use . (This got a cheer from the room full of Claude Code enthusiasts.) And why you should never use . (This did not get a cheer.) is a bit of a mouthful, so I'm going to use its better name, "YOLO mode", for the rest of this presentation. Claude Code running in this mode genuinely feels like a completely different product from regular, default Claude Code. The default mode requires you to pay constant attention to it, tracking everything it does and actively approving changes and actions every few steps. In YOLO mode you can leave Claude alone to solve all manner of hairy problems while you go and do something else entirely. I have a suspicion that many people who don't appreciate the value of coding agents have never experienced YOLO mode in all of its glory. I'll show you three projects I completed with YOLO mode in just the past 48 hours. I wrote about this one at length in Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code . I wanted to try the newly released DeepSeek-OCR model on an NVIDIA Spark, but doing so requires figuring out how to run a model using PyTorch and CUDA, which is never easy and is a whole lot harder on an ARM64 device. I SSHd into the Spark, started a fresh Docker container and told Claude Code to figure it out. It took 40 minutes and three additional prompts but it solved the problem , and I got to have breakfast and tinker with some other projects while it was working. This project started out in Claude Code for the web . I'm eternally interested in options for running server-side Python code inside a WebAssembly sandbox, for all kinds of reasons. I decided to see if the Claude iPhone app could launch a task to figure it out. I wanted to see how hard it was to do that using Pyodide running directly in Node.js. Claude Code got it working and built and tested this demo script showing how to do it. I started a new simonw/research repository to store the results of these experiments, each one in a separate folder. It's up to 5 completed research projects already and I created it less than 2 days ago. Here's my favorite, a project from just this morning. I decided I wanted to try out SLOCCount , a 2001-era Perl tool for counting lines of code and estimating the cost to develop them using 2001 USA developer salaries. .. but I didn't want to run Perl, so I decided to have Claude Code (for web, and later on my laptop) try and figure out how to run Perl scripts in WebAssembly. TLDR: it got there in the end ! It turned out some of the supporting scripts in SLOCCount were written in C, so it had to compile those to WebAssembly as well. And now tools.simonwillison.net/sloccount is a browser-based app which runs 25-year-old Perl+C in WebAssembly against pasted code, GitHub repository references and even zip files full of code. The wild thing is that all three of these projects weren't even a priority for me - they were side quests, representing pure curiosity that I could outsource to Claude Code and solve in the background while I was occupied with something else. I got a lot of useful work done in parallel to these three flights of fancy. But there's a reason has that scary name. It's dangerous to use Claude Code (and other coding agents) in this way! The reason for this is prompt injection , a term I coined three years ago to describe a class of attacks against LLMs that take advantage of the way untrusted content is concatenated together with trusted instructions. (It's named after SQL injection which shares a similar shape.) This remains an incredibly common vulnerability. Here's a great example of a prompt injection attack against a coding agent, described by Johann Rehberger as part of his Month of AI Bugs , sharing a new prompt injection report every day for the month of August. If a coding agent - in this case OpenHands - reads this file it can be tricked into grepping the available environment variables for (matching GitHub Personal Access Tokens) and sending that to the attacker's external server for "help debugging these variables". I coined another term to try and describe a common subset of prompt injection attacks: the lethal trifecta . Any time an LLM system combines access to private data with exposure to untrusted content and the ability to externally communicate , there's an opportunity for attackers to trick the system into leaking that private data back to them. These attacks are incredibly common . If you're running YOLO coding agents with access to private source code or secrets (like API keys in environment variables) you need to be concerned about the potential of these attacks. This is the fundamental rule of prompt injection: anyone who can get their tokens into your context should be considered to have full control over what your agent does next, including the tools that it calls. Some people will try to convince you that prompt injection attacks can be solved using more AI to detect the attacks. This does not work 100% reliably, which means it's not a useful security defense at all . The only solution that's credible is to run coding agents in a sandbox . The best sandboxes are the ones that run on someone else's computer! That way the worst that can happen is someone else's computer getting owned. You still need to worry about your source code getting leaked. Most of my stuff is open source anyway, and a lot of the code I have agents working on is research code with no proprietary secrets. If your code really is sensitive you need to consider network restrictions more carefully, as discussed in a few slides. There are lots of great sandboxes that run on other people's computers. OpenAI Codex Cloud, Claude Code for the web, Gemini Jules are all excellent solutions for this. I also really like the code interpreter features baked into the ChatGPT and Claude consumer apps. There are two problems to consider with sandboxing. The first is easy: you need to control what files can be read and written on the filesystem. The second is much harder: controlling the network connections that can be made by code running inside the agent. The reason network access is so important is that it represents the data exfiltration leg of the lethal trifecta. If you can prevent external communication back to an attacker they can't steal your private information, even if they manage to sneak in their own malicious instructions. Claude Code CLI grew a new sandboxing feature just yesterday, and Anthropic released an a new open source library showing how it works. The key to the implementation - at least on macOS - is Apple's little known but powerful command. This provides a way to run any command in a sandbox configured by a policy document. Those policies can control which files are visible but can also allow-list network connections. Anthropic run an HTTP proxy and allow the Claude Code environment to talk to that, then use the proxy to control which domains it can communicate with. (I used Claude itself to synthesize this example from Anthropic's codebase.) ... the bad news is that has been marked as deprecated in Apple's documentation since at least 2017! It's used by Codex CLI too, and is still the most convenient way to run a sandbox on a Mac. I'm hoping Apple will reconsider. So go forth and live dangerously! (But do it in a sandbox.) You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Simon Willison 1 months ago

Claude Code for web - a new asynchronous coding agent from Anthropic

Anthropic launched Claude Code for web this morning. It's an asynchronous coding agent - their answer to OpenAI's Codex Cloud and Google's Jules , and has a very similar shape. I had preview access over the weekend and I've already seen some very promising results from it. It's available online at claude.ai/code and shows up as a tab in the Claude iPhone app as well: As far as I can tell it's their latest Claude Code CLI app wrapped in a container (Anthropic are getting really good at containers these days) and configured to . It appears to behave exactly the same as the CLI tool, and includes a neat "teleport" feature which can copy both the chat transcript and the edited files down to your local Claude Code CLI tool if you want to take over locally. It's very straight-forward to use. You point Claude Code for web at a GitHub repository, select an environment (fully locked down, restricted to an allow-list of domains or configured to access domains of your choosing, including "*" for everything) and kick it off with a prompt. While it's running you can send it additional prompts which are queued up and executed after it completes its current step. Once it's done it opens a branch on your repo with its work and can optionally open a pull request. Claude Code for web's PRs are indistinguishable from Claude Code CLI's, so Anthropic told me it was OK to submit those against public repos even during the private preview. Here are some examples from this weekend: That second example is the most interesting. I saw a tweet from Armin about his MiniJinja Rust template language adding support for Python 3.14 free threading. I hadn't realized that project had Python bindings, so I decided it would be interesting to see a quick performance comparison between MiniJinja and Jinja2. I ran Claude Code for web against a private repository with a completely open environment ( in the allow-list) and prompted: I’m interested in benchmarking the Python bindings for https://github.com/mitsuhiko/minijinja against the equivalente template using Python jinja2 Design and implement a benchmark for this. It should use the latest main checkout of minijinja and the latest stable release of jinja2. The benchmark should use the uv version of Python 3.14 and should test both the regular 3.14 and the 3.14t free threaded version - so four scenarios total The benchmark should run against a reasonably complicated example of a template, using template inheritance and loops and such like In the PR include a shell script to run the entire benchmark, plus benchmark implantation, plus markdown file describing the benchmark and the results in detail, plus some illustrative charts created using matplotlib I entered this into the Claude iPhone app on my mobile keyboard, hence the typos. It churned away for a few minutes and gave me exactly what I asked for. Here's one of the four charts it created: (I was surprised to see MiniJinja out-performed by Jinja2, but I guess Jinja2 has had a decade of clever performance optimizations and doesn't need to deal with any extra overhead of calling out to Rust.) Note that I would likely have got the exact same result running this prompt against Claude CLI on my laptop. The benefit of Claude Code for web is entirely in its convenience as a way of running these tasks in a hosted container managed by Anthropic, with a pleasant web and mobile UI layered over the top. It's interesting how Anthropic chose to announce this new feature: the product launch is buried half way down their new engineering blog post Beyond permission prompts: making Claude Code more secure and autonomous , which starts like this: Claude Code's new sandboxing features, a bash tool and Claude Code on the web, reduce permission prompts and increase user safety by enabling two boundaries: filesystem and network isolation. I'm very excited to hear that Claude Code CLI is taking sandboxing more seriously. I've not yet dug into the details of that - it looks like it's using seatbelt on macOS and Bubblewrap on Linux. Anthropic released a new open source (Apache 2) library, anthropic-experimental/sandbox-runtime , with their implementation of this so far. Filesystem sandboxing is relatively easy. The harder problem is network isolation, which they describe like this: Network isolation , by only allowing internet access through a unix domain socket connected to a proxy server running outside the sandbox. This proxy server enforces restrictions on the domains that a process can connect to, and handles user confirmation for newly requested domains. And if you’d like further-increased security, we also support customizing this proxy to enforce arbitrary rules on outgoing traffic. This is crucial to protecting against both prompt injection and lethal trifecta attacks. The best way to prevent lethal trifecta attacks is to cut off one of the three legs, and network isolation is how you remove the data exfiltration leg that allows successful attackers to steal your data. If you run Claude Code for web in "No network access" mode you have nothing to worry about. I'm a little bit nervous about their "Trusted network access" environment. It's intended to only allow access to domains relating to dependency installation, but the default domain list has dozens of entries which makes me nervous about unintended exfiltration vectors sneaking through. You can also configure a custom environment with your own allow-list. I have one called "Everything" which allow-lists "*", because for projects like my MiniJinja/Jinja2 comparison above there are no secrets or source code involved that need protecting. I see Anthropic's focus on sandboxes as an acknowledgment that coding agents run in YOLO mode ( and the like) are enormously more valuable and productive than agents where you have to approve their every step. The challenge is making it convenient and easy to run them safely. This kind of sandboxing kind is the only approach to safety that feels credible to me. Update : A note on cost: I'm currently using a Claude "Max" plan that Anthropic gave me in order to test some of their features, so I don't have a good feeling for how Claude Code would cost for these kinds of projects. From running (an unofficial cost estimate tool ) it looks like I'm using between $1 and $5 worth of daily Claude CLI invocations at the moment. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Add query-string-stripper.html tool against my simonw/tools repo - a very simple task that creates (and deployed via GitHub Pages) this query-string-stripper tool. minijinja vs jinja2 Performance Benchmark - I ran this against a private repo and then copied the results here, so no PR. Here's the prompt I used. Update deepseek-ocr README to reflect successful project completion - I noticed that the README produced by Claude Code CLI for this project was misleadingly out of date, so I had Claude Code for web fix the problem.

0 views
Simon Willison 1 months ago

Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code

DeepSeek released a new model yesterday: DeepSeek-OCR , a 6.6GB model fine-tuned specifically for OCR. They released it as model weights that run using PyTorch and CUDA. I got it running on the NVIDIA Spark by having Claude Code effectively brute force the challenge of getting it working on that particular hardware. This small project (40 minutes this morning, most of which was Claude Code churning away while I had breakfast and did some other things) ties together a bunch of different concepts I've been exploring recently. I designed an agentic loop for the problem, gave Claude full permissions inside a Docker sandbox, embraced the parallel agents lifestyle and reused my notes on the NVIDIA Spark from last week. I knew getting a PyTorch CUDA model running on the Spark was going to be a little frustrating, so I decided to outsource the entire process to Claude Code to see what would happen. TLDR: It worked. It took four prompts (one long, three very short) to have Claude Code figure out everything necessary to run the new DeepSeek model on the NVIDIA Spark, OCR a document for me and produce copious notes about the process. I connected to the Spark from my Mac via SSH and started a new Docker container there: Then I installed npm and used that to install Claude Code: Then started Claude Code, telling it that it's OK that it's running as because it's in a sandbox: It provided me a URL to click on to authenticate with my Anthropic account. I kicked things off with this prompt: Create a folder deepseek-ocr and do everything else in that folder Then I ran the following, providing links to both the GitHub repository and the Hugging Face model, providing a clue about NVIDIA ARM and giving it an image ( this one , see previous post ) that I wanted it to run OCR on. Your task is to get this working: https://github.com/deepseek-ai/DeepSeek-OCR - it uses Hugging Face Transformers and models from https://huggingface.co/deepseek-ai/DeepSeek-OCR - you are running in a Docker container on an NVIDIA ARM device, first research your environment to check it will be possible to do this. Git clone both of those - you may need to enable LFS first. Make extensive notes as you go in a notes.md file which you should only ever append new notes to. At the end of this you should have scripts in the folder which can be run to both setup and then run the model against an example - use https://static.simonwillison.net/static/2025/ft.jpeg as the example image that you OCR. As a final step create a detailed README.md file explaining the project and what you learned and how to run your code. Note how I also told it to "make extensive notes as you go in a notes.md file which you should only ever append new notes to" so I'd have some useful notes at the end. It churned away for 13 minutes, and made a sequence of discoveries about the platform and the model and the difficulties of running PyTorch on it. And then... it gave up : The NVIDIA GB10 GPU has compute capability sm_121, which is too new for PyTorch 2.5.1 Additionally, the model has hardcoded calls, preventing CPU fallback without code modifications. So I gave it some encouragement: Any other options that might help get CUDA working like using different versions of PyTorch that have available CUDA wheels for this platform? Have you checked to see what PyTorch versions already have these ARM CUDA wheels that would work here? It ran this command to grep for all available PyTorch wheels on that match the ARM architecture: ... and that unblocked it. Excellent! I can see that PyTorch 2.9.0 has ARM64 wheels for CUDA 12.8, 12.9, and CUDA 13.0 (cu130)! This is exactly what we need! It installed the wheels it found and reported: Good news! PyTorch 2.9.0 can detect the GB10 GPU with CUDA capability 12.1 (sm_121), but it says the maximum supported is 12.0. However, it's just a warning now instead of a hard error! Let's try running inference: [...] Then kept on churning for a bit, ignoring warnings, and managed to run the model ! It loaded the image and ran this prompt: And got back output that started like this: Claude declared victory and pointed me to the output/result.mmd file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk. That result.mmd file looks to be exclusively whitespace to me, try and get a better result there - the bounding box image was good though Which caused Claude to reconsider the prompt it used: Maybe we need to use a different prompt. Looking at the README from DeepSeek-OCR, there are different prompts: The "grounding" mode seems to be focused on bounding boxes, which is why we got good bounding box results but the text output might be in a different format. Let me try running it again with a different prompt that's more focused on text extraction rather than grounding. It tried all of the prompt variants, made a note about them and created a file called PROMPTS_GUIDE.md with a bunch more details, including this comparison table: And this table under the heading "Performance Benchmarks": Test image: 3503×1668 pixels (Financial Times article) My final prompt was this, to gather everything together into a zip file I could extract from the Docker container: Create a zip file with the output and output_text and all of the scripts and notes - but leave out the github repo and the huggingface repo directories I added the contents of that zip file to my new simonw/research GitHub repo in the deepseek-ocr-nvidia-spark folder. Claude really likes writing notes! Here's the directory listing of that finished folder: My first prompt was at 15:31:07 (UTC). The final message from Claude Code came in at 16:10:03. That means it took less than 40 minutes start to finish, and I was only actively involved for about 5-10 minutes of that time. The rest of the time I was having breakfast and doing other things. Having tried and failed to get PyTorch stuff working in the past, I count this as a huge win. I'll be using this process a whole lot more in the future. How good were the actual results? There's honestly so much material in the resulting notes created by Claude that I haven't reviewed all of it. There may well be all sorts of errors in there, but it's indisputable that it managed to run the model and made notes on how it did that such that I'll be able to do the same thing in the future. I think the key factors in executing this project successfully were the following: Oh, and it looks like DeepSeek OCR is a pretty good model if you spend the time experimenting with different ways to run it. A small TIL from today: I had kicked off the job running in the Docker container via SSH to the Spark when I realized it would be neat if I could easily monitor the files it was creating while it was running. I asked Claude.ai : I am running a Docker container on a remote machine, which I started over SSH How can I have my local VS Code on MacOS show me the filesystem in that docker container inside that remote machine, without restarting anything? It gave me a set of steps that solved this exact problem: At the end when I told Claude to create a zip file of the results I could select that in the VS Code file explorer and use the "Download" menu item to download it to my Mac. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . PyTorch 2.5.1 supports: sm_50, sm_80, sm_86, sm_89, sm_90, sm_90a GB10 requires: sm_121 - for documents - general OCR - without layouts I gave it exactly what it needed: a Docker environment in the target hardware, instructions on where to get what it needed (the code and the model) and a clear goal for it to pursue. This is a great example of the pattern I described in designing agentic loops . Running it in a Docker sandbox meant I could use and leave it running on its own. If I'd had to approve every command it wanted to run I would have got frustrated and quit the project after just a few minutes. I applied my own knowledge and experience when it got stuck. I was confident (based on previous experiments with the Spark) that a CUDA wheel for ARM64 existed that was likely to work, so when it gave up I prompted it to try again, leading to success. Install the VS Code "Remote SSH" and "Dev Containers" extensions Use "Remote-SSH: Connect to Host" to connect to the remote machine (on my Tailscale network that's ) In the window for that remote SSH session, run "Dev Containers: Attach to Running Container" - this shows a list of containers and you can select the one you want to attach to ... and that's it! VS Code opens a new window providing full access to all of the files in that container. I opened up and watched it as Claude Code appended to it in real time.

0 views
Simon Willison 1 months ago

Claude Skills are awesome, maybe a bigger deal than MCP

Anthropic this morning introduced Claude Skills , a new pattern for making new abilities available to their models: Claude can now use Skills to improve how it performs specific tasks. Skills are folders that include instructions, scripts, and resources that Claude can load when needed. Claude will only access a skill when it's relevant to the task at hand. When used, skills make Claude better at specialized tasks like working with Excel or following your organization's brand guidelines. Their engineering blog has a more detailed explanation . There's also a new anthropic/skills GitHub repo. (I inadvertently preempted their announcement of this feature when I reverse engineered and wrote about it last Friday !) Skills are conceptually extremely simple: a skill is a Markdown file telling the model how to do something, optionally accompanied by extra documents and pre-written scripts that the model can run to help it accomplish the tasks described by the skill. Claude's new document creation abilities , which accompanied their new code interpreter feature in September, turned out to be entirely implemented using skills. Those are now available Anthropic's repo covering , , , and files. There's one extra detail that makes this a feature, not just a bunch of files in disk. At the start of a session Claude's various harnesses can scan all available skill files and read a short explanation for each one from the frontmatter YAML in the Markdown file. This is very token efficient: each skill only takes up a few dozen extra tokens, with the full details only loaded in should the user request a task that the skill can help solve. Here's that metadata for an example slack-gif-creator skill that Anthropic published this morning: Toolkit for creating animated GIFs optimized for Slack, with validators for size constraints and composable animation primitives. This skill applies when users request animated GIFs or emoji animations for Slack from descriptions like "make me a GIF for Slack of X doing Y". I just tried this skill out in the Claude mobile web app, against Sonnet 4.5. First I enabled the slack-gif-creator skill in the settings , then I prompted: And Claude made me this GIF . Click to play (it's almost epilepsy inducing, hence the click-to-play mechanism): OK, this particular GIF is terrible, but the great thing about skills is that they're very easy to iterate on to make them better. Here are some noteworthy snippets from the Python script it wrote , comments mine: This is pretty neat. Slack GIFs need to be a maximum of 2MB, so the skill includes a validation function which the model can use to check the file size. If it's too large the model can have another go at making it smaller. The skills mechanism is entirely dependent on the model having access to a filesystem, tools to navigate it and the ability to execute commands in that environment. This is a common pattern for LLM tooling these days - ChatGPT Code Interpreter was the first big example of this back in early 2023 , and the pattern later extended to local machines via coding agent tools such as Cursor, Claude Code, Codex CLI and Gemini CLI. This requirement is the biggest difference between skills and other previous attempts at expanding the abilities of LLMs, such as MCP and ChatGPT Plugins . It's a significant dependency, but it's somewhat bewildering how much new capability it unlocks. The fact that skills are so powerful and simple to create is yet another argument in favor of making safe coding environments available to LLMs. The word safe there is doing a lot of work though! We really need to figure out how best to sandbox these environments such that attacks such as prompt injections are limited to an acceptable amount of damage. Back in January I made some foolhardy predictions about AI/LLMs , including that "agents" would once again fail to happen: I think we are going to see a lot more froth about agents in 2025, but I expect the results will be a great disappointment to most of the people who are excited about this term. I expect a lot of money will be lost chasing after several different poorly defined dreams that share that name. I was entirely wrong about that. 2025 really has been the year of "agents", no matter which of the many conflicting definitions you decide to use (I eventually settled on " tools in a loop "). Claude Code is, with hindsight, poorly named. It's not purely a coding tool: it's a tool for general computer automation. Anything you can achieve by typing commands into a computer is something that can now be automated by Claude Code. It's best described as a general agent . Skills make this a whole lot more obvious and explicit. I find the potential applications of this trick somewhat dizzying. Just thinking about this with my data journalism hat on: imagine a folder full of skills that covers tasks like the following: Congratulations, you just built a "data journalism agent" that can discover and help publish stories against fresh drops of US census data. And you did it with a folder full of Markdown files and maybe a couple of example Python scripts. Model Context Protocol has attracted an enormous amount of buzz since its initial release back in November last year . I like to joke that one of the reasons it took off is that every company knew they needed an "AI strategy", and building (or announcing) an MCP implementation was an easy way to tick that box. Over time the limitations of MCP have started to emerge. The most significant is in terms of token usage: GitHub's official MCP on its own famously consumes tens of thousands of tokens of context, and once you've added a few more to that there's precious little space left for the LLM to actually do useful work. My own interest in MCPs has waned ever since I started taking coding agents seriously. Almost everything I might achieve with an MCP can be handled by a CLI tool instead. LLMs know how to call , which means you don't have to spend many tokens describing how to use them - the model can figure it out later when it needs to. Skills have exactly the same advantage, only now I don't even need to implement a new CLI tool. I can drop a Markdown file in describing how to do a task instead, adding extra scripts only if they'll help make things more reliable or efficient. One of the most exciting things about Skills is how easy they are to share. I expect many skills will be implemented as a single file - more sophisticated ones will be a folder with a few more. Anthropic have Agent Skills documentation and a Claude Skills Cookbook . I'm already thinking through ideas of skills I might build myself, like one on how to build Datasette plugins . Something else I love about the design of skills is there is nothing at all preventing them from being used with other models. You can grab a skills folder right now, point Codex CLI or Gemini CLI at it and say "read pdf/SKILL.md and then create me a PDF describing this project" and it will work, despite those tools and models having no baked in knowledge of the skills system. I expect we'll see a Cambrian explosion in Skills which will make this year's MCP rush look pedestrian by comparison. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Trying out the slack-gif-creator skill Skills depend on a coding environment Claude as a General Agent Skills compared to MCP Here come the Skills Where to get US census data from and how to understand its structure How to load data from different formats into SQLite or DuckDB using appropriate Python libraries How to publish data online, as Parquet files in S3 or pushed as tables to Datasette Cloud A skill defined by an experienced data reporter talking about how best to find the interesting stories in a new set of data A skill that describes how to build clean, readable data visualizations using D3

0 views