Posts in Ai (20 found)

Meta Compute, The Meta-OpenAI Battle, The Reality Labs Sacrifice

Mark Zuckerberg announced Meta Compute, a bet that winning in AI means winning with infrastructure; this, however, means retreating from Reality Labs.

0 views
Kaushik Gopal Yesterday

AI model choices 2026-01

Which AI model do I use? This is a common question I get asked, but models evolve so rapidly that I never felt like I could give an answer that would stay relevant for more than a month or two. This year, I finally feel like I have a stable set of model choices that consistently give me good results. I’m jotting it down here to share more broadly and to trace how my own choices evolve over time. GPT 5.2 (High) for planning and writing, including plans Opus 4.5 for anything coding, task automation, and tool calling Gemini ’s range of models for everything else: Gemini 3 (Thinking) for learning and understanding concepts (underrated) Gemini 3 (Flash) for quick fire questions Nano Banana (obv) for image generation NVIDIA’s Parakeet for voice transcription

0 views

The Insecure Evangelism of LLM Maximalists

I just can't help feeling these training wheels are getting in the way of my bicycle commute.

0 views
Stratechery Yesterday

Apple and Gemini, Foundation vs. Aggregation, Universal Commerce Protocol

The deal to put Gemini at the heart of Siri is official, and it makes sense for both sides; then Google runs its classic playbook with Universal Commerce Protocol.

0 views
Simon Willison 2 days ago

First impressions of Claude Cowork, Anthropic's general agent

New from Anthropic today is Claude Cowork , a "research preview" that they describe as "Claude Code for the rest of your work". It's currently available only to Max subscribers ($100 or $200 per month plans) as part of the updated Claude Desktop macOS application. I've been saying for a while now that Claude Code is a "general agent" disguised as a developer tool. It can help you with any computer task that can be achieved by executing code or running terminal commands... which covers almost anything, provided you know what you're doing with it! What it really needs is a UI that doesn't involve the terminal and a name that doesn't scare away non-developers. "Cowork" is a pretty solid choice on the name front! The interface for Cowork is a new tab in the Claude desktop app, called Cowork. It sits next to the existing Chat and Code tabs. It looks very similar to the desktop interface for regular Claude Code. You start with a prompt, optionally attaching a folder of files. It then starts work. I tried it out against my perpetually growing "blog-drafts" folder with the following prompt: Look at my drafts that were started within the last three months and then check that I didn't publish them on simonwillison.net using a search against content on that site and then suggest the ones that are most close to being ready It started by running this command: That path instantly caught my eye. Anthropic say that Cowork can only access files you grant it access to - it looks to me like they're mounting those files into a containerized environment, which should mean we can trust Cowork not to be able to access anything outside of that sandbox. It turns out I have 46 draft files from the past three months. Claude then went to work with its search tool, running 44 individual searches against to figure out which of my drafts had already been published. Here's the eventual reply: Based on my analysis, here are your unpublished drafts that appear closest to being ready for publication : 🔥 Most Ready to Publish (substantial content, not yet published) That's a good response! It found exactly what I needed to see, although those upgrade instructions are actually published elsewhere now ( in the Datasette docs ) and weren't actually intended for my blog. Just for fun, and because I really like artifacts , I asked for a follow-up: Make me an artifact with exciting animated encouragements to get me to do it Here's what I got: I couldn't figure out how to close the right sidebar so the artifact ended up cramped into a thin column but it did work. I expect Anthropic will fix that display bug pretty quickly. I've seen a few people ask what the difference between this and regular Claude Code is. The answer is not a lot . As far as I can tell Claude Cowork is regular Claude Code wrapped in a less intimidating default interface and with a filesystem sandbox configured for you without you needing to know what a "filesystem sandbox" is. Update : It's more than just a filesystem sandbox - I had Claude Code reverse engineer the Claude app and it found out that Claude uses VZVirtualMachine - the Apple Virtualization Framework - and downloads and boots a custom Linux root filesystem. I think that's a really smart product. Claude Code has an enormous amount of value that hasn't yet been unlocked for a general audience, and this seems like a pragmatic approach. With a feature like this, my first thought always jumps straight to security. How big is the risk that someone using this might be hit by hidden malicious instruction somewhere that break their computer or steal their data? Anthropic touch on that directly in the announcement: You should also be aware of the risk of " prompt injections ": attempts by attackers to alter Claude's plans through content it might encounter on the internet. We've built sophisticated defenses against prompt injections, but agent safety---that is, the task of securing Claude's real-world actions---is still an active area of development in the industry. These risks aren't new with Cowork, but it might be the first time you're using a more advanced tool that moves beyond a simple conversation. We recommend taking precautions, particularly while you learn how it works. We provide more detail in our Help Center . That help page includes the following tips: To minimize risks: I do not think it is fair to tell regular non-programmer users to watch out for "suspicious actions that may indicate prompt injection"! I'm sure they have some impressive mitigations going on behind the scenes. I recently learned that the summarization applied by the WebFetch function in Claude Code and now in Cowork is partly intended as a prompt injection protection layer via this tweet from Claude Code creator Boris Cherny: Summarization is one thing we do to reduce prompt injection risk. Are you running into specific issues with it? But Anthropic are being honest here with their warnings: they can attempt to filter out potential attacks all they like but the one thing they can't provide is guarantees that no future attack will be found that sneaks through their defenses and steals your data (see the lethal trifecta for more on this.) The problem with prompt injection remains that until there's a high profile incident it's really hard to get people to take it seriously. I myself have all sorts of Claude Code usage that could cause havoc if a malicious injection got in. Cowork does at least run in a filesystem sandbox by default, which is more than can be said for my habit! I wrote more about this in my 2025 round-up: The year of YOLO and the Normalization of Deviance . Security worries aside, Cowork represents something really interesting. This is a general agent that looks well positioned to bring the wildly powerful capabilities of Claude Code to a wider audience. I would be very surprised if Gemini and OpenAI don't follow suit with their own offerings in this category. I imagine OpenAI are already regretting burning the name "ChatGPT Agent" on their janky, experimental and mostly forgotten browser automation tool back in August ! bashtoni on Hacker News : Simple suggestion: logo should be a cow and and orc to match how I originally read the product name. I couldn't resist throwing that one at Nano Banana : You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . - "Frequently Argued Questions about LLMs" (22,602 bytes) This is a meaty piece documenting common arguments about LLMs with your counterpoints Well-structured with a TL;DR and multiple sections No matching published article found on your site Very close to ready - just needs a final review pass - "Claude Code Timeline and Codex Timeline" (3,075 bytes) About viewing JSONL session logs from Claude Code and Codex You published on Dec 25, but this appears to be a different/earlier piece about timeline viewing tools Shorter but seems complete - Plugin Upgrade Guide (3,147 bytes) Technical guide for plugin authors You published the main 1.0a20 announcement but this companion upgrade guide appears unpublished Would be valuable for plugin maintainers Avoid granting access to local files with sensitive information, like financial documents. When using the Claude in Chrome extension, limit access to trusted sites. If you chose to extend Claude’s default internet access settings, be careful to only extend internet access to sites you trust. Monitor Claude for suspicious actions that may indicate prompt injection.

1 views
Manuel Moreale 2 days ago

A moment with tea

Learning to appreciate different flavors is something that comes very hard for me. And yet, for some reason, tea is one of those things that no matter how hard it is for my tastebuds, I’ll constantly come back to. Thank you for keeping RSS alive. You're awesome. Email me :: Sign my guestbook :: Support for 1$/month :: See my generous supporters :: Subscribe to People and Blogs

0 views
Rob Zolkos 2 days ago

So where can we use our Claude subscription then?

There’s been confusion about where we can actually use a Claude subscription. This comes after Anthropic took action to prevent third-party applications from spoofing the Claude Code harness to use Claude subscriptions. The information in this post is based on my understanding from reading various tweets, official GitHub repos and documentation (some of which may or may not be up to date). I will endeavour to keep it up to date as new information becomes available. I would love to see Anthropic themselves maintain an easily parsable page like this that shows what is and is not permitted with a Claude subscription. We've taken action to prevent third-party clients from spoofing the Claude Code agent harness to use consumer subscriptions. Consumer subscriptions and their benefits should only be used in the Anthropic experiences they support (Claude Code CLI, Claude Code web, and via sessionKey in the Agent SDK). Third-party apps can use the API. From what I can gather, consumer subscriptions work with official Anthropic tools, not third-party applications. If you want third-party integrations, you need the API. The consumer applications (desktop and mobile) are the most straightforward way to use your Claude subscription. Available at claude.com/download , these apps give you direct access to Claude for conversation, file uploads, and Projects. The official command-line interface for Claude Code is fully supported with Claude subscriptions. This is the tool Anthropic built and maintains specifically for developers who want to use Claude in their development workflow. You get the full power of Claude integrated into your terminal, with access to your entire codebase, the ability to execute commands, read and write files, and use all the specialized agents that come with Claude Code. The web version of Claude Code (accessible through your browser at claude.ai/code) provides the same capabilities as the CLI but through a browser interface. Upload your project files, or point it at a repository, and you can work with Claude on your codebase directly. Want to experiment with building custom agents? The Claude Agent SDK lets you develop and test specialized agents powered by your Claude subscription for personal development work. The SDK is available in both Python and TypeScript , with documentation here . This is for personal experiments and development. For production deployments of agents, use the API instead of your subscription. You can use your Claude subscription to run automated agents in GitHub Actions. The Claude Code Action lets you set up workflows that leverage Claude for code review, documentation generation, or automated testing analysis. Documentation is here . Any other uses of Claude would require the use of API keys. Your Claude subscription gives you: Let me know if you have any corrections. Claude desktop and mobile apps for general use Claude Code CLI for terminal-based development Claude Code on the web for browser-based work The ability to build custom agents through the official SDK (for personal development) Claude Code GitHub Action for CI/CD integration

1 views
Marc Brooker 3 days ago

Agent Safety is a Box

Keep a lid on it. Before we start, let’s cover some terms so we’re thinking about the same thing. This is a post about AI agents, which I’ll define (riffing off Simon Willison 1 ) as: An AI agent runs models and tools in a loop to achieve a goal. Here, goals can include coding, customer service, proving theorems, cloud operations , or many other things. These agents can be interactive or one-shot; called by humans, other agents, or traditional computer systems; local or cloud; and short-lived or long-running. What they don’t tend to be is pure . They typically achieve their goals by side effects. Side effects including modifying the local filesystem, calling another agent, calling a cloud service, making a payment, or starting a 3D print. The topic of today’s post is those side-effects. Simply, what agents can do . We should also be concerned with what agents can say , and I’ll touch on that topic a bit as I go. But the focus is on do . Agents do things with tools. These could be MCP-style tools, powers , skills , or one of many other patterns for tool calling. But, crucially, the act of doing inference doesn’t do anything. Without the do , the think seems less important. The right way to control what agents do is to put them in a box. The box is a strong, deterministic, exact, layer of control outside the agent which limits which tools it can call, and what it can do with those tools. The most important one of those properties is outside the agent . Alignment and other AI safety topics are important. Steering , careful prompting, and context management help a lot. These techniques have a lot of value for liveness (success rate, cost, etc), but are insufficient for safety. They’re insufficient for safety for the same reason we’re building agents in the first place: because they’re flexible, adaptive, creative 2 problem solvers. Traditional old-school workflows are great. They’re cheap, predictable, deterministic, understandable, and well understood. But they aren’t flexible, adaptive, or creative. One change to a data representation or API, and they’re stuck. One unexpected exception case, and they can’t make progress. We’re interested in AI agents because they can make progress towards a broader range of goals without having a human think about all the edge cases before hand. Safety approaches which run inside the agent typically run against this hard trade-off: to get value out of an agent we want to give it as much flexibility as possible, but to reason about what it can do we need to constrain that flexibility. Doing that, with strong guarantees, by trying to constrain what an agent can think, is hard. The other advantage of the box, the deterministic layer around an agent, is that it allows us to make some crisp statements about what matters and doesn’t. For example, if the box deterministically implements the policy a refund can only be for the original purchase price or less , and only one refund can be issued per order , we can exactly reason about how much refunds can be without worrying about the prompt injection attack of the week. What is the Box? The implementation of the box depends a lot on the type of agent we’re talking about. In later posts I’ll look a bit at local agents (the kind I run on my laptop), but for today I’ll start with agents in the cloud. In this cloud environment, agents implemented in code run in a secure execution environment like AgentCore Runtime . Each agent session running inside this environment gets a secure, isolated, place to run its loop, execute generated code, store things in local memory, and so on. Then, we have to add a way to interact with the outside world. To allow the agent to do things. This is where gateways (like AgentCore Gateway ) come in. The gateway is the singular hole in the box. The place where tools are given to the agent, where those tools are controlled, and where policy is enforced. This scoping of tools differs from the usual concerns of authorization: typical authorization is concerned with what an actor can do with a tool, the gateway’s control is concerned with which tools are available. Agents can’t bypass the Gateway, because the Runtime stops them from sending packets anywhere else. Old-school network security controls. The Box’s Policy The simplest way this version of the box constrains what an agent can do is by constraining which tools it can access 3 . Then we need to control what the agent can do with these tools. This is where authorization comes in. In the simplest case, the agent is working on behalf of a human user, and inherits a subset of its authorizations. In a future post I’ll write about other cases, where agents have their own authorization and the ability to escalate privilege, but none of that invalidates the box concept. Regardless, most of today’s authorization implementations don’t have sufficient power and flexibility to express some of the constraints we’d like to express as we control what an agent can do. And they don’t tend to compose across tools. So we need a policy layer at the gateway. AgentCore Policy gives fine-grained, deterministic, control over the ways that an agent can call tools. Using the powerful Cedar policy language , AgentCore Policy is super flexible. But most people don’t want to learn Cedar, so we built on our research on converting human intent to policy to allow policies to also be expressed in natural language . Here’s what a policy looks like: By putting these policies at the edge of the box, in the gateway, we can make sure they are true no matter what the agent does. No errant prompt, context, or memory can bypass this policy. Anyway, this post has gotten very long, and there’s still some ground to cover. There’s more to say about multi-agent systems, memories, local agents, composition of polcies, and many other topics. But hopefully the core point is clear: by building a deterministic, strong, box around an agent we can get a level of safety and control that’s impossible to achieve without it. If this sounds interesting, and you’d like to spend an hour on it, here’s me talking about it at reInvent’25. Simon’s version is An LLM agent runs tools in a loop to achieve a goal , but I like to expand the definition to capture agents that may use smaller models and multiple models, and to highlight that inference is just one tool used by the larger system. I don’t love using the word creative in this sense, because it implies something is happening that really isn’t. But it’s not a terrible mental model. Which, of course, also requires that these tools are built in a way that they can’t be deputized to have their own unexpected side effects. In general, SaaS and cloud tools are built with an adversarial model which assumes that clients are badly-intentioned and so strictly scopes their access, so a lot of this work has already been done.

0 views

Letting Claude Play Text Adventures

The other day I went to an AI hackathon organized by my friends Lucia and Malin . The theme was mech interp , but I hardly know PyTorch so I planned to do something at the API layer rather than the model layer. Something I think about a lot is cognitive architectures (like Soar and ACT-R ). This is like a continuation of GOFAI research, inspired by cognitive science. And like GOFAI it’s never yielded anything useful. But I often think: can we scaffold LLMs with cog arch-inspired harnesses to overcome their limitations? LLM agents like Claude Code are basically “accidental” cognitive architectures: they are designed and built my practitioners rather than theorists, but they have commonalities, they all need a way to manage memory, tool use, a task agenda etc. Maybe building an agent on a more “principled” foundation, one informed by cognitive science, yields a higher-performing architecture. So I sat around a while thinking how to adapt Soar’s architecture to an LLM agent. And I sketched something out, but then I thought: how can I prove this performs better than baseline? I need an eval, a task. Math problems? Too one-shottable. A chatbot? Too interactive, I want something hands-off and long-horizon. A coding agent? That’s too freeform and requires too much tool use. And then I thought: text adventures ! You have a stylized, hierarchically-structured world accessible entirely tthrough text, long-term goals, puzzles, physical exploration and discovery of the environment. Even the data model of text adventures resembles frame-based knowledge representation systems. And there’s a vast collection of games available online. Anchorhead , which I played years ago, is a Lovecraft-inspired text adventure by Michael S. Gentry. It takes on the order of hundrds of turns to win across multiple in-game days. And the game world is huge and very open. In other words: a perfect long-horizon task. So I started hacking. The frotz interpreter runs on the command line and has a “dumb” interface called , which takes the ncurses fluff out, and gives you a very stripped command-line experience. It looks like this: It is easy to write a little Python wrapper to drive the interpreter through and : Now we can play the game from Python: send commands, get game output. Now we need the dual of this: a player. The trivial harness is basically nothing at all: treat the LLM/game interaction like a chat history. The LLM reads the game output from the interpreter, writes some reasoning tokens, and writes a command that is sent via to the interpreter. And this works well enough. Haiku 4.5 would mostly wander around the game map, but Sonnet 4.5 and Opus 4.5 manage to solve the game’s first puzzle—breaking into the real estate office, and finding the keys to the mansion—readily enough. It takes about ~200 turns for Claude to get to the second in-game day. The way I thought this would fail is: attention gets smeared across the long context, the model gets confused about the geometry of the world, its goal and task state, and starts confabulating, going in circles, etc. As usual, I was outsmarting myself. The reason this fails is you run out of credits. By the time you get to day two, each turn costs tens of thousands of input tokens. No good! We need a way to save money. Ok, let’s try something that’s easier on my Claude credits. We’ll show Claude the most recent five turns (this is the perceptual working memory), and give it a simple semantic memory: a list of strings that it can append entries to, and remove entries from using tool use. This keeps the token usage down: The problem is the narrow time horizon. With the trivial harness, Claude can break into the real estate office in ~10 turns, and does so right at the start of the game. With this new harness, Claude wanders about the town, taking copious notes, before returning to the real estate office, and it spends ~40 turns fumbling around with the garbage cans before managing to break into the real estate office. The next step, after getting the keys to the house, is to meet your husband Michael at the University and head home. Claude with the trivial harness takes about ~100 turns to find the house, with some tangential wandering about the town, and reaches day two around turn 150. Claude, with the memory harness, took ~250 turns just to get the keys to the house. And then it spends hundreds of turns just wandering in circles around the town, accumulating redundant memories, and hits the turn limit before even finding the house. Anchorhead is a long, broad game, and from the very beginning you can forget the plot and wander about most of the town. It takes a long time to see if a run with an agent goes anywhere. So I thought: I need something smaller. Unsurprisingly, Claude can make its own games. The Inform 7 package for NixOS was broken (though Mikael has fixed this recently) so I had to use Inform 6 . I started with a trivial escape-the-room type game, which was less than 100 lines of code and any Claude could beat it less than 10 turns. Then I asked for a larger, multi-room heist game. This one was more fun. It’s short enough that Claude can win with just the trivial harness. I tried a different harness, where Claude has access to only the last five turns of the game’s history, and a read-write memory scratchpad. And this one was interesting. First, because Claude only ever adds to its own memory, it never deletes memories. I thought it would do more to trim and edit its scratchpad. Second, because Claude become fixated on this red-herring room: a garden with a well. It kept going in circles, trying to tie a rope to the well and climb down. Because of the limited game history, it only realized it was stuck when it saw that the most recent ~20 entries it wrote to its memories related to various attempts to go down the well. Then I watched Claude walk away from the garden and solve the final puzzle, and hit the turn limit just two turns short of winning. Tangent: I wonder if models are better at playing games created by other instances of the same model, by noticing tiny correlations in the text to infer what puzzles and obstacles they would have written. In the end I abandoned the “small worlds” approach because the games are too stylized, linear, and uninteresting. Anchorhead is more unwieldy, but more natural. I have a bunch of ideas I want to test, to better learn how harness implementations affect performance. But I’m short on time, so I’m cutting it here and listing these as todos: The repository is here . Domain-Specific Memories: Claude’s notes are all jumbled with information on tasks, locations, etc. It might be better to have separate memories: a todo list, a memory of locations and their connections, etc. This is close to the Soar approach. Automatic Geography: related to the above, the harness can inspect the game output and build up a graph of rooms and their connections, and format it in the context. This saves Claude having to note those things manually using a tool. Manual Geography: the automatic geography approach has a few drawbacks. Without integration into the Z-machine interpreter, it requires some work to implement (parsing the currente location from the output, keeping track of the command history to find standard travel commands e.g. ) but isn’t 100% deterministic, so that mazes and dynamic rooms (e.g. elevators) will confuse the system. So, instead of doing it manually, we could give Claude a tool like . Episodic Memory: this feels like cheating, but, at the end of a run, you can show Claude the session transcript and ask it to summarize: what it accomplished and how, where it failed and why. Including a short walkthrough for how to get to the “last successful state”. This allows future runs to save time in getting up to speed.

0 views
Simon Willison 3 days ago

My answers to the questions I posed about porting open source code with LLMs

Last month I wrote about porting JustHTML from Python to JavaScript using Codex CLI and GPT-5.2 in a few hours while also buying a Christmas tree and watching Knives Out 3. I ended that post with a series of open questions about the ethics and legality of this style of work. Alexander Petros on lobste.rs just challenged me to answer them , which is fair enough! Here's my attempt at that. You can read the original post for background, but the short version is that it's now possible to point a coding agent at some other open source project and effectively tell it "port this to language X and make sure the tests still pass" and have it do exactly that. Here are the questions I posed along with my answers based on my current thinking. Extra context is that I've since tried variations on a similar theme a few more times using Claude Code and Opus 4.5 and found it to be astonishingly effective. I decided that the right thing to do here was to keep the open source license and copyright statement from the Python library author and treat what I had built as a derivative work, which is the entire point of open source. After sitting on this for a while I've come down on yes, provided full credit is given and the license is carefully considered. Open source allows and encourages further derivative works! I never got upset at some university student forking one of my projects on GitHub and hacking in a new feature that they used. I don't think this is materially different, although a port to another language entirely does feel like a slightly different shape. Now this one is complicated! It definitely hurts some projects because there are open source maintainers out there who say things like "I'm not going to release any open source code any more because I don't want it used for training" - I expect some of those would be equally angered by LLM-driven derived works as well. I don't know how serious this problem is - I've seen angry comments from anonymous usernames, but do they represent genuine open source contributions or are they just angry anonymous usernames? If we assume this is real, does the loss of those individuals get balanced out by the increase in individuals who CAN contribute to open source because they can now get work done in a few hours that might previously have taken them a few days that they didn't have to spare? I'll be brutally honest about that question: I think that if "they might train on my code / build a derived version with an LLM" is enough to drive you away from open source, your open source values are distinct enough from mine that I'm not ready to invest significantly in keeping you. I'll put that effort into welcoming the newcomers instead. The much bigger concern for me is the impact of generative AI on demand for open source. The recent Tailwind story is a visible example of this - while Tailwind blamed LLMs for reduced traffic to their documentation resulting in fewer conversions to their paid component library, I'm suspicious that the reduced demand there is because LLMs make building good-enough versions of those components for free easy enough that people do that instead. I've found myself affected by this for open source dependencies too. The other day I wanted to parse a cron expression in some Go code. Usually I'd go looking for an existing library for cron expression parsing - but this time I hardly thought about that for a second before prompting one (complete with extensive tests) into existence instead. I expect that this is going to quite radically impact the shape of the open source library world over the next few years. Is that "harmful to open source"? It may well be. I'm hoping that whatever new shape comes out of this has its own merits, but I don't know what those would be. I'm not a lawyer so I don't feel credible to comment on this one. My loose hunch is that I'm still putting enough creative control in through the way I direct the models for that to count as enough human intervention, at least under US law, but I have no idea. I've come down on "yes" here, again because I never thought it was irresponsible for some random university student to slap an Apache license on some bad code they just coughed up on GitHub. What's important here is making it very clear to potential users what they should expect from that software. I've started publishing my AI-generated and not 100% reviewed libraries as alphas, which I'm tentatively thinking of as "alpha slop" . I'll take the alpha label off once I've used them in production to the point that I'm willing to stake my reputation on them being decent implementations, and I'll ship a 1.0 version when I'm confident that they are a solid bet for other people to depend on. I think that's the responsible way to handle this. That one was a deliberately provocative question, because for a new HTML5 parsing library that passes 9,200 tests you would need a very good reason to hire an expert team for two months (at a cost of hundreds of thousands of dollars) to write such a thing. And honestly, thanks to the existing conformance suites this kind of library is simple enough that you may find their results weren't notably better than the one written by the coding agent. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

1 views
Jim Nielsen 3 days ago

In The Beginning There Was Slop

I’ve been slowly reading my copy of “The Internet Phone Book” and I recently read an essay in it by Elan Ullendorff called “The New Turing Test” . Elan argues that what matters in a work isn’t the tools used to make it, but the “expressiveness” of the work itself (was it made “from someone, for someone, in a particular context”): If something feels robotic or generic, it is those very qualities that make the work problematic, not the tools used. This point reminded me that there was slop before AI came on the scene. A lot of blogging was considered a primal form of slop when the internet first appeared: content of inferior substance, generated in quantities much vaster than heretofore considered possible. And the truth is, perhaps a lot of the content in blogosphere was “slop”. But it wasn’t slop because of the tools that made it — like Movable Type or Wordpress or Blogger. It was slop because it lacked thought, care, and intention — the “expressiveness” Elan argues for. You don’t need AI to produce slop because slop isn’t made by AI. It’s made by humans — AI is just the popular tool of choice for making it right now. Slop existed long before LLMs came onto the scene. It will doubtless exist long after too. Reply via: Email · Mastodon · Bluesky

1 views
<antirez> 3 days ago

Don't fall into the anti-AI hype

I love writing software, line by line. It could be said that my career was a continuous effort to create software well written, minimal, where the human touch was the fundamental feature. I also hope for a society where the last are not forgotten. Moreover, I don't want AI to economically succeed, I don't care if the current economic system is subverted (I could be very happy, honestly, if it goes in the direction of a massive redistribution of wealth). But, I would not respect myself and my intelligence if my idea of software and society would impair my vision: facts are facts, and AI is going to change programming forever. In 2020 I left my job in order to write a novel about AI, universal basic income, a society that adapted to the automation of work facing many challenges. At the very end of 2024 I opened a YouTube channel focused on AI, its use in coding tasks, its potential social and economical effects. But while I recognized what was going to happen very early, I thought that we had more time before programming would be completely reshaped, at least a few years. I no longer believe this is the case. Recently, state of the art LLMs are able to complete large subtasks or medium size projects alone, almost unassisted, given a good set of hints about what the end result should be. The degree of success you'll get is related to the kind of programming you do (the more isolated, and the more textually representable, the better: system programming is particularly apt), and to your ability to create a mental representation of the problem to communicate to the LLM. But, in general, it is now clear that for most projects, writing the code yourself is no longer sensible, if not to have fun. In the past week, just prompting, and inspecting the code to provide guidance from time to time, in a few hours I did the following four tasks, in hours instead of weeks: 1. I modified my linenoise library to support UTF-8, and created a framework for line editing testing that uses an emulated terminal that is able to report what is getting displayed in each character cell. Something that I always wanted to do, but it was hard to justify the work needed just to test a side project of mine. But if you can just describe your idea, and it materializes in the code, things are very different. 2. I fixed transient failures in the Redis test. This is very annoying work, timing related issues, TCP deadlock conditions, and so forth. Claude Code iterated for all the time needed to reproduce it, inspected the state of the processes to understand what was happening, and fixed the bugs. 3. Yesterday I wanted a pure C library that would be able to do the inference of BERT like embedding models. Claude Code created it in 5 minutes. Same output and same speed (15% slower) than PyTorch. 700 lines of code. A Python tool to convert the GTE-small model. 4. In the past weeks I operated changes to Redis Streams internals. I had a design document for the work I did. I tried to give it to Claude Code and it reproduced my work in, like, 20 minutes or less (mostly because I'm slow at checking and authorizing to run the commands needed). It is simply impossible not to see the reality of what is happening. Writing code is no longer needed for the most part. It is now a lot more interesting to understand what to do, and how to do it (and, about this second part, LLMs are great partners, too). It does not matter if AI companies will not be able to get their money back and the stock market will crash. All that is irrelevant, in the long run. It does not matter if this or the other CEO of some unicorn is telling you something that is off putting, or absurd. Programming changed forever, anyway. How do I feel, about all the code I wrote that was ingested by LLMs? I feel great to be part of that, because I see this as a continuation of what I tried to do all my life: democratizing code, systems, knowledge. LLMs are going to help us to write better software, faster, and will allow small teams to have a chance to compete with bigger companies. The same thing open source software did in the 90s. However, this technology is far too important to be in the hands of a few companies. For now, you can do the pre-training better or not, you can do reinforcement learning in a much more effective way than others, but the open models, especially the ones produced in China, continue to compete (even if they are behind) with frontier models of closed labs. There is a sufficient democratization of AI, so far, even if imperfect. But: it is absolutely not obvious that it will be like that forever. I'm scared about the centralization. At the same time, I believe neural networks, at scale, are simply able to do incredible things, and that there is not enough "magic" inside current frontier AI for the other labs and teams not to catch up (otherwise it would be very hard to explain, for instance, why OpenAI, Anthropic and Google are so near in their results, for years now). As a programmer, I want to write more open source than ever, now. I want to improve certain repositories of mine abandoned for time concerns. I want to apply AI to my Redis workflow. Improve the Vector Sets implementation and then other data structures, like I'm doing with Streams now. But I'm worried for the folks that will get fired. It is not clear what the dynamic at play will be: will companies try to have more people, and to build more? Or will they try to cut salary costs, having fewer programmers that are better at prompting? And, there are other sectors where humans will become completely replaceable, I fear. What is the social solution, then? Innovation can't be taken back after all. I believe we should vote for governments that recognize what is happening, and are willing to support those who will remain jobless. And, the more people get fired, the more political pressure there will be to vote for those who will guarantee a certain degree of protection. But I also look forward to the good AI could bring: new progress in science, that could help lower the suffering of the human condition, which is not always happy. Anyway, back to programming. I have a single suggestion for you, my friend. Whatever you believe about what the Right Thing should be, you can't control it by refusing what is happening right now. Skipping AI is not going to help you or your career. Think about it. Test these new tools, with care, with weeks of work, not in a five minutes test where you can just reinforce your own beliefs. Find a way to multiply yourself, and if it does not work for you, try again every few months. Yes, maybe you think that you worked so hard to learn coding, and now machines are doing it for you. But what was the fire inside you, when you coded till night to see your project working? It was building. And now you can build more and better, if you find your way to use AI effectively. The fun is still there, untouched. Comments

54 views
Xe Iaso 4 days ago

I made a simple agent for PR reviews. Don't use it.

My coworkers really like AI-powered code review tools and it seems that every time I make a pull request in one of their repos I learn about yet another AI code review SaaS product. Given that there are so many of them, I decided to see how easy it would be to develop my own AI-powered code review bot that targets GitHub repositories. I managed to hack out the core of it in a single afternoon using a model that runs on my desk. I've ended up with a little tool I call reviewbot that takes GitHub pull request information and submits code reviews in response. reviewbot is powered by a DGX Spark , llama.cpp , and OpenAI's GPT-OSS 120b . The AI model runs on my desk with a machine that pulls less power doing AI inference than my gaming tower pulls running fairly lightweight 3D games. In testing I've found that nearly all runs of reviewbot take less than two minutes, even at a rate of only 60 tokens per second generated by the DGX Spark. reviewbot is about 350 lines of Go that just feeds pull request information into the context window of the model and provides a few tools for actions like "leave pull request review" and "read contents of file". I'm considering adding other actions like "read messages in thread" or "read contents of issue", but I haven't needed them yet. To make my life easier, I distribute it as a Docker image that gets run in GitHub Actions whenever a pull review comment includes the magic phrase . The main reason I made reviewbot is that I couldn't find anything like it that let you specify the combination of: I'm fairly sure that there are thousands of similar AI-powered tools on the market that I can't find because Google is a broken tool, but this one is mine. When reviewbot reviews a pull request, it assembles an AI model prompt like this: The AI model can return one of three results: The core of reviewbot is the "AI agent loop", or a loop that works like this: reviewbot is a hack that probably works well enough for me. It has a number of limitations including but not limited to: When such an innovation as reviewbot comes to pass, people naturally have questions. In order to give you the best reading experience, I asked my friends, patrons, and loved ones for their questions about reviewbot. Here are some answers that may or may not help: Probably not! This is something I made out of curiosity, not something I made for you to actually use. It was a lot easier to make than I expected and is surprisingly useful for how little effort was put into it. Nope. Pure chaos. Let it all happen in a glorious way. How the fuck should I know? I don't even know if chairs exist. At least half as much I have wanted to use go wish for that. It's just common sense, really. When the wind can blow all the sand away. Three times daily or the netherbeast will emerge and doom all of society. We don't really want that to happen so we make sure to feed reviewbot its oatmeal. At least twelve. Not sure because I ran out of pancakes. Only if you add that functionality in a pull request. reviewbot can do anything as long as its code is extended to do that thing. Frankly, you shouldn't. Your own AI model name Your own AI model provider URL Your own AI model provider API token Definite approval via the tool that approves the changes with a summary of the changes made to the code. Definite rejection via the tool that rejects the changes with a summary of the reason why they're being rejected. Comments without approving or rejecting the code. Collect information to feed into the AI model Submit information to AI model If the AI model runs the tool, publish the results and exit. If the AI model runs any other tool, collect the information it's requesting and add it to the list of things to submit to the AI model in the next loop. If the AI model just returns text at any point, treat that as a noncommittal comment about the changes. It does not work with closed source repositories due to the gitfs library not supporting cloning repositories that require authentication. Could probably fix that with some elbow grease if I'm paid enough to do so. A fair number of test invocations had the agent rely on unpopulated fields from the GitHub API, which caused crashes. I am certain that I will only find more such examples and need to issue patches for them. reviewbot is like 300 lines of Go hacked up by hand in an afternoon. If you really need something like this, you can likely write one yourself with little effort.

2 views

Most Code is Just Cache

Claude Code has systematically begun to consume many of the SaaS apps I used to (or plan to) pay for. Why pay a subscription when I can "vibe code" a personal MVP in twenty minutes? I don’t worry about maintenance or vendor lock-in because, frankly, the code is disposable. If I need a new feature tomorrow, I don’t refactor—I just rebuild it. 1 Code is becoming just an ephemeral cache of my intent. Cartoon via Nano Banana. In this model, the ‘Source Code’ is the prompt and the context; the actual Python or Javascript that executes is just the binary. We still run the code because it’s thermodynamically efficient and deterministic, but we treat it as disposable. If the behavior needs to change, we don’t refactor the binary; we re-compile the intent. This shift has made me intolerant of static interfaces. I have stopped caring about software that doesn’t let me dump massive amounts of context into Gemini or Claude to just do the thing . If a product forces me to click buttons to execute a process that an LLM could intuit from a prompt, that product is already legacy. It forces us to question the permanence of the current model. We often make the mistake of assuming software—as we know it today—is a permanent fixture of human productivity. But if you zoom out, the era of SaaS is a blink of an eye in modern history. It is easy to overestimate how core it is to the future. In this post, I want to extrapolate these thoughts a bit and write out what could be the final stages of software. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. The stages here might not necessarily be chronological or mutually exclusive. Instead, they are ordered from static to dynamic code generation — where more and more the intent of a customer is the software they use. This is the baseline where software is a static artifact sold as a service, built on the assumption that user problems are repetitive and predictable enough to be solved by rigid workflows. To the consumer, this looks like dashboards, CRUD forms, and hardcoded automations. The intelligence here is sourced mainly from the SaaS founder and hired domain experts, hard-coded into business logic years before the user ever logs in. When: We recognized that distributing software via the cloud was more efficient than on-premise installations. Value Loop: Customer Problem → Product Manager writes PRD → Engineers write Static Code → Deploy → Customer adapts their workflow to the tool. (Time: Months to Years | Fit: Generic / One-size-fits-none) We are seeing this now with companies adopting the Forward Deployed Engineering (FDE) . In this stage, the SaaS company hires humans to manually use AI to build bespoke solutions for the client. For the consumer, this feels like a concierge service; they don’t get a login to a generic tool, they get a custom-built outcome delivered by a human who used AI to write the glue code. The intelligence is hybrid: the human provides the architecture, the AI writes the implementation code in weeks to days. When: Companies realize AI allows their employees to build custom apps for clients faster than the clients can learn or adapt a generic tool. Value Loop: Customer Problem → SaaS Employee (FDE) Prompts AI → AI generates Custom Script/App → Employee Deploys for Customer. (Time: Days | Fit: High / Tailored to specific customer edge cases) This is the current “safe space” for most tech companies, where they bolt an LLM onto an existing application to handle unstructured data. Consumers experience this as a “Draft Email” button in their CRM or a “Chat” sidebar in their UI—the platform is still the main product, but AI is a feature that (hopefully) reduces friction and/or provides some extra functionality customization 2 . The intelligence comes from a constrained model of product design and LLM scaffolding, providing content within a structure still strictly dictated by the SaaS platform’s code. When: People start to see AI is good at summarizing, generating content, or taking actions within existing workflows. Value Loop: Customer Problem → Static SaaS Interface AI Feature Text Box → Stochastic Result → Human Review. (Time: Minutes | Fit: Medium / Constrained by the platform’s UI) This is the tipping point where the software interface starts to disappear because the “interface” was just a way to collect context that the model can now ingest directly. Consumers move to a “Do this for me” interface where intent maps directly to an outcome rather than a button click, often realized as an agent calling a database or MCP servers 3 . The intelligence is the model and it’s engineered input context, relegating the SaaS role to in some sense providing clean proprietary data via an agent friendly interface. Software as a Service for Agents . When: People start to see AI is good at orchestrating complex decisions and using tools—across SaaS platforms—autonomously. Value Loop: Customer Problem (Prompt as ~PRD) → Runtime Code Generation → Dynamic Outcome. (Time: Real-time | Fit: Very High / Dynamically generated for the specific context) Critically, this doesn't mean the LLM acts as the CPU for every single user interaction (which would be latency-poor and energy-inefficient). Instead, the model almost acts as a Just-In-Time compiler. It generates the necessary code to execute the user’s intent, runs that code for the session, and then potentially discards it This is the end game in some cases. If code is just a cache for intent, eventually we bypass the cache and bake the intuition directly into the model. To the consumer, the “tool” is invisible; the expert system simply exists and provides answers or actions without a login or workflow. The intelligence is in the model itself; the software platform exists solely as a distillation mechanism—a gym to train the vertical AI—and once the model learns the domain, the software is no longer needed. A company in this stage is not really even SaaS anymore, maybe more so a AI-gyms-aaS company. When: People start to see AI is good at absorbing the entire vertical’s intuition. Value Loop: Raw Domain Data → Reinforcement Learning / Fine-Tuning → Model Weights. (Time: Instant / Pre-computed | Fit: Very High / Intuitive domain mastery) This might feel unintuitive as a stage — like how could you bake some proprietary data lake into a model? How can our juicy data not be the moat? My conclusion is that most (but not all) data is a transformation of rawer upstream inputs and that these transformation (data pipelines, cross-tenant analysis, human research, etc.) are all “cache” that can be distilled into a more general model that operates on its intuition and upstream platform inputs. “But can agents run a bank?” Reliability and safety comes down to distinguishing between guardrails (deterministic interfaces and scaffolding) and runtime execution (LLM code). For now, you don’t let the LLM invent the concept of a transaction ledger or rewrite the core banking loop on the fly. In XX years, maybe we do trust AI to write core transaction logic after all fail-able humans wrote the code for most mission critical software that exists today. The line between human-defined determinism and agent symbolic interfaces will gradually move of time. “But enterprise SaaS is actually super complex.” Yes, but that complexity is mostly just unresolved ambiguity. Your “deep enterprise understanding” is often a collection of thousands of edge cases—permissions, policy exceptions, region-specific rules—that humans had to manually hard-code into IF/ELSE statements over a decade. Distilled to the core, this complexity collapses. The model doesn’t need 500 hard-coded features; it needs the raw data and the intent. An app built for one can also make a lot of simplifications compared to one that acts as a platform. “Customers don’t want to prompt features.” I agree. I don’t think the future looks like a chatbot. “Chat” is a skeuomorphic bridge we use because we haven’t figured out the consistent native interface yet. It might be a UI that pre-emptively changes based on your role, or it might feel like hiring a really competent employee who just “takes care of it” without you needing to specify the . Or, as we see in Stage 2, the user never prompts at all—an FDE does it for them, and the user just gets a bespoke app that works perfectly. Stage 1, where most companies are stuck today, definitely is. Why? Because the sheer overhead of traditional SaaS—the learning curve, the rigid workflows, the "click tax" to get work done—is becoming unacceptable in a world where intent can be executed directly. It feels increasingly archaic when flexible solutions can be generated on demand. The value is moving away from the workflow logic itself and toward two specific layers that sandwich it: The Data Layer: Proprietary data, trust, and the “agentic scaffolding” that allows models to act safely within your domain. The Presentation Layer: Brand and UI. While I suspect trying to control the presentation layer long-term is futile (as users will eventually bring their own “interface agents” to interact with your data), for now, it remains a differentiator. We are going to see companies move through these tiers. The winners IMO will be the ones who realize that the "Service" part of SaaS is being replaced by model intelligence. The SaaS that remains will be the infrastructure of truth and the engine of agency. We are transitioning from a world of static artifacts (code that persists for years) to dynamic generations (code that exists for milliseconds or for a single answer). Of course, I could be wrong. Maybe AI capability plateaus before it can fully integrate into complex verticals. Maybe traditional SaaS holds the line at Stage 2 or 3, protecting its moat through sheer inertia. Maybe the world ends up more decentralized. Some of my open questions: Which stage should you work on today? Is there alpha in skipping straight to Stage 4, or do you need to build the Stage 2 “vibe coding” service to bootstrap for now? What are the interfaces of the future? Is it MCP, curated compute sandboxes, or a yet-to-be-defined agent-to-agent-to-human protocol? What interface wins out or does each company or consumer bring their own agentic worker? How fast does this happen? Are we looking at a multi-decade-long transition, or do companies today rapidly start dropping lower stage SaaS tools? Does AI have a similar impact beyond software? does medicine move from “static protocols” to “on-demand, patient-specific treatments”? Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. Even more so than me you can see Geoffrey Huntley’s ralph-powered rampage of GitHub and many other tools . I liked this tweet by Harj Taggar, “moved away from the FDE playbook that’s become the default for fast growing AI startups. Instead they’ve built AI to covert plain English from the customer into Python code to make the product work for their use cases” . Similar to Karpathy’s “LLMs not as a chatbot, but the kernel process of a new Operating System” (2023) Cartoon via Nano Banana. In this model, the ‘Source Code’ is the prompt and the context; the actual Python or Javascript that executes is just the binary. We still run the code because it’s thermodynamically efficient and deterministic, but we treat it as disposable. If the behavior needs to change, we don’t refactor the binary; we re-compile the intent. This shift has made me intolerant of static interfaces. I have stopped caring about software that doesn’t let me dump massive amounts of context into Gemini or Claude to just do the thing . If a product forces me to click buttons to execute a process that an LLM could intuit from a prompt, that product is already legacy. It forces us to question the permanence of the current model. We often make the mistake of assuming software—as we know it today—is a permanent fixture of human productivity. But if you zoom out, the era of SaaS is a blink of an eye in modern history. It is easy to overestimate how core it is to the future. In this post, I want to extrapolate these thoughts a bit and write out what could be the final stages of software. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. Software Evolution The stages here might not necessarily be chronological or mutually exclusive. Instead, they are ordered from static to dynamic code generation — where more and more the intent of a customer is the software they use. Stage 1. Traditional SaaS This is the baseline where software is a static artifact sold as a service, built on the assumption that user problems are repetitive and predictable enough to be solved by rigid workflows. To the consumer, this looks like dashboards, CRUD forms, and hardcoded automations. The intelligence here is sourced mainly from the SaaS founder and hired domain experts, hard-coded into business logic years before the user ever logs in. When: We recognized that distributing software via the cloud was more efficient than on-premise installations. Value Loop: Customer Problem → Product Manager writes PRD → Engineers write Static Code → Deploy → Customer adapts their workflow to the tool. (Time: Months to Years | Fit: Generic / One-size-fits-none) When: Companies realize AI allows their employees to build custom apps for clients faster than the clients can learn or adapt a generic tool. Value Loop: Customer Problem → SaaS Employee (FDE) Prompts AI → AI generates Custom Script/App → Employee Deploys for Customer. (Time: Days | Fit: High / Tailored to specific customer edge cases) When: People start to see AI is good at summarizing, generating content, or taking actions within existing workflows. Value Loop: Customer Problem → Static SaaS Interface AI Feature Text Box → Stochastic Result → Human Review. (Time: Minutes | Fit: Medium / Constrained by the platform’s UI) When: People start to see AI is good at orchestrating complex decisions and using tools—across SaaS platforms—autonomously. Value Loop: Customer Problem (Prompt as ~PRD) → Runtime Code Generation → Dynamic Outcome. (Time: Real-time | Fit: Very High / Dynamically generated for the specific context) When: People start to see AI is good at absorbing the entire vertical’s intuition. Value Loop: Raw Domain Data → Reinforcement Learning / Fine-Tuning → Model Weights. (Time: Instant / Pre-computed | Fit: Very High / Intuitive domain mastery) The Data Layer: Proprietary data, trust, and the “agentic scaffolding” that allows models to act safely within your domain. The Presentation Layer: Brand and UI. While I suspect trying to control the presentation layer long-term is futile (as users will eventually bring their own “interface agents” to interact with your data), for now, it remains a differentiator. Which stage should you work on today? Is there alpha in skipping straight to Stage 4, or do you need to build the Stage 2 “vibe coding” service to bootstrap for now? What are the interfaces of the future? Is it MCP, curated compute sandboxes, or a yet-to-be-defined agent-to-agent-to-human protocol? What interface wins out or does each company or consumer bring their own agentic worker? How fast does this happen? Are we looking at a multi-decade-long transition, or do companies today rapidly start dropping lower stage SaaS tools? Does AI have a similar impact beyond software? does medicine move from “static protocols” to “on-demand, patient-specific treatments”?

12 views
Stratechery 5 days ago

2026.02: AI Power, Now and In 100 Years

Welcome back to This Week in Stratechery! As a reminder, each week, every Friday, we’re sending out this overview of content in the Stratechery bundle; highlighted links are free for everyone . Additionally, you have complete control over what we send to you. If you don’t want to receive This Week in Stratechery emails (there is no podcast), please uncheck the box in your delivery settings . On that note, here were a few of our favorites this week. This week’s Stratechery video is on Netflix and the Hollywood End Game . Will AI Replace Humans? Dorm room discussions are not generally Stratechery’s domain, but for the terminally online it’s getting harder to escape the insistence in some corners of the AI world that humans will no longer be economically necessary, and that we should make changes now to address what they insist is a foregone conclusion. I disagree : humans may be doomed, but as long we are around, we’ll want other humans — and will build economies that reflect that fact. More generally, as we face this new paradigm, it’s essential to remember that technology is an amoral force: what is good or bad comes down to decisions we make, and trying to preemptively engineer our way around human nature guarantees a bad outcome. — Ben Thompson The Future of Power Generation. Through the second half of last year it became clear that one of the defining challenges of the modern era in the U.S. will be related to power: we need more than we can generate right now, electrical bills are skyrocketing, and that tension figures to compound as AI becomes more integral to the economy. If these topics interest you (and they should!), I heartily recommend  this week’s Stratechery Interview with Jeremie Eliahou Ontiveros and Ajey Pandey , two SemiAnalysis analysts who provide a rundown on what the biggest AI labs are doing to address these challenges, the important of natural gas, and how the market is responding to the most fundamental AI infrastructure challenge of them all. Reminder on that last point: Bubbles have benefits !  — Andrew Sharp What China Thinks of What Happened in Caracas. On this week’s Sharp China, Bill Bishop and I returned from a holiday that was busier than expected and broke down various aspects of China’s response to the upheaval in Venezuela after the CCP’s “all-weather strategic partnership” with the Nicolás Maduro regime was rendered null and void by the United States. Topics include: A likely unchanged calculus on Taiwan, a propaganda gift, questions about oil imports, and Iran looming as an even bigger wild card for PRC fortunes. Related to all this, on Sharp Text this week I wrote about the logic of the Maduro operation on the American side, and the dizzying nature of decoding U.S. foreign policy  in a modern era increasingly defined by cold war objectives but without cold war rhetoric.  — AS AI and the Human Condition — AI might replace all of the jobs; that’s only a problem if you think that humans will care, but if they care, they will create new jobs. Nvidia and Groq, A Stinkily Brilliant Deal, Why This Deal Makes Sense — Nvidia is licensing Groq’s technology and hiring most of its employees; it’s the most potent application of tech’s don’t-call-it-an-acquisition deal model yet. Nvidia at CES, Vera Rubin and AI-Native Storage Infrastructure, Alpamayo — Nvidia’s CES announcements didn’t have much for consumers, but affects them all the same. An Interview with Jeremie Eliahou Ontiveros and Ajey Pandey About Building Power for AI — An Interview with Jeremie Eliahou Ontiveros and Ajey Pandey about how AI labs and hyperscalers are leveraging demand to build out entirely new electrical infrastructure for AI. Notes from Schrödinger’s Cold War — Why the U.S. captured Nicolás Maduro, and the challenge of decoding U.S. foreign policy in an era defined by cold war objectives, but without cold war rhetoric. CES and Humanity Tahoe Icons High-Five to the Belgrade Hand The Remarkable Computers Built Not to Fail China’s Venezuela Calculations; Japan’s Rare Earth Access; A Reported Pause on Nvidia Purchases; The Meta-Manus Deal Under Review A New Year’s MVP Ballot, The Jokic Injury and a Jaylen Brown Renaissance, The Pistons and Their Possibilities The Economy in the 22nd Century, Amoral Tech and Silicon Valley Micro-Culture, What Nvidia Is Getting From Groq

0 views
Giles's blog 5 days ago

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results

I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI: Here were the results I got, sorted by the loss: Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern. I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now. In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change. Let's spec out the problem first. The instruction fine-tuning test trains our model on the Alpaca dataset in order to let it know how to follow instructions; that comprises a series of sequences like this: More details in this post . In the version I've settled on , I fine-tune on a training set of 85% of the samples, epoch by epoch, bailing out when the loss on a separate validation set of 5% of the samples starts rising. I then use the weights from the previous epoch -- that is, before validation loss started rising -- to generate responses to the remaining 10% of the samples. Once that's done, the script hits the OpenAI API, using GPT-5.1, default parameters for all of the options (eg. no explicit temperature) with queries like this: We do that for every model-generated response in the test set, then take the average of the scores and use that as our result. To see why that's problematic, imagine this simple instruction with no separate input: One response I've seen from my models was this: That's obvious garbage, and should get a zero -- and GPT-5.1 consistently does that. Another response, from OpenAI's original weights for their "medium" model (larger than the ones I've been training), is this: That's correct, so it deserves 100, or perhaps 95 due to being unnecessarily wordy (the answer "Jane Austen" is the suggested response in the dataset). But now how about this one: One of my models came up with that gem during an earlier eval. It's completely wrong, so it deserves a 0, right? And normally the GPT-5.1 model does that -- but sometimes it's a little more generous, and gives it a low, but non-zero score. When asked for its reason for that, it makes the logical point that while it's the wrong answer, at least Sarah Palin is a real person. It's better than the "the book wrote itself" complete nonsense of the first response. The problem is that the different runs against the different models are not consistent, as they're all talking to GPT-5.1 separately. One model might find it in a harsh "mood", and get a lower rating than another model that found it at a more generous moment. I came to the conclusion that the best way to fix this is to do a "batch" -- that is, fine-tune each model on the Alpaca dataset that Raschka provides, and generate responses for the test set and store them in a file. Then, once we've done that for all models, we can score them all at once, prompting GPT-5.1 with something like this: The theory is that doing it that way will mean that each individual query/response pair is graded consistently between models, even if there might still be inconsistencies between query/response pairs. That hopefully means we'll get more consistent results and can compare the models better. Here's the code: Running the first against each of our models, and then the second against all of the output files, gives us this updated table (with links to the annotated JSON files in case anyone else wants to take a look): (Still sorted by loss so that you can compare it more easily with the one above.) That's really interesting! The IFT score is still not correlated with the loss. But there does appear to be a pattern. It looks like we have three groups of models: I tried running the LLM-as-a-judge scoring script a few times, just to make sure this wasn't some kind of random weirdness, but the pattern was always the same: the OpenAI weights, the cloud FineWeb 8x A100 40 GiB, and the two local Local FineWeb-Edu models always got the best IFT scores, though sometimes they swapped positions (apart from the OpenAI medium model, which was of course always at the top). The other cloud FineWeb models and the local FineWeb one were consistently scored much lower. A hypothesis: there are two things that contribute to how good a model is at these IFT tests: Or to put it another way -- some of these models are smart but not knowledgeable, while others are knowledgeable but not smart, and some are neither. I think that could explain what we're seeing here. While OpenAI never published their "WebText" dataset for GPT-2, the paper describes it as a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. Now, the FineWeb dataset is quite similar, though I think it's a tad more curated than that. But OpenAI trained their models for quite some time and did lots of tricks to get the loss as low as possible. By contrast, the FineWeb-Edu dataset is a carefully selected subset of FineWeb, with only the most "educational" data. Models trained on it, you might think, would know more facts for a given amount of training. So we can imagine the OpenAI models are smart but not knowledgeable, as we can our cloud FineWeb 8x A100 40 GiB model, which (I believe due to an accidentally-near-optimal batch size) worked out well in terms of loss. They were trained on relatively sloppy datasets but turned out reasonably well. Their intelligence makes up for some of their lack of knowledge. Our other cloud trains and the local FineWeb one are dumb and not knowledgeable; they were trained on the low-information FineWeb dataset, but they didn't wind up with a particularly amazing loss. So they get low scores. And finally, our local FineWeb-Edu models are still dumb, but they make up for it by knowing more because their training data was better. Well, it sounds plausible ;-) And I'd like to spend some time digging in to see if there's any indication if it's actually true. But after an afternoon of poking around the results, I can't really get a handle on whether it is, or indeed how you'd test that hypothesis in any real depth. TBH, I think this has zoomed so far past my "no side quests" limit that it's not even visible in the rear view mirror, so it's probably best to shelve it as a "cool idea, bro" for now. Learning about how to run sensible evals, and how to work out what they're saying, will have to be a task for another day. I will keep on doing these IFT tests for future models, though, just out of interest. So: let's get back to our regular scheduled LLM training. Next up, how do we upload our models to Hugging Face quickly and easily so that other people can play with them. A simple cross entropy loss over a fixed test set. The results for an instruction fine-tune test that's covered in the book. A script to fine-tune a model and generate test responses and to dump them into a JSON file. The LLM-as-a-judge code to send a bunch of models' responses to GPT-5.1 . It scrambles the order of the models in each query, to try to avoid any preference the model might have for the first one vs the last one, and it stores GPT-5.1's per-response scores and comments in a new "annotated" JSON file. The OpenAI weights and the cloud train on the 8x A100 40 GiB machine using FineWeb, which have low loss and high IFT scores The other cloud models and the local train that used FineWeb, which have medium loss and low IFT scores. The FineWeb-Edu local trains, which have high loss, but IFT scores that are almost as good as the first group's. The loss. Models that are better at predicting the next token are inherently better at instruction-following after the fine-tuning. The amount of information in the dataset. It doesn't matter how clever a model is, if it never saw "Jane Austen wrote 'Pride and Prejudice'" as part of its training, it will never be able to get a good score on that question.

0 views
Simon Willison 6 days ago

LLM predictions for 2026, shared with Oxide and Friends

I joined a recording of the Oxide and Friends podcast on Tuesday to talk about 1, 3 and 6 year predictions for the tech industry. This is my second appearance on their annual predictions episode, you can see my predictions from January 2025 here . Here's the page for this year's episode , with options to listen in all of your favorite podcast apps or directly on YouTube . Bryan Cantrill started the episode by declaring that he's never been so unsure about what's coming in the next year. I share that uncertainty - the significant advances in coding agents just in the last two months have left me certain that things will change significantly, but unclear as to what those changes will be. Here are the predictions I shared in the episode. I think that there are still people out there who are convinced that LLMs cannot write good code. Those people are in for a very nasty shock in 2026. I do not think it will be possible to get to the end of even the next three months while still holding on to that idea that the code they write is all junk and it's it's likely any decent human programmer will write better code than they will. In 2023, saying that LLMs write garbage code was entirely correct. For most of 2024 that stayed true. In 2025 that changed, but you could be forgiven for continuing to hold out. In 2026 the quality of LLM-generated code will become impossible to deny. I base this on my own experience - I've spent more time exploring AI-assisted programming than most. The key change in 2025 (see my overview for the year ) was the introduction of "reasoning models" trained specifically against code using Reinforcement Learning. The major labs spent a full year competing with each other on who could get the best code capabilities from their models, and that problem turns out to be perfectly attuned to RL since code challenges come with built-in verifiable success conditions. Since Claude Opus 4.5 and GPT-5.2 came out in November and December respectively the amount of code I've written by hand has dropped to a single digit percentage of my overall output. The same is true for many other expert programmers I know. At this point if you continue to argue that LLMs write useless code you're damaging your own credibility. I think this year is the year we're going to solve sandboxing. I want to run code other people have written on my computing devices without it destroying my computing devices if it's malicious or has bugs. [...] It's crazy that it's 2026 and I still random code and then execute it in a way that it can steal all of my data and delete all my files. [...] I don't want to run a piece of code on any of my devices that somebody else wrote outside of sandbox ever again. This isn't just about LLMs, but it becomes even more important now there are so many more people writing code often without knowing what they're doing. Sandboxing is also a key part of the battle against prompt injection. We have a lot of promising technologies in play already for this - containers and WebAssembly being the two I'm most optimistic about. There's real commercial value involved in solving this problem. The pieces are there, what's needed is UX work to reduce the friction in using them productively and securely. I think we're due a Challenger disaster with respect to coding agent security[...] I think so many people, myself included, are running these coding agents practically as root, right? We're letting them do all of this stuff. And every time I do it, my computer doesn't get wiped. I'm like, "oh, it's fine". I used this as an opportunity to promote my favourite recent essay about AI security, the Normalization of Deviance in AI by Johann Rehberger. The Normalization of Deviance describes the phenomenon where people and organizations get used to operating in an unsafe manner because nothing bad has happened to them yet, which can result in enormous problems (like the 1986 Challenger disaster) when their luck runs out. Every six months I predict that a headline-grabbing prompt injection attack is coming soon, and every six months it doesn't happen. This is my most recent version of that prediction! (I dropped this one to lighten the mood after a discussion of the deep sense of existential dread that many programmers are feeling right now!) I think that Kākāpō parrots in New Zealand are going to have an outstanding breeding season. The reason I think this is that the Rimu trees are in fruit right now. There's only 250 of them, and they only breed if the Rimu trees have a good fruiting. The Rimu trees have been terrible since 2019, but this year the Rimu trees were all blooming. There are researchers saying that all 87 females of breeding age might lay an egg. And for a species with only 250 remaining parrots that's great news. (I just checked Wikipedia and I was right with the parrot numbers but wrong about the last good breeding season, apparently 2022 was a good year too.) In a year with precious little in the form of good news I am utterly delighted to share this story. Here's more: I don't often use AI-generated images on this blog, but the Kākāpō image the Oxide team created for this episode is just perfect : We will find out if the Jevons paradox saves our careers or not. This is a big question that anyone who's a software engineer has right now: we are driving the cost of actually producing working code down to a fraction of what it used to cost. Does that mean that our careers are completely devalued and we all have to learn to live on a tenth of our incomes, or does it mean that the demand for software, for custom software goes up by a factor of 10 and now our skills are even more valuable because you can hire me and I can build you 10 times the software I used to be able to? I think by three years we will know for sure which way that one went. The quote says it all. There are two ways this coding agents thing could go: it could turn out software engineering skills are devalued, or it could turn out we're more valuable and effective than ever before. I'm crossing my fingers for the latter! So far it feels to me like it's working out that way. I think somebody will have built a full web browser mostly using AI assistance, and it won't even be surprising. Rolling a new web browser is one of the most complicated software projects I can imagine[...] the cheat code is the conformance suites. If there are existing tests that it'll get so much easier. A common complaint today from AI coding skeptics is that LLMs are fine for toy projects but can't be used for anything large and serious. I think within 3 years that will be comprehensively proven incorrect, to the point that it won't even be controversial anymore. I picked a web browser here because so much of the work building a browser involves writing code that has to conform to an enormous and daunting selection of both formal tests and informal websites-in-the-wild. Coding agents are really good at tasks where you can define a concrete goal and then set them to work iterating in that direction. A web browser is the most ambitious project I can think of that leans into those capabilities. I think the job of being paid money to type code into a computer will go the same way as punching punch cards [...] in six years time, I do not think anyone will be paid to just to do the thing where you type the code. I think software engineering will still be an enormous career. I just think the software engineers won't be spending multiple hours of their day in a text editor typing out syntax. The more time I spend on AI-assisted programming the less afraid I am for my job, because it turns out building software - especially at the rate it's now possible to build - still requires enormous skill, experience and depth of understanding. The skills are changing though! Being able to read a detailed specification and transform it into lines of code is the thing that's being automated away. What's left is everything else, and the more time I spend working with coding agents the larger that "everything else" becomes. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . 1 year: It will become undeniable that LLMs write good code 1 year: We're finally going to solve sandboxing 1 year: A "Challenger disaster" for coding agent security 1 year: Kākāpō parrots will have an outstanding breeding season 3 years: the coding agents Jevons paradox for software engineering will resolve, one way or the other 3 years: Someone will build a new browser using mainly AI-assisted coding and it won't even be a surprise 6 years: Typing code by hand will go the way of punch cards Kākāpō breeding season 2026 introduction from the Department of Conservation from June 2025 . Bumper breeding season for kākāpō on the cards - 3rd December 2025, University of Auckland.

0 views
Martin Fowler 6 days ago

Fragments: January 8

Anthropic report on how their AI is changing their own software development practice . ❄                ❄                ❄                ❄                ❄ Much of the discussion about using LLMs for software development lacks details on workflow. Rather than just hear people gush about how wonderful it is, I want to understand the gritty details. What kinds of interactions occur with the LLM? What decisions do the humans make? When reviewing LLM outputs, what kinds of things are the humans looking for, what corrections do they make? Obie Fernandez has written a post that goes into these kinds of details. Over the Christmas / New Year period he used Claude to build a knowledge distillation application, that takes transcripts from Claude Code sessions, slack discussion, github PR threads etc, turns them into an RDF graph database, and provides a web app with natural language ways to query them. Not a proof of concept. Not a demo. The first cut of Nexus, a production-ready system with authentication, semantic search, an MCP server for agent access, webhook integrations for our primary SaaS platforms, comprehensive test coverage, deployed, integrated and ready for full-scale adoption at my company this coming Monday. Nearly 13,000 lines of code. The article is long, but worth the time to read it. An important feature of his workflow is relying on Test-Driven Development Here’s what made this sustainable rather than chaotic: TDD. Test-driven development. For most of the features, I insisted that Claude Code follow the red-green-refactor cycle with me. Write a failing test first. Make it pass with the simplest implementation. Then refactor while keeping tests green. This wasn’t just methodology purism. TDD served a critical function in AI-assisted development: it kept me in the loop. When you’re directing thousands of lines of code generation, you need a forcing function that makes you actually understand what’s being built. Tests are that forcing function. You can’t write a meaningful test for something you don’t understand. And you can’t verify that a test correctly captures intent without understanding the intent yourself. The account includes a major refactoring, and much evolution of the initial version of the tool. It’s also an interesting glimpse of how AI tooling may finally make RDF useful. ❄                ❄                ❄                ❄                ❄ When thinking about requirements for software, most discussions focus on prioritization. Some folks talk about buckets such as the MoSCoW set : Must, Should, Could, and Want. (The old joke being that, in MoSCoW, the cow is silent, because hardly any requirements end up in those buckets.) Jason Fried has a different set of buckets for interface design: Obvious, Easy, and Possible . This immediately resonates with me: a good way of think about how to allocate the cognitive costs for those who use a tool. ❄                ❄                ❄                ❄                ❄ Casey Newton explains how he followed up on an interesting story of dark patterns in food delivery, and found it to be a fake story, buttressed by AI image and document creation. On one hand, it clarifies the important role reporters play in exposing lies that get traction on the internet. But time taken to do this is time not spent on investigating real stories For most of my career up until this point, the document shared with me by the whistleblower would have seemed highly credible in large part because it would have taken so long to put together. Who would take the time to put together a detailed, 18-page technical document about market dynamics just to troll a reporter? Who would go to the trouble of creating a fake badge? Today, though, the report can be generated within minutes, and the badge within seconds. And while no good reporter would ever have published a story based on a single document and an unknown source, plenty would take the time to investigate the document’s contents and see whether human sources would back it up. The internet has always been full of slop, and we have always needed to be wary of what we read there. AI now makes it easy to manufacture convincing looking evidence, and this is never more dangerous than when it confirms strongly held beliefs and fears. ❄                ❄                ❄                ❄                ❄ Kent Beck : The descriptions of Spec-Driven development that I have seen emphasize writing the whole specification before implementation. This encodes the (to me bizarre) assumption that you aren’t going to learn anything during implementation that would change the specification. I’ve heard this story so many times told so many ways by well-meaning folks–if only we could get the specification “right”, the rest of this would be easy. Like him, that story has been the constant background siren to my career in tech. But the learning loop of experimentation is essential to the model building that’s at the heart of any kind of worthwhile specification. As Unmesh puts it: Large Language Models give us great leverage—but they only work if we focus on learning and understanding. They make it easier to explore ideas, to set things up, to translate intent into code across many specialized languages. But the real capability—our ability to respond to change—comes not from how fast we can produce code, but from how deeply we understand the system we are shaping. When Kent defined Extreme Programming, he made feedback one of its four core values. It strikes me that the key to making the full use of AI in software development is how to use it to accelerate the feedback loops. ❄                ❄                ❄                ❄                ❄ As I listen to people who are serious with AI-assisted programming, the crucial thing I hear is managing context. Programming-oriented tools are geting more sophisticated for that, but there’s also efforts at providing simpler tools, that allow customization. Carlos Villela recently recommended Pi, and its developer, Mario Zechner, has an interesting blog on its development. So what’s an old guy yelling at Claudes going to do? He’s going to write his own coding agent harness and give it a name that’s entirely un-Google-able, so there will never be any users. Which means there will also never be any issues on the GitHub issue tracker. How hard can it be? If I ever get the time to sit and really play with these tools, then something like Pi would be something I’d like to try out. Although as an addict to The One True Editor, I’m interested in some of libraries that work with that, such as gptel . That would enable me to use Emacs’s inherent programability to create my own command set to drive the interaction with LLMs. ❄                ❄                ❄                ❄                ❄ Outside of my professional work, I’ve posting regularly about my boardgaming on the specialist site BoardGameGeek. However its blogging environment doesn’t do a good job of providing an index to my posts, so I’ve created a list of my BGG posts on my own site. If you’re interested in my regular posts on boardgaming, and you’re on BGG you can subscribe to me there. If you’re not on BGG you can subscribe to the blog’s RSS feed . I’ve also created a list of my favorite board games . Most usage is for debugging and helping understand existing code Notable increase in using it for implementing new features Developers using it for 59% of their work and getting 50% productivity increase 14% of developers are “power users” reporting much greater gains Claude helps developers to work outside their core area Concerns about changes to the profession, career evolution, and social dynamics

0 views
Stratechery 6 days ago

An Interview with Jeremie Eliahou Ontiveros and Ajey Pandey About Building Power for AI

An Interview with Jeremie Eliahou Ontiveros and Ajey Pandey about how AI labs and hyperscalers are leveraging demand to build out entirely new electrical infrastructure for AI.

0 views