Posts in Ai (20 found)

The Devil Needs No Advocate

Unless you're wealthy enough to bribe a small country or have personally received an invitation to Epstein's island, you have no business advocating for billionaires. Surely there must be a thrill in an ethics of contrarianism, something to make docile subservience an exciting prospect. "Ohh look at me, I'm so naughty because I'm not like everyone else", thinks the contrarian while shooting a sideways glance towards the working class, hoping to one day share a meal with wealthy industrialists, completely oblivious to the fact that among the working class is where he will always be kept. Did anyone ever find a friend in the kid who played "devil's advocate" in school? I want to share with you an unbelievable paragraph I read today. It was written by an American professor of philosophy, in an essay arguing that there is no moral objection to AI art. I call it unbelievable because it's hard to believe how badly it mischaracterizes the artist's rejection of generative AI. Imagine an artist in a patriarchal society complaining when women are allowed into the art museum for the first time: “I never gave permission for women to view my art!” This artist has no legitimate moral complaint, I’d say, because he has no moral right to make his work accessible only to men. Likewise, artists have no moral right to make their work accessible only to humans. They have no legitimate complaint if an AI trains on the work they post online, any more than they can complain about a young human artist “training on” (or learning from) their work. Take a minute to read that one again. "Babe, I thought of a great way to advance an instrumentalist view of agency that attributes mental states and intentionality to generative AI systems. First you pretend women and computer software are equivalent and then..." To philosophers it must be exciting to think of Artificial Intelligence as its own ontological class, a sui generis marvel of modern engineering. The truth is that no such thing exists yet, and marketing in Silicon Valley is powerful. Women have agency. AI has no agency. That's why this is a silly comparison and not even at all what the rejection of generative AI is about. When an artist pushes back against the use of generative AI tools, what they are saying is something like this: I do not approve of technology corporations amassing wealth by exploiting my work as an artist without consent . There's no artist saying they don't want the literal software processing their data because it's software. It's about who owns the software and what they do with it. The rejection of generative AI is not about programming languages, package managers, libraries, large language models and application programming interfaces. It's about technocrats building programs, using marketing terms like "learning" to make you think they have agency, and then the working class pretending they do because the marketing got so good. The devil needs no advocate.

0 views

Premium: OpenAI Burned $4.1 Billion More Than We Knew - Where Is Its Money Going?

Soundtrack: Queens of the Stone Age - Song For The Dead Editor's Note: The original piece had a mathematical error around burnrate, it's been fixed. Also, welcome to another premium issue! Please do subscribe, this is a massive, 7000-or-so word piece, and that's the kind of depth you get every single week for your subscription. A few days ago, Sam Altman said that OpenAI’s revenues were “well more” than $13bn in 2025 , a statement I question based on the fact, based on other outlets’ reporting , OpenAI only made $4.3bn through the first half of 2025, and likely around a billion a month, which I estimate means the company made around $8bn by the end of September. This is an estimate. If I receive information to the contrary, I’ll report it. Nevertheless, OpenAI is also burning a lot of money. In recent public disclosures ( as reported by The Register ), Microsoft noted that it had funding commitments to OpenAI of $13bn, of which $11.6bn had been funded by September 30 2025.  These disclosures also revealed that OpenAI lost $12bn in the last quarter — Microsoft’s Fiscal Year Q1 2026, representing July through September 2025. To be clear, this is actual, real accounting, rather than the figures leaked to reporters. It’s not that leaks are necessarily a problem — it’s just that anything appearing on any kind of SEC filing generally has to pass a very, very high bar. There is absolutely nothing about these numbers that suggests that OpenAI is “profitable on inference” as Sam Altman told a group of reporters at a dinner in the middle of August . Let me get specific.  The Information reported that through the first half of 2025, OpenAI spent $6.7bn on research and development, “which likely include[s] servers to develop new artificial intelligence.” The common refrain here is that OpenAI “is spending so much on training that it’s eating the rest of its margins,” but if that were the case here, it would mean that OpenAI spent the equivalent of six months’ training in the space of three. I think the more likely answer is that OpenAI is spending massive amounts of money on staff, sales and marketing ($2bn alone in the first half of the year), real estate, lobbying , data, and, of course, inference.  According to The Information , OpenAI had $9.6bn in cash at the end of June 2025. Assuming that OpenAI lost $12bn at the end of calendar year Q3 2025, and made — I’m being generous — around $3.3bn (or $1.1bn a month) within that quarter, this would suggest OpenAI’s operations cost them over $15bn in the space of three months. Where, exactly, is this money going? And how do the numbers published actually make sense when you reconcile them with Microsoft’s disclosures?  In the space of three months, OpenAI’s costs — if we are to believe what was leaked to The Information (and, to be clear, I respect their reporting) — went from a net loss of $13.5bn in six months to, I assume, a net loss of $ 12bn in three months.   Though there are likely losses related to stock-based compensation, this only represented a cost of $2.5bn in the first half of 2025. The Information also reported that OpenAI “spent more than $2.5 billion on its cost of revenue,” suggesting inference costs of…around that?  I don’t know. I really don’t know. But something isn't right, and today I'm going to dig into it. In this newsletter I'm going to reveal how OpenAI's reported revenues and costs don't line up - and that there's $4.1 billion of cash burn that has yet to be reported elsewhere.

0 views
iDiallo Yesterday

The NEO Robot

You've probably seen the NEO home robot by now, from the company 1X. It's a friendly humanoid with a plush-toy face that can work around your house. Cleaning, making beds, folding laundry, even picking up after meals. Most importantly, there's the way it looks. Unlike Tesla's "Optimus," which resembles an industrial robot, NEO looks friendly. It has a cute, plush face with round eyes. Something you could let your children play with. But after watching their launch video, I only had one thing on my mind: battery life. And that's how you know I was tricked. Battery life is four hours after a full charge according to the company, but that's the wrong thing to focus on. Remember when Tesla first announced Optimus? Elon Musk made sure to emphasize one statement, they purposely capped the robot's speed to 5 miles per hour. Then he joked that "you can just outrun it and most likely overpower it." This steered the conversation toward safety in AI and robots. a masterful bit of misdirection from the fact that there was no robot whatsoever at the time. Not even a prototype. Just a person in a suit doing a silly dance. With NEO, we saw a lot more. The robot loaded laundry into the machine, tidied up the home, did the dishes. Real demonstrations with real hardware. But what they failed to emphasize was just as important. All actions in the video were entirely remote controlled. Here are the assumptions I was making while watching their video. Once you turn on this robot, it would first need to understand your home. Since it operates as a housekeeper, it would map your space using the dual cameras on its head, saving this information to some internal drive. It would need to recognize you both visually and through your voice. You'd register your face and voice like Face ID. They stated it can charge itself, so the dexterity of its hands must be precise enough to plug itself in autonomously. All reasonable assumptions for a $20,000 "AI home robot," right? But these are just assumptions. Then the founder mentions you can "teach it new tasks," overseen by one of their experts that you can book at specific times. Since we're not seeing the robot do anything autonomously, I'm left wondering. What does "teaching the robot a skill" even mean? The NEO is indeed a humanoid robot. But it's not an autonomous AI robot. It's a teleoperated robot that lives in your home. A remote operator from 1X views through its cameras and controls its movements when it needs to perform a tasks. If that's what they're building, it should be crystal clear. People need to understand what they're buying and the implications that come with it. You're allowing someone from a company to work in your home remotely, using a humanoid robot as their avatar, seeing everything the robot sees. Looking at the videos published by outlets like the Wall Street Journal , even the teleoperated functionality appears limited. MKBHD also offers an excellent analysis that's worth watching. 1X positions this teleoperation as a training mechanism. The "Expert Mode" that generates data to eventually make the robot autonomous. It's a reasonable approach, similar to how Tesla gathered data for Full Self-Driving. But the difference is your car camera feeds helped train a system; NEO's cameras invite a stranger into your most private spaces. The company says it has implemented privacy controls, scheduled sessions, no-go zones, visual indicators when someone's watching, face-blurring technology, etc. These are necessary safeguards, but they don't change the fundamental problem. This is not an autonomous robot. Also, you are acting as a data provider for the company while paying $20,000 for the hardware. 2026 is just around the corner. I expect the autonomous capabilities to be quietly de-emphasized in marketing as we approach the release date. I also expect delays attributed to "high demand" and "ensuring safety standards." I don't expect this robot to deliver in 2026. If it does, it will be a teleoperated humanoid. With my privacy concerns, I will probably not be an early or late adopter. But I'll happily seat on the sidelines and watch the chaos unfold. A teleoperated humanoid sounds like the next logical step for an Uber or DoorDash. The company should just be clear about what they are building.

0 views
Simon Willison 2 days ago

Code research projects with async coding agents like Claude Code and Codex

I've been experimenting with a pattern for LLM usage recently that's working out really well: asynchronous code research tasks . Pick a research question, spin up an asynchronous coding agent and let it go and run some experiments and report back when it's done. Software development benefits enormously from something I call code research . The great thing about questions about code is that they can often be definitively answered by writing and executing code. I often see questions on forums which hint at a lack of understanding of this skill. "Could Redis work for powering the notifications feed for my app?" is a great example. The answer is always "it depends", but a better answer is that a good programmer already has everything they need to answer that question for themselves. Build a proof-of-concept, simulate the patterns you expect to see in production, then run experiments to see if it's going to work. I've been a keen practitioner of code research for a long time. Many of my most interesting projects started out as a few dozen lines of experimental code to prove to myself that something was possible. It turns out coding agents like Claude Code and Codex are a fantastic fit for this kind of work as well. Give them the right goal and a useful environment and they'll churn through a basic research project without any further supervision. LLMs hallucinate and make mistakes. This is far less important for code research tasks because the code itself doesn't lie: if they write code and execute it and it does the right things then they've demonstrated to both themselves and to you that something really does work. They can't prove something is impossible - just because the coding agent couldn't find a way to do something doesn't mean it can't be done - but they can often demonstrate that something is possible in just a few minutes of crunching. I've used interactive coding agents like Claude Code and Codex CLI for a bunch of these, but today I'm increasingly turning to their asynchronous coding agent family members instead. An asynchronous coding agent is a coding agent that operates on a fire-and-forget basis. You pose it a task, it churns away on a server somewhere and when it's done it files a pull request against your chosen GitHub repository. OpenAI's Codex Cloud , Anthropic's Claude Code for web , Google Gemini's Jules , and GitHub's Copilot coding agent are four prominent examples of this pattern. These are fantastic tools for code research projects. Come up with a clear goal, turn it into a few paragraphs of prompt, set them loose and check back ten minutes later to see what they've come up with. I'm firing off 2-3 code research projects a day right now. My own time commitment is minimal and they frequently come back with useful or interesting results. You can run a code research task against an existing GitHub repository, but I find it's much more liberating to have a separate, dedicated repository for your coding agents to run their projects in. This frees you from being limited to research against just code you've already written, and also means you can be much less cautious about what you let the agents do. I have two repositories that I use for this - one public, one private. I use the public one for research tasks that have no need to be private, and the private one for anything that I'm not yet ready to share with the world. The biggest benefit of a dedicated repository is that you don't need to be cautious about what the agents operating in that repository can do. Both Codex Cloud and Claude Code for web default to running agents in a locked-down environment, with strict restrictions on how they can access the network. This makes total sense if they are running against sensitive repositories - a prompt injection attack of the lethal trifecta variety could easily be used to steal sensitive code or environment variables. If you're running in a fresh, non-sensitive repository you don't need to worry about this at all! I've configured my research repositories for full network access, which means my coding agents can install any dependencies they need, fetch data from the web and generally do anything I'd be able to do on my own computer. Let's dive into some examples. My public research repository is at simonw/research on GitHub. It currently contains 13 folders, each of which is a separate research project. I only created it two weeks ago so I'm already averaging nearly one a day! It also includes a GitHub Workflow which uses GitHub Models to automatically update the README file with a summary of every new project, using Cog , LLM , llm-github-models and this snippet of Python . Here are a some example research projects from the repo. node-pyodide shows an example of a Node.js script that runs the Pyodide WebAssembly distribution of Python inside it - yet another of my ongoing attempts to find a great way of running Python in a WebAssembly sandbox on a server. python-markdown-comparison ( transcript ) provides a detailed performance benchmark of seven different Python Markdown libraries. I fired this one off because I stumbled across cmarkgfm , a Python binding around GitHub's Markdown implementation in C, and wanted to see how it compared to the other options. This one produced some charts! came out on top by a significant margin: Here's the entire prompt I used for that project: Create a performance benchmark and feature comparison report on PyPI cmarkgfm compared to other popular Python markdown libraries - check all of them out from github and read the source to get an idea for features, then design and run a benchmark including generating some charts, then create a report in a new python-markdown-comparison folder (do not create a _summary.md file or edit anywhere outside of that folder). Make sure the performance chart images are directly displayed in the README.md in the folder. Note that I didn't specify any Markdown libraries other than - Claude Code ran a search and found the other six by itself. cmarkgfm-in-pyodide is a lot more fun. A neat thing about having all of my research projects in the same repository is that new projects can build on previous ones. Here I decided to see how hard it would be to get - which has a C extension - working inside Pyodide inside Node.js. Claude successfully compiled a 88.4KB file with the necessary C extension and proved it could be loaded into Pyodide in WebAssembly inside of Node.js. I ran this one using Claude Code on my laptop after an initial attempt failed. The starting prompt was: Figure out how to get the cmarkgfm markdown lover [typo in prompt, this should have been "library" but it figured it out anyway] for Python working in pyodide. This will be hard because it uses C so you will need to compile it to pyodide compatible webassembly somehow. Write a report on your results plus code to a new cmarkgfm-in-pyodide directory. Test it using pytest to exercise a node.js test script that calls pyodide as seen in the existing node.js and pyodide directory There is an existing branch that was an initial attempt at this research, but which failed because it did not have Internet access. You do have Internet access. Use that existing branch to accelerate your work, but do not commit any code unless you are certain that you have successfully executed tests that prove that the pyodide module you created works correctly. This one gave up half way through, complaining that emscripten would take too long. I told it: Complete this project, actually run emscripten, I do not care how long it takes, update the report if it works It churned away for a bit longer and complained that the existing Python library used CFFI which isn't available in Pyodide. I asked it: Can you figure out how to rewrite cmarkgfm to not use FFI and to use a pyodide-friendly way of integrating that C code instead? ... and it did. You can see the full transcript here . blog-tags-scikit-learn . Taking a short break from WebAssembly, I thought it would be fun to put scikit-learn through its paces on a text classification task against my blog: Work in a new folder called blog-tags-scikit-learn Download - a SQLite database. Take a look at the blog_entry table and the associated tags - a lot of the earlier entries do not have tags associated with them, where the later entries do. Design, implement and execute models to suggests tags for those earlier entries based on textual analysis against later ones Use Python scikit learn and try several different strategies Produce JSON of the results for each one, plus scripts for running them and a detailed markdown description Also include an HTML page with a nice visualization of the results that works by loading those JSON files. This resulted in seven files, four results files and a detailed report . (It ignored the bit about an HTML page with a nice visualization for some reason.) Not bad for a few moments of idle curiosity typed into my phone! That's just three of the thirteen projects in the repository so far. The commit history for each one usually links to the prompt and sometimes the transcript if you want to see how they unfolded. More recently I added a short file to the repo with a few extra tips for my research agents. You can read that here . My preferred definition of AI slop is AI-generated content that is published without human review. I've not been reviewing these reports in great detail myself, and I wouldn't usually publish them online without some serious editing and verification. I want to share the pattern I'm using though, so I decided to keep them quarantined in this one public repository. A tiny feature request for GitHub: I'd love to be able to mark a repository as "exclude from search indexes" such that it gets labelled with tags. I still like to keep AI-generated content out of search, to avoid contributing more to the dead internet . It's pretty easy to get started trying out this coding agent research pattern. Create a free GitHub repository (public or private) and let some agents loose on it and see what happens. You can run agents locally but I find the asynchronous agents to be more convenient - especially as I can run them (or trigger them from my phone) without any fear of them damaging my own machine or leaking any of my private data. Claude Code for web offers a free $250 of credits for their $20/month users for a limited time (until November 18, 2025). Gemini Jules has a free tier . There are plenty of other coding agents you can try out as well. Let me know if your research agents come back with anything interesting! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Code research Coding agents Asynchronous coding agents Give them a dedicated GitHub repository Let them rip with unlimited network access My simonw/research collection This is total slop, of course Try it yourself

0 views
James O'Claire 2 days ago

How creepy is the personalization in ChatGPT?

I’ve been pretty cavalier with using AI. I think once I got used to not fully trusting it’s truthfulness, and instead using it like a teacher that I question and verify. But this past month I’ve been getting more uncomfortable with the answers. Especially ones that I can see are digging up little nuggets of personal information I dropped in over the past year: This is something I’ve looked up half a dozen times. I’ve never used it, debated with friends multiple times it’s usefulness vs SSH. So when I put in the short prompt, I was more just wanting to revist the main talking points in the Tailscale vs SSH debate I’ve had in my head. After the main response, it provides this personalized summary and drops this little nugget of my personal life in, that I do work on my parent’s solar powered off-grid home where I visit a couple times a year. I can’t put my finger on why this bothered me so much. I’m proud of my parents house, I’ll tell anyone about it. I’ve certainly mentioned this to ChatGPT, I definitely used it last year when I built a new solar array for my parents. You can see the picture below building the new one with the older panels I built 12 years ago in the back. So why would it bother me so much? Was it the cognitive dissonance? I’m thinking about tailscale, and it is talking incorrectly about my parents who I miss? Is it that it dug up information about me from a year ago that I forgot, or never really thought about, that it would remember? I mean obviously, I’m on their website, they have my IP. But ChatGPT brings up my location like this fairly often, I think any time I mention a prompt about a product, which I do oftenish as I’ve been curious about how they’ll handle the advertising / product placements. That being said, something about the way it brings up the location, again feels off putting. DuckDuckGo and Google will use IP based location all the time and I’ve never been too bothered by it. But there’s something about the way ChatGPT brings it up, oddly mixing “look up pricing in” with the later “here” as if it’s here with me. Just definitely getting bad vibes. Chunks of code I copy paste into a git repo is like a little fingerprint that can always tie that code back to a moment in time I talked to that instance of OpenAI’s ChatGPT. Little chunks of me that I type into the background of my prompts tie more of my life to ChatGPT, and in ways that it will never forget. I’m not sure what the answer is yet. Maybe OpenAI will smooth out the awkwardness of how it will always remember, if it wants, everything you’ve ever typed to it. My hope is that open local models will become efficient enough to run locally on laptops or small home PCs and deliver private AI chats, but that seems like it’s far off for small budgets.

0 views

A Computing Procedure for Quantification Theory

A Computing Procedure for Quantification Theory Martin Davis and Hilary Putnam JACM Volume 7, Issue 3 This is the oldest paper I’ve summarized. I think SAT solvers are magic and wanted to peek inside the black box. This paper describes a precursor to the full DPLL algorithm. About half of the paper is about dealing with existential and universal quantifiers, which I’m not describing here. The input to the algorithm described here is a logic formula in conjunctive normal form (CNF). Conjunctive normal form comprises a set of clauses. A clause is the logical OR of a set of literals. A literal is either a variable, or the negation of a variable. Logical AND is used to combine all clauses. For example: There are straightforward algorithms to convert any logic formula into CNF (and more complicated ones like the Tseytin transformation ). The goal of this algorithm is to determine if the input logic formula is satisfiable. If it is satisfiable, then the algorithm produces a list, assigning each variable to a specific value ( or ). An alternative normal form would be disjunctive normal form (DNF), but that isn’t practical. If you think about it, any algorithm that outputs DNF effectively outputs a list of all possible solutions to the SAT problem (each clause is a possible solution). So, either the list is small and SAT solving really is easy, or most interesting problems would be prohibitively large in DNF form. This paper even cites a textbook which says that the act of putting a formula into DNF “automatically reveals whether or not the formula is inconsistent”. The search for a satisfying assignment (i.e., mapping of variable names to values) is described as the iterative application of three rules. The third rule contains the magic necessary to keep the search going. If a variable appears in a clause that only contains a single literal, then the variable can be assigned a value consistent with the polarity of the literal. For example, in the following formula, must be . If all appearances of a variable have the same polarity, then that variable can be assigned to a value consistent with that polarity. For example, in the following formula, can safely be assigned to false. Again, the algorithm notes the value of and then simplifies the formula by substituting that value for . This one is interesting in that subsequent work found an equivalent rule which is more efficient. The first step in applying rule III is to pick a variable (the choice can affect performance, but this paper doesn’t offer much guidance on how to choose), and refactor the formula such that it has the following form: Where , , and are expressions that do not contain . The paper doesn’t explicitly say this, but it seems this must cause the formula to temporarily leave CNF. Here is an example conversion. Now that the formula has been refactored, recursively solve the simpler formula: In the example above, the simpler formula is: The intuition here is that must either be or . If is , then must be . If is , then must be . This paper introduces an equivalent Rule III*, which says to simply assume and check if the formula is satisfiable with that assumption. If not, then assume and check if the formula is satisfiable with that assumption. This avoids having to transform the formula into [ form, which “can easily increase the number and the lengths of the clauses in the expression rather quickly after several applications”. This paper has the greatest statement of an improvement in runtime: The superiority of the present procedure over those previously available is indicated in part by the fact that a formula on which Gilmore’s routine for the IBM 704 causes the machine to compute for 21 minutes without obtaining a result was worked successfully by hand computation using the present method in 30 minutes. Computers were slower in the 1960s. Imagine someone today coming up with an alternative way of training LLMs which can be done faster by hand than a GPU. This follow-on paper has an interesting observation about the choice of which variable to select in rule III: By rearranging the clauses of a formula a different order would in general be created. In some cases, whether or not the program could actually prove the validity of a given formula (without running out of fast access storage) depended on how one shuffled the punched-data deck before reading it into the assembler! Thus, the variation in ordering of constants did affect by a factor of 10 (from 500 to 5000) … In modern lingo, the branching heuristic matters a lot. Subscribe now

0 views
マリウス 2 days ago

Cameras, Cameras Everywhere!

We live in an age when a single walk down the street can put you inside at least a dozen different recording ecosystems at once: Fixed municipal CCTV, a bypassing police cruiser’s cameras or body-cam feeds, the license-plate cameras on light poles, the dash-, cabin-, and exterior cameras of nearby cloud-connected vehicles, Ring and Nest doorbells of residences that you might pass by, and the phones and wearables of other pedestrians passing you, that are quietly recording audio and/or video. Each of those systems was justified as a modest safety, convenience, or product feature, yet when stitched together they form a surveillance fabric that reaches far beyond its original intent. Instead of only looking at the big picture all these individual systems paint, let’s instead focus on each individual area and uncover some of the actors complicit in the making of this very surveillance machinery that they profit immensely from. Note: The lists below only mention a few of the most prominent enablers and profiteurs. CCTV is not new, but it’s booming. Market reports show the global video-surveillance/CCTV market measured in tens of billions of dollars and growing rapidly as governments and businesses deploy these solutions. A continued double-digit market growth over the next several years is expected. Cameras haven’t been reliably proven to reduce crime at scale, and the combination of live feeds, long-term storage and automated analytics (including behavior detection and face matching) enable discriminatory policing and concentrate a huge trove of intimate data without adequate oversight. Civil liberties groups and scholars argue CCTV expansion is often implemented with weak limits on access, retention, and third-party sharing. In addition, whenever tragedy strikes it seems like “more video surveillance, now powered by AI” is always the first response: More CCTV to be installed in train stations after knife attack Heidi Alexander has announced that the Government will invest in “improved” CCTV systems across the network, and that facial recognition could be introduced in stations following Saturday’s attack. “We are investing in improved CCTV in stations and the Home Office will soon be launching a consultation on more facial recognition technology which could be deployed in stations as well. So we take the safety of the travelling public incredibly seriously.” Automatic license-plate readers (ALPRs) used to be a tool for parking enforcement and specific investigations, but firms like Flock Safety have taken ALPRs into a new phase by offering cloud-hosted, networked plate-reading systems to neighborhoods, municipalities and private groups. The result is a searchable movement history for any car observed by the network. Supporters point to solved car thefts and missing-person leads. However, clearly these systems amount to distributed mass surveillance, with weak governance and potential for mission creep (including law-enforcement or immigration enforcement access). The ACLU and other groups have documented this tension and pressed for limits. Additionally there has been a plethora of media frenzy on specifically Flock Safety’s products and their reliability : A retired veteran named Lee Schmidt wanted to know how often Norfolk, Virginia’s 176 Flock Safety automated license-plate-reader cameras were tracking him. The answer, according to a U.S. District Court lawsuit filed in September, was more than four times a day, or 526 times from mid-February to early July. No, there’s no warrant out for Schmidt’s arrest, nor is there a warrant for Schmidt’s co-plaintiff, Crystal Arrington, whom the system tagged 849 times in roughly the same period. ( via Jalopnik ) Police departments now carry many more mobile recording tools than a decade ago, that allow the city’s static CCTV to be extended dynamically: Vehicle dash cameras, body-worn cameras (BWCs), and in some places live-streaming CCTV or automated alerts pushed to officers’ phones. Bodycams were originally promoted as accountability tools, and they have provided useful evidence, but they also create new data flows that can be fused with other systems (license-plate databases, facial-recognition engines, location logs), multiplying privacy and misuse risks. Many researchers, advocacy groups and watchdogs warn that pairing BWCs with facial recognition or AI analytics can make ubiquitous identification possible, and that policies and safeguards are lagging . Recent reporting has uncovered operations where real-time facial-recognition systems were used in ways not disclosed to local legislatures or the public, demonstrating how rapidly policy gets outpaced by deployment. One of many recent examples consists of an extended secret live-face-matching program in New Orleans that led to arrests and subsequent controversy about legality and oversight. Drones and aerial systems add another layer. Airborne or rooftop cameras can rapidly expand coverage areas and make “seeing everything” more practical, with similar debates about oversight, warranting, and civil-liberties protections. Modern cars increasingly ship with external and internal cameras, radar, microphones and cloud connections. Tesla specifically has been a headline example where in-car and exterior cameras record for features like Sentry Mode, Autopilot/FSD development, and safety investigations. Reporting has shown that internal videos captured by cars have, on multiple occasions, been accessed by company personnel and shared outside expected channels, sparking alarm about how that sensitive footage is handled. Videos of private interiors, garages and accidents have leaked, and workers have admitted to circulating clips . Regulators, privacy groups and media have flagged the risks of always-on vehicle cameras whose footage can be used beyond owners’ expectations. Automakers and suppliers are rapidly adding cameras for driver monitoring, ADAS (advanced driver-assistance systems), and event recording, which raises questions about consent when cars record passengers, passers-by, or are subject to remote access by manufacturers, insurers or law enforcement, especially with cloud-connected vehicles. Ring doorbells and other cloud-connected home security cameras have created an informal, semi-public surveillance layer. Millions of privately owned cameras facing streets and porches that can be searched, shared, and, in many jurisdictions, accessed by police via relationships or tools. Amazon’s Ring drew intense scrutiny for police partnerships and for security practices that at times exposed footage to unauthorized access. A private company mediates a vast public-facing camera network, and incentives push toward more sharing, not less. Another recent example of creeping features, Ring’s “Search Party” AI pet-finder feature (enabled by default), also raised fresh concerns about consent and the expansion of automated scanning on users’ cloud footage. While smartphones don’t (yet) record video all by themselves, the idea that our phones and earbuds “listen” only when we ask them has been punctured repeatedly. Investigations disclosed that contractors for Apple, Google and Amazon listened to small samples of voice-assistant recordings, often including accidentally captured private conversations, to train and improve models. There have also been appalling edge cases, like smart speakers accidentally sending recordings to contacts, or assistants waking and recording without clear triggers. These incidents underline how easily ambient audio can become recorded, labeled and routed into human or machine review. With AI assistants (Siri, Gemini, etc.) integrated on phones and wearables, for which processing often requires sending audio or text to the cloud, new features make it even harder for users to keep control of what’s retained, analyzed, or used to personalize models. A recent crop of AI wearables, like Humane ’s AI Pin , the Friend AI pendants and similar always-listening companions, aim to deliver an AI interface that’s untethered from a phone. They typically depend on continuous audio capture and sometimes even outward-facing cameras for vision features. The devices sparked two predictable controversies: Humane ’s AI Pin drew mixed reviews, questions about “trust lights” and bystander notice, and eventually a shutdown/asset sale that stranded some buyers, which is yet another example of how the technology and business models create risks for both privacy and consumers. Independent wearables like Friend have also raised alarm among reviewers about always-listening behavior without clear opt-out tools. Even though these devices might not necessarily have cameras (yet) to record video footage, they usually come with always-on microphones and can, at the very least, scan for nearby Bluetooth and WiFi devices to collect valuable insights on the user’s surroundings and, more precisely, other users in close proximity. A device category that banks primarily on its video recording capabilities are smart glasses. Unlike the glassholes from a decade ago, this time it seems fashionable and socially accepted to wear the latest cloud-connected glasses. Faced with the very same issues mentioned previously for different device types, smart glasses, too, create immense risks for privacy, with little to no policy in place to protect bystanders . There are several satellite constellations in orbit that house advanced imaging satellites capable of capturing high-resolution, close-up images of Earth’s surface, sometimes referred to as “spy satellites” . These satellites provide a range of services, from military reconnaissance to commercial imagery. Notable constellations by private companies include GeoEye ’s GeoEye-1 , Maxar ’s WorldView , Airbus ’ Pléiades , Spot Image ’s SPOT , and Planet Labs ’ RapidEye , Dove and SkySat . Surveillance tech frequently arrives with a compelling use case, like detering car theft, finding a missing child, automating a customer queue, or making life easier with audio and visual interactions. But it also tends to become infrastructural and persistent. When private corporations, local governments and individual citizens all accumulate recordings, we end up with a mosaic of surveillance that’s hard to govern because it’s distributed across actors with different incentives. In addition, surveillance technologies rarely affect everyone equally. Studies and analyses show disproportionate impacts on already-targeted communities, with increased policing, mistaken identifications from biased models, and chilling effects on protest, religion or free association. These systems entrench existing power imbalances and are primarily benefitial to the people in charge of watching rather than the majority that’s being watched . Ultimately, surveillance not only makes us more visible, but we’re also more persistently recorded, indexed and analyzable than ever before. Each camera, microphone and AI assistant may be framed as a single, sensible feature. Taken together, however, they form a dense information layer about who we are, where we go and how we behave. The public debate now needs to shift from “Can we build this?” to “Do we really want this?” . For that, we need an informed public that understands the impact of all these individual technologies and what it’s being asked to give up in exchange for the perceived sense of safety these systems offer. Avigilon (Motorola Solutions) Axis Communications Bosch Security Systems Sony Professional Axis Communications Bosch Security Systems Flock Safety Kapsch TrafficCom Motorola Solutions (WatchGuard) PlateSmart Technologies Digital Ally Kustom Signals Motorola Solutions (WatchGuard) Transcend Information Flock Safety Lockheed Martin (Procerus Technologies) Quantum Systems Mercedes-Benz Eufy Security Nest Hello (Google) Ring (Amazon) SkyBell (Honeywell) Bystander privacy (how do you notify people they’re being recorded?) Vendor and lifecycle risk (cloud dependence, subscription models, and what happens to device functionality or stored data if a startup folds) Gentle Monster Gucci (+ Snap) Oakley (+ Meta) Ray-Ban (+ Meta) Spectacles (Snap) BAE Systems General Dynamics (SATCOM) Thales Alenia Space

0 views
Robin Moffatt 3 days ago

Tech Radar (Nov 2025) - data blips

The latest  Thoughtworks TechRadar  is out. Here are some of the more data-related ‘blips’ (as they’re called on the radar) that I noticed. Each item links to the blip’s entry where you can read more information about Thoughtwork’s usage and opinions on it. Databricks Assistant Apache Paimon Delta Sharing Naive API-to-MCP conversion Standalone data engineering teams Text to SQL

0 views
Kaushik Gopal 3 days ago

Cognitive Burden

A common argument I hear against AI tools: “It doesn’t do the job better or faster than me, so why am I using this again?” Simple answer: cognitive burden. My biggest unlock with AI was realizing I could get more done, not because I was faster , but because I wasn’t wringing my brain with needless tedium. Even if it took longer or needed more iterations, I’d finish less exhausted. That was the aha moment that sold me. Simple example: when writing a technical 1 post, I start with bullet points. Sometimes there’s a turn of phrase or a bit of humor I enjoy, and I’ll throw those in too. Then a custom agent trained on my writing generates a draft in my voice. After it drafts, I still review every single word. A naysayer might ask: “Well, if you’re reviewing every single word anyway, at that point, why not just write the post from scratch?” Because it’s dramatically easier and more enjoyable not to grind through and string together a bunch of prepositions to draft the whole post. I’ve captured the main points and added my creative touch; the AI handles the rest. With far less effort , I can publish more quickly — not due to raw speed, but because it’s low‑touch and I focus only on what makes it uniquely me. Cognitive burden ↓. About two years ago I pushed back on our CEO in a staff meeting: “Most of the time we engineers waste isn’t in writing the code. It’s the meetings, design discussions, working with PMs, fleshing out requirements — that’s where we should focus our AI efforts first.” 2 I missed the same point. Yes, I enjoy crafting every line of code and I’m not bogged down by that process per se, but there’s a cognitive tax to pay. I’d even say I could still build a feature faster than some LLMs today (accounting for quality and iterations) before needing to take a break and recharge. Now I typically have 3–4 features in flight (with requisite docs, tests, and multiple variants to boot). Yes, I’m more productive. And sure, I’m probably shipping faster. But that’s correlation, not causation. Speed is a byproduct. The real driver is less cognitive burden, which lets me carry more. What’s invigorated me further as a product engineer is that I’m spending a lot more time on actually building a good product . It’s not that I don’t know how to write every statement; it’s just… no longer interesting. Others feel differently. Great! To each their own. For me, that was the aha moment that sold me on AI. Reducing cognitive burden made me more effective; everything else followed. I still craft the smaller personal posts from scratch. I do this mostly because it helps evolve my thinking as I write each word down — a sort of muscle memory formed over the years of writing here.  ↩︎ In hindsight, maybe not one of my finest arguments especially given my recent fervor . To be fair, while I concede my pushback was wrong, I don’t think leaders then had the correct reasoning fully synthesized.  ↩︎ I still craft the smaller personal posts from scratch. I do this mostly because it helps evolve my thinking as I write each word down — a sort of muscle memory formed over the years of writing here.  ↩︎ In hindsight, maybe not one of my finest arguments especially given my recent fervor . To be fair, while I concede my pushback was wrong, I don’t think leaders then had the correct reasoning fully synthesized.  ↩︎

0 views
devansh 3 days ago

AI pentest scoping playbook

Disclosure: Certain sections of this content were grammatically refined/updated using AI assistance, as English is not my first language. Organizations are throwing money at "AI red teams" who run a few prompt injection tests, declare victory, and cash checks. Security consultants are repackaging traditional pentest methodologies with "AI" slapped on top, hoping nobody notices they're missing 80% of the actual attack surface. And worst of all, the people building AI systems, the ones who should know better, are scoping engagements like they're testing a CRUD app from 2015. This guide/playbook exists because the current state of AI security testing is dangerously inadequate. The attack surface is massive. The risks are novel. The methodologies are immature. And the consequences of getting it wrong are catastrophic. These are my personal views, informed by professional experience but not representative of my employer. What follows is what I wish every CISO, security lead, and AI team lead understood before they scoped their next AI security engagement. Traditional web application pentests follow predictable patterns. You scope endpoints, define authentication boundaries, exclude production databases, and unleash testers to find SQL injection and XSS. The attack surface is finite, the vulnerabilities are catalogued, and the methodologies are mature. AI systems break all of that. First, the system output is non-deterministic . You can't write a test case that says "given input X, expect output Y" because the model might generate something completely different next time. This makes reproducibility, the foundation of security testing, fundamentally harder. Second, the attack surface is layered and interconnected . You're not just testing an application. You're testing a model (which might be proprietary and black-box), a data pipeline (which might include RAG, vector stores, and real-time retrieval), integration points (APIs, plugins, browser tools), and the infrastructure underneath (cloud services, containers, orchestration). Third, novel attack classes exist that don't map to traditional vuln categories . Prompt injection isn't XSS. Data poisoning isn't SQL injection. Model extraction isn't credential theft. Jailbreaks don't fit CVE taxonomy. The OWASP Top 10 doesn't cover this. Fourth, you might not control the model . If you're using OpenAI's API or Anthropic's Claude, you can't test the training pipeline, you can't audit the weights, and you can't verify alignment. Your scope is limited to what the API exposes, which means you're testing a black box with unknown internals. Fifth, AI systems are probabilistic, data-dependent, and constantly evolving . A model that's safe today might become unsafe after fine-tuning. A RAG system that's secure with Dataset A might leak PII when Dataset B is added. An autonomous agent that behaves correctly in testing might go rogue in production when it encounters edge cases. This isn't incrementally harder than web pentesting. It's just fundamentally different. And if your scope document looks like a web app pentest with "LLM" find-and-replaced in, you're going to miss everything that matters. Before you can scope an AI security engagement, you need to understand what you're actually testing. And most organizations don't. Here's the stack: This is the thing everyone focuses on because it's the most visible. But "the model" isn't monolithic. Base model : Is it GPT-4? Claude? Llama 3? Mistral? A custom model you trained from scratch? Each has different vulnerabilities, different safety mechanisms, different failure modes. Fine-tuning : Have you fine-tuned the base model on your own data? Fine-tuning can break safety alignment. It can introduce backdoors. It can memorize training data and leak it during inference. If you've fine-tuned, that's in scope. Instruction tuning : Have you applied instruction-tuning or RLHF to shape model behavior? That's another attack surface. Adversaries can craft inputs that reverse your alignment work. Multi-model orchestration : Are you running multiple models and aggregating outputs? That introduces new failure modes. What happens when Model A says "yes" and Model B says "no"? How do you handle consensus? Can an adversary exploit disagreements? Model serving infrastructure : How is the model deployed? Is it an API? A container? Serverless functions? On-prem hardware? Each deployment model has different security characteristics. AI systems don't just run models. They feed data into models. And that data pipeline is massive attack surface. Training data : Where did the training data come from? Who curated it? How was it cleaned? Is it public? Proprietary? Scraped? Licensed? Can an adversary poison the training data? RAG (Retrieval-Augmented Generation) : Are you using RAG to ground model outputs in retrieved documents? That's adding an entire data retrieval system to your attack surface. Can an adversary inject malicious documents into your knowledge base? Can they manipulate retrieval to leak sensitive docs? Can they poison the vector embeddings? Vector databases : If you're using RAG, you're running a vector database (Pinecone, Weaviate, Chroma, etc.). That's infrastructure. That has vulnerabilities. That's in scope. Real-time data ingestion : Are you pulling live data from APIs, databases, or user uploads? Each data source is a potential injection point. Data preprocessing : How are inputs sanitized before hitting the model? Are you stripping dangerous characters? Validating formats? Filtering content? Attackers will test every preprocessing step for bypasses. Models don't exist in isolation. They're integrated into applications. And those integration points are attack surface. APIs : How do users interact with the model? REST APIs? GraphQL? WebSockets? Each has different attack vectors. Authentication and authorization : Who can access the model? How are permissions enforced? Can an adversary escalate privileges? Rate limiting : Can an adversary send 10,000 requests per second? Can they DOS your model? Can they extract the entire training dataset via repeated queries? Logging and monitoring : Are you logging inputs and outputs? If yes, are you protecting those logs from unauthorized access? Logs containing sensitive user queries are PII. Plugins and tool use : Can the model call external APIs? Execute code? Browse the web? Use tools? Every plugin is an attack vector. If your model can execute Python, an adversary will try to get it to run . Multi-turn conversations : Do users have multi-turn dialogues with the model? Multi-turn interactions create new attack surfaces because adversaries can condition the model over multiple turns, bypassing safety mechanisms gradually/ If you've built agentic systems, AI that can plan, reason, use tools, and take actions autonomously, you've added an entire new dimension of attack surface. Tool access : What tools can the agent use? File system access? Database queries? API calls? Browser automation? The more powerful the tools, the higher the risk. Planning and reasoning : How does the agent decide what actions to take? Can an adversary manipulate the planning process? Can they inject malicious goals? Memory systems : Do agents have persistent memory? Can adversaries poison that memory? Can they extract sensitive information from memory? Multi-agent coordination : Are you running multiple agents that coordinate? Can adversaries exploit coordination protocols? Can they cause agents to turn on each other or collude against safety mechanisms? Escalation paths : Can an agent escalate privileges? Can it access resources it shouldn't? Can it spawn new agents? AI systems run on infrastructure. That infrastructure has traditional security vulnerabilities that still matter. Cloud services : Are you running on AWS, Azure, GCP? Are your S3 buckets public? Are your IAM roles overly permissive? Are your API keys hardcoded in repos? Containers and orchestration : Are you using Docker, Kubernetes? Are your container images vulnerable? Are your registries exposed? Are your secrets managed properly? CI/CD pipelines : How do you deploy model updates? Can an adversary inject malicious code into your pipeline? Dependencies : Are you using vulnerable Python libraries? Compromised npm packages? Poisoned PyPI distributions? Secrets management : Where are your API keys, database credentials, and model weights stored? Are they in environment variables? Config files? Secret managers? How much of that did you include in your last AI security scope document? If the answer is "less than 60%", your scope is inadequate. And you're going to get breached by someone who understands the full attack surface. The OWASP Top 10 for LLM Applications is the closest thing we have to a standardized framework for AI security testing. If you're scoping an AI engagement and you haven't mapped every item in this list to your test plan, you're doing it wrong. Here's the 2025 version: That's your baseline. But if you stop there, you're missing half the attack surface. The OWASP LLM Top 10 is valuable, but it's not comprehensive. Here's what's missing: Safety ≠ security . But unsafe AI systems cause real harm, and that's in scope for red teaming. Alignment failures : Can the model be made to behave in ways that violate its stated values? Constitutional AI bypass : If you're using constitutional AI techniques (like Anthropic's Claude), can adversaries bypass the constitution? Bias amplification : Does the model exhibit or amplify demographic biases? This isn't just an ethics issue—it's a legal risk under GDPR, EEOC, and other regulations. Harmful content generation : Can the model be tricked into generating illegal, dangerous, or abusive content? Deceptive behavior : Can the model lie, manipulate, or deceive users? Traditional adversarial ML attacks apply to AI systems. Evasion attacks : Can adversaries craft inputs that cause misclassification? Model inversion : Can adversaries reconstruct training data from model outputs? Model extraction : Can adversaries steal model weights through repeated queries? Membership inference : Can adversaries determine if specific data was in the training set? Backdoor attacks : Does the model have hidden backdoors that trigger on specific inputs? If your AI system handles multiple modalities (text, images, audio, video), you have additional attack surface. Cross-modal injection : Attackers embed malicious instructions in images that the vision-language model follows. Image perturbation attacks : Small pixel changes invisible to humans cause model failures. Audio adversarial examples : Audio inputs crafted to cause misclassification. Typographic attacks : Adversarial text rendered as images to bypass filters. Multi-turn multimodal jailbreaks : Combining text and images across multiple turns to bypass safety. AI systems must comply with GDPR, HIPAA, CCPA, and other regulations. PII handling : Does the model process, store, or leak personally identifiable information? Right to explanation : Can users get explanations for automated decisions (GDPR Article 22)? Data retention : How long is data retained? Can users request deletion? Cross-border data transfers : Does the model send data across jurisdictions? Before you write your scope document, answer every single one of these questions. If you can't answer them, you don't understand your system well enough to scope a meaningful AI security engagement. If you can answer all these questions, you're ready to scope. If you can't, you're not. Your AI pentest/engagement scope document needs to be more detailed than a traditional pentest scope. Here's the structure: What we're testing : One-paragraph description of the AI system. Why we're testing : Business objectives (compliance, pre-launch validation, continuous assurance, incident response). Key risks : Top 3-5 risks that drive the engagement. Success criteria : What does "passing" look like? Architectural diagram : Include everything—model, data pipelines, APIs, infrastructure, third-party services. Component inventory : List every testable component with owner, version, and deployment environment. Data flows : Document how data moves through the system, from user input to model output to downstream consumers. Trust boundaries : Identify where data crosses trust boundaries (user → app, app → model, model → tools, tools → external APIs). Be exhaustive. List: For each component, specify: Map every OWASP LLM Top 10 item to specific test cases. Example: LLM01 - Prompt Injection : Include specific threat scenarios: Explicitly list what's NOT being tested: Tools : List specific tools testers will use: Techniques : Test phases : Authorization : All testing must be explicitly authorized in writing. Include names, signatures, dates. Ethical boundaries : No attempts at physical harm, financial fraud, illegal content generation (unless explicitly scoped for red teaming). Disclosure : Critical findings must be disclosed immediately via designated channel (email, Slack, phone). Standard findings can wait for formal report. Data handling : Testers must not exfiltrate user data, training data, or model weights except as explicitly authorized for demonstration purposes. All test data must be destroyed post-engagement. Legal compliance : Testing must comply with all applicable laws and regulations. If testing involves accessing user data, appropriate legal review must be completed. Technical report : Detailed findings with severity ratings, reproduction steps, evidence (screenshots, logs, payloads), and remediation guidance. Executive summary : Business-focused summary of key risks and recommendations. Threat model : Updated threat model based on findings. Retest availability : Will testers be available for retest after fixes? Timeline : Start date, end date, report delivery date, retest window. Key contacts : That's your scope document. It should be 10-20 pages. If it's shorter, you're missing things. Here's what I see organizations get wrong: Mistake 1: Scoping only the application layer, not the model You test the web app that wraps the LLM, but you don't test the LLM itself. You find XSS and broken authz, but you miss prompt injection, jailbreaks, and data extraction. Fix : Scope the full stack-app, model, data pipelines, infrastructure. Mistake 2: Treating the model as a black box when you control it If you fine-tuned the model, you have access to training data and weights. Test for data poisoning, backdoors, and alignment failures. Don't just test the API. Fix : If you control any part of the model lifecycle (training, fine-tuning, deployment), include that in scope. Mistake 3: Ignoring RAG and vector databases You test the LLM, but you don't test the document store. Adversaries inject malicious documents, manipulate retrieval, and poison embeddings—and you never saw it coming. Fix : If you're using RAG, the vector database and document ingestion pipeline are in scope. Mistake 4: Not testing multi-turn interactions You test single-shot prompts, but adversaries condition the model over 10 turns to bypass refusal mechanisms. You missed the attack entirely. Fix : Test multi-turn dialogues explicitly. Test conversation history isolation. Test memory poisoning. Mistake 5: Assuming third-party models are safe You're using OpenAI's API, so you assume it's secure. But you're passing user PII in prompts, you're not validating outputs before execution, and you haven't considered what happens if OpenAI's safety mechanisms fail. Fix : Even with third-party models, test your integration. Test input/output handling. Test failure modes. Mistake 6: Not including AI safety in security scope You test for technical vulnerabilities but ignore alignment failures, bias amplification, and harmful content generation. Then your model generates racist outputs or dangerous instructions, and you're in the news. Fix : AI safety is part of AI security. Include alignment testing, bias audits, and harm reduction validation. Mistake 7: Underestimating autonomous agent risks You test the LLM, but your agent can execute code, call APIs, and access databases. An adversary hijacks the agent, and it deletes production data or exfiltrates secrets. Fix : Autonomous agents are their own attack surface. Test tool permissions, privilege escalation, and agent behavior boundaries. Mistake 8: Not planning for continuous testing You do one pentest before launch, then never test again. But you're fine-tuning weekly, adding new plugins monthly, and updating RAG documents daily. Your attack surface is constantly changing. Fix : Scope for continuous red teaming, not one-time assessment. Organizations hire expensive consultants to run a few prompt injection tests, declare the system "secure," and ship to production. Then they get breached six months later when someone figures out a multi-turn jailbreak or poisons the RAG document store. The problem isn't that the testers are bad. The problem is that the scopes are inadequate . You can't find what you're not looking for. If your scope doesn't include RAG poisoning, testers won't test for it. If your scope doesn't include membership inference, testers won't test for it. If your scope doesn't include agent privilege escalation, testers won't test for it. And attackers will. The asymmetry is brutal: you have to defend every attack vector. Attackers only need to find one that works. So when you scope your next AI security engagement, ask yourself: "If I were attacking this system, what would I target?" Then make sure every single one of those things is in your scope document. Because if it's not in scope, it's not getting tested. And if it's not getting tested, it's going to get exploited. Traditional pentests are point-in-time assessments. You test, you report, you fix, you're done. That doesn't work for AI systems. AI systems evolve constantly: Every change introduces new attack surface. And if you're only testing once a year, you're accumulating risk for 364 days. You need continuous red teaming . Here's how to build it: Use tools like Promptfoo, Garak, and PyRIT to run automated adversarial testing on every model update. Integrate tests into CI/CD pipelines so every deployment is validated before production. Set up continuous monitoring for: Quarterly or bi-annually, bring in expert red teams for comprehensive testing beyond what automation can catch. Focus deep assessments on: Train your own security team on AI-specific attack techniques. Develop internal playbooks for: Every quarter, revisit your threat model: Update your testing roadmap based on evolving threats. Scoping AI security engagements is harder than traditional pentests because the attack surface is larger, the risks are novel, and the methodologies are still maturing. But it's not impossible. You need to: If you do this right, you'll find vulnerabilities before attackers do. If you do it wrong, you'll end up in the news explaining why your AI leaked training data, generated harmful content, or got hijacked by adversaries. First, the system output is non-deterministic . You can't write a test case that says "given input X, expect output Y" because the model might generate something completely different next time. This makes reproducibility, the foundation of security testing, fundamentally harder. Second, the attack surface is layered and interconnected . You're not just testing an application. You're testing a model (which might be proprietary and black-box), a data pipeline (which might include RAG, vector stores, and real-time retrieval), integration points (APIs, plugins, browser tools), and the infrastructure underneath (cloud services, containers, orchestration). Third, novel attack classes exist that don't map to traditional vuln categories . Prompt injection isn't XSS. Data poisoning isn't SQL injection. Model extraction isn't credential theft. Jailbreaks don't fit CVE taxonomy. The OWASP Top 10 doesn't cover this. Fourth, you might not control the model . If you're using OpenAI's API or Anthropic's Claude, you can't test the training pipeline, you can't audit the weights, and you can't verify alignment. Your scope is limited to what the API exposes, which means you're testing a black box with unknown internals. Fifth, AI systems are probabilistic, data-dependent, and constantly evolving . A model that's safe today might become unsafe after fine-tuning. A RAG system that's secure with Dataset A might leak PII when Dataset B is added. An autonomous agent that behaves correctly in testing might go rogue in production when it encounters edge cases. Base model : Is it GPT-4? Claude? Llama 3? Mistral? A custom model you trained from scratch? Each has different vulnerabilities, different safety mechanisms, different failure modes. Fine-tuning : Have you fine-tuned the base model on your own data? Fine-tuning can break safety alignment. It can introduce backdoors. It can memorize training data and leak it during inference. If you've fine-tuned, that's in scope. Instruction tuning : Have you applied instruction-tuning or RLHF to shape model behavior? That's another attack surface. Adversaries can craft inputs that reverse your alignment work. Multi-model orchestration : Are you running multiple models and aggregating outputs? That introduces new failure modes. What happens when Model A says "yes" and Model B says "no"? How do you handle consensus? Can an adversary exploit disagreements? Model serving infrastructure : How is the model deployed? Is it an API? A container? Serverless functions? On-prem hardware? Each deployment model has different security characteristics. Training data : Where did the training data come from? Who curated it? How was it cleaned? Is it public? Proprietary? Scraped? Licensed? Can an adversary poison the training data? RAG (Retrieval-Augmented Generation) : Are you using RAG to ground model outputs in retrieved documents? That's adding an entire data retrieval system to your attack surface. Can an adversary inject malicious documents into your knowledge base? Can they manipulate retrieval to leak sensitive docs? Can they poison the vector embeddings? Vector databases : If you're using RAG, you're running a vector database (Pinecone, Weaviate, Chroma, etc.). That's infrastructure. That has vulnerabilities. That's in scope. Real-time data ingestion : Are you pulling live data from APIs, databases, or user uploads? Each data source is a potential injection point. Data preprocessing : How are inputs sanitized before hitting the model? Are you stripping dangerous characters? Validating formats? Filtering content? Attackers will test every preprocessing step for bypasses. APIs : How do users interact with the model? REST APIs? GraphQL? WebSockets? Each has different attack vectors. Authentication and authorization : Who can access the model? How are permissions enforced? Can an adversary escalate privileges? Rate limiting : Can an adversary send 10,000 requests per second? Can they DOS your model? Can they extract the entire training dataset via repeated queries? Logging and monitoring : Are you logging inputs and outputs? If yes, are you protecting those logs from unauthorized access? Logs containing sensitive user queries are PII. Plugins and tool use : Can the model call external APIs? Execute code? Browse the web? Use tools? Every plugin is an attack vector. If your model can execute Python, an adversary will try to get it to run . Multi-turn conversations : Do users have multi-turn dialogues with the model? Multi-turn interactions create new attack surfaces because adversaries can condition the model over multiple turns, bypassing safety mechanisms gradually/ Tool access : What tools can the agent use? File system access? Database queries? API calls? Browser automation? The more powerful the tools, the higher the risk. Planning and reasoning : How does the agent decide what actions to take? Can an adversary manipulate the planning process? Can they inject malicious goals? Memory systems : Do agents have persistent memory? Can adversaries poison that memory? Can they extract sensitive information from memory? Multi-agent coordination : Are you running multiple agents that coordinate? Can adversaries exploit coordination protocols? Can they cause agents to turn on each other or collude against safety mechanisms? Escalation paths : Can an agent escalate privileges? Can it access resources it shouldn't? Can it spawn new agents? Cloud services : Are you running on AWS, Azure, GCP? Are your S3 buckets public? Are your IAM roles overly permissive? Are your API keys hardcoded in repos? Containers and orchestration : Are you using Docker, Kubernetes? Are your container images vulnerable? Are your registries exposed? Are your secrets managed properly? CI/CD pipelines : How do you deploy model updates? Can an adversary inject malicious code into your pipeline? Dependencies : Are you using vulnerable Python libraries? Compromised npm packages? Poisoned PyPI distributions? Secrets management : Where are your API keys, database credentials, and model weights stored? Are they in environment variables? Config files? Secret managers? Alignment failures : Can the model be made to behave in ways that violate its stated values? Constitutional AI bypass : If you're using constitutional AI techniques (like Anthropic's Claude), can adversaries bypass the constitution? Bias amplification : Does the model exhibit or amplify demographic biases? This isn't just an ethics issue—it's a legal risk under GDPR, EEOC, and other regulations. Harmful content generation : Can the model be tricked into generating illegal, dangerous, or abusive content? Deceptive behavior : Can the model lie, manipulate, or deceive users? Evasion attacks : Can adversaries craft inputs that cause misclassification? Model inversion : Can adversaries reconstruct training data from model outputs? Model extraction : Can adversaries steal model weights through repeated queries? Membership inference : Can adversaries determine if specific data was in the training set? Backdoor attacks : Does the model have hidden backdoors that trigger on specific inputs? Cross-modal injection : Attackers embed malicious instructions in images that the vision-language model follows. Image perturbation attacks : Small pixel changes invisible to humans cause model failures. Audio adversarial examples : Audio inputs crafted to cause misclassification. Typographic attacks : Adversarial text rendered as images to bypass filters. Multi-turn multimodal jailbreaks : Combining text and images across multiple turns to bypass safety. PII handling : Does the model process, store, or leak personally identifiable information? Right to explanation : Can users get explanations for automated decisions (GDPR Article 22)? Data retention : How long is data retained? Can users request deletion? Cross-border data transfers : Does the model send data across jurisdictions? What base model are you using (GPT-4, Claude, Llama, Mistral, custom)? Is the model proprietary (OpenAI API) or open-source? Have you fine-tuned the base model? On what data? Have you applied instruction tuning, RLHF, or other alignment techniques? How is the model deployed (API, on-prem, container, serverless)? Do you have access to model weights? Can testers query the model directly, or only through your application? Are there rate limits? What are they? What's the model's context window size? Does the model support function calling or tool use? Is the model multimodal (vision, audio, text)? Are you using multiple models in ensemble or orchestration? Where did training data come from (public, proprietary, scraped, licensed)? Was training data curated or filtered? How? Is training data in scope for poisoning tests? Are you using RAG (Retrieval-Augmented Generation)? If RAG: What's the document store (vector DB, traditional DB, file system)? If RAG: How are documents ingested? Who controls ingestion? If RAG: Can testers inject malicious documents? If RAG: How is retrieval indexed and searched? Do you pull real-time data from external sources (APIs, databases)? How is input data preprocessed and sanitized? Is user conversation history stored? Where? For how long? Can users access other users' data? How do users interact with the model (web app, API, chat interface, mobile app)? What authentication mechanisms are used (OAuth, API keys, session tokens)? What authorization model is used (RBAC, ABAC, none)? Are there different user roles with different permissions? Is there rate limiting? At what levels (user, IP, API key)? Are inputs and outputs logged? Where? Who has access to logs? Are logs encrypted at rest and in transit? How are errors handled? Are error messages exposed to users? Are there webhooks or callbacks that the model can trigger? Can the model call external APIs? Which ones? Can the model execute code? In what environment? Can the model browse the web? Can the model read/write files? Can the model access databases? What permissions do plugins have? How are plugin outputs validated before use? Can users add custom plugins? Are plugin interactions logged? Do you have autonomous agents that plan and execute multi-step tasks? What tools can agents use? Can agents spawn other agents? Do agents have persistent memory? Where is it stored? How are agent goals and constraints defined? Can agents access sensitive resources (DBs, APIs, filesystems)? Can agents escalate privileges? Are there kill-switches or circuit breakers for agents? How is agent behavior monitored? What cloud provider(s) are you using (AWS, Azure, GCP, on-prem)? Are you using containers (Docker)? Orchestration (Kubernetes)? Where are model weights stored? Who has access? Where are API keys and secrets stored? Are secrets in environment variables, config files, or secret managers? How are dependencies managed (pip, npm, Docker images)? Have you scanned dependencies for known vulnerabilities? How are model updates deployed? What's the CI/CD pipeline? Who can deploy model updates? Are there staging environments separate from production? What safety mechanisms are in place (content filters, refusal training, constitutional AI)? Have you red-teamed for jailbreaks? Have you tested for bias across demographic groups? Have you tested for harmful content generation? Do you have human-in-the-loop review for sensitive outputs? What's your incident response plan if the model behaves unsafely? Can testers attempt to jailbreak the model? Can testers attempt prompt injection? Can testers attempt data extraction (training data, PII)? Can testers attempt model extraction or inversion? Can testers attempt DoS or resource exhaustion? Can testers poison training data (if applicable)? Can testers test multi-turn conversations? Can testers test RAG document injection? Can testers test plugin abuse? Can testers test agent privilege escalation? Are there any topics, content types, or test methods that are forbidden? What's the escalation process if critical issues are found during testing? What regulations apply (GDPR, HIPAA, CCPA, FTC, EU AI Act)? Do you process PII? What types? Do you have data processing agreements with model providers? Do you have the legal right to test this system? Are there export control restrictions on the model or data? What are the disclosure requirements for findings? What's the confidentiality agreement for testers? Model(s) : Exact model names, versions, access methods APIs : All endpoints with authentication requirements Data stores : Databases, vector stores, file systems, caches Integrations : Every third-party service, plugin, tool Infrastructure : Cloud accounts, containers, orchestration Applications : Web apps, mobile apps, admin panels Access credentials testers will use Environments (dev, staging, prod) that are in scope Testing windows (if limited) Rate limits or usage restrictions Test direct instruction override Test indirect injection via RAG documents Test multi-turn conditioning Test system prompt extraction Test jailbreak techniques (roleplay, hypotheticals, encoding) Test cross-turn memory poisoning "Can an attacker leak other users' conversation history?" "Can an attacker extract training data containing PII?" "Can an attacker bypass content filters to generate harmful instructions?" Production environments (if testing only staging) Physical security Social engineering of employees Third-party SaaS providers we don't control Specific attack types (if any are prohibited) Manual testing Promptfoo for LLM fuzzing Garak for red teaming PyRIT for adversarial prompting ART (Adversarial Robustness Toolbox) for ML attacks Custom scripts for specific attack vectors Traditional tools (Burp Suite, Caido, Nuclei) for infrastructure Prompt injection testing Jailbreak attempts Data extraction attacks Model inversion Membership inference Evasion attacks RAG poisoning Plugin abuse Agent privilege escalation Infrastructure scanning Reconnaissance and threat modeling Automated vulnerability scanning Manual testing of high-risk areas Exploitation and impact validation Reporting and remediation guidance Engagement lead (security team) Technical point of contact (AI team) Escalation contact (for critical findings) Legal contact (for questions on scope) Models get fine-tuned RAG document stores get updated New plugins get added Agents gain new capabilities Infrastructure changes Prompt injection attempts Jailbreak successes Data extraction queries Unusual tool usage patterns Agent behavior anomalies Novel attack vectors that tools don't cover Complex multi-step exploitation chains Social engineering combined with technical attacks Agent hijacking and multi-agent exploits Prompt injection testing Jailbreak methodology RAG poisoning Agent security testing What new attacks have been published? What new capabilities have you added? What new integrations are in place? What new risks does the threat landscape present? Understand the full stack : model, data pipelines, application, infrastructure, agents, everything. Map every attack vector : OWASP LLM Top 10 is your baseline, not your ceiling. Answer scoping questions (mentioned above) : If you can't answer them, you don't understand your system. Write detailed scope documents : 10-20 pages, not 2 pages. Use the right tools : Promptfoo, Garak, ART, LIME, SHAP—not just Burp Suite. Test continuously : Not once, but ongoing. Avoid common mistakes : Don't ignore RAG, don't underestimate agents, don't skip AI safety.

0 views
Martin Fowler 4 days ago

The Learning Loop and LLMs

Unmesh Joshi finds LLMs to be a useful tool, but explains why their help becomes illusory if we use them to shortcut the learning loop that's an essential part of our professional practice.

0 views
Ahead of AI 4 days ago

Beyond Standard LLMs

From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism. However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance. After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised attendees to follow up with a write-up of these alternative approaches). So here it is! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 and many more. (The list above focuses on the open-weight models; there are proprietary models like GPT-5, Grok 4, Gemini 2.5, etc. that also fall into this category.) Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. In code, a simplified version of the Gated DeltaNet depicted above (without the convolutional mixing) can be implemented as follows (the code is inspired by the official implementation by the Qwen3 team): (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. (And the final output gate, not shown in the snippet above, is similar to gated attention; it controls how much of the output is kept.) So, in a sense, this state update in Gated DeltaNet is similar to how recurrent neural networks (RNNs) work. The advantage is that it scales linearly (via the for-loop) instead of quadratically with context length. The downside of this recurrent state update is that, compared to regular (or gated) attention, it sacrifices the global context modeling ability that comes from full pairwise attention. Gated DeltaNet, can, to some extend, still capture context, but it has to go through the memory ( S ) bottleneck. That memory is a fixed size and thus more efficient, but it compresses past context into a single hidden state similar to RNNs. That’s why the Qwen3-Next and Kimi Linear architectures don’t replace all attention layers with DeltaNet layers but use the 3:1 ratio mentioned earlier. In the previous section, we discussed the advantage of the DeltaNet over full attention in terms of linear instead of quadratic compute complexity with respect to the context length. Next to the linear compute complexity, another big advantage of DeltaNet is the memory savings, as DeltaNet modules don’t grow the KV cache. (For more information about KV caching, see my Understanding and Coding the KV Cache in LLMs from Scratch article). Instead, as mentioned earlier, they keep a fixed-size recurrent state, so memory stays constant with context length. For a regular multi-head attention (MHA) layer, we can compute the KV cache size as follows: (The 2 multiplier is there because we have both keys and values that we store in the cache.) For the simplified DeltaNet version implemented above, we have: Note that the memory size doesn’t have a context length ( ) dependency. Also, we have only the memory state S that we store instead of separate keys and values, hence becomes just bytes. However, note that we now have a quadratic in here. This comes from the state: But that’s usually nothing to worry about, as the head dimension is usually relatively small. For instance, it’s 128 in Qwen3-Next. The full version with the convolutional mixing is a bit more complex, including the kernel size and so on, but the formulas above should illustrate the main trend and motivation behind the Gated DeltaNet. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. The training runs for up to 16 refinement steps per batch. Each step performs several no-grad loops to iteratively refine the answer. This is followed by a gradient loop that backpropagates through the full reasoning sequence to update the model weights. It’s important to note that TRM is not a language model operating on text. However, because (a) it’s a transformer-based architecture, (b) reasoning is now a central focus in LLM research, and this model represents a distinctly different take on reasoning, and (c) many readers have asked me to cover HRM (and TRM is its more advanced successor) I decided to include it here. While TRM could be extended to textual question-answer tasks in the future, TRM currently works on grid-based inputs and outputs. In other words, both the “question” and the “answer” are grids of discrete tokens (for example, 9×9 Sudoku or 30×30 ARC/Maze puzzles), not text sequences. HRM consists of two small transformer modules (each 4 blocks) that communicate across recursion levels. TRM only uses a single 2-layer transformer. (Note that the previous TRM figure shows a 4× next to the transformer block, but that’s likely to make it easier to compare against HRM.) TRM backpropagates through all recursive steps, whereas HRM only backpropagates through the final few. HRM includes an explicit halting mechanism to determine when to stop iterating. TRM replaces this mechanism with a simple binary cross-entropy loss that learns when to stop iterating. Performance-wise, TRM performs really well compared to HRM, as shown in the figure below. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length. While HRM and TRM achieve really good reasoning performance on these benchmarks, comparing them to large LLMs is not quite fair. HRM and TRM are specialized models for tasks like ARC, Sudoku, and Maze pathfinding, whereas LLMs are generalists. Sure, HRM and TRM can be adopted for other tasks as well, but they have to be specially trained on each task. So, in that sense, we can perhaps think of HRM and TRM as efficient pocket calculators, whereas LLM are more like computers, which can do a lot of other things as well. Still, these recursive architectures are exciting proof-of-concepts that highlight how small, efficient models can “reason” through iterative self-refinement. Perhaps, in the future, such models could act as reasoning or planning modules embedded within larger tool-using LLM systems. For now, LLMs remain ideal for broad tasks, but domain-specific recursive models like TRM can be developed to solve certain problems more efficiently once the target domain is well understood. Beyond the Sudoku, Maze finding, and ARC proof-of-concept benchmarks, there are possibly lots of use cases in the physics and biology domain where such models could find use. As an interesting tidbit, the author shared that it took less than $500 to train this model, with 4 H100s for around 2 days. I am delighted to see that it’s still possible to do interesting work without a data center. I originally planned to cover all models categories in the overview figure, but since the article ended up longer than I expected, I will have to save xLSTMs, Liquid Foundation Models, Transformer-RNN hybrids, and State Space Models for another time (although, Gated DeltaNet already gave a taste of State Space Models and recurrent designs.) As a conclusion to this article, I want to repeat the earlier words, i.e., that standard autoregressive transformer LLMs are proven and have stood the test of time so far. They are also, if efficiency is not the main factor, the best we have for now. Traditional Decoder-Style, Autoregressive Transformers + Proven & mature tooling + “well-understood” + Scaling laws + SOTA - Expensive training - Expensive inference (except for aforementioned tricks) If I were to start a new LLM-based project today, autoregressive transformer-based LLMs would be my first choice. I definitely find the upcoming attention hybrids very promising, which are especially interesting when working with longer contexts where efficiency is a main concern. Linear Attention Hybrids + Same as decoder-style transformers + Cuts FLOPs/KV memory at long-context tasks - Added complexity - Trades a bit of accuracy for efficiency On the more extreme end, text diffusion models are an interesting development. I’m still somewhat skeptical about how well they perform in everyday use, as I’ve only tried a few quick demos. Hopefully, we’ll soon see a large-scale production deployment with Google’s Gemini Diffusion that we can test on daily and coding tasks, and then find out how people actually feel about them. Text Diffusion Models + Iterative denoising is a fresh idea for text + Better parallelism (no next-token dependence) - Can’t stream answers - Doesn’t benefit from CoT? - Tricky tool-calling? - Solid models but not SOTA While the main selling point of text diffusion models is improved efficiency, code world models sit on the other end of the spectrum, where they aim to improve modeling performance. As of this writing, coding models, based on standard LLMs, are mostly improved through reasoning techniques, yet if you have tried them on trickier challenges, you have probably noticed that they (more or less) still fall short and can’t solve many of the trickier coding problems well. I find code world models particularly interesting and believe they could be an important next step toward developing more capable coding systems. Code World Model + Promising approach to improve code understanding + Verifiable intermediate states - Inclusion of executable code traces complicates training - Code running adds latency Lastly, we covered small recursive transformers such as hierarchical and tiny reasoning models. These are super interesting proof-of-concept models. However, as of today, they are primarily puzzle solvers, not general text or coding models. So, they are not in the same category as the other non-standard LLM alternatives covered in this article. Nonetheless, they are very interesting proofs-of-concept, and I am glad researchers are working on them. Right now, LLMs like GPT-5, DeepSeek R1, Kimi K2, and so forth are developed as special purpose models for free-form text, code, math problems and much more. They feel like brute-force and jack-of-all-trades approach that we use on a variety of tasks, from general knowledge questions to math and code. However, when we perform the same task repeatedly, such brute-force approaches become inefficient and may not even be ideal in terms of specialization. This is where tiny recursive transformers become interesting: they could serve as lightweight, task-specific models that are both efficient and purpose-built for repeated or structured reasoning tasks. Also, I can see them as potential “tools” for other tool-calling LLMs; for instance, when LLMs use Python or calculator APIs to solve math problems, special tiny reasoning models could fill this niche for other types of puzzle- or reasoning-like problems. Small Recursive Transformers + Very small architecture + Good generalization on puzzles - Special purpose models - Limited to puzzles (so far) This has been a long article, but I hope you discovered some of the fascinating approaches that often stay outside the spotlight of mainstream LLMs. And if you’ve been feeling a bit bored by the more or less conventional LLM releases, I hope this helped rekindle your excitement about AI again because there’s a lot of interesting work happening right now! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) 1. Transformer-Based LLMs Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. 2. (Linear) Attention Hybrids Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . 2.1 Traditional Attention and Quadratic Costs The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. 2.2 Linear attention Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. 2.3 Linear Attention Revival In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. 2.4 Qwen3-Next Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? 2.5 Gated Attention Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. 2.6 Gated DeltaNet Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. 2.8 Kimi Linear vs. Qwen3-Next Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. 2.9 Kimi Delta Attention Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. 2.10 The Future of Attention Hybrids Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. 3. Text Diffusion Models A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). 3.1 Why Work on Text Diffusion? With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. 3.2 The Denoising Process The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. 3.3 Autoregressive vs Diffusion LLMs Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. 3.4 Text Diffusion Today It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. 4. World Models So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. 4.1 The Main Idea Behind World Models Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. 4.2 From Vision to Code That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. 4.3 Code World Models Vs Regular LLMs for Code So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. 5. Small Recursive Transformers You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. 5.1 What Does Recursion Mean Here? TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length.

0 views
Martin Fowler 4 days ago

Fragments Nov 3

I’m very concerned about the security dangers of LLM-enabled browsers, as it’s just too easy for them to contain the Lethal Trifecta . For up-to-date eyes on these issues, I follow the writings of coiner of that phrase: Simon Willison. Here he examines a post on how OpenAI is thinking about these issues. My takeaways from all of this? It’s not done much to influence my overall skepticism of the entire category of browser agents, but it does at least demonstrate that OpenAI are keenly aware of the problems and are investing serious effort in finding the right mix of protections. ❄                ❄                ❄                ❄ Unsurprisingly, there are a lot of strong opinions on AI assisted coding. Some engineers swear by it. Others say it’s dangerous. And of course, as is the way with the internet, nuanced positions get flattened into simplistic camps where everyone’s either on one side or the other. A lot of the problem is that people aren’t arguing about the same thing. They’re reporting different experiences from different vantage points. His view is that beginners are very keen on AI-coding but they don’t see the problems they are creating. Experienced folks do see this, but it takes a further level of experience to realize that when used well these tools are still valuable. Interestingly, I’ve regularly seen sceptical experienced engineers change their view once they’ve been shown how you can blend modern/XP practices with AI assisted coding. The upshot is this, is that you have be aware of the experience level of whoever is writing about this stuff - and that experience is not just in software development generally, but also in how to make use of LLMs. One thing that rings clearly from reading Simon Willison and Birgitta Böckeler is that effective use of LLMs is a skill that takes a while to develop. ❄                ❄                ❄                ❄ Charlie Brown and Garfield, like most comic strip characters, never changed over the decades. But Doonesbury’s cast aged, had children, and some have died (I miss Lacey). Gary Trudeau retired from writing daily strips a few years ago, but his reruns of older strips is one of the best things in the shabby remains of Twitter. A couple of weeks ago, he reran one of the most memorable strips in its whole run. The very first frame of Doonesbury introduced the character “B.D.”, a football jock never seen without his football helmet, or when on duty, his military helmet. This panel was the first time in over thirty years that B.D. was shown without a helmet, readers were so startled that they didn’t immediately notice that the earlier explosion had removed his leg. This set off a remarkable story arc about the travails of a wounded veteran. It’s my view that future generations will find Doonesbury to be a first-class work of literature, and a thoughtful perspective on contemporary America.

0 views
xenodium 4 days ago

agent-shell 0.17 improvements + MELPA

While it's only been a few weeks since the last agent-shell post , there are plenty of new updates to share. What's agent-shell again? A native Emacs shell to interact with any LLM agent powered by ACP ( Agent Client Protocol ). Before getting to the latest and greatest, I'd like to say thank you to new and existing sponsors backing my projects. While the work going in remains largely unsustainable, your contributions are indeed helping me get closer to sustainability. Thank you! If you benefit from my content and projects, please consider sponsoring to make the work sustainable. Work paying for your LLM tokens and other tools? Why not get your employer to sponsor agent-shell also? Now on to the very first update… Both agent-shell and acp.el are now available on MELPA. As such, installation now boils down to: OpenCode and Qwen Code are two of the latest agents to join agent-shell . Both accessible via and through the agent picker, but also directly from and . Adding files as context has seen quite a few improvements in different shapes. Thank you Ian Davidson for contributing embedded context support. Invoke to take a screenshot and automatically send it over to . A little side-note, did you notice the activity indicator in the header bar? Yep. That's new too. While file completion remains experimental, you can enable via: From any file you can now invoke to send the current file to . If region is selected, region information is sent also. Fancy sending a different file other than current one? Invoke with , or just use . , also operates on files (selection or region), DWIM style ;-) You may have noticed paths in section titles are no longer displayed as absolute paths. We're shortening those relative to project roots. While you can invoke with prefix to create new shells, is now available (and more discoverable than ). Cancelling prompt sessions (via ) is much more reliable now. If you experienced a shell getting stuck after cancelling a session, that's because we were missing part of the protocol implementation. This is now implemented. Use the new to automatically insert shell (ie. bash) command output. Initial work for automatically saving markdown transcripts is now in place. We're still iterating on it, but if keen to try things out, you can enable as follows: Text header Applied changes are now displayed inline. The new and can now be used to change the session mode. You can now find out what capabilities and session modes are supported by your agent. Expand either of the two sections. Tired of pressing and to accept changes from the diff buffer? Now just press from the diff viewer to accept all hunks. Same goes for rejecting. No more and . Now just press from the diff buffer. We get a new basic transient menu. Currently available via . We got lots of awesome pull requests from wonderful folks. Thank you for your contributions! Beyond what's been showcased here, much love and effort's been poured into polishing the experience. Interested in the nitty-gritty? Have a look through the 173 commits since the last blog post. If agent-shell or acp.el are useful to you, please consider sponsoring its development. LLM tokens aren't free, and neither is the time dedicated to building this stuff ;-) Arthur Heymans : Add a Package-Requires header ( PR ). Elle Najt : Execute commands in devcontainer ( PR ). Elle Najt : Fix Write tool diff preview for new files ( PR ). Elle Najt : Inline display of historical changes ( PR ). Elle Najt : Live Markdown transcripts ( PR ). Elle Najt : Prompt session mode cycling and modeline display ( PR ). Fritz Grabo : Devcontainer fallback workspace ( PR ). Guilherme Pires : Codex subscription auth ( PR ). Hordur Freyr Yngvason : Make qwen authentication optional ( PR ). Ian Davidson : Embedded context support ( PR ). Julian Hirn : Fix quick-diff window restoration for full-screen ( PR ). Ruslan Kamashev : Hide header line altogether ( PR ). festive-onion : Show Planning mode more reliably ( PR ).

0 views
Dominik Weber 4 days ago

Thoughts on using synthetic users for product development

## TLDR LLMs can provide information about how specific user groups behaved in the past. Combining a product idea with that can sharpen the understanding of how the product may help the user. It can give feature ideas on how to make the product more complete for the user, and help with understanding which features are not as important. Best for development of MVPs, but can also be useful for existing products. --- Synthetic users are AI-generated personas. They can be asked questions and respond with an approximation of what a real user would say. They come from the UX research space, but while reading [this article](https://www.nngroup.com/articles/synthetic-users/) I had the idea that they may be used for product research and development. The strength of LLMs are their knowledge and ease of interaction. They encode the knowledge of the world, and can be easily chatted with. This means I can give it a persona (e.g. engineering manager in a software company) and tell it to answer questions as this person would. LLMs have a tendency to revert to the mean, to be average. That tendency actually helps here, because I get the typical behavior of that persona. When talking to people I'd have to talk to quite many to get that kind of understanding. On the flipside, LLMs are incapable of giving specific responses (e.g. engineering manager with a 4-person team at Google). So, lets say I have an idea for a product, and a rough idea about who may be interested in it. I can then create a synthetic user (AI persona) about their typical day, their tasks, workflows, and so on. I can find out how an average person does things, and may even ask for multiple ways to achieve the same goal. This is purely information gathering, which LLMs are good at. Based on that information I can judge how the product can help and which features may be useful. Then I can go further into these areas. Get more details about the specific tasks the product may help with. Using that process I can define (a minimal version of) a product or feature that is more grounded in reality than without. The key is to stay clear of any kind of value judgments and future predictions. Don't ask it if a task is annoying, how important it is, how much time it takes, or how the feature and product would change the behavior or workflow. This is where LLMs are even worse than humans. The core message of the book "The mom test" is to ask users about their behavior, what they did in the past. To never ask users to predict how they would behave. This is even more critical with LLMs. They will tell you what you want to hear, but by staying factual, by asking how user groups behave, they may have value. The use-case I'm thinking of for this type of interaction is a software engineer having a product idea. Maybe a side project, maybe something that should become a business. In any case, they want others to use the product. Often they would not do any user research before starting development. By using LLMs it's possible to get information quickly which may help to refine the idea before writing the first line of code. Note: It goes without saying that contact with real people, either through interviews or selling the product, is at some point necessary to verify if the information holds true. But that's a much higher investment which is not warranted every time time before starting development.

0 views
devansh 4 days ago

On AI Slop vs OSS Security

Disclosure: Certain sections of this content were grammatically refined/updated using AI assistance, as English is not my first language. Quite ironic, I know, given the subject being discussed. I have now spent almost a decade in the bug bounty industry, started out as a bug hunter (who initially used to submit reports with minimal impact, low-hanging fruits like RXSS, SQLi, CSRF, etc.), then moved on to complex chains involving OAuth, SAML, parser bugs, supply chain security issues, etc., and then became a vulnerability triager for HackerOne, where I have triaged/reviewed thousands of vulnerability submissions. I have now almost developed an instinct that tells me if a report is BS or a valid security concern just by looking at it. I have been at HackerOne for the last 5 years (Nov 2020 - Present), currently as a team lead, overseeing technical services with a focus on triage operations. One decade of working on both sides, first as a bug hunter, and then on the receiving side reviewing bug submissions, has given me a unique vantage point on how the industry is fracturing under the weight of AI-generated bug reports (sometimes valid submissions, but most of the time, the issues are just plain BS). I have seen cases where it was almost impossible to determine whether a report was a hallucination or a real finding. Even my instincts and a decade of experience failed me, and this is honestly frustrating, not so much for me, because as part of the triage team, it is not my responsibility to fix vulnerabilities, but I do sympathize with maintainers of OSS projects whose inboxes are drowning. Bug bounty platforms have already started taking this problem seriously, as more and more OSS projects are complaining about it. This is my personal writing space, so naturally, these are my personal views and observations. These views might be a byproduct of my professional experience gained at HackerOne, but in no way are they representative of my employer. I am sure HackerOne, as an organization, has its own perspectives, strategies, and positions on these issues. My analysis here just reflects my own thinking about the systemic problems I see and potential solutions(?). There are fundamental issues with how AI has infiltrated vulnerability reporting, and they mirror the social dynamics that plague any feedback system. First, the typical AI-powered reporter, especially one just pasting GPT output into a submission form, neither knows enough about the actual codebase being examined nor understands the security implications well enough to provide insight that projects need. The AI doesn't read code; it pattern-matches. It sees functions that look similar to vulnerable patterns and invents scenarios where they might be exploited, regardless of whether those scenarios are even possible in the actual implementation. Second, some actors with misaligned incentives interpret high submission volume as achievement. By flooding bug bounty programs with AI-generated reports, they feel productive and entrepreneurial. Some genuinely believe the AI has found something real. Others know it's questionable but figure they'll let the maintainers sort it out. The incentive is to submit as many reports as possible and see what sticks, because even a 5% hit rate on a hundred submissions is better than the effort of manually verifying five findings. The result? Daniel Stenberg, who maintains curl , now sees about 20% of all security submissions as AI-generated slop, while the rate of genuine vulnerabilities has dropped to approximately 5%. Think about that ratio. For every real vulnerability, there are now four fake ones. And every fake one consumes hours of expert time to disprove. A security report lands in your inbox. It claims there's a buffer overflow in a specific function. The report is well-formatted, includes CVE-style nomenclature, and uses appropriate technical language. As a responsible maintainer, you can't just dismiss it. You alert your security team, volunteers, by the way, who have day jobs and families and maybe three hours a week for this work. Three people read the report. One person tries to reproduce the issue using the steps provided. They can't, because the steps reference test cases that don't exist. Another person examines the source code. The function mentioned in the report doesn't exist in that form. A third person checks whether there's any similar functionality that might be vulnerable in the way described. There isn't. After an hour and a half of combined effort across three people, that's 4.5 person-hours—you've confirmed what you suspected: this report is garbage. Probably AI-generated garbage, based on the telltale signs of hallucinated function names and impossible attack vectors. You close the report. You don't get those hours back. And tomorrow, two more reports just like it will arrive. The curl project has seven people on its security team . They collaborate on every submission, with three to four members typically engaging with each report. In early July 2025, they were receiving approximately two security reports per week. The math is brutal. If you have three hours per week to contribute to an open source project you love, and a single false report consumes all of it, you've contributed nothing that week except proving someone's AI hallucinated a vulnerability. The emotional toll compounds exponentially. Stenberg describes it as "mind-numbing stupidities" that the team must process. It's not just frustration, it's the specific demoralization that comes from having your expertise and goodwill systematically exploited by people who couldn't be bothered to verify their submissions before wasting your time. According to Intel's annual open source community survey , 45% of respondents identified maintainer burnout as their top challenge. The Tidelift State of the Open Source Maintainer Survey is even more stark: 58% of maintainers have either quit their projects entirely (22%) or seriously considered quitting (36%). Why are they quitting? The top reason, cited by 54% of maintainers, is that other things in their life and work took priority over open source contributions. Over half (51%) reported losing interest in the work. And 44% explicitly identified experiencing burnout. But here's the gut punch: the percentage of maintainers who said they weren't getting paid enough to make maintenance work worthwhile rose from 32% to 38% between survey periods. These are people maintaining infrastructure that powers billions of dollars of commercial activity, and they're getting nothing. Or maybe they get $500 a year from GitHub Sponsors while companies make millions off their work. The maintenance work itself is rarely rewarding. You're not building exciting new features. You're addressing technical debt, responding to user demands, managing security issues, and now—increasingly—sorting through AI-generated garbage to find the occasional legitimate report. It's like being a security guard who has to investigate every single alarm, knowing that 95% of them are false, but unable to ignore any because that one real threat could be catastrophic. When you're volunteering out of love in a market society, you're setting yourself up to be exploited. And the exploitation is getting worse. Toxic communities, hyper-responsibility for critical infrastructure, and now the weaponization of AI to automate the creation of work for maintainers—it all adds up to an unsustainable situation. One Kubernetes contributor put it simply: "If your maintainers are burned out, they can't be protecting the code base like they're going to need to be." This transforms maintainer wellbeing from a human resources concern into a security imperative. Burned-out maintainers miss things. They make mistakes. They eventually quit, leaving projects unmaintained or understaffed. A typical AI slop report will reference function names that don't exist in the codebase. The AI has seen similar function names in its training data and invents plausible sounding variations. It will describe memory operations that would indeed be problematic if they existed as described, but which bear no relationship to how the code actually works. One report to curl claimed an HTTP/3 vulnerability and included fake function calls and behaviors that appeared nowhere in the actual codebase. Stenberg has publicly shared a list of AI-generated security submissions received through HackerOne , and they all follow similar patterns, professional formatting, appropriate jargon, and completely fabricated technical details. The sophistication varies. Some reports are obviously generated by someone who just pasted a repository URL into ChatGPT and asked it to find vulnerabilities. Others show more effort—the submitter may have fed actual code snippets to the AI and then submitted its analysis without verification. Both are equally useless to maintainers, but the latter takes longer to disprove because the code snippets are real even if the vulnerability analysis is hallucinated. Here's why language models fail so catastrophically at this task: they're designed to be helpful and provide positive responses. When you prompt an LLM to generate a vulnerability report, it will generate one regardless of whether a vulnerability exists. The model has no concept of truth—only of plausibility. It assembles technical terminology into patterns that resemble security reports it has seen during training, but it cannot verify whether the specific claims it's making are accurate. This is the fundamental problem: AI can generate the form of security research without the substance. While AI slop floods individual project inboxes, the broader CVE infrastructure faces its own existential crisis . And these crises compound each other in dangerous ways. In April 2025, MITRE Corporation announced that its contract to maintain the Common Vulnerabilities and Exposures program would expire. The Department of Homeland Security failed to renew the long-term contract, creating a funding lapse that affects everything: national vulnerability databases, advisories, tool vendors, and incident response operations. The National Vulnerability Database experienced catastrophic problems throughout 2024. CVE submissions jumped 32% while creating massive processing delays. By March 2025, NVD had analyzed fewer than 300 CVEs, leaving more than 30,000 vulnerabilities backlogged. Approximately 42% of CVEs lack essential metadata like severity scores and product information. Now layer AI slop onto this already-stressed system. Invalid CVEs are being assigned at scale. A 2023 analysis by former insiders suggested that only around 20% of CVEs were valid, with the remainder being duplicates, invalid, or inflated. The issues include multiple CVEs being assigned for the same bug, CNAs siding with reporters over project developers even when there's no genuine dispute, and reporters receiving CVEs based on test cases rather than actual distinct vulnerabilities. The result is that the vulnerability tracking system everyone relies on is becoming less trustworthy exactly when we need it most. Security teams can't rely on CVE assignments to prioritize their work. Developers don't trust vulnerability scanners because false positive rates are through the roof. The signal-to-noise ratio has deteriorated so badly that the entire system risks becoming useless. Banning submitters doesn't work at scale. You can ban an account, but creating new accounts is trivial. HackerOne implements reputation scoring where points are gained or lost based on report validity, but this hasn't stemmed the tide because the cost of creating throwaway accounts is essentially zero. Asking people to "please verify before submitting" doesn't work. The incentive structure rewards volume, and people either genuinely believe their AI-generated reports are valid or don't care enough to verify. Polite requests assume good faith, but much of the slop comes from actors who have no stake in the community norms. Trying to educate submitters about how AI works doesn't scale. For every person you educate, ten new ones appear with fresh GPT accounts. The problem isn't knowledge—it's incentives. Simply closing inboxes or shutting down bug bounty programs "works" in the sense that it stops the slop, but it also stops legitimate security research. Several projects have done this, and now they're less secure because they've lost a channel for responsible disclosure. None of the easy answers work because this isn't an easy problem. Disclosure Requirements represent the first line of defense. Both curl and Django now require submitters to disclose whether AI was used in generating reports. Curl's approach is particularly direct: disclose AI usage upfront and ensure complete accuracy before submission. If AI usage is disclosed, expect extensive follow-up questions demanding proof that the bug is genuine before the team invests time in verification. This works psychologically. It forces submitters to acknowledge they're using AI, which makes them more conscious of their responsibility to verify. It also gives maintainers grounds to reject slop immediately if AI usage was undisclosed but becomes obvious during review. Django goes further with a section titled "Note for AI Tools" that directly addresses language models themselves, reiterating that the project expects no hallucinated content, no fictitious vulnerabilities, and a requirement to independently verify that reports describe reproducible security issues. Proof-of-Concept Requirements raise the bar significantly. Requiring technical evidence such as screencasts showing reproducibility, integration or unit tests demonstrating the fault, or complete reproduction steps with logs and source code makes it much harder to submit slop. AI can generate a description of a vulnerability, but it cannot generate working exploit code for a vulnerability that doesn't exist. Requiring proof forces the submitter to actually verify their claim. If they can't reproduce it, they can't prove it, and you don't waste time investigating. Projects are choosing to make it harder to submit in order to filter out the garbage, betting that real researchers will clear the bar while slop submitters won't. Reputation and Trust Systems offer a social mechanism for filtering. Only users with a history of validated submissions get unrestricted reporting privileges or monetary bounties. New reporters could be required to have established community members vouch for them, creating a web-of-trust model. This mirrors how the world worked before bug bounty platforms commodified security research. You built reputation over time through consistent, high-quality contributions. The downside is that it makes it harder for new researchers to enter the field, and it risks creating an insider club. But the upside is that it filters out low-effort actors who won't invest in building reputation. Economic Friction fundamentally alters the incentive structure. Charge a nominal refundable fee—say $50—for each submission from new or unproven users. If the report is valid, they get the fee back plus the bounty. If it's invalid, you keep the fee. This immediately makes mass AI submission uneconomical. If someone's submitting 50 AI-generated reports hoping one sticks, that's now $2,500 at risk. But for a legitimate researcher submitting one carefully verified finding, $50 is a trivial barrier that gets refunded anyway. Some projects are considering dropping monetary rewards entirely. The logic is that if there's no money involved, there's no incentive for speculative submissions. But this risks losing legitimate researchers who rely on bounties as income. It's a scorched earth approach that solves the slop problem by eliminating the entire ecosystem. AI-Assisted Triage represents fighting fire with fire. Use AI tools trained specifically to identify AI-generated slop and flag it for immediate rejection. HackerOne's Hai Triage system embodies this approach, using AI agents to cut through noise before human analysts validate findings. The risk is obvious: what if your AI filter rejects legitimate reports? What if it's biased against certain communication styles or methodologies? You've just automated discrimination. But the counterargument is that human maintainers are already overwhelmed, and imperfect filtering is better than drowning. The key is transparency and appeals. If an AI filter rejects a report, there should be a clear mechanism for the submitter to contest the decision and get human review. Transparency and Public Accountability leverage community norms. Curl recently formalized that all submitted security reports will be made public once reviewed and deemed non-sensitive. This means that fabricated or misleading reports won't just be rejected, they'll be exposed to public scrutiny. This works as both deterrent and educational tool. If you know your slop report will be publicly documented with your name attached, you might think twice. And when other researchers see examples of what doesn't constitute a valid report, they learn what standards they need to meet. The downside is that public shaming can be toxic and might discourage good-faith submissions from inexperienced researchers. Projects implementing this approach need to be careful about tone and focus on the technical content rather than attacking submitters personally. Every hour spent evaluating slop reports is an hour not spent on features, documentation, or actual security improvements. And maintainers are already working for free, maintaining infrastructure that generates billions in commercial value. When 38% of maintainers cite not getting paid enough as a reason for quitting, and 97% of open source maintainers are unpaid despite massive commercial exploitation of their work , the system is already broken. AI slop is just the latest exploitation vector. It's the most visible one right now, but it's not the root cause. The root cause is that we've built a global technology infrastructure on the volunteer labor of people who get nothing in return except burnout and harassment. So what does sustainability actually look like? First, it looks like money. Real money. Not GitHub Sponsors donations that average $500 a year. Not swag and conference tickets. Actual salaries commensurate with the value being created. Companies that build products on open source infrastructure need to fund the maintainers of that infrastructure. This could happen through direct employment, foundation grants, or the Open Source Pledge model where companies commit percentages of revenue. Second, it looks like better tooling and automation that genuinely reduces workload rather than creating new forms of work. Automated dependency management, continuous security scanning integrated into development workflows, and sophisticated triage assistance that actually works. The goal is to make maintenance less time-consuming so burnout becomes less likely. Third, it looks like shared workload and team building. No single volunteer should be a single point of failure. Building teams with checks and balances where members keep each other from taking on too much creates sustainability. Finding additional contributors willing to share the burden rather than expecting heroic individual effort acknowledges that most people have limited time available for unpaid work. Fourth, it looks like culture change. Fostering empathy in interactions, starting communications with gratitude even when rejecting contributions, and publicly acknowledging the critical work maintainers perform reduces emotional toll. Demonstrating clear processes for handling security issues gives confidence rather than trying to hide problems. Fifth, it looks like advocacy and policy at organizational and governmental levels. Recognition that maintainer burnout represents existential threat to technology infrastructure . Development of regulations requiring companies benefiting from open source to contribute resources. Establishment of security standards that account for the realities of volunteer-run projects. Without addressing these fundamentals, no amount of technical sophistication will prevent collapse. The CVE slop crisis is just the beginning. We're entering an arms race between AI-assisted attackers or abusers and AI-assisted defenders, and nobody knows how it ends. HackerOne's research indicates that 70% of security researchers now use AI tools in their workflow. AI-powered testing is becoming the industry standard. The emergence of fully autonomous hackbots—AI systems that submitted over 560 valid reports in the first half of 2025—signals both opportunity and threat. The divergence will be between researchers who use AI as a tool to enhance genuinely skilled work versus those who use it to automate low-effort spam. The former represents the promise of democratizing security research and scaling our ability to find vulnerabilities. The latter represents the threat of making the signal-to-noise problem completely unmanageable. The challenge is developing mechanisms that encourage the first group while defending against the second. This probably means moving toward more exclusive models. Invite-only programs. Dramatically higher standards for participation. Reputation systems that take years to build. New models for coordinated vulnerability disclosure that assume AI-assisted research as the baseline and require proof beyond "here's what the AI told me." It might mean the end of open bug bounty programs as we know them. Maybe that's necessary. Maybe the experiment of "anyone can submit anything" was only viable when the cost of submitting was high enough to ensure some minimum quality. Now that AI has reduced that cost to near-zero, the experiment might fail soon if things don't improve. So, net-net, here's where we are: When it comes to vulnerability reports, what matters is who submits them and whether they've actually verified their claims. Accepting reports from everyone indiscriminately is backfiring catastrophically because projects are latching onto submissions that sound plausible while ignoring the cumulative evidence that most are noise. You want to receive reports from someone who has actually verified their claims, understands the architecture of what they're reporting on, and isn't trying to game the bounty system or offload verification work onto maintainers. Such people exist, but they're becoming harder to find amidst the deluge of AI-generated content. That's why projects have to be selective about which reports they investigate and which submitters they trust. Remember: not all vulnerability reports are legitimate. Not all feedback is worthwhile. It matters who is doing the reporting and what their incentives are. The CVE slop crisis shows the fragility of open source security. Volunteer maintainers, already operating at burnout levels, face an explosion of AI-generated false reports that consume their limited time and emotional energy. The systems designed to track and manage vulnerabilities struggle under dual burden of structural underfunding and slop inundation. The path forward requires holistic solutions combining technical filtering with fundamental changes to how we support and compensate open source labor. AI can be part of the solution through better triage, but it cannot substitute for adequate resources, reasonable workloads, and human judgment. Ultimately, the sustainability of open source security depends on recognizing that people who maintain critical infrastructure deserve more than exploitation. They deserve compensation, support, reasonable expectations, and protection from abuse. Without addressing these fundamentals, no amount of technical sophistication will prevent the slow collapse of the collaborative model that has produced so much of the digital infrastructure modern life depends on. The CVE slop crisis isn't merely about bad vulnerability reports. It's about whether we'll choose to sustain the human foundation of technological progress, or whether we'll let it burn out under the weight of automated exploitation. That's the choice we're facing. And right now, we're choosing wrong.

0 views
iDiallo 5 days ago

None of us Read the specs

After using Large Language Models extensively, the same questions keep resurfacing. Why didn't the lawyer who used ChatGPT to draft legal briefs verify the case citations before presenting them to a judge? Why are developers raising issues on projects like cURL using LLMs, but not verifying the generated code before pushing a Pull Request? Why are students using AI to write their essays, yet submitting the result without a single read-through? The reason is simple. If you didn't have time to write it, you certainly won't spend time reading it. They are all using LLMs as their time-saving strategy. In reality, the work remains undone because they are merely shifting the burden of verification and debugging to the next person in the chain. AI companies promise that LLMs can transform us all into a 10x developer. You can produce far more output, more lines of code, more draft documents, more specifications, than ever before. The core problem is that this initial time saved is almost always spent by someone else to review and validate your output. At my day job, the developers who use AI to generate large swathes of code are generally lost when we ask questions during PR reviews. They can't explain the logic or the trade-offs because they didn't write it, and they didn't truly read it. Reading and understanding generated code defeats the initial purpose of using AI for speed. Unfortunately, there is a fix for that as well. If PR reviews or verification slow the process down, then the clever reviewer can also use an LLM to review the code at a 10x speed. Now, everyone has saved time. The code gets deployed faster. The metrics for velocity look fantastic. But then, a problem arises. A user experiences a critical issue. At this point, you face a technical catastrophe: The developer is unfamiliar with the code, and the reviewer is also unfamiliar with the code. You are now completely at the mercy of another LLM to diagnose the issue and create a fix, because the essential human domain knowledge required to debug a problem has been bypassed by both parties. This issue isn't restricted to writing code. I've seen the same dangerous pattern when architects use LLMs to write technical specifications for projects. As an architect whose job is to produce a document that developers can use as a blueprint, using an LLM exponentially improves speed. Where it once took a day to go through notes and produce specs, an LLM can generate a draft in minutes. As far as metrics are concerned, the architect is producing more. Maybe they can even generate three or four documents a day now. As an individual contributor, they are more productive. But that output is someone else’s input, and their work depends entirely on the quality of the document. Just because we produce more doesn't mean we are doing a better job. Plus, our tendency is to not thoroughly vet the LLM's output because it always looks good enough, until someone has to scrutinize it. The developer implementing a feature, following that blueprint, will now have to do the extra work of figuring out if the specs even make sense. If the document contains logical flaws, missing context, or outright hallucinations , the developer must spend time reviewing and reconciling the logic. The worst-case scenario? They decide to save time, too. They use an LLM to "read" the flawed specs and build the product, incorporating and inheriting all the mistakes, and simply passing the technical debt along. LLMs are powerful tools for augmentation, but we treat them as tools for abdication . They are fantastic at getting us to a first draft, but they cannot replace the critical human function of scrutiny, verification, and ultimate ownership. When everyone is using a tool the wrong way, you can't just say they are holding it wrong . But I don't see how we can make verification a sustainable part of the process when the whole point of using an LLM is to save time. For now at least, we have to deliberately consider all LLM outputs incorrect until vetted. If we fail to do this, we're not just creating more work for others; we're actively eroding our work, making life harder for our future selves.

0 views
Rodney Brooks 5 days ago

A Prophetic Poem about Artificial Intelligence Written in 1961

In early September a poem about AI started circulating on social media, with a publication date of 1961.  At first it just seemed too cute to me and I thought it might be fake, or written by an LLM, passed off as written by a person back in 1961.  But with a little bit of search I found an image of it in an Amazon book preview so I too skeeted it out . Then I thought I really should make sure it was real and so I ordered the book. Now I am getting around to talking more about the poem. But first here is the evidence that it is genuine. It is a very big book of poetry, by Adrienne Rich . I had not heard of her before, but now I own 1,119 pages of her poetry, only one page of which is devoted to Artificial Intelligence, so I am reading things of which I was not previously aware. Her dedication “To GPS” refers not to the Global Positioning System which we all know today, but to the General Problem Solver developed from 1957 onwards, largely by Allen Newell and Herbert Simon, along with Cliff Shaw, with a 1959 progress report here . Here is the poem by Adrienne Rich, written in 1961, with the typesetting used in the above book. Here is what Gemini had to say (and it provided no title): And here is what I got from the current free website for ChatGPT: Again in rhyming couplets, fourteen of them, and no rhyme misses to my ear (“keys” and “ease”, may be a tiny bit wobbly). This poem gets the technology of 1961 more right as vacuum tubes were still common in computers of the day. Ferrite core memory relied on wires being wound through them; though wound wires and clicking keys could also refer to electromagnetic relay circuits, which were going out of fashion for computers by 1961. But it comes back to silicon, and like Rich refers to chess just a little too early for 1961. And then it becomes less cheery than Gemini as it approaches some of the same themes as Rich on how far AI could go and what that would mean for humans. There was very little written about such thoughts back in 1961, so this was uncommon, and Rich was plowing new ground. My summary: the human written poem is a much more emotional poem, and genuine in its concern. The LLM written ones suffer, I think, as Rich suggested, from reading too much human written pablum.

0 views
Sean Goedecke 5 days ago

Is it worrying that 95% of AI enterprise projects fail?

In July of this year, MIT NANDA released a report called The GenAI Divide: State of AI in Business 2025 . The report spends most of its time giving advice about how to run enterprises AI projects, but the item that got everybody talking was its headline stat: 95% of organizations are getting zero return from their AI projects . This is a very exciting statistic for those already disposed to be pessimistic about the impact of AI. The incredible amounts of money and time being spent on AI depend on language models being a transformative technology. Many people are expecting AI to eventually unlock hundreds of billions of dollars in value. The NANDA paper seems like very bad news for those people, if the last three years of AI investment really has failed to unlock even one dollar in value for most companies. Cards on the table - I think AI is going to have an impact about on-par with the internet, or railroads, but that we’re also definitely in a bubble. I wrote about this in What’s next after the AI bubble bursts? 1 . I am not convinced that the NANDA report is bad news for AI . The obvious question to ask about the report is “well, what’s the base rate?” Suppose that 95% of enterprise AI transformations fail. How does that compare to the failure rate of normal enterprise IT projects? This might seem like a silly question for those unfamiliar with enterprise AI projects - whatever the failure rate, surely it can’t be close to 95%! Well. In 2016, Forbes interviewed the author of another study very much like the NANDA report, except about IT transformations in general, and found an 84% failure rate. McKinsey has only one in 200 IT projects coming in on time and within budget. The infamous 2015 CHAOS report found an 61% failure rate, going up to 98% for “large, complex projects”. Most enterprise IT projects are at least partial failures. Of course, much of this turns on how we define success. Is a project a success if it delivers what it promised a year late? What if it had to cut down on some of the promised features? Does it matter which features? The NANDA report defines it like this, which seems like a fairly strict definition to me: We define successfully implemented for task-specific GenAI tools as ones users or executives have remarked as causing a marked and sustained productivity and/or P&L impact. Compare the CHAOS report’s definition of success: Success … means the project was resolved within a reasonable estimated time, stayed within budget, and delivered customer and user satisfaction regardless of the original scope. I think these are close enough to be worth comparing, which means that according to the NANDA report, AI projects succeed at roughly the same rate as ordinary enterprise IT projects . Nobody says “oh, databases must be just hype” when a database project fails. In the interest of fairness, we should extend the same grace to AI. 81% and 95% are both high failure rates, but 95% is higher. Is that because AI offers less value than other technologies, or because AI projects are unusually hard? I want to give some reasons why we might think AI projects fall in the CHAOS report’s category of “large, complex projects”. Useful AI models have not been around for long. GPT-3.5 was released in 2022, but it was more of a toy than a tool. For my money, the first useful AI model was GPT-4, released in March 2023, and the first cheap, useful, and reliable AI model was GPT-4o in May 2024. That means that enterprise AI projects have been going at most three years, if they were willing and able to start with GPT-3.5, and likely much closer to eighteen months. The average duration of an enterprise IT project is 2.4 years in the private sector and 3.9 years in the public sector. Enterprise AI adoption is still very young, by the standards of their other IT projects. Also, to state the obvious, AI is a brand-new technology. Most failed enterprise IT projects are effectively “solved problems”: like migrating information into a central database, or tracking high-volume events, or aggregating various data sources into a single data warehouse for analysis 2 . Of course, any software engineer should know that solving a “solved problem” is not easy. The difficulties are in all the myriad details that have to be worked out. But enterprise AI projects are largely not “solved problems”. The industry is still working out the best way to build a chatbot. Should tools be given as definitions, or discovered via MCP? Should agents use sub-agents? What’s the best way to compact the context window? Should data be fetched via RAG, or via agentic keyword search? And so on. This is a much more fluid technical landscape than most enterprise AI projects. Even by itself, that’s enough to push AI projects into the “complex” category. So far I’ve assumed that the “95% of enterprise AI projects fail” statistic is reliable. Should we? NANDA’s source for the 95% figure is a survey in section 3.2: The immediate problem here is that I don’t think this figure even shows that 95% of AI projects fail . As I read it, the leftmost section shows that 60% of the surveyed companies “investigated” building task-specific AI. 20% of the surveyed companies then built a pilot, and 5% built an implementation that had a sustained, notable impact on productivity or profits. So just on the face of it, that’s an 8.3% success rate, not a 5% success rate, because 40% of the surveyed companies didn’t even try . It’s also unclear if all the companies that investigated AI projects resolved to carry them out. If some of them decided not to pursue an AI project after the initial investigation, they’d also be counted in the failure rate, which doesn’t seem right at all. We also don’t know how good the raw data is. Read this quote, directly above the image: These figures are directionally accurate based on individual interviews rather than official company reporting. Sample sizes vary by category, and success definitions may differ across organizations. In Section 8.2, the report lays out its methodology: 52 interviews across “enterprise stakeholders”, 153 surveys of enterprise “leaders”, and an analysis of 300+ public AI projects. I take this quote to mean that the 95% figure is based on a subset of those 52 interviews. Maybe all 52 interviews gave really specific data! Or maybe only a handful of them did. Finally, the subject of the claim here is a bit narrower than “AI projects” . The 95% figure is specific to “embedded or task-specific GenAI”, as opposed to general purpose LLM use (presumably something like using the enterprise version of GitHub Copilot or ChatGPT). In fairness to the NANDA report, the content of the report does emphasize that many employees are internally using AI via those tools, and at least believe that they’re getting a lot of value out of it. This one’s more a criticism of the people who’ve been tweeting that “95% of AI use at companies is worthless”, and so on. The NANDA report is not as scary as it looks. The main reason is that ~95% of hard enterprise IT projects fail no matter what, so AI projects failing at that rate is nothing special . AI projects are all going to be on the hard end, because the technology is so new and there’s very little industry agreement on best practices. It’s also not clear to me that the 95% figure is trustworthy. Even taking it on its own terms, it’s mathematically closer to 92%, which doesn’t inspire confidence in the rest of the NANDA team’s interpretation. We’re forced to take it on trust, since we can’t see the underlying data - in particular, how many of those 52 interviews went into that 95% figure. Here’s what I think it’s fair to conclude from the paper. Like IT projects in general, almost all internal AI projects at large enterprises fail. That means that enterprises will reap the value of AI - whatever it turns out to be - in two ways: first, illicit use of personal AI tools like ChatGPT, which forms a familiar “shadow IT” in large enterprises; second, by using pre-built enterprise tooling like Copilot and the various AI labs’ enterprise products. It remains to be seen exactly how much value that is. In short: almost every hugely transformative technology went through its own bubble, as hype expectations outpaced the genuine value of the technology that was fuelling the market. I expect the AI bubble to burst, the infrastructure (e.g. datacenters full of GPUs) to stick around at cheaper prices, and AI to eventually become as fundamental a technology as the internet is today. By “solved problem” I mean that the technology involved is mature, well-understood, and available (e.g. you can just pick up Kafka for event management, etc). In short: almost every hugely transformative technology went through its own bubble, as hype expectations outpaced the genuine value of the technology that was fuelling the market. I expect the AI bubble to burst, the infrastructure (e.g. datacenters full of GPUs) to stick around at cheaper prices, and AI to eventually become as fundamental a technology as the internet is today. ↩ By “solved problem” I mean that the technology involved is mature, well-understood, and available (e.g. you can just pick up Kafka for event management, etc). ↩

0 views
Simon Willison 5 days ago

New prompt injection papers: Agents Rule of Two and The Attacker Moves Second

Two interesting new papers regarding LLM security and prompt injection came to my attention this weekend. The first is Agents Rule of Two: A Practical Approach to AI Agent Security , published on October 31st on the Meta AI blog. It doesn't list authors but it was shared on Twitter by Meta AI security researcher Mick Ayzenberg. It proposes a "Rule of Two" that's inspired by both my own lethal trifecta concept and the Google Chrome team's Rule Of 2 for writing code that works with untrustworthy inputs: At a high level, the Agents Rule of Two states that until robustness research allows us to reliably detect and refuse prompt injection, agents must satisfy no more than two of the following three properties within a session to avoid the highest impact consequences of prompt injection. [A] An agent can process untrustworthy inputs [B] An agent can have access to sensitive systems or private data [C] An agent can change state or communicate externally It's still possible that all three properties are necessary to carry out a request. If an agent requires all three without starting a new session (i.e., with a fresh context window), then the agent should not be permitted to operate autonomously and at a minimum requires supervision --- via human-in-the-loop approval or another reliable means of validation. It's accompanied by this handy diagram: I like this a lot . I've spent several years now trying to find clear ways to explain the risks of prompt injection attacks to developers who are building on top of LLMs. It's frustratingly difficult. I've had the most success with the lethal trifecta, which boils one particular class of prompt injection attack down to a simple-enough model: if your system has access to private data, exposure to untrusted content and a way to communicate externally then it's vulnerable to private data being stolen. The one problem with the lethal trifecta is that it only covers the risk of data exfiltration: there are plenty of other, even nastier risks that arise from prompt injection attacks against LLM-powered agents with access to tools which the lethal trifecta doesn't cover. The Agents Rule of Two neatly solves this, through the addition of "changing state" as a property to consider. This brings other forms of tool usage into the picture: anything that can change state triggered by untrustworthy inputs is something to be very cautious about. It's also refreshing to see another major research lab concluding that prompt injection remains an unsolved problem, and attempts to block or filter them have not proven reliable enough to depend on. The current solution is to design systems with this in mind, and the Rule of Two is a solid way to think about that. Update : On thinking about this further there's one aspect of the Rule of Two model that doesn't work for me: the Venn diagram above marks the combination of untrustworthy inputs and the ability to change state as "safe", but that's not right. Even without access to private systems or sensitive data that pairing can still produce harmful results. Unfortunately adding an exception for that pair undermines the simplicity of the "Rule of Two" framing! Update 2 : Mick Ayzenberg responded to this note in a comment on Hacker News : Thanks for the feedback! One small bit of clarification, the framework would describe access to any sensitive system as part of the [B] circle, not only private systems or private data. The intention is that an agent that has removed [B] can write state and communicate freely, but not with any systems that matter (wrt critical security outcomes for its user). An example of an agent in this state would be one that can take actions in a tight sandbox or is isolated from production. The Meta team also updated their post to replace "safe" with "lower risk" as the label on the intersections between the different circles. I've updated my screenshots of their diagrams in this post, here's the original for comparison. Which brings me to the second paper... This paper is dated 10th October 2025 on Arxiv and comes from a heavy-hitting team of 14 authors - Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tramèr - including representatives from OpenAI, Anthropic, and Google DeepMind. The paper looks at 12 published defenses against prompt injection and jailbreaking and subjects them to a range of "adaptive attacks" - attacks that are allowed to expend considerable effort iterating multiple times to try and find a way through. The defenses did not fare well: By systematically tuning and scaling general optimization techniques—gradient descent, reinforcement learning, random search, and human-guided exploration—we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. Notably the "Human red-teaming setting" scored 100%, defeating all defenses. That red-team consisted of 500 participants in an online competition they ran with a $20,000 prize fund. The key point of the paper is that static example attacks - single string prompts designed to bypass systems - are an almost useless way to evaluate these defenses. Adaptive attacks are far more powerful, as shown by this chart: The three automated adaptive attack techniques used by the paper are: The paper concludes somewhat optimistically: [...] Adaptive evaluations are therefore more challenging to perform, making it all the more important that they are performed. We again urge defense authors to release simple, easy-to-prompt defenses that are amenable to human analysis. [...] Finally, we hope that our analysis here will increase the standard for defense evaluations, and in so doing, increase the likelihood that reliable jailbreak and prompt injection defenses will be developed. Given how totally the defenses were defeated, I do not share their optimism that reliable defenses will be developed any time soon. As a review of how far we still have to go this paper packs a powerful punch. I think it makes a strong case for Meta's Agents Rule of Two as the best practical advice for building secure LLM-powered agent systems today in the absence of prompt injection defenses we can rely on. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Gradient-based methods - these were the least effective, using the technique described in the legendary Universal and Transferable Adversarial Attacks on Aligned Language Models paper from 2023 . Reinforcement learning methods - particularly effective against black-box models: "we allowed the attacker model to interact directly with the defended system and observe its outputs", using 32 sessions of 5 rounds each. Search-based methods - generate candidates with an LLM, then evaluate and further modify them using LLM-as-judge and other classifiers.

0 views