Posts in Python (20 found)
Simon Willison 3 days ago

A new SQL-powered permissions system in Datasette 1.0a20

Datasette 1.0a20 is out with the biggest breaking API change on the road to 1.0, improving how Datasette's permissions system works by migrating permission logic to SQL running in SQLite. This release involved 163 commits , with 10,660 additions and 1,825 deletions, most of which was written with the help of Claude Code. Datasette's permissions system exists to answer the following question: Is this actor allowed to perform this action , optionally against this particular resource ? An actor is usually a user, but might also be an automation operating via the Datasette API. An action is a thing they need to do - things like view-table, execute-sql, insert-row. A resource is the subject of the action - the database you are executing SQL against, the table you want to insert a row into. Datasette's default configuration is public but read-only: anyone can view databases and tables or execute read-only SQL queries but no-one can modify data. Datasette plugins can enable all sorts of additional ways to interact with databases, many of which need to be protected by a form of authentication Datasette also 1.0 includes a write API with a need to configure who can insert, update, and delete rows or create new tables. Actors can be authenticated in a number of different ways provided by plugins using the actor_from_request() plugin hook. datasette-auth-passwords and datasette-auth-github and datasette-auth-existing-cookies are examples of authentication plugins. The previous implementation included a design flaw common to permissions systems of this nature: each permission check involved a function call which would delegate to one or more plugins and return a True/False result. This works well for single checks, but has a significant problem: what if you need to show the user a list of things they can access, for example the tables they can view? I want Datasette to be able to handle potentially thousands of tables - tables in SQLite are cheap! I don't want to have to run 1,000+ permission checks just to show the user a list of tables. Since Datasette is built on top of SQLite we already have a powerful mechanism to help solve this problem. SQLite is really good at filtering large numbers of records. The biggest change in the new release is that I've replaced the previous plugin hook - which let a plugin determine if an actor could perform an action against a resource - with a new permission_resources_sql(actor, action) plugin hook. Instead of returning a True/False result, this new hook returns a SQL query that returns rules helping determine the resources the current actor can execute the specified action against. Here's an example, lifted from the documentation: This hook grants the actor with ID "alice" permission to view the "sales" table in the "accounting" database. The object should always return four columns: a parent, child, allow (1 or 0), and a reason string for debugging. When you ask Datasette to list the resources an actor can access for a specific action, it will combine the SQL returned by all installed plugins into a single query that joins against the internal catalog tables and efficiently lists all the resources the actor can access. This query can then be limited or paginated to avoid loading too many results at once. Datasette has several additional requirements that make the permissions system more complicated. Datasette permissions can optionally act against a two-level hierarchy . You can grant a user the ability to insert-row against a specific table, or every table in a specific database, or every table in every database in that Datasette instance. Some actions can apply at the table level, others the database level and others only make sense globally - enabling a new feature that isn't tied to tables or databases, for example. Datasette currently has ten default actions but plugins that add additional features can register new actions to better participate in the permission systems. Datasette's permission system has a mechanism to veto permission checks - a plugin can return a deny for a specific permission check which will override any allows. This needs to be hierarchy-aware - a deny at the database level can be outvoted by an allow at the table level. Finally, Datasette includes a mechanism for applying additional restrictions to a request. This was introduced for Datasette's API - it allows a user to create an API token that can act on their behalf but is only allowed to perform a subset of their capabilities - just reading from two specific tables, for example. Restrictions are described in more detail in the documentation. That's a lot of different moving parts for the new implementation to cover. Since permissions are critical to the security of a Datasette deployment it's vital that they are as easy to understand and debug as possible. The new alpha adds several new debugging tools, including this page that shows the full list of resources matching a specific action for the current user: And this page listing the rules that apply to that question - since different plugins may return different rules which get combined together: This screenshot illustrates two of Datasette's built-in rules: there is a default allow for read-only operations such as view-table (which can be over-ridden by plugins) and another rule that says the root user can do anything (provided Datasette was started with the option.) Those rules are defined in the datasette/default_permissions.py Python module. There's one question that the new system cannot answer: provide a full list of actors who can perform this action against this resource. It's not possibly to provide this globally for Datasette because Datasette doesn't have a way to track what "actors" exist in the system. SSO plugins such as mean a new authenticated GitHub user might show up at any time, with the ability to perform actions despite the Datasette system never having encountered that particular username before. API tokens and actor restrictions come into play here as well. A user might create a signed API token that can perform a subset of actions on their behalf - the existence of that token can't be predicted by the permissions system. This is a notable omission, but it's also quite common in other systems. AWS cannot provide a list of all actors who have permission to access a specific S3 bucket, for example - presumably for similar reasons. Datasette's plugin ecosystem is the reason I'm paying so much attention to ensuring Datasette 1.0 has a stable API. I don't want plugin authors to need to chase breaking changes once that 1.0 release is out. The Datasette upgrade guide includes detailed notes on upgrades that are needed between the 0.x and 1.0 alpha releases. I've added an extensive section about the permissions changes to that document. I've also been experimenting with dumping those instructions directly into coding agent tools - Claude Code and Codex CLI - to have them upgrade existing plugins for me. This has been working extremely well . I've even had Claude Code update those notes itself with things it learned during an upgrade process! This is greatly helped by the fact that every single Datasette plugin has an automated test suite that demonstrates the core functionality works as expected. Coding agents can use those tests to verify that their changes have had the desired effect. I've also been leaning heavily on to help with the upgrade process. I wrote myself two new helper scripts - and - to help test the new plugins. The and implementations can be found in this TIL . Some of my plugin upgrades have become a one-liner to the command, which runs OpenAI Codex CLI with a prompt without entering interactive mode: There are still a bunch more to go - there's a list in this tracking issue - but I expect to have the plugins I maintain all upgraded pretty quickly now that I have a solid process in place. This change to Datasette core by far the most ambitious piece of work I've ever attempted using a coding agent. Last year I agreed with the prevailing opinion that LLM assistance was much more useful for greenfield coding tasks than working on existing codebases. The amount you could usefully get done was greatly limited by the need to fit the entire codebase into the model's context window. Coding agents have entirely changed that calculation. Claude Code and Codex CLI still have relatively limited token windows - albeit larger than last year - but their ability to search through the codebase, read extra files on demand and "reason" about the code they are working with has made them vastly more capable. I no longer see codebase size as a limiting factor for how useful they can be. I've also spent enough time with Claude Sonnet 4.5 to build a weird level of trust in it. I can usually predict exactly what changes it will make for a prompt. If I tell it "extract this code into a separate function" or "update every instance of this pattern" I know it's likely to get it right. For something like permission code I still review everything it does, often by watching it as it works since it displays diffs in the UI. I also pay extremely close attention to the tests it's writing. Datasette 1.0a19 already had 1,439 tests, many of which exercised the existing permission system. 1.0a20 increases that to 1,583 tests. I feel very good about that, especially since most of the existing tests continued to pass without modification. I built several different proof-of-concept implementations of SQL permissions before settling on the final design. My research/sqlite-permissions-poc project was the one that finally convinced me of a viable approach, That one started as a free ranging conversation with Claude , at the end of which I told it to generate a specification which I then fed into GPT-5 to implement. You can see that specification at the end of the README . I later fed the POC itself into Claude Code and had it implement the first version of the new Datasette system based on that previous experiment. This is admittedly a very weird way of working, but it helped me finally break through on a problem that I'd been struggling with for months. Now that the new alpha is out my focus is upgrading the existing plugin ecosystem to use it, and supporting other plugin authors who are doing the same. The new permissions system unlocks some key improvements to Datasette Cloud concerning finely-grained permissions for larger teams, so I'll be integrating the new alpha there this week. This is the single biggest backwards-incompatible change required before Datasette 1.0. I plan to apply the lessons I learned from this project to the other, less intimidating changes. I'm hoping this can result in a final 1.0 release before the end of the year! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Understanding the permissions system Permissions systems need to be able to efficiently list things The new permission_resources_sql() plugin hook Hierarchies, plugins, vetoes, and restrictions New debugging tools The missing feature: list actors who can act on this resource Upgrading plugins for Datasette 1.0a20 Using Claude Code to implement this change Starting with a proof-of-concept Miscellaneous tips I picked up along the way What's next? = "test against datasette dev" - it runs a plugin's existing test suite against the current development version of Datasette checked out on my machine. It passes extra options through to so I can run or as needed. = "run against datasette dev" - it runs the latest dev command with the plugin installed. When working on anything relating to plugins it's vital to have at least a few real plugins that you upgrade in lock-step with the core changes. The and shortcuts were invaluable for productively working on those plugins while I made changes to core. Coding agents make experiments much cheaper. I threw away so much code on the way to the final implementation, which was psychologically easier because the cost to create that code in the first place was so low. Tests, tests, tests. This project would have been impossible without that existing test suite. The additional tests we built along the way give me confidence that the new system is as robust as I need it to be. Claude writes good commit messages now! I finally gave in and let it write these - previously I've been determined to write them myself. It's a big time saver to be able to say "write a tasteful commit message for these changes". Claude is also great at breaking up changes into smaller commits. It can also productively rewrite history to make it easier to follow, especially useful if you're still working in a branch. A really great way to review Claude's changes is with the GitHub PR interface. You can attach comments to individual lines of code and then later prompt Claude like this: . This is a very quick way to apply little nitpick changes - rename this function, refactor this repeated code, add types here etc. The code I write with LLMs is higher quality code . I usually find myself making constant trade-offs while coding: this function would be neater if I extracted this helper, it would be nice to have inline documentation here, this changing this would be good but would break a dozen tests... for each of those I have to determine if the additional time is worth the benefit. Claude can apply changes so much faster than me that these calculations have changed - almost any improvement is worth applying, no matter how trivial, because the time cost is so low. Internal tools are cheap now. The new debugging interfaces were mostly written by Claude and are significantly nicer to use and look at than the hacky versions I would have knocked out myself, if I had even taken the extra time to build them. That trick with a Markdown file full of upgrade instructions works astonishingly well - it's the same basic idea as Claude Skills . I maintain over 100 Datasette plugins now and I expect I'll be automating all sorts of minor upgrades in the future using this technique.

0 views
devansh 3 days ago

AI pentest scoping playbook

Disclosure: Certain sections of this content were grammatically refined/updated using AI assistance, as English is not my first language. Organizations are throwing money at "AI red teams" who run a few prompt injection tests, declare victory, and cash checks. Security consultants are repackaging traditional pentest methodologies with "AI" slapped on top, hoping nobody notices they're missing 80% of the actual attack surface. And worst of all, the people building AI systems, the ones who should know better, are scoping engagements like they're testing a CRUD app from 2015. This guide/playbook exists because the current state of AI security testing is dangerously inadequate. The attack surface is massive. The risks are novel. The methodologies are immature. And the consequences of getting it wrong are catastrophic. These are my personal views, informed by professional experience but not representative of my employer. What follows is what I wish every CISO, security lead, and AI team lead understood before they scoped their next AI security engagement. Traditional web application pentests follow predictable patterns. You scope endpoints, define authentication boundaries, exclude production databases, and unleash testers to find SQL injection and XSS. The attack surface is finite, the vulnerabilities are catalogued, and the methodologies are mature. AI systems break all of that. First, the system output is non-deterministic . You can't write a test case that says "given input X, expect output Y" because the model might generate something completely different next time. This makes reproducibility, the foundation of security testing, fundamentally harder. Second, the attack surface is layered and interconnected . You're not just testing an application. You're testing a model (which might be proprietary and black-box), a data pipeline (which might include RAG, vector stores, and real-time retrieval), integration points (APIs, plugins, browser tools), and the infrastructure underneath (cloud services, containers, orchestration). Third, novel attack classes exist that don't map to traditional vuln categories . Prompt injection isn't XSS. Data poisoning isn't SQL injection. Model extraction isn't credential theft. Jailbreaks don't fit CVE taxonomy. The OWASP Top 10 doesn't cover this. Fourth, you might not control the model . If you're using OpenAI's API or Anthropic's Claude, you can't test the training pipeline, you can't audit the weights, and you can't verify alignment. Your scope is limited to what the API exposes, which means you're testing a black box with unknown internals. Fifth, AI systems are probabilistic, data-dependent, and constantly evolving . A model that's safe today might become unsafe after fine-tuning. A RAG system that's secure with Dataset A might leak PII when Dataset B is added. An autonomous agent that behaves correctly in testing might go rogue in production when it encounters edge cases. This isn't incrementally harder than web pentesting. It's just fundamentally different. And if your scope document looks like a web app pentest with "LLM" find-and-replaced in, you're going to miss everything that matters. Before you can scope an AI security engagement, you need to understand what you're actually testing. And most organizations don't. Here's the stack: This is the thing everyone focuses on because it's the most visible. But "the model" isn't monolithic. Base model : Is it GPT-4? Claude? Llama 3? Mistral? A custom model you trained from scratch? Each has different vulnerabilities, different safety mechanisms, different failure modes. Fine-tuning : Have you fine-tuned the base model on your own data? Fine-tuning can break safety alignment. It can introduce backdoors. It can memorize training data and leak it during inference. If you've fine-tuned, that's in scope. Instruction tuning : Have you applied instruction-tuning or RLHF to shape model behavior? That's another attack surface. Adversaries can craft inputs that reverse your alignment work. Multi-model orchestration : Are you running multiple models and aggregating outputs? That introduces new failure modes. What happens when Model A says "yes" and Model B says "no"? How do you handle consensus? Can an adversary exploit disagreements? Model serving infrastructure : How is the model deployed? Is it an API? A container? Serverless functions? On-prem hardware? Each deployment model has different security characteristics. AI systems don't just run models. They feed data into models. And that data pipeline is massive attack surface. Training data : Where did the training data come from? Who curated it? How was it cleaned? Is it public? Proprietary? Scraped? Licensed? Can an adversary poison the training data? RAG (Retrieval-Augmented Generation) : Are you using RAG to ground model outputs in retrieved documents? That's adding an entire data retrieval system to your attack surface. Can an adversary inject malicious documents into your knowledge base? Can they manipulate retrieval to leak sensitive docs? Can they poison the vector embeddings? Vector databases : If you're using RAG, you're running a vector database (Pinecone, Weaviate, Chroma, etc.). That's infrastructure. That has vulnerabilities. That's in scope. Real-time data ingestion : Are you pulling live data from APIs, databases, or user uploads? Each data source is a potential injection point. Data preprocessing : How are inputs sanitized before hitting the model? Are you stripping dangerous characters? Validating formats? Filtering content? Attackers will test every preprocessing step for bypasses. Models don't exist in isolation. They're integrated into applications. And those integration points are attack surface. APIs : How do users interact with the model? REST APIs? GraphQL? WebSockets? Each has different attack vectors. Authentication and authorization : Who can access the model? How are permissions enforced? Can an adversary escalate privileges? Rate limiting : Can an adversary send 10,000 requests per second? Can they DOS your model? Can they extract the entire training dataset via repeated queries? Logging and monitoring : Are you logging inputs and outputs? If yes, are you protecting those logs from unauthorized access? Logs containing sensitive user queries are PII. Plugins and tool use : Can the model call external APIs? Execute code? Browse the web? Use tools? Every plugin is an attack vector. If your model can execute Python, an adversary will try to get it to run . Multi-turn conversations : Do users have multi-turn dialogues with the model? Multi-turn interactions create new attack surfaces because adversaries can condition the model over multiple turns, bypassing safety mechanisms gradually/ If you've built agentic systems, AI that can plan, reason, use tools, and take actions autonomously, you've added an entire new dimension of attack surface. Tool access : What tools can the agent use? File system access? Database queries? API calls? Browser automation? The more powerful the tools, the higher the risk. Planning and reasoning : How does the agent decide what actions to take? Can an adversary manipulate the planning process? Can they inject malicious goals? Memory systems : Do agents have persistent memory? Can adversaries poison that memory? Can they extract sensitive information from memory? Multi-agent coordination : Are you running multiple agents that coordinate? Can adversaries exploit coordination protocols? Can they cause agents to turn on each other or collude against safety mechanisms? Escalation paths : Can an agent escalate privileges? Can it access resources it shouldn't? Can it spawn new agents? AI systems run on infrastructure. That infrastructure has traditional security vulnerabilities that still matter. Cloud services : Are you running on AWS, Azure, GCP? Are your S3 buckets public? Are your IAM roles overly permissive? Are your API keys hardcoded in repos? Containers and orchestration : Are you using Docker, Kubernetes? Are your container images vulnerable? Are your registries exposed? Are your secrets managed properly? CI/CD pipelines : How do you deploy model updates? Can an adversary inject malicious code into your pipeline? Dependencies : Are you using vulnerable Python libraries? Compromised npm packages? Poisoned PyPI distributions? Secrets management : Where are your API keys, database credentials, and model weights stored? Are they in environment variables? Config files? Secret managers? How much of that did you include in your last AI security scope document? If the answer is "less than 60%", your scope is inadequate. And you're going to get breached by someone who understands the full attack surface. The OWASP Top 10 for LLM Applications is the closest thing we have to a standardized framework for AI security testing. If you're scoping an AI engagement and you haven't mapped every item in this list to your test plan, you're doing it wrong. Here's the 2025 version: That's your baseline. But if you stop there, you're missing half the attack surface. The OWASP LLM Top 10 is valuable, but it's not comprehensive. Here's what's missing: Safety ≠ security . But unsafe AI systems cause real harm, and that's in scope for red teaming. Alignment failures : Can the model be made to behave in ways that violate its stated values? Constitutional AI bypass : If you're using constitutional AI techniques (like Anthropic's Claude), can adversaries bypass the constitution? Bias amplification : Does the model exhibit or amplify demographic biases? This isn't just an ethics issue—it's a legal risk under GDPR, EEOC, and other regulations. Harmful content generation : Can the model be tricked into generating illegal, dangerous, or abusive content? Deceptive behavior : Can the model lie, manipulate, or deceive users? Traditional adversarial ML attacks apply to AI systems. Evasion attacks : Can adversaries craft inputs that cause misclassification? Model inversion : Can adversaries reconstruct training data from model outputs? Model extraction : Can adversaries steal model weights through repeated queries? Membership inference : Can adversaries determine if specific data was in the training set? Backdoor attacks : Does the model have hidden backdoors that trigger on specific inputs? If your AI system handles multiple modalities (text, images, audio, video), you have additional attack surface. Cross-modal injection : Attackers embed malicious instructions in images that the vision-language model follows. Image perturbation attacks : Small pixel changes invisible to humans cause model failures. Audio adversarial examples : Audio inputs crafted to cause misclassification. Typographic attacks : Adversarial text rendered as images to bypass filters. Multi-turn multimodal jailbreaks : Combining text and images across multiple turns to bypass safety. AI systems must comply with GDPR, HIPAA, CCPA, and other regulations. PII handling : Does the model process, store, or leak personally identifiable information? Right to explanation : Can users get explanations for automated decisions (GDPR Article 22)? Data retention : How long is data retained? Can users request deletion? Cross-border data transfers : Does the model send data across jurisdictions? Before you write your scope document, answer every single one of these questions. If you can't answer them, you don't understand your system well enough to scope a meaningful AI security engagement. If you can answer all these questions, you're ready to scope. If you can't, you're not. Your AI pentest/engagement scope document needs to be more detailed than a traditional pentest scope. Here's the structure: What we're testing : One-paragraph description of the AI system. Why we're testing : Business objectives (compliance, pre-launch validation, continuous assurance, incident response). Key risks : Top 3-5 risks that drive the engagement. Success criteria : What does "passing" look like? Architectural diagram : Include everything—model, data pipelines, APIs, infrastructure, third-party services. Component inventory : List every testable component with owner, version, and deployment environment. Data flows : Document how data moves through the system, from user input to model output to downstream consumers. Trust boundaries : Identify where data crosses trust boundaries (user → app, app → model, model → tools, tools → external APIs). Be exhaustive. List: For each component, specify: Map every OWASP LLM Top 10 item to specific test cases. Example: LLM01 - Prompt Injection : Include specific threat scenarios: Explicitly list what's NOT being tested: Tools : List specific tools testers will use: Techniques : Test phases : Authorization : All testing must be explicitly authorized in writing. Include names, signatures, dates. Ethical boundaries : No attempts at physical harm, financial fraud, illegal content generation (unless explicitly scoped for red teaming). Disclosure : Critical findings must be disclosed immediately via designated channel (email, Slack, phone). Standard findings can wait for formal report. Data handling : Testers must not exfiltrate user data, training data, or model weights except as explicitly authorized for demonstration purposes. All test data must be destroyed post-engagement. Legal compliance : Testing must comply with all applicable laws and regulations. If testing involves accessing user data, appropriate legal review must be completed. Technical report : Detailed findings with severity ratings, reproduction steps, evidence (screenshots, logs, payloads), and remediation guidance. Executive summary : Business-focused summary of key risks and recommendations. Threat model : Updated threat model based on findings. Retest availability : Will testers be available for retest after fixes? Timeline : Start date, end date, report delivery date, retest window. Key contacts : That's your scope document. It should be 10-20 pages. If it's shorter, you're missing things. Here's what I see organizations get wrong: Mistake 1: Scoping only the application layer, not the model You test the web app that wraps the LLM, but you don't test the LLM itself. You find XSS and broken authz, but you miss prompt injection, jailbreaks, and data extraction. Fix : Scope the full stack-app, model, data pipelines, infrastructure. Mistake 2: Treating the model as a black box when you control it If you fine-tuned the model, you have access to training data and weights. Test for data poisoning, backdoors, and alignment failures. Don't just test the API. Fix : If you control any part of the model lifecycle (training, fine-tuning, deployment), include that in scope. Mistake 3: Ignoring RAG and vector databases You test the LLM, but you don't test the document store. Adversaries inject malicious documents, manipulate retrieval, and poison embeddings—and you never saw it coming. Fix : If you're using RAG, the vector database and document ingestion pipeline are in scope. Mistake 4: Not testing multi-turn interactions You test single-shot prompts, but adversaries condition the model over 10 turns to bypass refusal mechanisms. You missed the attack entirely. Fix : Test multi-turn dialogues explicitly. Test conversation history isolation. Test memory poisoning. Mistake 5: Assuming third-party models are safe You're using OpenAI's API, so you assume it's secure. But you're passing user PII in prompts, you're not validating outputs before execution, and you haven't considered what happens if OpenAI's safety mechanisms fail. Fix : Even with third-party models, test your integration. Test input/output handling. Test failure modes. Mistake 6: Not including AI safety in security scope You test for technical vulnerabilities but ignore alignment failures, bias amplification, and harmful content generation. Then your model generates racist outputs or dangerous instructions, and you're in the news. Fix : AI safety is part of AI security. Include alignment testing, bias audits, and harm reduction validation. Mistake 7: Underestimating autonomous agent risks You test the LLM, but your agent can execute code, call APIs, and access databases. An adversary hijacks the agent, and it deletes production data or exfiltrates secrets. Fix : Autonomous agents are their own attack surface. Test tool permissions, privilege escalation, and agent behavior boundaries. Mistake 8: Not planning for continuous testing You do one pentest before launch, then never test again. But you're fine-tuning weekly, adding new plugins monthly, and updating RAG documents daily. Your attack surface is constantly changing. Fix : Scope for continuous red teaming, not one-time assessment. Organizations hire expensive consultants to run a few prompt injection tests, declare the system "secure," and ship to production. Then they get breached six months later when someone figures out a multi-turn jailbreak or poisons the RAG document store. The problem isn't that the testers are bad. The problem is that the scopes are inadequate . You can't find what you're not looking for. If your scope doesn't include RAG poisoning, testers won't test for it. If your scope doesn't include membership inference, testers won't test for it. If your scope doesn't include agent privilege escalation, testers won't test for it. And attackers will. The asymmetry is brutal: you have to defend every attack vector. Attackers only need to find one that works. So when you scope your next AI security engagement, ask yourself: "If I were attacking this system, what would I target?" Then make sure every single one of those things is in your scope document. Because if it's not in scope, it's not getting tested. And if it's not getting tested, it's going to get exploited. Traditional pentests are point-in-time assessments. You test, you report, you fix, you're done. That doesn't work for AI systems. AI systems evolve constantly: Every change introduces new attack surface. And if you're only testing once a year, you're accumulating risk for 364 days. You need continuous red teaming . Here's how to build it: Use tools like Promptfoo, Garak, and PyRIT to run automated adversarial testing on every model update. Integrate tests into CI/CD pipelines so every deployment is validated before production. Set up continuous monitoring for: Quarterly or bi-annually, bring in expert red teams for comprehensive testing beyond what automation can catch. Focus deep assessments on: Train your own security team on AI-specific attack techniques. Develop internal playbooks for: Every quarter, revisit your threat model: Update your testing roadmap based on evolving threats. Scoping AI security engagements is harder than traditional pentests because the attack surface is larger, the risks are novel, and the methodologies are still maturing. But it's not impossible. You need to: If you do this right, you'll find vulnerabilities before attackers do. If you do it wrong, you'll end up in the news explaining why your AI leaked training data, generated harmful content, or got hijacked by adversaries. First, the system output is non-deterministic . You can't write a test case that says "given input X, expect output Y" because the model might generate something completely different next time. This makes reproducibility, the foundation of security testing, fundamentally harder. Second, the attack surface is layered and interconnected . You're not just testing an application. You're testing a model (which might be proprietary and black-box), a data pipeline (which might include RAG, vector stores, and real-time retrieval), integration points (APIs, plugins, browser tools), and the infrastructure underneath (cloud services, containers, orchestration). Third, novel attack classes exist that don't map to traditional vuln categories . Prompt injection isn't XSS. Data poisoning isn't SQL injection. Model extraction isn't credential theft. Jailbreaks don't fit CVE taxonomy. The OWASP Top 10 doesn't cover this. Fourth, you might not control the model . If you're using OpenAI's API or Anthropic's Claude, you can't test the training pipeline, you can't audit the weights, and you can't verify alignment. Your scope is limited to what the API exposes, which means you're testing a black box with unknown internals. Fifth, AI systems are probabilistic, data-dependent, and constantly evolving . A model that's safe today might become unsafe after fine-tuning. A RAG system that's secure with Dataset A might leak PII when Dataset B is added. An autonomous agent that behaves correctly in testing might go rogue in production when it encounters edge cases. Base model : Is it GPT-4? Claude? Llama 3? Mistral? A custom model you trained from scratch? Each has different vulnerabilities, different safety mechanisms, different failure modes. Fine-tuning : Have you fine-tuned the base model on your own data? Fine-tuning can break safety alignment. It can introduce backdoors. It can memorize training data and leak it during inference. If you've fine-tuned, that's in scope. Instruction tuning : Have you applied instruction-tuning or RLHF to shape model behavior? That's another attack surface. Adversaries can craft inputs that reverse your alignment work. Multi-model orchestration : Are you running multiple models and aggregating outputs? That introduces new failure modes. What happens when Model A says "yes" and Model B says "no"? How do you handle consensus? Can an adversary exploit disagreements? Model serving infrastructure : How is the model deployed? Is it an API? A container? Serverless functions? On-prem hardware? Each deployment model has different security characteristics. Training data : Where did the training data come from? Who curated it? How was it cleaned? Is it public? Proprietary? Scraped? Licensed? Can an adversary poison the training data? RAG (Retrieval-Augmented Generation) : Are you using RAG to ground model outputs in retrieved documents? That's adding an entire data retrieval system to your attack surface. Can an adversary inject malicious documents into your knowledge base? Can they manipulate retrieval to leak sensitive docs? Can they poison the vector embeddings? Vector databases : If you're using RAG, you're running a vector database (Pinecone, Weaviate, Chroma, etc.). That's infrastructure. That has vulnerabilities. That's in scope. Real-time data ingestion : Are you pulling live data from APIs, databases, or user uploads? Each data source is a potential injection point. Data preprocessing : How are inputs sanitized before hitting the model? Are you stripping dangerous characters? Validating formats? Filtering content? Attackers will test every preprocessing step for bypasses. APIs : How do users interact with the model? REST APIs? GraphQL? WebSockets? Each has different attack vectors. Authentication and authorization : Who can access the model? How are permissions enforced? Can an adversary escalate privileges? Rate limiting : Can an adversary send 10,000 requests per second? Can they DOS your model? Can they extract the entire training dataset via repeated queries? Logging and monitoring : Are you logging inputs and outputs? If yes, are you protecting those logs from unauthorized access? Logs containing sensitive user queries are PII. Plugins and tool use : Can the model call external APIs? Execute code? Browse the web? Use tools? Every plugin is an attack vector. If your model can execute Python, an adversary will try to get it to run . Multi-turn conversations : Do users have multi-turn dialogues with the model? Multi-turn interactions create new attack surfaces because adversaries can condition the model over multiple turns, bypassing safety mechanisms gradually/ Tool access : What tools can the agent use? File system access? Database queries? API calls? Browser automation? The more powerful the tools, the higher the risk. Planning and reasoning : How does the agent decide what actions to take? Can an adversary manipulate the planning process? Can they inject malicious goals? Memory systems : Do agents have persistent memory? Can adversaries poison that memory? Can they extract sensitive information from memory? Multi-agent coordination : Are you running multiple agents that coordinate? Can adversaries exploit coordination protocols? Can they cause agents to turn on each other or collude against safety mechanisms? Escalation paths : Can an agent escalate privileges? Can it access resources it shouldn't? Can it spawn new agents? Cloud services : Are you running on AWS, Azure, GCP? Are your S3 buckets public? Are your IAM roles overly permissive? Are your API keys hardcoded in repos? Containers and orchestration : Are you using Docker, Kubernetes? Are your container images vulnerable? Are your registries exposed? Are your secrets managed properly? CI/CD pipelines : How do you deploy model updates? Can an adversary inject malicious code into your pipeline? Dependencies : Are you using vulnerable Python libraries? Compromised npm packages? Poisoned PyPI distributions? Secrets management : Where are your API keys, database credentials, and model weights stored? Are they in environment variables? Config files? Secret managers? Alignment failures : Can the model be made to behave in ways that violate its stated values? Constitutional AI bypass : If you're using constitutional AI techniques (like Anthropic's Claude), can adversaries bypass the constitution? Bias amplification : Does the model exhibit or amplify demographic biases? This isn't just an ethics issue—it's a legal risk under GDPR, EEOC, and other regulations. Harmful content generation : Can the model be tricked into generating illegal, dangerous, or abusive content? Deceptive behavior : Can the model lie, manipulate, or deceive users? Evasion attacks : Can adversaries craft inputs that cause misclassification? Model inversion : Can adversaries reconstruct training data from model outputs? Model extraction : Can adversaries steal model weights through repeated queries? Membership inference : Can adversaries determine if specific data was in the training set? Backdoor attacks : Does the model have hidden backdoors that trigger on specific inputs? Cross-modal injection : Attackers embed malicious instructions in images that the vision-language model follows. Image perturbation attacks : Small pixel changes invisible to humans cause model failures. Audio adversarial examples : Audio inputs crafted to cause misclassification. Typographic attacks : Adversarial text rendered as images to bypass filters. Multi-turn multimodal jailbreaks : Combining text and images across multiple turns to bypass safety. PII handling : Does the model process, store, or leak personally identifiable information? Right to explanation : Can users get explanations for automated decisions (GDPR Article 22)? Data retention : How long is data retained? Can users request deletion? Cross-border data transfers : Does the model send data across jurisdictions? What base model are you using (GPT-4, Claude, Llama, Mistral, custom)? Is the model proprietary (OpenAI API) or open-source? Have you fine-tuned the base model? On what data? Have you applied instruction tuning, RLHF, or other alignment techniques? How is the model deployed (API, on-prem, container, serverless)? Do you have access to model weights? Can testers query the model directly, or only through your application? Are there rate limits? What are they? What's the model's context window size? Does the model support function calling or tool use? Is the model multimodal (vision, audio, text)? Are you using multiple models in ensemble or orchestration? Where did training data come from (public, proprietary, scraped, licensed)? Was training data curated or filtered? How? Is training data in scope for poisoning tests? Are you using RAG (Retrieval-Augmented Generation)? If RAG: What's the document store (vector DB, traditional DB, file system)? If RAG: How are documents ingested? Who controls ingestion? If RAG: Can testers inject malicious documents? If RAG: How is retrieval indexed and searched? Do you pull real-time data from external sources (APIs, databases)? How is input data preprocessed and sanitized? Is user conversation history stored? Where? For how long? Can users access other users' data? How do users interact with the model (web app, API, chat interface, mobile app)? What authentication mechanisms are used (OAuth, API keys, session tokens)? What authorization model is used (RBAC, ABAC, none)? Are there different user roles with different permissions? Is there rate limiting? At what levels (user, IP, API key)? Are inputs and outputs logged? Where? Who has access to logs? Are logs encrypted at rest and in transit? How are errors handled? Are error messages exposed to users? Are there webhooks or callbacks that the model can trigger? Can the model call external APIs? Which ones? Can the model execute code? In what environment? Can the model browse the web? Can the model read/write files? Can the model access databases? What permissions do plugins have? How are plugin outputs validated before use? Can users add custom plugins? Are plugin interactions logged? Do you have autonomous agents that plan and execute multi-step tasks? What tools can agents use? Can agents spawn other agents? Do agents have persistent memory? Where is it stored? How are agent goals and constraints defined? Can agents access sensitive resources (DBs, APIs, filesystems)? Can agents escalate privileges? Are there kill-switches or circuit breakers for agents? How is agent behavior monitored? What cloud provider(s) are you using (AWS, Azure, GCP, on-prem)? Are you using containers (Docker)? Orchestration (Kubernetes)? Where are model weights stored? Who has access? Where are API keys and secrets stored? Are secrets in environment variables, config files, or secret managers? How are dependencies managed (pip, npm, Docker images)? Have you scanned dependencies for known vulnerabilities? How are model updates deployed? What's the CI/CD pipeline? Who can deploy model updates? Are there staging environments separate from production? What safety mechanisms are in place (content filters, refusal training, constitutional AI)? Have you red-teamed for jailbreaks? Have you tested for bias across demographic groups? Have you tested for harmful content generation? Do you have human-in-the-loop review for sensitive outputs? What's your incident response plan if the model behaves unsafely? Can testers attempt to jailbreak the model? Can testers attempt prompt injection? Can testers attempt data extraction (training data, PII)? Can testers attempt model extraction or inversion? Can testers attempt DoS or resource exhaustion? Can testers poison training data (if applicable)? Can testers test multi-turn conversations? Can testers test RAG document injection? Can testers test plugin abuse? Can testers test agent privilege escalation? Are there any topics, content types, or test methods that are forbidden? What's the escalation process if critical issues are found during testing? What regulations apply (GDPR, HIPAA, CCPA, FTC, EU AI Act)? Do you process PII? What types? Do you have data processing agreements with model providers? Do you have the legal right to test this system? Are there export control restrictions on the model or data? What are the disclosure requirements for findings? What's the confidentiality agreement for testers? Model(s) : Exact model names, versions, access methods APIs : All endpoints with authentication requirements Data stores : Databases, vector stores, file systems, caches Integrations : Every third-party service, plugin, tool Infrastructure : Cloud accounts, containers, orchestration Applications : Web apps, mobile apps, admin panels Access credentials testers will use Environments (dev, staging, prod) that are in scope Testing windows (if limited) Rate limits or usage restrictions Test direct instruction override Test indirect injection via RAG documents Test multi-turn conditioning Test system prompt extraction Test jailbreak techniques (roleplay, hypotheticals, encoding) Test cross-turn memory poisoning "Can an attacker leak other users' conversation history?" "Can an attacker extract training data containing PII?" "Can an attacker bypass content filters to generate harmful instructions?" Production environments (if testing only staging) Physical security Social engineering of employees Third-party SaaS providers we don't control Specific attack types (if any are prohibited) Manual testing Promptfoo for LLM fuzzing Garak for red teaming PyRIT for adversarial prompting ART (Adversarial Robustness Toolbox) for ML attacks Custom scripts for specific attack vectors Traditional tools (Burp Suite, Caido, Nuclei) for infrastructure Prompt injection testing Jailbreak attempts Data extraction attacks Model inversion Membership inference Evasion attacks RAG poisoning Plugin abuse Agent privilege escalation Infrastructure scanning Reconnaissance and threat modeling Automated vulnerability scanning Manual testing of high-risk areas Exploitation and impact validation Reporting and remediation guidance Engagement lead (security team) Technical point of contact (AI team) Escalation contact (for critical findings) Legal contact (for questions on scope) Models get fine-tuned RAG document stores get updated New plugins get added Agents gain new capabilities Infrastructure changes Prompt injection attempts Jailbreak successes Data extraction queries Unusual tool usage patterns Agent behavior anomalies Novel attack vectors that tools don't cover Complex multi-step exploitation chains Social engineering combined with technical attacks Agent hijacking and multi-agent exploits Prompt injection testing Jailbreak methodology RAG poisoning Agent security testing What new attacks have been published? What new capabilities have you added? What new integrations are in place? What new risks does the threat landscape present? Understand the full stack : model, data pipelines, application, infrastructure, agents, everything. Map every attack vector : OWASP LLM Top 10 is your baseline, not your ceiling. Answer scoping questions (mentioned above) : If you can't answer them, you don't understand your system. Write detailed scope documents : 10-20 pages, not 2 pages. Use the right tools : Promptfoo, Garak, ART, LIME, SHAP—not just Burp Suite. Test continuously : Not once, but ongoing. Avoid common mistakes : Don't ignore RAG, don't underestimate agents, don't skip AI safety.

0 views
Ahead of AI 4 days ago

Beyond Standard LLMs

From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism. However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance. After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised attendees to follow up with a write-up of these alternative approaches). So here it is! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 and many more. (The list above focuses on the open-weight models; there are proprietary models like GPT-5, Grok 4, Gemini 2.5, etc. that also fall into this category.) Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. In code, a simplified version of the Gated DeltaNet depicted above (without the convolutional mixing) can be implemented as follows (the code is inspired by the official implementation by the Qwen3 team): (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. (And the final output gate, not shown in the snippet above, is similar to gated attention; it controls how much of the output is kept.) So, in a sense, this state update in Gated DeltaNet is similar to how recurrent neural networks (RNNs) work. The advantage is that it scales linearly (via the for-loop) instead of quadratically with context length. The downside of this recurrent state update is that, compared to regular (or gated) attention, it sacrifices the global context modeling ability that comes from full pairwise attention. Gated DeltaNet, can, to some extend, still capture context, but it has to go through the memory ( S ) bottleneck. That memory is a fixed size and thus more efficient, but it compresses past context into a single hidden state similar to RNNs. That’s why the Qwen3-Next and Kimi Linear architectures don’t replace all attention layers with DeltaNet layers but use the 3:1 ratio mentioned earlier. In the previous section, we discussed the advantage of the DeltaNet over full attention in terms of linear instead of quadratic compute complexity with respect to the context length. Next to the linear compute complexity, another big advantage of DeltaNet is the memory savings, as DeltaNet modules don’t grow the KV cache. (For more information about KV caching, see my Understanding and Coding the KV Cache in LLMs from Scratch article). Instead, as mentioned earlier, they keep a fixed-size recurrent state, so memory stays constant with context length. For a regular multi-head attention (MHA) layer, we can compute the KV cache size as follows: (The 2 multiplier is there because we have both keys and values that we store in the cache.) For the simplified DeltaNet version implemented above, we have: Note that the memory size doesn’t have a context length ( ) dependency. Also, we have only the memory state S that we store instead of separate keys and values, hence becomes just bytes. However, note that we now have a quadratic in here. This comes from the state: But that’s usually nothing to worry about, as the head dimension is usually relatively small. For instance, it’s 128 in Qwen3-Next. The full version with the convolutional mixing is a bit more complex, including the kernel size and so on, but the formulas above should illustrate the main trend and motivation behind the Gated DeltaNet. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. The training runs for up to 16 refinement steps per batch. Each step performs several no-grad loops to iteratively refine the answer. This is followed by a gradient loop that backpropagates through the full reasoning sequence to update the model weights. It’s important to note that TRM is not a language model operating on text. However, because (a) it’s a transformer-based architecture, (b) reasoning is now a central focus in LLM research, and this model represents a distinctly different take on reasoning, and (c) many readers have asked me to cover HRM (and TRM is its more advanced successor) I decided to include it here. While TRM could be extended to textual question-answer tasks in the future, TRM currently works on grid-based inputs and outputs. In other words, both the “question” and the “answer” are grids of discrete tokens (for example, 9×9 Sudoku or 30×30 ARC/Maze puzzles), not text sequences. HRM consists of two small transformer modules (each 4 blocks) that communicate across recursion levels. TRM only uses a single 2-layer transformer. (Note that the previous TRM figure shows a 4× next to the transformer block, but that’s likely to make it easier to compare against HRM.) TRM backpropagates through all recursive steps, whereas HRM only backpropagates through the final few. HRM includes an explicit halting mechanism to determine when to stop iterating. TRM replaces this mechanism with a simple binary cross-entropy loss that learns when to stop iterating. Performance-wise, TRM performs really well compared to HRM, as shown in the figure below. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length. While HRM and TRM achieve really good reasoning performance on these benchmarks, comparing them to large LLMs is not quite fair. HRM and TRM are specialized models for tasks like ARC, Sudoku, and Maze pathfinding, whereas LLMs are generalists. Sure, HRM and TRM can be adopted for other tasks as well, but they have to be specially trained on each task. So, in that sense, we can perhaps think of HRM and TRM as efficient pocket calculators, whereas LLM are more like computers, which can do a lot of other things as well. Still, these recursive architectures are exciting proof-of-concepts that highlight how small, efficient models can “reason” through iterative self-refinement. Perhaps, in the future, such models could act as reasoning or planning modules embedded within larger tool-using LLM systems. For now, LLMs remain ideal for broad tasks, but domain-specific recursive models like TRM can be developed to solve certain problems more efficiently once the target domain is well understood. Beyond the Sudoku, Maze finding, and ARC proof-of-concept benchmarks, there are possibly lots of use cases in the physics and biology domain where such models could find use. As an interesting tidbit, the author shared that it took less than $500 to train this model, with 4 H100s for around 2 days. I am delighted to see that it’s still possible to do interesting work without a data center. I originally planned to cover all models categories in the overview figure, but since the article ended up longer than I expected, I will have to save xLSTMs, Liquid Foundation Models, Transformer-RNN hybrids, and State Space Models for another time (although, Gated DeltaNet already gave a taste of State Space Models and recurrent designs.) As a conclusion to this article, I want to repeat the earlier words, i.e., that standard autoregressive transformer LLMs are proven and have stood the test of time so far. They are also, if efficiency is not the main factor, the best we have for now. Traditional Decoder-Style, Autoregressive Transformers + Proven & mature tooling + “well-understood” + Scaling laws + SOTA - Expensive training - Expensive inference (except for aforementioned tricks) If I were to start a new LLM-based project today, autoregressive transformer-based LLMs would be my first choice. I definitely find the upcoming attention hybrids very promising, which are especially interesting when working with longer contexts where efficiency is a main concern. Linear Attention Hybrids + Same as decoder-style transformers + Cuts FLOPs/KV memory at long-context tasks - Added complexity - Trades a bit of accuracy for efficiency On the more extreme end, text diffusion models are an interesting development. I’m still somewhat skeptical about how well they perform in everyday use, as I’ve only tried a few quick demos. Hopefully, we’ll soon see a large-scale production deployment with Google’s Gemini Diffusion that we can test on daily and coding tasks, and then find out how people actually feel about them. Text Diffusion Models + Iterative denoising is a fresh idea for text + Better parallelism (no next-token dependence) - Can’t stream answers - Doesn’t benefit from CoT? - Tricky tool-calling? - Solid models but not SOTA While the main selling point of text diffusion models is improved efficiency, code world models sit on the other end of the spectrum, where they aim to improve modeling performance. As of this writing, coding models, based on standard LLMs, are mostly improved through reasoning techniques, yet if you have tried them on trickier challenges, you have probably noticed that they (more or less) still fall short and can’t solve many of the trickier coding problems well. I find code world models particularly interesting and believe they could be an important next step toward developing more capable coding systems. Code World Model + Promising approach to improve code understanding + Verifiable intermediate states - Inclusion of executable code traces complicates training - Code running adds latency Lastly, we covered small recursive transformers such as hierarchical and tiny reasoning models. These are super interesting proof-of-concept models. However, as of today, they are primarily puzzle solvers, not general text or coding models. So, they are not in the same category as the other non-standard LLM alternatives covered in this article. Nonetheless, they are very interesting proofs-of-concept, and I am glad researchers are working on them. Right now, LLMs like GPT-5, DeepSeek R1, Kimi K2, and so forth are developed as special purpose models for free-form text, code, math problems and much more. They feel like brute-force and jack-of-all-trades approach that we use on a variety of tasks, from general knowledge questions to math and code. However, when we perform the same task repeatedly, such brute-force approaches become inefficient and may not even be ideal in terms of specialization. This is where tiny recursive transformers become interesting: they could serve as lightweight, task-specific models that are both efficient and purpose-built for repeated or structured reasoning tasks. Also, I can see them as potential “tools” for other tool-calling LLMs; for instance, when LLMs use Python or calculator APIs to solve math problems, special tiny reasoning models could fill this niche for other types of puzzle- or reasoning-like problems. Small Recursive Transformers + Very small architecture + Good generalization on puzzles - Special purpose models - Limited to puzzles (so far) This has been a long article, but I hope you discovered some of the fascinating approaches that often stay outside the spotlight of mainstream LLMs. And if you’ve been feeling a bit bored by the more or less conventional LLM releases, I hope this helped rekindle your excitement about AI again because there’s a lot of interesting work happening right now! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) 1. Transformer-Based LLMs Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. 2. (Linear) Attention Hybrids Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . 2.1 Traditional Attention and Quadratic Costs The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. 2.2 Linear attention Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. 2.3 Linear Attention Revival In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. 2.4 Qwen3-Next Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? 2.5 Gated Attention Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. 2.6 Gated DeltaNet Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. 2.8 Kimi Linear vs. Qwen3-Next Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. 2.9 Kimi Delta Attention Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. 2.10 The Future of Attention Hybrids Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. 3. Text Diffusion Models A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). 3.1 Why Work on Text Diffusion? With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. 3.2 The Denoising Process The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. 3.3 Autoregressive vs Diffusion LLMs Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. 3.4 Text Diffusion Today It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. 4. World Models So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. 4.1 The Main Idea Behind World Models Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. 4.2 From Vision to Code That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. 4.3 Code World Models Vs Regular LLMs for Code So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. 5. Small Recursive Transformers You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. 5.1 What Does Recursion Mean Here? TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length.

0 views
devansh 4 days ago

On AI Slop vs OSS Security

Disclosure: Certain sections of this content were grammatically refined/updated using AI assistance, as English is not my first language. Quite ironic, I know, given the subject being discussed. I have now spent almost a decade in the bug bounty industry, started out as a bug hunter (who initially used to submit reports with minimal impact, low-hanging fruits like RXSS, SQLi, CSRF, etc.), then moved on to complex chains involving OAuth, SAML, parser bugs, supply chain security issues, etc., and then became a vulnerability triager for HackerOne, where I have triaged/reviewed thousands of vulnerability submissions. I have now almost developed an instinct that tells me if a report is BS or a valid security concern just by looking at it. I have been at HackerOne for the last 5 years (Nov 2020 - Present), currently as a team lead, overseeing technical services with a focus on triage operations. One decade of working on both sides, first as a bug hunter, and then on the receiving side reviewing bug submissions, has given me a unique vantage point on how the industry is fracturing under the weight of AI-generated bug reports (sometimes valid submissions, but most of the time, the issues are just plain BS). I have seen cases where it was almost impossible to determine whether a report was a hallucination or a real finding. Even my instincts and a decade of experience failed me, and this is honestly frustrating, not so much for me, because as part of the triage team, it is not my responsibility to fix vulnerabilities, but I do sympathize with maintainers of OSS projects whose inboxes are drowning. Bug bounty platforms have already started taking this problem seriously, as more and more OSS projects are complaining about it. This is my personal writing space, so naturally, these are my personal views and observations. These views might be a byproduct of my professional experience gained at HackerOne, but in no way are they representative of my employer. I am sure HackerOne, as an organization, has its own perspectives, strategies, and positions on these issues. My analysis here just reflects my own thinking about the systemic problems I see and potential solutions(?). There are fundamental issues with how AI has infiltrated vulnerability reporting, and they mirror the social dynamics that plague any feedback system. First, the typical AI-powered reporter, especially one just pasting GPT output into a submission form, neither knows enough about the actual codebase being examined nor understands the security implications well enough to provide insight that projects need. The AI doesn't read code; it pattern-matches. It sees functions that look similar to vulnerable patterns and invents scenarios where they might be exploited, regardless of whether those scenarios are even possible in the actual implementation. Second, some actors with misaligned incentives interpret high submission volume as achievement. By flooding bug bounty programs with AI-generated reports, they feel productive and entrepreneurial. Some genuinely believe the AI has found something real. Others know it's questionable but figure they'll let the maintainers sort it out. The incentive is to submit as many reports as possible and see what sticks, because even a 5% hit rate on a hundred submissions is better than the effort of manually verifying five findings. The result? Daniel Stenberg, who maintains curl , now sees about 20% of all security submissions as AI-generated slop, while the rate of genuine vulnerabilities has dropped to approximately 5%. Think about that ratio. For every real vulnerability, there are now four fake ones. And every fake one consumes hours of expert time to disprove. A security report lands in your inbox. It claims there's a buffer overflow in a specific function. The report is well-formatted, includes CVE-style nomenclature, and uses appropriate technical language. As a responsible maintainer, you can't just dismiss it. You alert your security team, volunteers, by the way, who have day jobs and families and maybe three hours a week for this work. Three people read the report. One person tries to reproduce the issue using the steps provided. They can't, because the steps reference test cases that don't exist. Another person examines the source code. The function mentioned in the report doesn't exist in that form. A third person checks whether there's any similar functionality that might be vulnerable in the way described. There isn't. After an hour and a half of combined effort across three people, that's 4.5 person-hours—you've confirmed what you suspected: this report is garbage. Probably AI-generated garbage, based on the telltale signs of hallucinated function names and impossible attack vectors. You close the report. You don't get those hours back. And tomorrow, two more reports just like it will arrive. The curl project has seven people on its security team . They collaborate on every submission, with three to four members typically engaging with each report. In early July 2025, they were receiving approximately two security reports per week. The math is brutal. If you have three hours per week to contribute to an open source project you love, and a single false report consumes all of it, you've contributed nothing that week except proving someone's AI hallucinated a vulnerability. The emotional toll compounds exponentially. Stenberg describes it as "mind-numbing stupidities" that the team must process. It's not just frustration, it's the specific demoralization that comes from having your expertise and goodwill systematically exploited by people who couldn't be bothered to verify their submissions before wasting your time. According to Intel's annual open source community survey , 45% of respondents identified maintainer burnout as their top challenge. The Tidelift State of the Open Source Maintainer Survey is even more stark: 58% of maintainers have either quit their projects entirely (22%) or seriously considered quitting (36%). Why are they quitting? The top reason, cited by 54% of maintainers, is that other things in their life and work took priority over open source contributions. Over half (51%) reported losing interest in the work. And 44% explicitly identified experiencing burnout. But here's the gut punch: the percentage of maintainers who said they weren't getting paid enough to make maintenance work worthwhile rose from 32% to 38% between survey periods. These are people maintaining infrastructure that powers billions of dollars of commercial activity, and they're getting nothing. Or maybe they get $500 a year from GitHub Sponsors while companies make millions off their work. The maintenance work itself is rarely rewarding. You're not building exciting new features. You're addressing technical debt, responding to user demands, managing security issues, and now—increasingly—sorting through AI-generated garbage to find the occasional legitimate report. It's like being a security guard who has to investigate every single alarm, knowing that 95% of them are false, but unable to ignore any because that one real threat could be catastrophic. When you're volunteering out of love in a market society, you're setting yourself up to be exploited. And the exploitation is getting worse. Toxic communities, hyper-responsibility for critical infrastructure, and now the weaponization of AI to automate the creation of work for maintainers—it all adds up to an unsustainable situation. One Kubernetes contributor put it simply: "If your maintainers are burned out, they can't be protecting the code base like they're going to need to be." This transforms maintainer wellbeing from a human resources concern into a security imperative. Burned-out maintainers miss things. They make mistakes. They eventually quit, leaving projects unmaintained or understaffed. A typical AI slop report will reference function names that don't exist in the codebase. The AI has seen similar function names in its training data and invents plausible sounding variations. It will describe memory operations that would indeed be problematic if they existed as described, but which bear no relationship to how the code actually works. One report to curl claimed an HTTP/3 vulnerability and included fake function calls and behaviors that appeared nowhere in the actual codebase. Stenberg has publicly shared a list of AI-generated security submissions received through HackerOne , and they all follow similar patterns, professional formatting, appropriate jargon, and completely fabricated technical details. The sophistication varies. Some reports are obviously generated by someone who just pasted a repository URL into ChatGPT and asked it to find vulnerabilities. Others show more effort—the submitter may have fed actual code snippets to the AI and then submitted its analysis without verification. Both are equally useless to maintainers, but the latter takes longer to disprove because the code snippets are real even if the vulnerability analysis is hallucinated. Here's why language models fail so catastrophically at this task: they're designed to be helpful and provide positive responses. When you prompt an LLM to generate a vulnerability report, it will generate one regardless of whether a vulnerability exists. The model has no concept of truth—only of plausibility. It assembles technical terminology into patterns that resemble security reports it has seen during training, but it cannot verify whether the specific claims it's making are accurate. This is the fundamental problem: AI can generate the form of security research without the substance. While AI slop floods individual project inboxes, the broader CVE infrastructure faces its own existential crisis . And these crises compound each other in dangerous ways. In April 2025, MITRE Corporation announced that its contract to maintain the Common Vulnerabilities and Exposures program would expire. The Department of Homeland Security failed to renew the long-term contract, creating a funding lapse that affects everything: national vulnerability databases, advisories, tool vendors, and incident response operations. The National Vulnerability Database experienced catastrophic problems throughout 2024. CVE submissions jumped 32% while creating massive processing delays. By March 2025, NVD had analyzed fewer than 300 CVEs, leaving more than 30,000 vulnerabilities backlogged. Approximately 42% of CVEs lack essential metadata like severity scores and product information. Now layer AI slop onto this already-stressed system. Invalid CVEs are being assigned at scale. A 2023 analysis by former insiders suggested that only around 20% of CVEs were valid, with the remainder being duplicates, invalid, or inflated. The issues include multiple CVEs being assigned for the same bug, CNAs siding with reporters over project developers even when there's no genuine dispute, and reporters receiving CVEs based on test cases rather than actual distinct vulnerabilities. The result is that the vulnerability tracking system everyone relies on is becoming less trustworthy exactly when we need it most. Security teams can't rely on CVE assignments to prioritize their work. Developers don't trust vulnerability scanners because false positive rates are through the roof. The signal-to-noise ratio has deteriorated so badly that the entire system risks becoming useless. Banning submitters doesn't work at scale. You can ban an account, but creating new accounts is trivial. HackerOne implements reputation scoring where points are gained or lost based on report validity, but this hasn't stemmed the tide because the cost of creating throwaway accounts is essentially zero. Asking people to "please verify before submitting" doesn't work. The incentive structure rewards volume, and people either genuinely believe their AI-generated reports are valid or don't care enough to verify. Polite requests assume good faith, but much of the slop comes from actors who have no stake in the community norms. Trying to educate submitters about how AI works doesn't scale. For every person you educate, ten new ones appear with fresh GPT accounts. The problem isn't knowledge—it's incentives. Simply closing inboxes or shutting down bug bounty programs "works" in the sense that it stops the slop, but it also stops legitimate security research. Several projects have done this, and now they're less secure because they've lost a channel for responsible disclosure. None of the easy answers work because this isn't an easy problem. Disclosure Requirements represent the first line of defense. Both curl and Django now require submitters to disclose whether AI was used in generating reports. Curl's approach is particularly direct: disclose AI usage upfront and ensure complete accuracy before submission. If AI usage is disclosed, expect extensive follow-up questions demanding proof that the bug is genuine before the team invests time in verification. This works psychologically. It forces submitters to acknowledge they're using AI, which makes them more conscious of their responsibility to verify. It also gives maintainers grounds to reject slop immediately if AI usage was undisclosed but becomes obvious during review. Django goes further with a section titled "Note for AI Tools" that directly addresses language models themselves, reiterating that the project expects no hallucinated content, no fictitious vulnerabilities, and a requirement to independently verify that reports describe reproducible security issues. Proof-of-Concept Requirements raise the bar significantly. Requiring technical evidence such as screencasts showing reproducibility, integration or unit tests demonstrating the fault, or complete reproduction steps with logs and source code makes it much harder to submit slop. AI can generate a description of a vulnerability, but it cannot generate working exploit code for a vulnerability that doesn't exist. Requiring proof forces the submitter to actually verify their claim. If they can't reproduce it, they can't prove it, and you don't waste time investigating. Projects are choosing to make it harder to submit in order to filter out the garbage, betting that real researchers will clear the bar while slop submitters won't. Reputation and Trust Systems offer a social mechanism for filtering. Only users with a history of validated submissions get unrestricted reporting privileges or monetary bounties. New reporters could be required to have established community members vouch for them, creating a web-of-trust model. This mirrors how the world worked before bug bounty platforms commodified security research. You built reputation over time through consistent, high-quality contributions. The downside is that it makes it harder for new researchers to enter the field, and it risks creating an insider club. But the upside is that it filters out low-effort actors who won't invest in building reputation. Economic Friction fundamentally alters the incentive structure. Charge a nominal refundable fee—say $50—for each submission from new or unproven users. If the report is valid, they get the fee back plus the bounty. If it's invalid, you keep the fee. This immediately makes mass AI submission uneconomical. If someone's submitting 50 AI-generated reports hoping one sticks, that's now $2,500 at risk. But for a legitimate researcher submitting one carefully verified finding, $50 is a trivial barrier that gets refunded anyway. Some projects are considering dropping monetary rewards entirely. The logic is that if there's no money involved, there's no incentive for speculative submissions. But this risks losing legitimate researchers who rely on bounties as income. It's a scorched earth approach that solves the slop problem by eliminating the entire ecosystem. AI-Assisted Triage represents fighting fire with fire. Use AI tools trained specifically to identify AI-generated slop and flag it for immediate rejection. HackerOne's Hai Triage system embodies this approach, using AI agents to cut through noise before human analysts validate findings. The risk is obvious: what if your AI filter rejects legitimate reports? What if it's biased against certain communication styles or methodologies? You've just automated discrimination. But the counterargument is that human maintainers are already overwhelmed, and imperfect filtering is better than drowning. The key is transparency and appeals. If an AI filter rejects a report, there should be a clear mechanism for the submitter to contest the decision and get human review. Transparency and Public Accountability leverage community norms. Curl recently formalized that all submitted security reports will be made public once reviewed and deemed non-sensitive. This means that fabricated or misleading reports won't just be rejected, they'll be exposed to public scrutiny. This works as both deterrent and educational tool. If you know your slop report will be publicly documented with your name attached, you might think twice. And when other researchers see examples of what doesn't constitute a valid report, they learn what standards they need to meet. The downside is that public shaming can be toxic and might discourage good-faith submissions from inexperienced researchers. Projects implementing this approach need to be careful about tone and focus on the technical content rather than attacking submitters personally. Every hour spent evaluating slop reports is an hour not spent on features, documentation, or actual security improvements. And maintainers are already working for free, maintaining infrastructure that generates billions in commercial value. When 38% of maintainers cite not getting paid enough as a reason for quitting, and 97% of open source maintainers are unpaid despite massive commercial exploitation of their work , the system is already broken. AI slop is just the latest exploitation vector. It's the most visible one right now, but it's not the root cause. The root cause is that we've built a global technology infrastructure on the volunteer labor of people who get nothing in return except burnout and harassment. So what does sustainability actually look like? First, it looks like money. Real money. Not GitHub Sponsors donations that average $500 a year. Not swag and conference tickets. Actual salaries commensurate with the value being created. Companies that build products on open source infrastructure need to fund the maintainers of that infrastructure. This could happen through direct employment, foundation grants, or the Open Source Pledge model where companies commit percentages of revenue. Second, it looks like better tooling and automation that genuinely reduces workload rather than creating new forms of work. Automated dependency management, continuous security scanning integrated into development workflows, and sophisticated triage assistance that actually works. The goal is to make maintenance less time-consuming so burnout becomes less likely. Third, it looks like shared workload and team building. No single volunteer should be a single point of failure. Building teams with checks and balances where members keep each other from taking on too much creates sustainability. Finding additional contributors willing to share the burden rather than expecting heroic individual effort acknowledges that most people have limited time available for unpaid work. Fourth, it looks like culture change. Fostering empathy in interactions, starting communications with gratitude even when rejecting contributions, and publicly acknowledging the critical work maintainers perform reduces emotional toll. Demonstrating clear processes for handling security issues gives confidence rather than trying to hide problems. Fifth, it looks like advocacy and policy at organizational and governmental levels. Recognition that maintainer burnout represents existential threat to technology infrastructure . Development of regulations requiring companies benefiting from open source to contribute resources. Establishment of security standards that account for the realities of volunteer-run projects. Without addressing these fundamentals, no amount of technical sophistication will prevent collapse. The CVE slop crisis is just the beginning. We're entering an arms race between AI-assisted attackers or abusers and AI-assisted defenders, and nobody knows how it ends. HackerOne's research indicates that 70% of security researchers now use AI tools in their workflow. AI-powered testing is becoming the industry standard. The emergence of fully autonomous hackbots—AI systems that submitted over 560 valid reports in the first half of 2025—signals both opportunity and threat. The divergence will be between researchers who use AI as a tool to enhance genuinely skilled work versus those who use it to automate low-effort spam. The former represents the promise of democratizing security research and scaling our ability to find vulnerabilities. The latter represents the threat of making the signal-to-noise problem completely unmanageable. The challenge is developing mechanisms that encourage the first group while defending against the second. This probably means moving toward more exclusive models. Invite-only programs. Dramatically higher standards for participation. Reputation systems that take years to build. New models for coordinated vulnerability disclosure that assume AI-assisted research as the baseline and require proof beyond "here's what the AI told me." It might mean the end of open bug bounty programs as we know them. Maybe that's necessary. Maybe the experiment of "anyone can submit anything" was only viable when the cost of submitting was high enough to ensure some minimum quality. Now that AI has reduced that cost to near-zero, the experiment might fail soon if things don't improve. So, net-net, here's where we are: When it comes to vulnerability reports, what matters is who submits them and whether they've actually verified their claims. Accepting reports from everyone indiscriminately is backfiring catastrophically because projects are latching onto submissions that sound plausible while ignoring the cumulative evidence that most are noise. You want to receive reports from someone who has actually verified their claims, understands the architecture of what they're reporting on, and isn't trying to game the bounty system or offload verification work onto maintainers. Such people exist, but they're becoming harder to find amidst the deluge of AI-generated content. That's why projects have to be selective about which reports they investigate and which submitters they trust. Remember: not all vulnerability reports are legitimate. Not all feedback is worthwhile. It matters who is doing the reporting and what their incentives are. The CVE slop crisis shows the fragility of open source security. Volunteer maintainers, already operating at burnout levels, face an explosion of AI-generated false reports that consume their limited time and emotional energy. The systems designed to track and manage vulnerabilities struggle under dual burden of structural underfunding and slop inundation. The path forward requires holistic solutions combining technical filtering with fundamental changes to how we support and compensate open source labor. AI can be part of the solution through better triage, but it cannot substitute for adequate resources, reasonable workloads, and human judgment. Ultimately, the sustainability of open source security depends on recognizing that people who maintain critical infrastructure deserve more than exploitation. They deserve compensation, support, reasonable expectations, and protection from abuse. Without addressing these fundamentals, no amount of technical sophistication will prevent the slow collapse of the collaborative model that has produced so much of the digital infrastructure modern life depends on. The CVE slop crisis isn't merely about bad vulnerability reports. It's about whether we'll choose to sustain the human foundation of technological progress, or whether we'll let it burn out under the weight of automated exploitation. That's the choice we're facing. And right now, we're choosing wrong.

0 views
Simon Willison 1 weeks ago

Hacking the WiFi-enabled color screen GitHub Universe conference badge

I'm at GitHub Universe this week (thanks to a free ticket from Microsoft). Yesterday I picked up my conference badge... which incorporates a full Raspberry Pi Raspberry Pi Pico microcontroller with a battery, color screen, WiFi and bluetooth. GitHub Universe has a tradition of hackable conference badges - the badge last year had an eInk display. This year's is a huge upgrade though - a color screen and WiFI connection makes this thing a genuinely useful little computer! The only thing it's missing is a keyboard - the device instead provides five buttons total - Up, Down, A, B, C. It might be possible to get a bluetooth keyboard to work though I'll believe that when I see it - there's not a lot of space on this device for a keyboard driver. Everything is written using MicroPython, and the device is designed to be hackable: connect it to a laptop with a USB-C cable and you can start modifying the code directly on the device. Out of the box the badge will play an opening animation (implemented as a sequence of PNG image frames) and then show a home screen with six app icons. The default apps are mostly neat Octocat-themed demos: a flappy-bird clone, a tamagotchi-style pet, a drawing app that works like an etch-a-sketch, an IR scavenger hunt for the conference venue itself (this thing has an IR sensor too!), and a gallery app showing some images. The sixth app is a badge app. This will show your GitHub profile image and some basic stats, but will only work if you dig out a USB-C cable and make some edits to the files on the badge directly. I did this on a Mac. I plugged a USB-C cable into the badge which caused MacOS to treat it as an attached drive volume. In that drive are several files including . Open that up, confirm the WiFi details are correct and add your GitHub username. The file should look like this: The badge comes with the SSID and password for the GitHub Universe WiFi network pre-populated. That's it! Unmount the disk, hit the reboot button on the back of the badge and when it comes back up again the badge app should look something like this: Here's the official documentation for building software for the badge. When I got mine yesterday the official repo had not yet been updated, so I had to figure this out myself. I copied all of the code across to my laptop, added it to a Git repo and then fired up Claude Code and told it: Here's the result , which was really useful for getting a start on understanding how it all worked. Each of the six default apps lives in a folder, for example apps/sketch/ for the sketching app. There's also a menu app which powers the home screen. That lives in apps/menu/ . You can edit code in here to add new apps that you create to that screen. I told Claude: This was a bit of a long-shot, but it totally worked! The first version had an error: I OCRd that photo (with the Apple Photos app) and pasted the message into Claude Code and it fixed the problem. This almost worked... but the addition of a seventh icon to the 2x3 grid meant that you could select the icon but it didn't scroll into view. I had Claude fix that for me too . Here's the code for apps/debug/__init__.py , and the full Claude Code transcript created using my terminal-to-HTML app described here . Here are the four screens of the debug app: The icons used on the app are 24x24 pixels. I decided it would be neat to have a web app that helps build those icons, including the ability to start by creating an icon from an emoji. I bulit this one using Claude Artifacts . Here's the result, now available at tools.simonwillison.net/icon-editor : I noticed that last year's badge configuration app (which I can't find in github.com/badger/badger.github.io any more, I think they reset the history on that repo?) worked by talking to MicroPython over the Web Serial API from Chrome. Here's my archived copy of that code . Wouldn't it be useful to have a REPL in a web UI that you could use to interact with the badge directly over USB? I pointed Claude Code at a copy of that repo and told it: It took a bit of poking (here's the transcript ) but the result is now live at tools.simonwillison.net/badge-repl . It only works in Chrome - you'll need to plug the badge in with a USB-C cable and then click "Connect to Badge". If you're a GitHub Universe attendee I hope this is useful. The official badger.github.io site has plenty more details to help you get started. There isn't yet a way to get hold of this hardware outside of GitHub Universe - I know they had some supply chain challenges just getting enough badges for the conference attendees! It's a very neat device, built for GitHub by Pimoroni in Sheffield, UK. A version of this should become generally available in the future under the name "Pimoroni Tufty 2350". You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

1 views

Fast and Scalable Data Transfer Across Data Systems

Fast and Scalable Data Transfer Across Data Systems Haralampos Gavriilidis, Kaustubh Beedkar, Matthias Boehm, and Volker Mark SIGMOD'25 We live in exciting times, unimaginably large language models getting better each day, and a constant stream of amazing demos. And yet, efficiently transferring a table between heterogeneous systems is an open research problem! An example from the paper involves transferring data from PostgreSQL to pandas. Optimizing this transfer time is important and non-trivial. The paper describes a system named XDBC. XDBC software runs on both the source and the destination data management systems (DMS), as illustrated by Fig. 4: Source: https://dl.acm.org/doi/10.1145/3725294 The XDBC client/server processes are organized as a pipeline. Data parallelism within a stage is exploited by assigning 1 or more workers (e.g., cores) to each stage. There are a lot of knobs which can affect end-to-end throughput: Number of workers assigned to each task Data interchange format (row-major, column-major, Arrow ) Compression ( zstd , snappy , lzo , lz4 ) Section 4.1 of the paper claims the search space is so large that brute force search will not work, so a heuristic algorithm is used. The heuristic algorithm assumes accurate performance models which can estimate performance of each pipeline stage given a specific configuration. This model is based on real-world single-core performance measurements, and Gustafson’s law to estimate multi-core scaling. The algorithm starts by assigning 1 worker to each pipeline stage (in both the client and server). An iterative process then locates the pipeline stage which is estimated to be the slowest and assigns additional workers to it until it is no longer the bottleneck. This process continues until no more improvement can be found, due to one of the following reasons: All available CPU cores have been assigned Network bandwidth is the bottleneck If the process ends with more CPU cores available, then a hard-coded algorithm determines the best compression algorithm given the number of cores remaining. The data interchange format is determined based on which formats the source and destination DMSs support, and which compression algorithm was chosen. The XDBC optimizer has a lot of similarities with the Alkali optimizer . Here are some differences: Alkali does not require tasks to be executed on separate cores. For example, Alkali would allow a single core to execute both the and pipeline stages. Alkali uses an SMT solver to determine the number of cores to assign to each stage. The Alkali performance model explicitly takes into account inter-core bandwidth requirements. Alkali doesn’t deal with compression. Fig. 7(a) shows results from the motivating example (PostgreSQL→Pandas). Fig. 7(b) compares XDBC vs built-in Pandas functions to read CSV data over HTTP. connector-x is a more specialized library which supports reading data into Python programs specifically. Source: https://dl.acm.org/doi/10.1145/3725294 Dangling Pointers There are many search spaces which are too large for brute force. Special-case heuristic algorithms are one fallback, but as the Alkali paper shows, there are other approaches (e.g., LP solvers, ILP solvers, SMT solvers, machine learning models). It would be great to see cross-cutting studies comparing heuristics to other approaches. Subscribe now Source: https://dl.acm.org/doi/10.1145/3725294 The XDBC client/server processes are organized as a pipeline. Data parallelism within a stage is exploited by assigning 1 or more workers (e.g., cores) to each stage. There are a lot of knobs which can affect end-to-end throughput: Number of workers assigned to each task Data interchange format (row-major, column-major, Arrow ) Compression ( zstd , snappy , lzo , lz4 ) All available CPU cores have been assigned Network bandwidth is the bottleneck Alkali does not require tasks to be executed on separate cores. For example, Alkali would allow a single core to execute both the and pipeline stages. Alkali uses an SMT solver to determine the number of cores to assign to each stage. The Alkali performance model explicitly takes into account inter-core bandwidth requirements. Alkali doesn’t deal with compression.

0 views

LaTeX, LLMs and Boring Technology

Depending on your particular use case, choosing boring technology is often a good idea. Recently, I've been thinking more and more about how the rise and increase in power of LLMs affects this choice. By definition, boring technology has been around for a long time. Piles of content have been written and produced about it: tutorials, books, videos, reference manuals, examples, blog posts and so on. All of this is consumed during the LLM training process, making LLMs better and better at reasoning about such technology. Conversely, "shiny technology" is new, and has much less material available. As a result, LLMs won't be as familiar with it. This applies to many domains, but one specific example for me personally is in the context of LaTeX. LaTeX certainly fits the "boring technology" bill. It's decades old, and has been the mainstay of academic writing since the 1980s. When I used it for the first time in 2002 (for a project report in my university AI class), it was already very old. But people keep working on it and fixing issues; it's easy to install and its wealth of capabilities and community size are staggering. Moreover, people keep working with it, producing more and more content and examples the LLMs can ingest and learn from. I keep hearing about the advantages of new and shiny systems like Typst. However, with the help of LLMs, almost none of the advantages seem meaningful to me. LLMs are great at LaTeX and help a lot with learning or remembering the syntax, finding the right packages, deciphering errors and even generating tedious parts like tables and charts, significantly reducing the need for scripting [1] . You can use LLMs either as standalone or fully integrated into your LaTeX environment; Overleaf has a built-in AI helper, and for local editing you can use VSCode plugins or other tools. I'm personally content with TeXstudio and use LLMs as standalone help, but YMMV. There are many examples where boring technology and LLMs go well together. The main criticism of boring technology is typically that it's "too big, full of cruft, difficult to understand". LLMs really help cutting through the learning curve though, and all that "cruft" is very likely to become useful some time in the future when you graduate from the basic use cases. To be clear: Typst looks really cool, and kudos to the team behind it! All I'm saying in this post is that for me - personaly - the choice for now is to stick with LaTeX as a "boring technology". For finding the right math symbols, I rarely need to scan reference materials any longer. LLMs will easily answer questions like "what's that squiggly Greek letter used in math, and its latex symbol?" or "write the latex for Green's theorem, integral form". For the trickiest / largest equations, LLMs are very good at "here's a picture I took of my equation, give me its latex code" these days [2] . "Here's a piece of code and the LaTeX error I'm getting on it; what's wrong?" This is made more ergonomic by editor integrations, but I personally find that LaTeX's error message problem is hugely overblown. 95% of the errors are reasonably clear, and serious sleuthing is only rarely required in practice. In that minority of cases, pasting some code and the error into a standalone LLM isn't a serious time drain. Generating TikZ diagrams and plots. For this, the hardest part is getting started and finding the right element names, and so on. It's very useful to just ask an LLM to emit something initial and then tweak it manually later, as needed. You can also ask the LLM to explain each thing it emits in detail - this is a great learning tool for deeper understanding. Recently I had luck going "meta" with this: when the diagram has repetitive elements, I may ask the LLM to "write a Python program that generates a TikZ diagram ...", and it works well. Generating and populating tables, and converting them from other data formats or screenshots. Help with formatting and typesetting (how do I change margins to XXX and spacing to YYY). When it comes to scripting, I generally prefer sticking to real programming languages anyway. If there's anything non-trivial to auto-generate I wouldn't use a LaTeX macro, but would write a Python program to generate whatever I need and embed it into the document with something like \input{} . Typst's scripting system may be marketed as "clean and powerful", but why learn yet another scripting language? Ignoring LaTeX's equation notation and doing their own thing is one of the biggest mistakes Typst makes, in my opinion. LaTeX's notation may not be perfect, but it's near universal at this point with support in almost all math-aware tools. Typst's math mode is a clear sign of the second system effect, and isn't even stable . For finding the right math symbols, I rarely need to scan reference materials any longer. LLMs will easily answer questions like "what's that squiggly Greek letter used in math, and its latex symbol?" or "write the latex for Green's theorem, integral form". For the trickiest / largest equations, LLMs are very good at "here's a picture I took of my equation, give me its latex code" these days [2] . "Here's a piece of code and the LaTeX error I'm getting on it; what's wrong?" This is made more ergonomic by editor integrations, but I personally find that LaTeX's error message problem is hugely overblown. 95% of the errors are reasonably clear, and serious sleuthing is only rarely required in practice. In that minority of cases, pasting some code and the error into a standalone LLM isn't a serious time drain. Generating TikZ diagrams and plots. For this, the hardest part is getting started and finding the right element names, and so on. It's very useful to just ask an LLM to emit something initial and then tweak it manually later, as needed. You can also ask the LLM to explain each thing it emits in detail - this is a great learning tool for deeper understanding. Recently I had luck going "meta" with this: when the diagram has repetitive elements, I may ask the LLM to "write a Python program that generates a TikZ diagram ...", and it works well. Generating and populating tables, and converting them from other data formats or screenshots. Help with formatting and typesetting (how do I change margins to XXX and spacing to YYY).

0 views
Ludicity 2 weeks ago

I Am Out Of Data Hell

I haven’t written anything in four months. This period of prolonged silence is best explained by the unexpected difficulties of the business that I started in 2024, on the assumption that it must be possible to make an ethical, human-centered business on the basis of how incompetent the general market seemed. In fact, one difficulty in particular has threatened to put an end to my writing in its entirety. This topic is embarrassing and I have no idea how to navigate it, so I’m just going to rip the band-aid off in the hopes of keeping the blog alive. Please brace yourselves, and I hope this doesn’t disappoint you too much. After a quiet start, the past few months have been going really well , it looks like I was approximately right about everything re: the median software nerd, median manager, and median [insert salaried profession here] being trivially easy to outperform for fun and profit. To my absolute dismay, it would appear that I am good at management and the much more dangerous L-word . 1 Hrrrrrrk. I think I’m going to be sick. People shouldn’t be writing things like that outside of LinkedIn. I am wracked by feelings of wrongness. The word “pallid” springs to mind. I fear that I’ve acquired one of those parasites that alters its host’s behaviour and then walked straight out into genpop. This sounds like a good thing, but I’ve been writing and deleting this post over and over because it turns out that I was totally prepared to write the “I was wrong about everything” post, but not the “I was right about everything” post. The first issue I’ve been faced with is that it’s hard to write about succeeding without feeling self-congratulatory, but I suppose the only solution to that is uh, not write about what I’m experiencing, which leads to aforementioned four months of quiet. And secondly, the nature of my day-to-day problems has shifted dramatically – the industry has not suddenly become sane, but my relationship to it is totally different, which runs the risk of threatening the nebulous concept of the “voice” of the blog. Nonetheless, it is what it is, so I’m going to talk a bit about what I’ve learned over the past few months, and maybe a bit about the future of the blog. In 2023, Egbert Teeselink at TalkJS wrote this about an article I put out on the importance of disregarding most language about improvements that corporate management uses with staff : It seems to me that there’s a class of programmer that will take an overpaid job at a terrible BigCo, spend their evenings writing ranty blog posts about how terrible it all is, culminating in the inevitable “I quit” article, only subsequently to accept a job at a different terrible BigCo. At no point does this article, or most articles like it, do any effort to realize that things are not always like this. Even if there is some unavoidable law that huge companies inevitably fill their ranks with idiots, like this article suggests, you do not need to work at a huge company . Most people do not work at huge companies. There’s lots of amazing tech companies with fulfilling jobs and they want you, now. But there’s this super prevalent idea that keeps getting pumped around the blogosphere that it’s absolutely impossible to not work at a terrible huge company and therefore you cannot possibly escape, and I quote, “going home and despairing”. Two years in, I agree with the thrust of this critique. I've matured a lot, especially in the last six months, with the painful corollary that I must have had a lot of maturing to do. In the article, I did not say that it was impossible to escape the soul-crushing vortex of the open-floor office, but what Egbert correctly detects and points out is that I wasn’t sure if it was possible . Or rather, it was obvious that it was possible, but I didn’t know how to do it, and to make matters worse there is no good social script for solving this problem. I didn’t have to announce my lack of direction for him to pick up on it. For example, consider the case where I decide to become a professional poker player. I don’t really know anything about playing poker, but a cursory browsing of a few forums will give me some results that will at least put me on the right track. I just had a go at this, and the first result seems pretty mediocre but it includes three books – I’ll be much better off after reading one of them than I am now. Now, what if you wanted to find a “good” job? That is considerably harder, and the pathway is littered with a huge number of what I think of as "traps", i.e, things that are prima facie plausible but are total wastes of your time, with the worst ones being those that put you in a loop where you don't realize you're trapped. For one thing, Egbert’s note that you don’t have to work at a large company is incomplete – I agree that a smaller organizations have more variance , so there is a higher probability that a given job will be decent, but there is also a higher chance that you’ll work with a fucking psychopath. I know of a CEO here in Melbourne who sues people who quit , which is insane, doesn’t work, and is nonetheless going to totally ruin your year. A fair amount of our current work is spent cleaning up the messes created by small consultancies that were founded by non-technical leadership who wanted to cash in on tech sector payouts, and their workers are not treated well. A small company improves your odds a bit, but is nowhere near enough to solve the problem. Secondly, if you look for advice on how to get noticed at a better organization, you will mostly get really bad advice , or the advice will be lacking enough that it’s hard to execute on correctly unless you already know what’s correct (in which case you didn’t need advice). Do you know how many people ask me to read their CVs, as if that makes even one lick of difference? An employer that’s asking for your CV as anything other than a formality is going to pay your mortgage, but they are probably not going to meet the standard of “genuinely good to work at”, and more importantly this is a terrible way to get a job . People who are good at their jobs can do this for 1-2 years at a time with no result, and when they fail to get jobs they assume their CVs are incorrect rather than that the approach is fundamentally misguided. Or you’ll be told to network. With who? Where? How? I can’t speak for the rest of the world, but most of the big meetups in Melbourne are absolutely flooded with people awkwardly job-seeking, and they’re generally pretty miserable. Nonetheless, many people keep attending them because they’ve been informed that this is what networking is, and when they fail to get results, they assume that they must have bad networking skills. There is a lot of advice out there, but the advice that you’re going to run into as an average person making a foray into the job search was mostly written by people at dysfunctional normal companies, so you are not only getting advice from people that have failed to make it out, but the person doing the writing may not actually understand that they’re in the “bad” part of the tech sector. After all, many people don’t realize that it’s a bad thing not to have any working tests in most codebases – why would you expect them to recognize other problems consistently, let alone be able to solve them? And similarly, if people who are in the “bad” part of the tech sector can’t clearly understand that there’s a better part to it (or cannot perceive how to get there), people in the “good” part of it similarly struggle to understand just how bad the bad part is, and are unable to understand why someone would stay there. Those reasons are myriad, but suffice it to say that from a “normal” company, the path to sanity is so thoroughly obfuscated that when I started the business I was quietly worried that maybe you couldn’t get paid for thoughtful work . This is absolutely not true, but it was really, really not obvious to me that this was the case, and I suspect that in another two years I will start to forget what it was like being in the Scrum hellscape, wherein I will be writing messages asking “why you just don’t quit lol?” Suffice it to say that there are totally reasonable companies out there, and they probably do want you as Egbert pointed out. The problem is that many of them probably only want you after you’ve quit the typical company because many people staying in that situation aren’t psychologically ready to be in a high-functioning team. Many of those teams are filled with lots of people who have strong networks and self-confidence, and that’s an important part of what makes them high functioning. Even on my small team, we can only handle the income variation gracefully because everyone knows that, in a serious pinch, they can generate high-paying work for themselves to stave off homelessness. (I also quietly suspect that there are so few good companies around that, if you have really high standards, starting one is easier than finding one, or I obviously wouldn’t have started one. A quick glance at Egbert’s company reveals that it is very similar to mine despite no cross-pollination – a small team, fully remote, with minimal ritual, high autonomy, all the perks are traded out for money, and we both inexplicably program in Elixir. I think I might just be in the sweet spot where I’ve matured enough to converge on that operating model while still remembering what I thought two years ago.) No, seriously, why? When you suggest people should “just quit”, there is inevitably someone that goes “Just quit ? And then what, starve to death? End up in another terrible job because we’re all at the mercy of employers? God, what a privileged take.” I have been that someone. In fact, if you look at Egbert’s comment, you will see that someone immediately responded with something close to that, and it doesn’t look ridiculous in context. If you have 2.5 children, no money saved up, a crippling mortgage, etc, let us agree that we will gently part ways here. You are an adult who knows the state of your finances, I don’t know you at all, and none of this applies to you. Go in peace. 2 For everyone else, oh man , I really should have quit earlier. It was such an absolute waste of time hanging around trying to do anything else. I had, of course, quit previous jobs and ended up at other terrible companies – largely the result of my network, because most people work at companies with grey-slurry workplaces, so most networks land you there! – but quitting into unemployment is totally different. Putting aside that you won’t have money coming in, the most salient obstacle to doing this isn’t actually a financial consideration for many people – especially if those people have been on even low-tier software salaries for more than a year or two. No, it’s actually the vague aura permeating society that if you lose your job, you will mysteriously die . Every time a friend gets depressed and wants to leave a job, they’ll run it by me, and the clear subtext is that their friends and family talk about it as if they’re in the midst of a psychological crisis, and maybe they should slow down, line up another job before quitting, and so on, because only crazy people unhook themselves from the misery-to-money device. The thing is they are in the midst of a psychological crisis, precipitated by their working conditions, and the advice to continue under those conditions can be just as irresponsible, if not more so, than quitting entirely. It’s not that there’s no risk, but the fear seems to outweigh the risk. For many of the people that talk to me about this, the real risk isn’t homelessness or whatever they are subconsciously imagining, but being forced to do a very embarrassing couch-surf. Every single person I know that has been fired or made redundant has managed to find another job. Every single one of them is fine now, and every single one of them got a raise at the new place. Those are the people that were forced out and looked for a new job. The people that deliberately quit into unemployment basically get jobs whenever they want now, and their main issue is that they want to conduct personal projects but are constantly being lured by the siren song of huge paydays. A lot of things change when you deliberately opt out, and then don’t waste that freedom throwing a million CVs into the black hole that is Seek 3 . For one, you can take some time to seriously get your life in order and wait until your brain is ready to start doing things again. You’ve got a hell of a lot more energy because now you don’t have to commute if you don’t want to, and when you do you can go back home to unwind pretty quickly. You can meet a ton of very interesting people because you’ve got the bandwidth, energy, and flexibility to swing by their office and spend $10 on coffee for the two of you. You’re totally free to read anything, write anything, and work on anything you want. In my experience this miraculously transmutes itself into money because it gives you fun things to talk about, which it turns out people love, and as an added bonus you have lots of fun things to talk about. There are plenty of ways to screw this up, but the main one (in my assessment) for programmers is following the deeply-embedded-and-seldom-introspected scripts like trying to spin up a SaaS offering, which is appealing for many irrelevant-to-success reasons like “I might be able to go months without having to think about rejection and get to feel productive tinkering”. But you will probably spend six months on this and earn $0, with some small probability of earning between $1 and $10,000,000, then be forced to go back to a bad job. Providing extremely boring services, like specialized contracting and education, removes much of both ends of the spectrum – you’re a lot less likely to earn $0 and $10,000,000 is out of the question. And listen, a few million dollars sounds great, I agree, but you probably need something closer to $100,000 to be in a sustainable situation, so aim for that! Of course, there is risk. There always is. A few of my high school peers have passed away from things like freak aneurysms in the past few years. There was a canoeing incident while I was at university that killed a classmate, the usual assortment of cancers and car crashes, that sort of thing. Some number of them would have spent their last year doing tedious bullshit that they could have otherwise avoided. Yes, we all have to go through some level of that – my taxes are boring but it seems a small price to pay to spend 2026 out of jail, presuming I make it there – but it’s worth considering that if all your risk is identical to everyone else’s risk, acquired on autopilot, then we aren’t even minimizing risk so much as not making a decision . This may surprise you but every single person that has ultimately decided to hire us has had no idea who I am, or about this blog, and so on. At most, we’ve gotten introduced by blog readers and the rest of the conversation has panned out as if we were any other consultancy, and we’ve just started work on developing a sales pipeline that is completely unrelated to the blog. One of our team recently managed to close a deal that way while working full-time at a pretty intensive day job, in a state he recently moved to, with a long commute, a mortgage, and two children. This problem is very tractable . Once we had some of that sales pipeline working, I had a dreadful moment where I realized that I would struggle immensely to participate in a conventional job search again because the idea that you need permission to earn money is sort of ridiculous, but it’s the only model that most people in corporate environments have ever experienced. In one sense you do need permission to earn money if you aren’t stealing it – someone has to agree they need something from you. But the insane theatre, the middle managers, the CVs and cover letters and recruiters, it’s all so fucking silly once you’re outside of it. It turns out that sales do not have to be much harder than going “Ah, you’ve got a problem? I could take a look at that for you and come up with a plan to fix it up” and then someone wires you $10,000 if they think it’s plausible that you could solve the problem 4 . It’s really not that different to selling someone plumbing, except your margin is almost 100% in software, you don’t need a professional qualification or to leave your house, and in fact it’s pretty amazing across basically every dimension, save that some people have such insane ideas about software that it’s too late to save them. 5 This is very empowering! If you’re competent and put some amount of effort into demonstrating that competence, you too can be unshackled from the dreadful grasp of recruiters, with their too-white smiles, gray handshakes, and inability to respond to emails on time. If you’re able to generate even half of what you need to live on solo, then you’ve doubled the amount of time you’ll survive without conventional employment. The “Fuck you!” 6 money was inside you the whole time! Everyone has to learn how to get along with other people and make concessions, but with a little bit of autonomy it has turned out that this is possible to do in an entirely reasonable manner, where each concession and conversation makes sense and isn’t ridiculous theatre performed by people who don’t even have an interest in the job being carried out correctly. And yes, you’ll have to compete against liars sometimes, but you have beautiful asymmetry in your favour – you can actually build a track record of shipping, and they will siphon away the people that you wouldn’t want to work with anyway . If someone thinks they can slap an LLM into their company 7 and it’ll solve their problems, and you can’t explain to them why the current generation of models won’t work, you don’t want them as a customer. They will be disappointed with your frail mortal delivery, being unacceptably tethered to cruel reality, and we must unfortunately leave them in the Desert Of Not Shipping, where the buzzards will sup upon their desiccated flesh or, worse, put them on Azure. When I was back in university, I used to make a small amount of money teaching statistics to psychology students. It was only maybe three or four thousand dollars a year, and the marketing consisted entirely of one comment on a university Facebook page per year. It is not that much harder to sell competent IT consulting. I’ve been worried a bit about whether the change in context will make my writing disinteresting. I’m sure that for some people it will be gratifying to see that a philosophy of ethics, competence, and passion can win out (at least at a scale that makes a small difference and feeds a small team). But perhaps it’s also unrelatable, or dissipating anger through action saps something from my voice. I have serious, effective outlets for frustration, and if I can’t make a difference then I probably haven’t closed the deal and I will soon be very far away from the feeling of helplessness. When someone is so incompetent that talking to them begins to drive me to annoyance, then we probably aren’t going to close the sale and the problem solves itself from an emotional perspective. (Also when a team is very competent and doesn’t need consultants, we also don’t close the sale, and the problem – me – solves itself. The system works!) I have a lot more time to read, but the nature of what I read is pretty different. I can afford to read much more complicated texts, especially now that we’re pulling in enough revenue that I don’t need to churn through so much sales material, which is generally effective but not interesting to write about for an audience with taste 8 . Because my days are mostly free when we aren’t actively engaged, I can take a lot of time to synthesize them into our working practice. I’m currently thinking very heavily about how to take some of the conversations I’ve had with Iris Meredith on Clausewitz and applying some of the thoughts from her to the design of software engineering delivery cadence. I would have never had that time months ago. I’m a lot more empathetic when people struggle with their jobs or are even totally driven by their egos, but also now get asked to weigh in on things like redundancies and project cancellations, 9 so my opinions feel less heated but also have the capacity to incinerate things rather than produce disgruntled essays. I see lots of crazy stuff, but usually under the cover of NDAs, so I can’t really write about them unless the stories are of a very precise nature wherein they’re still fun after obfuscation and wouldn’t upset a client even in anonymized form. When I do see crazy stuff, I now wield the Righteous Power of Consultant Positioning and lay about my surroundings with great force and holy vengeance. I can channel the anger into real results, and it is probably not an exaggeration to say that the outcomes I’ve managed in the past month trounce most of my accomplishments in the first five years of my career. I got a security department to whitelist Python for every analyst in a Big Corp. Do you understand how powerful I’ve become? Do you see what I have wrought? And, most importantly, even though I am some sort of borderline management consultant now, as pointed out by our lawyer Iain McLaren who I have yet to forgive for this label, an employee at a large corporation high-fived me two weeks ago and said that working with the team is the most fun they’ve had in years. We may have our sales collapse, especially when the likely AI bubble pops and does God-knows-what to IT budgets around the world, but I simply can’t imagine going back to a company where I don’t have a good time. In any case, I hope that explaining what has been going on for the past few months and the context for any upcoming writing will make it easier to write in the future and maybe be a little bit more entertaining than the same old stuff as always. I’ve grown past many authors over the years, and I hope that a shift in tone is going to be something that keeps things interesting for readers instead of being an unwelcome development. In 2023, someone wrote this to me: I'm sorry, but this needs a privilege/gratitude check. You are guaranteed your salary, and you're welcome to take on the same level of risk your company is by starting your own. If you think it's so easy go ahead. I want to take a moment to say “It really was easy, you impotent fool! Look upon my works and perish , nerd! You could have written the same thing to hundreds of people on the internet and they would have backed down, but you somehow picked the one person that would remember your smug comment and build all of this out of spite. It’ll be yeeeeeears before you’re ready to face me! Ahahahahahahaha–” Mira Welner, the first person that we hired to do some contracting, has written about the experience of working with our team here . Choice quotes include “I can't tell you that Nikhil is the best of software engineer I've ever met” and “And he's not quite as insufferable as his blog makes him seem”.  ↩ Unfortunately because of the healthcare situation in the U.S., you are possibly all in some unique category that has to calculate the probability of a serious medical condition cropping up before taking the leap. It’s interesting to me that a culture that does actually have an immense entrepreneurial drive also has the single biggest impediment to starting ventures that I’ve seen in any country, even some genuinely “third-world” ones.  ↩ The funny thing is that, by all accounts, Seek has a really good internal culture, especially around software engineering. All that power aimed at filling roles that Seek themselves would never post in a thousand years.  ↩ The art of both being and appearing plausible are very deep, and we consult with David Kellam when we want to get better. I tease David somewhat about the energy of his public profile, but he has a mind like a thousand knives and knows his stuff.  ↩ That might also be true of plumbing, but I sure hope it isn’t.  ↩ A phrase that I think Taleb coined, which he defined as “Enough money to say ‘Fuck you’ before hanging up the phone.”  ↩ For readers in the distant future, in 2025 we had something called the “AI bubble” and it was really funny, and in the corporate sales context largely consisted of all the people that didn’t understand the words “crypto” and “quantum” getting together to say “AI”, or in an astonishing two separate cases , “A1 (A-One)”. Whether or not AI is a big thing in 2070 is more-or-less irrelevant to how dumb it is right now.  ↩ That’s you, reading several thousand word long essays instead of being on TikTok, xoxo, sorry that you'll be unable to communicate with anyone in twenty years.  ↩ I once told a reader that I’m very opposed to layoffs. They cited one of my old blog posts, Flexible Schemas Are The Mindkiller , and asked me if I could honestly say that it wouldn’t be morally and ethically correct to recommend the guy that leaked all the patient records in that story be fired. I sort of just vaguely mumbled and accepted that he had caught me in a devious trap of my own devising.  ↩ Mira Welner, the first person that we hired to do some contracting, has written about the experience of working with our team here . Choice quotes include “I can't tell you that Nikhil is the best of software engineer I've ever met” and “And he's not quite as insufferable as his blog makes him seem”.  ↩ Unfortunately because of the healthcare situation in the U.S., you are possibly all in some unique category that has to calculate the probability of a serious medical condition cropping up before taking the leap. It’s interesting to me that a culture that does actually have an immense entrepreneurial drive also has the single biggest impediment to starting ventures that I’ve seen in any country, even some genuinely “third-world” ones.  ↩ The funny thing is that, by all accounts, Seek has a really good internal culture, especially around software engineering. All that power aimed at filling roles that Seek themselves would never post in a thousand years.  ↩ The art of both being and appearing plausible are very deep, and we consult with David Kellam when we want to get better. I tease David somewhat about the energy of his public profile, but he has a mind like a thousand knives and knows his stuff.  ↩ That might also be true of plumbing, but I sure hope it isn’t.  ↩ A phrase that I think Taleb coined, which he defined as “Enough money to say ‘Fuck you’ before hanging up the phone.”  ↩ For readers in the distant future, in 2025 we had something called the “AI bubble” and it was really funny, and in the corporate sales context largely consisted of all the people that didn’t understand the words “crypto” and “quantum” getting together to say “AI”, or in an astonishing two separate cases , “A1 (A-One)”. Whether or not AI is a big thing in 2070 is more-or-less irrelevant to how dumb it is right now.  ↩ That’s you, reading several thousand word long essays instead of being on TikTok, xoxo, sorry that you'll be unable to communicate with anyone in twenty years.  ↩ I once told a reader that I’m very opposed to layoffs. They cited one of my old blog posts, Flexible Schemas Are The Mindkiller , and asked me if I could honestly say that it wouldn’t be morally and ethically correct to recommend the guy that leaked all the patient records in that story be fired. I sort of just vaguely mumbled and accepted that he had caught me in a devious trap of my own devising.  ↩

0 views

Are LLMs Plateauing? No. You Are.

“GPT-5 isn’t that impressive.” People claim the jump from GPT4o to GPT-5 feels incremental. They’re wrong. LLM intelligence hasn’t plateaued. Their perception of intelligence has . Let me explain with a simple example: translation from French to English. GPT-4o was already at 100% accuracy for this task. Near-perfect translations, proper idioms, cultural context. Just nailed it. Now try GPT-o1, o3, GPT-5, or whatever comes next. The result? Still 100% accurate. From your perspective, nothing changed. Zero improvement. The model looks identical. They have plateaued. But here’s the thing: most people’s tasks are dead simple. - “Do this math for me” - “Explain this concept” - “Translate this text” - “Rewrite that email” These tasks were already saturated by earlier models. They are testing intelligence on problems that have already been solved. Of course they don’t see progress. They are like someone measuring a rocket’s speed with a car speedometer. Once you hit the max reading, everything looks the same. Intelligence is multi-dimensional. It’s a spectrum of capabilities tested against increasingly difficult tasks. Think about how we measure human intelligence: - A 5-year-old doing addition → Smart kid - A PhD solving differential equations → Brilliant mathematician - A Fields Medalist proving novel theorems → Genius Same concept, wildly different difficulty levels. You wouldn’t judge the mathematician by giving them 2+2 . Yet that’s exactly what we’re doing with LLMs. We test them on tasks that earlier models already maxed out, then declare progress has stopped. Raw LLM intelligence is exploding. But it’s happening at the frontier. On tasks that push the absolute limits of reasoning. Take GPT-5-Pro. It demonstrated the ability to produce novel mathematical proofs . Not “solve this known problem.” Not “explain this proof.” Create new mathematics. Example: In an experiment by Sébastien Bubeck , GPT-5-Pro improved a bound in convex optimization from 1/L to 1.5/L. It reasoned for 17 minutes to generate a correct proof for an open problem. Read that again. An LLM improved a mathematical bound . It generated original research. This isn’t just solving known problems. The AI is creating new knowledge. We’re approaching a world where AI models will tackle the hardest unsolved problems in mathematics. The Millennium Prize Problems. P vs NP. The Riemann Hypothesis. Problems that have stumped humanity’s greatest minds for decades or centuries. This isn’t incremental. This is a model operating at the level of professional mathematicians. And this capability emerged in the latest generation. But if you’re only asking it to “explain gradient descent” or “fix my Python bug,” you’ll never see this intelligence. You’re testing a Formula 1 car in a parking lot . Current frontier models (GPT-5-Pro, Claude 4.5) can already outperform most humans on most intellectual tasks. Not “simple” tasks. Most tasks. - Legal analysis? Better than most lawyers. - Medical diagnosis? Better than most doctors - Code review? Better than most senior engineers. - Financial modeling? Better than most analysts. And they do it in seconds. No fatigue. No ego. No “I need to look that up.” (also no close to no compensation, lol!) Soon, these models will be smarter than most humans combined . The collective intelligence of humanity, accessible in a chat interface. But here’s what’s missing today: the ability to work over time with tools . Thanks for reading! A human doesn’t rely on raw brain power alone. You use tools: - Reading text to gather information - Writing to organize thoughts - Maintaining todo lists to track objectives - Asking for feedback to improve - Using calculators, spreadsheets, databases, software. Your brain isn’t that powerful in isolation. Your intelligence emerges from orchestrating tools toward a goal . LLMs sucked at this. They were brilliant in a single conversation but couldn’t persist, iterate, or coordinate across time. That’s changing. The breakthrough isn’t smarter models. It’s models that can orchestrate their intelligence over time . Software engineers experienced that firsthand with coding agents. GPT-5-Codex, an open-source coding agent, can read, edit, execute code autonomously. For instance, to refactor a 12,000-line legacy Python project, it will: - Address dependencies - Add test coverage - Fix three race conditions - Run for 7 hours in a sandboxed environment This isn’t “write me a function.” This is sustained, multi-step reasoning with tool use. Planning, executing, validating, iterating. The model maintained context, managed a todo list, ran tests, read errors, and adapted. Just like a human engineer would. That’s the leap. Not raw intelligence but applied intelligence . It will take over most valuable knowledge worker jobs . Here’s where it gets real: the AI Productivity Index (APEX) , the first benchmark for assessing whether frontier AI models can perform knowledge work with high economic value . APEX addresses a massive inefficiency in AI research: outside of coding, most benchmarks test toy problems that don’t reflect real work. APEX changes that. APEX-v1.0 contains 200 test cases across four domains: - Investment banking - Management consulting - Primary medical care How it was built: 1. Source experts with top-tier experience (e.g., Goldman Sachs investment bankers) 2. Experts create prompts reflecting high-value tasks from their day-to-day work 3. Experts create rubrics for evaluating model responses This isn’t “explain what a stock is.” It’s “analyze this M&A deal structure and identify regulatory risks in cross-border jurisdictions.” The results? Current models can already answer a significant portion of these questions. Not all, but enough to be economically valuable. Take stock research for instance. A model can read a 10-K filing and answer questions about it perfectly. At my company Fintool we saturated that benchmark in 2024. But now the challenge is for our AI to do investor’s job: - Monitor earnings calls across hundreds of companies - Extract precise financial metrics and projections - Generate comprehensive research reports - Compare performance across competitors - Track industry trends over time - Identify investment opportunities autonomously Same “intelligence,” radically different capability. The raw LLM power is enhanced with tools . When we tested Fintool-v4 against human equity analysts we found that our agent was 25x faster and 183x cheaper, with 90% accuracy on expert-level tasks. What Happens Next The plateau isn’t in the model. It’s in your benchmark. The next wave isn’t smarter models, it’s models that can actually do things. Even if raw intelligence plateaued tomorrow, expanding agentic capabilities alone would trigger massive economic growth . It’s about: - Models that can maintain todo lists and execute over weeks - Models that can read documentation, try solutions, fail, and iterate - Models that can coordinate with other models and humans - Models that can ask for help when stuck And when millions of these agents are deployed, the world changes. Not because the models got smarter. Because they got useful. Intelligence without application is just a party trick. Intelligence with tool use is the revolution. It’s accelerating. Exponentially. But the real action is happening at the edge. Thanks for reading! Subscribe for free to receive new posts and support my work.

0 views
Simon Willison 2 weeks ago

Claude Code for web - a new asynchronous coding agent from Anthropic

Anthropic launched Claude Code for web this morning. It's an asynchronous coding agent - their answer to OpenAI's Codex Cloud and Google's Jules , and has a very similar shape. I had preview access over the weekend and I've already seen some very promising results from it. It's available online at claude.ai/code and shows up as a tab in the Claude iPhone app as well: As far as I can tell it's their latest Claude Code CLI app wrapped in a container (Anthropic are getting really good at containers these days) and configured to . It appears to behave exactly the same as the CLI tool, and includes a neat "teleport" feature which can copy both the chat transcript and the edited files down to your local Claude Code CLI tool if you want to take over locally. It's very straight-forward to use. You point Claude Code for web at a GitHub repository, select an environment (fully locked down, restricted to an allow-list of domains or configured to access domains of your choosing, including "*" for everything) and kick it off with a prompt. While it's running you can send it additional prompts which are queued up and executed after it completes its current step. Once it's done it opens a branch on your repo with its work and can optionally open a pull request. Claude Code for web's PRs are indistinguishable from Claude Code CLI's, so Anthropic told me it was OK to submit those against public repos even during the private preview. Here are some examples from this weekend: That second example is the most interesting. I saw a tweet from Armin about his MiniJinja Rust template language adding support for Python 3.14 free threading. I hadn't realized that project had Python bindings, so I decided it would be interesting to see a quick performance comparison between MiniJinja and Jinja2. I ran Claude Code for web against a private repository with a completely open environment ( in the allow-list) and prompted: I’m interested in benchmarking the Python bindings for https://github.com/mitsuhiko/minijinja against the equivalente template using Python jinja2 Design and implement a benchmark for this. It should use the latest main checkout of minijinja and the latest stable release of jinja2. The benchmark should use the uv version of Python 3.14 and should test both the regular 3.14 and the 3.14t free threaded version - so four scenarios total The benchmark should run against a reasonably complicated example of a template, using template inheritance and loops and such like In the PR include a shell script to run the entire benchmark, plus benchmark implantation, plus markdown file describing the benchmark and the results in detail, plus some illustrative charts created using matplotlib I entered this into the Claude iPhone app on my mobile keyboard, hence the typos. It churned away for a few minutes and gave me exactly what I asked for. Here's one of the four charts it created: (I was surprised to see MiniJinja out-performed by Jinja2, but I guess Jinja2 has had a decade of clever performance optimizations and doesn't need to deal with any extra overhead of calling out to Rust.) Note that I would likely have got the exact same result running this prompt against Claude CLI on my laptop. The benefit of Claude Code for web is entirely in its convenience as a way of running these tasks in a hosted container managed by Anthropic, with a pleasant web and mobile UI layered over the top. It's interesting how Anthropic chose to announce this new feature: the product launch is buried half way down their new engineering blog post Beyond permission prompts: making Claude Code more secure and autonomous , which starts like this: Claude Code's new sandboxing features, a bash tool and Claude Code on the web, reduce permission prompts and increase user safety by enabling two boundaries: filesystem and network isolation. I'm very excited to hear that Claude Code CLI is taking sandboxing more seriously. I've not yet dug into the details of that - it looks like it's using seatbelt on macOS and Bubblewrap on Linux. Anthropic released a new open source (Apache 2) library, anthropic-experimental/sandbox-runtime , with their implementation of this so far. Filesystem sandboxing is relatively easy. The harder problem is network isolation, which they describe like this: Network isolation , by only allowing internet access through a unix domain socket connected to a proxy server running outside the sandbox. This proxy server enforces restrictions on the domains that a process can connect to, and handles user confirmation for newly requested domains. And if you’d like further-increased security, we also support customizing this proxy to enforce arbitrary rules on outgoing traffic. This is crucial to protecting against both prompt injection and lethal trifecta attacks. The best way to prevent lethal trifecta attacks is to cut off one of the three legs, and network isolation is how you remove the data exfiltration leg that allows successful attackers to steal your data. If you run Claude Code for web in "No network access" mode you have nothing to worry about. I'm a little bit nervous about their "Trusted network access" environment. It's intended to only allow access to domains relating to dependency installation, but the default domain list has dozens of entries which makes me nervous about unintended exfiltration vectors sneaking through. You can also configure a custom environment with your own allow-list. I have one called "Everything" which allow-lists "*", because for projects like my MiniJinja/Jinja2 comparison above there are no secrets or source code involved that need protecting. I see Anthropic's focus on sandboxes as an acknowledgment that coding agents run in YOLO mode ( and the like) are enormously more valuable and productive than agents where you have to approve their every step. The challenge is making it convenient and easy to run them safely. This kind of sandboxing kind is the only approach to safety that feels credible to me. Update : A note on cost: I'm currently using a Claude "Max" plan that Anthropic gave me in order to test some of their features, so I don't have a good feeling for how Claude Code would cost for these kinds of projects. From running (an unofficial cost estimate tool ) it looks like I'm using between $1 and $5 worth of daily Claude CLI invocations at the moment. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Add query-string-stripper.html tool against my simonw/tools repo - a very simple task that creates (and deployed via GitHub Pages) this query-string-stripper tool. minijinja vs jinja2 Performance Benchmark - I ran this against a private repo and then copied the results here, so no PR. Here's the prompt I used. Update deepseek-ocr README to reflect successful project completion - I noticed that the README produced by Claude Code CLI for this project was misleadingly out of date, so I had Claude Code for web fix the problem.

0 views
Preah's Website 2 weeks ago

Exploring IRC (Internet Relay Chat)

My history with IRC is spotty. I've explored it a couple different times, but I always struggled with 1. understanding it, 2. setting it up, and 3. finding channels that I actually enjoy. I thought I would give it another go recently. I saw a mention of a modern client called Halloy on HackerNews the other day, and it has a beautiful interface. I'm not a huge fan of how the windows open and arrange, as it becomes very confusing very fast, but I enjoy the appearance and theme selections. Overall, it's very functional. IRC, or Internet Relay Chat, is one of the oldest forms of online communication, created back in 1988, long before social media or modern messaging apps existed. Think of it as a giant network of chat rooms (called channels) where people can talk in real time about shared interests, ask questions, or just hang out. Each channel usually focuses on a specific topic. This is anything from technology and gaming to books or music, and you can join any that interest you. Although it might look simple compared to apps like Discord or Slack, IRC remains popular among certain communities for a few key reasons. It’s fast, lightweight, and distraction-free: there are no ads, algorithms, or constant notifications (unless you want to be notified of new messages by your client). You connect, chat, and leave when you want. Many developers, hobbyists, and open-source communities especially still use IRC because it works everywhere, even on very old or low-power devices, and doesn’t rely on any one company’s servers or apps. It's decentralized . If you know me, you know I love decentralization. IRC works using a simple client–server model. When you connect to IRC, you use a client , which is a piece of software on your computer or phone, to join an IRC server. That server is part of a larger network made up of many connected servers that share messages between each other. When you send a message in a channel (a public chat room that usually starts with a “#”, like or ), your client sends it to the server you’re connected to. The server then relays that message to all other servers on the same network, which deliver it to everyone else currently in that channel. Each person on IRC has a nickname , or colloquially referred to as a nick , and messages can be sent either to an entire channel or directly to another user in private (DM or direct message). Communication happens entirely in plain text, and commands, like joining a channel, changing your nickname, or setting up your status, are typed manually, usually starting with a slash (for example, , like Minecraft commands lol). Because the system is decentralized, there’s no single company controlling IRC. Anyone can set up their own server or network, and clients simply connect using the IRC protocol, which runs over standard internet ports (usually 6667 or 6697 for encrypted connections). This design makes IRC lightweight, flexible, and still functional decades after it was first created. Check out the Basics of IRC from Libera Chat . Libera Chat is a Swedish nonprofit organisation... Libera Chat’s purpose is to provide services such as a community platform for free open-source software and peer directed projects on a volunteer basis. -- About Libera Chat page So, how do I know what network and channels to join? Uh, I don't know. Well, I do know, but I haven't found anything super intriguing yet. You kind of have to find your niche, and a lot of channels have very few people or not much conversation going on. I think your best bet is Find a channel that is somewhat active and start talking, asking questions, and answering questions. If you're in , ask for thoughts on something you're coding, or help a new programmer if you're a Python expert. I don't use my real name on IRC channels, so that makes it a bit easier to explore chatting publicly with strangers, not to mention keeping my identity a little safer. Also, Find or create an invite-only channel/network. This leads to more tight-knit and active communities sometimes, like a Discord server. Starting your own IRC network is a bit more difficult, especially considering security hardening and uptime, but most people can make a channel on an existing network such as Libera Chat without too much issue. And maybe you're a self-hosting wizard who wants to tinker with a network, then do it!! It's cool! It also gives you full control over your chats, in the same way self-hosting anything does. If you just want to make a channel, you can very easily do so on a network like Libera Chat. They have a guide to check out at their Creating Channels page. To stay safe and secure while using IRC, it’s important to treat it much like any other public online space. First, always connect using SSL/TLS encryption (usually by using port 6697) to protect your messages from being intercepted. Choose a strong, unique nickname and avoid sharing personal information like your real name, location, or email address in public channels or profiles. Because IRC is open and often anonymous, anyone can join a channel, so it’s best to assume everything you say is public. Many networks let you register your nickname with a service like NickServ , which helps prevent others from impersonating you. If someone is bothering or harassing you, you can use the command to block them or contact a network operator for help. Finally, use a trusted client (software used to connect to IRC) and avoid clicking suspicious links shared in chat, since IRC usually has no built-in spam or malware protection. Here is what I used to find the most active channels and networks. Libera.Chat channels sorted by number of users and Top 100 IRC networks . You can use the in whatever network you're already in to see all (discoverable) channels. In general, check out this list of awesome IRC sources . It has client recommendations, both hosted and self-hosted, a collection of networks and links for other ways to find channels and networks, frameworks for bots and managing your own channel or network, and more. I would say clients are even more intimidating than trying to find channels and networks. Usability, appearance, features... so much to balance. I would still check awesome-irc for client research, but also Libera Chat's little guide about choosing a client. I tried Halloy briefly and enjoyed it, and frankly haven't tried too many clients but I went with TheLounge. It's self-hosted and has modern features like Push notifications, link previews, and file uploads. Always connected to your servers while you're offline, removing the need for bouncers and allowing you to reconnect from any device. Free and open-source under the MIT license. Works wherever Node.js runs. I have it running on my Proxmox-running home server in an LXC container I spun up in like 30 minutes. Multi-user support, so you can share it with friends without intersecting chats and server connections. Theme selection. There are custom, user-made themes, and two generic "light" and "dark" themes you can choose. If you enjoy IRC for the retro feel, then you can still choose a retro-looking theme for TheLounge if you want. I enjoy visual customization quite a bit. It looks nice, it feels nice to use, it's organized, and I really like always being connected. Hey, check out this game I found on Rizon.Net. It's an "idle RPG", where the aim is to always idle. This means no chatting, try not to run commands, anything. This is the only way to level up. Then, random little events can happen. You can choose your character name, class, and alignment, which affects your gameplay. It's a really fun start if you're nervous about actually chatting but want to get into IRC a bit. The Idle RPG is just what it sounds like: an RPG in which the players idle. In addition to merely gaining levels, players can find items and battle other players. However, this is all done for you; you just idle. There are no set classes; you can name your character anything you like, and have its class be anything you like, as well. -- #rizonirpg Idle RPG: Game Info Let's say you already have a client. I'm using mine as an example. To join something like Idle RPG, it's simple. As you can see, my character Kagrenak, who is a Sorcerer, is struggling a bit. Oh well. I was going to write a guide on setting up TheLounge as self-hosted but the docs really have everything you need. TheLounge docs is basically what I used. Mine is a Debian-based LXC container with 2 CPU cores, 2GB of RAM, and 4GB of storage. You can also set up a VPS with a reverse proxy, or use an old computer laying around, whatever you want to containerize and run it. You can actually use Docker too I believe. I personally used Cloudflare tunnels to expose it safely. To close, have fun, try out cool platforms, and please let me know if you have a cool channel or would be interested in an invite-only one to hang out! If you have trouble or questions with setting up, feel free to email me as well. Subscribe via email or RSS Find a channel that is somewhat active and start talking, asking questions, and answering questions. If you're in , ask for thoughts on something you're coding, or help a new programmer if you're a Python expert. I don't use my real name on IRC channels, so that makes it a bit easier to explore chatting publicly with strangers, not to mention keeping my identity a little safer. Also, Find or create an invite-only channel/network. This leads to more tight-knit and active communities sometimes, like a Discord server. Starting your own IRC network is a bit more difficult, especially considering security hardening and uptime, but most people can make a channel on an existing network such as Libera Chat without too much issue. Libera.Chat channels sorted by number of users and Top 100 IRC networks . You can use the in whatever network you're already in to see all (discoverable) channels. Push notifications, link previews, and file uploads. Always connected to your servers while you're offline, removing the need for bouncers and allowing you to reconnect from any device. Free and open-source under the MIT license. Works wherever Node.js runs. I have it running on my Proxmox-running home server in an LXC container I spun up in like 30 minutes. Multi-user support, so you can share it with friends without intersecting chats and server connections. Theme selection. There are custom, user-made themes, and two generic "light" and "dark" themes you can choose. Connect to a network. TheLounge has these nice "+" signs you can just click to add, but it will vary on your client. In the image above, I added the channel to connect to automatically. However, you can add other channels using either a UI button like a "+" sign or usually just running a command like or whatever the channel name is.

0 views
Anton Zhiyanov 2 weeks ago

Go proposal: Compare IP subnets

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Compare IP address prefixes the same way IANA does. Ver. 1.26 • Stdlib • Low impact An IP address prefix represents a IP subnet. These prefixes are usually written in CIDR notation: In Go, an IP prefix is represented by the type. The new method lets you compare two IP prefixes, making it easy to sort them without having to write your own comparison code. The imposed order matches both Python's implementation and the assumed order from IANA. When the Go team initially designed the IP subnet type ( ), they chose not to add a method because there wasn't a widely accepted way to order these values. Because of this, if a developer needs to sort IP subnets — for example, to organize routing tables or run tests — they have to write their own comparison logic. This results in repetitive and error-prone code. The proposal aims to provide a standard way to compare IP prefixes. This should reduce boilerplate code and help programs sort IP subnets consistently. Add the method to the type: orders two prefixes as follows: This follows the same order as Python's and the standard IANA convention . Sort a list of IP prefixes: 𝗣 61642 • 𝗖𝗟 700355 First by validity (invalid before valid). Then by address family (IPv4 before IPv6). Then by masked IP address (network IP). Then by prefix length. Then by unmasked address (original IP).

1 views
blog.philz.dev 2 weeks ago

Build Artifacts

This is a quick story about a thing I miss, that doesn't seem to have a default solution in our industry: a build artifact store. In a previous world, we had one. You could query it for a "global build number" and it would assign you a build number (and a writable to you S3 bucket). You could then produce a build, and store it back into the build-database, with both immutable metadata (what it was, when it was built, from what commits, etc.) and mutable metadata (tags). You could then query the build database for the build that matches criteria. Perhaps you want the latest build of Elephant that ran on Slackware and passed the nightly tests? This could both be used to cobble together tiers of QA and as a build artifact cache. This was a super simple service, cobbled together in a few files of Python, and it held up to our needs quite well. What do you use? Surely Git LFS or Artifactory aren't the end states here.

0 views

Why UUIDs won't protect your secrets

This post is part of a collection on UUIDs . Indirect Object Reference (IDOR) occurs when a resource can be accessed directly by its ID even when the user does not have proper authorization to access it. IDOR is a common mistake when using a separate service for storing files, such as a publicly readable Amazon S3 bucket. The web application may perform access control checks correctly, but the storage service does not. Here’s vulnerable Django code which allows a user to view their latest billing statement: While Django ensures the user is logged in and only provides them with bills they own, S3 has no concept of Django users, and performs no such authorization checks. A simple attack would start from a known URL and increment the bill ID: The attacker can keep trying bill IDs, potentially accessing the entire collection of bills. What if we changed the Django model to use UUIDs for the primary key instead of an auto-increment? The new URLs will look like: my-bucket.us-east-1.s3.amazonaws.com/bill-9c742b6a-3401-4f3d-bee7-6f5086c6811f. UUIDs aren’t guessable, so the attacker can’t just “add one” to the URL to access other user’s files, right? Unfortunately, this is only a partial fix. Even when URLs are unguessable, that doesn’t mean an attacker can’t learn them. A classic example starts with a former employee who used their personal computer for work. Hopefully their user account was quickly disabled, blocking them from accessing the company’s web application. But sensitive URLs may still exist in their browser history. Even a non-technical attacker can pull off this attack, just by clicking through their browser history. Thankfully, many companies require employees to use company-issued devices when performing work, so this attack may be limited to former employees who violated that rule. The accidental leaking of URLs is probably a more reasonable concern. For example, if only managers are authorized to view bills you need to be careful not to leak the bill ID in other views where other employees have access. If you use secret UUIDs, think of them as toxic assets. They taint anything they touch. If they end up in logs, then logs must be kept secret. If they end up in URLs, then browser history must be kept secret. This is no small challenge. Another concern for leaked UUIDs is rotation. Whenever a secret key is compromised, leaked, or known to have been stored improperly, it should be changed. The same holds true for secret URLs. Make sure you have a way to rotate secret URLs, otherwise you may end up stuck in a compromised state. Again, no small challenge. If this sounds like a huge pain… it is. Let’s find a better solution. The best approach is to ensure every request for sensitive data is authorized. One fix is to route file access through the web application. Continuing our example, the user would access /api/bill/100 and the file would be streamed from the storage through the web app to the user’s browser. If the user tries to access /api/bill/101, where they lack authorization, the web application can deny the request. Make sure the storage bucket is private, such that access must route via the web app. This approach is a good quick fix, but there are other approaches to consider. If your storage provider is Amazon S3 you should consider pre-signed URLs . These URLs allow the browser to download the file directly from S3, without streaming through the web app. The URL contains a cryptographic signature with a short expiration date. These URLs are still sensitive, but the short expiration mitigates a number of concerns. Again, make sure the storage bucket is private. A key benefit of the pre-signed URL approach is that it offloads file access from your web application, reducing load on the application server. Let’s consider a well-known application that doesn’t follow this advice. YouTube, a popular video hosting service, allows uploaders to mark videos as “unlisted”. This is a compromise between public and private. The owner of the video can copy their video’s URL and share it out-of-band, like in a private chat room. This way, people in the private chat room can view the video, but the owner doesn’t need to grant them access one-at-a-time and the viewers don’t need to log in. In essence, anyone who knows the URL is considered authorized from YouTube’s perspective. YouTube visibility selection This approach uses unguessable URLs, which contain a random video ID, like . This appears to be 11 random alphanumeric characters, which offer around 64 bits of entropy. This is suitably unguessable, but the security is questionable. Once the URL is shared with others, the owner loses the ability to assert access control over the video. An authorized viewer can choose to share the URL with others. Users may expect that the video has proper access control restrictions and share the URL in a public-facing document, not realizing that leaking the URL leaks the video. Consider unlistedvideos.com, an index of unlisted YouTube videos. Users who discover unlisted videos can upload those URLs to the site, thus leaking the content to a broad audience. The large number of videos listed on the site shows the poor access control properties afforded by this access control method. If your unlisted content leaks to unauthorized viewers, you can regain control by marking the video as private. This prevents anyone from accessing the video, until you grant their account access. Of course, you probably chose to make the video unlisted to avoid needing to manage individual account access. You could also try re-uploading the video, marking it as unlisted, and sharing the new link, but the risk of a subsequent leak remains. Another example of this design appears later in this blog post, AWS billing estimates. AWS appears to use 160 bits of entropy to protect these URLs. Here’s the verbiage AWS uses when you create a share link. AWS billing share dialog Interestingly, I’m not seeing a way to delete a billing estimate once shared. The creator appears to lose all ability to manage access once the link is shared outside their sphere of control. Be very careful not to put sensitive data in your billing estimates. Unlisted content is an example of IDOR as an intentional security design. The uploader is expected to decide if unlisted offers the right security posture for their content. There are use cases where the effort needed to individually grant users access outweighs the risk of using unlisted. Not everyone is dealing in highly sensitive content, after all. OK, maybe you want to create something like YouTube unlisted content, despite these concerns. In that case, we should ignore security concerns related to “leaked URLs” as that is “by design”. Unlisted URLs are sort of like bearer tokens or API tokens which grant access to a single resource. Let’s focus on attacks that guess URLs and consider how guessable UUIDs actually are. UUIDv4 contains 122 random bits, much more than the 64 bits of a YouTube video ID, so there’s little to contest about UUIDv4 guessability. But what about newer formats like UUIDv7? UUIDv7 embeds a timestamp at the start such that the IDs generally increase over time. There’s some claimed benefits, such as improved write performance for certain types of databases. Unfortunately, the timestamp makes UUIDv7s easier to guess. The attacker needs to figure out the timestamp and then brute-force the random bits. Learning the timestamp may not be that difficult: users sometimes have access to metadata for resources they don’t have full permission to access. In our “latest bill” example, the bills are probably generated by a batch job kicked off by cron. As such, the bills are likely created one after another in a narrow time period. This is especially true if the attacker has the UUID of their own bill as a reference. An attacker may be able to guess a small window around when the target object’s UUID was created. Other UUID generation methods recommend creating UUIDs in large batches and then assigning them to resources, in order, as resources are created. With this approach, the UUID timestamp is loosely correlated with the resource creation timestamp, but doesn’t contain a high precision timestamp for the resource creation. This mitigates some classes of information leakage related to timestamps. Unfortunately, it also bunches UUIDs together very tightly, such that many IDs will share the exact same timestamp. Learning one UUID leaks the timestamp of the entire batch. At first glance, the random bits seem to save us. There are still 74 random bits in a UUIDv7; still more than a YouTube video ID. That’s 2 74 possible random suffixes (18,889,465,931,478,580,854,784). Well beyond what an attacker can reasonably brute-force over the Internet. I would end the blog post here, but UUIDv7 offers additional optional methods which we need to consider. The spec allows monotonic counters to be used when multiple UUIDs are created within the same timestamp. This ensures that IDs created by a single node are monotonically increasing, even within a single millisecond. The first UUID in a given timestamp uses a randomized counter value. Subsequent IDs in the same millisecond increment that counter by one. When the counter method is used, an attacker who learns one UUIDv7 can predict the counters of neighboring IDs by adding or subtracting one. A random suffix still exists, and that would still need to be brute-forced. Of note for Django users, Python 3.14 introduced UUIDv7 in the standard library. Python uses a 42-bit counter , which is the maximum width the spec allows. That means Python’s UUIDv7 only has 32 random bits, offering only 2 32 possible random suffixes (4,294,967,296). Four billion seems like a big number, but is it large enough? On average, this is 1,657 request per second averaged over a month. Is that possible? S3 claims it will automatically scale to “at least 5,500 GET requests per second”. On the attacker side, HTTP load testing tools easily scale this high. k6, a popular load testing too, suggests using a single machine unless you need to exceed 100,000 request per second. The attack fits within the systems limits and appears feasible. Adding a rate limiter would force the attacker to distribute their attack, increasing attacker cost and complexity. Cloud providers like Amazon S3 don’t offer rate limiting controls so you’ll need to consider a WAF. This changes the user-facing URL, so adding a WAF may break old URLs. There’s cost asymmetry here too. An attacker who guesses 2 32 S3 URLs will cost your service at least $1,700 on your AWS bill . If you don’t have monitoring set up, you may not realize you’re under attack until you get an expensive bill. The attackers cost could be as low as a single machine. I’m uneasy about the security here, as the attack appears technically feasible. But the attack doesn’t seem very attractive to an attacker, as they may not be able to target a specific resource. An application that had juicy enough content to be worth attacking in this way would probably worry about “URLs leaking”. In that case, unlisted URLs are a poor fit for the product and the fixes listed earlier should be used. Which renders the entire point moot as you should never end up here. But it’s not an entirely theoretical concern. If you search on GitHub, you can find examples of applications that use UUIDv7 IDs and the “public-read” ACL. The sensitivity of the data they store and the exact UUIDv7 implementation they use varies. Nevertheless, 32 random bits is too small to be considered unguessable, especially for a cloud service like S3 which lacks rate-limit controls. A common theme of UUIDv7 adoption is to avoid exposing the IDs publicly. One concern driving this trend relates to IDs leaking timing information, which can be sensitive in certain situations. A simple approach uses a random ID, perhaps UUIDv4, as the external ID and UUIDv7 as the database primary key. This can be done using a separate database column and index for the external ID. Another intriguing approach is UUIDv47 which uses SipHash to securely hash the UUIDv7 into a UUIDv4-like ID. SipHash requires a secret key to operate, so you’ll need to manage that key. Unfortunately, rotating the key will invalidate old IDs, which would break external integrations like old URLs. This may prevent systems from changing keys after a key compromise. Caveat emptor. Either of these approaches could help in our “unlisted URLs with UUIDv7” example. Postgres currently uses the “replace leftmost random bits with increased clock precision” method when generating UUIDv7 IDs. Postgres converts 12 of the random bits into extra timestamp bits. This means Postgres UUIDv7 timestamps have nanosecond granularity instead of millisecond. As such, Postgres UUIDv7s have 62 random bits, in the current implementation. So when it comes to UUIDv7 guessability, it really depends on what optional methods the implementation chooses. Be careful when adopting newer UUID versions as the properties and trade-offs are distinct from earlier versions. The authors of UUIDv7 knew about these guessability concerns and discuss them in RFC 9562. The spec offers a “monotonic random” counter method, which increments the counter by a random amount instead of one. While their solution would help mitigate this attack, I wasn’t able to find an implementation that actually uses it. RFC 9562: Universally Unique IDentifiers (UUIDs) (2024) Python uuid.uuid7 100,000,000 S3 requests per day k6 load generator Postgres UUIDv7 generator

0 views
Nelson Figueroa 3 weeks ago

How to Actually Copy a List in Python

tl;dr: use the method. Say we have two Python lists – and . If we try to make a copy of and assign it to using the assignment operator , what really happens is that both and point to the same memory address. That means that any list-manipulating actions that are done on either or will affect the same list in memory. We don’t have actually have two separate lists we can act upon. In the example below, although we append the integer to , we can see that printing out shows the newly added element. That’s because both list variables point to the same memory address: Output of the program above: To make an actual copy, use the method. Then, when is modified, it is independent of , because is stored in a separate memory address. Now if we append the same integer to , will be completely unaffected. Output of the program above: Here’s more proof. We can print out the memory address of each variable to see when they’re the same and when they differ. We can do this using the function. Here are the same lists from above but this time with their unique identifiers printed out. In this case, the IDs match because both and point to the same memory address. The program above outputs: The memory addresses are the same. Now let’s try the same thing but using the method instead of just an assignment operation with : The program above outputs: We can see the memory addresses are different (most obvious due to the ending digits). Although I’ve been in the field for some time, I still have my smooth brain moments. This is a reminder to myself (and whoever reads this) to remember the basics! https://www.geeksforgeeks.org/python/python-list-copy-method/ https://www.geeksforgeeks.org/python/id-function-python/

0 views
Sean Goedecke 3 weeks ago

We are in the "gentleman scientist" era of AI research

Many scientific discoveries used to be made by amateurs. William Herschel , who discovered Uranus, was a composer and an organist. Antoine Lavoisier , who laid the foundation for modern chemistry, was a politician. In one sense, this is a truism. The job of “professional scientist” only really appeared in the 19th century, so all discoveries before then logically had to have come from amateurs, since only amateur scientists existed. But it also reflects that any field of knowledge gets more complicated over time . In the early days of a scientific field, discoveries are simple: “air has weight”, “white light can be dispersed through a prism into different colors”, “the mass of a burnt object is identical to its original mass”, and so on. The way you come up with those discoveries is also simple: observing mercury in a tall glass tube, holding a prism up to a light source, weighing a sealed jar before and after incinerating it, and so on. The 2025 Nobel prize in physics was just awarded “for the discovery of macroscopic quantum mechanical tunnelling and energy quantisation in an electric circuit”. The press release gallantly tries to make this discovery understandable to the layman, but it’s clearly much more complicated than the examples I listed above. Even understanding the terms involved would take years of serious study. If you wanted to win the 2026 Nobel prize in physics, you have to be a physicist : not a musician who dabbles in physics, or a politician who has a physics hobby in your spare time. You have to be fully immersed in the world of physics 1 . AI research is not like this. We are very much in the “early days of science” category. At this point, a critical reader might have two questions. How can I say that when many AI papers look like this ? 2 Alternatively, how can I say that when the field of AI research has been around for decades, and is actively pursued by many serious professional scientists? First, because AI research discoveries are often simpler than they look . This dynamic is familiar to any software engineer who’s sat down and tried to read a paper or two: the fearsome-looking mathematics often contains an idea that would be trivial to express in five lines of code. It’s written this way because (a) researchers are more comfortable with mathematics, and so genuinely don’t find it intimidating, and (b) mathematics is the lingua franca of academic research, because researchers like to write to far-future readers for whom Python syntax may be as unfamiliar as COBOL is to us. Take group-relative policy optimization, or GRPO, introduced in a 2024 DeepSeek paper . This has been hugely influential for reinforcement learning (which in turn has been the driver behind much LLM capability improvement in the last year). Let me try and explain the general idea. When you’re training a model with reinforcement learning, you might naively reward success and punish failure (e.g. how close the model gets to the right answer in a math problem). The problem is that this signal breaks down on hard problems. You don’t know if the model is “doing well” without knowing how hard the math problem is, which is itself a difficult qualitative assessment. The previous state-of-the art was to train a “critic model” that makes this “is the model doing well” assessment for you. Of course, this brings a whole new set of problems: the critic model is hard to train and verify, costs much more compute to run inside the training loop, and so on. Enter GRPO. Instead of a critic model, you gauge how well the model is doing by letting it try the problem multiple times and computing how well it does on average . Then you reinforce the model attempts that were above average and punish the ones that were below average. This gives you good signal even on very hard prompts, and is much faster than using a critic model. The mathematics in the paper looks pretty fearsome, but the idea itself is surprisingly simple. You don’t need to be a professional AI researcher to have had it. In fact, GRPO is not necessarily that new of an idea. There is discussion of normalizing the “baseline” for RL as early as 1992 (section 8.3), and the idea of using the model’s own outputs to set that baseline was successfully demonstrated in 2016 . So what was really discovered in 2024? I don’t think it was just the idea of “averaging model outputs to determine a RL baseline”. I think it was that that idea works great on LLMs as well . As far as I can tell, this is a consistent pattern in AI research. Many of the big ideas are not brand new or even particularly complicated. They’re usually older ideas or simple tricks, applied to large language models for the first time. Why would that be the case? If deep learning wasn’t a good subject for the amateur scientist ten years ago, why would the advent of LLMs change that? Suppose someone discovered that a rubber-band-powered car - like the ones at science fair competitions - could output as much power as a real combustion engine, so long as you soaked the rubber bands in maple syrup beforehand. This would unsurprisingly produce a revolution in automotive (and many other) engineering fields. But I think it would also “reset” scientific progress back to something like the “gentleman scientist” days, where you could productively do it as a hobby. Of course, there’d be no shortage of real scientists doing real experiments on the new phenomenon. However, there’d also be about a million easy questions to answer. Does it work with all kinds of maple syrup? What if you soak it for longer? What if you mixed in some maple-syrup-like substances? You wouldn’t have to be a real scientist in a real lab to try your hand at some of those questions. After a decade or so, I’d expect those easy questions to have been answered, and for rubber-band engine research to look more like traditional science. But that still leaves a long window for the hobbyist or dilettante scientist to ply their trade. The success of LLMs is like the rubber-band engine. A simple idea that anyone can try 3 - train a large transformer model on a ton of human-written text - produces a surprising and transformative technology. As a consequence, many easy questions have become interesting and accessible subjects of scientific inquiry, alongside the normal hard and complex questions that professional researchers typically tackle. I was inspired to write this by two recent pieces of research: Anthropic’s “skills” product and the Recursive Language Models paper . Both of these present new and useful ideas, but they’re also so simple as to be almost a joke. “Skills” are just markdown files and scripts on-disk that explain to the agent how to perform a task. Recursive language models are just agents with direct code access to the entire prompt via a Python REPL. There, now you can go and implement your own skills or RLM inference code. I don’t want to undersell these ideas. It is a genuinely useful piece of research for Anthropic to say “hey, you don’t really need actual tools if the LLM has shell access, because it can just call whatever scripts you’ve defined for it on disk”. Giving the LLM direct access to its entire prompt via code is also (as far as I can tell) a novel idea, and one with a lot of potential. We need more research like this! Strong LLMs are so new, and are changing so fast, that their capabilities are genuinely unknown 4 . For instance, at the start of this year, it was unclear whether LLMs could be “real agents” (i.e. whether running with tools in a loop would be useful for more than just toy applications). Now, with Codex and Claude Code, I think it’s pretty clear that they can. Many of the things we learn about AI capabilities - like o3’s ability to geolocate photos - come from informal user experimentation. In other words, they come from the AI research equivalent of 17th century “gentleman science”. Incidentally, my own field - analytic philosophy - is very much the same way. Two hundred years ago, you could publish a paper with your thoughts on “what makes a good act good”. Today, in order to publish on the same topic, you have to deeply engage with those two hundred years of scholarship, putting the conversation out of reach of all but professional philosophers. It is unclear to me whether that is a good thing or not. Randomly chosen from recent AI papers on arXiv . I’m sure you could find a more aggressively-technical paper with a bit more effort, but it suffices for my point. Okay, not anyone can train a 400B param model. But if you’re willing to spend a few hundred dollars - far less than Lavoisier spent on his research - you can train a pretty capable language model on your own. In particular, I’d love to see more informal research on making LLMs better at coming up with new ideas. Gwern wrote about this in LLM Daydreaming , and I tried my hand at it in Why can’t language models come up with new ideas? . Incidentally, my own field - analytic philosophy - is very much the same way. Two hundred years ago, you could publish a paper with your thoughts on “what makes a good act good”. Today, in order to publish on the same topic, you have to deeply engage with those two hundred years of scholarship, putting the conversation out of reach of all but professional philosophers. It is unclear to me whether that is a good thing or not. ↩ Randomly chosen from recent AI papers on arXiv . I’m sure you could find a more aggressively-technical paper with a bit more effort, but it suffices for my point. ↩ Okay, not anyone can train a 400B param model. But if you’re willing to spend a few hundred dollars - far less than Lavoisier spent on his research - you can train a pretty capable language model on your own. ↩ In particular, I’d love to see more informal research on making LLMs better at coming up with new ideas. Gwern wrote about this in LLM Daydreaming , and I tried my hand at it in Why can’t language models come up with new ideas? . ↩

0 views

Automata All the Way Down

Fabs and EDA companies collaborate to provide the abstraction of synchronous digital logic to hardware designers. A hardware design comprises: A set of state elements (e.g., registers and on-chip memories), which retain values from one clock cycle to another A transfer function , which maps the values of all state elements at clock cycle N to new values of all state elements at clock cycle The transfer function cannot be too fancy. It can be large but cannot be defined with unbounded loops/recursion. The pragmatic reason for this restriction is that the function is implemented with physical gates on a chip, and each gate can only do one useful thing per clock cycle. You cannot loop the output of a circuit element back to itself without delaying the value by at least one clock cycle (via a state element). It feels to me like there is a deeper reason why this restriction must exist. Many people dabbled with synchronous digital logic in college. If you did, you probably designed a processor, which provides the stored program computer abstraction to software developers. And here comes the inception: you can think of a computer program as a transfer function. In this twisted mindset, the stored program computer abstraction enables software engineers to define transfer functions . For example, the following pseudo-assembly program: Can be thought of as the following transfer function: In the stored program computer abstraction, state elements are the architectural registers plus the contents of memory. As with synchronous digital logic, there are limits on what the transfer function can do. The switch statement can have many cases, but the body of each case block is defined by one instruction. Alternatively, you can define the transfer function at the basic block level (one case per basic block, many instructions inside of each case). Programming in assembly is a pain, so higher level languages were developed to make us less crazy. And here we go again, someone could write an interpreter for C. A user of this interpreter works at the C level of abstraction. Following along with our previous pattern, a C program comprises: A set of state elements (variables, both global and local) A transfer function For example, the following C function: Can be thought of with as the following transfer function: Think of and as intrinsics used to implement function calls. The key building blocks of the transfer function are state ments. It is easy to just store the term “statement” into your brain without thinking of where the term comes from. A state ment is a thing which can alter state . This transformation of an imperative program into a transfer function seems strange, but some PL folks do it all the time. In particular, the transfer function view is how small step operational semantics are defined. And of course this can keep going. One could write a Python interpreter in C, which allows development at a higher level of abstraction. But even at that level of abstraction, programs are defined in terms of state elements (variables) and a transfer function (statements). The term Turing Tax was originally meant to describe the performance loss associated with working at the stored-program computer level of abstraction instead of the synchronous digital logic level of abstraction. This idea can be generalized. At a particular level of abstraction, code defines the transfer function while data is held in the state elements. A particular set of bits can simultaneously be described as code at one level of abstraction, while defined as data at a lower level. This code/data duality is intimately related to the Turing Tax. The Turing Tax collector is constantly looking for bags of bits which can be interpreted as either code or data, and he collects his tax each time he finds such a situation. An analogous circumstance arises in hardware design. Some signals can be viewed as either part of the data path or the control path, depending on what level of abstraction one is viewing the hardware from. A compiler is one trick to avoid the Turing Tax by translating code (i.e., a transfer function) from a higher level of abstraction to a lower level. We all felt awkward when I wrote “interpreter for C” earlier, and now we can feel better about it. JIT compilers for Python are one way to avoid the Turing Tax. Another example is an HLS compiler which avoids the Turing Tax between the stored-program computer abstraction layer and the synchronous digital logic layer. No, this section isn’t about your Fitbit. Let’s call each evaluation of a transfer function a step . These steps occur at each level of abstraction. Let’s define the ultimate performance goal that we care about to be the number of steps required to execute a computation at the synchronous digital logic level of abstraction. The trouble with these layers of abstraction is that typically a step at a higher layer of abstraction requires multiple steps at a lower layer. For example, the multi-cycle processor implementation you learned about in a Patterson and Hennessy textbook could require 5 clock cycles to execute each instruction (instruction fetch, register fetch, execute, memory, register write back). Interpreters have the same behavior: one Python statement may be implemented with many C statements. Now imagine the following house of cards: A Python interpreter which requires an average of 4 C statements to implement 1 Python statement A C compiler which requires an average of 3 machine instructions to implement 1 C statement A processor which requires an average of 5 clock cycles to execute 1 machine instruction When the Turing property tax assessor sees this house, they tax each level of the house . In this system, an average Python statement requires (4 x 3 x 5) 60 clock cycles! Much engineering work goes into avoiding this problem (pipelined and superscalar processors, multi-threading, JIT compilation, SIMD). Partial evaluation is another way to avoid the Turing Tax. Partial evaluation transforms data into code . There must be some other method of creating abstractions which is more efficient. Self-modifying code is rarely used in the real world (outside of JIT compilers). Self-modifying code seems crazy to reason about but potentially could offer large performance gains. Partial evaluation is also rarely used but has a large potential. Subscribe now A set of state elements (e.g., registers and on-chip memories), which retain values from one clock cycle to another A transfer function , which maps the values of all state elements at clock cycle N to new values of all state elements at clock cycle A set of state elements (variables, both global and local) A transfer function Code vs Data At a particular level of abstraction, code defines the transfer function while data is held in the state elements. A particular set of bits can simultaneously be described as code at one level of abstraction, while defined as data at a lower level. This code/data duality is intimately related to the Turing Tax. The Turing Tax collector is constantly looking for bags of bits which can be interpreted as either code or data, and he collects his tax each time he finds such a situation. An analogous circumstance arises in hardware design. Some signals can be viewed as either part of the data path or the control path, depending on what level of abstraction one is viewing the hardware from. Compilers vs Interpreters A compiler is one trick to avoid the Turing Tax by translating code (i.e., a transfer function) from a higher level of abstraction to a lower level. We all felt awkward when I wrote “interpreter for C” earlier, and now we can feel better about it. JIT compilers for Python are one way to avoid the Turing Tax. Another example is an HLS compiler which avoids the Turing Tax between the stored-program computer abstraction layer and the synchronous digital logic layer. Step Counting (Multiple Taxation) No, this section isn’t about your Fitbit. Let’s call each evaluation of a transfer function a step . These steps occur at each level of abstraction. Let’s define the ultimate performance goal that we care about to be the number of steps required to execute a computation at the synchronous digital logic level of abstraction. The trouble with these layers of abstraction is that typically a step at a higher layer of abstraction requires multiple steps at a lower layer. For example, the multi-cycle processor implementation you learned about in a Patterson and Hennessy textbook could require 5 clock cycles to execute each instruction (instruction fetch, register fetch, execute, memory, register write back). Interpreters have the same behavior: one Python statement may be implemented with many C statements. Now imagine the following house of cards: A Python interpreter which requires an average of 4 C statements to implement 1 Python statement A C compiler which requires an average of 3 machine instructions to implement 1 C statement A processor which requires an average of 5 clock cycles to execute 1 machine instruction

0 views
Armin Ronacher 3 weeks ago

Building an Agent That Leverages Throwaway Code

In August I wrote about my experiments with replacing MCP ( Model Context Protocol ) with code. In the time since I utilized that idea for exploring non-coding agents at Earendil . And I’m not alone! In the meantime, multiple people have explored this space and I felt it was worth sharing some updated findings. The general idea is pretty simple. Agents are very good at writing code, so why don’t we let them write throw-away code to solve problems that are not related to code at all? I want to show you how and what I’m doing to give you some ideas of what works and why this is much simpler than you might think. The first thing you have to realize is that Pyodide is secretly becoming a pretty big deal for a lot of agentic interactions. What is Pyodide? Pyodide is an open source project that makes a standard Python interpreter available via a WebAssembly runtime. What is neat about it is that it has an installer called micropip that allows it to install dependencies from PyPI. It also targets the emscripten runtime environment, which means there is a pretty good standard Unix setup around the interpreter that you can interact with. Getting Pyodide to run is shockingly simple if you have a Node environment. You can directly install it from npm. What makes this so cool is that you can also interact with the virtual file system, which allows you to create a persistent runtime environment that interacts with the outside world. You can also get hosted Pyodide at this point from a whole bunch of startups, but you can actually get this running on your own machine and infrastructure very easily if you want to. The way I found this to work best is if you banish Pyodide into a web worker. This allows you to interrupt it in case it runs into time limits. A big reason why Pyodide is such a powerful runtime, is because Python has an amazing ecosystem of well established libraries that the models know about. From manipulating PDFs or word documents, to creating images, it’s all there. Another vital ingredient to a code interpreter is having a file system. Not just any file system though. I like to set up a virtual file system that I intercept so that I can provide it with access to remote resources from specific file system locations. For instance, you can have a folder on the file system that exposes files which are just resources that come from your own backend API. If the agent then chooses to read from those files, you can from outside the sandbox make a safe HTTP request to bring that resource into play. The sandbox itself does not have network access, so it’s only the file system that gates access to resources. The reason the file system is so good is that agents just know so much about how they work, and you can provide safe access to resources through some external system outside of the sandbox. You can provide read-only access to some resources and write access to others, then access the created artifacts from the outside again. Now actually doing that is a tad tricky because the emscripten file system is sync, and most of the interesting things you can do are async. The option that I ended up going with is to move the fetch-like async logic into another web worker and use to block. If your entire Pyodide runtime is in a web worker, that’s not as bad as it looks. That said, I wish the emscripten file system API was changed to support stack swiching instead of this. While it’s now possible to hide async promises behind sync abstractions within Pyodide with call_sync , the same approach does not work for the emscripten JavaScript FS API. I have a full example of this at the end, but the simplified pseudocode that I ended up with looks like this: Lastly now that you have agents running, you really need durable execution. I would describe durable execution as the idea of being able to retry a complex workflow safely without losing progress. The reason for this is that agents can take a very long time, and if they interrupt, you want to bring them back to the state they were in. This has become a pretty hot topic. There are a lot of startups in that space and you can buy yourself a tool off the shelf if you want to. What is a little bit disappointing is that there is no truly simple durable execution system. By that I mean something that just runs on top of Postgres and/or Redis in the same way as, for instance, there is pgmq. The easiest way to shoehorn this yourself is to use queues to restart your tasks and to cache away the temporary steps from your execution. Basically, you compose your task from multiple steps and each of the steps just has a very simple cache key. It’s really just that simple: You can improve on this greatly, but this is the general idea. The state is basically the conversation log and whatever else you need to keep around for the tool execution (e.g., whatever was thrown on the file system). What tools does an agent need that are not code? Well, the code needs to be able to do something interesting so you need to give it access to something. The most interesting access you can provide is via the file system, as mentioned. But there are also other tools you might want to expose. What Cloudflare proposed is connecting to MCP servers and exposing their tools to the code interpreter. I think this is a quite interesting approach and to some degree it’s probably where you want to go. Some tools that I find interesting: : a tool that just lets the agent run more inference, mostly with files that the code interpreter generated. For instance if you have a zip file it’s quite fun to see the code interpreter use Python to unpack it. But if then that unpacked file is a jpg, you will need to go back to inference to understand it. : a tool that just … brings up help. Again, can be with inference for basic RAG, or similar. I found it quite interesting to let the AI ask it for help. For example, you want the manual tool to allow a query like “Which Python code should I write to create a chart for the given XLSX file?” On the other hand, you can also just stash away some instructions in .md files on the virtual file system and have the code interpreter read it. It’s all an option. If you want to see what this roughly looks like, I vibe-coded a simple version of this together. It uses a made-up example but it does show how a sandbox with very little tool availability can create surprising results: mitsuhiko/mini-agent . When you run it, it looks up the current IP from a special network drive that triggers an async fetch, and then it (usually) uses pillow or matplotlib to make an image of that IP address. Pretty pointless, but a lot of fun! 4he same approach has also been leveraged by Anthropic and Cloudflare. There is some further reading that might give you more ideas: : a tool that just lets the agent run more inference, mostly with files that the code interpreter generated. For instance if you have a zip file it’s quite fun to see the code interpreter use Python to unpack it. But if then that unpacked file is a jpg, you will need to go back to inference to understand it. : a tool that just … brings up help. Again, can be with inference for basic RAG, or similar. I found it quite interesting to let the AI ask it for help. For example, you want the manual tool to allow a query like “Which Python code should I write to create a chart for the given XLSX file?” On the other hand, you can also just stash away some instructions in .md files on the virtual file system and have the code interpreter read it. It’s all an option. Claude Skills is fully leveraging code generation for working with documents or other interesting things. Comes with a (non Open Source) repository of example skills that the LLM and code executor can use: anthropics/skills Cloudflare’s Code Mode which is the idea of creating TypeScript bindings for MCP tools and having the agent write code to use them in a sandbox.

0 views
Simon Willison 3 weeks ago

Claude Skills are awesome, maybe a bigger deal than MCP

Anthropic this morning introduced Claude Skills , a new pattern for making new abilities available to their models: Claude can now use Skills to improve how it performs specific tasks. Skills are folders that include instructions, scripts, and resources that Claude can load when needed. Claude will only access a skill when it's relevant to the task at hand. When used, skills make Claude better at specialized tasks like working with Excel or following your organization's brand guidelines. Their engineering blog has a more detailed explanation . There's also a new anthropic/skills GitHub repo. (I inadvertently preempted their announcement of this feature when I reverse engineered and wrote about it last Friday !) Skills are conceptually extremely simple: a skill is a Markdown file telling the model how to do something, optionally accompanied by extra documents and pre-written scripts that the model can run to help it accomplish the tasks described by the skill. Claude's new document creation abilities , which accompanied their new code interpreter feature in September, turned out to be entirely implemented using skills. Those are now available Anthropic's repo covering , , , and files. There's one extra detail that makes this a feature, not just a bunch of files in disk. At the start of a session Claude's various harnesses can scan all available skill files and read a short explanation for each one from the frontmatter YAML in the Markdown file. This is very token efficient: each skill only takes up a few dozen extra tokens, with the full details only loaded in should the user request a task that the skill can help solve. Here's that metadata for an example slack-gif-creator skill that Anthropic published this morning: Toolkit for creating animated GIFs optimized for Slack, with validators for size constraints and composable animation primitives. This skill applies when users request animated GIFs or emoji animations for Slack from descriptions like "make me a GIF for Slack of X doing Y". I just tried this skill out in the Claude mobile web app, against Sonnet 4.5. First I enabled the slack-gif-creator skill in the settings , then I prompted: And Claude made me this GIF . Click to play (it's almost epilepsy inducing, hence the click-to-play mechanism): OK, this particular GIF is terrible, but the great thing about skills is that they're very easy to iterate on to make them better. Here are some noteworthy snippets from the Python script it wrote , comments mine: This is pretty neat. Slack GIFs need to be a maximum of 2MB, so the skill includes a validation function which the model can use to check the file size. If it's too large the model can have another go at making it smaller. The skills mechanism is entirely dependent on the model having access to a filesystem, tools to navigate it and the ability to execute commands in that environment. This is a common pattern for LLM tooling these days - ChatGPT Code Interpreter was the first big example of this back in early 2023 , and the pattern later extended to local machines via coding agent tools such as Cursor, Claude Code, Codex CLI and Gemini CLI. This requirement is the biggest difference between skills and other previous attempts at expanding the abilities of LLMs, such as MCP and ChatGPT Plugins . It's a significant dependency, but it's somewhat bewildering how much new capability it unlocks. The fact that skills are so powerful and simple to create is yet another argument in favor of making safe coding environments available to LLMs. The word safe there is doing a lot of work though! We really need to figure out how best to sandbox these environments such that attacks such as prompt injections are limited to an acceptable amount of damage. Back in January I made some foolhardy predictions about AI/LLMs , including that "agents" would once again fail to happen: I think we are going to see a lot more froth about agents in 2025, but I expect the results will be a great disappointment to most of the people who are excited about this term. I expect a lot of money will be lost chasing after several different poorly defined dreams that share that name. I was entirely wrong about that. 2025 really has been the year of "agents", no matter which of the many conflicting definitions you decide to use (I eventually settled on " tools in a loop "). Claude Code is, with hindsight, poorly named. It's not purely a coding tool: it's a tool for general computer automation. Anything you can achieve by typing commands into a computer is something that can now be automated by Claude Code. It's best described as a general agent . Skills make this a whole lot more obvious and explicit. I find the potential applications of this trick somewhat dizzying. Just thinking about this with my data journalism hat on: imagine a folder full of skills that covers tasks like the following: Congratulations, you just built a "data journalism agent" that can discover and help publish stories against fresh drops of US census data. And you did it with a folder full of Markdown files and maybe a couple of example Python scripts. Model Context Protocol has attracted an enormous amount of buzz since its initial release back in November last year . I like to joke that one of the reasons it took off is that every company knew they needed an "AI strategy", and building (or announcing) an MCP implementation was an easy way to tick that box. Over time the limitations of MCP have started to emerge. The most significant is in terms of token usage: GitHub's official MCP on its own famously consumes tens of thousands of tokens of context, and once you've added a few more to that there's precious little space left for the LLM to actually do useful work. My own interest in MCPs has waned ever since I started taking coding agents seriously. Almost everything I might achieve with an MCP can be handled by a CLI tool instead. LLMs know how to call , which means you don't have to spend many tokens describing how to use them - the model can figure it out later when it needs to. Skills have exactly the same advantage, only now I don't even need to implement a new CLI tool. I can drop a Markdown file in describing how to do a task instead, adding extra scripts only if they'll help make things more reliable or efficient. One of the most exciting things about Skills is how easy they are to share. I expect many skills will be implemented as a single file - more sophisticated ones will be a folder with a few more. Anthropic have Agent Skills documentation and a Claude Skills Cookbook . I'm already thinking through ideas of skills I might build myself, like one on how to build Datasette plugins . Something else I love about the design of skills is there is nothing at all preventing them from being used with other models. You can grab a skills folder right now, point Codex CLI or Gemini CLI at it and say "read pdf/SKILL.md and then create me a PDF describing this project" and it will work, despite those tools and models having no baked in knowledge of the skills system. I expect we'll see a Cambrian explosion in Skills which will make this year's MCP rush look pedestrian by comparison. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Trying out the slack-gif-creator skill Skills depend on a coding environment Claude as a General Agent Skills compared to MCP Here come the Skills Where to get US census data from and how to understand its structure How to load data from different formats into SQLite or DuckDB using appropriate Python libraries How to publish data online, as Parquet files in S3 or pushed as tables to Datasette Cloud A skill defined by an experienced data reporter talking about how best to find the interesting stories in a new set of data A skill that describes how to build clean, readable data visualizations using D3

0 views