Posts in Go (20 found)
Stone Tools 2 days ago

Bank Street Writer on the Apple II

Stop me if you're heard this one . In 1978, a young man wandered into a Tandy Radio Shack and found himself transfixed by the TRS-80 systems on display. He bought one just to play around with, and it wound up transforming his life from there on. As it went with so many, so too did it go with lawyer Doug Carlston. His brother, Gary, initially unimpressed, warmed up to the machine during a long Maine winter. The two thus smitten mused, "Can we make money off of this?" Together they formed a developer-sales relationship, with Doug developing Galactic Saga and third brother Don developing Tank Command . Gary's sales acumen brought early success and Broderbund was officially underway. Meanwhile in New York, Richard Ruopp, president of Bank Street College of Education, a kind of research center for experimental and progressive education, was thinking about how emerging technology fit into the college's mission. Writing was an important part of their curriculum, but according to Ruopp , "We tested the available word processors and found we couldn’t use any of them." So, experts from Bank Street College worked closely with consultant Franklin Smith and software development firm Intentional Educations Inc. to build a better word processor for kids. The fruit of that labor, Bank Street Writer , was published by Scholastic exclusively to schools at first, with Broderbund taking up the home distribution market a little later. Bank Street Writer would dominate home software sales charts for years and its name would live on as one of the sacred texts, like Lemonade Stand or The Oregon Trail . Let's see what lessons there are to learn from it yet. 1916 Founded by Lucy Sprague Mitchell, Wesley Mitchell, and Harriet Johnson as the “Bureau of Educational Experiments” (BEE) with the goal of understanding in what environment children best learn and develop, and to help adults learn to cultivate that environment. 1930 BEE moves to 69 Bank Street. (Will move to 112th Street in 1971, for space reasons.) 1937 The Writer’s Lab, which connects writers and students, is formed. 1950 BEE is renamed to Bank Street College of Education. 1973 Minnesota Educational Computing Consortium (MECC) is founded. This group would later go on to produce The Oregon Trail . 1983 Bank Street Writer, developed by Intentional Educations Inc., published by Broderbund Software, and “thoroughly tested by the academics at Bank Street College of Education.” Price: $70. 1985 Writer is a success! Time to capitalize! Bank Street Speller $50, Bank Street Filer $50, Bank Street Mailer $50, Bank Street Music Writer $50, Bank Street Prewriter (published by Scholastic) $60. 1986 Bank Street Writer Plus $100. Bank Street Writer III (published by Scholastic) $90. It’s basically Plus with classroom-oriented additions, including a 20-column mode and additional teaching aides. 1987 Bank Street Storybook, $40. 1992 Bank Street Writer for the Macintosh (published by Scholastic) $130. Adds limited page layout options, Hypercard-style hypertext, clip art, punctuation checker, image import with text wrap, full color, sound support, “Classroom Publishing” of fliers and pamphlets, and electronic mail. With word processors, I want to give them a chance to present their best possible experience. I do put a little time into trying the baseline experience many would have had with the software during the height of its popularity. "Does the software still have utility today?" can only be fairly answered by giving the software a fighting chance. To that end, I've gifted myself a top-of-the-line (virtual) Apple //e running the last update to Writer , the Plus edition. You probably already know how to use Bank Street Writer Plus . You don't know you know, but you do know because you have familiarity with GUI menus and basic word processing skills. All you're lacking is an understanding of the vagaries of data storage and retrieval as necessitated by the hardware of the time, but once armed with that knowledge you could start using this program without touching the manual again. It really is as easy as the makers claim. The simplicity is driven by very a subtle, forward-thinking user interface. Of primary interest is the upper prompt area. The top 3 lines of the screen serve as an ever-present, contextual "here's the situation" helper. What's going on? What am I looking at? What options are available? How do I navigate this screen? How do I use this tool? Whatever you're doing, whatever menu option you've chosen, the prompt area is already displaying information about which actions are available right now in the current context . As the manual states, "When in doubt, look for instructions in the prompt area." The manual speaks truth. For some, the constant on-screen prompting could be a touch overbearing, but I personally don't think it's so terrible to know that the program is paying attention to my actions and wants me to succeed. The assistance isn't front-loaded, like so many mobile apps, nor does it interrupt, like Clippy. I simply can't fault the good intentions, nor can I really think of anything in modern software that takes this approach to user-friendliness. The remainder of the screen is devoted to your writing and works like any other word processor you've used. Just type, move the cursor with the arrow keys, and type some more. I think most writers will find it behaves "as expected." There are no Electric Pencil -style over-type surprises, nor VisiCalc -style arrow key manipulations. What seems to have happened is that in making a word processor that is easy for children to use, they accidentally made a word processor that is just plain easy. The basic functionality is drop-dead simple to pick up by just poking around, but there's quite a bit more to learn here. To do so, we have a few options for getting to know Bank Street Writer in more detail. There are two manuals by virtue of the program's educational roots. Bank Street Writer was published by both Broderbund (for the home market) and Scholastic (for schools). Each tailored their own manual to their respective demographic. Broderbund's manual is cleanly designed, easy to understand, and gets right to the point. It is not as "child focused" as reviews at the time might have you believe. Scholastic's is more of a curriculum to teach word processing, part of the 80s push for "computers in the classroom." It's packed with student activities, pages that can be copied and distributed, and (tellingly) information for the teacher explaining "What is a word processor?" Our other option for learning is on side 2 of the main program disk. Quite apart from the program proper, the disk contains an interactive tutorial. I love this commitment to the user's success, though I breezed through it in just a few minutes, being a cultured word processing pro of the 21st century. I am quite familiar with "menus" thank you very much. As I mentioned at the top, the screen is split into two areas: prompt and writing. The prompt area is fixed, and can neither be hidden nor turned off. This means there's no "full screen" option, for example. The writing area runs in high-res graphics mode so as to bless us with the gift of an 80-character wide display. Being a graphics display also means the developer could have put anything on screen, including a ruler which would have been a nice formatting helper. Alas. Bank Street offers limited preference settings; there's not much we can do to customize the program's display or functionality. The upshot is that as I gain confidence with the program, the program doesn't offer to match my ability. There is one notable trick, which I'll discuss later, but overall there is a missed opportunity here for adapting to a user's increasing skill. Kids do grow up, after all. As with Electric Pencil , I'm writing this entirely in Bank Street Writer . Unlike the keyboard/software troubles there, here in 128K Apple //e world I have Markdown luxuries like . The emulator's amber mode is soothing to the eyes and soul. Mouse control is turned on and works perfectly, though it's much easier and faster to navigate by keyboard, as God intended. This is an enjoyable writing experience. Which is not to say the program is without quirks. Perhaps the most unfortunate one is how little writing space 128K RAM buys for a document. At this point in the write-up I'm at about 1,500 words and BSW's memory check function reports I'm already at 40% of capacity. So the largest document one could keep resident in memory at one time would run about 4,000 words max? Put bluntly, that ain't a lot. Splitting documents into multiple files is pretty much forced upon anyone wanting to write anything of length. Given floppy disk fragility, especially with children handling them, perhaps that's not such a bad idea. However, from an editing point of view, it is frustrating to recall which document I need to load to review any given piece of text. Remember also, there's no copy/paste as we understand it today. Moving a block of text between documents is tricky, but possible. BSW can save a selected portion of text to its own file, which can then be "retrieved" (inserted) at the current cursor position in another file. In this way the diskette functions as a memory buffer for cross-document "copy/paste." Hey, at least there is some option available. Flipping through old magazines of the time, it's interesting just how often Bank Street Writer comes up as the comparative reference point for home word processors over the years. If a new program had even the slightest whiff of trying to be "easy to use" it was invariably compared to Bank Street Writer . Likewise, there were any number of writers and readers of those magazines talking about how they continued to use Bank Street Writer , even though so-called "better" options existed. I don't want to oversell its adoption by adults, but it most definitely was not a children-only word processor, by any stretch. I think the release of Plus embraced a more mature audience. In schools it reigned supreme for years, including the Scholastic-branded version of Plus called Bank Street Writer III . There were add-on "packs" of teacher materials for use with it. There was also Bank Street Prewriter , a tool for helping to organize themes and thoughts before committing to the act of writing, including an outliner, as popularized by ThinkTank . (always interesting when influences ripple through the industry like this) Of course, the Scholastic approach was built around the idea of teachers having access to computers in the classroom. And THAT was build on the idea of teachers feeling comfortable enough with computers to seamlessly merge them into a lesson-plan. Sure, the kids needed something simple to learn, but let's be honest, so did the adults. There was a time when attaching a computer to anything meant a fundamental transformation of that thing was assured and imminent. For example, the "office of the future" (as discussed in the Superbase post ) had a counterpart in the "classroom of tomorrow." In 1983, Popular Computing said, "Schools are in the grip of a computer mania." Steve Jobs took advantage of this, skating to where the puck would be, by donating Apple 2s to California schools. In October 1983, Creative Computing did a little math on that plan. $20M in retail donations brought $4M in tax credits against $5M in gross donations. Apple could donate a computer to every elementary, middle, and high school in California for an outlay of only $1M. Jobs lobbied Congress hard to pass a national version of the same "Kids Can't Wait" bill, which would have extended federal tax credits for such donations. That never made it to law, for various political reasons. But the California initiative certainly helped position Apple as the go-to system for computers in education. By 1985, Apple would dominate fully half of the education market. That would continue into the Macintosh era, though Apple's dominance diminished slowly as cheaper, "good enough" alternatives entered the market. Today, Apple is #3 in the education market, behind Windows and Chromebooks . It is a fair question to ask, "How useful could a single donated computer be to a school?" Once it's in place, then what? Does it have function? Does anyone have a plan for it? Come to think of it, does anyone on staff even know how to use it? When Apple put a computer into (almost) every school in California, they did require training. Well, let's say lip-service was paid to the idea of the aspiration of training. One teacher from each school had to receive one day's worth of training to attain a certificate which allowed the school to receive the computer. That teacher was then tasked with training their coworkers. Wait, did I say "one day?" Sorry, I meant about one HOUR of training. It's not too hard to see where Larry Cuban was coming from when he published Oversold & Underused: Computers in the Classroom in 2001. Even of schools with more than a single system, he notes, "Why, then, does a school's high access (to computers) yield limited use? Nationally and in our case studies, teachers... mentioned that training in relevant software and applications was seldom offered... (Teachers) felt that the generic training available was often irrelevant to their specific and immediate needs." From my perspective, and I'm no historian, it seems to me there were four ways computers were introduced into the school setting. The three most obvious were: I personally attended schools of all three types. What I can say the schools had in common was how little attention, if any, was given to the computer and how little my teachers understood them. An impromptu poll of friends aligned with my own experience. Schools didn't integrate computers into classwork, except when classwork was explicitly about computers. I sincerely doubt my time playing Trillium's Shadowkeep during recess was anything close to Apple's vision of a "classroom of tomorrow." The fourth approach to computers into the classroom was significantly more ambitious. Apple tried an experiment in which five public school sites were chosen for a long-term research project. In 1986, the sites were given computers for every child in class and at home. They reasoned that for computers to truly make an impact on children, the computer couldn't just be a fun toy they occasionally interacted with. Rather, it required full integration into their lives. Now, it is darkly funny to me that having achieved this integration today through smartphones, adults work hard to remove computers from school. It is also interesting to me that Apple kind of led the way in making that happen, although in fairness they don't seem to consider the iPhone to be a computer . America wasn't alone in trying to give its children a technological leg up. In England, the BBC spearheaded a major drive to get computers into classrooms via a countrywide computer literacy program. Even in the States, I remember watching episodes of BBC's The Computer Programme on PBS. Regardless of Apple's or the BBC's efforts, the long-term data on the effectiveness of computers in the classroom has been mixed, at best, or even an outright failure. Apple's own assessment of their "Apple Classrooms of Tomorrow" (ACOT) program after a couple of years concluded, "Results showed that ACOT students maintained their performance levels on standard measures of educational achievement in basic skills, and they sustained positive attitudes as judged by measures addressing the traditional activities of schooling." Which is a "we continue to maintain the dream of selling more computers to schools" way of saying, "Nothing changed." In 2001, the BBC reported , "England's schools are beginning to use computers more in teaching - but teachers are making "slow progress" in learning about them." Then in 2015 the results were "disappointing, "Even where computers are used in the classroom, their impact on student performance is mixed at best." Informatique pour tous, France 1985: Pedagogy, Industry and Politics by Clémence Cardon-Quint noted the French attempt at computers in the classroom as being, "an operation that can be considered both as a milestone and a failure." Computers in the Classrooms of an Authoritarian Country: The Case of Soviet Latvia (1980s–1991) by Iveta Kestere, Katrina Elizabete Purina-Bieza shows the introduction of computers to have drawn stark power and social divides, while pushing prescribed gender roles of computers being "for boys." Teachers Translating and Circumventing the Computer in Lower and Upper Secondary Swedish Schools in the 1970s and 1980 s by Rosalía Guerrero Cantarell noted, "the role of teachers as agents of change was crucial. But teachers also acted as opponents, hindering the diffusion of computer use in schools." Now, I should be clear that things were different in the higher education market, as with PLATO in the universities. But in the primary and secondary markets, Bank Street Writer 's primary demographic, nobody really knew what to do with the machines once they had them. The most straightforwardly damning assessment is from Oversold & Underused where Cuban says in the chapter "Are Computers in Schools Worth the Investment?", "Although promoters of new technologies often spout the rhetoric of fundamental change, few have pursued deep and comprehensive changes in the existing system of schooling." Throughout the book he notes how most teachers struggle to integrate computers into their lessons and teaching methodologies. The lack of guidance in developing new ways of teaching means computers will continue to be relegated to occasional auxiliary tools trotted out from time to time, not integral to the teaching process. "Should my conclusions and predictions be accurate, both champions and skeptics will be disappointed. They may conclude, as I have, that the investment of billions of dollars over the last decade has yet to produce worthy outcomes," he concludes. Thanks to my sweet four-drive virtual machine, I can summon both the dictionary and thesaurus immediately. Put the cursor at the start of a word and hit or to get an instant spot check of spelling or synonyms. Without the reality of actual floppy disk access speed, word searches are fast. Spelling can be performed on the full document, which does take noticeable time to finish. One thing I really love is how cancelling an action or moving forward on the next step of a process is responsive and immediate. If you're growing bored of an action taking too long, just cancel it with ; it will stop immediately . The program feels robust and unbreakable in that way. There is a word lookup, which accepts wildcards, for when you kinda-sorta know how to spell a word but need help. Attached to this function is an anagram checker which benefits greatly from a virtual CPU boost. But it can only do its trick on single words, not phrases. Earlier I mentioned how little the program offers a user who has gained confidence and skill. That's not entirely accurate, thanks to its most surprising super power: macros. Yes, you read that right. This word processor designed for children includes macros. They are stored at the application level, not the document level, so do keep that in mind. Twenty can be defined, each consisting of up to 32 keystrokes. Running keystrokes in a macro is functionally identical to typing by hand. Because the program can be driven 100% by keyboard alone, macros can trigger menu selections and step through tedious parts of those commands. For example, to save our document periodically we need to do the following every time: That looks like a job for to me. 0:00 / 0:23 1× Defining a macro to save, with overwrite, the current file. After it is defined, I execute it which happens very quickly in the emulator. Watch carefully. If you can perform an action through a series of discrete keyboard commands, you can make a macro from it. This is freeing, but also works to highlight what you cannot do with the program. For example, there is no concept of an active selection, so a word is the smallest unit you can directly manipulate due to keyboard control limitations. It's not nothin' but it's not quite enough. I started setting up markdown macros, so I could wrap the current word in or for italic and bold. Doing the actions in the writing area and noting the minimal steps necessary to achieve the desired outcome translated into perfect macros. I was even able to make a kind of rudimentary "undo" for when I wrap something in italic but intended to use bold. This reminded me that I haven't touched macro functionality in modern apps since my AppleScript days. Lemme check something real quick. I've popped open LibreOffice and feel immediately put off by its Macros function. It looks super powerful; a full dedicated code editor with watched variables for authoring in its scripting language. Or is it languages? Is it Macros or ScriptForge? What are "Gimmicks?" Just what is going on? Google Docs is about the same, using Javascript for its "Apps Script" functionality. Here's a Stack Overflow post where someone wants to select text and set it to "blue and bold" with a keystroke and is presented with 32 lines of Javascript. Many programs seem to have taken a "make the simple things difficult, and the hard things possible" approach to macros. Microsoft Word reportedly has a "record" function for creating macros, which will watch what you do and let you play back those actions in sequence. (a la Adobe Photoshop's "actions") This sounds like a nice evolution of the BSW method. I say "reportedly" because it is not available in the online version and so I couldn't try it for myself without purchasing Microsoft 365. I certainly don't doubt the sky's the limit with these modern macro systems. I'm sure amazing utilities can be created, with custom dialog boxes, internet data retrieval, and more. The flip-side is that a lot of power has has been stripped from the writer and handed over to the programmer, which I think is unfortunate. Bank Street Writer allows an author to use the same keyboard commands for creating a macro as for writing a document. There is a forgotten lesson in that. Yes, BSW's macros are limited compared to modern tools, but they are immediately accessible and intuitive. They leverage skills the user is already known to possess . The learning curve is a straight, flat line. Like any good word processor, user-definable tab stops are possible. Bringing up the editor for tabs displays a ruler showing tab stops and their type (normal vs. decimal-aligned). Using the same tools for writing, the ruler is similarly editable. Just type a or a anywhere along the ruler. So, the lack of a ruler I noted at the beginning is now doubly-frustrating, because it exists! Perhaps it was determined to be too much visual clutter for younger users? Again, this is where the Options screen could have allowed advanced users to toggle on features as they grow in comfort and ambition. From what I can tell in the product catalogs, the only major revision after this was for the Macintosh which added a whole host of publishing features. If I think about my experience with BSW these past two weeks, and think about what my wish-list for a hypothetical update might be, "desktop publishing" has never crossed my mind. Having said all of that, I've really enjoyed using it to write this post. It has been solid, snappy, and utterly crash free. To be completely frank, when I switched over into LibreOffice , a predominantly native app for Windows, it felt laggy and sluggish. Bank Street Writer feels smooth and purpose-built, even in an emulator. Features are discoverable and the UI always makes it clear what action can be taken next. I never feel lost nor do I worry that an inadvertent action will have unknowable consequences. The impression of it being an assistant to my writing process is strong, probably more so than many modern word processors. This is cleanly illustrated by the prompt area which feels like a "good idea we forgot." (I also noted this in my ThinkTank examination) I cannot lavish such praise upon the original Bank Street Writer , only on this Plus revision. The original is 40-columns only, spell-checking is a completely separate program, no thesaurus, no macros, a kind of bizarre modal switch between writing/editing/transfer modes, no arrow key support, and other quirks of its time and target system (the original Apple 2). Plus is an incredibly smart update to that original, increasing its utility 10-fold, without sacrificing ease of use. In fact, it's actually easier to use, in my opinion than the original and comes just shy of being something I could use on a regular basis. Bank Street Writer is very good! But it's not quite great . Ways to improve the experience, notable deficiencies, workarounds, and notes about incorporating the software into modern workflows (if possible). AppleWin 32bit 1.31.0.0 on Windows 11 Emulating an Enhanced Apple //e Authentic machine speed (enhanced disk access speed) Monochrome (amber) for clean 80-column display Disk II controller in slot 5 (enables four floppies, total) Mouse interface in slot 4 Bank Street Writer Plus At the classroom level there are one or more computers. At the school level there is a "computer lab" with one or more systems. There were no computers. Hit (open the File menu) Hit (select Save File) Hit three times (stepping through default confirmation dialogs) I find that running at 300% CPU speed in AppleWin works great. No repeating key issues and the program is well-behaved. Spell check works quickly enough to not be annoying and I honestly enjoyed watching it work its way through the document. Sometimes there's something to be said about slowing the computer down to swift human-speed, to form a stronger sense of connection between your own work and the computer's work. I did mention that I used a 4-disk setup, but in truth I never really touched the thesaurus. A 3-disk setup is probably sufficient. The application never crashed; the emulator was rock-solid. CiderPress2 works perfectly for opening the files on an Apple ][ disk image. Files are of file extension, which CiderPress2 tries to open as disassembly, not text. Switch "Conversion" to "Plain Text" and you'll be fine. This is a program that would benefit greatly from one more revision. It's very close to being enough for a "minimalist" crowd. There are four, key pieces missing for completeness: Much longer document handling Smarter, expanded dictionary, with definitions Customizable UI, display/hide: prompts, ruler, word count, etc. Extra formatting options, like line spacing, visual centering, and so on. For a modern writer using hyperlinks, this can trip up the spell-checker quite ferociously. It doesn't understand, nor can it be taught, pattern-matching against URLs to skip them.

0 views
Corrode 3 days ago

Canonical

What does it take to rewrite the foundational components of one of the world’s most popular Linux distributions? Ubuntu serves over 12 million daily desktop users alone, and the systems that power it, from sudo to core utilities, have been running for decades with what Jon Seager, VP of Engineering for Ubuntu at Canonical, calls “shaky underpinnings.” In this episode, we talk to Jon about the bold decision to “oxidize” Ubuntu’s foundation. We explore why they’re rewriting critical components like sudo in Rust, how they’re managing the immense risk of changing software that millions depend on daily, and what it means to modernize a 20-year-old operating system without breaking the internet. CodeCrafters helps you become proficient in Rust by building real-world, production-grade projects. Learn hands-on by creating your own shell, HTTP server, Redis, Kafka, Git, SQLite, or DNS service from scratch. Start for free today and enjoy 40% off any paid plan by using this link . Canonical is the company behind Ubuntu, one of the most widely-used Linux distributions in the world. From personal desktops to cloud infrastructure, Ubuntu powers millions of systems globally. Canonical’s mission is to make open source software available to people everywhere, and they’re now pioneering the adoption of Rust in foundational system components to improve security and reliability for the next generation of computing. Jon Seager is VP Engineering for Ubuntu at Canonical, where he oversees the Ubuntu Desktop, Server, and Foundations teams. Appointed to this role in January 2025, Jon is driving Ubuntu’s modernization strategy with a focus on Communication, Automation, Process, and Modernisation. His vision includes adopting memory-safe languages like Rust for critical infrastructure components. Before this role, Jon spent three years as VP Engineering building Juju and Canonical’s catalog of charms. He’s passionate about making Ubuntu ready for the next 20 years of computing. Juju - Jon’s previous focus, a cloud orchestration tool GNU coretuils - The widest used implementation of commands like ls, rm, cp, and more uutils coreutils - coreutils implementation in Rust sudo-rs - For your Rust based sandwiches needs LTS - Long Term Support, a release model popularized by Ubuntu coreutils-from-uutils - List of symbolic links used for coreutils on Ubuntu, some still point to the GNU implementation man: sudo -E - Example of a feature that sudo-rs does not support SIMD - Single instruction, multiple data rust-coreutils - The Ubuntu package with all it’s supported CPU platforms listed fastcat - Matthias’ blogpost about his faster version of systemd-run0 - Alternative approach to sudo from the systemd project AppArmor - The Linux Security Module used in Ubuntu PAM - The Pluggable Authentication Modules, which handles all system authentication in Linux SSSD - Enables LDAP user profiles on Linux machines ntpd-rs - Timesynchronization daemon written in Rust which may land in Ubuntu 26.04 Trifecta Tech Foundation - Foundation supporting sudo-rs development Sequioa PGP - OpenPGP tools written in Rust Mir - Canonicals wayland compositor library, uses some Rust Anbox Cloud - Canonical’s Android streaming platform, includes Rust components Simon Fels - Original creator of Anbox and Anbox Cloud team lead at Canonical LXD - Container and VM hypervisor dqlite - SQLite with a replication layer for distributed use cases, potentially being rewritten in Rust Rust for Linux - Project to add Rust support to the Linux kernel Nova GPU Driver - New Linux OSS driver for NVIDIA GPUs written in Rust Ubuntu Asahi - Community project for Ubuntu on Apple Silicon debian-devel: Hard Rust requirements from May onward - Parts of apt are being rewritten in Rust (announced a month after the recording of this episode) Go Standard Library - Providing things like network protocols, cryptographic algorithms, and even tools to handle image formats Python Standard Library - The origin of “batteries included” The Rust Standard Library - Basic types, collections, filesystem access, threads, processes, synchronisation, and not much more clap - Superstar library for CLI option parsing serde - Famous high-level serilization and deserialization interface crate Jon Seager’s Website Jon’s Blog: Engineering Ubuntu For The Next 20 Years Canonical Blog Ubuntu Blog Canonical Careers: Engineering - Apply your Rust skills in the Linux ecosystem

0 views
Anton Zhiyanov 4 days ago

Go proposal: Goroutine metrics

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Export goroutine-related metrics from the Go runtime. Ver. 1.26 • Stdlib • Medium impact New metrics in the package give better insight into goroutine scheduling: Go's runtime/metrics package already provides a lot of runtime stats, but it doesn't include metrics for goroutine states or thread counts. Per-state goroutine metrics can be linked to common production issues. An increasing waiting count can show a lock contention problem. A high not-in-go count means goroutines are stuck in syscalls or cgo. A growing runnable backlog suggests the CPUs can't keep up with demand. Observability systems can track these counters to spot regressions, find scheduler bottlenecks, and send alerts when goroutine behavior changes from the usual patterns. Developers can use them to catch problems early without needing full traces. Add the following metrics to the package: The per-state numbers are not guaranteed to add up to the live goroutine count ( , available since Go 1.16). All metrics use uint64 counters. Start some goroutines and print the metrics after 100 ms of activity: No surprises here: we read the new metric values the same way as before — using metrics.Read . 𝗣 15490 • 𝗖𝗟 690397 , 690398 , 690399 P.S. If you are into goroutines, check out my interactive book on concurrency Total number of goroutines since the program started. Number of goroutines in each state. Number of active threads.

0 views
Anton Zhiyanov 6 days ago

Gist of Go: Concurrency testing

This is a chapter from my book on Go concurrency , which teaches the topic from the ground up through interactive examples. Testing concurrent programs is a lot like testing single-task programs. If the code is well-designed, you can test the state of a concurrent program with standard tools like channels, wait groups, and other abstractions built on top of them. But if you've made it so far, you know that concurrency is never that easy. In this chapter, we'll go over common testing problems and the solutions that Go offers. Waiting for goroutines • Checking channels • Checking for leaks • Durable blocking • Instant waiting • Time inside the bubble • Thoughts on time 1  ✎ • Thoughts on time 2  ✎ • Checking for cleanup • Bubble rules • Keep it up Let's say we want to test this function: Calculations run asynchronously in a separate goroutine. However, the function returns a result channel, so this isn't a problem: At point ⓧ, the test is guaranteed to wait for the inner goroutine to finish. The rest of the test code doesn't need to know anything about how concurrency works inside the function. Overall, the test isn't any more complicated than if were synchronous. But we're lucky that returns a channel. What if it doesn't? Let's say the function looks like this: We write a simple test and run it: The assertion fails because at point ⓧ, we didn't wait for the inner goroutine to finish. In other words, we didn't synchronize the and goroutines. That's why still has its initial value (0) when we do the check. We can add a short delay with : The test is now passing. But using to sync goroutines isn't a great idea, even in tests. We don't want to set a custom delay for every function we're testing. Also, the function's execution time may be different on the local machine compared to a CI server. If we use a longer delay just to be safe, the tests will end up taking too long to run. Sometimes you can't avoid using in tests, but since Go 1.25, the package has made these cases much less common. Let's see how it works. The package has a lot going on under the hood, but its public API is very simple: The function creates an isolated bubble where you can control time to some extent. Any new goroutines started inside this bubble become part of the bubble. So, if we wrap the test code with , everything will run inside the bubble — the test code, the function we're testing, and its goroutine. At point ⓧ, we want to wait for the goroutine to finish. The function comes to the rescue! It blocks the calling goroutine until all other goroutines in the bubble are finished. (It's actually a bit more complicated than that, but we'll talk about it later.) In our case, there's only one other goroutine (the inner goroutine), so will pause until it finishes, and then the test will move on. Now the test passes instantly. That's better! ✎ Exercise: Wait until done Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. As we've seen, you can use to wait for the tested goroutine to finish, and then check the state of the data you are interested in. You can also use it to check the state of channels. Let's say there's a function that generates N numbers like 11, 22, 33, and so on: And a simple test: Set N=2, get the first number from the generator's output channel, then get the second number. The test passed, so the function works correctly. But does it really? Let's use in "production": Panic! We forgot to close the channel when exiting the inner goroutine, so the for-range loop waiting on that channel got stuck. Let's fix the code: And add a test for the channel state: The test is still failing, even though we're now closing the channel when the goroutine exits. This is a familiar problem: at point ⓧ, we didn't wait for the inner goroutine to finish. So when we check the channel, it hasn't closed yet. That's why the test fails. We can delay the check using : But it's better to use : At point ⓧ, blocks the test until the only other goroutine (the inner goroutine) finishes. Once the goroutine has exited, the channel is already closed. So, in the select statement, the case triggers with set to , allowing the test to pass. As you can see, the package helped us avoid delays in the test, and the test itself didn't get much more complicated. As we've seen, you can use to wait for the tested goroutine to finish, and then check the state of the data or channels. You can also use it to detect goroutine leaks. Let's say there's a function that runs the given functions concurrently and sends their results to an output channel: And a simple test: Send three functions to be executed, get the first result from the output channel, and check it. The test passed, so the function works correctly. But does it really? Let's run three times, passing three functions each time: After 50 ms — when all the functions should definitely have finished — there are still 9 running goroutines ( ). In other words, all the goroutines are stuck. The reason is that the channel is unbuffered. If the client doesn't read from it, or doesn't read all the results, the goroutines inside get blocked when they try to send the result of to . Let's fix this by adding a buffer of the right size to the channel: Then add a test to check the number of goroutines: The test is still failing, even though the channel is now buffered, and the goroutines shouldn't block on sending to it. This is a familiar problem: at point ⓧ, we didn't wait for the running goroutines to finish. So is greater than zero, which makes the test fail. We can delay the check using (not recommended), or use a third-party package like goleak (a better option): The test passes now. By the way, goleak also uses internally, but it does so much more efficiently. It tries up to 20 times, with the wait time between checks increasing exponentially, starting at 1 microsecond and going up to 100 milliseconds. This way, the test runs almost instantly. Even better, we can check for leaks without any third-party packages by using : Earlier, I said that blocks the calling goroutine until all other goroutines finish. Actually, it's a bit more complicated. blocks until all other goroutines either finish or become durably blocked . We'll talk about "durably" later. For now, let's focus on "become blocked." Let's temporarily remove the buffer from the channel and check the test results: Here's what happens: Next, comes into play. It not only starts the bubble goroutine, but also tries to wait for all child goroutines to finish before it returns. If sees that some goroutines are stuck (in our case, all 9 are blocked trying to send to the channel), it panics: main bubble goroutine has exited but blocked goroutines remain So, we found the leak without using or goleak, thanks to the useful features of and : Now let's make the channel buffered and run the test again: As we've found, blocks until all goroutines in the bubble — except the one that called — have either finished or are durably blocked. Let's figure out what "durably blocked" means. For , a goroutine inside a bubble is considered durably blocked if it is blocked by any of the following operations: Other blocking operations are not considered durable, and ignores them. For example: The distinction between "durable" and other types of blocks is just a implementation detail of the package. It's not a fundamental property of the blocking operations themselves. In real-world applications, this distinction doesn't exist, and "durable" blocks are neither better nor worse than any others. Let's look at an example. Let's say there's a type that performs some asynchronous computation: Our goal is to write a test that checks the result while the calculation is still running . Let's see how the test changes depending on how is implemented (except for the version — we'll cover that one a bit later). Let's say is implemented using a done channel: Naive test: The check fails because when is called, the goroutine in hasn't set yet. Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on reading from the channel. This channel is created inside the bubble, so the block is durable. The call in the test returns as soon as happens, and we get the current value of . Let's say is implemented using select: Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on a select statement. Both channels used in the select ( and ) are created inside the bubble, so the block is durable. The call in the test returns as soon as happens, and we get the current value of . Let's say is implemented using a wait group: Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on the wait group's call. The group's method was called inside the bubble, so this is a durable block. The call in the test returns as soon as happens, and we get the current value of . Let's say is implemented using a condition variable: Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on the condition variable's call. This is a durable block. The call returns as soon as happens, and we get the current value of . Let's say is implemented using a mutex: Let's try using to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on the mutex's call. doesn't consider blocking on a mutex to be durable. The call ignores the block and never returns. The test hangs and only fails when the overall timeout is reached. You might be wondering why the authors didn't consider blocking on mutexes to be durable. There are a couple of reasons: ⌘ ⌘ ⌘ Let's go back to the original question: how does the test change depending on how is implemented? It doesn't change at all. We used the exact same test code every time: If your program uses durably blocking operations, always works the same way: Very convenient! ✎ Exercise: Blocking queue Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Inside the bubble, time works differently. Instead of using a regular wall clock, the bubble uses a fake clock that can jump forward to any point in the future. This can be quite handy when testing time-sensitive code. Let's say we want to test this function: The positive scenario is straightforward: send a value to the channel, call the function, and check the result: The negative scenario, where the function times out, is also pretty straightforward. But the test takes the full three seconds to complete: We're actually lucky the timeout is only three seconds. It could have been as long as sixty! To make the test run instantly, let's wrap it in : Note that there is no call here, and the only goroutine in the bubble (the root one) gets durably blocked on a select statement in . Here's what happens next: Thanks to the fake clock, the test runs instantly instead of taking three seconds like it would with the "naive" approach. You might have noticed that quite a few circumstances coincided here: We'll look at the alternatives soon, but first, here's a quick exercise. ✎ Exercise: Wait, repeat Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. The fake clock in can be tricky. It move forward only if: ➊ all goroutines in the bubble are durably blocked; ➋ there's a future moment when at least one goroutine will unblock; and ➌ isn't running. Let's look at the alternatives. I'll say right away, this isn't an easy topic. But when has time travel ever been easy? :) Here's the function we're testing: Let's run in a separate goroutine, so there will be two goroutines in the bubble: panicked because the root bubble goroutine finished while the goroutine was still blocked on a select. Reason: only advances the clock if all goroutines are blocked — including the root bubble goroutine. How to fix: Use to make sure the root goroutine is also durably blocked. Now all three conditions are met again (all goroutines are durably blocked; the moment of future unblocking is known; there is no call to ). The fake clock moves forward 3 seconds, which unblocks the goroutine. The goroutine finishes, leaving only the root one, which is still blocked on . The clock moves forward another 2 seconds, unblocking the root goroutine. The assertion passes, and the test completes successfully. But if we run the test with the race detector enabled (using the flag), it reports a data race on the variable: Logically, using in the root goroutine doesn't guarantee that the goroutine (which writes to the variable) will finish before the root goroutine reads from . That's why the race detector reports a problem. Technically, the test passes because of how is implemented, but the race still exists in the code. The right way to handle this is to call after : Calling ensures that the goroutine finishes before the root goroutine reads , so there's no data race anymore. Here's the function we're testing: Let's replace in the root goroutine with : panicked because the root bubble goroutine finished while the goroutine was still blocked on a select. Reason: only advances the clock if there is no active running. If all bubble goroutines are durably blocked but a is running, won't advance the clock. Instead, it will simply finish the call and return control to the goroutine that called it (in this case, the root bubble goroutine). How to fix: don't use . Let's update to use context cancellation instead of a timer: We won't cancel the context in the test: panicked because all goroutines in the bubble are hopelessly blocked. Reason: only advances the clock if it knows how much to advance it. In this case, there is no future moment that would unblock the select in . How to fix: Manually unblock the goroutine and call to wait for it to finish. Now, cancels the context and unblocks the select in , while makes sure the goroutine finishes before the test checks and . Let's update to lock the mutex before doing any calculations: In the test, we'll lock the mutex before calling , so it will block: The test failed because it hit the overall timeout set in . Reason: only works with durable blocks. Blocking on a mutex lock isn't considered durable, so the bubble can't do anything about it — even though the sleeping inner goroutine would have unlocked the mutex in 10 ms if the bubble had used the wall clock. How to fix: Don't use . Now the mutex unlocks after 10 milliseconds (wall clock), finishes successfully, and the check passes. The clock inside the buuble won't move forward if: ✎ Exercise: Asynchronous repeater Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Let's practice understanding time in the bubble with some thinking exercises. Try to solve the problem in your head before using the playground. Here's a function that performs synchronous work: And a test for it: What is the test missing at point ⓧ? ✓ Thoughts on time 1 There's only one goroutine in the test, so when gets blocked by , the time in the bubble jumps forward by 3 seconds. Then sets to and finishes. Finally, the test checks and passes successfully. No need to add anything. Let's keep practicing our understanding of time in the bubble with some thinking exercises. Try to solve the problem in your head before using the playground. Here's a function that performs asynchronous work: And a test for it: What is the test missing at point ⓧ? ✓ Thoughts on time 2 Let's go over the options. ✘ synctest.Wait This won't help because returns as soon as inside is called. The check fails, and panics with the error: "main bubble goroutine has exited but blocked goroutines remain". ✘ time.Sleep Because of the call in the root goroutine, the wait inside in is already over by the time is checked. However, there's no guarantee that has run yet. That's why the test might pass or might fail. ✘ synctest.Wait, then time.Sleep This option is basically the same as just using , because returns before the in even starts. The test might pass or might fail. ✓ time.Sleep, then synctest.Wait This is the correct answer: Since the root goroutine isn't blocked, it checks while the goroutine is blocked by the call. The check fails, and panics with the message: "main bubble goroutine has exited but blocked goroutines remain". Sometimes you need to test objects that use resources and should be able to release them. For example, this could be a server that, when started, creates a pool of network connections, connects to a database, and writes file caches. When stopped, it should clean all this up. Let's see how we can make sure everything is properly stopped in the tests. We're going to test this server: Let's say we wrote a basic functional test: The test passes, but does that really mean the server stopped when we called ? Not necessarily. For example, here's a buggy implementation where our test would still pass: As you can see, the author simply forgot to stop the server here. To detect the problem, we can wrap the test in and see it panic: The server ignores the call and doesn't stop the goroutine running inside . Because of this, the goroutine gets blocked while writing to the channel. When finishes, it detects the blocked goroutine and panics. Let's fix the server code (to keep things simple, we won't support multiple or calls): Now the test passes. Here's how it works: Instead of using to stop something, it's common to use the method. It registers a function that will run when the test finishes: Functions registered with run in last-in, first-out (LIFO) order, after all deferred functions have executed. In the test above, there's not much difference between using and . But the difference becomes important if we move the server setup into a separate helper function, so we don't have to repeat the setup code in different tests: The approach doesn't work because it calls when returns — before the test assertions run: The approach works because it calls when has finished — after all the assertions have already run: Sometimes, a context ( ) is used to stop the server instead of a separate method. In that case, our server interface might look like this: Now we don't even need to use or to check whether the server stops when the context is canceled. Just pass as the context: returns a context that is automatically created when the test starts and is automatically canceled when the test finishes. Here's how it works: To check for stopping via a method or function, use or . To check for cancellation or stopping via context, use . Inside a bubble, returns a context whose channel is associated with the bubble. The context is automatically canceled when ends. Functions registered with inside the bubble run just before finishes. Let's go over the rules for living in the bubble. The following operations durably block a goroutine: The limitations are quite logical, and you probably won't run into them. Don't create channels or objects that contain channels (like tickers or timers) outside the bubble. Otherwise, the bubble won't be able to manage them, and the test will hang: Don't access synchronization primitives associated with a bubble from outside the bubble: Don't call , , or inside a bubble: Don't call inside the bubble: Don't call from outside the bubble: Don't call concurrently from multiple goroutines: ✎ Exercise: Testing a pipeline Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. The package is a complicated beast. But now that you've studied it, you can test concurrent programs no matter what synchronization tools they use—channels, selects, wait groups, timers or tickers, or even . In the next chapter, we'll talk about concurrency internals (coming soon). Pre-order for $10   or read online Three calls to start 9 goroutines. The call to blocks the root bubble goroutine ( ). One of the goroutines finishes its work, tries to write to , and gets blocked (because no one is reading from ). The same thing happens to the other 8 goroutines. sees that all the child goroutines in the bubble are blocked, so it unblocks the root goroutine. The root goroutine finishes. unblocks as soon as all other goroutines are durably blocked. panics when finished if there are still blocked goroutines left in the bubble. Sending to or receiving from a channel created within the bubble. A select statement where every case is a channel created within the bubble. Calling if all calls were made inside the bubble. Sending to or receiving from a channel created outside the bubble. Calling or . I/O operations (like reading a file from disk or waiting for a network response). System calls and cgo calls. Mutexes are usually used to protect shared state, not to coordinate goroutines (the example above is completely unrealistic). In tests, you usually don't need to pause before locking a mutex to check something. Mutex locks are usually held for a very short time, and mutexes themselves need to be as fast as possible. Adding extra logic to support could slow them down in normal (non-test) situations. It waits until all other goroutines in the bubble are blocked. Then, it unblocks the goroutine that called it. The bubble checks if the goroutine can be unblocked by waiting. In our case, it can — we just need to wait 3 seconds. The bubble's clock instantly jumps forward 3 seconds. The select in chooses the timeout case, and the function returns . The test assertions for and both pass successfully. There's no call. There's only one goroutine. The goroutine is durably blocked. It will be unblocked at certain point in the future. There are any goroutines that aren't durably blocked. It's unclear how much time to advance. is running. Because of the call in the root goroutine, the wait inside in is already over by the time is checked. Because of the call, the goroutine is guaranteed to finish (and hence to call ) before is checked. The main test code runs. Before the test finishes, the deferred is called. In the server goroutine, the case in the select statement triggers, and the goroutine ends. sees that there are no blocked goroutines and finishes without panicking. The main test code runs. Before the test finishes, the context is automatically canceled. The server goroutine stops (as long as the server is implemented correctly and checks for context cancellation). sees that there are no blocked goroutines and finishes without panicking. A bubble is created by calling . Each call creates a separate bubble. Goroutines started inside the bubble become part of it. The bubble can only manage durable blocks. Other types of blocks are invisible to it. If all goroutines in the bubble are durably blocked with no way to unblock them (such as by advancing the clock or returning from a call), panics. When finishes, it tries to wait for all child goroutines to complete. However, if even a single goroutine is durably blocked, panics. Calling returns a context whose channel is associated with the bubble. Functions registered with run inside the bubble, immediately before returns. Calling in a bubble blocks the goroutine that called it. returns when all other goroutines in the bubble are durably blocked. returns when all other goroutines in the bubble have finished. The bubble uses a fake clock (starting at 2000-01-01 00:00:00 UTC). Time in the bubble only moves forward if all goroutines are durably blocked. Time advances by the smallest amount needed to unblock at least one goroutine. If the bubble has to choose between moving time forward or returning from a running , it returns from . A blocking send or receive on a channel created within the bubble. A blocking select statement where every case is a channel created within the bubble. Calling if all calls were made inside the bubble.

0 views
Filippo Valsorda 1 weeks ago

The 2025 Go Cryptography State of the Union

This past August, I delivered my traditional Go Cryptography State of the Union talk at GopherCon US 2025 in New York. It goes into everything that happened at the intersection of Go and cryptography over the last year. You can watch the video (with manually edited subtitles, for my fellow subtitles enjoyers) or read the transcript below (for my fellow videos not-enjoyers). The annotated transcript below was made with Simon Willison’s tool . All pictures were taken around Rome, the Italian contryside, and the skies of the Northeastern United States. Welcome to my annual performance review. We are going to talk about all of the stuff that we did in the Go cryptography world during the past year. When I say "we," it doesn't mean just me, it means me, Roland Shoemaker, Daniel McCarney, Nicola Morino, Damien Neil, and many, many others, both from the Go team and from the Go community that contribute to the cryptography libraries all the time. I used to do this work at Google, and I now do it as an independent as part of and leading Geomys , but we'll talk about that later. When we talk about the Go cryptography standard libraries, we talk about all of those packages that you use to build secure applications. That's what we make them for. We do it to provide you with encryption and hashes and protocols like TLS and SSH, to help you build secure applications . The main headlines of the past year: We shipped post quantum key exchanges, which is something that you will not have to think about and will just be solved for you. We have solved FIPS 140, which some of you will not care about at all and some of you will be very happy about. And the thing I'm most proud of: we did all of this while keeping an excellent security track record, year after year. This is an update to something you've seen last year. The Go Security Track Record It's the list of vulnerabilities in the Go cryptography packages. We don't assign a severity—because it's really hard, instead they're graded on the "Filippo's unhappiness score." It goes shrug, oof, and ouch. Time goes from bottom to top, and you can see how as time goes by things have been getting better. People report more things, but they're generally more often shrugs than oofs and there haven't been ouches. More specifically, we haven't had any oof since 2023. We didn't have any Go-specific oof since 2021. When I say Go-specific, I mean: well, sometimes the protocol is broken, and as much as we want to also be ahead of that by limiting complexity, you know, sometimes there's nothing you can do about that. And we haven't had ouches since 2019 . I'm very happy about that. But if this sounds a little informal, I'm also happy to report that we had the first security audit by a professional firm. Trail of Bits looked at all of the nuts and bolts of the Go cryptography standard library: primitives, ciphers, hashes, assembly implementations. They didn't look at the protocols, which is a lot more code on top of that, but they did look at all of the foundational stuff. And I'm happy to say that they found nothing . Two of a kind t-shirts, for me and Roland Shoemaker. It is easy though to maintain a good security track record if you never add anything, so let's talk about the code we did add instead. First of all, post-quantum key exchanges. We talked about post-quantum last year, but as a very quick refresher: Now, we focused on post-quantum key exchange because the key exchange defends against the most urgent risk, which is that somebody might be recording connections today, keeping them saved on some storage for the next 5-50 years and then use the future quantum computers to decrypt those sessions. I'm happy to report that we now have ML-KEM, which is the post-quantum key exchange algorithm selected by the NIST competition, an international competition run in the open. You can use it directly from the crypto/mlkem standard library package starting in Go 1.24, but you're probably not gonna do that. Instead, you're probably going to just use crypto/tls, which by default now uses a hybrid of X25519 and ML-KEM-768 for all connections with other systems that support it. Why hybrid? Because this is new cryptography. So we are still a little worried that somebody might break it. There was one that looked very good and had very small ciphertext, and we were all like, “yes, yes, that's good, that's good.” And then somebody broke it on a laptop. It was very annoying. We're fairly confident in lattices. We think this is the good one. But still, we are taking both the old stuff and the new stuff, hashing them together, and unless you have both a quantum computer to break the old stuff and a mathematician who broke the new stuff, you're not breaking the connection. crypto/tls can now negotiate that with Chrome and can negotiate that with other Go 1.24+ applications. Not only that, we also removed any choice you had in ordering of key exchanges because we think we know better than you and— that didn't come out right, uh. … because we assume that you actually want us to make those kind of decisions, so as long as you don't turn it off, we will default to post-quantum. You can still turn it off. But as long as you don't turn it off, we'll default to the post-quantum stuff to keep your connection safe from the future. Same stuff with x/crypto/ssh. Starting in v0.38.0. SSH does the same thing, they just put X25519 and ML-KEM-768 in a different order, which you would think doesn't matter—and indeed it doesn't matter—but there are rules where "no, no, no, you have to put that one first." And the other rule says "no, you have to put that one first." It's been a whole thing. I'm tired. OpenSSH supports it, so if you connect to a recent enough version of OpenSSH, that connection is post-quantum and you didn't have to do anything except update. Okay, but you said key exchanges and digital signatures are broken. What about the latter? Well, key exchanges are urgent because of the record-now-decrypt-later problem, but unless the physicists that are developing quantum computers also develop a time machine, they can't use the QC to go back in time and use a fake signature today. So if you're verifying a signature today, I promise you it's not forged by a quantum computer. We have a lot more time to figure out post-quantum digital signatures. But if we can, why should we not start now? Well, it's different. Key exchange, we knew what hit we had to take. You have to do a key exchange, you have to do it when you start the connection, and ML-KEM is the algorithm we have, so we're gonna use it. Signatures, we developed a lot of protocols like TLS, SSH, back when it was a lot cheaper to put signatures on the wire. When you connect to a website right now, you get five signatures. We can't send you five 2KB blobs every time you connect to a website. So we are waiting to give time to protocols to evolve, to redesign things with the new trade-offs in mind of signatures not being cheap. We are kind of slow rolling intentionally the digital signature side because it's both not as urgent and not as ready to deploy. We can't do the same “ta-da, it's solved for you” show because signatures are much harder to roll out. Let's talk about another thing that I had mentioned last year, which is FIPS 140. FIPS 140 is a US government regulation for how to do cryptography. It is a list of algorithms, but it's not just a list of algorithms. It's also a list of rules that the modules have to follow. What is a module? Well, a module used to be a thing you would rack. All the rules are based on the idea that it's a thing you can rack. Then the auditor can ask “what is the module’s boundary?” And you're like, “this shiny metal box over here." And, you know, that works. When people ask those questions of libraries, though, I do get a little mad every time. Like, what are the data input ports of your library? Ports. Okay. Anyway, it's an interesting thing to work with. To comply with FIPS 140 in Go, up to now, you had to use an unsupported GOEXPERIMENT, which would replace all of the Go cryptography standard library, all of the stuff I'm excited about, with the BoringCrypto module, which is a FIPS 140 module developed by the BoringSSL folks. We love the BoringSSL folks, but that means using cgo, and we do not love cgo. It has memory safety issues, it makes cross-compilation difficult, it’s not very fast. Moreover, the list of algorithms and platforms of BoringCrypto is tailored to the needs of BoringSSL and not to the needs of the Go community, and their development cycle doesn't match our development cycle: we don't decide when that module gets validated. Speaking of memory safety, I lied a little. Trail of Bits did find one vulnerability. They found it in Go+BoringCrypto, which was yet another reason to try to push away from it. Instead, we've got now the FIPS 140-3 Go Cryptographic Module. Not only is it native Go, it's actually just a different name for the internal Go packages that all the regular Go cryptography package use for the FIPS 140 algorithms. We just moved them into their own little bubble so that when they ask us “what is the module boundary” we can point at those packages. Then there's a runtime mode which enables some of the self-tests and slow stuff that you need for compliance. It also tells crypto/tls not to negotiate stuff that's not FIPS, but aside from that, it doesn't change any observable behavior. We managed to keep everything working exactly the same: you don't import a different package, you don't do anything different, your applications just keep working the same way. We're very happy about that. Finally, you can at compile time select a GOFIPS140 frozen module, which is just a zip file of the source of the module as it was back when we submitted it for validation, which is a compliance requirement sometimes. By the way, that means we have to be forward compatible with future versions of Go, even for internal packages, which was a little spicy. You can read more in the upstream FIPS 140-3 docs . You might be surprised to find out that using a FIPS 140 algorithm from a FIPS 140 module is not actually enough to be FIPS 140 compliant The FIPS 140 module also has to be tested for that specific algorithm. What we did is we just tested them all, so you can use any FIPS 140 algorithm without worrying about whether it's tested in our module. When I say we tested them all, I mean that some of them we tested with four different names. NIST calls HKDF alternatively SP 800-56C two-step KDF, SP 800-133 Section 6.3 CKG, SP 800-108 Feedback KDF, and Implementation Guidance D.P OneStepNoCounter KDF (you don't wanna know). It has four different names for the same thing. We just tested it four times, it's on the certificate, you can use it whatever way you want and it will be compliant. But that's not enough. Even if you use a FIFS 140 algorithm from a FIPS 140 module that was tested for the algorithm it's still not enough because it has to run on a platform that was tested as part of the validation. So we tested on a lot of platforms. Some of them were paid for by various Fortune 100s that had an interest in them getting tested, but some of them had no sponsors. We really wanted to solve this problem for everyone, once and for all, so Geomys just paid for all the FreeBSD, macOS, even Windows testing so that we could say “run it on whatever and it's probably going to be compliant.” (Don't quote me on that.) How did we test on that many machines? Well, you know, we have this sophisticated data center… Um, no. No, no. I got a bunch of stuff shipped to my place. That's my NAS now. It's an Ampere Altra Q64-22, sixty-four arm64 cores, and yep, it's my NAS. Then I tested it on, you know, this sophisticated arm64 macOS testing platform. And then on the Windows one, which is my girlfriend's laptop. And then the arm one, which was my router. Apparently I own an EdgeRouter now? It's sitting in the data center which is totally not my kitchen. It was all a very serious and regimented thing, and all of it is actually recorded, in recorded sessions with the accredited laboratories, so all this is now on file with the US government. You might or might not be surprised to hear that the easiest way to meet the FIPS 140 requirements is not to exceed them. That's annoying and a problem of FIPS 140 in general: if you do what everybody else does, which is just clearing the bar, nobody will ask questions, so there’s a strong temptation to lower security in FIPS 140 mode. We just refused to accept that. Instead, we figured out complex stratagems. For example, for randomness, the safest thing to do is to just take randomness from the kernel every time you need it. The kernel knows if a virtual machine was just cloned and we don't, so we risk generating the same random bytes twice. But NIST will not allow that. You need to follow a bunch of standards for how the randomness is generated, and the kernel doesn’t. So what we do is we do everything that NIST asks and then every time you ask for randomness, we squirrel off, go to the kernel, get a little piece of extra entropy, stir it into the pot before giving back the result, and give back the result. It's still NIST compliant because it's as strong as both the NIST and the kernel solution, but it took some significant effort to show it is compliant. We did the same for ECDSA. ECDSA is a digital signature mechanism. We've talked about it a few other times. It's just a way to take a message and a private key and generate a signature, here (s, r) . To make a signature, you also need a random number, and that number must be used only once with the same private key. You cannot reuse it. That number is k here. Why can you not reuse it? Because if you reuse it, then you can do this fun algebra thing and then pop the private key falls out by just smashing two signatures together. Bad, really, really bad. How do we generate this number that must never be the same? Well, one option is we make it random. But what if your random number generator breaks and generates twice the same random number? That would leak the private key, and that would be bad. So the community came up with deterministic ECDSA . Instead of generating the nonce at random, we are going to hash the message and the private key. This is still actually a little risky though, because if there's a fault in the CPU , for example, or a bug, because for example you're taking the wrong inputs , you might still end up generating the same value but signing a slightly different message. How do we mitigate both of those? We do both. We take some randomness and the private key and the message, we hash them all together, and now it's really, really hard for the number to come out the same. That's called hedged ECDSA. The Go crypto library has been doing hedged ECDSA from way before it was called hedged and way before I was on the team . Except… random ECDSA has always been FIPS. Deterministic ECDSA has been FIPS since a couple years ago. Hedged ECDSA is technically not FIPS. We really didn't want to make our ECDSA package less secure, so we found a forgotten draft that specifies a hedged ECDSA scheme, and we proceeded to argue that actually if you read SP 800-90A Revision 1 very carefully you realize that if you claim that the private key is just the DRBG entropy plus two-thirds of the DRBG nonce, you are allowed to use it because of SP 800-57 Part 1, etc etc etc . We basically just figured out a way to claim it was fine and the lab eventually said "okay, shut up." I'm very proud of that one. If you want to read more about this, check out the announcement blog post . If you know you need commercial services for FIPS 140, here’s Geomys FIPS 140 commercial services page . If you don't know if you need them, you actually probably don't. It's fine, the standard library will probably solve this for you now. Okay, but who cares about this FIPS 140 stuff? "Dude, we've been talking about FIPS 140 for 10 minutes and I don't care about that." Well, I care because I spent my last year on it and that apparently made me the top committer for the cycle to the Go repo and that's mostly FIPS 140 stuff. I don't know how to feel about that. There have been actually a lot of positive side effects from the FIPS 140 effort. We took care to make sure that everything that we found we would leave in a better state. For example, there are new packages that moved from x/crypto into the standard library: crypto/hkdf, crypto/pbkdf, crypto/sha3. SHA-3 is faster and doesn't allocate anymore. HKDF has a new generic API which lets you pass in a function that returns either a concrete type that implements Hash or a function that returns a Hash interface, which otherwise was a little annoying. (You had to make a little closure.) I like it. We restructured crypto/aes and crypto/cipher and in the process merged a contribution from a community member that made AES-CTR, the counter mode, between 2 and 9 times faster. That was a pretty good result. The assembly interfaces are much more consistent now. Finally, we finished cleaning up crypto/rsa. If you remember from last year, we made the crypto/rsa sign and verify operations not use math/big and use constant time code. Now we also made key generation, validation, and pre-computation all not use math/big. That loading keys that were serialized to JSON a lot faster, and made key generation much faster. But how much faster? Benchmarking key generation is really hard because it's a random process: you take a number random number and you check, is it prime? No. Toss. Is it prime? Nope. Toss. Is it prime? You keep doing this. If you're lucky, it’s very fast. If you are unlucky, very slow. It’s a geometric distribution and if you want to average it out, you have to run for hours. Instead, I figured out a new way by mathematically deriving the average number of pulls you are supposed to do and preparing a synthetic run that gives exactly the expected mean number of checks, so that we get a representative sample to benchmark deterministically . That was a lot of fun. Moreover, we detect more broken keys, and we did a rare backwards compatibility break to stop supporting keys smaller than 1024 bits. 1024 is already pretty small, you should be using 2048 minimum, but if you're using less than 1024, it can be broken on the proverbial laptop. It's kind of silly that a production library lets you do something so insecure, and you can't tell them apart just by looking at the code. You have to know what the size of the key is. So we just took that out. I expected people to yell at me. Nobody yelled at me. Good job community. Aside from adding stuff, you know that we are very into testing and that testing is how we keep that security track record that we talked about. I have one bug in particular that is my white whale. (You might say, "Filippo, well-adjusted people don't have white whales." Well, we learned nothing new, have we?) My white whale is this assembly bug that we found at Cloudflare before I joined the Go team. I spent an afternoon figuring out an exploit for it with Sean Devlin in Paris, while the yellow jackets set fire to cop cars outside. That's a different story. It's an assembly bug where the carry—literally the carry like when you do a pen and paper multiplication—was just not accounted for correctly. You can watch my talk Squeezing a Key through a Carry Bit if you are curious to learn more about it. The problem with this stuff is that it's so hard to get code coverage for it because all the code always runs. It's just that you don't know if it always runs with that carry at zero, and if the carry was one, it’d do the wrong math. I think we've cracked it, by using mutation testing. We have a framework that tells the assembler, "hey, anywhere you see an add-with-carry, replace it with a simple add that discards the carry." Then we run the tests. If the tests still pass, the test did not cover that carry. If that happens we fail a meta-test and tell whoever's sending the CL, “hey, no, no, no, you gotta test that.” Same for checking the case in which the carry is always set. We replace the add-with-carry with a simple add and then insert a +1. It's a little tricky. If you want to read more about it, it's in this blog post . I'm very hopeful that will help us with all this assembly stuff. Next, accumulated test vectors . This is a little trick that I'm very very fond of. Say you want to test a very large space. For example there are two inputs and they can both be 0 to 200 bytes long, and you want to test all the size combinations. That would be a lot of test vectors, right? If I checked in a megabyte of test vectors every time I wanted to do that, people eventually would yell at me. Instead what we do is run the algorithm with each size combination, and take the result and we put it inside a rolling hash. Then at the end we take the hash result and we check that it comes out right. We do this with two implementations. If it comes out to the same hash, great. If it comes out not to the same hash, it doesn't help you figure out what the bug is, but it tells you there's a bug. I'll take it. We really like reusing other people's tests. We're lazy. The BoringSSL people have a fantastic suite of tests for TLS called BoGo and Daniel has been doing fantastic work integrating that and making crypto/tls stricter and stricter in the process. It's now much more spec compliant on the little things where it goes like, “no, no, no, you're not allowed to put a zero here” and so on. Then, the Let's Encrypt people have a test tool for the ACME protocol called Pebble. (Because it's a small version of their production system called Boulder! It took me a long time to figure it out and eventually I was like ooooohhh.) Finally, NIST has this X.509 interoperability test suite, which just doesn't have a good name. It's good though. More assembly cleanups. There used to be places in assembly where—as if assembly was not complicated enough—instructions were just written down as raw machine code. Sometimes even the comment was wrong! Can you tell the comment changed in that patch? This is a thing Roland and Joel found. Now there's a test that will just yell at you if you try to commit a or instruction. We also removed all the assembly that was specifically there for speeding up stuff on CPUs that don't have AVX2. AVX2 came out in 2015 and if you want to go fast, you're probably not using the CPU generation from back then. We still run on it, just not as fast. More landings! I’m going to speed through these ones. This is all stuff that we talked about last year and that we actually landed. Stuff like data independent timing to tell the CPU, "no, no, I actually did mean for you to do that in constant time, goddammit." And server-side TLS Encrypted Client Hello, which is a privacy improvement. We had client side, now we have server side. crypto/rand.Read never fails. We promised that, we did that. Now, do you know how hard it is to test the failure case of something that never fails? I had to re-implement the seccomp library to tell the kernel to break the getrandom syscall to check what happens when it doesn’t work. There are tests all pointing guns at each other to make sure the fallback both works and is never hit unexpectedly. It's also much faster now because Jason Donenfeld added the Linux getrandom VDSO. Sean Liao added rand.Text like we promised. Then more stuff like hash.Cloner , which I think makes a lot of things a little easier, and more and more and more and more. The Go 1.24 and Go 1.25 release notes are there for you. x/crypto/ssh is also under our maintenance and some excellent stuff happened there, too. Better tests, better error messages, better compatibility, and we're working on some v2 APIs . If you have opinions, it’s time to come to those issues to talk about them! It’s been an exciting year, and I'm going to give you just two samples of things we're planning to do for the next year. One is TLS profiles. Approximately no one wants to specifically configure the fifteen different knobs of a TLS library. Approximately no one—because I know there are some people who do and they yell at me regularly. But instead most people just want "hey, make it broadly compatible." "Hey, make it FIPS compliant." "Hey, make it modern." We're looking for a way to make it easy to just say what your goal is, and then we do all the configuration for you in a way that makes sense and that evolves with time. I'm excited about this one. And maybe something with passkeys? If you run websites that authenticate users a bunch with password hashes and maybe also with WebAuthN, find me, email us, we want feedback. We want to figure out what to build here, into the standard library. Alright, so it's been a year of cryptography, but it's also been a year of Geomys. Geomys launched a year ago here at GopherCon. If you want an update, we went on the Fallthrough podcast to talk about it , so check that out. We are now a real company and how you know is that we have totes: it's the equivalent of a Facebook-official relationship. The best FIPS 140 side effect has been that we have a new maintainer. Daniel McCarney joined us to help with the FIPS effort and then we were working very well together so Geomys decided to just take him on as a permanent maintainer on the Go crypto maintenance team. I’m very excited about that. This is all possible thanks to our clients, and if you have any questions, here are the links. You might also want to follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . My work is made possible by Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. Post-quantum cryptography is about the future. We are worried about quantum computers that might exist… 5-50 (it's a hell of a range) years from now, and that might break all of asymmetrical encryption. (Digital signatures and key exchanges.) Post-quantum cryptography runs on classical computers. It's cryptography that we can do now that resists future quantum computers. Post-quantum cryptography is fast, actually. If you were convinced that for some reason it was slow, that's a common misconception. However, post-quantum cryptography is large. Which means that we have to send a lot more bytes on the wire to get the same results.

0 views
Max Woolf 2 weeks ago

Nano Banana can be prompt engineered for extremely nuanced AI image generation

You may not have heard about new AI image generation models as much lately, but that doesn’t mean that innovation in the field has stagnated: it’s quite the opposite. FLUX.1-dev immediately overshadowed the famous Stable Diffusion line of image generation models, while leading AI labs have released models such as Seedream , Ideogram , and Qwen-Image . Google also joined the action with Imagen 4 . But all of those image models are vastly overshadowed by ChatGPT’s free image generation support in March 2025. After going organically viral on social media with the prompt, ChatGPT became the new benchmark for how most people perceive AI-generated images, for better or for worse. The model has its own image “style” for common use cases, which make it easy to identify that ChatGPT made it. Two sample generations from ChatGPT. ChatGPT image generations often have a yellow hue in their images. Additionally, cartoons and text often have the same linework and typography. Of note, , the technical name of the underlying image generation model, is an autoregressive model. While most image generation models are diffusion-based to reduce the amount of compute needed to train and generate from such models, works by generating tokens in the same way that ChatGPT generates the next token, then decoding them into an image. It’s extremely slow at about 30 seconds to generate each image at the highest quality (the default in ChatGPT), but it’s hard for most people to argue with free. In August 2025, a new mysterious text-to-image model appeared on LMArena : a model code-named “nano-banana”. This model was eventually publically released by Google as Gemini 2.5 Flash Image , an image generation model that works natively with their Gemini 2.5 Flash model. Unlike Imagen 4, it is indeed autoregressive, generating 1,290 tokens per image. After Nano Banana’s popularity pushed the Gemini app to the top of the mobile App Stores, Google eventually made Nano Banana the colloquial name for the model as it’s definitely more catchy than “Gemini 2.5 Flash Image”. The first screenshot on the iOS App Store for the Gemini app. Personally, I care little about what leaderboards say which image generation AI looks the best. What I do care about is how well the AI adheres to the prompt I provide: if the model can’t follow the requirements I desire for the image—my requirements are often specific —then the model is a nonstarter for my use cases. At the least, if the model does have strong prompt adherence, any “looking bad” aspect can be fixed with prompt engineering and/or traditional image editing pipelines. After running Nano Banana though its paces with my comically complex prompts, I can confirm that thanks to Nano Banana’s robust text encoder, it has such extremely strong prompt adherence that Google has understated how well it works. Like ChatGPT, Google offers methods to generate images for free from Nano Banana. The most popular method is through Gemini itself, either on the web or in an mobile app, by selecting the “Create Image 🍌” tool. Alternatively, Google also offers free generation in Google AI Studio when Nano Banana is selected on the right sidebar, which also allows for setting generation parameters such as image aspect ratio and is therefore my recommendation. In both cases, the generated images have a visible watermark on the bottom right corner of the image. For developers who want to build apps that programmatically generate images from Nano Banana, Google offers the endpoint on the Gemini API . Each image generated costs roughly $0.04/image for a 1 megapixel image (e.g. 1024x1024 if a 1:1 square): on par with most modern popular diffusion models despite being autoregressive, and much cheaper than ’s $0.17/image. Working with the Gemini API is a pain and requires annoying image encoding/decoding boilerplate, so I wrote and open-sourced a Python package: gemimg , a lightweight wrapper around Gemini API’s Nano Banana endpoint that lets you generate images with a simple prompt, in addition to handling cases such as image input along with text prompts. I chose to use the Gemini API directly despite protests from my wallet for three reasons: a) web UIs to LLMs often have system prompts that interfere with user inputs and can give inconsistent output b) using the API will not show a visible watermark in the generated image, and c) I have some prompts in mind that are…inconvenient to put into a typical image generation UI. Let’s test Nano Banana out, but since we want to test prompt adherence specifically, we’ll start with more unusual prompts. My go-to test case is: I like this prompt because not only is an absurd prompt that gives the image generation model room to be creative, but the AI model also has to handle the maple syrup and how it would logically drip down from the top of the skull pancake and adhere to the bony breakfast. The result: That is indeed in the shape of a skull and is indeed made out of pancake batter, blueberries are indeed present on top, and the maple syrup does indeed drop down from the top of the pancake while still adhereing to its unusual shape, albeit some trails of syrup disappear/reappear. It’s one of the best results I’ve seen for this particular test, and it’s one that doesn’t have obvious signs of “AI slop” aside from the ridiculous premise. Now, we can try another one of Nano Banana’s touted features: editing. Image editing, where the prompt targets specific areas of the image while leaving everything else as unchanged as possible, has been difficult with diffusion-based models until very recently with Flux Kontext . Autoregressive models in theory should have an easier time doing so as it has a better understanding of tweaking specific tokens that correspond to areas of the image. While most image editing approaches encourage using a single edit command, I want to challenge Nano Banana. Therefore, I gave Nano Banana the generated skull pancake, along with five edit commands simultaneously: All five of the edits are implemented correctly with only the necessary aspects changed, such as removing the blueberries on top to make room for the mint garnish, and the pooling of the maple syrup on the new cookie-plate is adjusted. I’m legit impressed. Now we can test more difficult instances of prompt engineering. One of the most compelling-but-underdiscussed use cases of modern image generation models is being able to put the subject of an input image into another scene. For open-weights image generation models, it’s possible to “train” the models to learn a specific subject or person even if they are not notable enough to be in the original training dataset using a technique such as finetuning the model with a LoRA using only a few sample images of your desired subject. Training a LoRA is not only very computationally intensive/expensive, but it also requires care and precision and is not guaranteed to work—speaking from experience. Meanwhile, if Nano Banana can achieve the same subject consistency without requiring a LoRA, that opens up many fun oppertunities. Way back in 2022, I tested a technique that predated LoRAs known as textual inversion on the original Stable Diffusion in order to add a very important concept to the model: Ugly Sonic , from the initial trailer for the Sonic the Hedgehog movie back in 2019. One of the things I really wanted Ugly Sonic to do is to shake hands with former U.S. President Barack Obama , but that didn’t quite work out as expected. 2022 was a now-unrecognizable time where absurd errors in AI were celebrated. Can the real Ugly Sonic finally shake Obama’s hand? Of note, I chose this test case to assess image generation prompt adherence because image models may assume I’m prompting the original Sonic the Hedgehog and ignore the aspects of Ugly Sonic that are distinct to only him. Specifically, I’m looking for: I also confirmed that Ugly Sonic is not surfaced by Nano Banana, and prompting as such just makes a Sonic that is ugly, purchasing a back alley chili dog. I gave Gemini the two images of Ugly Sonic above (a close-up of his face and a full-body shot to establish relative proportions) and this prompt: That’s definitely Obama shaking hands with Ugly Sonic! That said, there are still issues: the color grading/background blur is too “aesthetic” and less photorealistic, Ugly Sonic has gloves, and the Ugly Sonic is insufficiently lanky. Back in the days of Stable Diffusion, the use of prompt engineering buzzwords such as , , and to generate “better” images in light of weak prompt text encoders were very controversial because it was difficult both subjectively and intuitively to determine if they actually generated better pictures. Obama shaking Ugly Sonic’s hand would be a historic event. What would happen if it were covered by The New York Times ? I added to the previous prompt: So there’s a few notable things going on here: That said, I only wanted the image of Obama and Ugly Sonic and not the entire New York Times A1. Can I just append to the previous prompt and have that be enough to generate the image only while maintaining the compositional bonuses? I can! The gloves are gone and his chest is white, although Ugly Sonic looks out-of-place in the unintentional sense. As an experiment, instead of only feeding two images of Ugly Sonic, I fed Nano Banana all the images of Ugly Sonic I had ( seventeen in total), along with the previous prompt. This is an improvement over the previous generated image: no eyebrows, white hands, and a genuinely uncanny vibe. Again, there aren’t many obvious signs of AI generation here: Ugly Sonic clearly has five fingers! That’s enough Ugly Sonic for now, but let’s recall what we’ve observed so far. There are two noteworthy things in the prior two examples: the use of a Markdown dashed list to indicate rules when editing, and the fact that specifying as a buzzword did indeed improve the composition of the output image. Many don’t know how image generating models actually encode text. In the case of the original Stable Diffusion, it used CLIP , whose text encoder open-sourced by OpenAI in 2021 which unexpectedly paved the way for modern AI image generation. It is extremely primitive relative to modern standards for transformer-based text encoding, and only has a context limit of 77 tokens: a couple sentences, which is sufficient for the image captions it was trained on but not nuanced input. Some modern image generators use T5 , an even older experimental text encoder released by Google that supports 512 tokens. Although modern image models can compensate for the age of these text encoders through robust data annotation during training the underlying image models, the text encoders cannot compensate for highly nuanced text inputs that fall outside the domain of general image captions. A marquee feature of Gemini 2.5 Flash is its support for agentic coding pipelines; to accomplish this, the model must be trained on extensive amounts of Markdown (which define code repository s and agentic behaviors in ) and JSON (which is used for structured output/function calling/MCP routing). Additionally, Gemini 2.5 Flash was also explictly trained to understand objects within images, giving it the ability to create nuanced segmentation masks . Nano Banana’s multimodal encoder, as an extension of Gemini 2.5 Flash, should in theory be able to leverage these properties to handle prompts beyond the typical image-caption-esque prompts. That’s not to mention the vast annotated image training datasets Google owns as a byproduct of Google Images and likely trained Nano Banana upon, which should allow it to semantically differentiate between an image that is and one that isn’t, as with similar buzzwords. Let’s give Nano Banana a relatively large and complex prompt, drawing from the learnings above and see how well it adheres to the nuanced rules specified by the prompt: This prompt has everything : specific composition and descriptions of different entities, the use of hex colors instead of a natural language color, a heterochromia constraint which requires the model to deduce the colors of each corresponding kitten’s eye from earlier in the prompt, and a typo of “San Francisco” that is definitely intentional. Each and every rule specified is followed. For comparison, I gave the same command to ChatGPT—which in theory has similar text encoding advantages as Nano Banana—and the results are worse both compositionally and aesthetically, with more tells of AI generation. 1 The yellow hue certainly makes the quality differential more noticeable. Additionally, no negative space is utilized, and only the middle cat has heterochromia but with the incorrect colors. Another thing about the text encoder is how the model generated unique relevant text in the image without being given the text within the prompt itself: we should test this further. If the base text encoder is indeed trained for agentic purposes, it should at-minimum be able to generate an image of code. Let’s say we want to generate an image of a minimal recursive Fibonacci sequence in Python, which would look something like: I gave Nano Banana this prompt: It tried to generate the correct corresponding code but the syntax highlighting/indentation didn’t quite work, so I’ll give it a pass. Nano Banana is definitely generating code, and was able to maintain the other compositional requirements. For posterity, I gave the same prompt to ChatGPT: It did a similar attempt at the code which indicates that code generation is indeed a fun quirk of multimodal autoregressive models. I don’t think I need to comment on the quality difference between the two images. An alternate explanation for text-in-image generation in Nano Banana would be the presence of prompt augmentation or a prompt rewriter, both of which are used to orient a prompt to generate more aligned images. Tampering with the user prompt is common with image generation APIs and aren’t an issue unless used poorly (which caused a PR debacle for Gemini last year), but it can be very annoying for testing. One way to verify if it’s present is to use adversarial prompt injection to get the model to output the prompt itself, e.g. if the prompt is being rewritten, asking it to generate the text “before” the prompt should get it to output the original prompt. That’s, uh, not the original prompt. Did I just leak Nano Banana’s system prompt completely by accident? The image is hard to read, but if it is the system prompt—the use of section headers implies it’s formatted in Markdown—then I can surgically extract parts of it to see just how the model ticks: These seem to track, but I want to learn more about those buzzwords in point #3: Huh, there’s a guard specifically against buzzwords? That seems unnecessary: my guess is that this rule is a hack intended to avoid the perception of model collapse by avoiding the generation of 2022-era AI images which would be annotated with those buzzwords. As an aside, you may have noticed the ALL CAPS text in this section, along with a command. There is a reason I have been sporadically capitalizing in previous prompts: caps does indeed work to ensure better adherence to the prompt (both for text and image generation), 2 and threats do tend to improve adherence. Some have called it sociopathic, but this generation is proof that this brand of sociopathy is approved by Google’s top AI engineers. Tangent aside, since “previous” text didn’t reveal the prompt, we should check the “current” text: That worked with one peculiar problem: the text “image” is flat-out missing, which raises further questions. Is “image” parsed as a special token? Maybe prompting “generate an image” to a generative image AI is a mistake. I tried the last logical prompt in the sequence: …which always raises a error: not surprising if there is no text after the original prompt. This section turned out unexpectedly long, but it’s enough to conclude that Nano Banana definitely has indications of benefitting from being trained on more than just image captions. Some aspects of Nano Banana’s system prompt imply the presence of a prompt rewriter, but if there is indeed a rewriter, I am skeptical it is triggering in this scenario, which implies that Nano Banana’s text generation is indeed linked to its strong base text encoder. But just how large and complex can we make these prompts and have Nano Banana adhere to them? Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens. The intent of this large context window for Nano Banana is for multiturn conversations in Gemini where you can chat back-and-forth with the LLM on image edits. Given Nano Banana’s prompt adherence on small complex prompts, how well does the model handle larger-but-still-complex prompts? Can Nano Banana render a webpage accurately? I used a LLM to generate a bespoke single-page HTML file representing a Counter app, available here . The web page uses only vanilla HTML, CSS, and JavaScript, meaning that Nano Banana would need to figure out how they all relate in order to render the web page correctly. For example, the web page uses CSS Flexbox to set the ratio of the sidebar to the body in a 1/3 and 2/3 ratio respectively. Feeding this prompt to Nano Banana: That’s honestly better than expected, and the prompt cost 916 tokens. It got the overall layout and colors correct: the issues are more in the text typography, leaked classes/styles/JavaScript variables, and the sidebar:body ratio. No, there’s no practical use for having a generative AI render a webpage, but it’s a fun demo. A similar approach that does have a practical use is providing structured, extremely granular descriptions of objects for Nano Banana to render. What if we provided Nano Banana a JSON description of a person with extremely specific details, such as hair volume, fingernail length, and calf size? As with prompt buzzwords, JSON prompting AI models is a very controversial topic since images are not typically captioned with JSON, but there’s only one way to find out. I wrote a prompt augmentation pipeline of my own that takes in a user-input description of a quirky human character, e.g. , and outputs a very long and detailed JSON object representing that character with a strong emphasis on unique character design. 3 But generating a Mage is boring, so I asked my script to generate a male character that is an equal combination of a Paladin, a Pirate, and a Starbucks Barista: the resulting JSON is here . The prompt I gave to Nano Banana to generate a photorealistic character was: Beforehand I admit I didn’t know what a Paladin/Pirate/Starbucks Barista would look like, but he is definitely a Paladin/Pirate/Starbucks Barista. Let’s compare against the input JSON, taking elements from all areas of the JSON object (about 2600 tokens total) to see how well Nano Banana parsed it: Checking the JSON field-by-field, the generation also fits most of the smaller details noted. However, he is not photorealistic, which is what I was going for. One curious behavior I found is that any approach of generating an image of a high fantasy character in this manner has a very high probability of resulting in a digital illustration, even after changing the target publication and adding “do not generate a digital illustration” to the prompt. The solution requires a more clever approach to prompt engineering: add phrases and compositional constraints that imply a heavy physicality to the image, such that a digital illustration would have more difficulty satisfying all of the specified conditions than a photorealistic generation: The image style is definitely closer to Vanity Fair (the photographer is reflected in his breastplate!), and most of the attributes in the previous illustration also apply—the hands/cutlass issue is also fixed. Several elements such as the shoulderplates are different, but not in a manner that contradicts the JSON field descriptions: perhaps that’s a sign that these JSON fields can be prompt engineered to be even more nuanced. Yes, prompting image generation models with HTML and JSON is silly, but “it’s not silly if it works” describes most of modern AI engineering. Nano Banana allows for very strong generation control, but there are several issues. Let’s go back to the original example that made ChatGPT’s image generation go viral: . I ran that exact prompt through Nano Banana on a mirror selfie of myself: …I’m not giving Nano Banana a pass this time. Surprisingly, Nano Banana is terrible at style transfer even with prompt engineering shenanigans, which is not the case with any other modern image editing model. I suspect that the autoregressive properties that allow Nano Banana’s excellent text editing make it too resistant to changing styles. That said, creating a new image does in fact work as expected, and creating a new image using the character provided in the input image with the specified style (as opposed to a style transfer ) has occasional success. Speaking of that, Nano Banana has essentially no restrictions on intellectual property as the examples throughout this blog post have made evident. Not only will it not refuse to generate images from popular IP like ChatGPT now does, you can have many different IPs in a single image. Normally, Optimus Prime is the designated driver. I am not a lawyer so I cannot litigate the legalities of training/generating IP in this manner or whether intentionally specifying an IP in a prompt but also stating “do not include any watermarks” is a legal issue: my only goal is to demonstrate what is currently possible with Nano Banana. I suspect that if precedent is set from existing IP lawsuits against OpenAI and Midjourney , Google will be in line to be sued. Another note is moderation of generated images, particularly around NSFW content, which always important to check if your application uses untrusted user input. As with most image generation APIs, moderation is done against both the text prompt and the raw generated image. That said, while running my standard test suite for new image generation models, I found that Nano Banana is surprisingly one of the more lenient AI APIs. With some deliberate prompts, I can confirm that it is possible to generate NSFW images through Nano Banana—obviously I cannot provide examples. I’ve spent a very large amount of time overall with Nano Banana and although it has a lot of promise, some may ask why I am writing about how to use it to create highly-specific high-quality images during a time where generative AI has threatened creative jobs. The reason is that information asymmetry between what generative image AI can and can’t do has only grown in recent months: many still think that ChatGPT is the only way to generate images and that all AI-generated images are wavy AI slop with a piss yellow filter. The only way to counter this perception is though evidence and reproducibility. That is why not only am I releasing Jupyter Notebooks detailing the image generation pipeline for each image in this blog post, but why I also included the prompts in this blog post proper; I apologize that it padded the length of the post to 26 minutes, but it’s important to show that these image generations are as advertised and not the result of AI boosterism. You can copy these prompts and paste them into AI Studio and get similar results, or even hack and iterate on them to find new things. Most of the prompting techniques in this blog post are already well-known by AI engineers far more skilled than myself, and turning a blind eye won’t stop people from using generative image AI in this manner. I didn’t go into this blog post expecting it to be a journey, but sometimes the unexpected journeys are the best journeys. There are many cool tricks with Nano Banana I cut from this blog post due to length, such as providing an image to specify character positions and also investigations of styles such as pixel art that most image generation models struggle with, but Nano Banana now nails. These prompt engineering shenanigans are only the tip of the iceberg. Jupyter Notebooks for the generations used in this post are split between the gemimg repository and a second testing repository . I would have preferred to compare the generations directly from the endpoint for an apples-to-apples comparison, but OpenAI requires organization verification to access it, and I am not giving OpenAI my legal ID.  ↩︎ Note that ALL CAPS will not work with CLIP-based image generation models at a technical level, as CLIP’s text encoder is uncased.  ↩︎ Although normally I open-source every script I write for my blog posts, I cannot open-source the character generation script due to extensive testing showing it may lean too heavily into stereotypes. Although adding guardrails successfully reduces the presence of said stereotypes and makes the output more interesting, there may be unexpected negative externalities if open-sourced.  ↩︎ A lanky build, as opposed to the real Sonic’s chubby build. A white chest, as opposed to the real Sonic’s beige chest. Blue arms with white hands, as opposed to the real Sonic’s beige arms with white gloves. Small pasted-on-his-head eyes with no eyebrows, as opposed to the real Sonic’s large recessed eyes and eyebrows. That is the most cleanly-rendered New York Times logo I’ve ever seen. It’s safe to say that Nano Banana trained on the New York Times in some form. Nano Banana is still bad at rendering text perfectly/without typos as most image generation models. However, the expanded text is peculiar: it does follow from the prompt, although “Blue Blur” is a nickname for the normal Sonic the Hedgehog. How does an image generating model generate logical text unprompted anyways? Ugly Sonic is even more like normal Sonic in this iteration: I suspect the “Blue Blur” may have anchored the autoregressive generation to be more Sonic-like. The image itself does appear to be more professional, and notably has the distinct composition of a photo from a professional news photographer: adherence to the “rule of thirds”, good use of negative space, and better color balance. , mostly check. (the hands are transposed and the cutlass disappears) I would have preferred to compare the generations directly from the endpoint for an apples-to-apples comparison, but OpenAI requires organization verification to access it, and I am not giving OpenAI my legal ID.  ↩︎ Note that ALL CAPS will not work with CLIP-based image generation models at a technical level, as CLIP’s text encoder is uncased.  ↩︎ Although normally I open-source every script I write for my blog posts, I cannot open-source the character generation script due to extensive testing showing it may lean too heavily into stereotypes. Although adding guardrails successfully reduces the presence of said stereotypes and makes the output more interesting, there may be unexpected negative externalities if open-sourced.  ↩︎

0 views
Anton Zhiyanov 2 weeks ago

Go proposal: Context-aware Dialer methods

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Add context-aware, network-specific methods to the type. Ver. 1.26 • Stdlib • Low impact The type connects to the address using a given network (protocol) — TCP, UDP, IP, or Unix sockets. The new context-aware methods ( , , , and ) combine the efficiency of the existing network-specific functions (which skip address resolution and dispatch) with the cancellation capabilities of . The package already has top-level functions for different networks ( , , , and ), but these were made before was introduced, so they don't support cancellation: On the other hand, the type has a general-purpose method. It supports cancellation and can be used to connect to any of the known networks: However, if you already know the network type and address, using is a bit less efficient than network-specific functions like due to: Address resolution overhead: handles address resolution internally (like DNS lookups and converting to or ) using the network and address strings you provide. Network-specific functions accept a pre-resolved address object, so they skip this step. Network type dispatch: must route the call to the protocol-specific dialer. Network-specific functions already know which protocol to use, so they skip this step. So, network-specific functions in the package are more efficient, but they don't support cancellation. The type supports cancellation, but it's less efficient. This proposal aims to solve the mismatch by adding context-aware, network-specific methods to the type. Also, adding new methods to the lets you use the newer address types from the package (like instead of ), which are preferred in modern Go code. Add four new methods to the : The method signatures are similar to the existing top-level functions, but they also accept a context and use the newer address types from the package. Use the method to connect to a TCP server: Use the method to connect to a Unix socket: In both cases, the dialing fails because I didn't bother to start the server in the playground :) 𝗣 49097 • 𝗖𝗟 657296 Address resolution overhead: handles address resolution internally (like DNS lookups and converting to or ) using the network and address strings you provide. Network-specific functions accept a pre-resolved address object, so they skip this step. Network type dispatch: must route the call to the protocol-specific dialer. Network-specific functions already know which protocol to use, so they skip this step.

1 views
@bwplotka 2 weeks ago

The (lazy) Git UI You Didn't Know You Need

When my son was born last April, I had ambitious learning plans for the upcoming 5w paternity leave. As you can imagine, with two kids, life quickly verified this plan 🙃. I did eventually start some projects. One of the goals (sounding rebellious in the current AI hype cycle) was to learn and use neovim for coding. As a Goland aficionado, I (and my wrist) have always been tempted by no-mouse, OSS, gopls based, highly configurable dev setups.

0 views
Chris Coyier 3 weeks ago

The Great (Refrigerator) Divide

I like a good hot sauce. It’s not, like, my personality , but I enjoy them. There are enough different hot sauces that having a bit of a collection of them is reasonable. Cholula is a mainstay, working equally well on Mexican and egg-based dishes. Although admit Tabasco is my general go-to. The green Tabasco works particularly well on Chipotle for whatever reason. Tapatio is right in there working maybe slightly better on the rice-y-er Mexican stuff. Red Hot on my chili or wings, absolutely. Those are all big names. Hot sauce has quite a long tail. There plenty of Tier-2 (in popularity) sauces. Think Tiger Sauce, which is quite sweet and tends to work well on dishes that evoke that anyway (I’m thinking sautéed peppers and onions, for instance). Yellow Bird is having their hot sauce moment lately — I quite like the literally yellow habanero style — which has a tang to it that works well with chicken I think. Roasted veggies like carrots and broccoli? There I like the Portland all-timer Secret Aardvark . Much Asian food is born to pair with Sriracha, of course. I’m a big fan of Heatly lately. I’d call Tier-3 that whole genre of hot sauces people buy you when they go on vacation and stop into a store that only sells hot sauces (right next to the oil & vinegar shop!). These are the Johnny’s Burning Butthole sauces and Sally’s Simmering Sweetspot. They have cheezy cartoon graphics on them and there are hundreds and hundreds of them, and some of them are perfectly good, but you never quite know what you are going to get and it’s easy to forget even after you’ve tried it. Tier-4 is the bottle you got from the local restaurant in town with an ambitious chef trying to diversify income streams. I’ve taken too long to get to my point though. SOME of these hot sauces say “Refrigerate after opening.” on the bottle, a rule you probably shouldn’t break (unless you’re a Johnny’s Burning Butthole kinda guy). SOME of these hot sauces… don’t. And my theory is: the bigger and more successful the hot sauce brand, the less likely it requires refridgeration. I ain’t trying to knock fridge brands. Yellow Bird, Heatly, Secret Aardvark are all favorites and require it (along with all Sriracha’s, which makes more sense as it’s so ketchup-like). I will admit though that I don’t love it. I don’t really want a whole area in my fridge that’s loaded with hot sauces. That veers too closely into personality territory. Much easier to have some basic cabinet space for them. So anyway. If you wanna go huge with your hot sauce brand, you can’t require refrigeration. The next big-Tabasco needs to sit right out on those diner tables with the salt and pepper.

0 views
Ahead of AI 3 weeks ago

Beyond Standard LLMs

From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism. However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance. After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised attendees to follow up with a write-up of these alternative approaches). So here it is! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 and many more. (The list above focuses on the open-weight models; there are proprietary models like GPT-5, Grok 4, Gemini 2.5, etc. that also fall into this category.) Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. In code, a simplified version of the Gated DeltaNet depicted above (without the convolutional mixing) can be implemented as follows (the code is inspired by the official implementation by the Qwen3 team): (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. (And the final output gate, not shown in the snippet above, is similar to gated attention; it controls how much of the output is kept.) So, in a sense, this state update in Gated DeltaNet is similar to how recurrent neural networks (RNNs) work. The advantage is that it scales linearly (via the for-loop) instead of quadratically with context length. The downside of this recurrent state update is that, compared to regular (or gated) attention, it sacrifices the global context modeling ability that comes from full pairwise attention. Gated DeltaNet, can, to some extend, still capture context, but it has to go through the memory ( S ) bottleneck. That memory is a fixed size and thus more efficient, but it compresses past context into a single hidden state similar to RNNs. That’s why the Qwen3-Next and Kimi Linear architectures don’t replace all attention layers with DeltaNet layers but use the 3:1 ratio mentioned earlier. In the previous section, we discussed the advantage of the DeltaNet over full attention in terms of linear instead of quadratic compute complexity with respect to the context length. Next to the linear compute complexity, another big advantage of DeltaNet is the memory savings, as DeltaNet modules don’t grow the KV cache. (For more information about KV caching, see my Understanding and Coding the KV Cache in LLMs from Scratch article). Instead, as mentioned earlier, they keep a fixed-size recurrent state, so memory stays constant with context length. For a regular multi-head attention (MHA) layer, we can compute the KV cache size as follows: (The 2 multiplier is there because we have both keys and values that we store in the cache.) For the simplified DeltaNet version implemented above, we have: Note that the memory size doesn’t have a context length ( ) dependency. Also, we have only the memory state S that we store instead of separate keys and values, hence becomes just bytes. However, note that we now have a quadratic in here. This comes from the state: But that’s usually nothing to worry about, as the head dimension is usually relatively small. For instance, it’s 128 in Qwen3-Next. The full version with the convolutional mixing is a bit more complex, including the kernel size and so on, but the formulas above should illustrate the main trend and motivation behind the Gated DeltaNet. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. The training runs for up to 16 refinement steps per batch. Each step performs several no-grad loops to iteratively refine the answer. This is followed by a gradient loop that backpropagates through the full reasoning sequence to update the model weights. It’s important to note that TRM is not a language model operating on text. However, because (a) it’s a transformer-based architecture, (b) reasoning is now a central focus in LLM research, and this model represents a distinctly different take on reasoning, and (c) many readers have asked me to cover HRM (and TRM is its more advanced successor) I decided to include it here. While TRM could be extended to textual question-answer tasks in the future, TRM currently works on grid-based inputs and outputs. In other words, both the “question” and the “answer” are grids of discrete tokens (for example, 9×9 Sudoku or 30×30 ARC/Maze puzzles), not text sequences. HRM consists of two small transformer modules (each 4 blocks) that communicate across recursion levels. TRM only uses a single 2-layer transformer. (Note that the previous TRM figure shows a 4× next to the transformer block, but that’s likely to make it easier to compare against HRM.) TRM backpropagates through all recursive steps, whereas HRM only backpropagates through the final few. HRM includes an explicit halting mechanism to determine when to stop iterating. TRM replaces this mechanism with a simple binary cross-entropy loss that learns when to stop iterating. Performance-wise, TRM performs really well compared to HRM, as shown in the figure below. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length. While HRM and TRM achieve really good reasoning performance on these benchmarks, comparing them to large LLMs is not quite fair. HRM and TRM are specialized models for tasks like ARC, Sudoku, and Maze pathfinding, whereas LLMs are generalists. Sure, HRM and TRM can be adopted for other tasks as well, but they have to be specially trained on each task. So, in that sense, we can perhaps think of HRM and TRM as efficient pocket calculators, whereas LLM are more like computers, which can do a lot of other things as well. Still, these recursive architectures are exciting proof-of-concepts that highlight how small, efficient models can “reason” through iterative self-refinement. Perhaps, in the future, such models could act as reasoning or planning modules embedded within larger tool-using LLM systems. For now, LLMs remain ideal for broad tasks, but domain-specific recursive models like TRM can be developed to solve certain problems more efficiently once the target domain is well understood. Beyond the Sudoku, Maze finding, and ARC proof-of-concept benchmarks, there are possibly lots of use cases in the physics and biology domain where such models could find use. As an interesting tidbit, the author shared that it took less than $500 to train this model, with 4 H100s for around 2 days. I am delighted to see that it’s still possible to do interesting work without a data center. I originally planned to cover all models categories in the overview figure, but since the article ended up longer than I expected, I will have to save xLSTMs, Liquid Foundation Models, Transformer-RNN hybrids, and State Space Models for another time (although, Gated DeltaNet already gave a taste of State Space Models and recurrent designs.) As a conclusion to this article, I want to repeat the earlier words, i.e., that standard autoregressive transformer LLMs are proven and have stood the test of time so far. They are also, if efficiency is not the main factor, the best we have for now. Traditional Decoder-Style, Autoregressive Transformers + Proven & mature tooling + “well-understood” + Scaling laws + SOTA - Expensive training - Expensive inference (except for aforementioned tricks) If I were to start a new LLM-based project today, autoregressive transformer-based LLMs would be my first choice. I definitely find the upcoming attention hybrids very promising, which are especially interesting when working with longer contexts where efficiency is a main concern. Linear Attention Hybrids + Same as decoder-style transformers + Cuts FLOPs/KV memory at long-context tasks - Added complexity - Trades a bit of accuracy for efficiency On the more extreme end, text diffusion models are an interesting development. I’m still somewhat skeptical about how well they perform in everyday use, as I’ve only tried a few quick demos. Hopefully, we’ll soon see a large-scale production deployment with Google’s Gemini Diffusion that we can test on daily and coding tasks, and then find out how people actually feel about them. Text Diffusion Models + Iterative denoising is a fresh idea for text + Better parallelism (no next-token dependence) - Can’t stream answers - Doesn’t benefit from CoT? - Tricky tool-calling? - Solid models but not SOTA While the main selling point of text diffusion models is improved efficiency, code world models sit on the other end of the spectrum, where they aim to improve modeling performance. As of this writing, coding models, based on standard LLMs, are mostly improved through reasoning techniques, yet if you have tried them on trickier challenges, you have probably noticed that they (more or less) still fall short and can’t solve many of the trickier coding problems well. I find code world models particularly interesting and believe they could be an important next step toward developing more capable coding systems. Code World Model + Promising approach to improve code understanding + Verifiable intermediate states - Inclusion of executable code traces complicates training - Code running adds latency Lastly, we covered small recursive transformers such as hierarchical and tiny reasoning models. These are super interesting proof-of-concept models. However, as of today, they are primarily puzzle solvers, not general text or coding models. So, they are not in the same category as the other non-standard LLM alternatives covered in this article. Nonetheless, they are very interesting proofs-of-concept, and I am glad researchers are working on them. Right now, LLMs like GPT-5, DeepSeek R1, Kimi K2, and so forth are developed as special purpose models for free-form text, code, math problems and much more. They feel like brute-force and jack-of-all-trades approach that we use on a variety of tasks, from general knowledge questions to math and code. However, when we perform the same task repeatedly, such brute-force approaches become inefficient and may not even be ideal in terms of specialization. This is where tiny recursive transformers become interesting: they could serve as lightweight, task-specific models that are both efficient and purpose-built for repeated or structured reasoning tasks. Also, I can see them as potential “tools” for other tool-calling LLMs; for instance, when LLMs use Python or calculator APIs to solve math problems, special tiny reasoning models could fill this niche for other types of puzzle- or reasoning-like problems. Small Recursive Transformers + Very small architecture + Good generalization on puzzles - Special purpose models - Limited to puzzles (so far) This has been a long article, but I hope you discovered some of the fascinating approaches that often stay outside the spotlight of mainstream LLMs. And if you’ve been feeling a bit bored by the more or less conventional LLM releases, I hope this helped rekindle your excitement about AI again because there’s a lot of interesting work happening right now! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) 1. Transformer-Based LLMs Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. 2. (Linear) Attention Hybrids Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . 2.1 Traditional Attention and Quadratic Costs The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. 2.2 Linear attention Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. 2.3 Linear Attention Revival In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. 2.4 Qwen3-Next Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? 2.5 Gated Attention Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. 2.6 Gated DeltaNet Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. 2.8 Kimi Linear vs. Qwen3-Next Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. 2.9 Kimi Delta Attention Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. 2.10 The Future of Attention Hybrids Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. 3. Text Diffusion Models A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). 3.1 Why Work on Text Diffusion? With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. 3.2 The Denoising Process The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. 3.3 Autoregressive vs Diffusion LLMs Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. 3.4 Text Diffusion Today It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. 4. World Models So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. 4.1 The Main Idea Behind World Models Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. 4.2 From Vision to Code That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. 4.3 Code World Models Vs Regular LLMs for Code So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. 5. Small Recursive Transformers You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. 5.1 What Does Recursion Mean Here? TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length.

1 views
Filippo Valsorda 4 weeks ago

Claude Code Can Debug Low-level Cryptography

Over the past few days I wrote a new Go implementation of ML-DSA, a post-quantum signature algorithm specified by NIST last summer. I livecoded it all over four days, finishing it on Thursday evening. Except… Verify was always rejecting valid signatures. I was exhausted, so I tried debugging for half an hour and then gave up, with the intention of coming back to it the next day with a fresh mind. On a whim, I figured I would let Claude Code take a shot while I read emails and resurfaced from hyperfocus. I mostly expected it to flail in some maybe-interesting way, or rule out some issues. Instead, it rapidly figured out a fairly complex low-level bug in my implementation of a relatively novel cryptography algorithm. I am sharing this because it made me realize I still don’t have a good intuition for when to invoke AI tools, and because I think it’s a fantastic case study for anyone who’s still skeptical about their usefulness. Full disclosure: Anthropic gave me a few months of Claude Max for free. They reached out one day and told me they were giving it away to some open source maintainers. Maybe it’s a ploy to get me hooked so I’ll pay for it when the free coupon expires. Maybe they hoped I’d write something like this. Maybe they are just nice. Anyway, they made no request or suggestion to write anything public about Claude Code. Now you know. I started Claude Code v2.0.28 with Opus 4.1 and no system prompts, and gave it the following prompt (typos included): I implemented ML-DSA in the Go standard library, and it all works except that verification always rejects the signatures. I know the signatures are right because they match the test vector. YOu can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Look for potential reasons the signatures don’t verify. ultrathink I spot-checked and w1 is different from the signing one. To my surprise, it pinged me a few minutes later with a complete fix . Maybe I shouldn’t be surprised! Maybe it would have been clear to anyone more familiar with AI tools that this was a good AI task: a well-scoped issue with failing tests. On the other hand, this is a low-level issue in a fresh implementation of a complex, relatively novel algorithm. It figured out that I had merged and into a single function for using it from Sign, and then reused it from Verify where already produces the high bits, effectively taking the high bits of w1 twice in Verify. Looking at the log , it loaded the implementation into the context and then immediately figured it out, without any exploratory tool use! After that it wrote itself a cute little test that reimplemented half of verification to confirm the hypothesis, wrote a mediocre fix, and checked the tests pass. I threw the fix away and refactored to take high bits as input, and changed the type of the high bits, which is both clearer and saves a round-trip through Montgomery representation. Still, this 100% saved me a bunch of debugging time. On Monday, I had also finished implementing signing with failing tests. There were two bugs, which I fixed in the following couple evenings. The first one was due to somehow computing a couple hardcoded constants (1 and -1 in the Montgomery domain) wrong . It was very hard to find, requiring a lot of deep printfs and guesswork. Took me maybe an hour or two. The second one was easier: a value that ends up encoded in the signature was too short (32 bits instead of 32 bytes) . It was relatively easy to tell because only the first four bytes of the signature were the same, and then the signature lengths were different. I figured these would be an interesting way to validate Claude’s ability to help find bugs in low-level cryptography code, so I checked out the old version of the change with the bugs (yay Jujutsu!) and kicked off a fresh Claude Code session with this prompt: I am implementing ML-DSA in the Go standard library, and I just finished implementing signing, but running the tests against a known good test vector it looks like it goes into an infinite loop, probably because it always rejects in the Fiat-Shamir with Aborts loop. You can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Figure out why it loops forever, and get the tests to pass. ultrathink It spent some time doing printf debugging and chasing down incorrect values very similarly to how I did it, and then figured out and fixed the wrong constants . Took Claude definitely less than it took me. Impressive. It gave up after fixing that bug even if the tests still failed, so I started a fresh session (on the assumption that the context on the wrong constants would do more harm than good investigating an independent bug), and gave it this prompt: I am implementing ML-DSA in the Go standard library, and I just finished implementing signing, but running the tests against a known good test vector they don’t match. You can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Figure out what is going on. ultrathink It took a couple wrong paths, thought for quite a bit longer, and then found this one too . I honestly expected it to fail initially. It’s interesting how Claude found the “easier” bug more difficult. My guess is that maybe the large random-looking outputs of the failing tests did not play well with its attention. The fix it proposed was updating only the allocation’s length and not its capacity, but whatever, the point is finding the bug, and I’ll usually want to throw away the fix and rewrite it myself anyway. Three out of three one-shot debugging hits with no help is extremely impressive . Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it. As ever, I wish we had better tooling for using LLMs which didn’t look like chat or autocomplete or “make me a PR.” For example, how nice would it be if every time tests fail, an LLM agent was kicked off with the task of figuring out why, and only notified us if it did before we fixed it? For more low-level cryptography bugs implementations, follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . I promise I almost never post about AI. Enjoy the silliest floof. Surely this will help redeem me in the eyes of folks who consider AI less of a tool and more of something to be hated or loved. My work is made possible by Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team.

0 views
Lukáš Lalinský 1 months ago

How I turned Zig into my favorite language to write network programs in

I’ve been watching the Zig language for a while now, given that it was created for writing audio software (low-level, no allocations, real time). I never paid too much attention though, it seemed a little weird to me and I didn’t see the real need. Then I saw a post from Andrew Kelley (creator of the language) on Hacker News, about how he reimplemented my Chromaprint algorithm in Zig, and that got me really interested. I’ve been planning to rewrite AcoustID’s inverted index for a long time, I had a couple of prototypes, but none of the approaches felt right. I was going through some rough times, wanted to learn something new, so I decided to use the project as an opportunity to learn Zig. And it was great, writing Zig is a joy. The new version was faster and more scalable than the previous C++ one. I was happy, until I wanted to add a server interface. In the previous C++ version, I used Qt , which might seem very strange for a server software, but I wanted a nice way of doing asynchronous I/O and Qt allowed me to do that. It was callback-based, but Qt has a lot of support for making callbacks usable. In the newer prototypes, I used Go, specifically for the ease of networking and concurrency. With Zig, I was stuck. There are some Zig HTTP servers, so I could use those. I wanted to implement my legacy TCP server as well, and that’s a lot harder, unless I want to spawn a lot of threads. Then I made a crazy decision, to use Zig also for implementing a clustered layer on top of my server, using NATS as a messaging system, so I wrote a Zig NATS client , and that gave me a lot of experience with Zig’s networking capabilities. Fast forward to today, I’m happy to introduce Zio, an asynchronous I/O and concurrency library for Zig . If you look at the examples, you will not really see where is the asynchronous I/O, but it’s there, in the background and that’s the point. Writing asynchronous code with callbacks is a pain. Not only that, it requires a lot of allocations, because you need state to survive across callbacks. Zio is an implementation of Go style concurrency, but limited to what’s possible in Zig. Zio tasks are stackful coroutines with fixed-size stacks. When you run , this will initiate the I/O operation in the background and then suspend the current task until the I/O operation is done. When it’s done, the task will be resumed, and the result will be returned. That gives you the illusion of synchronous code, allowing for much simpler state management. Zio support fully asynchronous network and file I/O, has synchronization primitives (mutexes, condition variables, etc.) that work with the cooperative runtime, has Go-style channels, OS signal watches and more. Tasks can run in single-threaded mode, or multi-threaded, in which case they can migrate from thread to thread for lower latency and better load balancing. And it’s FAST. I don’t want to be posting benchmarks here, maybe later when I have more complex ones, but the single-threaded mode is beating any framework I’ve tried so far. It’s much faster than both Go and Rust’s Tokio. Context switching is virtually free, comparable to a function call. The multi-threaded mode, while still not being as robust as Go/Tokio, has comparable performance. It’s still a bit faster than either of them, but that performance might go down as I add more fairness features. Because it implements the standard interfaces for reader/writer, you can actually use external libraries that are unaware they are running within Zio. Here is an example of a HTTP server: When I started working with Zig, I really thought it’s going to be a niche language to write the fast code in, and then I’ll need a layer on top of that in a different language. With Zio, that changed. The next step for me is to update my NATS client to use Zio internally. And after that, I’m going to work on a HTTP client/server library based on Zio.

0 views
Filippo Valsorda 1 months ago

The Geomys Standard of Care

One of the most impactful effects of professionalizing open source maintenance is that as professionals we can invest into upholding a set of standards that make our projects safer and more reliable. The same commitments and overhead that are often objected to when required of volunteers should be table stakes for professional maintainers. I didn’t find a lot of prior art, so to compile the Geomys Standard of Care I started by surveying recent supply chain compromises to look for mitigable root causes. (By the way, you might have missed that email because it includes the name of a domain used for a phishing campaign, so it got flagged as phishing. Oops.) I also asked feedback from experts in various areas such as CI security, and from other Geomys maintainers. The first draft is below, and we’ll maintain the latest version at geomys.org/standard-of-care . It covers general maintenance philosophy, ongoing stability and reliability, dependency management, account and CI security, vulnerability handling, licensing, and more. In the future, we want to look into adopting more binary transparency tools, and into doing periodic reviews of browser extensions and of authorized Gerrit and GitHub OAuth apps and tokens (just GitHub has four places 1 to look in!). We also welcome feedback on things that would be valuable to add, for security or for reliability. We aim to maintain our projects sustainably and predictably. We are only able to do this thanks to our retainer contracts with our clients, but these commitments are offered to the whole community, not just to paying clients. Scope . We apply this standard to projects maintained or co-maintained by Geomys, including For projects where we are not the sole maintainers, we prioritize working well with the rest of the team. Geomys maintainers may also have personal projects that are not held to this standard (e.g. everything in mostly-harmless ). Code review . If the project accepts external contributions, we review all the code provided to us. This extends to any code generated with LLMs, as well. Complexity . A major part of the role of a maintainer is saying no. We consciously limit complexity, and keep the goals and non-goals of a project in mind when considering features. (See for example the Go Cryptography Principles .) Static analysis . We run staticcheck , by our very own @dominikh , in CI. Stability . Once a Go package reaches v1, we maintain strict backwards compatibility within a major version, similarly to the standard library’s compatibility promise . Ongoing maintenance . Not all projects are actively worked on at all times (e.g. some projects may be effectively finished, or we may work in batches). However, unless a project is explicitly archived or deprecated, we will address newly arising issues that make the project unsuitable for a previously working use case (e.g. compatibility with a new OS). Dependency management . We don’t use automatic dependency version bump tools, like Dependabot. For our purposes, they only cause churn and increase the risk of supply chain attacks by adopting new module versions before the ecosystem has had time to detect attacks. (Dependabot specifically also has worrying impersonation risks , which would make for trivial social engineering attacks.) Instead, we run govulncheck on a schedule, to get high signal-to-noise ratio notifications of vulnerable dependencies that actually affect our projects; and run isolated CI jobs with the latest versions of our dependencies (i.e. running before ) to ensure we’re alerted early of breakages, so we can easily update to future security releases and so we’re aware of potential compatibility issues for our dependents. Phishing-resistant authentication . Phishing is by far the greatest threat to our security and, transitively, to that of our users. We acknowledge there is no amount of human carefulness that can systematically withstand targeted attacks, so we use technically phishing-resistant authentication for all services that allow impacting our projects’ users. Phishing-resistant authentication means passkeys or WebAuthn 2FA, with credentials stored in platform authenticators (e.g. iCloud Keychain), password managers (e.g. 1Password or Chrome), or hardware tokens (e.g. YubiKeys). Critical accounts that allow escalating to user impact include: If a strict mode such as Google’s Advanced Protection Program or Apple’s Advanced Data Protection is available, we enable it. If a phishable fallback authentication or account recovery method is instead required, we configure one that is secret-based (e.g. TOTP or recovery codes) and either delete the secret or commit to never using it without asking a fellow Geomys maintainer to review the circumstances that necessitated it. TOTP can’t hurt us if we don’t use it. We never enable SMS as an authentication mechanism or as an account recovery mechanism, because SIM jacking is possible even without action on our part. Long-lived credentials . We avoid where possible long-lived persistent credentials, or make them non-extractable if possible. For example, we use git-credential-oauth instead of Gerrit cookies, and hardware-bound SSH keys with yubikey-agent or Secretive instead of personal access tokens for git pushes to GitHub. Unlike phishing-resistant authentication, we found it impractical to roll out short-lived credentials universally. Notably, we have not found a way to use the GitHub CLI without extractable long-lived credentials. CI security . We run zizmor on our GitHub Actions workflows, and we don’t use dangerous GitHub Actions triggers that run privileged workflows with attacker-controlled contexts, such as . We run GitHub Actions workflows with read-only permissions and no secrets by default. Workflows that have write permissions or access to secrets disable all use of caches (including indirectly through actions like ), to mitigate cache poisoning attacks . (Note that, incredibly, read-only workflows can write arbitrary cache entries, which is why this must be mitigated at cache use time.) Third-party access . For projects maintained solely by Geomys, we avoid providing user-impacting (i.e. push or release) access to external people, and publicly disclose any exceptions. If abandoning a project, we prefer archiving it and letting a fork spawn to handing over control to external people. This way dependents can make their own assessment of whether to trust the new maintainers. Any exceptions will be widely communicated well in advance. Under no circumstances will we release to public registration a domain, GitHub user/org, or package name that was previously assigned to a Geomys project. Availability monitoring . We have automated uptime monitoring for critical user-facing endpoints, such as the Go import path meta pages. This also provides monitoring for critical domain expiration, preventing accidental takeovers. Transparency logging . We subscribe to new version notifications via GopherWatch , to be alerted of unauthorized module versions published to the Go Checksum Database. We monitor Certificate Transparency logs for critical domains (e.g. the roots of our Go import paths) using tools such as Cert Spotter or Silent CT . We also set CAA records on those domains limiting issuance to the minimal set of CAs required for operation. Vulnerability handling . We document the official vulnerability reporting mechanism of each project, we encourage coordinated vulnerability reporting, and we appreciate the work of security researchers. We honor embargoes of up to 90 days, and we do not share vulnerability details with people not involved in fixing it until they are public. (Paying clients do not get access to private vulnerability details. This is to honor our responsibility to the various stakeholders of an open source project, and to acknowledge that often these details are not ours to share.) Once a vulnerability is made public, we ensure it is included in the Go vulnerability database with accurate credit and metadata, including a CVE number. If the documented vulnerability reporting mechanism is unresponsive, an escalation path is available by emailing security at geomys.org. Licenses . We use permissive, well-known licenses: BSD-3-Clause, BSD-2-Clause, BSD-1-Clause, 0BSD, ISC, MIT, or (less preferably) Apache-2.0. Disclaimer . This is not a legally binding agreement. Your use of the projects continues to be controlled by their respective licenses, and/or by your contract with Geomys, which does not include this document unless explicitly specified. I am getting a cat (if I successfully defeat my allergies through a combination of LiveClear , SLIT , antihistamines, and HEPA filters), so obviously you are going to get a lot of cat pictures going forward. For more, you can follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . This is the work of Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. https://github.com/settings/tokens and https://github.com/settings/personal-access-tokens and https://github.com/settings/apps/authorizations and https://github.com/settings/applications  ↩ the and packages in the Go standard library and the FIPS 140-3 Go Cryptographic Module (co-maintained with the rest of the Go team) Staticcheck filippo.io/edwards25519 filippo.io/csrf filippo.io/keygen filippo.io/intermediates (externalized from the standard library) age and typage Sunlight and filippo.io/torchwood yubikey-agent run govulncheck on a schedule, to get high signal-to-noise ratio notifications of vulnerable dependencies that actually affect our projects; and run isolated CI jobs with the latest versions of our dependencies (i.e. running before ) to ensure we’re alerted early of breakages, so we can easily update to future security releases and so we’re aware of potential compatibility issues for our dependents. All Google accounts linked to a Gerrit account Password manager Passkey sync (e.g. Apple iCloud) Website host Domain registrar Package registry (if applicable, although Go’s decentralized package management largely removes this attack surface) https://github.com/settings/tokens and https://github.com/settings/personal-access-tokens and https://github.com/settings/apps/authorizations and https://github.com/settings/applications  ↩

0 views
Anton Zhiyanov 1 months ago

Go proposal: Compare IP subnets

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Compare IP address prefixes the same way IANA does. Ver. 1.26 • Stdlib • Low impact An IP address prefix represents a IP subnet. These prefixes are usually written in CIDR notation: In Go, an IP prefix is represented by the type. The new method lets you compare two IP prefixes, making it easy to sort them without having to write your own comparison code. The imposed order matches both Python's implementation and the assumed order from IANA. When the Go team initially designed the IP subnet type ( ), they chose not to add a method because there wasn't a widely accepted way to order these values. Because of this, if a developer needs to sort IP subnets — for example, to organize routing tables or run tests — they have to write their own comparison logic. This results in repetitive and error-prone code. The proposal aims to provide a standard way to compare IP prefixes. This should reduce boilerplate code and help programs sort IP subnets consistently. Add the method to the type: orders two prefixes as follows: This follows the same order as Python's and the standard IANA convention . Sort a list of IP prefixes: 𝗣 61642 • 𝗖𝗟 700355 First by validity (invalid before valid). Then by address family (IPv4 before IPv6). Then by masked IP address (network IP). Then by prefix length. Then by unmasked address (original IP).

1 views

Interview with a new hosting provider founder

Most of us use infrastructure provided by companies like DigitalOcean and AWS. Some of us choose to work on that infrastructure. And some of us are really built different and choose to build all that infrastructure from scratch . This post is a real treat for me to bring you. I met Diana through a friend of mine, and I've gotten some peeks behind the curtain as she builds a new hosting provider . So I was thrilled that she agreed to an interview to let me share some of that with you all. So, here it is: a peek behind the curtain of a new hosting provider, in a very early stage. This is the interview as transcribed (any errors are mine), with a few edits as noted for clarity. Nicole: Hi, Diana! Thanks for taking the time to do this. Can you start us off by just telling us a little bit about who you are and what your company does? Diana: So I'm Diana, I'm trans, gay, AuDHD and I like to create, mainly singing and 3D printing. I also have dreams of being the change I want to see in the world. Since graduating high school, all infrastructure has become a passion for me. Particularly networking and computer infrastructure. From your home internet connection to data centers and everything in between. This has led me to create Andromeda Industries and the dba Gigabit.Host. Gigabit.Host is a hosting service where the focus is affordable and performant host for individuals, communities, and small businesses. Let's start out talking about the business a little bit. What made you decide to start a hosting company? The lack of performance for a ridiculous price. The margins on hosting is ridiculous, it's why the majority of the big tech companies' revenue comes from their cloud offerings. So my thought has been why not take that and use it more constructively. Instead of using the margins to crush competition while making the rich even more wealthy, use those margins for good. What is the ethos of your company? To use the net profits from the company to support and build third spaces and other low return/high investment cost ventures. From my perspective, these are the types of ideas that can have the biggest impact on making the world a better place. So this is my way of adopting socialist economic ideas into the systems we currently have and implementing the changes. How big is the company? Do you have anyone else helping out? It’s just me for now, though the plan is to make it into a co-op or unionized business. I have friends and supporters of the project, giving feedback and suggesting improvements. What does your average day-to-day look like? I go to my day job during the week, and work on the company in my spare time. I have alerts and monitors that warn me when something needs addressing, overall operations are pretty hands off. You're a founder, and founders have to wear all the hats. How have you managed your work-life balance while starting this? At this point it’s more about balancing my job, working on the company, and taking care of my cat. It's unfortunately another reason that I started this endeavor, there just aren't spaces I'd rather be than home, outside of a park or hiking. All of my friends are online and most say the same, where would I go? Hosting businesses can be very capital intensive to start. How do you fund it? Through my bonuses and stocks currently, also through using more cost effective brands that are still reliable and performant. What has been the biggest challenge of operating it from a business perspective? Getting customers. I'm not a huge fan of marketing and have been using word of mouth as the primary method of growing the business. Okay, my part here then haha. If people want to sign up, how should they do that? If people are interested in getting service, they can request an invite through this link: https://portal.gigabit.host/invite/request . What has been the most fun part of running a hosting company? Getting to actually be hands on with the hardware and making it as performant as possible. It scratches an itch of eking out every last drop of performance. Also not doing it because it's easy, doing it because I thought it would be easy. What has been the biggest surprise from starting Gigabit.Host? How both complex and easy it has been at the same time. Also how much I've been learning and growing through starting the company. What're some of the things you've learned? It's been learning that wanting it to be perfect isn't realistic, taking the small wins and building upon and continuing to learn as you go. My biggest learning challenge was how to do frontend work with Typescript and styling, the backend code has been easy for me. The frontend used to be my weakness, now it could be better, and as I add new features I can see it continuing to getting better over time. Now let's talk a little bit about the tech behind the scenes. What does the tech stack look like? Next.js and Typescript for the front and backend. Temporal is used for provisioning and task automation. Supabase is handling user management Proxmox for the hardware virtualization How do you actually manage this fleet of VMs? For the customer side we only handle the initial provisioning, then the customer is free to use whatever tool they choose. The provisioning of the VMs is handled using Go and Temporal. For our internal services we use Ansible and automation scripts. [Nicole: the code running the platform is open source, so you can take a look at how it's done in the repository !] How do your technical choices and your values as a founder and company work together? They are usually in sync, the biggest struggle has been minimizing cost of hardware. While I would like to use more advanced networking gear, it's currently cost prohibitive. Which choices might you have made differently? [I would have] gathered more capital before getting started. Though that's me trying to be a perfectionist, when the reality is buy as little as possible and use what you have when able. This seems like a really hard business to be in since you need reliability out of the gate. How have you approached that? Since I've been self-funding this endeavor, I've had to forgo high availability for now due to costs. To work around that I've gotten modern hardware for the critical parts of the infrastructure. This so far has enabled us to achieve 90%+ uptime, with the current goal to add redundancy as able to do so. What have been the biggest technical challenges you've run into? Power and colocation costs. Colocation is expensive in Seattle. Around 8x the cost of my previous colo in Atlanta, GA. Power has been the second challenge, running modern hardware means higher power requirements. Most data centers outside of hyperscalers are limited to 5 to 10 kW per rack. This limits the hardware and density, thankfully for now it [is] a future struggle. Huge thanks to Diana for taking the time out of her very busy for this interview! And thank you to a few friends who helped me prepare for the interview.

0 views
Emil Privér 1 months ago

We Re-Built Our Integration Service Using Postgres and Go

Our integration service connects our platform to external systems. Earlier this year, we reached a scaling limit at 40 integrations and rebuilt it from the ground up. The service handles three primary responsibilities: sending data to external systems, managing job queues, and prioritizing work based on criticality. The original implementation functioned but had architectural constraints that prevented horizontal scaling. We use microservices because different components have conflicting requirements. The management API handles complex business logic with normalized schemas—separate tables for translations and categories. The public API optimizes for read performance under load, using denormalized data by adding translations directly into category tables and handling filtering in Go. A monolithic architecture would require compromising performance in one area to accommodate the other. The integration service currently processes millions of events daily, with volume increasing as we onboard new customers. This post describes our implementation of a queue system using PostgreSQL and Go, focusing on design decisions and technical trade-offs. The first implementation used GCP Pub/Sub, a topic-to-many-subscription service where messages are replicated across multiple queues. This architecture introduced several scalability issues. The integration service maintained a database for integration configurations but lacked ownership of its operational data. This violated a distributed systems principle: services should own their data rather than depend on other services for it. This dependency forced our management service to serialize complete payloads into the queue. Updating a single attribute on a sub-object required sending the entire parent object with all nested sub-objects, metadata, and relationships. Different external APIs have varying data requirements—some need individual sub-objects while others require complete hierarchies. For clients with records containing 300-500 sub-objects, this resulted in significant message size inflation. GCP charges by message size rather than count, making large messages substantially more expensive than smaller ones. GCP’s WebSocket delivery requires clients to buffer messages internally. With 40 integrations running separate consumers with filters, traffic spikes created memory pressure: This prevented horizontal scaling and limited us to vertical scaling approaches. External APIs enforce varying rate limits. Our in-memory rate limiter tracked requests per integration but prevented horizontal scaling since state couldn’t be shared across instances without risking rate limit violations. By early 2025, these issues had compounded: excessive message sizes increasing costs, memory bloat requiring oversized containers, vertical-only scaling, high operational expenses, rate limiting preventing horizontal scale, and lack of data independence. The system couldn’t accommodate our growth trajectory. A complete rebuild was necessary. The v2 design addressed specific limitations: Additional improvements: The standard approach involves the producer computing payloads and sending them to the queue for consumer processing. We used this in v1 but rejected it for v2. Customers frequently make multiple rapid changes to the same record—updating a title, then a price, then a description. Each change triggers an event. Instead of sending three separate updates, we consolidate changes into a single update. We implemented a in the jobs table. Multiple updates to the same record within a short time window are deduplicated into a single job, reducing load on both our system and recipient systems. We chose PostgreSQL as our queue backend for several reasons: Often, we think we need something bigger like Apache Kafka when a relational database like PostgreSQL is sufficient for our requirements. The jobs table structure: Each job tracks: Postgres-backed queues require careful indexing. We use partial indexes (with WHERE clauses) only for actively queried states: , , , and . We don’t index or states. These statuses contain the majority of jobs in the table and aren’t needed in the job processing flow. Indexing them would just add more data into the memory when we don’t use it in the flow. Jobs are ordered by for FIFO processing, with priority queue overrides when applicable. Jobs follow a defined lifecycle: Timestamp fields serve observability purposes, measuring job duration and identifying bottlenecks. For jobs, retry timing is calculated using exponential backoff. The worker system requirements: We evaluated two approaches: maintaining in-memory queues with multiple goroutines using for and select to fetch jobs, or having goroutines fetch data from the database and iterate over the results. We chose the database iteration approach for its simplicity. pgxpool handles connection pooling, eliminating the need for channel-based in-memory queues. Each worker runs in a separate goroutine, using a ticker to poll for jobs every second. Before processing, workers check for shutdown signals ( or channel). When shutdown is initiated, workers stop accepting new jobs and mark in-flight jobs as . This prevents stalled jobs from blocking integration queues. Checking shutdown signals between jobs ensures clean shutdowns. During shutdown, we create a fresh context with for retrying jobs. This prevents database write failures when the main context is canceled. The query implements fair scheduling to prevent high-volume integrations from monopolizing workers: Query breakdown: Step 1: Identify busy integrations This CTE identifies integrations with 50+ concurrent processing jobs. Step 2: Select jobs with priority ordering Jobs are selected from integrations not in the busy list. Priority updates are ordered first, followed by FIFO ordering. locks selected rows to the current transaction, preventing duplicate processing by concurrent workers. Step 3: Update job status Selected jobs are updated to status with a recorded start time. This ensures fair resource allocation across integrations. Job timeouts are critical for queue health. In the initial release, we reused the global context for job processing. When jobs hung waiting for slow external APIs, they couldn’t be marked completed or failed due to context lifecycle coupling. Jobs accumulated in state indefinitely. The solution: context separation. The global context controls worker lifecycle. Each job receives its own context with a timeout. Timed-out jobs are marked , allowing queue progression. This also enables database writes during shutdown using a fresh context, even when the global context is canceled. Failed jobs require retry logic with appropriate timing. Immediate retries against failing external APIs are counterproductive. We implement exponential backoff: instant first retry, 10 seconds for the second, 30 seconds for the third, up to 30 minutes. The field drives backoff calculation. After 10 attempts, jobs are marked . Error types guide retry behavior: This allows each integration to decide how to handle errors based on the external API’s response. For example, a 400 Bad Request might be a permanent validation failure (NonRetryableError), while a 503 Service Unavailable is transient and should retry (RetryableError). The integration implementation determines the appropriate error type for each scenario. Jobs occasionally become stuck in state due to worker panics, database connection failures, or unexpected container termination. A cron job runs every minute, identifying jobs in state beyond the expected duration. These jobs are moved to with incremented retry counts, treating them as standard failures. This ensures queue progression despite unexpected failures. Rate limiting across multiple containers was v2’s most complex challenge. V1’s in-memory rate limiter worked for single containers but couldn’t share state across instances. While Redis was an option, we already had PostgreSQL with sufficient performance. The solution: a table tracking request counts per integration per second: Before external API requests, we increment the counter for the integration’s current time window (rounded to the second). PostgreSQL returns the new count. If the count exceeds the limit, we sleep 250ms and retry. If under the limit, we proceed. This works because all containers share the database as the source of truth for rate limiting. Occasionally, jobs are rate-limited during heavy load due to the gap between count checking and request sending. These jobs retry immediately. The occurrence rate is acceptable. Hope you enjoyed this article and learned something new. This system has worked really well so far, and we’ve had only a few minor issues that we fixed quickly. I will update this article over time. Mass updates generate large objects per record Objects are duplicated for each configured integration Copies buffer across 5-10 consumer instances Infrastructure requires 2GB RAM and 2 cores to handle spikes, despite needing only 512MB and 1 core during normal operation Horizontal scaling - Enable scaling across multiple containers Distributed rate limiting - Coordinate rate limits across instances Data ownership - Store operational data within the service Delta updates - Send only changed data rather than complete records Fair scheduling - Prevent single integrations from monopolizing resources Priority queuing - Process critical updates before lower-priority changes Self-service re-sync - Enable customers to re-sync catalogs independently Visibility - Provide APIs for customers to monitor sent data and queue status Performance - PostgreSQL is fast enough for our use case. We don’t need sub-second message delivery. Simplicity - Using a managed PostgreSQL instance on GCP is significantly simpler than introducing new infrastructure. Familiarity - Most developers understand SQL, reducing onboarding time. Existing infrastructure - We already use PostgreSQL for our data, eliminating the need for additional systems. - Links logs across services - Specifies the action (e.g., ) - Records failure details - Tracks current workflow state - Counts retry attempts - Schedules next retry , , - Provides metrics for observability - Links to specific integrations - Identifies the platform - Contains job data - Prevents duplicate execution Created → Initial state: Picked up → Transitions to Success → Becomes , records Failed (10 retries) → Becomes , records Failed (retries remaining) → Becomes , increments , calculates Parallel worker execution Horizontal scaling across containers Graceful shutdowns without job loss Distributed rate limit enforcement—we need to respect rate limits no matter how many containers we run - Permanent failures (e.g., validation errors). No retry. - Transient failures (e.g., 500 Internal Server Error). Retry with backoff. - Retry limit reached. Mark failed.

0 views
alikhil 1 months ago

kubectl-find - UNIX-find-like plugin to find resources and perform action on them

Recently, I have developed a plugin for inspired by UNIX utility to find and perform action on resources. And few days ago number of stars in the repo reached 50! I think it’s a good moment to tell more about the project. As engineer who works with kubernetes everyday I use kubectl a lot. Actually, more than 50% of my terminal history commands are related to kubernetes. Here is a top 10 commands: Run this command if you are curious what about yours the most popular commands in terminal history. I use kubectl to check status of the pods, delete orphaned resources, trigger sync on and much more. When I realized half my terminal history was just kubectl commands, I thought — there must be a better way to find things in Kubernetes without chaining pipes with / / . And I imagined how nice it would be to have a UNIX -like tool — something that lets you search for exactly what you need in the cluster and then perform actions directly on the matching resources. I searched for a krew plugin like this but there was not any. For that reason, I decided to develop one ! I used sample-cli-plugin as a starting point. Its clean repository structure and straightforward design make it a great reference for working with the Kubernetes API. Additionally, it allows easy reuse of the extensive Kubernetes client libraries. Almost everything in the Kubernetes ecosystem is written in Go, and this plugin is no exception — which is great, as it allows building binaries for a wide range of CPU architectures and operating systems. Use filter to find any resource by any custom condition. uses gojq implementation of . By default, will print found resources to Stdout. However, there flags that you can provide to perform action on found resources: Use krew to install the plugin: I’m currently working on adding: If you’re tired of writing long chains, give a try — it’s already saved me countless keystrokes. Check out the repo ⭐ github.com/alikhil/kubectl-find and share your ideas or issues — I’d love to hear how you use it! - to delete them - to patch with provided JSON - to run command on pods JSON/YAML output format More filters Saved queries

0 views
The Coder Cafe 1 months ago

Conflict-Free Replicated Data Types (CRDTs)

☕ Welcome to The Coder Cafe! Today, we will explore CRDTs, why they matter in distributed systems, and how they keep nodes in sync. Get cozy, grab a coffee, and let’s begin! CRDTs, short for Conflict-Free Replicated Data Types, are a family of data structures built for distributed systems. At first sight, CRDTs may look intimidating. Yet at their core, the idea is not that complex. What makes them special is that they allow updates to happen independently on different nodes while still guaranteeing that all replicas eventually converge to the same state. To understand how CRDTs achieve this, we first need to step back. We need to talk about concurrent operations and what coordination means in a distributed system. Let’s take it step by step. What does concurrent operations mean? Our first intuition might be to say they happen at the same time. That’s not quite right. Here’s a counterargument based on a collaborative editing example. While on a plane, Alice connects to a document and makes an offline change to a sentence. An hour later, Bob connects to the same document and edits the very same sentence, but online. Later, when Alice lands, both versions have to sync. The two edits (1. and 2.) were separated by an hour. They didn’t happen at the same time, yet they are concurrent. So what’s a better definition for concurrent operations? Two operations that are not causally related. In the previous example, neither operation was made with knowledge of the other. They are not causally related, which makes them concurrent. Yet, if Bob had first seen Alice’s update and then made his own, his edit would depend on hers. In that case, the two operations wouldn’t be concurrent anymore. We should also understand concurrent ≠ conflict: If Alice fixes a missing letter in a word while Bob removes the whole word, that’s a conflict. If Alice edits one sentence while Bob edits another, that’s not a conflict. Concurrency is about independence in knowledge. Conflict is about whether the effects of operations collide. Now, let’s talk about coordination in distributed systems. Imagine a database with two nodes, node 1 and node 2. A bunch of clients connect to it. Sometimes requests go to node 1, sometimes to node 2. Let’s say two clients send concurrent and conflicting operations: In this case, we can’t have node 1 storing $200 while node 2 stores -$100. That would be a consistency violation with the two nodes disagreeing on Alice’s balance. Instead, both nodes need to agree on a shared value. To do that, they have to communicate and decide on one of the following: Reject both operations Accept client A’s update and set the balance to $200 Accept client B’s update and set the balance to -$100 The very action of nodes communicating and, if needed, waiting to agree on a single outcome is called coordination. Coordination is one way to keep replicas consistent under concurrent operations. But coordination is not the only way. That’s where CRDTs come in. CRDT stands for Conflict-Free Replicated Data Types . In short, CRDTs are data structures built so that nodes can accept local updates independently and concurrently, without the need for coordination. If you read our recent post on availability models, you might notice we’re now in the territory of total availability: a system is totally available if every non-faulty node can execute any operation. Total availability comes with weaker consistency. For CRDTs, the consistency guarantee is called Strong Eventual Consistency (SEC) . For that, CRDTs rely on a deterministic conflict resolution algorithm. Because every node applies the same rules, all replicas are guaranteed to eventually converge to the same state. Let’s make this more concrete with a classic CRDT: the G-Counter (Grow-Only Counter). Imagine a database with two nodes tracking the number of likes on a post. Node 1 receives a new like, increments its counter, and replies success to the client: Then, node 1 communicates with node 2 to send this update: Ultimately, both nodes converge to the same value: 6. How does the conflict resolution work for a G-Counter? Each replica keeps a vector of counters, with one slot per node. In our example, the total number of likes is 5. Let’s say node 1 has seen 2 likes and node 2 has seen 3 likes. So the initial state is the following: When node 1 receives a new like, it only increments its own slot. Node 2 is now temporarily out of sync: During synchronization, both nodes merge their vectors by taking the element-wise maximum: Now both replicas converge to the same state: The beauty of this algorithm is that it’s deterministic and order-independent. No matter when or how often the nodes sync, they always end up with the same state. NOTE : Do you know Gossip Glomers? It’s a series of distributed systems challenges we briefly introduced in an earlier post . Challenge 4 is to build a Grow-Only Counter. It’s worth checking out if you haven’t already. CRDTs can also be combined to make a more complex CRDT. For example, if we want to track both likes and dislikes, we can use two G-Counters together. This data type is called a PN-Counter (Positive-Negative Counter). Imagine two clients act concurrently on the same post: one likes it, another dislikes it. The nodes exchange their updates and converge to the same value: In the case of a PN-Counter, the conflict resolution algorithm is similar to the G-Counter. The difference lies in the fact that it involves not one but two vectors: one for increases and one for decreases. Assume an initial state where node 1 has received 2 likes and 0 dislikes, and node 2 has received 3 likes and 0 dislikes: Now, suppose node 1 receives a new like and node 2 receives a dislike. Before the sync, the state is the following: When the replicas exchange their state, the merge rule is element-wise maximum for each vector: After sync, both nodes converge to: The final counter of likes is: Let’s pause for a second. Based on what we’ve discussed, can you think of some use cases for CRDTs? A data structure where nodes are updated independently, concurrently, without coordination, and still guarantees that they converge to the same state? One main use case is collaborative and offline-first systems. For example, Notion, a collaborative workspace, recently introduced a feature that lets people edit the same content offline. They rely on CRDTs, and more specifically on Peritext, a CRDT for rich-text collaboration co-authored by multiple people, including . Another big use case is totally available systems that put availability ahead of strong consistency. As we’ve seen, nodes don’t need to coordinate before acknowledging a client request, which makes the system more highly available. Take Redis, for example. It can be configured in an active-active architecture with geographically distributed datacenter s. Clients connect to their closest cluster and get local latencies without waiting for coordination across distant regions. And yes, this setup is built on CRDTs. We could also think about other applications for CRDTs, like: Edge & IoT : Devices update offline and merge later without a central server. Peer-to-peer : Peers share changes directly and match up when they reconnect. CDN/edge state : Keep preferences, drafts, or counters near users and sync to the origin later. There are two main types of CRDTs: State-based CRDTs : Convergence happens by propagating the full state. Operation-based CRDTs : Convergence happens by propagating the update operations. In the previous examples, we looked at two state-based CRDTs: the G-Counter (Grow-Only Counter) and the PN-Counter (Positive-Negative Counter). In both cases, what was exchanged between the nodes was the entire state. For example, node 1 could tell node 2 that its total number of likes is 3. With state-based CRDTs, states are merged with a function that must be: Commutative: We can merge in any order and get the same result. Idempotent: Merging something with itself doesn’t change it. Associative: We can merge in any grouping and get the same result. Each synchronization monotonically increases the internal state. In other words, when two replicas sync, the state can only move forward, never backward. This is enforced by a simple “ can’t-go-backwards ” rule (a partial order), where merges use operations like max for numbers (as we’ve seen) or union for sets. In operation-based CRDTs, nodes share the operations rather than the full state. Convergence relies on three properties: Commutativity of concurrent operations Causality: Either carried in the operations’ metadata (for example, vector clocks) or guaranteed by the transport layer through causal delivery Duplicate tolerance: Handled by idempotent operations, unique operation IDs with deduplication, or a transport layer that guarantees no duplicates One example of an operation-based CRDT is the LWW-Register (Last-Writer-Wins Register), which stores a single value. Updates are resolved using a logical timestamp (such as Lamport clocks) along with a tie-breaker like the node ID. When a node writes a value, it broadcasts an operation . On receiving it, a node applies the update if the pair is greater than the one it currently holds. To summarize: State-based CRDTs: Convergence is guaranteed because merging states is associative, commutative, and idempotent. Don’t require assumptions on the delivery layer beyond eventual delivery. Simpler to reason about. Exchanging full states can be more bandwidth-intensive. Operation-based CRDTs: More bandwidth-efficient; we only send the operations, not the whole state. Correctness usually depends on having causal order (or encoding causality in the ops) and tolerating duplicates via idempotence/dedup. More complex to implement (causal broadcast, vector clocks, or equivalent). For completeness, there’s also a third type we should be aware of: delta-based CRDTs . Here, convergence is achieved by sending and merging fragments of state (deltas) rather than the entire state. A quick analogy to picture the differences: State-based CRDT: “ From time to time, send me the whole document. ” Operation-based CRDT: “ When you make a change, tell me exactly what you did. ” → “ Adding word `miles` at position 42. ” Delta-based CRDT: “ When you make a change, send me just the delta that reflects it (for example, the updated sentence) ” → “ And miles to go before I sleep. ” We talked about collaborative document editing. So you might assume a system like Google Docs is based on CRDTs, right? Well, that’s not the case. Google Docs is based on another concept called OT (Operational Transformation) . The goal of OT and CRDT is the same: convergence among all nodes in a collaborative system. The main difference is that OT requires all communication to go through the same server: We haven’t mentioned it until now (on purpose), but with CRDTs, there’s no need for a central server to achieve convergence . Back to our collaborative editing tool: if Alice and Bob are both offline but manage to connect their laptops directly, they could still achieve convergence without talking to a central server: As we saw earlier, CRDTs embed a deterministic conflict resolution algorithm. The data type itself ensures convergence. That’s the key difference: CRDTs don’t need to make any assumptions about the network topology or about a central server. considers CRDT to be the natural successor of OT. NOTE : So, why is Google Docs still based on OT? Historical reasons. Google Docs was launched before CRDTs existed, and it still works really well. There’s no practical reason for Google to migrate from OT to CRDT, despite some discussions about it in the past. Operations are concurrent when they aren’t causally related; concurrency doesn’t automatically mean conflict. Coordination is when replicas communicate and, if needed, wait to agree on a single outcome for concurrent updates before acknowledging clients, so they don’t diverge. CRDTs accept independent updates on each replica and still converge via deterministic merge rules. Three types: state-based (share full state), operation-based (share operations), delta-based (share just the changed parts). CRDTs are a great fit for systems like offline-first collaboration and highly available systems. Unlike OT, CRDTs don’t rely on a central server to reach the same result everywhere. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Exploring Database Isolation Levels Safety and Liveness Ivan Zhao (Notion’s CEO) tweet on the new Notion offline collaboration feature Diving into Conflict-Free Replicated Data Types (CRDTs) - Redis CRDTs: The Hard Parts by Hacker News discussion Peritext - A CRDT for Rich-Text Collaboration Active-Active geo-distribution (CRDTS-based) - Redis Bartosz Sypytkowski’s 12-part blog series on CRDT ❤️ If you enjoyed this post, please hit the like button. 💬 Have you worked with CRDTs before, or do you see another use case where they shine? Share your thoughts in the comments! Leave a comment CRDTs, short for Conflict-Free Replicated Data Types, are a family of data structures built for distributed systems. At first sight, CRDTs may look intimidating. Yet at their core, the idea is not that complex. What makes them special is that they allow updates to happen independently on different nodes while still guaranteeing that all replicas eventually converge to the same state. To understand how CRDTs achieve this, we first need to step back. We need to talk about concurrent operations and what coordination means in a distributed system. Let’s take it step by step. Concurrent Operations What does concurrent operations mean? Our first intuition might be to say they happen at the same time. That’s not quite right. Here’s a counterargument based on a collaborative editing example. While on a plane, Alice connects to a document and makes an offline change to a sentence. An hour later, Bob connects to the same document and edits the very same sentence, but online. Later, when Alice lands, both versions have to sync. If Alice fixes a missing letter in a word while Bob removes the whole word, that’s a conflict. If Alice edits one sentence while Bob edits another, that’s not a conflict. In this case, we can’t have node 1 storing $200 while node 2 stores -$100. That would be a consistency violation with the two nodes disagreeing on Alice’s balance. Instead, both nodes need to agree on a shared value. To do that, they have to communicate and decide on one of the following: Reject both operations Accept client A’s update and set the balance to $200 Accept client B’s update and set the balance to -$100 Then, node 1 communicates with node 2 to send this update: Ultimately, both nodes converge to the same value: 6. How does the conflict resolution work for a G-Counter? Each replica keeps a vector of counters, with one slot per node. In our example, the total number of likes is 5. Let’s say node 1 has seen 2 likes and node 2 has seen 3 likes. So the initial state is the following: When node 1 receives a new like, it only increments its own slot. Node 2 is now temporarily out of sync: During synchronization, both nodes merge their vectors by taking the element-wise maximum: Now both replicas converge to the same state: The beauty of this algorithm is that it’s deterministic and order-independent. No matter when or how often the nodes sync, they always end up with the same state. NOTE : Do you know Gossip Glomers? It’s a series of distributed systems challenges we briefly introduced in an earlier post . Challenge 4 is to build a Grow-Only Counter. It’s worth checking out if you haven’t already. PN-Counter CRDTs can also be combined to make a more complex CRDT. For example, if we want to track both likes and dislikes, we can use two G-Counters together. This data type is called a PN-Counter (Positive-Negative Counter). Imagine two clients act concurrently on the same post: one likes it, another dislikes it. The nodes exchange their updates and converge to the same value: In the case of a PN-Counter, the conflict resolution algorithm is similar to the G-Counter. The difference lies in the fact that it involves not one but two vectors: one for increases and one for decreases. Assume an initial state where node 1 has received 2 likes and 0 dislikes, and node 2 has received 3 likes and 0 dislikes: Now, suppose node 1 receives a new like and node 2 receives a dislike. Before the sync, the state is the following: When the replicas exchange their state, the merge rule is element-wise maximum for each vector: After sync, both nodes converge to: The final counter of likes is: Use Cases Let’s pause for a second. Based on what we’ve discussed, can you think of some use cases for CRDTs? A data structure where nodes are updated independently, concurrently, without coordination, and still guarantees that they converge to the same state? One main use case is collaborative and offline-first systems. For example, Notion, a collaborative workspace, recently introduced a feature that lets people edit the same content offline. They rely on CRDTs, and more specifically on Peritext, a CRDT for rich-text collaboration co-authored by multiple people, including . Another big use case is totally available systems that put availability ahead of strong consistency. As we’ve seen, nodes don’t need to coordinate before acknowledging a client request, which makes the system more highly available. Take Redis, for example. It can be configured in an active-active architecture with geographically distributed datacenter s. Clients connect to their closest cluster and get local latencies without waiting for coordination across distant regions. And yes, this setup is built on CRDTs. We could also think about other applications for CRDTs, like: Edge & IoT : Devices update offline and merge later without a central server. Peer-to-peer : Peers share changes directly and match up when they reconnect. CDN/edge state : Keep preferences, drafts, or counters near users and sync to the origin later. State-based CRDTs : Convergence happens by propagating the full state. Operation-based CRDTs : Convergence happens by propagating the update operations. Commutative: We can merge in any order and get the same result. Idempotent: Merging something with itself doesn’t change it. Associative: We can merge in any grouping and get the same result. Commutativity of concurrent operations Causality: Either carried in the operations’ metadata (for example, vector clocks) or guaranteed by the transport layer through causal delivery Duplicate tolerance: Handled by idempotent operations, unique operation IDs with deduplication, or a transport layer that guarantees no duplicates State-based CRDTs: Convergence is guaranteed because merging states is associative, commutative, and idempotent. Don’t require assumptions on the delivery layer beyond eventual delivery. Simpler to reason about. Exchanging full states can be more bandwidth-intensive. Operation-based CRDTs: More bandwidth-efficient; we only send the operations, not the whole state. Correctness usually depends on having causal order (or encoding causality in the ops) and tolerating duplicates via idempotence/dedup. More complex to implement (causal broadcast, vector clocks, or equivalent). State-based CRDT: “ From time to time, send me the whole document. ” Operation-based CRDT: “ When you make a change, tell me exactly what you did. ” → “ Adding word `miles` at position 42. ” Delta-based CRDT: “ When you make a change, send me just the delta that reflects it (for example, the updated sentence) ” → “ And miles to go before I sleep. ” We haven’t mentioned it until now (on purpose), but with CRDTs, there’s no need for a central server to achieve convergence . Back to our collaborative editing tool: if Alice and Bob are both offline but manage to connect their laptops directly, they could still achieve convergence without talking to a central server: As we saw earlier, CRDTs embed a deterministic conflict resolution algorithm. The data type itself ensures convergence. That’s the key difference: CRDTs don’t need to make any assumptions about the network topology or about a central server. considers CRDT to be the natural successor of OT. NOTE : So, why is Google Docs still based on OT? Historical reasons. Google Docs was launched before CRDTs existed, and it still works really well. There’s no practical reason for Google to migrate from OT to CRDT, despite some discussions about it in the past. Conclusion Operations are concurrent when they aren’t causally related; concurrency doesn’t automatically mean conflict. Coordination is when replicas communicate and, if needed, wait to agree on a single outcome for concurrent updates before acknowledging clients, so they don’t diverge. CRDTs accept independent updates on each replica and still converge via deterministic merge rules. Three types: state-based (share full state), operation-based (share operations), delta-based (share just the changed parts). CRDTs are a great fit for systems like offline-first collaboration and highly available systems. Unlike OT, CRDTs don’t rely on a central server to reach the same result everywhere. Exploring Database Isolation Levels Safety and Liveness Ivan Zhao (Notion’s CEO) tweet on the new Notion offline collaboration feature Diving into Conflict-Free Replicated Data Types (CRDTs) - Redis CRDTs: The Hard Parts by Hacker News discussion Peritext - A CRDT for Rich-Text Collaboration Active-Active geo-distribution (CRDTS-based) - Redis Bartosz Sypytkowski’s 12-part blog series on CRDT

0 views
Filippo Valsorda 1 months ago

A Retrospective Survey of 2024/2025 Open Source Supply Chain Compromises

Lack of memory safety is such a predominant cause of security issues that we have a responsibility as professional software engineering to robustly mitigate it in security-sensitive use cases—by using memory safe languages. Similarly, I have the growing impression that software supply chain compromises have a few predominant causes which we might have a responsibility as a professional open source maintainers to robustly mitigate. To test this impression and figure out any such mitigations, I collected all 2024/2025 open source supply chain compromises I could find, and categorized their root cause. (If you find more, do email me!) Since I am interested in mitigations we can apply as maintainers of depended-upon projects to avoid compromises, I am ignoring: intentionally malicious packages (e.g. typosquatting), issues in package managers (e.g. internal name shadowing), open source infrastructure abuse (e.g. using package registries for post-compromise exfiltration), and isolated app compromises (i.e. not software that is depended upon). Also, I am specifically interested in how an attacker got their first unauthorized access, not in what they did with it. Annoyingly, there is usually a lot more written about the latter than the former. In no particular order, but kind of grouped. XZ Utils Long term pressure campaign on the maintainer to hand over access. Root cause : control handoff. Contributing factor: non-reproducible release artifacts. Nx S1ingularity Shell injection in GitHub Action with trigger and unnecessary read/write permissions 1 , used to extract a npm token. Root cause : pull_request_target. Contributing factors: read/write CI permissions, long-lived credential exfiltration, post-install scripts. Shai-Hulud Worm behavior by using compromised npm tokens to publish packages with malicious post-install scripts, and compromised GitHub tokens to publish malicious GitHub Actions workflows. Root cause : long-lived credential exfiltration. Contributing factor: post-install scripts. npm debug/chalk/color Maintainer phished with an "Update 2FA Now" email. Had TOTP 2FA enabled. Root cause : phishing. polyfill.io Attacker purchased CDN domain name and GitHub organization. Root cause : control handoff. MavenGate Expired domains and changed GitHub usernames resurrected to take control of connected packages. Root causes : domain resurrection, username resurrection. reviewdog and tj-actions/changed-files Contributors deliberately granted automatic write access for GitHub Action repository 2 . Malicious tag re-published to compromise GitHub PAT of more popular GitHub Action 3 . Root cause : control handoff. Contributing factors: read/write CI permissions, long-lived credential exfiltration, mutable GitHub Actions tags. Ultralytics Shell injection in GitHub Action with trigger (which required read/write permissions), pivoted to publishing pipeline via GitHub Actions cache poisoning. Compromised again later using an exfiltrated PyPI token. Root cause : pull_request_target. Contributing factors: GitHub Actions cache poisoning, long-lived credential exfiltration. Kong Ingress Controller GitHub Action with trigger restricted to trusted users but bypassed via Dependabot impersonation 4 , previously patched but still available on old branch. GitHub PAT exfiltrated and used. Root causes : pull_request_target, Dependabot impersonation. Contributing factors: per-branch CI configuration, long-lived credential exfiltration. Rspack Pwn request 5 against workflow 6 in other project, leading to a GitHub classic token of a maintainer with permissions to the web-infra-dev organization 7 (kindly confirmed via email by the Rspack Team). Similar to previously reported and fixed vulnerability 8 in the Rspack repository. Root causes : issue_comment. Contributing factor: long-lived credential exfiltration. eslint-config-prettier "Verify your account" 9 npm phishing. Root cause : phishing. num2words "Email verification" PyPI phishing. Root cause : phishing. @solana/web3.js A "phishing attack on the credentials for publishing npm packages." Root cause : phishing. rustfoundation.dev Fake compromise remediation 10 Crates.io phishing. Unclear if successful. Root cause : phishing. React Native ARIA & gluestack-ui "[U]nauthorized access to publishing credentials." Colorful and long Incident Report lacks any details on "sophisticated" entry point. Presumably an exposed npm token. Root cause : long-lived credential exfiltration(?). lottie-player Unclear, but mitigation involved "remov[ing] all access and associated tokens/services accounts of the impacted developer." Root cause : long-lived credential exfiltration(?) or control handoff(?). rand-user-agent Unclear. Malicious npm versions published, affected company seems to have deleted the project. Presumably npm token compromise. Root cause : long-lived credential exfiltration(?). DogWifTool GitHub token extracted from distributed binary. Root cause : long-lived credential exfiltration. Surprising no one, the most popular confirmed initial compromise vector is phishing. It works against technical open source maintainers. It works against 2FA TOTP. It. Works. It is also very fixable. It’s 2025 and every professional open source maintainer should be using phishing-resistant authentication (passkeys or WebAuthn 2FA) on all developer accounts, and accounts upstream of them. Upstream accounts include email, password manager, passkey sync (e.g. Apple iCloud), web/DNS hosting, and domain registrar. Some services, such as GitHub, require a phishable 2FA method along with phishing-resistant ones. In that case, the best option is to enable TOTP, and delete the secret or write it down somewhere safe and never ever use it—effectively disabling it. This does not work with SMS, since SIM jacking is possible even without action by the victim. Actually surprisingly—to me—a number of compromises are due to, effectively, giving access to the attacker. This is a nuanced people issue. The solution is obviously “don’t do that” but that really reduces to the decades-old issue of open source maintenance sustainability. In a sense, since this analysis is aimed at professional maintainers who can afford it, control handoff is easily avoided by not doing it. Kind of incredible that a specific feature has a top 3 spot, but projects get compromised by “pwn requests” all the time. The workflow trigger runs privileged CI with a context full of attacker-controlled data in response to pull requests. It makes a meek attempt to be safer by not checking out the attacker’s code, instead checking out the upstream target. That’s empirically not enough, with shell injection attacks causing multiple severe compromises. The zizmor static analyzer can help detect injection vulnerabilities, but it seems clear that is unsafe at any speed, and should just never be used. Other triggers that run privileged with attacker-controlled context should be avoided for the same reason. The Rspack compromise, for example, was due to checking out attacker-controlled code on an trigger if the PR receives a comment. What are the alternatives? One option is to implement an external service in a language that can safely deal with untrusted inputs (i.e. not YAML’d shell), and use webhooks. That unfortunately requires long-lived credentials (see below). GitHub itself recommends using the unprivileged trigger followed by the trigger, but it’s unclear to me how safer that would actually be against injection attacks. Finally, since two out of three compromises were due to shell injection, it might be safer to use a proper programming language, like JavaScript with actions/github-script , or any other language accessing the context via environment variables instead of YAML interpolation. This means not using any third-party actions, as well. Allowlisting actors and read-only steps are not robust mitigations, see Read/write CI permissions and Dependabot impersonation below. Overall, none of the mitigations are particularly satisfactory, so the solution might be simply to eschew features that require and other privileged attacker-controlled triggers. (To be honest, I am not a fan of chatty bots on issues and PRs, so I never needed them.) Attackers love to steal tokens. There is no universal solution, but it’s so predominant that we can consider piecemeal solutions. Long-lived credentials are only a root cause when they are accidentally exposed. Otherwise, they are a secondary compromise mechanism for lateral movement or persistence, after the attacker got privileged code execution. Mitigating the latter is somewhat less appealing because an attacker with code execution can find more creative ways to carry out an attack, but we can prune some low-hanging fruit. Go removes the need for package registry tokens by simply not having accounts. (Instead, the go command fetches modules directly from VCS, with caching by the Go Modules Proxy and universality and immutability guaranteed by the Go Checksum Database.) In other ecosystems Trusted Publishing replaces long-lived private tokens with short-lived OIDC tokens, although there is no way to down-scope the capabilities of an OIDC token. GitHub Personal Access Tokens are harder to avoid for anything that’s not supported by GitHub Actions permissions. Chainguard has a third-party Security Token Service that trades OIDC tokens for short-lived tokens , and their article has a good list of cases in which PATs end up otherwise necessary. Given the risk, it might be worth giving up on non-critical features that would require powerful tokens. Gerrit “git cookies” (which are actually just OAuth refresh tokens for the Gerrit app) can be replaced with… well, OAuth refresh tokens but kept in memory instead of disk, using git-credential-oauth . They can also be stored a little more safely in the platform keychain by treating them as an HTTP password, although that’s not well documented . In the long term, it would be great to see the equivalent of Device Bound Session Credentials for developer and automated workflows. Turns out you can just exfiltrate a token from a GitHub Actions runner to impersonate Dependabot with arbitrary PRs ??? I guess! Fine! Just don’t allowlist Dependabot. Not sure what a deeper meta-mitigation that didn’t require knowing this factoid would have been. This is also a social engineering risk, so I guess just turn off Dependabot? Multiple ecosystems (Go and Maven, for example) are vulnerable to name takeovers, whether expired domain names or changed GitHub user/org names. The new owner of the name gets to publish updates for that package. From the point of view of the maintainer, the mitigation is just not to change GitHub names (at least without registering the old one), and to register critical domains for a long period, with expiration alerting. Some CI compromises happened in contexts that could or should have been read-only. It sounds like giving GitHub Actions workflows only read permissions like should be a robust mitigation for any compromise of the code they run. Unfortunately, and kind of incredibly, even a read-only workflow is handed a token that can write to the cross-workflow cache for any key. This cache is then used implicitly by a number of official actions, allowing cross-workflow escalation by GitHub Actions cache poisoning . This contradicts some of GitHub’s own recommendations, and makes the existence of a setting to make GitHub Actions read-only by default more misleading than useful. The behavior does not extend to regular triggers, which are actually read-only (otherwise anyone could poison caches with a PR). GitHub simply doesn’t seem to offer a way to opt in to it. I can see no robust mitigation in the GitHub ecosystem. I would love to be wrong, this is maddening. Two compromises propagated by injecting npm post-install scripts, to obtain code execution as soon as a dependency was installed. This can be disabled with which is worth doing for defense in depth. However, it’s only useful if the dependency is not going to be executed in a privileged context, e.g. to run tests in Node.js. Go, unlike most ecosystems, considers code execution during fetch or compilation to be a security vulnerability, so has this safety margin by default. The XZ backdoor was hidden in a release artifact that didn’t match the repository source. It would be great if that was more detectable, in the form of reproducible artifacts. The road to a fail-closed world where systems automatically detect non-reproducing artifacts is still long, though. How supply chain attacks usually work these days is that an attacker gets the ability to publish new versions for a package, publishes a malicious version, and waits for dependents to update (maybe with the help of Dependabot) or install the latest version ex novo. Not with GitHub Actions! The recommended and most common way to refer to a GitHub Action is by its major version, which is resolved to a git tag that is expected to change arbitrarily when new versions are published. This means that an attacker can instantly compromise every dependent workflow. This was an unforced error already in 2019, when GitHub Actions launched while Go had already shipped an immutable package system. This has been discussed many times since and most other ecosystems have improved somewhat. A roadmap item for immutable Actions has been silent since 2022 . The new immutable releases feature doesn’t apply to non-release tags, and the GitHub docs still recommend changing tags for Actions. As maintainers, we can opt in to pinning where it’s somehow still not the default. For GitHub Actions, that means using unreadable commit hashes, which can be somewhat ameliorated with tooling . For npm, it means using instead of . One compromise was due to a vulnerability that was already fixed, but had persisted on an old branch. Any time we make a security improvement (including patching a vulnerable Action) on a GitHub Actions workflow, we need to remember to cherry-pick it to all branches, including stale ones. Can’t think of a good mitigation, just yet another sharp edge of GitHub Actions you need to be aware of, I suppose. There are a number of useful mitigations, but the ones that appear to be as clearly a professional responsibility as memory safety are phishing-resistant authentication; not handing over access to attackers; and avoiding privileged attacker-controlled GitHub Actions triggers (e.g. ). This research was part of an effort to compile a Geomys Standard of Care that amongst other things mitigates the most common security risks to the projects we are entrusted with. We will publish and implement it soon, to keep up to date follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . On Saturday, between 250,000 and 1,000,000 people (depending on who you believe, 0.4–1.7% of the whole population of Italy) took part in a demonstration against the genocide unfolding in Gaza. Anyway, here's a picture of the Archbasilica of San Giovanni in Laterano at the end of the march. My work is made possible by Geomys , an organization of professional Go maintainer, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. https://github.com/nrwl/nx/security/advisories/GHSA-cxm3-wv7p-598c#:~:text=20%20AM%20EDT-,Attack%20Vector,-Vulnerable%20Workflow https://github.com/reviewdog/reviewdog/issues/2079 https://github.com/tj-actions/changed-files/issues/2464#issuecomment-2727020537 https://www.synacktiv.com/publications/github-actions-exploitation-dependabot https://github.com/module-federation/core/pull/3324 https://github.com/module-federation/core/tree/c3aff14a4b9de2588122ec24cf456dc1fdd742f0/.github/workflows https://github.com/web-infra-dev/rspack/issues/8767#issuecomment-2563345582 https://www.praetorian.com/blog/compromising-bytedances-rspack-github-actions-vulnerabilities/ https://github.com/prettier/eslint-config-prettier/issues/339#issuecomment-3090304490 https://github.com/rust-lang/crates.io/discussions/11889#discussion-8886064 One option is to implement an external service in a language that can safely deal with untrusted inputs (i.e. not YAML’d shell), and use webhooks. That unfortunately requires long-lived credentials (see below). GitHub itself recommends using the unprivileged trigger followed by the trigger, but it’s unclear to me how safer that would actually be against injection attacks. Finally, since two out of three compromises were due to shell injection, it might be safer to use a proper programming language, like JavaScript with actions/github-script , or any other language accessing the context via environment variables instead of YAML interpolation. This means not using any third-party actions, as well. Allowlisting actors and read-only steps are not robust mitigations, see Read/write CI permissions and Dependabot impersonation below. phishing-resistant authentication; not handing over access to attackers; and avoiding privileged attacker-controlled GitHub Actions triggers (e.g. ). https://github.com/nrwl/nx/security/advisories/GHSA-cxm3-wv7p-598c#:~:text=20%20AM%20EDT-,Attack%20Vector,-Vulnerable%20Workflow https://github.com/reviewdog/reviewdog/issues/2079 https://github.com/tj-actions/changed-files/issues/2464#issuecomment-2727020537 https://www.synacktiv.com/publications/github-actions-exploitation-dependabot https://github.com/module-federation/core/pull/3324 https://github.com/module-federation/core/tree/c3aff14a4b9de2588122ec24cf456dc1fdd742f0/.github/workflows https://github.com/web-infra-dev/rspack/issues/8767#issuecomment-2563345582 https://www.praetorian.com/blog/compromising-bytedances-rspack-github-actions-vulnerabilities/ https://github.com/prettier/eslint-config-prettier/issues/339#issuecomment-3090304490 https://github.com/rust-lang/crates.io/discussions/11889#discussion-8886064

0 views