Posts in Go (20 found)

Go proposal: Secret mode

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Automatically erase used memory to prevent secret leaks. Ver. 1.26 • Stdlib • Low impact The new package lets you run a function in secret mode . After the function finishes, it immediately erases (zeroes out) the registers and stack it used. Heap allocations made by the function are erased as soon as the garbage collector decides they are no longer reachable. This helps make sure sensitive information doesn't stay in memory longer than needed, lowering the risk of attackers getting to it. The package is experimental and is mainly for developers of cryptographic libraries, not for application developers. Cryptographic protocols like WireGuard or TLS have a property called "forward secrecy". This means that even if an attacker gains access to long-term secrets (like a private key in TLS), they shouldn't be able to decrypt past communication sessions. To make this work, session keys (used to encrypt and decrypt data during a specific communication session) need to be erased from memory after they're used. If there's no reliable way to clear this memory, the keys could stay there indefinitely, which would break forward secrecy. In Go, the runtime manages memory, and it doesn't guarantee when or how memory is cleared. Sensitive data might remain in heap allocations or stack frames, potentially exposed in core dumps or through memory attacks. Developers often have to use unreliable "hacks" with reflection to try to zero out internal buffers in cryptographic libraries. Even so, some data might still stay in memory where the developer can't reach or control it. The solution is to provide a runtime mechanism that automatically erases all temporary storage used during sensitive operations. This will make it easier for library developers to write secure code without using workarounds. Add the package with and functions: The current implementation has several limitations: The last point might not be immediately obvious, so here's an example. If an offset in an array is itself secret (you have a array and the secret key always starts at ), don't create a pointer to that location (don't create a pointer to ). Otherwise, the garbage collector might store this pointer, since it needs to know about all active pointers to do its job. If someone launches an attack to access the GC's memory, your secret offset could be exposed. The package is mainly for developers who work on cryptographic libraries. Most apps should use higher-level libraries that use behind the scenes. As of Go 1.26, the package is experimental and can be enabled by setting at build time. Use to generate a session key and encrypt a message using AES-GCM: Note that protects not just the raw key, but also the structure (which contains the expanded key schedule) created inside the function. This is a simplified example, of course — it only shows how memory erasure works, not a full cryptographic exchange. In real situations, the key needs to be shared securely with the receiver (for example, through key exchange) so decryption can work. 𝗣 21865 • 𝗖𝗟 704615 • 👥 Daniel Morsing , Dave Anderson , Filippo Valsorda , Jason A. Donenfeld , Keith Randall , Russ Cox Only supported on linux/amd64 and linux/arm64. On unsupported platforms, invokes directly. Protection does not cover any global variables that writes to. Trying to start a goroutine within causes a panic. If calls , erasure is delayed until all deferred functions are executed. Heap allocations are only erased if ➊ the program drops all references to them, and ➋ then the garbage collector notices that those references are gone. The program controls the first part, but the second part depends on when the runtime decides to act. If panics, the panicked value might reference memory allocated inside . That memory won't be erased until (at least) the panicked value is no longer reachable. Pointer addresses might leak into data buffers that the runtime uses for garbage collection. Do not put confidential information into pointers.

0 views
Anton Zhiyanov 5 days ago

Gist of Go: Concurrency internals

This is a chapter from my book on Go concurrency , which teaches the topic from the ground up through interactive examples. Here's where we started this book: Functions that run with are called goroutines. The Go runtime juggles these goroutines and distributes them among operating system threads running on CPU cores. Compared to OS threads, goroutines are lightweight, so you can create hundreds or thousands of them. That's generally correct, but it's a little too brief. In this chapter, we'll take a closer look at how goroutines work. We'll still use a simplified model, but it should help you understand how everything fits together. Concurrency • Goroutine scheduler • GOMAXPROCS • Concurrency primitives • Scheduler metrics • Profiling • Tracing • Keep it up At the hardware level, CPU cores are responsible for running parallel tasks. If a processor has 4 cores, it can run 4 instructions at the same time — one on each core. At the operating system level, a thread is the basic unit of execution. There are usually many more threads than CPU cores, so the operating system's scheduler decides which threads to run and which ones to pause. The scheduler keeps switching between threads to make sure each one gets a turn to run on a CPU, instead of waiting in line forever. This is how the operating system handles concurrency. At the Go runtime level, a goroutine is the basic unit of execution. The runtime scheduler runs a fixed number of OS threads, often one per CPU core. There can be many more goroutines than threads, so the scheduler decides which goroutines to run on the available threads and which ones to pause. The scheduler keeps switching between goroutines to make sure each one gets a turn to run on a thread, instead of waiting in line forever. This is how Go handles concurrency. The Go runtime scheduler doesn't decide which threads run on the CPU — that's the operating system scheduler's job. The Go runtime makes sure all goroutines run on the threads it manages, but the OS controls how and when those threads actually get CPU time. The scheduler's job is to run M goroutines on N operating system threads, where M can be much larger than N. Here's a simple way to do it: Take goroutines G11-G14 and run them: Goroutine G12 got blocked while reading from the channel. Put it back in the queue and replace it with G15: But there are a few things to keep in mind. Let's say goroutines G11–G14 are running smoothly without getting blocked by mutexes or channels. Does that mean goroutines G15–G20 won't run at all and will just have to wait ( starve ) until one of G11–G14 finally finishes? That would be unfortunate. That's why the scheduler checks each running goroutine roughly every 10 ms to decide if it's time to pause it and put it back in the queue. This approach is called preemptive scheduling: the scheduler can interrupt running goroutines when needed so others have a chance to run too. System calls The scheduler can manage a goroutine while it's running Go code. But what happens if a goroutine makes a system call, like reading from disk? In that case, the scheduler can't take the goroutine off the thread, and there's no way to know how long the system call will take. For example, if goroutines G11–G14 in our example spend a long time in system calls, all worker threads will be blocked, and the program will basically "freeze". To solve this problem, the scheduler starts new threads if the existing ones get blocked in a system call. For example, here's what happens if G11 and G12 make system calls: Here, the scheduler started two new threads, E and F, and assigned goroutines G15 and G16 from the queue to these threads. When G11 and G12 finish their system calls, the scheduler will stop or terminate the extra threads (E and F) and keep running the goroutines on four threads: A-B-C-D. This is a simplified model of how the goroutine scheduler works in Go. If you want to learn more, I recommend watching the talk by Dmitry Vyukov, one of the scheduler's developers: Go scheduler: Implementing language with lightweight concurrency ( video , slides ) We said that the scheduler uses N threads to run goroutines. In the Go runtime, the value of N is set by a parameter called . The runtime setting controls the maximum number of operating system threads the Go scheduler can use to execute goroutines concurrently. It defaults to the value of , which is the number of logical CPUs on the machine. Strictly speaking, is either the total number of logical CPUs or the number allowed by the CPU affinity mask, whichever is lower. This can be adjusted by the CPU quota, as explained below. For example, on my 8-core laptop, the default value of is also 8: You can change by setting environment variable or calling : You can also undo the manual changes and go back to the default value set by the runtime. To do this, use the function (Go 1.25+): Go programs often run in containers, like those managed by Docker or Kubernetes. These systems let you limit the CPU resources for a container using a Linux feature called cgroups . A cgroup (control group) in Linux lets you group processes together and control how much CPU, memory, and network I/O they can use by setting limits and priorities. For example, here's how you can limit a Docker container to use only four CPUs: Before version 1.25, the Go runtime didn't consider the CPU quota when setting the value. No matter how you limited CPU resources, was always set to the number of logical CPUs on the host machine: Starting with version 1.25, the Go runtime respects the CPU quota: So, the default value is set to either the number of logical CPUs or the CPU limit enforced by cgroup settings for the process, whichever is lower. Note on CPU limits Cgroups actually offer not just one, but two ways to limit CPU resources: Docker's and / set the quota, while sets the shares. Kubernetes' CPU limit sets the quota, while CPU request sets the shares. Go's runtime only takes the CPU quota into account, not the shares. Fractional CPU limits are rounded up: On a machine with multiple CPUs, the minimum default value for is 2, even if the CPU limit is set lower: The Go runtime automatically updates if the CPU limit changes. It happens up to once per second (less frequently if the application is idle). Let's take a quick look at the three main concurrency tools for Go: goroutines, channels, and select. A goroutine is implemented as a pointer to a structure. Here's what it looks like: The structure has many fields, but most of its memory is taken up by the stack, which holds the goroutine's local variables. By default, each stack gets 2 KB of memory, and it grows if needed. Because goroutines use very little memory, they're much more efficient than operating system threads, which usually need about 1 MB each. Their small size lets you run tens (or even hundreds) of thousands of goroutines on a single machine. A channel is implemented as a pointer to a structure. Here's what it looks like: The buffer array ( ) has a fixed size ( , which you can get with the builtin). It's created when you make a buffered channel. The number of items in the channel ( , which you can get with the builtin) increases when you send to the channel and decreases when you receive from it. The builtin sets the field to 1. Sending an item to an unbuffered channel, or to a buffered channel that's already full, puts the goroutine into the queue. Receiving from an empty channel puts the goroutine into the queue. The select logic is implemented in the function. It's a huge function that takes a list of select cases and (very simply put) works as follows: ✎ Exercise: Runtime simulator Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Metrics show how the Go runtime is performing, like how much heap memory it uses or how long garbage collection pauses take. Each metric has a unique name (for example, ) and a value, which can be a number or a histogram. We use the package to work with metrics. List all available metrics with descriptions: Get the value of a specific metric: Here are some goroutine-related metrics: In real projects, runtime metrics are usually exported automatically with client libraries for Prometheus, OpenTelemetry, or other observability tools. Here's an example for Prometheus: The exported metrics are then collected by Prometheus, visualized, and used to set up alerts. Profiling helps you understand exactly what the program is doing, what resources it uses, and where in the code this happens. Profiling is often not recommended in production because it's a "heavy" process that can slow things down. But that's not the case with Go. Go's profiler is designed for production use. It uses sampling, so it doesn't track every single operation. Instead, it takes quick snapshots of the runtime every 10 ms and puts them together to give you a full picture. Go supports the following profiles: The easiest way to add a profiler to your app is by using the package. When you import it, it automatically registers HTTP handlers for collecting profiles: Or you can register profiler handlers manually: After that, you can start profiling with a specific profile by running the command with the matching URL, or just open that URL in your browser: For the CPU profile, you can choose how long the profiler runs (the default is 30 seconds). Other profiles are taken instantly. After running the profiler, you'll get a binary file that you can open in the browser using the same utility. For example: The pprof web interface lets you view the same profile in different ways. My personal favorites are the flame graph , which clearly shows the call hierarchy and resource usage, and the source view, which shows the exact lines of code. You can also profile manually. To collect a CPU profile, use and : To collect other profiles, use : Profiling is a broad topic, and we've only touched the surface. To learn more, start with these articles: Tracing records certain types of events while the program is running, mainly those related to concurrency and memory: If you enabled the profiling server as described earlier, you can collect a trace using this URL: Trace files can be quite large, so it's better to use a small N value. After tracing is complete, you'll get a binary file that you can open in the browser using the utility: In the trace web interface, you'll see each goroutine's "lifecycle" on its own line. You can zoom in and out of the trace with the W and S keys, and you can click on any event to see more details: You can also collect a trace manually: Flight recording is a tracing technique that collects execution data, such as function calls and memory allocations, within a sliding window that's limited by size or duration. It helps to record traces of interesting program behavior, even if you don't know in advance when it will happen. The type (Go 1.25+) implements a flight recorder in Go. It tracks a moving window over the execution trace produced by the runtime, always containing the most recent trace data. Here's an example of how you might use it. First, configure the sliding window: Then create the recorder and start it: Continue with the application code as usual: Finally, save the trace snapshot to a file when an important event occurs: Use to view the trace in the browser: ✎ Exercise: Comparing blocks Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Now you can see how challenging the Go scheduler's job is. Fortunately, most of the time you don't need to worry about how it works behind the scenes — sticking to goroutines, channels, select, and other synchronization primitives is usually enough. This is the final chapter of my "Gist of Go: Concurrency" book. I invite you to read it — the book is an easy-to-understand, interactive guide to concurrency programming in Go. Pre-order for $10   or read online Put all goroutines in a queue. Take N goroutines from the queue and run them. If a running goroutine gets blocked (for example, waiting to read from a channel or waiting on a mutex), put it back in the queue and run the next goroutine from the queue. CPU quota — the maximum CPU time the cgroup may use within some period window. CPU shares — relative CPU priorities given to the kernel scheduler. Go through the cases and check if the matching channels are ready to send or receive. If several cases are ready, choose one at random (to prevent starvation, where some cases are always chosen and others are never chosen). Once a case is selected, perform the send or receive operation on the matching channel. If there is a default case and no other cases are ready, pick the default. If no cases are ready, block the goroutine and add it to the channel queue for each case. Count of goroutines created since program start (Go 1.26+). Count of live goroutines (created but not finished yet). An increase in this metric may indicate a goroutine leak. Approximate count of goroutines running or blocked in a system call or cgo call (Go 1.26+). An increase in this metric may indicate problems with such calls. Approximate count of goroutines ready to execute, but not executing (Go 1.26+). An increase in this metric may mean the system is overloaded and the CPU can't keep up with the growing number of goroutines. Approximate count of goroutines executing (Go 1.26+). Always less than or equal to . Approximate count of goroutines waiting on a resource — I/O or sync primitives (Go 1.26+). An increase in this metric may indicate issues with mutex locks, other synchronization blocks, or I/O issues. The current count of live threads that are owned by the runtime (Go 1.26+). The current setting — the maximum number of operating system threads the scheduler can use to execute goroutines concurrently. CPU . Shows how much CPU time each function uses. Use it to find performance bottlenecks if your program is running slowly because of CPU-heavy tasks. Heap . Shows the heap memory currently used by each function. Use it to detect memory leaks or excessive memory usage. Allocs . Shows which functions have used heap memory since the profiler started (not just currently). Use it to optimize garbage collection or reduce allocations that impact performance. Goroutine . Shows the stack traces of all current goroutines. Use it to get an overview of what the program is doing. Block . Shows where goroutines block waiting on synchronization primitives like channels, mutexes and wait groups. Use it to identify synchronization bottlenecks and issues in data exchange between goroutines. Disabled by default. Mutex . Shows lock contentions on mutexes and internal runtime locks. Use it to find "problematic" mutexes that goroutines are frequently waiting for. Disabled by default. Profiling Go Programs Diagnostics goroutine creation and state changes; system calls; garbage collection; heap size changes;

1 views
Carlos Becker 5 days ago

OpenSource Fridays Brasil

I was in a live stream with Pachi Parra , talking a bit about my background, and about GoReleaser.

0 views
Stratechery 6 days ago

An Interview with Atlassian CEO Mike Cannon-Brookes About Atlassian and AI

Good morning, This week’s Stratechery Interview is with Atlassian founder and CEO Mike Cannon-Brookes . Cannon-Brookes and Scott Farquhar — whom I interviewed in 2017 — founded Atlassian in 2002; their first product was Jira, a project and issue-tracking tool, followed by Confluence, a team collaboration platform. Atlassian, thanks in part to their location in Australia, pioneered several critical innovations, including downloadable software and a self-serve business model; over the ensuing two decades Atlassian has moved to the cloud and greatly expanded their offering, and is now leaning into AI. In this interview we discuss that entire journey, including Cannon-Brookes’ desire to not have a job, how the absence of venture capital shaped the company, and how the company’s go-to-market approach has evolved. We then dive into AI, including why Cannon-Brookes believes that there will be more developers doing more, and why Atlassian’s position in the enterprise lets them create compelling offerings. Finally we discuss Atlassian’s sponsorship of Williams, the F1 race team, and why Cannon-Brookes thinks they can both help Williams win and also accrue big benefits for Atlassian. To repeat a disclosure I have long made in my Ethics Statement , I did, in the earliest years of Stratechery, take on consulting work for a limited number of companies, including Atlassian. And, for what it’s worth, I’m also a huge F1 fan! Go Max. As a reminder, all Stratechery content, including interviews, is available as a podcast; click the link at the top of this email to add Stratechery to your podcast player. On to the Interview: This interview is lightly edited for content and clarity. Mike Cannon-Brooks, welcome to Stratechery. MCB: Thank you for having me, Ben. So this is admittedly a new experience for me, I’ve already interviewed the founder of Atlassian , but it wasn’t you. I’m of course referring to Scott [Farquhar] . That was eight years ago, actually, before I even had podcasts. It was very brief, but hey, like I said, new experiences. MCB: That’s true. That’s true. And you wrote a consulting paper for us in 2014! I was going to disclose, yes, in the very brief period where I did consulting work, you flew me down to Sydney for a week, I had a chance to learn a lot about Atlassian. And on a personal note, that consulting contract helped me a lot, that was when I was just starting. It’s funny how small the numbers seem in retrospect, but maybe that’s why I’ve shied away from writing about you too much over the years, because it meant a lot to me. So I appreciate it and there’s my disclosure for the interview. MCB: Thank you. It’s a good piece of work. Don’t forget, ironically, we started as a consulting and services business and then decided that software was a better business model, so I think you did the same thing. You went the scalability route instead of the consulting work via Sydney. Absolutely. I’m not doing anything that doesn’t scale anymore, but I did love visiting Sydney, so it was great. MCB: Still, we pulled out the old consulting paper you wrote for us in 2014. Why are we going to win, why are we going to lose, everything else, it was classic Ben work. Was it good? MCB: It’s pretty good! It’s interesting, I’d probably be embarrassed if I read it today. Anyhow, the good news is that since it’s the first time I’m interviewing you, I do get to do my favorite segment, which is learning more about you. Where did you grow up, but also, where were you born? I know they were different places. Then, how’d you get interested in technology and what’s your version of the Atlassian origin story? MCB: Sure, I feel like I’ve heard this question 1,000 times! Where to start? My dad was in banking, he joined the glorious institution that is Citibank today, from England. Parents are both from Cambridge and bounced around the world a lot as part of that job. Took the, “Hey, we need someone to go to this country”, and he was like, “I’ll take that”. So I was born in America, in a period I lived in New York. To be honest, lived there for three months before I moved to Taiwan. Really? Whoa. I didn’t know that. MCB: Yeah, in 1980 when it was very different than what it is today. Yeah. Were you saving that to drop that off me? I had no idea. I thought you went straight from America to Australia. MCB: I only just thought about it about 30 seconds ago, actually. No, I went to Taiwan for a few years, lived in Hong Kong for a few years, went to Australia for a few years. So how I got into technology is actually related because my parents were moving around so much, the logic was being English, that they would send us to English boarding schools and that would be a stable thing while they were moving once we got old enough. So at the mighty age of seven, I was put on Qantas and sent to England and back four times a year to go to boarding school in England for about five, six years. Because of that boarding school, I have one of the lowest frequent flyer numbers in Australia, they introduced the frequent flyer program and that was at the end of year one or end of year two. I get given this catalog by my parents and how you’ve earned all these points, “What do you want to buy?”, and it’s like, “I don’t know, trips, winery things, booze”, I’m flicking through this catalog and I’m like, “There’s literally nothing in this catalog”, of gear that you used to be able to get that I wanted and at the back is this computer, so I was like, “I guess I’ll get that”. The only thing that was potentially age appropriate. MCB: That was the only thing in the catalog, I didn’t want a toaster, I didn’t want wine, so that became my first computer, the mighty Amstrad PC20 . Four colors, no hard drive. Eventually, I bought an external floppy drive, so you could put in two and did buy magazines and type in programs and write games and stuff from magazines and play with it, played a lot of video games basically back in that era. I was into computers peripherally all through high school, came back to Australia at 12, my parents had settled here by then and weren’t moving, and so I came back here, did all high school and university here. In high school, I was always going to be an architect, that was my dream the entire way through, but come to the end of grade 12, applied for a bunch of scholarships, because university, applied for the scholarships, ended up getting one and so I thought, “Oh, well, maybe I’ll take that”, and it was in a course called BIT. Basically, half computer science, half finance and economics, but it was 15 grand a year, tax-free, so I was like, “Well, I’ll do that for a while and go back to the architecture thing”. Of course, famously in that scholarship, I met my first business partner of my first startup, met my second business partner of the second startup, they went in radically different directions in terms of outcome, but it was just 30 kids right at the right time, did the dot-com era thing. Now, ironically, as a part of that scholarship, you had to spend six months in three industrial placements, so the origin story of Atlassian comes from then a little bit, because those industrial placements were so boring. Scott spent six months installing Windows at a large corporate and he was crazy freaking smart and it was like, “Hey, go from computer to computer and upgrade to Windows 98”, or whatever it was. It was like, “Guys, this is our life, this is going to be horrible”. I worked for Nortel Bay Networks, which was a good, at the time, massive competitor, Cisco then completely disappeared and so a good tech lesson in and of itself, I basically cataloged the room full of networking gear and routers, it was mind-numbingly boring. So towards the end of the university course, I famously sent an email to a few people saying, “Look, I don’t really want to get a real job, why don’t we start a company and we’ll try some stuff?”. And this was after the dot-com era? This was the early 2000s? MCB: This was after the dot-com era, yeah. So I lived through the dot-com era actually as a journalist and writer, analyst and technology. I worked for a company called Internet.com, which became Jupiter Media and Jupiter Research and that was great, that was an amazing era for me. We ran events, newsletters, what would’ve been podcasts, didn’t have them back then. And we ran events on Mobile Monday, I think one of them was called and it was all about WAP and— Well, the real secret is you’re not the only one. There are some founders that are very successful, that they’re like, “Look, I just want to pontificate about technology”. MCB: A little bit like you, I remember getting in a lot of trouble from some of the startups, because some company would launch and I wrote basically 500 words on, “This thing’s never going to work, this is a disaster of an idea”, and they would ring up and yell at my boss and he was awesome, he’d be like, “Dude, just keep writing what you think”, and it didn’t make you very popular as a journalist type. Anyway, emailed some people, tried to start a business, we didn’t actually know what we were going to do. Atlassian has, I always tell people, a terrible origin story. You should not copy us. You just didn’t want to be installing Windows or upgrading software. MCB: We literally did not want to get a real job. And Scott replied and said, “Yeah, sure, I’m in for trying that”. He was one of the smartest kids in our class and his nickname is Skip, because he was the president of our student association and always a leader type and Eagle Scout and everything else, so we’re like, “Yeah, okay, let’s do that, we’re good mates” — and that started Atlassian. We picked the name in about five minutes, which if you consulted any branding company, would not have been chosen. Ironically, originally, we were going to do customer service and consulting, that was what the gig was. Hence the name, because Atlas was a Greek titan whose job was to stand on top of the Atlas Mountains and hold up the sky, that’s what he was supposed to be doing. He was a bad guy, so his punishment was to hold the sky up and we thought that was an act of legendary service, and so we were going to provide legendary service by holding up the sky for customers and as I said, did the service thing for about six months, decided that this is a terrible business. People paying us $350 US to answer their questions and didn’t scale and was at crazy hours of the morning and night and everything else. So in the meantime, we wrote the first version of what became Jira . We actually wrote three pieces of software, one was a knowledge basey type tool, one was a mail archiving tool for groups, so you could see each other’s email as a shared archiving. And were you seeing this and you were building tools for yourself, for your consulting business? MCB: Literally, yes, exactly. So all three were tools that we needed for ourselves. People would email us and I couldn’t see Scott’s email and he couldn’t see mine at the time and it was like this is silly, and we built Jira to handle questions and issues and problems that we were having ourselves that became a teeny bit popular. There was this glimmer that someone else cared, so we poured all the effort into that. What was that? What was the glimmer? Because this is when Agile is taking over software development and at least the legend is Jira and Agile go hand in hand, is that a correct characterization? MCB: A little bit, but this is actually pre-Agile. So Jira comes out before Agile is even a thing. I think it was about two or three years before we had any version of marketing or feature sets that involved Agile. This was just a web-based, at the time, a bug tracker. So the interesting evolution part of the company obviously is it started as a bug tracker for software developers, it became an issue tracker for technology teams and now it’s like a business workflow for tens of millions of people every day across the world, most of whom have nothing to do with technology, so it’s gone on its own evolution. Would anything have been different if this was the plan from the beginning, or did it have to be this organic, “We’re figuring it out as we go along as we’re running away from Windows installations”, sort of story? MCB: I think, look, obviously, if we could choose to follow in our own footsteps, the Back to the Future skeptic in me would say it’s gone pretty well, so I’d follow every single footstep I took. (laughing) Yep, totally. MCB: And that would’ve become the plan. But look, we had two hunches really, which both turned out to be radically correct. Now, I would say we were following waves or whatever else, but one was that the Internet would change software distribution, which sounds ridiculous now and when I talk to graduates nowadays, I have to put them in the right time and place and say, “Look, when we started, software was distributed on a CD”, BEA WebLogic was the bee’s knees and you used to have to get it on a CD if you were lucky. If not, someone would come and install it for you and that’s how software was distributed. We made that CD into a ZIP file and put it on the Internet for people to download. You didn’t access it like a SaaS application, you literally download it from our website. Right. It’s funny that when you first say that, it’s like, “Oh, it’s completely transformative”, well, but you were an on-premises software story. But actually, no, there’s several steps to getting to SaaS, one of which is just downloading software. MCB: And we had people call us before they would download to check that we were real and stuff and I’m like, “Why don’t you just download the damn ZIP file?”, and I also date them, because, well, maybe I’ll get to the business model part, but the second innovation was that we thought open source would change software costs. So we had this big hunch, we were both writing a bunch of open source code at the time. Open source was a massive movement, especially in the Java space. Embarrassingly, I actually wrote a book called Open Source Java Programming that you can find with some mates. It’s still on Amazon and we sold a few thousand copies, I think, but I swore I’d never write a book again, it was a very painful experience. Thank you, you’re validating my life decisions . MCB: Yeah. Open source did bring the cost of building software down radically. We were writing a very small layer, 5% of the code at best on top of masses of amazing open source libraries and we contributed to those libraries, but we could deliver an amazing experience for a very low cost. We learned a lot, pricing and packaging. So what was the implication of that hunch though? Just that the market for developers, that would subsequently mean there was more software? MCB: A little bit that was the implication of the hunch. Largely for us, it was that the cost was going down. Pre-open source, you had to write everything so if Jira was back then, I don’t know, a million lines of code, if you added all the open source libraries together, it was 25, 30, 40 million lines of code. It was so big that it was so expensive, because you had to write all of that. To think of Windows, they wrote everything, the networking stack, there were no libraries, there was no open source involved in the original versions, it was all written by Microsoft. So the cost of that was very high, then you had to charge a lot of money. So we thought, look, if we could take all these amazing open source libraries, contribute back to them — we were a great open source citizen — and build a piece of proprietary software on top of them that solved customer’s problems, we could deliver that really cheaply. In fact, we sold the original versions of Jira, they were $800, unlimited users, unlimited use with no lifespan. So it was just 800 bucks, one-time fee forever and we learned a lot about pricing and packaging firstly, but secondly, it was very simple. Our goal in the early days, we had to sell one copy a week to stay alive, that was it. Some weeks, we’d sell two copies. $1,600 US would roll in and we’d be like, “Cool, we got a week off to survive”, and then one copy a week became two and two became five and five became ten, and now it’s hundreds of thousands. Well, isn’t the thing you just didn’t want to have a job? So I love this part of the story, because when I started Stratechery, I had a job from Microsoft that made, I think, $104,000 or something like that. I’m like, “I just want to make that, because I don’t want to work for a corporation, so if I could just get to there, it’ll be great”. MCB: We had exactly the same sets of goals. We had a few things we wanted to make somewhere that we wanted to go to work. I wanted to get up every day and think, “I want to go to work”, and weirdly, almost 24 years later, I love coming to work, so a tick achieved. We wanted to make it so we didn’t have to wear a suit, neither of us really like wearing suits at all — in fact, it’s a bit of an allergic reaction often and so tick, don’t turn up to work in a suit every day. And thirdly, most of our friends, so this is right where IBM bought PwC ironically, so out of the 30-odd kids in our class, maybe 10 went to IBM as consultants and 10 went to PwC and then they all end up going to the same shop and their grad salary there was $47,600. So our goal for year one was to end the year making at least a grad salary and convince ourselves we’re not crazy kind of thing and we smashed that goal, so that was good, but that was there. The Internet, the distribution part is important, knowing your favorite topics. Tell me about that and along with the business model, because again, this goes back so far, I don’t think people appreciate the extent to this entire idea of self-serve or bottoms up selling. This is really where it all started. MCB: Yes. And look, a few things. Firstly, if you come from Australia, we’re an exporting nation. “We’re built on the sheep’s back”, is a phrase, Australia’s built on the sheep’s back. What that really means is because we were this colony originally, then country on the far side of the world, anything we did to make money largely had to leave the country and go somewhere else. Originally, that was a struggle to find a product that could do that. “Built on a sheep’s back” is because wool was the first product that could do that, you could put it on a wooden boat, because it wasn’t very heavy and you could ship it a long distance, because it kept really well, so we could make sheep’s wool and make money as a country by shipping it back to Europe and it could survive the journey and so the country was built on the sheep’s back. We are a massive exporting nation. Trump brings in his tariffs, we’re the only country with a negative rate of return, we have a positive trade relationship with America and we’re like, “Wait a second, why did we get taxed?”, so obviously, it’s rocks, technology, we build and export everything as a country that we do. So our mentality was like, “Well, if we’re going to make money, it’s going to be overseas”, that was the first thing, is, “Okay, it’s going to be somewhere else, it’s not going to be Australians buying our software”, and so the Internet allowed us to do this. We put up a shopfront, early website and people could come to our website, download our software and then we just needed a way to get paid for it. The problem was in order to do that and the trust barriers of the Internet, we had to have a very low price and we had to have a fully installable offering. So we spent so much time on making it installable, documentation, “How would you get yourself up and running and try it?” — the software, as we put it, had to sell itself. Our software had to be bought, not sold. We didn’t have any salespeople, we couldn’t travel to your office in Sweden or London and help you out with it. For $800, we couldn’t have done that and secondly, it didn’t make any sense. So the evolution was, “Okay, this is the only possible path that we can go down is we have to figure out how to get people to do this”, now it turns out once you have figured out how to do that, it’s an incredibly powerful motor because you have lots of people coming, you have a very cheap piece of software for its relative performance, and you get people using it in all these big businesses all over the place. I would say 50% of the customers I go meet nowadays, probably meet a handful of customers, a couple a day on an average kind of thing, many of those have been a customer for 20 years, 22 years, 23 years. How many customers have been a customer 23 years? I’m like that’s crazy, we’re only 24 years old. That’s awesome. MCB: And so they downloaded very early, they didn’t download as all of , all of them are customers. Just one guy who’s like, “I need a way to track my issues”. MCB: Exactly. It was some guy in a backroom who needed to track it. I know the Cisco origin story, that was literally a guy, he’s still there, he’s been there 22, 23 years, he’s awesome. And they started with just, “I just needed a way to manage my issues for 10 people”, and now it’s hundreds of thousands of people, seats that we have there, it’s kind of grown over time. How did we know that business model was working? Again, it dates us a lot, this didn’t mean we didn’t answer questions, we were big on customer service and helping people, email was the way to do that. A bit of IRC back then, we had a channel you could log into and we’d help you. But the first customer, we used to walk into the office in the morning and we had a fax machine with literally rolls of paper. So if you wanted to pay for this distributed software, this says how old, there was no SSL keys, I heard you complaining about it the other day, totally agree with that era. You had to download a PDF off our website, which was pretty modern that it was a PDF, fill in your credit card details, and fax it to us, that is how you paid when we started. So we would walk in the morning and there’d be these rolls of paper on the ground, you be like, “Ah, sweet, someone bought something”, you know what I mean? It became a weird dopamine drug for us. The very first company was American Airlines… MCB: About six months in that we came in the morning and there was a fax on the ground with $800 and a credit card number written on it and we had never talked to American Airlines, they had never emailed us, they had never asked for customer service, they’d never gone on IRC, they had never talked to us in any way, shape or form. Man, this thing could work, we just made $800 out of the air. MCB: I mean, there was a lot of pre-work to get them there, but obviously that was kind of different. MCB: Then secondarily, as you wrote, I’m just trying to finish a very long answer here, we started Confluence in 2004, and those two became the jewel engines and both of those I think were probably major moments. I often say Confluence is a bigger moment, actually. The business model was kind of established, this is two years into the business. We made, I think, $800 grand in year one, $1.6 million in year two, maybe $5 million in year three, and $12 million in year four, if I remember the revenue numbers. So the thing was working really well. You’re the company that’s the Microsoft heir in some respects, which is the really just you took venture eventually, but didn’t really need to, just pure bottoms up. You and Scott, we’re able to keep a huge portion of the company because of that, it’s an amazing story that is, I think, under-told in some respects. MCB: Yeah, well, we actually did. I mean, we did and didn’t. So the venture story is one of my favorites because it describes how we think from first principles. Firstly, the first capital we put on the balance sheet, institutional capital to put on the balance sheet, I guess you could argue our initial, I don’t know, $10 grand each was some money, but was in the IPO . So in 2015, when we went public, that was the first capital that went into the business all time. We took two rounds of funding, one in 2010 and one in 2013, but both of which were to employees, the first was to the founders and the second was to large number of employees who bought in so both of those companies bought ordinary stock. Secondary shares basically, yeah. MCB: They bought ordinary stock, there were no preferences, there were no anything, that was kind of the way it is. And we love the Accel guys that invested, it’s kind of funny because their business model was wildly wrong, we now have their original spreadsheets and stuff. We’ve 15 years in, you know them really, really well, they wanted us to grow it. I think we had to grow at 30% for two years, 20% the year after and something like that to double or triple their money and at the time they put in $60 mil US , that was the largest investment I think Accel had ever made in anything software, digital kind of world and it was this massive bet. It was a one-page term sheet for ordinary stock, so credit to those two partners who took massive risk on us, had to fight, we know that GC, everybody else to do this unusual funding round and I think we did 50% growth the first year, and our CAGR since then is probably 40%. Yeah, it worked out pretty well. MCB: They did very well. I think their 2-3x was more like a 300x or something. You mentioned the Confluence moment. Why was that a big deal? Usually the story is you have one product and you need to focus and you’re two years old, you’re launching a completely new product. Is that the aspect you’re referring to? MCB: Yes, I think it comes down to being bootstrapped. Look, we spent nine years convinced we were going to die every day, there was just such a mentality that this thing was all going to fall over and we better work harder and keep going. The Confluence moment was important because I remember, I don’t know exactly, but sometime around then we understood venture capital. Firstly, on the venture capital side, because they do relate to each other, there was no VC available in 2001 and 2002 in Australia. We’re a nuclear winter, we’re two idiots with no credibility. Right. You could barely get funded in San Francisco, you’re not going to get funding in Sydney. MCB: No, because 2001, you weren’t even finding San Francisco funding because the whole dot-com boom had just happened, no one was getting funded anyway. We’re in Australia and we have no credibility, so we didn’t even bother. We literally, 2010 when we went to the Accel thing and we talked to five VCs, was the first time we’d ever pitched the business. It was just not a thing, people don’t understand, we used to say we were customer-funded when people would ask the also awkward question of, “Who’s your funding come from?”, we were like, “We’re customer-funded”, They go, “Oh, okay”. Lifestyle business! MCB: But we did understand venture capital, massive readers, I have an army full of technical books, books about technology and the industry and history and stuff from that magic era of airport bookstores. We read every episode of Red Herring and Industry Standard and Wired Magazine, I have just this huge library, so voracious readers. One thing you understood about venture capital is they put the portfolio theory on their side — and I’m a big fan of venture capital, I should say, I’m the chair of Australia’s biggest VC fund and that’s my other mate that I met in university, Niki Scevak . But we wanted portfolio theory on our side, we’d done finance and economics, we had one product, this was highly risky if you’re bootstrapped. So there was a little bit of the thinking that actually if we have two products, our chances of total failure are less, one of them can fail and we’ll be okay and so we started a second product. Yes, arguably it was hard, but our first one was going all right, it was like making, I don’t know, five million bucks a year and we had a handful of really awesome backpacker programmers. And the early people, it’s like a whole total band of misfits that somehow made this thing work and we’re having a lot of fun, we’re working really hard and so we made another internal tool that became Confluence and being adjacent, but very different, selling to different audiences, but having a lot — if you bought one, there was a good reason to have the other one, no matter which way you started, became a really good symbiotic loop of these two engines that powered us for a very long time. So it was more a case of reducing our risk actually than anything else. Wasn’t it risky to be splitting your resources or did that not even occur to you? MCB: I don’t think it occurred to us, no. It was more about splitting our risk and we were doing pretty well, but it changed the business because we moved from being the Jira company to a software company, and I say that’s probably the most under-understood moment because we had to learn about not how to market Jira, but how to market software, not how to build Jira, but how to build software. So now we have 20, 25 apps in 5 different categories that sell to all sorts of different teams who own a business, but we had to become a software company. Microsoft, I don’t know the analogy’s really that fair to them, to be honest, or fair to us, it seems massively over-glamorizing what they’ve achieved, which is amazing, I’m huge fan of Microsoft. The need to understand how to sell, in their case, like Minecraft, SQL Server, Azure, AI, you have to understand the building, the creation of technology, the selling of technology, the marketing of technology at a generic level, it really helped us generify the business. I think if we’d gone too much longer, everybody would’ve been on the Jira team, it would’ve been too hard to start a second thing and instead, we’ve always been a multi-product company. You just mentioned selling a lot. When did you finally realize or transition away from just being self-serve to actually, “We’ve got to grow beyond this”? Was it almost like a pivot that came too late because your identity was so wrapped up into the, “We’re the self-serve company”? MCB: Look, it’s never been a pivot, I get asked this by investors all the time. I would say our go to-market model and our process has kept evolving pretty much every year or two for 20 years and I say evolving because we’re very aware of the strengths of the model that we came up with and we’re very aware of what it takes to power that and we’ve been very careful when we’ve evolved, changed, added to it, not to destroy the original one. So nowadays, we have two amazing business models where we call them high-touch and low-touch. So we have the low-touch model, which is literally the same thing as it’s always been, hundreds of thousands of people show up every week, they try our software, we want them to have a great experience trying the software, we want to spread it as widely as possible and as many enterprises as we can, and some of those will stick, some of those will get working and we measure aggressively the rates of return and dollars and flows and funnels and everything else. This whole team whose job is to make sure that that’s working at now massive scale, right. But at the same time, what happened is as customers got more and more Atlassian software deployed, they wanted a different relationship with us, they wanted a bigger relationship. Those days they used to be spending, as soon as we were spending $20 grand, we were like, “Oh man, maybe we should talk to these people”, nowadays it’s more like around $50 to $100 grand is when we’ll talk to you. So the lines kept moving for different reasons and we actually have online sales, inside sales in between actually, the sort of classical someone gets on an airplane and goes to travel to you. So it’s just kept evolving. We talk about the IPO a lot, it’s our 10-year anniversary coming up this month, I’m off to New York next week to ring the bell and celebrate 10 years. When we went public, as an example, we had less than 10 companies paying a million dollars a year, now we’re well north of 500 in 10 years. So that doesn’t come without an amazing enterprise sales team and teams that go out and help customers and customer success and all the trappings of a really top flight enterprise sales organization, because for most of those customers, again, I think it’s north of 85% of the Fortune 500 are deep Atlassian customers. We become a strategic partner to these businesses that if we go down, rockets don’t take off, banks shut down, it’s a real critical importance to most of these customers. How big is your business outside of directly working with developer teams? As I recall, this was part of the consulting thing was you were wanting to do Jira for sales or Jira for all these different sort of functions, where and how did that evolve? MCB: So it’s been a continuum for a long time. So nowadays, less than half of our users are in technology teams, and probably a third of those are developers, less than half of them. So a portion of our audience, it’s a very important point of words. When I talk about this, all the engineers are like, “Hey, you don’t care about us anymore”, I’m like, “No, that’s not true”, that business is a great business, it’s just the rest of our business has grown massively around it. There are not enough developers in the world for our business. Our fundamental value has always been actually, and it took us one of these things, it took a decade to realize, firstly, we don’t solve technology problems, we never have, we’ve never had anything that’s like, “I care what code you write, which language the code is in, what the code does”. We solve collaboration and people problems, we always have solved people problems, even Agile was a people problem. It’s not a technology problem, actually, it’s a people problem. It’s, “How do we organize a group of people to build a piece of technology that best meets the customer’s needs and goes off track as little as possible?”, that is a collaborative people problem, we’ve always solved people problems. Our value actually came because there’s a lot of tools for technology teams and we never wanted to be in the dev tools business, that’s a road of bones, it’s very hard to build sustainable competitive advantage and dev tools, the history shows this. There’s just a different company every few years, developers tastes are fickle, our developers taste are fickle, this is not me sledging developers at all, we have a massive R&D arm and that group changes languages every couple of years, they change how they build software every couple of years, they’re constantly moving on, they change our analytics tools and everything else because they are tool builders and toolmakers, that makes sense, but that’s a hard place to build a business. Interestingly topical today, so we’ll see. But the easier place to build a business in the long term was the level above that, which is the collaboration problems that came, which started as, “How do we get engineers, designers, product managers, business analysts to all be on the same page about what it is that they’re building and have a repeatable process for that?”. It turned out that as the world has become technology-driven, as we say, our customers are technology-driven organizations. If you’re a large organization for whom technology is your key distinct advantage, it doesn’t matter whether you’re making chips and databases or whether you’re making rockets or cars or whether you’re making financial services or insurance or healthcare, I would argue for most of the businesses that are great, technology is their key competitive advantage, then you should be our customer, that is it. And what we help you do is we help your technology teams and your business teams collaborate across that boundary because that’s actually the hardest boundary. Building great technology is one set of problems, making it work for your customers usually means in different industries, a different amount of working with all sorts of business people and that’s what Jira did from the very start. Now that’s what our whole portfolio in service management, in strategy and leadership teams is about doing that at different scales and different amounts in different places. Does it bug you when you get complaints on the Internet of, “Jira’s so complicated”, “Hard to use”, blah, blah, blah? And are you speaking to, the problem is that the problem space we’re working in is not the single developer trying to track an issue, it’s trying to herd a bunch of cats and get them the same direction and muddling through that is a lot more difficult than it seems. MCB: It bothers me anytime people don’t like our software, sure. We’ve worked for the last 20 years to make it better every day. We’ll probably work for the next 20 years to make it better every day and people will still probably be dissatisfied and that is our fundamental core design challenge. There’s a few reasons they say that. Firstly, the on-premise business model and the cloud shift is really important because with the cloud shift, we update the software, with the on-premise business model, we don’t, so you would often be on older data versions, customers would upgrade once a year or every two years or something, and so we can’t control that. Secondly, the challenge of Jira is at our core, we solve a whole lot of what we say is structured and unstructured workflows. Confluence is an unstructured workflow, Jira’s a very structured workflow. You have a set of steps, you have permissioning and restrictions, you have fields, you have what’s happening in this process. The auditor will do something and pass it to the internal accounting team, the accounting team will do this and pass it to legal, legal will do this and pass it to these people. You’re defining a workflow and you’re having information flow back and forth and a Jira work item is, as we call it, it’s a human reference to work. That’s the best description of what Jira is work in the knowledge work era is this very ephemeral concept. Back to your development example, is the code the software? Is the idea the software? Is the designs in Figma — these are all parts of what it is, this thing that’s called this virtual thing that we’ve built. What we track is with a human reference to that, so someone can say it’s a new admin console. Cool, here’s the design for the admin console, there’s the spec for the admin console, there’s the code for the admin console, here’s where it’s been tested, here’s where it’s deployed. Did customers like it? We need a reference to this thing that is otherwise spread across hundreds of systems and virtualized. Once you’re building a workflow system, companies, ours included, love process, we love workflows, we love control, and that control usually comes with more data. “Hey, don’t fill in these three fields, fill in these 50 fields”, and they’re all required for some reason and our job to customers is to say, “Do you really need 50 fields?”, because you’re creating a user experience- You’re ruining it for us! MCB: Your users are going to have to fill in all 50 fields, and it feels like that’s going to take you a while. We have customers — I went back and checked, I think almost every single person you’ve interviewed on your podcast is a customer of ours. I don’t know if it’s 100%, but it’s definitely north of 95% out of the last 20 guests. Stratechery is a customer of yours, so there you go. MCB: Oh, really? Well, there you go. Thank you. One of my engineers adores Jira, so I get the opposite angle from what I asked about. MCB: That’s right. So look, it’s a challenge for sure, but at the same time, man, the value we’ve created, the business value, the number of customers that run on it, it’s ironic, we talk about the AI era and all these other things. Literally, no chips go out of any of the chip companies you love talking about, every single one of them, soup to nuts. So at what point did you realize that AI was going to impact you in a major way? Was there an “aha” moment or it’s just been in the air? Or is it a specific time you realized, “Look, this is going to completely change what we do?” MCB: Again, I’m one of these — I’ve realized I’ve become the old man in the room. We’ve done machine learning for a long time in lots of ways because of our online business model, so I’d say we’ve done AI for a long time. Obviously, LLMs are what people refer to nowadays by AI and agents and these words that have corrupted the entire thing, the meaning changes in technology when it means something else. The launch of various versions of ChatGPT were very instructive obviously, they were a moment for everybody. The optimism, and I would say we’re massive AI optimists, it is the best thing that’s happened to our business in 25 years. Why? Because people might look at you from the outside and say you’re still characterized as — even though your business expanded far beyond developers — “Oh, you have a lot of developers”, I’m skipping over the transition to the cloud just because we’re running out of time, but it’s an interesting story. You did announce you are finally ending the on-premises software, which I’m curious, it is a sentimental moment to come to that decision, but people might look at you from the outside and say, “Oh, there’s a company that’s going to have a problem with AI, AI is going to replace developers, it’s the decreased seats . What are they going to do?” MCB: There’s a few ways to take that. I’m trying to put it on a tee for you. I think I know what you want to say. MCB: There’s a few ways to look at it. Firstly, I think AI is a good example where people are very concrete about the negatives and the positives are upside. I think it’s a huge force multiplier personally for human creativity, problem solving, all sorts of things, it’s a massive positive for society. That doesn’t mean there aren’t any negatives, but the net effect is really high. And we spend a lot of time, you hear it in the media talking about the job loss, the efficiency gains, whichever way you want to put it, that’s the thing. Well, that’s because it’s really concrete in a spreadsheet, “I can do this process with half as many people”, “Wow, look at that, that’s great”, what’s never written in the spreadsheet is all the new processes that get created, all the new ways of doing things, the quality of the output is going to be twice as high. If software costs half as much to write, I can either do it with half as many people, but core competitive forces, I would argue, in the economy mean I will need the same number of people, I would just need to do a better job of making higher quality technology. So our view on AI overall is an accelerant, not a replacement to everything we do, and just the next era of technology change is really positive. We’ve loved technology, we love the cloud, we love all the tech changes we’ve been through, mobile. Look, us as a business, we are in the game of knowledge work. We solve human problems, workflows, business processes, this is what we do. These largely revolve around text, or if it’s video nowadays, that can be reduced to text in various ways. LLMs allow us to understand that text in a massively deeper way than we ever have been, and the problems we solve aren’t going away. 20 years time, there’ll be groups of people trying to solve some sort of problem as a team and working on a project, and so these things aren’t going to go. They’re going to need to talk to each other and collaborate of what work’s going on and how it’s working, so the textual aspect of it has been amazing. The features we’ve been able to ship, we never could have built five years ago, it was literally impossible, so the ability to solve customer problems is so much higher than it ever has been. Secondly, our software is incredibly valuable at the core of these workflows, but it’s also incredibly promiscuous. What I mean by that is we have always been very highly interlinked with everything else. If it’s a sales team, there are links to Salesforce and customer records, there are links to internal systems, there are links to maybe features that need to be built, there are links to some content and document. So any Jira, Confluence, or Loom , you don’t record a Loom unless you’re talking about something, you don’t have a Jira issue without pointing to all sorts of different resources, whether that’s a GitHub or Figma, whether it’s Salesforce or Workday. That gives us a really unique knowledge, which we’ve turned into the teamwork graph, that actually started pre-AI, so the irony is the Teamwork Graph is about 6 years old. Well, it started with Confluence. This is the whole thing where you look backwards, and to your point, if you had just been the Jira company, but because from the very beginning, you mentioned Confluence was different but it was adjacent and you had to build the links and stuff together, and as you build all these different tools, because everyone wants to be this point of integration. And I wanted you to tell me about Rovo and this idea of being able to search across all your documents. Who gets permission to do that? It’s someone that’s already there, and you made the critical decision to be there back in 2004 or whatever it was. MCB: That’s true. Certainly back in 2004, and then in I think 2019, the Teamwork Graph starts, which is trying to take all of those links and turn them into a graph. The connectivity, two things linked to this Figma thing, five things linked to this customer record — okay, cool, that means something, so we built this Graph. To be honest, it was a bit of a technology lark. We have a lot of these projects that are really cool and we’re like, “We’ll be able to use this somehow and it’s going to grown”, and now it’s a hundred billion objects and connections connecting all of the company’s knowledge. It becomes the organizational memory nowadays and context and all these things nobody knew in 2019 that’s what it was going to be, it just seemed we needed it for various process connections. That turns out to be because it’s got permissions and compliance and all of the enterprise stuff built in, which is incredibly difficult, the best resource to point AI at in various forms. You still have to be good at the AI parts to get the knowledge, the context for any area, so the Teamwork Graph is our data layer. It’s not only the best kind of enterprise search engine for your content from a 10 Blue Links kind of way of thinking. If you’re chatting through your content, you still need all your organizational knowledge. I actually obviously found your Article, I was like, “Hey, what has Ben Thompson written about us last year?”, and I asked Rovo in chat and it comes back to me with he wrote this, that and the other and pulls out some snippets. I’m like, “Tell me more, do you think we’ve hit that?”, I literally got a report written by Rovo on your report as to whether it had been accurate. “Go look at the last 10 years with deep research and web search and come back and tell me, was he right or wrong?”, and it gave me a really interesting analysis of whether you were right and wrong. It’s like most AI things, it’s like 90% correct, it’s pretty good. It solved a lot of the first problem and I would not have done that work otherwise. I would have read it quickly and so I wasn’t going to put an analyst on it internally to do this work, but I could send something to do work I never would’ve done. Who’s your competitor for this spot, for this Rovo position where you have all this context, you can actually search your company in a way that just wasn’t possible previously? MCB: Who are the competitors you say? Yeah, because everyone is claiming they’re in this spot, “We can be the central place that you go and we have visibility everywhere”, why is Atlassian the one that’s going to win that space? MCB: A few reasons why we will. I think we have a great chance to be a great player is maybe the easiest way to say it. I think everybody loves this absolute win position, we don’t believe in enterprise technology, you usually get these absolute wins, it’s not quite the same as in the consumer world. We have a lot of business processes and workflows, millions every day that run through us, those are human collaboration workflows, so they are cool. The auditing team hands off to the accounting team, hands off to the tax team, whatever it is, sales workflows, marketing workflows, and they span lots of our applications and many others. If you’re going to go and introduce agents, these autonomous AI-driven software programs, whatever you want to call an agent, you’re going to put them into existing processes to make those processes either more efficient, more accurate. When the human picks up a task, it’s got all the information they need because something’s gone out to find it, that is an incredibly powerful position, which is why we support our agents and everybody else’s. You can assign a Jira work item to a Cursor agent in terms of code, you can assign it to a Salesforce agent. If you have your agent technology choice, I don’t think you’re going to have one agent platform, I think you’re probably going to have multiples, there are going to be a handful of organizational knowledge graphs that are powerful enough to solve these problems across multiple tools, but we have access to all those tools. We already know the information to some level, and that becomes a very unique advantage. Do you see this as a way to expand even further how much of a company you cover? You started with developers, then you expand to adjacent teams, and you talk about it’s now just a fraction of your user base. Do you own entire companies or could you get there? It’s like, “Okay, we still have these teams over here that are not on Jira, but Rovo’s so good that we need to bring everyone in”? MCB: Look, again, it would be great. I think it is unrealistic, and we should say “Absolutely”, right? MCB: If [Salesforce CEO Marc] Benioff was here, he’d be like, “Absolutely, we’ll own the world”, we love him, that’s the way he is, I don’t think about it as owning a customer. Our mentality has always been — I always use the subway analogy versus we have some competitors, for example, that want to be the control tower, their whole thing is we’ll be the control tower, just give us control and we’ll go and control everybody else, we’ll move the planes around. I think in enterprise IT, that’s an unrealistic view. Every CIO has been sold this for decades, it doesn’t happen because the world changes too quickly. Our philosophy and our commitment to customers has always been we will be a great citizen on all sides, we will interact with all of the applications you need, the old ones and the new ones, and we will be a valuable point of exchange in your business workflows and processes, whether those are structured like in Jira, whether unstructured like in Loom or Talent or something else. The reason for that is you have lots of systems. We want to be a valuable station on your subway network, we don’t want to be at the end of one of the lines, we want to be one of the handful of hub stations that are about moving trains around, and that is the best way to get your knowledge moving in your organization, it’s the best way to deal with your processes. Therefore, we need to have amazing AI capabilities. We have a massive investment in R&D, we have thousands of people working on AI tooling at the moment, and we have a huge creation bent, which is one of the reasons I think — we’ve talked a bit about the data advantage we have, I think we have a huge design advantage, and I actually think design is one of the hardest parts of building great AI experiences because it’s real fundamental design for the first time. You had a great line, you did a podcast a couple of weeks ago that I’ll put a link to, but you mentioned basically, the customer should not need to understand the difference between deterministic and probabilistic in the context of design, that’s what you’re driving at here. MCB: They should not need to understand that, they should need to understand when outcomes, outputs may be wrong or may be creative. Again, you talk a lot about the fact that hallucination is the other side of creativity, right, you can’t have one without the other. Hallucinations are a miracle. We have computers making stuff up! MCB: Our job is to explain to a customer when that happens, so it’s like this might be something you want to do, and that requires a lot of design. We have a feature in Jira called Work Breakdown which is super popular, where I can take a Jira issue and say, “Make me a bunch of sub-issues, this task has to be broken into a set of steps”. I don’t believe in the magic button theory of AI, that I’ll just hit a button and it’ll do all the things, I believe deeply in the value from AI will come from human-AI collaboration in a loop. It’s me and the AI working back and forth. You talk about yourself and Daman quite a lot , and it’s you, Daman and ChatGPT working together, but it’s not like you ask one thing and it’s done. It’s an interaction, it’s a collaboration back and forth, and that’s going to happen everywhere. In Work Breakdown, what it does is it says, “Hey, based on these types of documents I’ve gone to find from your whole graph in Google Docs and Confluence, whatever, I think this piece breaks down into these, is that correct?”, and it goes, “No, actually, that one doesn’t make any difference, these two are really good, you forgot about this document”, “Cool, let me go do that for you again”, and come back and say, “Is it these?”, “That’s closer”, and then you’re like, “That’s good enough, it’s 90% of what I need”, and then I go add the two that I need myself. That is a huge productivity boost but it’s not magically correct, and it requires a lot of design to tell people, “These are not the answers, these are possible answers, help us refine them and get better at it so that you get the 90% upside and the 10% downside is managed”. Are all these people pursuing these full agents that act on their own, are they just totally misguided? MCB: No, because I think, well, agents will take — there’s a snake oil sales thing going on as there always is in any bubble, and the snake oil sales is not wrong, it’s just chronologically challenged. (laughing) That’s so good. MCB: Well, customers are struggling. When I talk to customers every day, they’re like, “Is everyone else using these things to just magically transform their business with this simple, it took them five minutes and it’s replaced entire armies of people?”, and I’m like, “No, nobody’s doing that”. What they’re actually doing is taking business processes that are really important to their business and saying, “Okay, can I make this step better? This is highly error-prone. It’s compliance in a large organization, how do I make this part of the process better?”, and we’re like, “Oh, we can totally do that”, and they will replace small bits of lots of processes so that in Ship of Theseus style, five years from now, the process will look radically different. Occasionally, they are replacing entire processes, but this is the 1% case, what they’re actually doing is they have whole machines that are running and they’re trying to fix this cog and fix that cog, and that’s super valuable for them. That’s not a downside, that’s really, really valuable. And often, it’s work they didn’t want to do, work that wasn’t getting done, it wasn’t done at a high quality, so we got to remember that, I say this quite a lot, people shouldn’t be afraid of AI taking their job, I fundamentally believe this, they should be afraid of someone who’s really good at AI taking their job. That’s actually what’s going to happen, is someone is going to come along, in a sales sense, they’re really good at using all these AI tools to give better customer outcomes or handle more customers at one time. Is this why you’re hiring so many young people? MCB: Yes, I guess so. Yes, they’re more AI-native, they come out understanding these tools and technologies. I find the biggest irony in universities is all these people who “cheat” their way through every assignment, I use cheat in quote marks, using ChatGPT to handle these assignments, and then they’re worried AI is going to take all these jobs. I’m like, “Wait, you literally took your own job of writing the assignment, but you’ve also trained yourself on how to use these tools to get the outcome required” — now one might argue the university degree should be different, but just like when Google came along and you could look up any fact, knowing facts became far less important than the ability to look it up. I still think AI, it doesn’t create anything, maybe slightly controversial, but I argue it synthesizes information, it’s really good at processing huge amounts of information, giving it back to you, changing its form, bringing it back. Humans are still the only source of fundamental knowledge creation. I point out one of the flaws in the one person billion dollar company argument, and this will happen but it’ll be an anomaly. That company doesn’t get created without that one person, so there’s not AI creating companies magically. It’s like can a company eternally buy back its stock? No, because at some point, someone is going to own the final share? MCB: That’s right and I think this is missed, right? This is where we say it’s about unlocking creativity and what we do for our customers is put Rovo and these amazing data capabilities that we have alongside all the enterprise compliance and data residency, and there’s a massive amount of making this work in the enterprise with trust and probity and security. It’s very difficult. And great design to say, “What do you hire us to do? How do you get these technology and business teams to work together? What workflows do you have in your projects and your service teams, and how can we make those workflows better with more data and make your teams more informed?” That will end up with us having more share of employees in a business that use our stuff every day. Awesome. You made two big acquisitions recently, the DX acquisition , I think, makes a ton of sense to me measuring engineering productivity, particularly in the area of AI. What actual ROI are we getting on this? MCB: And how much money am I spending? Because I’m spending suddenly a lot of money, right? This is not cheap at all, I have huge bills. Internally, we use Rovo Dev , we use Claude Code, we use GitHub Copilot, we use Cursor, we have them available to all. We have a huge R&D — again, I think we’re still number one on the NASDAQ for R&D spending as proportion of revenue. You can take that as a good thing in the AI era or a bad thing, everyone gets to choose their own view on that, but we’ve always been incredibly high on R&D spending since day one. The bills that we pay though are very high, so DX is simply saying, “Okay, cool, how do I measure what I’m getting for that? Should I pay twice as much money because these bills are worthwhile, or is there a lot of it that’s actually just it’s really fun and it’s not actually leading to productivity gains?”. This is going to be a hard problem because there’s a lot of money on the line at the moment that people are paying for these tools, which is not without value, but measuring exactly what the value is is really, really hard, and that team’s done a phenomenal job. And we now have an Atlassian office in Salt Lake City, Utah, where I already spend a lot of time. Totally by coincidence, but it’s really nice. So that purchase, love it, makes a ton of sense. In perfect alignment with you. How does The Browser Company fit in? MCB: A lot of ways. So I have believed for a long time that browsers are broken. We’ve built browsers for an era of software that we don’t live in today. And I don’t, in my browser, have a bunch of tabs that represent webpages, I don’t have that. I have a bunch of tasks, I have a bunch of applications, I have a bunch of documents, and the browser was fundamentally never built to do that. That’s what Arc, first product from The Browser Company — if you don’t use Arc every single day, you should be, it’ll increase your productivity instantly because it’s built for knowledge workers and the way that they have to actually work every day and how they manage all of these tabs and tasks and flows versus serving the New York Times or whatever. That is a browser built for knowledge workers, and there’s a lot more we can do in that era as software changes. Secondly, obviously AI has come along, and we now have chats and applications as a extra part of the browser experience, so I think we can change how enterprises use browsers, security being a big issue. I think AI in the browser is a really important thing, but I suspect it’s not in the basic way of just combining Chrome and ChatGPT, that’s not how it’s going to play out. I suspect it requires a massive amount of design, which The Browser Company is phenomenal at, and it requires changing how people use their day-to-day applications. From our point of view, and I’ve been an Arc fan since day one, [The Browser Company CEO] Josh [Miller] and I have known each other a long time, there’s a knowledge worker angle and there’s obviously a business angle to it in a huge way that our customers are knowledge workers. We can change the way they do their work in a meaningful way of productivity, that is exactly what we have been trying to do in a lot of different ways. The browser itself, being chromium-based, Edge being chromium-based, Chrome being chromium-based, the rendering of webpages is not the problem, it is the fundamental user experience of, “How do I take all of my SaaS applications, my agents, my chats, my tabs, my knowledge, and put it all together in ways that make my day quicker?” — that is what we are trying to do fundamentally at the start. The context that we have is incredibly important for that. And the browser has, if you think about it, my personal memory. We used to call it the browser history. Great, it shows what I’ve seen, it does not have my organizational memory, which we have a great example of in the Teamwork Graph. So if I can put these things together, I can make a much more productive browsing experience for customers fundamentally in that world. I think we have an amazing shot of doing that and of changing how knowledge workers use SaaS. We’re not trying to make a browser, as I’ve said, for my kids, we’re not trying to make a browser for my parents, we’re not trying to make a browser for shopping or for anything else. We’re trying to make a browser for people who spend all day living in Salesforce and Jira and Google Docs and Confluence and Figma and GitHub, and that is their life. The laptop warrior that sits in that experience, I believe we can use AI and design to make that a far better experience and build an amazing product. They’re well on the way to doing that, we can supercharge doing it. You look skeptical. No, I’m looking at the clock, I skipped over a huge section. Your whole shift to the cloud, all those sorts of things. However, there is one thing I wanted to get to: you are wearing an Atlassian Williams Racing hat , I am a big F1 fan, I was very excited about you doing this . How did that come about? How was the first year? Was this another hunch this is going to work out? I mean, Williams is looking like a pretty good bet. MCB: Yes, our world’s largest sports bet. Look, how did it come about? So how do I make a short answer? F1 is changing, I think, in a massive way. I know now being incredibly deep in the business of it, the fundamental change is that hardware is becoming less important and software is becoming more important, this is a trend that we are used to. JV, James Vowles , the Team Principal, was the first person that approached us a long while ago now to help them, and for a teeny, teeny sticker in the corner, to help them get more productive as a team. What people don’t realize about F1 is these are large organizations, right? There’s 1100 people that work for Atlassian Williams Racing. And Williams was really pared down and skinny, he was brought back in with new owners to actually rebuild the entire thing? MCB: Yes, they were in deep trouble. But in rebuilding it, he is a software engineer, software developer by trade, by history kind of thing. He’s a technically-minded person. He downloaded Jira himself in 2004 to install it, so he knows us quite well. So we were brought on for our ability to help them with their teamwork and their collaboration, they really needed a technical upgrade to a whole lot of their systems. Turns out they need us in almost every part of their business because the service workflow’s important. We’re now in the garage, we’re using tons of AI to try to make them better, so there’s a lot of things we can do to build to hopefully help them win, and it’s a mission you can fall in love with. Here is one of the most storied brands in Formula 1 that’s fallen on tough times, every sportsperson loves a recovery story. And I was sold early on the recovery story, I’m like, “Fuck it, let’s go help, let’s make this happen. Let’s get back to being a championship team”. So we fell in love with the mission, and JV is super compelling, he’s got a one-decade goal, and they’re very goal-driven, and we love that, but they needed a lot of help, so that’s what they asked us for help with is initially. The more we looked at it, the more we learned about Formula 1, yes, it’s becoming a software-driven sport. So as an example, Atlassian Williams, I believe have twice as many software developers as the next team on the grid. Because it’s cost-capped, you got to choose, “Do I hire a software developer or an aerodynamicist?” — it’s a very clear cost cap, you’re choosing where to put your resources. As virtualization and everything get better, it’s less, “How well can I draw a curve?” and, “How much can I help 1100 people work together, and how can we build great software”, which really is the core of the car, right? So that then comes to us, tiny sticker, probably a founder-ish moment where I’m like, “How much is the sticker on the top?”, and they didn’t have a sticker on the top and I’m like, well, “What would that get us?” So we ran the numbers on that and the reason is twofold. You talked about our GTM, our go-to-market transformation, we have an ability to build various things. Firstly, branding is obviously massive, top three teams get 10 times the branding as the bottom three teams. So if you’re going to make a sports bet, you pay for a long period of time with the bottom three team, you help make them a top three team, and your sport bet pays out really well just on a sheer TV time and etc — the number of staff, parents, and other things, have said to staff members, “Hey, that company you work for, it’s really great, I saw them on the TV on the weekend”, and the staff member will say, “Dude, I’ve worked there for 12 years, why do you suddenly know about it?”, “Oh, I saw them driving. Carlos [Sainz Jr.] is great”, or something. And he is! So obviously, there’s a huge marketing and branding angle that’s about their position being better. The really interesting part of what we’re doing there is we have customers all around the world, we have customers in 200-odd countries, and we can’t go and visit all of our biggest customers in a meaningful way. We certainly can’t take them to some of our best and most exciting customers, right? There are electric car companies that use our stuff that we’d love to take many customers to a factory, or rockets, or whoever, I can’t take many customers into some of your favorite chip companies and say, “Look how they use our stuff”, I can maybe get one or two customers a year into that customer and show them how they use our things. With Formula 1, what we’re building is a mobile EBC, so an executive briefing center. Formula 1 goes around the world. It goes to Melbourne, it goes to Singapore, it goes to Japan, it goes to England, it goes to various parts of Northern Europe, it goes to various parts of America and you’re like, “Hey, where are our customers?” — roughly distributed like that. It comes to town, we can invite a whole lot of customers into a great experience, we can tell them a lot about Atlassian software, we can also invite them into one of our best customers. They can sit in the garage, and I can tell them how our service collection is helping power the assets, that when that wing’s broken, it gets known here, and they start making a new one back in the factory in Oxford, and this one gets shipped around the world and another one will get moved. And, “Here, I can show you the asset management and the service that goes along with it, I can show you how the garage is getting more efficient because of us, I can show you how we’re helping them win races”. We don’t drive cars, we help them be more productive as a team and I can do that in an environment of it’s an exciting environment. They can drink a great latte or a champagne or whatever they want, and I can explain to them how we are transforming this business in a meaningful way with our tools no matter which way they want to look at it, which is the most powerful customer story that you can go and tell a couple-hundred customers a year in their city. We come to their city, right? I was in Montreal, I took a whole bunch of Canadian customers over the three days, they were like, “This changes my view of Atlassian”, and I’m like, “That’s exactly our goal”, that is at the enterprise end of enterprise sales though, right? But that’s the ironic thing, it’s as far away from where you started as you could be. MCB: Well, they didn’t get there. I met two Canadian banks we had in Montreal as an example, both of whom had been customers for over 20 years, they started spending $800 bucks or maybe $4800 as we moved our pricing to around five grand — now they spend a million, two million dollars a year, and they could be spending ten. We have the ability to give the massive business value across a far larger swath of their business. And I can say, “What do you use from our system of work today? What could you use? Let me show you how Williams uses that piece of the system of work”, which is just a very visceral and exciting customer example to show them how they’re winning. And it helps, again, culturally, super aligned. They’re an awesome group of people trying really hard to win in the most ridiculously competitive sport and the highs are highs, the lows are low. Any sporting fan, you’re well familiar with various different sports that we have in common, but this is technology built by a large business team that has to win a sport. That doesn’t happen anywhere else in the sporting world, I would claim. Giannis [Antetokounmpo] doesn’t make his own shoes and have a team of people making better shoes and a better basketball so he can win, that doesn’t happen in other sports. It’s all about the people on the floor in an NBA game as to who wins, and that’s great, don’t get me wrong, I love basketball. The work in Formula 1 is done by 1000 people back in Oxford. It’s a Constructor Championship . MCB: The constructor championship I do think should be more important, especially given the current exact week we’re in, which is an amazing week for Atlassian Williams Racing, second podium . You talk about that bet, I told JV at the start of the year, I thought that he’s like, “What do you think our five-year future is?”, and I said, “Look, I think, number one, we’ll get one podium this year, 2025; 2026, we’ll win a race; and by 2030, we will have won a championship, that is my OKRs [Objectives and Key Results]”, and he said, “Oh, wow, okay, yeah I think so”. It lines up, I know the team OKRs and other things. And we won two podiums this year, so I was wrong, and I think we have a great chance for 2026, and we are working hard to make the team better and the single-best customer example we have of every piece of software that we sell. Mike, I’d love to talk again. It was great talking to you again. And, hey, good luck. And I’m a Williams fan, so I’ll be cheering for you this weekend. MCB: Oh, yeah. Well, I’m not sure this weekend, but 2026, 2027- Okay. I’m kind of kissing up, I am dying for Max [Verstappen] to win is the honest truth. I need the McLarens to run into each other . But other than that, Williams is my second love. MCB: Do you think McLaren will issue team orders to switch them if Oscar is in second and Lando’s in fourth? Yes. And I don’t know what’s going to happen if that happens, and this will be fascinating. MCB: We will have to see. It’s going to be a huge week. But that’s what makes the sport exciting, right? The whole thing is amazing. Talk to you later. MCB: All right. Thanks, man. This Daily Update Interview is also available as a podcast. To receive it in your podcast player, visit Stratechery . The Daily Update is intended for a single recipient, but occasional forwarding is totally fine! If you would like to order multiple subscriptions for your team with a group discount (minimum 5), please contact me directly. Thanks for being a supporter, and have a great day!

0 views
Anton Zhiyanov 1 weeks ago

Go proposal: Type-safe error checking

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Introducing — a modern, type-safe alternative to . Ver. 1.26 • Stdlib • High impact The new function is a generic version of : It's type-safe, faster, and easier to use: is not deprecated (yet), but is recommended for new code. The function requires you to declare a variable of the target error type and pass a pointer to it: It makes the code quite verbose, especially when checking for multiple types of errors: With a generic , you can specify the error type right in the function call. This makes the code shorter and keeps error variables scoped to their blocks: Another issue with is that it uses reflection and can cause runtime panics if used incorrectly (like if you pass a non-pointer or a type that doesn't implement ). While static analysis tools usually catch these issues, using the generic has several benefits: Finally, can handle everything that does, so it's a drop-in improvement for new code. Add the function to the package: Recommend using instead of : Open a file and check if the error is related to the file path: 𝗣 51945 • 𝗖𝗟 707235 Unlike , doesn't use the package, but it still relies on type assertions and interface checks. These operations access runtime type metadata, so isn't completely "reflection-free" in the strict sense.  ↩︎ No reflection 1 . No runtime panics. Less allocations. Compile-time type safety. Unlike , doesn't use the package, but it still relies on type assertions and interface checks. These operations access runtime type metadata, so isn't completely "reflection-free" in the strict sense.  ↩︎

0 views
xenodium 1 weeks ago

At one with your code

While in the mood to goof around with Emacs, CLI, and image rendering , I've revised an idea to generate some sort of art from your codebase (or any text really). That is, given an image, generate a textual representation, potentially using source code as input. With that, here's one : a utility to transform images into character art using text from your codebase. Rather than tell you more about it, best to see it in action. Just a bit of fun. That's all there is to it. While I've only run it on macOS, 's written in Go, so should be fairly portable. I'd love to know if you get it running on Linux. The code's on GitHub . If you're on macOS, I've added a Homebrew on GitHub , so you should just be able to install with: Having fun with ? Enjoying this blog or my projects ? I am an 👉 indie dev 👈. Help make it sustainable by ✨ sponsoring ✨ Need a blog? I can help with that . Maybe buy my iOS apps too ;)

0 views
Stone Tools 1 weeks ago

Bank Street Writer on the Apple II

Stop me if you're heard this one . In 1978, a young man wandered into a Tandy Radio Shack and found himself transfixed by the TRS-80 systems on display. He bought one just to play around with, and it wound up transforming his life from there on. As it went with so many, so too did it go with lawyer Doug Carlston. His brother, Gary, initially unimpressed, warmed up to the machine during a long Maine winter. The two thus smitten mused, "Can we make money off of this?" Together they formed a developer-sales relationship, with Doug developing Galactic Saga and third brother Don developing Tank Command . Gary's sales acumen brought early success and Broderbund was officially underway. Meanwhile in New York, Richard Ruopp, president of Bank Street College of Education, a kind of research center for experimental and progressive education, was thinking about how emerging technology fit into the college's mission. Writing was an important part of their curriculum, but according to Ruopp , "We tested the available word processors and found we couldn’t use any of them." So, experts from Bank Street College worked closely with consultant Franklin Smith and software development firm Intentional Educations Inc. to build a better word processor for kids. The fruit of that labor, Bank Street Writer , was published by Scholastic exclusively to schools at first, with Broderbund taking up the home distribution market a little later. Bank Street Writer would dominate home software sales charts for years and its name would live on as one of the sacred texts, like Lemonade Stand or The Oregon Trail . Let's see what lessons there are to learn from it yet. 1916 Founded by Lucy Sprague Mitchell, Wesley Mitchell, and Harriet Johnson as the “Bureau of Educational Experiments” (BEE) with the goal of understanding in what environment children best learn and develop, and to help adults learn to cultivate that environment. 1930 BEE moves to 69 Bank Street. (Will move to 112th Street in 1971, for space reasons.) 1937 The Writer’s Lab, which connects writers and students, is formed. 1950 BEE is renamed to Bank Street College of Education. 1973 Minnesota Educational Computing Consortium (MECC) is founded. This group would later go on to produce The Oregon Trail . 1983 Bank Street Writer, developed by Intentional Educations Inc., published by Broderbund Software, and “thoroughly tested by the academics at Bank Street College of Education.” Price: $70. 1985 Writer is a success! Time to capitalize! Bank Street Speller $50, Bank Street Filer $50, Bank Street Mailer $50, Bank Street Music Writer $50, Bank Street Prewriter (published by Scholastic) $60. 1986 Bank Street Writer Plus $100. Bank Street Writer III (published by Scholastic) $90. It’s basically Plus with classroom-oriented additions, including a 20-column mode and additional teaching aides. 1987 Bank Street Storybook, $40. 1992 Bank Street Writer for the Macintosh (published by Scholastic) $130. Adds limited page layout options, Hypercard-style hypertext, clip art, punctuation checker, image import with text wrap, full color, sound support, “Classroom Publishing” of fliers and pamphlets, and electronic mail. With word processors, I want to give them a chance to present their best possible experience. I do put a little time into trying the baseline experience many would have had with the software during the height of its popularity. "Does the software still have utility today?" can only be fairly answered by giving the software a fighting chance. To that end, I've gifted myself a top-of-the-line (virtual) Apple //e running the last update to Writer , the Plus edition. You probably already know how to use Bank Street Writer Plus . You don't know you know, but you do know because you have familiarity with GUI menus and basic word processing skills. All you're lacking is an understanding of the vagaries of data storage and retrieval as necessitated by the hardware of the time, but once armed with that knowledge you could start using this program without touching the manual again. It really is as easy as the makers claim. The simplicity is driven by very a subtle, forward-thinking user interface. Of primary interest is the upper prompt area. The top 3 lines of the screen serve as an ever-present, contextual "here's the situation" helper. What's going on? What am I looking at? What options are available? How do I navigate this screen? How do I use this tool? Whatever you're doing, whatever menu option you've chosen, the prompt area is already displaying information about which actions are available right now in the current context . As the manual states, "When in doubt, look for instructions in the prompt area." The manual speaks truth. For some, the constant on-screen prompting could be a touch overbearing, but I personally don't think it's so terrible to know that the program is paying attention to my actions and wants me to succeed. The assistance isn't front-loaded, like so many mobile apps, nor does it interrupt, like Clippy. I simply can't fault the good intentions, nor can I really think of anything in modern software that takes this approach to user-friendliness. The remainder of the screen is devoted to your writing and works like any other word processor you've used. Just type, move the cursor with the arrow keys, and type some more. I think most writers will find it behaves "as expected." There are no Electric Pencil -style over-type surprises, nor VisiCalc -style arrow key manipulations. What seems to have happened is that in making a word processor that is easy for children to use, they accidentally made a word processor that is just plain easy. The basic functionality is drop-dead simple to pick up by just poking around, but there's quite a bit more to learn here. To do so, we have a few options for getting to know Bank Street Writer in more detail. There are two manuals by virtue of the program's educational roots. Bank Street Writer was published by both Broderbund (for the home market) and Scholastic (for schools). Each tailored their own manual to their respective demographic. Broderbund's manual is cleanly designed, easy to understand, and gets right to the point. It is not as "child focused" as reviews at the time might have you believe. Scholastic's is more of a curriculum to teach word processing, part of the 80s push for "computers in the classroom." It's packed with student activities, pages that can be copied and distributed, and (tellingly) information for the teacher explaining "What is a word processor?" Our other option for learning is on side 2 of the main program disk. Quite apart from the program proper, the disk contains an interactive tutorial. I love this commitment to the user's success, though I breezed through it in just a few minutes, being a cultured word processing pro of the 21st century. I am quite familiar with "menus" thank you very much. As I mentioned at the top, the screen is split into two areas: prompt and writing. The prompt area is fixed, and can neither be hidden nor turned off. This means there's no "full screen" option, for example. The writing area runs in high-res graphics mode so as to bless us with the gift of an 80-character wide display. Being a graphics display also means the developer could have put anything on screen, including a ruler which would have been a nice formatting helper. Alas. Bank Street offers limited preference settings; there's not much we can do to customize the program's display or functionality. The upshot is that as I gain confidence with the program, the program doesn't offer to match my ability. There is one notable trick, which I'll discuss later, but overall there is a missed opportunity here for adapting to a user's increasing skill. Kids do grow up, after all. As with Electric Pencil , I'm writing this entirely in Bank Street Writer . Unlike the keyboard/software troubles there, here in 128K Apple //e world I have Markdown luxuries like . The emulator's amber mode is soothing to the eyes and soul. Mouse control is turned on and works perfectly, though it's much easier and faster to navigate by keyboard, as God intended. This is an enjoyable writing experience. Which is not to say the program is without quirks. Perhaps the most unfortunate one is how little writing space 128K RAM buys for a document. At this point in the write-up I'm at about 1,500 words and BSW's memory check function reports I'm already at 40% of capacity. So the largest document one could keep resident in memory at one time would run about 4,000 words max? Put bluntly, that ain't a lot. Splitting documents into multiple files is pretty much forced upon anyone wanting to write anything of length. Given floppy disk fragility, especially with children handling them, perhaps that's not such a bad idea. However, from an editing point of view, it is frustrating to recall which document I need to load to review any given piece of text. Remember also, there's no copy/paste as we understand it today. Moving a block of text between documents is tricky, but possible. BSW can save a selected portion of text to its own file, which can then be "retrieved" (inserted) at the current cursor position in another file. In this way the diskette functions as a memory buffer for cross-document "copy/paste." Hey, at least there is some option available. Flipping through old magazines of the time, it's interesting just how often Bank Street Writer comes up as the comparative reference point for home word processors over the years. If a new program had even the slightest whiff of trying to be "easy to use" it was invariably compared to Bank Street Writer . Likewise, there were any number of writers and readers of those magazines talking about how they continued to use Bank Street Writer , even though so-called "better" options existed. I don't want to oversell its adoption by adults, but it most definitely was not a children-only word processor, by any stretch. I think the release of Plus embraced a more mature audience. In schools it reigned supreme for years, including the Scholastic-branded version of Plus called Bank Street Writer III . There were add-on "packs" of teacher materials for use with it. There was also Bank Street Prewriter , a tool for helping to organize themes and thoughts before committing to the act of writing, including an outliner, as popularized by ThinkTank . (always interesting when influences ripple through the industry like this) Of course, the Scholastic approach was built around the idea of teachers having access to computers in the classroom. And THAT was build on the idea of teachers feeling comfortable enough with computers to seamlessly merge them into a lesson-plan. Sure, the kids needed something simple to learn, but let's be honest, so did the adults. There was a time when attaching a computer to anything meant a fundamental transformation of that thing was assured and imminent. For example, the "office of the future" (as discussed in the Superbase post ) had a counterpart in the "classroom of tomorrow." In 1983, Popular Computing said, "Schools are in the grip of a computer mania." Steve Jobs took advantage of this, skating to where the puck would be, by donating Apple 2s to California schools. In October 1983, Creative Computing did a little math on that plan. $20M in retail donations brought $4M in tax credits against $5M in gross donations. Apple could donate a computer to every elementary, middle, and high school in California for an outlay of only $1M. Jobs lobbied Congress hard to pass a national version of the same "Kids Can't Wait" bill, which would have extended federal tax credits for such donations. That never made it to law, for various political reasons. But the California initiative certainly helped position Apple as the go-to system for computers in education. By 1985, Apple would dominate fully half of the education market. That would continue into the Macintosh era, though Apple's dominance diminished slowly as cheaper, "good enough" alternatives entered the market. Today, Apple is #3 in the education market, behind Windows and Chromebooks . It is a fair question to ask, "How useful could a single donated computer be to a school?" Once it's in place, then what? Does it have function? Does anyone have a plan for it? Come to think of it, does anyone on staff even know how to use it? When Apple put a computer into (almost) every school in California, they did require training. Well, let's say lip-service was paid to the idea of the aspiration of training. One teacher from each school had to receive one day's worth of training to attain a certificate which allowed the school to receive the computer. That teacher was then tasked with training their coworkers. Wait, did I say "one day?" Sorry, I meant about one HOUR of training. It's not too hard to see where Larry Cuban was coming from when he published Oversold & Underused: Computers in the Classroom in 2001. Even of schools with more than a single system, he notes, "Why, then, does a school's high access (to computers) yield limited use? Nationally and in our case studies, teachers... mentioned that training in relevant software and applications was seldom offered... (Teachers) felt that the generic training available was often irrelevant to their specific and immediate needs." From my perspective, and I'm no historian, it seems to me there were four ways computers were introduced into the school setting. The three most obvious were: I personally attended schools of all three types. What I can say the schools had in common was how little attention, if any, was given to the computer and how little my teachers understood them. An impromptu poll of friends aligned with my own experience. Schools didn't integrate computers into classwork, except when classwork was explicitly about computers. I sincerely doubt my time playing Trillium's Shadowkeep during recess was anything close to Apple's vision of a "classroom of tomorrow." The fourth approach to computers into the classroom was significantly more ambitious. Apple tried an experiment in which five public school sites were chosen for a long-term research project. In 1986, the sites were given computers for every child in class and at home. They reasoned that for computers to truly make an impact on children, the computer couldn't just be a fun toy they occasionally interacted with. Rather, it required full integration into their lives. Now, it is darkly funny to me that having achieved this integration today through smartphones, adults work hard to remove computers from school. It is also interesting to me that Apple kind of led the way in making that happen, although in fairness they don't seem to consider the iPhone to be a computer . America wasn't alone in trying to give its children a technological leg up. In England, the BBC spearheaded a major drive to get computers into classrooms via a countrywide computer literacy program. Even in the States, I remember watching episodes of BBC's The Computer Programme on PBS. Regardless of Apple's or the BBC's efforts, the long-term data on the effectiveness of computers in the classroom has been mixed, at best, or even an outright failure. Apple's own assessment of their "Apple Classrooms of Tomorrow" (ACOT) program after a couple of years concluded, "Results showed that ACOT students maintained their performance levels on standard measures of educational achievement in basic skills, and they sustained positive attitudes as judged by measures addressing the traditional activities of schooling." Which is a "we continue to maintain the dream of selling more computers to schools" way of saying, "Nothing changed." In 2001, the BBC reported , "England's schools are beginning to use computers more in teaching - but teachers are making "slow progress" in learning about them." Then in 2015 the results were "disappointing, "Even where computers are used in the classroom, their impact on student performance is mixed at best." Informatique pour tous, France 1985: Pedagogy, Industry and Politics by Clémence Cardon-Quint noted the French attempt at computers in the classroom as being, "an operation that can be considered both as a milestone and a failure." Computers in the Classrooms of an Authoritarian Country: The Case of Soviet Latvia (1980s–1991) by Iveta Kestere, Katrina Elizabete Purina-Bieza shows the introduction of computers to have drawn stark power and social divides, while pushing prescribed gender roles of computers being "for boys." Teachers Translating and Circumventing the Computer in Lower and Upper Secondary Swedish Schools in the 1970s and 1980 s by Rosalía Guerrero Cantarell noted, "the role of teachers as agents of change was crucial. But teachers also acted as opponents, hindering the diffusion of computer use in schools." Now, I should be clear that things were different in the higher education market, as with PLATO in the universities. But in the primary and secondary markets, Bank Street Writer 's primary demographic, nobody really knew what to do with the machines once they had them. The most straightforwardly damning assessment is from Oversold & Underused where Cuban says in the chapter "Are Computers in Schools Worth the Investment?", "Although promoters of new technologies often spout the rhetoric of fundamental change, few have pursued deep and comprehensive changes in the existing system of schooling." Throughout the book he notes how most teachers struggle to integrate computers into their lessons and teaching methodologies. The lack of guidance in developing new ways of teaching means computers will continue to be relegated to occasional auxiliary tools trotted out from time to time, not integral to the teaching process. "Should my conclusions and predictions be accurate, both champions and skeptics will be disappointed. They may conclude, as I have, that the investment of billions of dollars over the last decade has yet to produce worthy outcomes," he concludes. Thanks to my sweet four-drive virtual machine, I can summon both the dictionary and thesaurus immediately. Put the cursor at the start of a word and hit or to get an instant spot check of spelling or synonyms. Without the reality of actual floppy disk access speed, word searches are fast. Spelling can be performed on the full document, which does take noticeable time to finish. One thing I really love is how cancelling an action or moving forward on the next step of a process is responsive and immediate. If you're growing bored of an action taking too long, just cancel it with ; it will stop immediately . The program feels robust and unbreakable in that way. There is a word lookup, which accepts wildcards, for when you kinda-sorta know how to spell a word but need help. Attached to this function is an anagram checker which benefits greatly from a virtual CPU boost. But it can only do its trick on single words, not phrases. Earlier I mentioned how little the program offers a user who has gained confidence and skill. That's not entirely accurate, thanks to its most surprising super power: macros. Yes, you read that right. This word processor designed for children includes macros. They are stored at the application level, not the document level, so do keep that in mind. Twenty can be defined, each consisting of up to 32 keystrokes. Running keystrokes in a macro is functionally identical to typing by hand. Because the program can be driven 100% by keyboard alone, macros can trigger menu selections and step through tedious parts of those commands. For example, to save our document periodically we need to do the following every time: That looks like a job for to me. 0:00 / 0:23 1× Defining a macro to save, with overwrite, the current file. After it is defined, I execute it which happens very quickly in the emulator. Watch carefully. If you can perform an action through a series of discrete keyboard commands, you can make a macro from it. This is freeing, but also works to highlight what you cannot do with the program. For example, there is no concept of an active selection, so a word is the smallest unit you can directly manipulate due to keyboard control limitations. It's not nothin' but it's not quite enough. I started setting up markdown macros, so I could wrap the current word in or for italic and bold. Doing the actions in the writing area and noting the minimal steps necessary to achieve the desired outcome translated into perfect macros. I was even able to make a kind of rudimentary "undo" for when I wrap something in italic but intended to use bold. This reminded me that I haven't touched macro functionality in modern apps since my AppleScript days. Lemme check something real quick. I've popped open LibreOffice and feel immediately put off by its Macros function. It looks super powerful; a full dedicated code editor with watched variables for authoring in its scripting language. Or is it languages? Is it Macros or ScriptForge? What are "Gimmicks?" Just what is going on? Google Docs is about the same, using Javascript for its "Apps Script" functionality. Here's a Stack Overflow post where someone wants to select text and set it to "blue and bold" with a keystroke and is presented with 32 lines of Javascript. Many programs seem to have taken a "make the simple things difficult, and the hard things possible" approach to macros. Microsoft Word reportedly has a "record" function for creating macros, which will watch what you do and let you play back those actions in sequence. (a la Adobe Photoshop's "actions") This sounds like a nice evolution of the BSW method. I say "reportedly" because it is not available in the online version and so I couldn't try it for myself without purchasing Microsoft 365. I certainly don't doubt the sky's the limit with these modern macro systems. I'm sure amazing utilities can be created, with custom dialog boxes, internet data retrieval, and more. The flip-side is that a lot of power has has been stripped from the writer and handed over to the programmer, which I think is unfortunate. Bank Street Writer allows an author to use the same keyboard commands for creating a macro as for writing a document. There is a forgotten lesson in that. Yes, BSW's macros are limited compared to modern tools, but they are immediately accessible and intuitive. They leverage skills the user is already known to possess . The learning curve is a straight, flat line. Like any good word processor, user-definable tab stops are possible. Bringing up the editor for tabs displays a ruler showing tab stops and their type (normal vs. decimal-aligned). Using the same tools for writing, the ruler is similarly editable. Just type a or a anywhere along the ruler. So, the lack of a ruler I noted at the beginning is now doubly-frustrating, because it exists! Perhaps it was determined to be too much visual clutter for younger users? Again, this is where the Options screen could have allowed advanced users to toggle on features as they grow in comfort and ambition. From what I can tell in the product catalogs, the only major revision after this was for the Macintosh which added a whole host of publishing features. If I think about my experience with BSW these past two weeks, and think about what my wish-list for a hypothetical update might be, "desktop publishing" has never crossed my mind. Having said all of that, I've really enjoyed using it to write this post. It has been solid, snappy, and utterly crash free. To be completely frank, when I switched over into LibreOffice , a predominantly native app for Windows, it felt laggy and sluggish. Bank Street Writer feels smooth and purpose-built, even in an emulator. Features are discoverable and the UI always makes it clear what action can be taken next. I never feel lost nor do I worry that an inadvertent action will have unknowable consequences. The impression of it being an assistant to my writing process is strong, probably more so than many modern word processors. This is cleanly illustrated by the prompt area which feels like a "good idea we forgot." (I also noted this in my ThinkTank examination) I cannot lavish such praise upon the original Bank Street Writer , only on this Plus revision. The original is 40-columns only, spell-checking is a completely separate program, no thesaurus, no macros, a kind of bizarre modal switch between writing/editing/transfer modes, no arrow key support, and other quirks of its time and target system (the original Apple 2). Plus is an incredibly smart update to that original, increasing its utility 10-fold, without sacrificing ease of use. In fact, it's actually easier to use, in my opinion than the original and comes just shy of being something I could use on a regular basis. Bank Street Writer is very good! But it's not quite great . Ways to improve the experience, notable deficiencies, workarounds, and notes about incorporating the software into modern workflows (if possible). AppleWin 32bit 1.31.0.0 on Windows 11 Emulating an Enhanced Apple //e Authentic machine speed (enhanced disk access speed) Monochrome (amber) for clean 80-column display Disk II controller in slot 5 (enables four floppies, total) Mouse interface in slot 4 Bank Street Writer Plus At the classroom level there are one or more computers. At the school level there is a "computer lab" with one or more systems. There were no computers. Hit (open the File menu) Hit (select Save File) Hit three times (stepping through default confirmation dialogs) I find that running at 300% CPU speed in AppleWin works great. No repeating key issues and the program is well-behaved. Spell check works quickly enough to not be annoying and I honestly enjoyed watching it work its way through the document. Sometimes there's something to be said about slowing the computer down to swift human-speed, to form a stronger sense of connection between your own work and the computer's work. I did mention that I used a 4-disk setup, but in truth I never really touched the thesaurus. A 3-disk setup is probably sufficient. The application never crashed; the emulator was rock-solid. CiderPress2 works perfectly for opening the files on an Apple ][ disk image. Files are of file extension, which CiderPress2 tries to open as disassembly, not text. Switch "Conversion" to "Plain Text" and you'll be fine. This is a program that would benefit greatly from one more revision. It's very close to being enough for a "minimalist" crowd. There are four, key pieces missing for completeness: Much longer document handling Smarter, expanded dictionary, with definitions Customizable UI, display/hide: prompts, ruler, word count, etc. Extra formatting options, like line spacing, visual centering, and so on. For a modern writer using hyperlinks, this can trip up the spell-checker quite ferociously. It doesn't understand, nor can it be taught, pattern-matching against URLs to skip them.

0 views
Corrode 1 weeks ago

Canonical

What does it take to rewrite the foundational components of one of the world’s most popular Linux distributions? Ubuntu serves over 12 million daily desktop users alone, and the systems that power it, from sudo to core utilities, have been running for decades with what Jon Seager, VP of Engineering for Ubuntu at Canonical, calls “shaky underpinnings.” In this episode, we talk to Jon about the bold decision to “oxidize” Ubuntu’s foundation. We explore why they’re rewriting critical components like sudo in Rust, how they’re managing the immense risk of changing software that millions depend on daily, and what it means to modernize a 20-year-old operating system without breaking the internet. CodeCrafters helps you become proficient in Rust by building real-world, production-grade projects. Learn hands-on by creating your own shell, HTTP server, Redis, Kafka, Git, SQLite, or DNS service from scratch. Start for free today and enjoy 40% off any paid plan by using this link . Canonical is the company behind Ubuntu, one of the most widely-used Linux distributions in the world. From personal desktops to cloud infrastructure, Ubuntu powers millions of systems globally. Canonical’s mission is to make open source software available to people everywhere, and they’re now pioneering the adoption of Rust in foundational system components to improve security and reliability for the next generation of computing. Jon Seager is VP Engineering for Ubuntu at Canonical, where he oversees the Ubuntu Desktop, Server, and Foundations teams. Appointed to this role in January 2025, Jon is driving Ubuntu’s modernization strategy with a focus on Communication, Automation, Process, and Modernisation. His vision includes adopting memory-safe languages like Rust for critical infrastructure components. Before this role, Jon spent three years as VP Engineering building Juju and Canonical’s catalog of charms. He’s passionate about making Ubuntu ready for the next 20 years of computing. Juju - Jon’s previous focus, a cloud orchestration tool GNU coretuils - The widest used implementation of commands like ls, rm, cp, and more uutils coreutils - coreutils implementation in Rust sudo-rs - For your Rust based sandwiches needs LTS - Long Term Support, a release model popularized by Ubuntu coreutils-from-uutils - List of symbolic links used for coreutils on Ubuntu, some still point to the GNU implementation man: sudo -E - Example of a feature that sudo-rs does not support SIMD - Single instruction, multiple data rust-coreutils - The Ubuntu package with all it’s supported CPU platforms listed fastcat - Matthias’ blogpost about his faster version of systemd-run0 - Alternative approach to sudo from the systemd project AppArmor - The Linux Security Module used in Ubuntu PAM - The Pluggable Authentication Modules, which handles all system authentication in Linux SSSD - Enables LDAP user profiles on Linux machines ntpd-rs - Timesynchronization daemon written in Rust which may land in Ubuntu 26.04 Trifecta Tech Foundation - Foundation supporting sudo-rs development Sequioa PGP - OpenPGP tools written in Rust Mir - Canonicals wayland compositor library, uses some Rust Anbox Cloud - Canonical’s Android streaming platform, includes Rust components Simon Fels - Original creator of Anbox and Anbox Cloud team lead at Canonical LXD - Container and VM hypervisor dqlite - SQLite with a replication layer for distributed use cases, potentially being rewritten in Rust Rust for Linux - Project to add Rust support to the Linux kernel Nova GPU Driver - New Linux OSS driver for NVIDIA GPUs written in Rust Ubuntu Asahi - Community project for Ubuntu on Apple Silicon debian-devel: Hard Rust requirements from May onward - Parts of apt are being rewritten in Rust (announced a month after the recording of this episode) Go Standard Library - Providing things like network protocols, cryptographic algorithms, and even tools to handle image formats Python Standard Library - The origin of “batteries included” The Rust Standard Library - Basic types, collections, filesystem access, threads, processes, synchronisation, and not much more clap - Superstar library for CLI option parsing serde - Famous high-level serilization and deserialization interface crate Jon Seager’s Website Jon’s Blog: Engineering Ubuntu For The Next 20 Years Canonical Blog Ubuntu Blog Canonical Careers: Engineering - Apply your Rust skills in the Linux ecosystem

0 views
Anton Zhiyanov 2 weeks ago

Go proposal: Goroutine metrics

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Export goroutine-related metrics from the Go runtime. Ver. 1.26 • Stdlib • Medium impact New metrics in the package give better insight into goroutine scheduling: Go's runtime/metrics package already provides a lot of runtime stats, but it doesn't include metrics for goroutine states or thread counts. Per-state goroutine metrics can be linked to common production issues. An increasing waiting count can show a lock contention problem. A high not-in-go count means goroutines are stuck in syscalls or cgo. A growing runnable backlog suggests the CPUs can't keep up with demand. Observability systems can track these counters to spot regressions, find scheduler bottlenecks, and send alerts when goroutine behavior changes from the usual patterns. Developers can use them to catch problems early without needing full traces. Add the following metrics to the package: The per-state numbers are not guaranteed to add up to the live goroutine count ( , available since Go 1.16). All metrics use uint64 counters. Start some goroutines and print the metrics after 100 ms of activity: No surprises here: we read the new metric values the same way as before — using metrics.Read . 𝗣 15490 • 𝗖𝗟 690397 , 690398 , 690399 P.S. If you are into goroutines, check out my interactive book on concurrency Total number of goroutines since the program started. Number of goroutines in each state. Number of active threads.

1 views
Anton Zhiyanov 2 weeks ago

Gist of Go: Concurrency testing

This is a chapter from my book on Go concurrency , which teaches the topic from the ground up through interactive examples. Testing concurrent programs is a lot like testing single-task programs. If the code is well-designed, you can test the state of a concurrent program with standard tools like channels, wait groups, and other abstractions built on top of them. But if you've made it so far, you know that concurrency is never that easy. In this chapter, we'll go over common testing problems and the solutions that Go offers. Waiting for goroutines • Checking channels • Checking for leaks • Durable blocking • Instant waiting • Time inside the bubble • Thoughts on time 1  ✎ • Thoughts on time 2  ✎ • Checking for cleanup • Bubble rules • Keep it up Let's say we want to test this function: Calculations run asynchronously in a separate goroutine. However, the function returns a result channel, so this isn't a problem: At point ⓧ, the test is guaranteed to wait for the inner goroutine to finish. The rest of the test code doesn't need to know anything about how concurrency works inside the function. Overall, the test isn't any more complicated than if were synchronous. But we're lucky that returns a channel. What if it doesn't? Let's say the function looks like this: We write a simple test and run it: The assertion fails because at point ⓧ, we didn't wait for the inner goroutine to finish. In other words, we didn't synchronize the and goroutines. That's why still has its initial value (0) when we do the check. We can add a short delay with : The test is now passing. But using to sync goroutines isn't a great idea, even in tests. We don't want to set a custom delay for every function we're testing. Also, the function's execution time may be different on the local machine compared to a CI server. If we use a longer delay just to be safe, the tests will end up taking too long to run. Sometimes you can't avoid using in tests, but since Go 1.25, the package has made these cases much less common. Let's see how it works. The package has a lot going on under the hood, but its public API is very simple: The function creates an isolated bubble where you can control time to some extent. Any new goroutines started inside this bubble become part of the bubble. So, if we wrap the test code with , everything will run inside the bubble — the test code, the function we're testing, and its goroutine. At point ⓧ, we want to wait for the goroutine to finish. The function comes to the rescue! It blocks the calling goroutine until all other goroutines in the bubble are finished. (It's actually a bit more complicated than that, but we'll talk about it later.) In our case, there's only one other goroutine (the inner goroutine), so will pause until it finishes, and then the test will move on. Now the test passes instantly. That's better! ✎ Exercise: Wait until done Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. As we've seen, you can use to wait for the tested goroutine to finish, and then check the state of the data you are interested in. You can also use it to check the state of channels. Let's say there's a function that generates N numbers like 11, 22, 33, and so on: And a simple test: Set N=2, get the first number from the generator's output channel, then get the second number. The test passed, so the function works correctly. But does it really? Let's use in "production": Panic! We forgot to close the channel when exiting the inner goroutine, so the for-range loop waiting on that channel got stuck. Let's fix the code: And add a test for the channel state: The test is still failing, even though we're now closing the channel when the goroutine exits. This is a familiar problem: at point ⓧ, we didn't wait for the inner goroutine to finish. So when we check the channel, it hasn't closed yet. That's why the test fails. We can delay the check using : But it's better to use : At point ⓧ, blocks the test until the only other goroutine (the inner goroutine) finishes. Once the goroutine has exited, the channel is already closed. So, in the select statement, the case triggers with set to , allowing the test to pass. As you can see, the package helped us avoid delays in the test, and the test itself didn't get much more complicated. As we've seen, you can use to wait for the tested goroutine to finish, and then check the state of the data or channels. You can also use it to detect goroutine leaks. Let's say there's a function that runs the given functions concurrently and sends their results to an output channel: And a simple test: Send three functions to be executed, get the first result from the output channel, and check it. The test passed, so the function works correctly. But does it really? Let's run three times, passing three functions each time: After 50 ms — when all the functions should definitely have finished — there are still 9 running goroutines ( ). In other words, all the goroutines are stuck. The reason is that the channel is unbuffered. If the client doesn't read from it, or doesn't read all the results, the goroutines inside get blocked when they try to send the result of to . Let's fix this by adding a buffer of the right size to the channel: Then add a test to check the number of goroutines: The test is still failing, even though the channel is now buffered, and the goroutines shouldn't block on sending to it. This is a familiar problem: at point ⓧ, we didn't wait for the running goroutines to finish. So is greater than zero, which makes the test fail. We can delay the check using (not recommended), or use a third-party package like goleak (a better option): The test passes now. By the way, goleak also uses internally, but it does so much more efficiently. It tries up to 20 times, with the wait time between checks increasing exponentially, starting at 1 microsecond and going up to 100 milliseconds. This way, the test runs almost instantly. Even better, we can check for leaks without any third-party packages by using : Earlier, I said that blocks the calling goroutine until all other goroutines finish. Actually, it's a bit more complicated. blocks until all other goroutines either finish or become durably blocked . We'll talk about "durably" later. For now, let's focus on "become blocked." Let's temporarily remove the buffer from the channel and check the test results: Here's what happens: Next, comes into play. It not only starts the bubble goroutine, but also tries to wait for all child goroutines to finish before it returns. If sees that some goroutines are stuck (in our case, all 9 are blocked trying to send to the channel), it panics: main bubble goroutine has exited but blocked goroutines remain So, we found the leak without using or goleak, thanks to the useful features of and : Now let's make the channel buffered and run the test again: As we've found, blocks until all goroutines in the bubble — except the one that called — have either finished or are durably blocked. Let's figure out what "durably blocked" means. For , a goroutine inside a bubble is considered durably blocked if it is blocked by any of the following operations: Other blocking operations are not considered durable, and ignores them. For example: The distinction between "durable" and other types of blocks is just a implementation detail of the package. It's not a fundamental property of the blocking operations themselves. In real-world applications, this distinction doesn't exist, and "durable" blocks are neither better nor worse than any others. Let's look at an example. Let's say there's a type that performs some asynchronous computation: Our goal is to write a test that checks the result while the calculation is still running . Let's see how the test changes depending on how is implemented (except for the version — we'll cover that one a bit later). Let's say is implemented using a done channel: Naive test: The check fails because when is called, the goroutine in hasn't set yet. Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on reading from the channel. This channel is created inside the bubble, so the block is durable. The call in the test returns as soon as happens, and we get the current value of . Let's say is implemented using select: Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on a select statement. Both channels used in the select ( and ) are created inside the bubble, so the block is durable. The call in the test returns as soon as happens, and we get the current value of . Let's say is implemented using a wait group: Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on the wait group's call. The group's method was called inside the bubble, so this is a durable block. The call in the test returns as soon as happens, and we get the current value of . Let's say is implemented using a condition variable: Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on the condition variable's call. This is a durable block. The call returns as soon as happens, and we get the current value of . Let's say is implemented using a mutex: Let's try using to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on the mutex's call. doesn't consider blocking on a mutex to be durable. The call ignores the block and never returns. The test hangs and only fails when the overall timeout is reached. You might be wondering why the authors didn't consider blocking on mutexes to be durable. There are a couple of reasons: ⌘ ⌘ ⌘ Let's go back to the original question: how does the test change depending on how is implemented? It doesn't change at all. We used the exact same test code every time: If your program uses durably blocking operations, always works the same way: Very convenient! ✎ Exercise: Blocking queue Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Inside the bubble, time works differently. Instead of using a regular wall clock, the bubble uses a fake clock that can jump forward to any point in the future. This can be quite handy when testing time-sensitive code. Let's say we want to test this function: The positive scenario is straightforward: send a value to the channel, call the function, and check the result: The negative scenario, where the function times out, is also pretty straightforward. But the test takes the full three seconds to complete: We're actually lucky the timeout is only three seconds. It could have been as long as sixty! To make the test run instantly, let's wrap it in : Note that there is no call here, and the only goroutine in the bubble (the root one) gets durably blocked on a select statement in . Here's what happens next: Thanks to the fake clock, the test runs instantly instead of taking three seconds like it would with the "naive" approach. You might have noticed that quite a few circumstances coincided here: We'll look at the alternatives soon, but first, here's a quick exercise. ✎ Exercise: Wait, repeat Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. The fake clock in can be tricky. It move forward only if: ➊ all goroutines in the bubble are durably blocked; ➋ there's a future moment when at least one goroutine will unblock; and ➌ isn't running. Let's look at the alternatives. I'll say right away, this isn't an easy topic. But when has time travel ever been easy? :) Here's the function we're testing: Let's run in a separate goroutine, so there will be two goroutines in the bubble: panicked because the root bubble goroutine finished while the goroutine was still blocked on a select. Reason: only advances the clock if all goroutines are blocked — including the root bubble goroutine. How to fix: Use to make sure the root goroutine is also durably blocked. Now all three conditions are met again (all goroutines are durably blocked; the moment of future unblocking is known; there is no call to ). The fake clock moves forward 3 seconds, which unblocks the goroutine. The goroutine finishes, leaving only the root one, which is still blocked on . The clock moves forward another 2 seconds, unblocking the root goroutine. The assertion passes, and the test completes successfully. But if we run the test with the race detector enabled (using the flag), it reports a data race on the variable: Logically, using in the root goroutine doesn't guarantee that the goroutine (which writes to the variable) will finish before the root goroutine reads from . That's why the race detector reports a problem. Technically, the test passes because of how is implemented, but the race still exists in the code. The right way to handle this is to call after : Calling ensures that the goroutine finishes before the root goroutine reads , so there's no data race anymore. Here's the function we're testing: Let's replace in the root goroutine with : panicked because the root bubble goroutine finished while the goroutine was still blocked on a select. Reason: only advances the clock if there is no active running. If all bubble goroutines are durably blocked but a is running, won't advance the clock. Instead, it will simply finish the call and return control to the goroutine that called it (in this case, the root bubble goroutine). How to fix: don't use . Let's update to use context cancellation instead of a timer: We won't cancel the context in the test: panicked because all goroutines in the bubble are hopelessly blocked. Reason: only advances the clock if it knows how much to advance it. In this case, there is no future moment that would unblock the select in . How to fix: Manually unblock the goroutine and call to wait for it to finish. Now, cancels the context and unblocks the select in , while makes sure the goroutine finishes before the test checks and . Let's update to lock the mutex before doing any calculations: In the test, we'll lock the mutex before calling , so it will block: The test failed because it hit the overall timeout set in . Reason: only works with durable blocks. Blocking on a mutex lock isn't considered durable, so the bubble can't do anything about it — even though the sleeping inner goroutine would have unlocked the mutex in 10 ms if the bubble had used the wall clock. How to fix: Don't use . Now the mutex unlocks after 10 milliseconds (wall clock), finishes successfully, and the check passes. The clock inside the buuble won't move forward if: ✎ Exercise: Asynchronous repeater Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Let's practice understanding time in the bubble with some thinking exercises. Try to solve the problem in your head before using the playground. Here's a function that performs synchronous work: And a test for it: What is the test missing at point ⓧ? ✓ Thoughts on time 1 There's only one goroutine in the test, so when gets blocked by , the time in the bubble jumps forward by 3 seconds. Then sets to and finishes. Finally, the test checks and passes successfully. No need to add anything. Let's keep practicing our understanding of time in the bubble with some thinking exercises. Try to solve the problem in your head before using the playground. Here's a function that performs asynchronous work: And a test for it: What is the test missing at point ⓧ? ✓ Thoughts on time 2 Let's go over the options. ✘ synctest.Wait This won't help because returns as soon as inside is called. The check fails, and panics with the error: "main bubble goroutine has exited but blocked goroutines remain". ✘ time.Sleep Because of the call in the root goroutine, the wait inside in is already over by the time is checked. However, there's no guarantee that has run yet. That's why the test might pass or might fail. ✘ synctest.Wait, then time.Sleep This option is basically the same as just using , because returns before the in even starts. The test might pass or might fail. ✓ time.Sleep, then synctest.Wait This is the correct answer: Since the root goroutine isn't blocked, it checks while the goroutine is blocked by the call. The check fails, and panics with the message: "main bubble goroutine has exited but blocked goroutines remain". Sometimes you need to test objects that use resources and should be able to release them. For example, this could be a server that, when started, creates a pool of network connections, connects to a database, and writes file caches. When stopped, it should clean all this up. Let's see how we can make sure everything is properly stopped in the tests. We're going to test this server: Let's say we wrote a basic functional test: The test passes, but does that really mean the server stopped when we called ? Not necessarily. For example, here's a buggy implementation where our test would still pass: As you can see, the author simply forgot to stop the server here. To detect the problem, we can wrap the test in and see it panic: The server ignores the call and doesn't stop the goroutine running inside . Because of this, the goroutine gets blocked while writing to the channel. When finishes, it detects the blocked goroutine and panics. Let's fix the server code (to keep things simple, we won't support multiple or calls): Now the test passes. Here's how it works: Instead of using to stop something, it's common to use the method. It registers a function that will run when the test finishes: Functions registered with run in last-in, first-out (LIFO) order, after all deferred functions have executed. In the test above, there's not much difference between using and . But the difference becomes important if we move the server setup into a separate helper function, so we don't have to repeat the setup code in different tests: The approach doesn't work because it calls when returns — before the test assertions run: The approach works because it calls when has finished — after all the assertions have already run: Sometimes, a context ( ) is used to stop the server instead of a separate method. In that case, our server interface might look like this: Now we don't even need to use or to check whether the server stops when the context is canceled. Just pass as the context: returns a context that is automatically created when the test starts and is automatically canceled when the test finishes. Here's how it works: To check for stopping via a method or function, use or . To check for cancellation or stopping via context, use . Inside a bubble, returns a context whose channel is associated with the bubble. The context is automatically canceled when ends. Functions registered with inside the bubble run just before finishes. Let's go over the rules for living in the bubble. The following operations durably block a goroutine: The limitations are quite logical, and you probably won't run into them. Don't create channels or objects that contain channels (like tickers or timers) outside the bubble. Otherwise, the bubble won't be able to manage them, and the test will hang: Don't access synchronization primitives associated with a bubble from outside the bubble: Don't call , , or inside a bubble: Don't call inside the bubble: Don't call from outside the bubble: Don't call concurrently from multiple goroutines: ✎ Exercise: Testing a pipeline Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. The package is a complicated beast. But now that you've studied it, you can test concurrent programs no matter what synchronization tools they use—channels, selects, wait groups, timers or tickers, or even . In the next chapter, we'll talk about concurrency internals (coming soon). Pre-order for $10   or read online Three calls to start 9 goroutines. The call to blocks the root bubble goroutine ( ). One of the goroutines finishes its work, tries to write to , and gets blocked (because no one is reading from ). The same thing happens to the other 8 goroutines. sees that all the child goroutines in the bubble are blocked, so it unblocks the root goroutine. The root goroutine finishes. unblocks as soon as all other goroutines are durably blocked. panics when finished if there are still blocked goroutines left in the bubble. Sending to or receiving from a channel created within the bubble. A select statement where every case is a channel created within the bubble. Calling if all calls were made inside the bubble. Sending to or receiving from a channel created outside the bubble. Calling or . I/O operations (like reading a file from disk or waiting for a network response). System calls and cgo calls. Mutexes are usually used to protect shared state, not to coordinate goroutines (the example above is completely unrealistic). In tests, you usually don't need to pause before locking a mutex to check something. Mutex locks are usually held for a very short time, and mutexes themselves need to be as fast as possible. Adding extra logic to support could slow them down in normal (non-test) situations. It waits until all other goroutines in the bubble are blocked. Then, it unblocks the goroutine that called it. The bubble checks if the goroutine can be unblocked by waiting. In our case, it can — we just need to wait 3 seconds. The bubble's clock instantly jumps forward 3 seconds. The select in chooses the timeout case, and the function returns . The test assertions for and both pass successfully. There's no call. There's only one goroutine. The goroutine is durably blocked. It will be unblocked at certain point in the future. There are any goroutines that aren't durably blocked. It's unclear how much time to advance. is running. Because of the call in the root goroutine, the wait inside in is already over by the time is checked. Because of the call, the goroutine is guaranteed to finish (and hence to call ) before is checked. The main test code runs. Before the test finishes, the deferred is called. In the server goroutine, the case in the select statement triggers, and the goroutine ends. sees that there are no blocked goroutines and finishes without panicking. The main test code runs. Before the test finishes, the context is automatically canceled. The server goroutine stops (as long as the server is implemented correctly and checks for context cancellation). sees that there are no blocked goroutines and finishes without panicking. A bubble is created by calling . Each call creates a separate bubble. Goroutines started inside the bubble become part of it. The bubble can only manage durable blocks. Other types of blocks are invisible to it. If all goroutines in the bubble are durably blocked with no way to unblock them (such as by advancing the clock or returning from a call), panics. When finishes, it tries to wait for all child goroutines to complete. However, if even a single goroutine is durably blocked, panics. Calling returns a context whose channel is associated with the bubble. Functions registered with run inside the bubble, immediately before returns. Calling in a bubble blocks the goroutine that called it. returns when all other goroutines in the bubble are durably blocked. returns when all other goroutines in the bubble have finished. The bubble uses a fake clock (starting at 2000-01-01 00:00:00 UTC). Time in the bubble only moves forward if all goroutines are durably blocked. Time advances by the smallest amount needed to unblock at least one goroutine. If the bubble has to choose between moving time forward or returning from a running , it returns from . A blocking send or receive on a channel created within the bubble. A blocking select statement where every case is a channel created within the bubble. Calling if all calls were made inside the bubble.

0 views
Filippo Valsorda 2 weeks ago

The 2025 Go Cryptography State of the Union

This past August, I delivered my traditional Go Cryptography State of the Union talk at GopherCon US 2025 in New York. It goes into everything that happened at the intersection of Go and cryptography over the last year. You can watch the video (with manually edited subtitles, for my fellow subtitles enjoyers) or read the transcript below (for my fellow videos not-enjoyers). The annotated transcript below was made with Simon Willison’s tool . All pictures were taken around Rome, the Italian contryside, and the skies of the Northeastern United States. Welcome to my annual performance review. We are going to talk about all of the stuff that we did in the Go cryptography world during the past year. When I say "we," it doesn't mean just me, it means me, Roland Shoemaker, Daniel McCarney, Nicola Morino, Damien Neil, and many, many others, both from the Go team and from the Go community that contribute to the cryptography libraries all the time. I used to do this work at Google, and I now do it as an independent as part of and leading Geomys , but we'll talk about that later. When we talk about the Go cryptography standard libraries, we talk about all of those packages that you use to build secure applications. That's what we make them for. We do it to provide you with encryption and hashes and protocols like TLS and SSH, to help you build secure applications . The main headlines of the past year: We shipped post quantum key exchanges, which is something that you will not have to think about and will just be solved for you. We have solved FIPS 140, which some of you will not care about at all and some of you will be very happy about. And the thing I'm most proud of: we did all of this while keeping an excellent security track record, year after year. This is an update to something you've seen last year. The Go Security Track Record It's the list of vulnerabilities in the Go cryptography packages. We don't assign a severity—because it's really hard, instead they're graded on the "Filippo's unhappiness score." It goes shrug, oof, and ouch. Time goes from bottom to top, and you can see how as time goes by things have been getting better. People report more things, but they're generally more often shrugs than oofs and there haven't been ouches. More specifically, we haven't had any oof since 2023. We didn't have any Go-specific oof since 2021. When I say Go-specific, I mean: well, sometimes the protocol is broken, and as much as we want to also be ahead of that by limiting complexity, you know, sometimes there's nothing you can do about that. And we haven't had ouches since 2019 . I'm very happy about that. But if this sounds a little informal, I'm also happy to report that we had the first security audit by a professional firm. Trail of Bits looked at all of the nuts and bolts of the Go cryptography standard library: primitives, ciphers, hashes, assembly implementations. They didn't look at the protocols, which is a lot more code on top of that, but they did look at all of the foundational stuff. And I'm happy to say that they found nothing . Two of a kind t-shirts, for me and Roland Shoemaker. It is easy though to maintain a good security track record if you never add anything, so let's talk about the code we did add instead. First of all, post-quantum key exchanges. We talked about post-quantum last year, but as a very quick refresher: Now, we focused on post-quantum key exchange because the key exchange defends against the most urgent risk, which is that somebody might be recording connections today, keeping them saved on some storage for the next 5-50 years and then use the future quantum computers to decrypt those sessions. I'm happy to report that we now have ML-KEM, which is the post-quantum key exchange algorithm selected by the NIST competition, an international competition run in the open. You can use it directly from the crypto/mlkem standard library package starting in Go 1.24, but you're probably not gonna do that. Instead, you're probably going to just use crypto/tls, which by default now uses a hybrid of X25519 and ML-KEM-768 for all connections with other systems that support it. Why hybrid? Because this is new cryptography. So we are still a little worried that somebody might break it. There was one that looked very good and had very small ciphertext, and we were all like, “yes, yes, that's good, that's good.” And then somebody broke it on a laptop. It was very annoying. We're fairly confident in lattices. We think this is the good one. But still, we are taking both the old stuff and the new stuff, hashing them together, and unless you have both a quantum computer to break the old stuff and a mathematician who broke the new stuff, you're not breaking the connection. crypto/tls can now negotiate that with Chrome and can negotiate that with other Go 1.24+ applications. Not only that, we also removed any choice you had in ordering of key exchanges because we think we know better than you and— that didn't come out right, uh. … because we assume that you actually want us to make those kind of decisions, so as long as you don't turn it off, we will default to post-quantum. You can still turn it off. But as long as you don't turn it off, we'll default to the post-quantum stuff to keep your connection safe from the future. Same stuff with x/crypto/ssh. Starting in v0.38.0. SSH does the same thing, they just put X25519 and ML-KEM-768 in a different order, which you would think doesn't matter—and indeed it doesn't matter—but there are rules where "no, no, no, you have to put that one first." And the other rule says "no, you have to put that one first." It's been a whole thing. I'm tired. OpenSSH supports it, so if you connect to a recent enough version of OpenSSH, that connection is post-quantum and you didn't have to do anything except update. Okay, but you said key exchanges and digital signatures are broken. What about the latter? Well, key exchanges are urgent because of the record-now-decrypt-later problem, but unless the physicists that are developing quantum computers also develop a time machine, they can't use the QC to go back in time and use a fake signature today. So if you're verifying a signature today, I promise you it's not forged by a quantum computer. We have a lot more time to figure out post-quantum digital signatures. But if we can, why should we not start now? Well, it's different. Key exchange, we knew what hit we had to take. You have to do a key exchange, you have to do it when you start the connection, and ML-KEM is the algorithm we have, so we're gonna use it. Signatures, we developed a lot of protocols like TLS, SSH, back when it was a lot cheaper to put signatures on the wire. When you connect to a website right now, you get five signatures. We can't send you five 2KB blobs every time you connect to a website. So we are waiting to give time to protocols to evolve, to redesign things with the new trade-offs in mind of signatures not being cheap. We are kind of slow rolling intentionally the digital signature side because it's both not as urgent and not as ready to deploy. We can't do the same “ta-da, it's solved for you” show because signatures are much harder to roll out. Let's talk about another thing that I had mentioned last year, which is FIPS 140. FIPS 140 is a US government regulation for how to do cryptography. It is a list of algorithms, but it's not just a list of algorithms. It's also a list of rules that the modules have to follow. What is a module? Well, a module used to be a thing you would rack. All the rules are based on the idea that it's a thing you can rack. Then the auditor can ask “what is the module’s boundary?” And you're like, “this shiny metal box over here." And, you know, that works. When people ask those questions of libraries, though, I do get a little mad every time. Like, what are the data input ports of your library? Ports. Okay. Anyway, it's an interesting thing to work with. To comply with FIPS 140 in Go, up to now, you had to use an unsupported GOEXPERIMENT, which would replace all of the Go cryptography standard library, all of the stuff I'm excited about, with the BoringCrypto module, which is a FIPS 140 module developed by the BoringSSL folks. We love the BoringSSL folks, but that means using cgo, and we do not love cgo. It has memory safety issues, it makes cross-compilation difficult, it’s not very fast. Moreover, the list of algorithms and platforms of BoringCrypto is tailored to the needs of BoringSSL and not to the needs of the Go community, and their development cycle doesn't match our development cycle: we don't decide when that module gets validated. Speaking of memory safety, I lied a little. Trail of Bits did find one vulnerability. They found it in Go+BoringCrypto, which was yet another reason to try to push away from it. Instead, we've got now the FIPS 140-3 Go Cryptographic Module. Not only is it native Go, it's actually just a different name for the internal Go packages that all the regular Go cryptography package use for the FIPS 140 algorithms. We just moved them into their own little bubble so that when they ask us “what is the module boundary” we can point at those packages. Then there's a runtime mode which enables some of the self-tests and slow stuff that you need for compliance. It also tells crypto/tls not to negotiate stuff that's not FIPS, but aside from that, it doesn't change any observable behavior. We managed to keep everything working exactly the same: you don't import a different package, you don't do anything different, your applications just keep working the same way. We're very happy about that. Finally, you can at compile time select a GOFIPS140 frozen module, which is just a zip file of the source of the module as it was back when we submitted it for validation, which is a compliance requirement sometimes. By the way, that means we have to be forward compatible with future versions of Go, even for internal packages, which was a little spicy. You can read more in the upstream FIPS 140-3 docs . You might be surprised to find out that using a FIPS 140 algorithm from a FIPS 140 module is not actually enough to be FIPS 140 compliant The FIPS 140 module also has to be tested for that specific algorithm. What we did is we just tested them all, so you can use any FIPS 140 algorithm without worrying about whether it's tested in our module. When I say we tested them all, I mean that some of them we tested with four different names. NIST calls HKDF alternatively SP 800-56C two-step KDF, SP 800-133 Section 6.3 CKG, SP 800-108 Feedback KDF, and Implementation Guidance D.P OneStepNoCounter KDF (you don't wanna know). It has four different names for the same thing. We just tested it four times, it's on the certificate, you can use it whatever way you want and it will be compliant. But that's not enough. Even if you use a FIFS 140 algorithm from a FIPS 140 module that was tested for the algorithm it's still not enough because it has to run on a platform that was tested as part of the validation. So we tested on a lot of platforms. Some of them were paid for by various Fortune 100s that had an interest in them getting tested, but some of them had no sponsors. We really wanted to solve this problem for everyone, once and for all, so Geomys just paid for all the FreeBSD, macOS, even Windows testing so that we could say “run it on whatever and it's probably going to be compliant.” (Don't quote me on that.) How did we test on that many machines? Well, you know, we have this sophisticated data center… Um, no. No, no. I got a bunch of stuff shipped to my place. That's my NAS now. It's an Ampere Altra Q64-22, sixty-four arm64 cores, and yep, it's my NAS. Then I tested it on, you know, this sophisticated arm64 macOS testing platform. And then on the Windows one, which is my girlfriend's laptop. And then the arm one, which was my router. Apparently I own an EdgeRouter now? It's sitting in the data center which is totally not my kitchen. It was all a very serious and regimented thing, and all of it is actually recorded, in recorded sessions with the accredited laboratories, so all this is now on file with the US government. You might or might not be surprised to hear that the easiest way to meet the FIPS 140 requirements is not to exceed them. That's annoying and a problem of FIPS 140 in general: if you do what everybody else does, which is just clearing the bar, nobody will ask questions, so there’s a strong temptation to lower security in FIPS 140 mode. We just refused to accept that. Instead, we figured out complex stratagems. For example, for randomness, the safest thing to do is to just take randomness from the kernel every time you need it. The kernel knows if a virtual machine was just cloned and we don't, so we risk generating the same random bytes twice. But NIST will not allow that. You need to follow a bunch of standards for how the randomness is generated, and the kernel doesn’t. So what we do is we do everything that NIST asks and then every time you ask for randomness, we squirrel off, go to the kernel, get a little piece of extra entropy, stir it into the pot before giving back the result, and give back the result. It's still NIST compliant because it's as strong as both the NIST and the kernel solution, but it took some significant effort to show it is compliant. We did the same for ECDSA. ECDSA is a digital signature mechanism. We've talked about it a few other times. It's just a way to take a message and a private key and generate a signature, here (s, r) . To make a signature, you also need a random number, and that number must be used only once with the same private key. You cannot reuse it. That number is k here. Why can you not reuse it? Because if you reuse it, then you can do this fun algebra thing and then pop the private key falls out by just smashing two signatures together. Bad, really, really bad. How do we generate this number that must never be the same? Well, one option is we make it random. But what if your random number generator breaks and generates twice the same random number? That would leak the private key, and that would be bad. So the community came up with deterministic ECDSA . Instead of generating the nonce at random, we are going to hash the message and the private key. This is still actually a little risky though, because if there's a fault in the CPU , for example, or a bug, because for example you're taking the wrong inputs , you might still end up generating the same value but signing a slightly different message. How do we mitigate both of those? We do both. We take some randomness and the private key and the message, we hash them all together, and now it's really, really hard for the number to come out the same. That's called hedged ECDSA. The Go crypto library has been doing hedged ECDSA from way before it was called hedged and way before I was on the team . Except… random ECDSA has always been FIPS. Deterministic ECDSA has been FIPS since a couple years ago. Hedged ECDSA is technically not FIPS. We really didn't want to make our ECDSA package less secure, so we found a forgotten draft that specifies a hedged ECDSA scheme, and we proceeded to argue that actually if you read SP 800-90A Revision 1 very carefully you realize that if you claim that the private key is just the DRBG entropy plus two-thirds of the DRBG nonce, you are allowed to use it because of SP 800-57 Part 1, etc etc etc . We basically just figured out a way to claim it was fine and the lab eventually said "okay, shut up." I'm very proud of that one. If you want to read more about this, check out the announcement blog post . If you know you need commercial services for FIPS 140, here’s Geomys FIPS 140 commercial services page . If you don't know if you need them, you actually probably don't. It's fine, the standard library will probably solve this for you now. Okay, but who cares about this FIPS 140 stuff? "Dude, we've been talking about FIPS 140 for 10 minutes and I don't care about that." Well, I care because I spent my last year on it and that apparently made me the top committer for the cycle to the Go repo and that's mostly FIPS 140 stuff. I don't know how to feel about that. There have been actually a lot of positive side effects from the FIPS 140 effort. We took care to make sure that everything that we found we would leave in a better state. For example, there are new packages that moved from x/crypto into the standard library: crypto/hkdf, crypto/pbkdf, crypto/sha3. SHA-3 is faster and doesn't allocate anymore. HKDF has a new generic API which lets you pass in a function that returns either a concrete type that implements Hash or a function that returns a Hash interface, which otherwise was a little annoying. (You had to make a little closure.) I like it. We restructured crypto/aes and crypto/cipher and in the process merged a contribution from a community member that made AES-CTR, the counter mode, between 2 and 9 times faster. That was a pretty good result. The assembly interfaces are much more consistent now. Finally, we finished cleaning up crypto/rsa. If you remember from last year, we made the crypto/rsa sign and verify operations not use math/big and use constant time code. Now we also made key generation, validation, and pre-computation all not use math/big. That loading keys that were serialized to JSON a lot faster, and made key generation much faster. But how much faster? Benchmarking key generation is really hard because it's a random process: you take a number random number and you check, is it prime? No. Toss. Is it prime? Nope. Toss. Is it prime? You keep doing this. If you're lucky, it’s very fast. If you are unlucky, very slow. It’s a geometric distribution and if you want to average it out, you have to run for hours. Instead, I figured out a new way by mathematically deriving the average number of pulls you are supposed to do and preparing a synthetic run that gives exactly the expected mean number of checks, so that we get a representative sample to benchmark deterministically . That was a lot of fun. Moreover, we detect more broken keys, and we did a rare backwards compatibility break to stop supporting keys smaller than 1024 bits. 1024 is already pretty small, you should be using 2048 minimum, but if you're using less than 1024, it can be broken on the proverbial laptop. It's kind of silly that a production library lets you do something so insecure, and you can't tell them apart just by looking at the code. You have to know what the size of the key is. So we just took that out. I expected people to yell at me. Nobody yelled at me. Good job community. Aside from adding stuff, you know that we are very into testing and that testing is how we keep that security track record that we talked about. I have one bug in particular that is my white whale. (You might say, "Filippo, well-adjusted people don't have white whales." Well, we learned nothing new, have we?) My white whale is this assembly bug that we found at Cloudflare before I joined the Go team. I spent an afternoon figuring out an exploit for it with Sean Devlin in Paris, while the yellow jackets set fire to cop cars outside. That's a different story. It's an assembly bug where the carry—literally the carry like when you do a pen and paper multiplication—was just not accounted for correctly. You can watch my talk Squeezing a Key through a Carry Bit if you are curious to learn more about it. The problem with this stuff is that it's so hard to get code coverage for it because all the code always runs. It's just that you don't know if it always runs with that carry at zero, and if the carry was one, it’d do the wrong math. I think we've cracked it, by using mutation testing. We have a framework that tells the assembler, "hey, anywhere you see an add-with-carry, replace it with a simple add that discards the carry." Then we run the tests. If the tests still pass, the test did not cover that carry. If that happens we fail a meta-test and tell whoever's sending the CL, “hey, no, no, no, you gotta test that.” Same for checking the case in which the carry is always set. We replace the add-with-carry with a simple add and then insert a +1. It's a little tricky. If you want to read more about it, it's in this blog post . I'm very hopeful that will help us with all this assembly stuff. Next, accumulated test vectors . This is a little trick that I'm very very fond of. Say you want to test a very large space. For example there are two inputs and they can both be 0 to 200 bytes long, and you want to test all the size combinations. That would be a lot of test vectors, right? If I checked in a megabyte of test vectors every time I wanted to do that, people eventually would yell at me. Instead what we do is run the algorithm with each size combination, and take the result and we put it inside a rolling hash. Then at the end we take the hash result and we check that it comes out right. We do this with two implementations. If it comes out to the same hash, great. If it comes out not to the same hash, it doesn't help you figure out what the bug is, but it tells you there's a bug. I'll take it. We really like reusing other people's tests. We're lazy. The BoringSSL people have a fantastic suite of tests for TLS called BoGo and Daniel has been doing fantastic work integrating that and making crypto/tls stricter and stricter in the process. It's now much more spec compliant on the little things where it goes like, “no, no, no, you're not allowed to put a zero here” and so on. Then, the Let's Encrypt people have a test tool for the ACME protocol called Pebble. (Because it's a small version of their production system called Boulder! It took me a long time to figure it out and eventually I was like ooooohhh.) Finally, NIST has this X.509 interoperability test suite, which just doesn't have a good name. It's good though. More assembly cleanups. There used to be places in assembly where—as if assembly was not complicated enough—instructions were just written down as raw machine code. Sometimes even the comment was wrong! Can you tell the comment changed in that patch? This is a thing Roland and Joel found. Now there's a test that will just yell at you if you try to commit a or instruction. We also removed all the assembly that was specifically there for speeding up stuff on CPUs that don't have AVX2. AVX2 came out in 2015 and if you want to go fast, you're probably not using the CPU generation from back then. We still run on it, just not as fast. More landings! I’m going to speed through these ones. This is all stuff that we talked about last year and that we actually landed. Stuff like data independent timing to tell the CPU, "no, no, I actually did mean for you to do that in constant time, goddammit." And server-side TLS Encrypted Client Hello, which is a privacy improvement. We had client side, now we have server side. crypto/rand.Read never fails. We promised that, we did that. Now, do you know how hard it is to test the failure case of something that never fails? I had to re-implement the seccomp library to tell the kernel to break the getrandom syscall to check what happens when it doesn’t work. There are tests all pointing guns at each other to make sure the fallback both works and is never hit unexpectedly. It's also much faster now because Jason Donenfeld added the Linux getrandom VDSO. Sean Liao added rand.Text like we promised. Then more stuff like hash.Cloner , which I think makes a lot of things a little easier, and more and more and more and more. The Go 1.24 and Go 1.25 release notes are there for you. x/crypto/ssh is also under our maintenance and some excellent stuff happened there, too. Better tests, better error messages, better compatibility, and we're working on some v2 APIs . If you have opinions, it’s time to come to those issues to talk about them! It’s been an exciting year, and I'm going to give you just two samples of things we're planning to do for the next year. One is TLS profiles. Approximately no one wants to specifically configure the fifteen different knobs of a TLS library. Approximately no one—because I know there are some people who do and they yell at me regularly. But instead most people just want "hey, make it broadly compatible." "Hey, make it FIPS compliant." "Hey, make it modern." We're looking for a way to make it easy to just say what your goal is, and then we do all the configuration for you in a way that makes sense and that evolves with time. I'm excited about this one. And maybe something with passkeys? If you run websites that authenticate users a bunch with password hashes and maybe also with WebAuthN, find me, email us, we want feedback. We want to figure out what to build here, into the standard library. Alright, so it's been a year of cryptography, but it's also been a year of Geomys. Geomys launched a year ago here at GopherCon. If you want an update, we went on the Fallthrough podcast to talk about it , so check that out. We are now a real company and how you know is that we have totes: it's the equivalent of a Facebook-official relationship. The best FIPS 140 side effect has been that we have a new maintainer. Daniel McCarney joined us to help with the FIPS effort and then we were working very well together so Geomys decided to just take him on as a permanent maintainer on the Go crypto maintenance team. I’m very excited about that. This is all possible thanks to our clients, and if you have any questions, here are the links. You might also want to follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . My work is made possible by Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. Post-quantum cryptography is about the future. We are worried about quantum computers that might exist… 5-50 (it's a hell of a range) years from now, and that might break all of asymmetrical encryption. (Digital signatures and key exchanges.) Post-quantum cryptography runs on classical computers. It's cryptography that we can do now that resists future quantum computers. Post-quantum cryptography is fast, actually. If you were convinced that for some reason it was slow, that's a common misconception. However, post-quantum cryptography is large. Which means that we have to send a lot more bytes on the wire to get the same results.

0 views
Max Woolf 3 weeks ago

Nano Banana can be prompt engineered for extremely nuanced AI image generation

You may not have heard about new AI image generation models as much lately, but that doesn’t mean that innovation in the field has stagnated: it’s quite the opposite. FLUX.1-dev immediately overshadowed the famous Stable Diffusion line of image generation models, while leading AI labs have released models such as Seedream , Ideogram , and Qwen-Image . Google also joined the action with Imagen 4 . But all of those image models are vastly overshadowed by ChatGPT’s free image generation support in March 2025. After going organically viral on social media with the prompt, ChatGPT became the new benchmark for how most people perceive AI-generated images, for better or for worse. The model has its own image “style” for common use cases, which make it easy to identify that ChatGPT made it. Two sample generations from ChatGPT. ChatGPT image generations often have a yellow hue in their images. Additionally, cartoons and text often have the same linework and typography. Of note, , the technical name of the underlying image generation model, is an autoregressive model. While most image generation models are diffusion-based to reduce the amount of compute needed to train and generate from such models, works by generating tokens in the same way that ChatGPT generates the next token, then decoding them into an image. It’s extremely slow at about 30 seconds to generate each image at the highest quality (the default in ChatGPT), but it’s hard for most people to argue with free. In August 2025, a new mysterious text-to-image model appeared on LMArena : a model code-named “nano-banana”. This model was eventually publically released by Google as Gemini 2.5 Flash Image , an image generation model that works natively with their Gemini 2.5 Flash model. Unlike Imagen 4, it is indeed autoregressive, generating 1,290 tokens per image. After Nano Banana’s popularity pushed the Gemini app to the top of the mobile App Stores, Google eventually made Nano Banana the colloquial name for the model as it’s definitely more catchy than “Gemini 2.5 Flash Image”. The first screenshot on the iOS App Store for the Gemini app. Personally, I care little about what leaderboards say which image generation AI looks the best. What I do care about is how well the AI adheres to the prompt I provide: if the model can’t follow the requirements I desire for the image—my requirements are often specific —then the model is a nonstarter for my use cases. At the least, if the model does have strong prompt adherence, any “looking bad” aspect can be fixed with prompt engineering and/or traditional image editing pipelines. After running Nano Banana though its paces with my comically complex prompts, I can confirm that thanks to Nano Banana’s robust text encoder, it has such extremely strong prompt adherence that Google has understated how well it works. Like ChatGPT, Google offers methods to generate images for free from Nano Banana. The most popular method is through Gemini itself, either on the web or in an mobile app, by selecting the “Create Image 🍌” tool. Alternatively, Google also offers free generation in Google AI Studio when Nano Banana is selected on the right sidebar, which also allows for setting generation parameters such as image aspect ratio and is therefore my recommendation. In both cases, the generated images have a visible watermark on the bottom right corner of the image. For developers who want to build apps that programmatically generate images from Nano Banana, Google offers the endpoint on the Gemini API . Each image generated costs roughly $0.04/image for a 1 megapixel image (e.g. 1024x1024 if a 1:1 square): on par with most modern popular diffusion models despite being autoregressive, and much cheaper than ’s $0.17/image. Working with the Gemini API is a pain and requires annoying image encoding/decoding boilerplate, so I wrote and open-sourced a Python package: gemimg , a lightweight wrapper around Gemini API’s Nano Banana endpoint that lets you generate images with a simple prompt, in addition to handling cases such as image input along with text prompts. I chose to use the Gemini API directly despite protests from my wallet for three reasons: a) web UIs to LLMs often have system prompts that interfere with user inputs and can give inconsistent output b) using the API will not show a visible watermark in the generated image, and c) I have some prompts in mind that are…inconvenient to put into a typical image generation UI. Let’s test Nano Banana out, but since we want to test prompt adherence specifically, we’ll start with more unusual prompts. My go-to test case is: I like this prompt because not only is an absurd prompt that gives the image generation model room to be creative, but the AI model also has to handle the maple syrup and how it would logically drip down from the top of the skull pancake and adhere to the bony breakfast. The result: That is indeed in the shape of a skull and is indeed made out of pancake batter, blueberries are indeed present on top, and the maple syrup does indeed drop down from the top of the pancake while still adhereing to its unusual shape, albeit some trails of syrup disappear/reappear. It’s one of the best results I’ve seen for this particular test, and it’s one that doesn’t have obvious signs of “AI slop” aside from the ridiculous premise. Now, we can try another one of Nano Banana’s touted features: editing. Image editing, where the prompt targets specific areas of the image while leaving everything else as unchanged as possible, has been difficult with diffusion-based models until very recently with Flux Kontext . Autoregressive models in theory should have an easier time doing so as it has a better understanding of tweaking specific tokens that correspond to areas of the image. While most image editing approaches encourage using a single edit command, I want to challenge Nano Banana. Therefore, I gave Nano Banana the generated skull pancake, along with five edit commands simultaneously: All five of the edits are implemented correctly with only the necessary aspects changed, such as removing the blueberries on top to make room for the mint garnish, and the pooling of the maple syrup on the new cookie-plate is adjusted. I’m legit impressed. Now we can test more difficult instances of prompt engineering. One of the most compelling-but-underdiscussed use cases of modern image generation models is being able to put the subject of an input image into another scene. For open-weights image generation models, it’s possible to “train” the models to learn a specific subject or person even if they are not notable enough to be in the original training dataset using a technique such as finetuning the model with a LoRA using only a few sample images of your desired subject. Training a LoRA is not only very computationally intensive/expensive, but it also requires care and precision and is not guaranteed to work—speaking from experience. Meanwhile, if Nano Banana can achieve the same subject consistency without requiring a LoRA, that opens up many fun oppertunities. Way back in 2022, I tested a technique that predated LoRAs known as textual inversion on the original Stable Diffusion in order to add a very important concept to the model: Ugly Sonic , from the initial trailer for the Sonic the Hedgehog movie back in 2019. One of the things I really wanted Ugly Sonic to do is to shake hands with former U.S. President Barack Obama , but that didn’t quite work out as expected. 2022 was a now-unrecognizable time where absurd errors in AI were celebrated. Can the real Ugly Sonic finally shake Obama’s hand? Of note, I chose this test case to assess image generation prompt adherence because image models may assume I’m prompting the original Sonic the Hedgehog and ignore the aspects of Ugly Sonic that are distinct to only him. Specifically, I’m looking for: I also confirmed that Ugly Sonic is not surfaced by Nano Banana, and prompting as such just makes a Sonic that is ugly, purchasing a back alley chili dog. I gave Gemini the two images of Ugly Sonic above (a close-up of his face and a full-body shot to establish relative proportions) and this prompt: That’s definitely Obama shaking hands with Ugly Sonic! That said, there are still issues: the color grading/background blur is too “aesthetic” and less photorealistic, Ugly Sonic has gloves, and the Ugly Sonic is insufficiently lanky. Back in the days of Stable Diffusion, the use of prompt engineering buzzwords such as , , and to generate “better” images in light of weak prompt text encoders were very controversial because it was difficult both subjectively and intuitively to determine if they actually generated better pictures. Obama shaking Ugly Sonic’s hand would be a historic event. What would happen if it were covered by The New York Times ? I added to the previous prompt: So there’s a few notable things going on here: That said, I only wanted the image of Obama and Ugly Sonic and not the entire New York Times A1. Can I just append to the previous prompt and have that be enough to generate the image only while maintaining the compositional bonuses? I can! The gloves are gone and his chest is white, although Ugly Sonic looks out-of-place in the unintentional sense. As an experiment, instead of only feeding two images of Ugly Sonic, I fed Nano Banana all the images of Ugly Sonic I had ( seventeen in total), along with the previous prompt. This is an improvement over the previous generated image: no eyebrows, white hands, and a genuinely uncanny vibe. Again, there aren’t many obvious signs of AI generation here: Ugly Sonic clearly has five fingers! That’s enough Ugly Sonic for now, but let’s recall what we’ve observed so far. There are two noteworthy things in the prior two examples: the use of a Markdown dashed list to indicate rules when editing, and the fact that specifying as a buzzword did indeed improve the composition of the output image. Many don’t know how image generating models actually encode text. In the case of the original Stable Diffusion, it used CLIP , whose text encoder open-sourced by OpenAI in 2021 which unexpectedly paved the way for modern AI image generation. It is extremely primitive relative to modern standards for transformer-based text encoding, and only has a context limit of 77 tokens: a couple sentences, which is sufficient for the image captions it was trained on but not nuanced input. Some modern image generators use T5 , an even older experimental text encoder released by Google that supports 512 tokens. Although modern image models can compensate for the age of these text encoders through robust data annotation during training the underlying image models, the text encoders cannot compensate for highly nuanced text inputs that fall outside the domain of general image captions. A marquee feature of Gemini 2.5 Flash is its support for agentic coding pipelines; to accomplish this, the model must be trained on extensive amounts of Markdown (which define code repository s and agentic behaviors in ) and JSON (which is used for structured output/function calling/MCP routing). Additionally, Gemini 2.5 Flash was also explictly trained to understand objects within images, giving it the ability to create nuanced segmentation masks . Nano Banana’s multimodal encoder, as an extension of Gemini 2.5 Flash, should in theory be able to leverage these properties to handle prompts beyond the typical image-caption-esque prompts. That’s not to mention the vast annotated image training datasets Google owns as a byproduct of Google Images and likely trained Nano Banana upon, which should allow it to semantically differentiate between an image that is and one that isn’t, as with similar buzzwords. Let’s give Nano Banana a relatively large and complex prompt, drawing from the learnings above and see how well it adheres to the nuanced rules specified by the prompt: This prompt has everything : specific composition and descriptions of different entities, the use of hex colors instead of a natural language color, a heterochromia constraint which requires the model to deduce the colors of each corresponding kitten’s eye from earlier in the prompt, and a typo of “San Francisco” that is definitely intentional. Each and every rule specified is followed. For comparison, I gave the same command to ChatGPT—which in theory has similar text encoding advantages as Nano Banana—and the results are worse both compositionally and aesthetically, with more tells of AI generation. 1 The yellow hue certainly makes the quality differential more noticeable. Additionally, no negative space is utilized, and only the middle cat has heterochromia but with the incorrect colors. Another thing about the text encoder is how the model generated unique relevant text in the image without being given the text within the prompt itself: we should test this further. If the base text encoder is indeed trained for agentic purposes, it should at-minimum be able to generate an image of code. Let’s say we want to generate an image of a minimal recursive Fibonacci sequence in Python, which would look something like: I gave Nano Banana this prompt: It tried to generate the correct corresponding code but the syntax highlighting/indentation didn’t quite work, so I’ll give it a pass. Nano Banana is definitely generating code, and was able to maintain the other compositional requirements. For posterity, I gave the same prompt to ChatGPT: It did a similar attempt at the code which indicates that code generation is indeed a fun quirk of multimodal autoregressive models. I don’t think I need to comment on the quality difference between the two images. An alternate explanation for text-in-image generation in Nano Banana would be the presence of prompt augmentation or a prompt rewriter, both of which are used to orient a prompt to generate more aligned images. Tampering with the user prompt is common with image generation APIs and aren’t an issue unless used poorly (which caused a PR debacle for Gemini last year), but it can be very annoying for testing. One way to verify if it’s present is to use adversarial prompt injection to get the model to output the prompt itself, e.g. if the prompt is being rewritten, asking it to generate the text “before” the prompt should get it to output the original prompt. That’s, uh, not the original prompt. Did I just leak Nano Banana’s system prompt completely by accident? The image is hard to read, but if it is the system prompt—the use of section headers implies it’s formatted in Markdown—then I can surgically extract parts of it to see just how the model ticks: These seem to track, but I want to learn more about those buzzwords in point #3: Huh, there’s a guard specifically against buzzwords? That seems unnecessary: my guess is that this rule is a hack intended to avoid the perception of model collapse by avoiding the generation of 2022-era AI images which would be annotated with those buzzwords. As an aside, you may have noticed the ALL CAPS text in this section, along with a command. There is a reason I have been sporadically capitalizing in previous prompts: caps does indeed work to ensure better adherence to the prompt (both for text and image generation), 2 and threats do tend to improve adherence. Some have called it sociopathic, but this generation is proof that this brand of sociopathy is approved by Google’s top AI engineers. Tangent aside, since “previous” text didn’t reveal the prompt, we should check the “current” text: That worked with one peculiar problem: the text “image” is flat-out missing, which raises further questions. Is “image” parsed as a special token? Maybe prompting “generate an image” to a generative image AI is a mistake. I tried the last logical prompt in the sequence: …which always raises a error: not surprising if there is no text after the original prompt. This section turned out unexpectedly long, but it’s enough to conclude that Nano Banana definitely has indications of benefitting from being trained on more than just image captions. Some aspects of Nano Banana’s system prompt imply the presence of a prompt rewriter, but if there is indeed a rewriter, I am skeptical it is triggering in this scenario, which implies that Nano Banana’s text generation is indeed linked to its strong base text encoder. But just how large and complex can we make these prompts and have Nano Banana adhere to them? Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens. The intent of this large context window for Nano Banana is for multiturn conversations in Gemini where you can chat back-and-forth with the LLM on image edits. Given Nano Banana’s prompt adherence on small complex prompts, how well does the model handle larger-but-still-complex prompts? Can Nano Banana render a webpage accurately? I used a LLM to generate a bespoke single-page HTML file representing a Counter app, available here . The web page uses only vanilla HTML, CSS, and JavaScript, meaning that Nano Banana would need to figure out how they all relate in order to render the web page correctly. For example, the web page uses CSS Flexbox to set the ratio of the sidebar to the body in a 1/3 and 2/3 ratio respectively. Feeding this prompt to Nano Banana: That’s honestly better than expected, and the prompt cost 916 tokens. It got the overall layout and colors correct: the issues are more in the text typography, leaked classes/styles/JavaScript variables, and the sidebar:body ratio. No, there’s no practical use for having a generative AI render a webpage, but it’s a fun demo. A similar approach that does have a practical use is providing structured, extremely granular descriptions of objects for Nano Banana to render. What if we provided Nano Banana a JSON description of a person with extremely specific details, such as hair volume, fingernail length, and calf size? As with prompt buzzwords, JSON prompting AI models is a very controversial topic since images are not typically captioned with JSON, but there’s only one way to find out. I wrote a prompt augmentation pipeline of my own that takes in a user-input description of a quirky human character, e.g. , and outputs a very long and detailed JSON object representing that character with a strong emphasis on unique character design. 3 But generating a Mage is boring, so I asked my script to generate a male character that is an equal combination of a Paladin, a Pirate, and a Starbucks Barista: the resulting JSON is here . The prompt I gave to Nano Banana to generate a photorealistic character was: Beforehand I admit I didn’t know what a Paladin/Pirate/Starbucks Barista would look like, but he is definitely a Paladin/Pirate/Starbucks Barista. Let’s compare against the input JSON, taking elements from all areas of the JSON object (about 2600 tokens total) to see how well Nano Banana parsed it: Checking the JSON field-by-field, the generation also fits most of the smaller details noted. However, he is not photorealistic, which is what I was going for. One curious behavior I found is that any approach of generating an image of a high fantasy character in this manner has a very high probability of resulting in a digital illustration, even after changing the target publication and adding “do not generate a digital illustration” to the prompt. The solution requires a more clever approach to prompt engineering: add phrases and compositional constraints that imply a heavy physicality to the image, such that a digital illustration would have more difficulty satisfying all of the specified conditions than a photorealistic generation: The image style is definitely closer to Vanity Fair (the photographer is reflected in his breastplate!), and most of the attributes in the previous illustration also apply—the hands/cutlass issue is also fixed. Several elements such as the shoulderplates are different, but not in a manner that contradicts the JSON field descriptions: perhaps that’s a sign that these JSON fields can be prompt engineered to be even more nuanced. Yes, prompting image generation models with HTML and JSON is silly, but “it’s not silly if it works” describes most of modern AI engineering. Nano Banana allows for very strong generation control, but there are several issues. Let’s go back to the original example that made ChatGPT’s image generation go viral: . I ran that exact prompt through Nano Banana on a mirror selfie of myself: …I’m not giving Nano Banana a pass this time. Surprisingly, Nano Banana is terrible at style transfer even with prompt engineering shenanigans, which is not the case with any other modern image editing model. I suspect that the autoregressive properties that allow Nano Banana’s excellent text editing make it too resistant to changing styles. That said, creating a new image does in fact work as expected, and creating a new image using the character provided in the input image with the specified style (as opposed to a style transfer ) has occasional success. Speaking of that, Nano Banana has essentially no restrictions on intellectual property as the examples throughout this blog post have made evident. Not only will it not refuse to generate images from popular IP like ChatGPT now does, you can have many different IPs in a single image. Normally, Optimus Prime is the designated driver. I am not a lawyer so I cannot litigate the legalities of training/generating IP in this manner or whether intentionally specifying an IP in a prompt but also stating “do not include any watermarks” is a legal issue: my only goal is to demonstrate what is currently possible with Nano Banana. I suspect that if precedent is set from existing IP lawsuits against OpenAI and Midjourney , Google will be in line to be sued. Another note is moderation of generated images, particularly around NSFW content, which always important to check if your application uses untrusted user input. As with most image generation APIs, moderation is done against both the text prompt and the raw generated image. That said, while running my standard test suite for new image generation models, I found that Nano Banana is surprisingly one of the more lenient AI APIs. With some deliberate prompts, I can confirm that it is possible to generate NSFW images through Nano Banana—obviously I cannot provide examples. I’ve spent a very large amount of time overall with Nano Banana and although it has a lot of promise, some may ask why I am writing about how to use it to create highly-specific high-quality images during a time where generative AI has threatened creative jobs. The reason is that information asymmetry between what generative image AI can and can’t do has only grown in recent months: many still think that ChatGPT is the only way to generate images and that all AI-generated images are wavy AI slop with a piss yellow filter. The only way to counter this perception is though evidence and reproducibility. That is why not only am I releasing Jupyter Notebooks detailing the image generation pipeline for each image in this blog post, but why I also included the prompts in this blog post proper; I apologize that it padded the length of the post to 26 minutes, but it’s important to show that these image generations are as advertised and not the result of AI boosterism. You can copy these prompts and paste them into AI Studio and get similar results, or even hack and iterate on them to find new things. Most of the prompting techniques in this blog post are already well-known by AI engineers far more skilled than myself, and turning a blind eye won’t stop people from using generative image AI in this manner. I didn’t go into this blog post expecting it to be a journey, but sometimes the unexpected journeys are the best journeys. There are many cool tricks with Nano Banana I cut from this blog post due to length, such as providing an image to specify character positions and also investigations of styles such as pixel art that most image generation models struggle with, but Nano Banana now nails. These prompt engineering shenanigans are only the tip of the iceberg. Jupyter Notebooks for the generations used in this post are split between the gemimg repository and a second testing repository . I would have preferred to compare the generations directly from the endpoint for an apples-to-apples comparison, but OpenAI requires organization verification to access it, and I am not giving OpenAI my legal ID.  ↩︎ Note that ALL CAPS will not work with CLIP-based image generation models at a technical level, as CLIP’s text encoder is uncased.  ↩︎ Although normally I open-source every script I write for my blog posts, I cannot open-source the character generation script due to extensive testing showing it may lean too heavily into stereotypes. Although adding guardrails successfully reduces the presence of said stereotypes and makes the output more interesting, there may be unexpected negative externalities if open-sourced.  ↩︎ A lanky build, as opposed to the real Sonic’s chubby build. A white chest, as opposed to the real Sonic’s beige chest. Blue arms with white hands, as opposed to the real Sonic’s beige arms with white gloves. Small pasted-on-his-head eyes with no eyebrows, as opposed to the real Sonic’s large recessed eyes and eyebrows. That is the most cleanly-rendered New York Times logo I’ve ever seen. It’s safe to say that Nano Banana trained on the New York Times in some form. Nano Banana is still bad at rendering text perfectly/without typos as most image generation models. However, the expanded text is peculiar: it does follow from the prompt, although “Blue Blur” is a nickname for the normal Sonic the Hedgehog. How does an image generating model generate logical text unprompted anyways? Ugly Sonic is even more like normal Sonic in this iteration: I suspect the “Blue Blur” may have anchored the autoregressive generation to be more Sonic-like. The image itself does appear to be more professional, and notably has the distinct composition of a photo from a professional news photographer: adherence to the “rule of thirds”, good use of negative space, and better color balance. , mostly check. (the hands are transposed and the cutlass disappears) I would have preferred to compare the generations directly from the endpoint for an apples-to-apples comparison, but OpenAI requires organization verification to access it, and I am not giving OpenAI my legal ID.  ↩︎ Note that ALL CAPS will not work with CLIP-based image generation models at a technical level, as CLIP’s text encoder is uncased.  ↩︎ Although normally I open-source every script I write for my blog posts, I cannot open-source the character generation script due to extensive testing showing it may lean too heavily into stereotypes. Although adding guardrails successfully reduces the presence of said stereotypes and makes the output more interesting, there may be unexpected negative externalities if open-sourced.  ↩︎

0 views
Anton Zhiyanov 3 weeks ago

Go proposal: Context-aware Dialer methods

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Add context-aware, network-specific methods to the type. Ver. 1.26 • Stdlib • Low impact The type connects to the address using a given network (protocol) — TCP, UDP, IP, or Unix sockets. The new context-aware methods ( , , , and ) combine the efficiency of the existing network-specific functions (which skip address resolution and dispatch) with the cancellation capabilities of . The package already has top-level functions for different networks ( , , , and ), but these were made before was introduced, so they don't support cancellation: On the other hand, the type has a general-purpose method. It supports cancellation and can be used to connect to any of the known networks: However, if you already know the network type and address, using is a bit less efficient than network-specific functions like due to: Address resolution overhead: handles address resolution internally (like DNS lookups and converting to or ) using the network and address strings you provide. Network-specific functions accept a pre-resolved address object, so they skip this step. Network type dispatch: must route the call to the protocol-specific dialer. Network-specific functions already know which protocol to use, so they skip this step. So, network-specific functions in the package are more efficient, but they don't support cancellation. The type supports cancellation, but it's less efficient. This proposal aims to solve the mismatch by adding context-aware, network-specific methods to the type. Also, adding new methods to the lets you use the newer address types from the package (like instead of ), which are preferred in modern Go code. Add four new methods to the : The method signatures are similar to the existing top-level functions, but they also accept a context and use the newer address types from the package. Use the method to connect to a TCP server: Use the method to connect to a Unix socket: In both cases, the dialing fails because I didn't bother to start the server in the playground :) 𝗣 49097 • 𝗖𝗟 657296 Address resolution overhead: handles address resolution internally (like DNS lookups and converting to or ) using the network and address strings you provide. Network-specific functions accept a pre-resolved address object, so they skip this step. Network type dispatch: must route the call to the protocol-specific dialer. Network-specific functions already know which protocol to use, so they skip this step.

1 views
@bwplotka 1 months ago

The (lazy) Git UI You Didn't Know You Need

When my son was born last April, I had ambitious learning plans for the upcoming 5w paternity leave. As you can imagine, with two kids, life quickly verified this plan 🙃. I did eventually start some projects. One of the goals (sounding rebellious in the current AI hype cycle) was to learn and use neovim for coding. As a Goland aficionado, I (and my wrist) have always been tempted by no-mouse, OSS, gopls based, highly configurable dev setups.

0 views
Chris Coyier 1 months ago

The Great (Refrigerator) Divide

I like a good hot sauce. It’s not, like, my personality , but I enjoy them. There are enough different hot sauces that having a bit of a collection of them is reasonable. Cholula is a mainstay, working equally well on Mexican and egg-based dishes. Although admit Tabasco is my general go-to. The green Tabasco works particularly well on Chipotle for whatever reason. Tapatio is right in there working maybe slightly better on the rice-y-er Mexican stuff. Red Hot on my chili or wings, absolutely. Those are all big names. Hot sauce has quite a long tail. There plenty of Tier-2 (in popularity) sauces. Think Tiger Sauce, which is quite sweet and tends to work well on dishes that evoke that anyway (I’m thinking sautéed peppers and onions, for instance). Yellow Bird is having their hot sauce moment lately — I quite like the literally yellow habanero style — which has a tang to it that works well with chicken I think. Roasted veggies like carrots and broccoli? There I like the Portland all-timer Secret Aardvark . Much Asian food is born to pair with Sriracha, of course. I’m a big fan of Heatly lately. I’d call Tier-3 that whole genre of hot sauces people buy you when they go on vacation and stop into a store that only sells hot sauces (right next to the oil & vinegar shop!). These are the Johnny’s Burning Butthole sauces and Sally’s Simmering Sweetspot. They have cheezy cartoon graphics on them and there are hundreds and hundreds of them, and some of them are perfectly good, but you never quite know what you are going to get and it’s easy to forget even after you’ve tried it. Tier-4 is the bottle you got from the local restaurant in town with an ambitious chef trying to diversify income streams. I’ve taken too long to get to my point though. SOME of these hot sauces say “Refrigerate after opening.” on the bottle, a rule you probably shouldn’t break (unless you’re a Johnny’s Burning Butthole kinda guy). SOME of these hot sauces… don’t. And my theory is: the bigger and more successful the hot sauce brand, the less likely it requires refridgeration. I ain’t trying to knock fridge brands. Yellow Bird, Heatly, Secret Aardvark are all favorites and require it (along with all Sriracha’s, which makes more sense as it’s so ketchup-like). I will admit though that I don’t love it. I don’t really want a whole area in my fridge that’s loaded with hot sauces. That veers too closely into personality territory. Much easier to have some basic cabinet space for them. So anyway. If you wanna go huge with your hot sauce brand, you can’t require refrigeration. The next big-Tabasco needs to sit right out on those diner tables with the salt and pepper.

0 views
Taranis 1 months ago

Today is my last day at Google

It's been a while. I joined Google in 2016 at the mothership in Mountain View, California. I'd just left 10 years at NASA, and was taking a leap into the unknown. I'd never worked for a large American company before, and Google seemed more than a little Bay Area culty at the time. I started as a Level 6 staff engineer, something I had absolutely no idea the definition for, and nobody really ever explained it. I stepped into a tech lead position in the Gin team – Google has lots of internal names for systems and products that are meaningless outside the company, so I should say that this has absolutely nothing to do with the Golang Gin web framework! It was actually an insider risk/counterintelligence auditing system inteded to do things like ferret out foreign spies, employees doing things they shouldn't, etc. It worked pretty well. A few months in, my manager suggested that maybe Gin could do well to become an external product. This led me to doing quite a bit of travel, visiting other Google sites in Chicago, London and Zurich, and many meetings with legal and VPs persuading them that this could work. At the time, Cloud and (what became known later as) Workspace were blocked on some very large deals because customers were concerned that Google employees might be able to access their data. Of course, if they did, Gin would most likely spot that, but those customers weren't happy to take that on trust. So I was pushed to move to Chicago, build out a team, and to externalize Gin. This wasn't straightforward – the data rate was scarily huge, and though our internal analysts could handle the data, it would be incomprehensible to outsiders. The solution became known as Access Transparency , which my team delivered in an astonishingly short time scale. I heard that it unlocked a lot of business once it rolled out. (Full disclosure: the Access Approval system came later, and wasn't my project) At that time, Trump's first term was happening, and as a queer immigrant I was getting more and more scared. I was checking the news more than once a day to see whether I needed to put an escape plan into effect. I had several, made a little simpler because Chicago is very close to the Canadian border, but I ended up taking a less dramatic approach than bailing unprepared. My mother passed away just before I moved to Chicago, and my dad's health was failing. They lived on the Spanish Balearic island Mallorca at the time, which was an easy flight from Zurich. I asked my upper management, and was able to relocate to Google Zurich. I liked Zurich a lot. I still do, to be honest, even though I left for Ireland a couple of years ago. Whilst there, I initially stayed in Security & Privacy, working in the Data Governance team for about 18 months. A big reorg killed my project, so I moved over to Cloud Site Reliability Engineering, and became a manager. A team had grown a bit too big, and was being split into two, so I ended up picking up half of the new team, which became known as the Zurich side of the Open Source Analytics SRE team. We ran a few systems, most notably Google's managed Hadoop-on-Cloud product, Dataproc . A year later, another reorg rolled our team's responsibilites back into the development team, and after a bit of finagling and wheeler dealing the majority of the old team went with me to YouTube, becoming the Zurich Trust & Safety SRE team. Then COVID happened. Things got weird, for everyone. I was doing whatever I could to support my team, even organizing some basic cooking lessons via Google Meet when it became apparent that some of them had never cooked for themselves and this was hard for them as they were having to depend on delivered groceries. While I was in Zurich I was diagnosed with a couple of autoimmune things, and was strongly advised to avoid catching viruses (not just COVID). Of course, I did manage to catch the damned thing just as everything shut down, before testing was widely available. Medical services were swamped, so people were being told not to show up unless they really were dying, so I just basically holed up for about 3 weeks. It sucked, but I was OK, thankfully no long-COVID symptoms. My spouse lost their genetic father to the virus soon afterwards. But it did seem to me that Google would probably return to office in a way that might be safe(ish) for most people, but not me. I asked to stay working from home, and was told I could, but I had to give up my management job. I was quite sad about that, but it was what it was and there wasn't much I could do about it. I brought on one of my team members as my replacement, and went back to a tech lead position. This lasted a few months. I took on an uber-tech-lead (UTL) position in YouTube SRE, but I really didn't get any traction. UTL is a difficult job – you are essentially tech leading tech leads, which critically depends on support from upper management. I really liked YouTube and the people, but I'd describe it as a bit of a supertanker – once it's going it's going, and changing direction even slightly is very difficult. An opportinity came up to move over to another team that was doing what these days would probably be called observability research – basically data science on logs, in order to detect and diagnose problems in large scale systems. This was a lot of fun – I got to throw everything from traditional statistical analysis to digital signal processing to various kinds of machine learning (genetic algorithms as well as neural networks) at huge amounts of log data. About half way through that project I moved to Ireland. My dad had passed away the previous year, I was starting to think forwards to post-Google life, and I realised that staying in Switzerland probably wasn't going to work. Though I loved the country, financially it didn't add up. Cost of living, particularly for property, was out of reach, even on a fat salary. Moving to Ireland resulted in a large pay cut because of the way Google operates payroll relative to local norms, but it worked out to be a wash. With even an apartment where I was living being the 1m+ CHF price bracket, I just couldn't see how I'd be able to retire once I hit retirement age. In Ireland, I was able to buy a house outright, free and clear, no mortgage, for what would have barely been a downpayment in Switzerland. I bought an old post office in a small village in the west of Ireland, still intact from the day it closed in 2009, with a 5 bedroom apartment above and several outbuildings and a fairly large garden. The research project came to an end a bit over a year ago (it had always been time-limited), so I moved over to the Turnups org (internally known as the Turnips, for probably obvious reasons). This is the bit of Cloud that 'turns up' clusters to cope with expansion, much of which around AI compute capability recently. About a year ago, I hit a point where I realised that there was some trans-related stuff that I'd been putting off for decades, and I really needed to deal with. With the wind blowing in a scary direction for trans people everywhere, passing has become a significant safety issue. I'd already had facial surgery in 2001 with its pioneer Dr Doug Ousterhout in San Francisco, but there were some less than ideal outcomes from the original surgery, and time and gravity had taken its toll. So early this year I took extended medical leave and went through several surgeries to deal with that, with Facialteam in Spain. I also started working with a voice therapist. It's been a long road, but I'm happy with the results. Being on extended medical leave gave me a lot of time to think, and I came to the realisation that staying long term at Google probably wasn't the right thing to do. The company doesn't feel like the company I joined – its values are different, and it seems increasingly likely to acquiesce to pressure from the Trump administration. I can easily see an executive order being issued forcing all companies with US government contracts (which of course includes Google, particularly Google Cloud) to fire trans workers worldwide. Even in Ireland, in the EU, where something like that would be flat out illegal, it felt like a career death sentence, so I started to look more seriously about what should happen next. Anyone who knows me knows that I'm anything but passive, so I wasn't keen on just waiting for the axe to fall, however good my salary might have been. So I decided to leave the company on my own terms. Yes, I have plans. But I'm not announcing them until I'm no longer an employee. Watch this space!

0 views
Ahead of AI 1 months ago

Beyond Standard LLMs

From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism. However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance. After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised attendees to follow up with a write-up of these alternative approaches). So here it is! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 and many more. (The list above focuses on the open-weight models; there are proprietary models like GPT-5, Grok 4, Gemini 2.5, etc. that also fall into this category.) Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. In code, a simplified version of the Gated DeltaNet depicted above (without the convolutional mixing) can be implemented as follows (the code is inspired by the official implementation by the Qwen3 team): (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. (And the final output gate, not shown in the snippet above, is similar to gated attention; it controls how much of the output is kept.) So, in a sense, this state update in Gated DeltaNet is similar to how recurrent neural networks (RNNs) work. The advantage is that it scales linearly (via the for-loop) instead of quadratically with context length. The downside of this recurrent state update is that, compared to regular (or gated) attention, it sacrifices the global context modeling ability that comes from full pairwise attention. Gated DeltaNet, can, to some extend, still capture context, but it has to go through the memory ( S ) bottleneck. That memory is a fixed size and thus more efficient, but it compresses past context into a single hidden state similar to RNNs. That’s why the Qwen3-Next and Kimi Linear architectures don’t replace all attention layers with DeltaNet layers but use the 3:1 ratio mentioned earlier. In the previous section, we discussed the advantage of the DeltaNet over full attention in terms of linear instead of quadratic compute complexity with respect to the context length. Next to the linear compute complexity, another big advantage of DeltaNet is the memory savings, as DeltaNet modules don’t grow the KV cache. (For more information about KV caching, see my Understanding and Coding the KV Cache in LLMs from Scratch article). Instead, as mentioned earlier, they keep a fixed-size recurrent state, so memory stays constant with context length. For a regular multi-head attention (MHA) layer, we can compute the KV cache size as follows: (The 2 multiplier is there because we have both keys and values that we store in the cache.) For the simplified DeltaNet version implemented above, we have: Note that the memory size doesn’t have a context length ( ) dependency. Also, we have only the memory state S that we store instead of separate keys and values, hence becomes just bytes. However, note that we now have a quadratic in here. This comes from the state: But that’s usually nothing to worry about, as the head dimension is usually relatively small. For instance, it’s 128 in Qwen3-Next. The full version with the convolutional mixing is a bit more complex, including the kernel size and so on, but the formulas above should illustrate the main trend and motivation behind the Gated DeltaNet. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. The training runs for up to 16 refinement steps per batch. Each step performs several no-grad loops to iteratively refine the answer. This is followed by a gradient loop that backpropagates through the full reasoning sequence to update the model weights. It’s important to note that TRM is not a language model operating on text. However, because (a) it’s a transformer-based architecture, (b) reasoning is now a central focus in LLM research, and this model represents a distinctly different take on reasoning, and (c) many readers have asked me to cover HRM (and TRM is its more advanced successor) I decided to include it here. While TRM could be extended to textual question-answer tasks in the future, TRM currently works on grid-based inputs and outputs. In other words, both the “question” and the “answer” are grids of discrete tokens (for example, 9×9 Sudoku or 30×30 ARC/Maze puzzles), not text sequences. HRM consists of two small transformer modules (each 4 blocks) that communicate across recursion levels. TRM only uses a single 2-layer transformer. (Note that the previous TRM figure shows a 4× next to the transformer block, but that’s likely to make it easier to compare against HRM.) TRM backpropagates through all recursive steps, whereas HRM only backpropagates through the final few. HRM includes an explicit halting mechanism to determine when to stop iterating. TRM replaces this mechanism with a simple binary cross-entropy loss that learns when to stop iterating. Performance-wise, TRM performs really well compared to HRM, as shown in the figure below. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length. While HRM and TRM achieve really good reasoning performance on these benchmarks, comparing them to large LLMs is not quite fair. HRM and TRM are specialized models for tasks like ARC, Sudoku, and Maze pathfinding, whereas LLMs are generalists. Sure, HRM and TRM can be adopted for other tasks as well, but they have to be specially trained on each task. So, in that sense, we can perhaps think of HRM and TRM as efficient pocket calculators, whereas LLM are more like computers, which can do a lot of other things as well. Still, these recursive architectures are exciting proof-of-concepts that highlight how small, efficient models can “reason” through iterative self-refinement. Perhaps, in the future, such models could act as reasoning or planning modules embedded within larger tool-using LLM systems. For now, LLMs remain ideal for broad tasks, but domain-specific recursive models like TRM can be developed to solve certain problems more efficiently once the target domain is well understood. Beyond the Sudoku, Maze finding, and ARC proof-of-concept benchmarks, there are possibly lots of use cases in the physics and biology domain where such models could find use. As an interesting tidbit, the author shared that it took less than $500 to train this model, with 4 H100s for around 2 days. I am delighted to see that it’s still possible to do interesting work without a data center. I originally planned to cover all models categories in the overview figure, but since the article ended up longer than I expected, I will have to save xLSTMs, Liquid Foundation Models, Transformer-RNN hybrids, and State Space Models for another time (although, Gated DeltaNet already gave a taste of State Space Models and recurrent designs.) As a conclusion to this article, I want to repeat the earlier words, i.e., that standard autoregressive transformer LLMs are proven and have stood the test of time so far. They are also, if efficiency is not the main factor, the best we have for now. Traditional Decoder-Style, Autoregressive Transformers + Proven & mature tooling + “well-understood” + Scaling laws + SOTA - Expensive training - Expensive inference (except for aforementioned tricks) If I were to start a new LLM-based project today, autoregressive transformer-based LLMs would be my first choice. I definitely find the upcoming attention hybrids very promising, which are especially interesting when working with longer contexts where efficiency is a main concern. Linear Attention Hybrids + Same as decoder-style transformers + Cuts FLOPs/KV memory at long-context tasks - Added complexity - Trades a bit of accuracy for efficiency On the more extreme end, text diffusion models are an interesting development. I’m still somewhat skeptical about how well they perform in everyday use, as I’ve only tried a few quick demos. Hopefully, we’ll soon see a large-scale production deployment with Google’s Gemini Diffusion that we can test on daily and coding tasks, and then find out how people actually feel about them. Text Diffusion Models + Iterative denoising is a fresh idea for text + Better parallelism (no next-token dependence) - Can’t stream answers - Doesn’t benefit from CoT? - Tricky tool-calling? - Solid models but not SOTA While the main selling point of text diffusion models is improved efficiency, code world models sit on the other end of the spectrum, where they aim to improve modeling performance. As of this writing, coding models, based on standard LLMs, are mostly improved through reasoning techniques, yet if you have tried them on trickier challenges, you have probably noticed that they (more or less) still fall short and can’t solve many of the trickier coding problems well. I find code world models particularly interesting and believe they could be an important next step toward developing more capable coding systems. Code World Model + Promising approach to improve code understanding + Verifiable intermediate states - Inclusion of executable code traces complicates training - Code running adds latency Lastly, we covered small recursive transformers such as hierarchical and tiny reasoning models. These are super interesting proof-of-concept models. However, as of today, they are primarily puzzle solvers, not general text or coding models. So, they are not in the same category as the other non-standard LLM alternatives covered in this article. Nonetheless, they are very interesting proofs-of-concept, and I am glad researchers are working on them. Right now, LLMs like GPT-5, DeepSeek R1, Kimi K2, and so forth are developed as special purpose models for free-form text, code, math problems and much more. They feel like brute-force and jack-of-all-trades approach that we use on a variety of tasks, from general knowledge questions to math and code. However, when we perform the same task repeatedly, such brute-force approaches become inefficient and may not even be ideal in terms of specialization. This is where tiny recursive transformers become interesting: they could serve as lightweight, task-specific models that are both efficient and purpose-built for repeated or structured reasoning tasks. Also, I can see them as potential “tools” for other tool-calling LLMs; for instance, when LLMs use Python or calculator APIs to solve math problems, special tiny reasoning models could fill this niche for other types of puzzle- or reasoning-like problems. Small Recursive Transformers + Very small architecture + Good generalization on puzzles - Special purpose models - Limited to puzzles (so far) This has been a long article, but I hope you discovered some of the fascinating approaches that often stay outside the spotlight of mainstream LLMs. And if you’ve been feeling a bit bored by the more or less conventional LLM releases, I hope this helped rekindle your excitement about AI again because there’s a lot of interesting work happening right now! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) 1. Transformer-Based LLMs Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. 2. (Linear) Attention Hybrids Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . 2.1 Traditional Attention and Quadratic Costs The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. 2.2 Linear attention Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. 2.3 Linear Attention Revival In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. 2.4 Qwen3-Next Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? 2.5 Gated Attention Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. 2.6 Gated DeltaNet Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. 2.8 Kimi Linear vs. Qwen3-Next Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. 2.9 Kimi Delta Attention Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. 2.10 The Future of Attention Hybrids Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. 3. Text Diffusion Models A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). 3.1 Why Work on Text Diffusion? With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. 3.2 The Denoising Process The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. 3.3 Autoregressive vs Diffusion LLMs Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. 3.4 Text Diffusion Today It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. 4. World Models So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. 4.1 The Main Idea Behind World Models Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. 4.2 From Vision to Code That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. 4.3 Code World Models Vs Regular LLMs for Code So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. 5. Small Recursive Transformers You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. 5.1 What Does Recursion Mean Here? TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length.

1 views
Filippo Valsorda 1 months ago

Claude Code Can Debug Low-level Cryptography

Over the past few days I wrote a new Go implementation of ML-DSA, a post-quantum signature algorithm specified by NIST last summer. I livecoded it all over four days, finishing it on Thursday evening. Except… Verify was always rejecting valid signatures. I was exhausted, so I tried debugging for half an hour and then gave up, with the intention of coming back to it the next day with a fresh mind. On a whim, I figured I would let Claude Code take a shot while I read emails and resurfaced from hyperfocus. I mostly expected it to flail in some maybe-interesting way, or rule out some issues. Instead, it rapidly figured out a fairly complex low-level bug in my implementation of a relatively novel cryptography algorithm. I am sharing this because it made me realize I still don’t have a good intuition for when to invoke AI tools, and because I think it’s a fantastic case study for anyone who’s still skeptical about their usefulness. Full disclosure: Anthropic gave me a few months of Claude Max for free. They reached out one day and told me they were giving it away to some open source maintainers. Maybe it’s a ploy to get me hooked so I’ll pay for it when the free coupon expires. Maybe they hoped I’d write something like this. Maybe they are just nice. Anyway, they made no request or suggestion to write anything public about Claude Code. Now you know. I started Claude Code v2.0.28 with Opus 4.1 and no system prompts, and gave it the following prompt (typos included): I implemented ML-DSA in the Go standard library, and it all works except that verification always rejects the signatures. I know the signatures are right because they match the test vector. YOu can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Look for potential reasons the signatures don’t verify. ultrathink I spot-checked and w1 is different from the signing one. To my surprise, it pinged me a few minutes later with a complete fix . Maybe I shouldn’t be surprised! Maybe it would have been clear to anyone more familiar with AI tools that this was a good AI task: a well-scoped issue with failing tests. On the other hand, this is a low-level issue in a fresh implementation of a complex, relatively novel algorithm. It figured out that I had merged and into a single function for using it from Sign, and then reused it from Verify where already produces the high bits, effectively taking the high bits of w1 twice in Verify. Looking at the log , it loaded the implementation into the context and then immediately figured it out, without any exploratory tool use! After that it wrote itself a cute little test that reimplemented half of verification to confirm the hypothesis, wrote a mediocre fix, and checked the tests pass. I threw the fix away and refactored to take high bits as input, and changed the type of the high bits, which is both clearer and saves a round-trip through Montgomery representation. Still, this 100% saved me a bunch of debugging time. On Monday, I had also finished implementing signing with failing tests. There were two bugs, which I fixed in the following couple evenings. The first one was due to somehow computing a couple hardcoded constants (1 and -1 in the Montgomery domain) wrong . It was very hard to find, requiring a lot of deep printfs and guesswork. Took me maybe an hour or two. The second one was easier: a value that ends up encoded in the signature was too short (32 bits instead of 32 bytes) . It was relatively easy to tell because only the first four bytes of the signature were the same, and then the signature lengths were different. I figured these would be an interesting way to validate Claude’s ability to help find bugs in low-level cryptography code, so I checked out the old version of the change with the bugs (yay Jujutsu!) and kicked off a fresh Claude Code session with this prompt: I am implementing ML-DSA in the Go standard library, and I just finished implementing signing, but running the tests against a known good test vector it looks like it goes into an infinite loop, probably because it always rejects in the Fiat-Shamir with Aborts loop. You can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Figure out why it loops forever, and get the tests to pass. ultrathink It spent some time doing printf debugging and chasing down incorrect values very similarly to how I did it, and then figured out and fixed the wrong constants . Took Claude definitely less than it took me. Impressive. It gave up after fixing that bug even if the tests still failed, so I started a fresh session (on the assumption that the context on the wrong constants would do more harm than good investigating an independent bug), and gave it this prompt: I am implementing ML-DSA in the Go standard library, and I just finished implementing signing, but running the tests against a known good test vector they don’t match. You can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Figure out what is going on. ultrathink It took a couple wrong paths, thought for quite a bit longer, and then found this one too . I honestly expected it to fail initially. It’s interesting how Claude found the “easier” bug more difficult. My guess is that maybe the large random-looking outputs of the failing tests did not play well with its attention. The fix it proposed was updating only the allocation’s length and not its capacity, but whatever, the point is finding the bug, and I’ll usually want to throw away the fix and rewrite it myself anyway. Three out of three one-shot debugging hits with no help is extremely impressive . Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it. As ever, I wish we had better tooling for using LLMs which didn’t look like chat or autocomplete or “make me a PR.” For example, how nice would it be if every time tests fail, an LLM agent was kicked off with the task of figuring out why, and only notified us if it did before we fixed it? For more low-level cryptography bugs implementations, follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . I promise I almost never post about AI. Enjoy the silliest floof. Surely this will help redeem me in the eyes of folks who consider AI less of a tool and more of something to be hated or loved. My work is made possible by Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team.

0 views
Lukáš Lalinský 1 months ago

How I turned Zig into my favorite language to write network programs in

I’ve been watching the Zig language for a while now, given that it was created for writing audio software (low-level, no allocations, real time). I never paid too much attention though, it seemed a little weird to me and I didn’t see the real need. Then I saw a post from Andrew Kelley (creator of the language) on Hacker News, about how he reimplemented my Chromaprint algorithm in Zig, and that got me really interested. I’ve been planning to rewrite AcoustID’s inverted index for a long time, I had a couple of prototypes, but none of the approaches felt right. I was going through some rough times, wanted to learn something new, so I decided to use the project as an opportunity to learn Zig. And it was great, writing Zig is a joy. The new version was faster and more scalable than the previous C++ one. I was happy, until I wanted to add a server interface. In the previous C++ version, I used Qt , which might seem very strange for a server software, but I wanted a nice way of doing asynchronous I/O and Qt allowed me to do that. It was callback-based, but Qt has a lot of support for making callbacks usable. In the newer prototypes, I used Go, specifically for the ease of networking and concurrency. With Zig, I was stuck. There are some Zig HTTP servers, so I could use those. I wanted to implement my legacy TCP server as well, and that’s a lot harder, unless I want to spawn a lot of threads. Then I made a crazy decision, to use Zig also for implementing a clustered layer on top of my server, using NATS as a messaging system, so I wrote a Zig NATS client , and that gave me a lot of experience with Zig’s networking capabilities. Fast forward to today, I’m happy to introduce Zio, an asynchronous I/O and concurrency library for Zig . If you look at the examples, you will not really see where is the asynchronous I/O, but it’s there, in the background and that’s the point. Writing asynchronous code with callbacks is a pain. Not only that, it requires a lot of allocations, because you need state to survive across callbacks. Zio is an implementation of Go style concurrency, but limited to what’s possible in Zig. Zio tasks are stackful coroutines with fixed-size stacks. When you run , this will initiate the I/O operation in the background and then suspend the current task until the I/O operation is done. When it’s done, the task will be resumed, and the result will be returned. That gives you the illusion of synchronous code, allowing for much simpler state management. Zio support fully asynchronous network and file I/O, has synchronization primitives (mutexes, condition variables, etc.) that work with the cooperative runtime, has Go-style channels, OS signal watches and more. Tasks can run in single-threaded mode, or multi-threaded, in which case they can migrate from thread to thread for lower latency and better load balancing. And it’s FAST. I don’t want to be posting benchmarks here, maybe later when I have more complex ones, but the single-threaded mode is beating any framework I’ve tried so far. It’s much faster than both Go and Rust’s Tokio. Context switching is virtually free, comparable to a function call. The multi-threaded mode, while still not being as robust as Go/Tokio, has comparable performance. It’s still a bit faster than either of them, but that performance might go down as I add more fairness features. Because it implements the standard interfaces for reader/writer, you can actually use external libraries that are unaware they are running within Zio. Here is an example of a HTTP server: When I started working with Zig, I really thought it’s going to be a niche language to write the fast code in, and then I’ll need a layer on top of that in a different language. With Zio, that changed. The next step for me is to update my NATS client to use Zio internally. And after that, I’m going to work on a HTTP client/server library based on Zio.

0 views
Filippo Valsorda 1 months ago

The Geomys Standard of Care

One of the most impactful effects of professionalizing open source maintenance is that as professionals we can invest into upholding a set of standards that make our projects safer and more reliable. The same commitments and overhead that are often objected to when required of volunteers should be table stakes for professional maintainers. I didn’t find a lot of prior art, so to compile the Geomys Standard of Care I started by surveying recent supply chain compromises to look for mitigable root causes. (By the way, you might have missed that email because it includes the name of a domain used for a phishing campaign, so it got flagged as phishing. Oops.) I also asked feedback from experts in various areas such as CI security, and from other Geomys maintainers. The first draft is below, and we’ll maintain the latest version at geomys.org/standard-of-care . It covers general maintenance philosophy, ongoing stability and reliability, dependency management, account and CI security, vulnerability handling, licensing, and more. In the future, we want to look into adopting more binary transparency tools, and into doing periodic reviews of browser extensions and of authorized Gerrit and GitHub OAuth apps and tokens (just GitHub has four places 1 to look in!). We also welcome feedback on things that would be valuable to add, for security or for reliability. We aim to maintain our projects sustainably and predictably. We are only able to do this thanks to our retainer contracts with our clients, but these commitments are offered to the whole community, not just to paying clients. Scope . We apply this standard to projects maintained or co-maintained by Geomys, including For projects where we are not the sole maintainers, we prioritize working well with the rest of the team. Geomys maintainers may also have personal projects that are not held to this standard (e.g. everything in mostly-harmless ). Code review . If the project accepts external contributions, we review all the code provided to us. This extends to any code generated with LLMs, as well. Complexity . A major part of the role of a maintainer is saying no. We consciously limit complexity, and keep the goals and non-goals of a project in mind when considering features. (See for example the Go Cryptography Principles .) Static analysis . We run staticcheck , by our very own @dominikh , in CI. Stability . Once a Go package reaches v1, we maintain strict backwards compatibility within a major version, similarly to the standard library’s compatibility promise . Ongoing maintenance . Not all projects are actively worked on at all times (e.g. some projects may be effectively finished, or we may work in batches). However, unless a project is explicitly archived or deprecated, we will address newly arising issues that make the project unsuitable for a previously working use case (e.g. compatibility with a new OS). Dependency management . We don’t use automatic dependency version bump tools, like Dependabot. For our purposes, they only cause churn and increase the risk of supply chain attacks by adopting new module versions before the ecosystem has had time to detect attacks. (Dependabot specifically also has worrying impersonation risks , which would make for trivial social engineering attacks.) Instead, we run govulncheck on a schedule, to get high signal-to-noise ratio notifications of vulnerable dependencies that actually affect our projects; and run isolated CI jobs with the latest versions of our dependencies (i.e. running before ) to ensure we’re alerted early of breakages, so we can easily update to future security releases and so we’re aware of potential compatibility issues for our dependents. Phishing-resistant authentication . Phishing is by far the greatest threat to our security and, transitively, to that of our users. We acknowledge there is no amount of human carefulness that can systematically withstand targeted attacks, so we use technically phishing-resistant authentication for all services that allow impacting our projects’ users. Phishing-resistant authentication means passkeys or WebAuthn 2FA, with credentials stored in platform authenticators (e.g. iCloud Keychain), password managers (e.g. 1Password or Chrome), or hardware tokens (e.g. YubiKeys). Critical accounts that allow escalating to user impact include: If a strict mode such as Google’s Advanced Protection Program or Apple’s Advanced Data Protection is available, we enable it. If a phishable fallback authentication or account recovery method is instead required, we configure one that is secret-based (e.g. TOTP or recovery codes) and either delete the secret or commit to never using it without asking a fellow Geomys maintainer to review the circumstances that necessitated it. TOTP can’t hurt us if we don’t use it. We never enable SMS as an authentication mechanism or as an account recovery mechanism, because SIM jacking is possible even without action on our part. Long-lived credentials . We avoid where possible long-lived persistent credentials, or make them non-extractable if possible. For example, we use git-credential-oauth instead of Gerrit cookies, and hardware-bound SSH keys with yubikey-agent or Secretive instead of personal access tokens for git pushes to GitHub. Unlike phishing-resistant authentication, we found it impractical to roll out short-lived credentials universally. Notably, we have not found a way to use the GitHub CLI without extractable long-lived credentials. CI security . We run zizmor on our GitHub Actions workflows, and we don’t use dangerous GitHub Actions triggers that run privileged workflows with attacker-controlled contexts, such as . We run GitHub Actions workflows with read-only permissions and no secrets by default. Workflows that have write permissions or access to secrets disable all use of caches (including indirectly through actions like ), to mitigate cache poisoning attacks . (Note that, incredibly, read-only workflows can write arbitrary cache entries, which is why this must be mitigated at cache use time.) Third-party access . For projects maintained solely by Geomys, we avoid providing user-impacting (i.e. push or release) access to external people, and publicly disclose any exceptions. If abandoning a project, we prefer archiving it and letting a fork spawn to handing over control to external people. This way dependents can make their own assessment of whether to trust the new maintainers. Any exceptions will be widely communicated well in advance. Under no circumstances will we release to public registration a domain, GitHub user/org, or package name that was previously assigned to a Geomys project. Availability monitoring . We have automated uptime monitoring for critical user-facing endpoints, such as the Go import path meta pages. This also provides monitoring for critical domain expiration, preventing accidental takeovers. Transparency logging . We subscribe to new version notifications via GopherWatch , to be alerted of unauthorized module versions published to the Go Checksum Database. We monitor Certificate Transparency logs for critical domains (e.g. the roots of our Go import paths) using tools such as Cert Spotter or Silent CT . We also set CAA records on those domains limiting issuance to the minimal set of CAs required for operation. Vulnerability handling . We document the official vulnerability reporting mechanism of each project, we encourage coordinated vulnerability reporting, and we appreciate the work of security researchers. We honor embargoes of up to 90 days, and we do not share vulnerability details with people not involved in fixing it until they are public. (Paying clients do not get access to private vulnerability details. This is to honor our responsibility to the various stakeholders of an open source project, and to acknowledge that often these details are not ours to share.) Once a vulnerability is made public, we ensure it is included in the Go vulnerability database with accurate credit and metadata, including a CVE number. If the documented vulnerability reporting mechanism is unresponsive, an escalation path is available by emailing security at geomys.org. Licenses . We use permissive, well-known licenses: BSD-3-Clause, BSD-2-Clause, BSD-1-Clause, 0BSD, ISC, MIT, or (less preferably) Apache-2.0. Disclaimer . This is not a legally binding agreement. Your use of the projects continues to be controlled by their respective licenses, and/or by your contract with Geomys, which does not include this document unless explicitly specified. I am getting a cat (if I successfully defeat my allergies through a combination of LiveClear , SLIT , antihistamines, and HEPA filters), so obviously you are going to get a lot of cat pictures going forward. For more, you can follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . This is the work of Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. https://github.com/settings/tokens and https://github.com/settings/personal-access-tokens and https://github.com/settings/apps/authorizations and https://github.com/settings/applications  ↩ the and packages in the Go standard library and the FIPS 140-3 Go Cryptographic Module (co-maintained with the rest of the Go team) Staticcheck filippo.io/edwards25519 filippo.io/csrf filippo.io/keygen filippo.io/intermediates (externalized from the standard library) age and typage Sunlight and filippo.io/torchwood yubikey-agent run govulncheck on a schedule, to get high signal-to-noise ratio notifications of vulnerable dependencies that actually affect our projects; and run isolated CI jobs with the latest versions of our dependencies (i.e. running before ) to ensure we’re alerted early of breakages, so we can easily update to future security releases and so we’re aware of potential compatibility issues for our dependents. All Google accounts linked to a Gerrit account Password manager Passkey sync (e.g. Apple iCloud) Website host Domain registrar Package registry (if applicable, although Go’s decentralized package management largely removes this attack surface) https://github.com/settings/tokens and https://github.com/settings/personal-access-tokens and https://github.com/settings/apps/authorizations and https://github.com/settings/applications  ↩

0 views