Latest Posts (20 found)

Build Your Own Key-Value Storage Engine—Week 7

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Over the last few weeks, you refined your LSM tree to introduce leveling. In case of a key miss, the process requires the following steps: Lookup from the memtable. Lookup from all the L0 SSTables. Lookup from one L1 SSTable. Lookup from one L2 SSTable. Last week, you optimized the lookups by introducing block-based SSTables and indexing, but a lookup is still not a “free” operation. Worst case, it requires fetching two pages (one for the index block and one for the data block) to find out that a key is missing in an SSTable. This week, you will optimize searches by introducing a “tiny” level of caching per SSTable. If you’re an avid reader of The Coder Cafe 1 , we already discussed a great candidate for such a cache: One that doesn’t consume too much memory to make sure we don’t increase space amplification drastically. One that is fast enough so that a lookup doesn’t introduce too much overhead, especially if we have to check a cache before making any lookup in an SSTable. You will implement a cache using Bloom filters : a space-efficient, probabilistic data structure to check for set membership. A Bloom filter can return two possible answers: The element is definitely not in the set (no false negatives). The element may be in the set (false positives are possible). In addition to optimizing SSTable lookups, you will also optimize your memtable. In week 2, you implemented a memtable using a hashtable. Let’s get some perspective to understand the problems of using a hashtable: A memtable buffers writes. As it’s the main entry point for writes, a write has to be fast. → OK: a hashtable has average inserts, plus ( : the length of the key) for hashing. For reads, doing a key lookup has to be fast → OK: average lookups, plus to hash. Doing range scanning operations (week 5, optional work), such as: “ Give me the list of keys between bar and foo “ → A hashtable, because it’s not an ordered data structure, is terrible: you end up touching everything so with the number of elements in the hashtable. Flush to L0 → A hashtable isn’t ordered, so it requires sorting all the keys ( ) with n the number of elements) to produce the SSTables. Because of these negative points, could we find a better data structure? Yes! This week, you will switch the memtable to a radix trie (see Further Notes for a discussion on alternative data structures). A trie is a tree-shaped data structure usually used to store strings efficiently. The common example to illustrate a trie is to store a dictionary. For example, suppose you want to store these two words: Despite that starts with the same four letters, you need to store a total of 4 + 5 = 9 letters. Tries optimize the storage required by sharing prefixes. Each node stores one letter. Here’s an example of a trie storing these two words in addition to the word foo ( nodes represent the end of a word): As you can see, we didn’t duplicate the first four letters of to store . In this very example, instead of storing 9 letters for and , we stored only five letters. Yet, you’re not going to implement a “basic” trie for your memtable; instead, you will implement a compressed trie called a radix trie (also known as a patricia 2 trie). Back to the previous example, storing one node (one square) has an overhead. It usually means at least one extra field to store the next element, usually a pointer. In the previous example, we needed 11 nodes in total, but what if we could compress the number of nodes required? The idea is to combine nodes with a single child: This new trie stores the exact same information, except it requires 6 nodes instead of 11. That’s what radix tries are about. To summarize the benefits of switching a memtable from a hashtable to a radix trie: Ordered by design: Tries keep keys in order and make prefix/range lookups natural, which helps for and for streaming a sorted flush. No rebalancing/rehashing pauses: The shape doesn’t depend on insertion order, and operations don’t need rebalancing; you avoid periodic rehash work. Prefix compression: A radix trie can cut duplicated key bytes in the memtable, reducing in-memory space. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Let’s size the Bloom filter. You will target: (false-positive rate) = 1% (max elements per SSTable) = 1,953 (hash functions) = 5 Using the formula from the Bloom Filters post: We get ≈ 19,230 bits, i.e., 2,404 B. We will round up to 2,496 B (39 × 64 B), so the bitset is a whole number of cache lines. NOTE : Using =7 would shave only ~2–3% space for ~40% more hash work, so =5 is a good trade-off. To distribute elements across the bitvector, you will use the following approach. You will use xxHash64 with two different constant seeds to get two base hashes, then derive k indices by double hashing (pseudo-code): The required changes to introduce Bloom filters: For each SSTable in the MANIFEST, cache its related Bloom filter in memory. Since each Bloom filter requires only a small amount of space, this optimization has a minimal memory footprint. For example, caching 1,000 Bloom filters of the type you designed requires less than 2.5 MB of memory. SSTable creation: For each new SSTable you write, initialize an empty bitvector of 2,496 B. Build the Bloom filter in memory as you emit the keys (including tombstones): Compute based on the key. For each , set bit at position . When the SSTable is done, persist a sidecar file next to it (e.g., and ) and the file. Update the cache containing the Bloom filters. Compaction: Delete from memory the Bloom filters corresponding to deleted SSTables. Before reading an SSTable: Compute based on the key. If all the bits of are set: The key may be present, therefore, proceed with your normal lookup in the SSTable. Otherwise: Skip this SSTable. Now, let’s replace your hashtable with a trie. : Compressed edge fragment. : A map keyed by the next character after to a node. : An enum with the different possible values: : The node is just a prefix, no full key ends here. : A full key exists at this node. : This key was explicitly deleted. : If is , the corresponding value. Root is a sentinel node with an empty . Walk from the root, matching the longest common prefix against . If partial match in the middle of an edge, split once: Create a parent with the common part, two children: the old suffix and the new suffix. Descend via the next child (next unmatched character). At the terminal node: set and Walk edges by longest-prefix match. If an edge doesn’t match, return not found. At the terminal node: If : return If or , return not found. Walk as in . If the path doesn’t fully exist, create the missing suffix nodes with so that a terminal node exists. At the terminal node: set (you may have to clear ). Flush process: In-order traversal: : Emit tombstone. : Emit nothing. There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Use per-SSTable random seeds for the Bloom hash functions. Persist them in the Bloom filter files. In Bloom Filters , you introduced blocked Bloom filters, a variant that optimizes spatial locality by: Dividing the bloom filter into contiguous blocks, each the size of a cache line. Restricting each query to a single block to ensure all bit lookups stay within the same cache line. Switch to blocked Bloom filters and see the impacts on latency and throughput. If you implemented the operation from week 5 (optional work), wire it to your memtable radix trie. That’s it for this week! You optimized lookups with per-SSTable Bloom filters and switched the memtable to a radix trie, an ordered data structure. Since the beginning of the series, everything you built has been single-threaded, and flush/compaction remains stop-the-world. In two weeks, you will finally tackle the final boss of LSM trees: concurrency. If you want to dive more into tries, Trie Memtables in Cassandra is a paper that explains why Cassandra moved from a skip list + B-tree memtable to a trie, and what it changed for topics such as GC and CPU locality. A popular variant of radix trie is the Adaptive Radix Tree (ART): it dynamically resizes node types based on the number of children to stay compact and cache-friendly, while supporting fast in-memory lookups, inserts, and deletes. This paper (or this summary ) explores the topic in depth. You should also be aware that tries aren’t the only option for memtables, as other data structures exist. For example, RocksDB relies on a skip list. See this resource for more information. About Bloom filters, some engines keep a Bloom filter not only per SSTable but per data-block range as well. This was the case for RocksDB’s older block-based filter format ( source ). RocksDB later shifted toward partitioned index/filters, which partition the index and full-file filter into smaller blocks with a top-level directory for on-demand loading. The official doc delves into the new approach. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. I’m sure you are. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Over the last few weeks, you refined your LSM tree to introduce leveling. In case of a key miss, the process requires the following steps: Lookup from the memtable. Lookup from all the L0 SSTables. Lookup from one L1 SSTable. Lookup from one L2 SSTable. One that doesn’t consume too much memory to make sure we don’t increase space amplification drastically. One that is fast enough so that a lookup doesn’t introduce too much overhead, especially if we have to check a cache before making any lookup in an SSTable. The element is definitely not in the set (no false negatives). The element may be in the set (false positives are possible). A memtable buffers writes. As it’s the main entry point for writes, a write has to be fast. → OK: a hashtable has average inserts, plus ( : the length of the key) for hashing. For reads, doing a key lookup has to be fast → OK: average lookups, plus to hash. Doing range scanning operations (week 5, optional work), such as: “ Give me the list of keys between bar and foo “ → A hashtable, because it’s not an ordered data structure, is terrible: you end up touching everything so with the number of elements in the hashtable. Flush to L0 → A hashtable isn’t ordered, so it requires sorting all the keys ( ) with n the number of elements) to produce the SSTables. As you can see, we didn’t duplicate the first four letters of to store . In this very example, instead of storing 9 letters for and , we stored only five letters. Yet, you’re not going to implement a “basic” trie for your memtable; instead, you will implement a compressed trie called a radix trie (also known as a patricia 2 trie). Back to the previous example, storing one node (one square) has an overhead. It usually means at least one extra field to store the next element, usually a pointer. In the previous example, we needed 11 nodes in total, but what if we could compress the number of nodes required? The idea is to combine nodes with a single child: This new trie stores the exact same information, except it requires 6 nodes instead of 11. That’s what radix tries are about. To summarize the benefits of switching a memtable from a hashtable to a radix trie: Ordered by design: Tries keep keys in order and make prefix/range lookups natural, which helps for and for streaming a sorted flush. No rebalancing/rehashing pauses: The shape doesn’t depend on insertion order, and operations don’t need rebalancing; you avoid periodic rehash work. Prefix compression: A radix trie can cut duplicated key bytes in the memtable, reducing in-memory space. (false-positive rate) = 1% (max elements per SSTable) = 1,953 (hash functions) = 5 Startup: For each SSTable in the MANIFEST, cache its related Bloom filter in memory. Since each Bloom filter requires only a small amount of space, this optimization has a minimal memory footprint. For example, caching 1,000 Bloom filters of the type you designed requires less than 2.5 MB of memory. SSTable creation: For each new SSTable you write, initialize an empty bitvector of 2,496 B. Build the Bloom filter in memory as you emit the keys (including tombstones): Compute based on the key. For each , set bit at position . When the SSTable is done, persist a sidecar file next to it (e.g., and ) and the file. Update the cache containing the Bloom filters. Compaction: Delete from memory the Bloom filters corresponding to deleted SSTables. Lookup: Before reading an SSTable: Compute based on the key. If all the bits of are set: The key may be present, therefore, proceed with your normal lookup in the SSTable. Otherwise: Skip this SSTable. : Compressed edge fragment. : A map keyed by the next character after to a node. : An enum with the different possible values: : The node is just a prefix, no full key ends here. : A full key exists at this node. : This key was explicitly deleted. : If is , the corresponding value. : Walk from the root, matching the longest common prefix against . If partial match in the middle of an edge, split once: Create a parent with the common part, two children: the old suffix and the new suffix. Descend via the next child (next unmatched character). At the terminal node: set and : Walk edges by longest-prefix match. If an edge doesn’t match, return not found. At the terminal node: If : return If or , return not found. : Walk as in . If the path doesn’t fully exist, create the missing suffix nodes with so that a terminal node exists. At the terminal node: set (you may have to clear ). In-order traversal: : Emit . : Emit tombstone. : Emit nothing. Dividing the bloom filter into contiguous blocks, each the size of a cache line. Restricting each query to a single block to ensure all bit lookups stay within the same cache line.

0 views

An Interview with Bill Gurley About Runnin’ Down a Dream

An interview with long-time (retired) VC Bill Gurley about his new book about building a career you love, Uber, and the modern state of VC.

0 views

curl up 2026

The annual curl users and developers meeting, curl up, takes place May 23-24 2026 in Prague, Czechia. We are in fact returning to the same city and the exact same venue as in 2025. We liked it so much! This is a cozy and friendly event that normally attracts around 20-30 attendees. We gather in a room through a weekend and we talk curl. The agenda is usually setup with a number of talks through the two days, and each talk ends with a follow-up Q&A and discussion session. So no big conference thing, just a bunch of friends around a really large table. Over a weekend. Anyone is welcome to attend – for free – and everyone is encouraged to submit a talk proposal – anything that is curl and Internet transfer related goes. We make an effort to attract and lure the core curl developers and the most active contributors of recent years into the room. We do this by reimbursing their travel and hotel expenses. The agenda is a collaborative effort and we are going to work on putting it together from now all the way until the event, in order to make sure we make the best of the weekend and we get to talk to and listen to all the curl related topics we can think of! Help us improve the Agenda in the curl-up wiki: https://github.com/curl/curl-up/wiki/2026 Meeting up in the real world as opposed to doing video meetings helps us get to know each other better, allows us to socialize in ways we otherwise never can do and in the end it helps us work better together – which subsequently helps us write better code and produce better outcomes! It also helps us meet and welcome newcomers and casual contributors. Showing up at curl up is an awesome way to dive into the curl world wholeheartedly and in the deep end. Needless to say this event costs money to run. We pay our top people to come, we pay for the venue and pay for food. We would love to have your company mentioned as top sponsor of the event or perhaps a social dinner on the Saturday? Get in touch and let’s get it done! Everyone is welcome and encouraged to attend – at no cost. We only ask that you register in advance (the registration is not open yet). We always record all sessions on video and make them available after the fact. You can catch up on previous years’ curl up sessions on the curl website’s video section . We also live-stream all the sessions on curl up during both days. To be found on my twitch channel: curlhacker . Our events are friendly to everyone. We abide to the code of conduct and we never had anyone be even close to violating that,

0 views

Notes on Linear Algebra for Polynomials

We’ll be working with the set P_n(\mathbb{R}) , real polynomials of degree \leq n . Such polynomials can be expressed using n+1 scalar coefficients a_i as follows: The set P_n(\mathbb{R}) , along with addition of polynomials and scalar multiplication form a vector space . As a proof, let’s review how the vector space axioms are satisfied. We’ll use p(x) , q(x) and r(x) as arbitrary polynomials from the set P_n(\mathbb{R}) for the demonstration. Similarly, a and b are arbitrary scalars in . Associativity of vector addition : This is trivial because addition of polynomials is associative [1] . Commutativity is similarly trivial, for the same reason: Commutativity of vector addition : Identity element of vector addition : The zero polynomial 0 serves as an identity element. \forall p(x)\in P_n(\mathbb{R}) , we have 0 + p(x) = p(x) . Inverse element of vector addition : For each p(x) , we can use q(x)=-p(x) as the additive inverse, because p(x)+q(x)=0 . Identity element of scalar multiplication The scalar 1 serves as an identity element for scalar multiplication. For each p(x) , it’s true that 1\cdot p(x)=p(x) . Associativity of scalar multiplication : For any two scalars a and b : Distributivity of scalar multiplication over vector addition : For any p(x) , q(x) and scalar a : Distributivity of scalar multiplication over scalar addition : For any scalars a and b and polynomial p(x) : Since we’ve shown that polynomials in P_n(\mathbb{R}) form a vector space, we can now build additional linear algebraic definitions on top of that. A set of k polynomials p_k(x)\in P_n(\mathbb{R}) is said to be linearly independent if implies a_i=0 \quad \forall i . In words, the only linear combination resulting in the zero vector is when all coefficients are 0. As an example, let’s discuss the fundamental building blocks of polynomials in P_n(\mathbb{R}) : the set \{1, x, x^2, \dots x^n\} . These are linearly independent because: is true only for zero polynomial, in which all the coefficients a_i=0 . This comes from the very definition of polynomials. Moreover, this set spans the entire P_n(\mathbb{R}) because every polynomial can be (by definition) expressed as a linear combination of \{1, x, x^2, \dots x^n\} . Since we’ve shown these basic polynomials are linearly independent and span the entire vector space, they are a basis for the space. In fact, this set has a special name: the monomial basis (because a monomial is a polynomial with a single term). Suppose we have some set polynomials, and we want to know if these form a basis for P_n(\mathbb{R}) . How do we go about it? The idea is using linear algebra the same way we do for any other vector space. Let’s use a concrete example to demonstrate: Is the set Q a basis for P_n(\mathbb{R}) ? We’ll start by checking whether the members of Q are linearly independent. Write: By regrouping, we can turn this into: For this to be true, the coefficient of each monomial has to be zero; mathematically: In matrix form: We know how to solve this, by reducing the matrix into row-echelon form . It’s easy to see that the reduced row-echelon form of this specific matrix is I , the identity matrix. Therefore, this set of equations has a single solution: a_i=0 \quad \forall i [2] . We’ve shown that the set Q is linearly independent. Now let’s show that it spans the space P_n(\mathbb{R}) . We want to analyze: And find the coefficients a_i that satisfy this for any arbitrary , and \gamma . We proceed just as before, by regrouping on the left side: and equating the coefficient of each power of separately: If we turn this into matrix form, the matrix of coefficients is exactly the same as before. So we know there’s a single solution, and by rearranging the matrix into I , the solution will appear on the right hand side. It doesn’t matter for the moment what the actual solution is, as long as it exists and is unique. We’ve shown that Q spans the space! Since the set Q is linearly independent and spans P_n(\mathbb{R}) , it is a basis for the space. I’ve discussed inner products for functions in the post about Hilbert space . Well, polynomials are functions , so we can define an inner product using integrals as follows [3] : Where the bounds a and b are arbitrary, and could be infinite. Whenever we deal with integrals we worry about convergence; in my post on Hilbert spaces, we only talked about L^2 - the square integrable functions. Most polynomials are not square integrable, however. Therefore, we can restrict this using either: Let’s use the latter, and restrict the bounds into the range [-1,1] , setting w(x)=1 . We have the following inner product: Let’s check that this satisfies the inner product space conditions. Conjugate symmetry : Since real multiplication is commutative, we can write: We deal in the reals here, so we can safely ignore complex conjugation. Linearity in the first argument : Let p_1,p_2,q\in P_n(\mathbb{R}) and a,b\in \mathbb{R} . We want to show that Expand the left-hand side using our definition of inner product: The result is equivalent to a\langle p_1,q\rangle +b\langle p_2,q\rangle . Positive-definiteness : We want to show that for nonzero p\in P_n(\mathbb{R}) , we have \langle p, p\rangle > 0 . First of all, since p(x)^2\geq0 for all , it’s true that: What about the result 0 though? Well, let’s say that Since p(x)^2 is a non-negative function, this means that the integral of a non-negative function ends up being 0. But p(x) is a polynomial, so it’s continuous , and so is p(x)^2 . If the integral of a continuous non-negative function is 0, it means the function itself is 0. Had it been non-zero in any place, the integral would necessarily have to be positive as well. We’ve proven that \langle p, p\rangle=0 only when p is the zero polynomial. The positive-definiteness condition is satisfied. In conclusion, P_n(\mathbb{R}) along with the inner product we’ve defined forms an inner product space . Now that we have an inner product, we can define orthogonality on polynomials: two polynomials p,q are orthogonal (w.r.t. our inner product) iff Contrary to expectation [4] , the monomial basis polynomials are not orthogonal using our definition of inner product. For example, calculating the inner product for 1 and x^2 : There are other sets of polynomials that are orthogonal using our inner product. For example, the Legendre polynomials ; but this is a topic for another post. A special weight function w(x) to make sure the inner product integral converges Set finite bounds on the integral, and then we can just set w(x)=1 .

0 views
Chris Coyier Yesterday

Tucci Pan Review

Stanley Tucci has a set of cookware named after him that GreenPan sells. I’ve got these two pans: I forget where they came from exactly, some silent auction or something, but I unboxed and started using them about 8 months ago. I was so hyped the first few months! It’s my daily-driver pan. I’d say it’s used once a day, on average. Then it looses it’s luster after a while. I could scrub the bottom, but I just don’t care about that. The inside was more concerning. I hit up their customer support, as it’s not just the aesthetics that were dimming here, the pan really seems maybe half as nicely non-stick as it was 8 months ago, and cleaning it with non-abrasive techniques takes much longer. Fill the pan halfway with water and bring it to a simmer for about 2 minutes. Pour out the water and place the pan on a safe sturdy surface. Carefully use a Melamine sponge (Mr. Clean Magic Eraser, our Restoring Sponge or any melamine sponge) and a little plain water on the warm surface to wipe away the food or stuck on oil.  This should do the trick. Fair enough: that technique worked well to remove what they called “a layer of carbonized oil”. I got it entirely clean with a bit of elbow grease. I’d say the pan performs 10% better after that. But it ain’t back to its former glory. I highly suspect at the one-year mark the pan is basically gonna be toast. So my review is:   it’s an incredible pan for 6 months and a so-so pan for 6 months, then you’re done. There is some kind of coating, and it’s way better than average, but it’s just not a forever thing. If you can stomach a few hundred bucks a year to replace it, go for it. Me, I’ve got some research to do on what to replace it with because I think I want a little longer longevity. And yes, I’ve got a well-seasoned cast-iron I’ve used most of my life. That’s fine, but I wanna try other things. Specifically, less-honkin’ pans that are easier to handle. Ultra extremely non-stick Washing them with a soft sponge is nearly effortless because of how non-stick they are. Feels good, like I’m taking care of it correctly. The edges of the pan, with the steep angles, are perfect for that cool chef move where you toss/flip stuff in the pan with a wrist movement.

1 views

I vibe coded my dream macOS presentation app

I gave a talk this weekend at Social Science FOO Camp in Mountain View. The event was a classic unconference format where anyone could present a talk without needing to propose it in advance. I grabbed a slot for a talk I titled "The State of LLMs, February 2026 edition", subtitle "It's all changed since November!". I vibe coded a custom macOS app for the presentation the night before. I've written about the last twelve months of development in LLMs in December 2023 , December 2024 and December 2025 . I also presented The last six months in LLMs, illustrated by pelicans on bicycles at the AI Engineer World’s Fair in June 2025. This was my first time dropping the time covered to just three months, which neatly illustrates how much the space keeps accelerating and felt appropriate given the November 2025 inflection point . (I further illustrated this acceleration by wearing a Gemini 3 sweater to the talk, which I was given a couple of weeks ago and is already out-of-date thanks to Gemini 3.1 .) I always like to have at least one gimmick in any talk I give, based on the STAR moment principle I learned at Stanford - include Something They'll Always Remember to try and help your talk stand out. For this talk I had two gimmicks. I built the first part of the talk around coding agent assisted data analysis of Kākāpō breeding season (which meant I got to show off my mug ), then did a quick tour of some new pelicans riding bicycles before ending with the reveal that the entire presentation had been presented using a new macOS app I had vibe coded in ~45 minutes the night before the talk. The app is called Present - literally the first name I thought of. It's built using Swift and SwiftUI and weighs in at 355KB, or 76KB compressed . Swift apps are tiny! It may have been quick to build but the combined set of features is something I've wanted for years . I usually use Keynote for presentations, but sometimes I like to mix things up by presenting using a sequence of web pages. I do this by loading up a browser window with a tab for each page, then clicking through those tabs in turn while I talk. This works great, but comes with a very scary disadvantage: if the browser crashes I've just lost my entire deck! I always have the URLs in a notes file, so I can click back to that and launch them all manually if I need to, but it's not something I'd like to deal with in the middle of a talk. This was my starting prompt : Build a SwiftUI app for giving presentations where every slide is a URL. The app starts as a window with a webview on the right and a UI on the left for adding, removing and reordering the sequence of URLs. Then you click Play in a menu and the app goes full screen and the left and right keys switch between URLs That produced a plan. You can see the transcript that implemented that plan here . In Present a talk is an ordered sequence of URLs, with a sidebar UI for adding, removing and reordering those URLs. That's the entirety of the editing experience. When you select the "Play" option in the menu (or hit Cmd+Shift+P) the app switches to full screen mode. Left and right arrow keys navigate back and forth, and you can bump the font size up and down or scroll the page if you need to. Hit Escape when you're done. Crucially, Present saves your URLs automatically any time you make a change. If the app crashes you can start it back up again and restore your presentation state. You can also save presentations as a file (literally a newline-delimited sequence of URLs) and load them back up again later. Getting the initial app working took so little time that I decided to get more ambitious. It's neat having a remote control for a presentation... So I prompted: Add a web server which listens on 0.0.0.0:9123 - the web server serves a single mobile-friendly page with prominent left and right buttons - clicking those buttons switches the slide left and right - there is also a button to start presentation mode or stop depending on the mode it is in. I have Tailscale on my laptop and my phone, which means I don't have to worry about Wi-Fi networks blocking access between the two devices. My phone can access directly from anywhere in the world and control the presentation running on my laptop. It took a few more iterative prompts to get to the final interface, which looked like this: There's a slide indicator at the top, prev and next buttons, a nice big "Start" button and buttons for adjusting the font size. The most complex feature is that thin bar next to the start button. That's a touch-enabled scroll bar - you can slide your finger up and down on it to scroll the currently visible web page up and down on the screen. It's very clunky but it works just well enough to solve the problem of a page loading with most interesting content below the fold. I'd already pushed the code to GitHub (with a big "This app was vibe coded [...] I make no promises other than it worked on my machine!" disclaimer) when I realized I should probably take a look at the code. I used this as an opportunity to document a recent pattern I've been using: asking the model to present a linear walkthrough of the entire codebase. Here's the resulting Linear walkthroughs pattern in my ongoing Agentic Engineering Patterns guide , including the prompt I used. The resulting walkthrough document is genuinely useful. It turns out Claude Code decided to implement the web server for the remote control feature using socket programming without a library ! Here's the minimal HTTP parser it used for routing: Using GET requests for state changes like that opens up some fun CSRF vulnerabilities. For this particular application I don't really care. Vibe coding stories like this are ten a penny these days. I think this one is worth sharing for a few reasons: This doesn't mean native Mac developers are obsolete. I still used a whole bunch of my own accumulated technical knowledge (and the fact that I'd already installed Xcode and the like) to get this result, and someone who knew what they were doing could have built a far better solution in the same amount of time. It's a neat illustration of how those of us with software engineering experience can expand our horizons in fun and interesting directions. I'm no longer afraid of Swift! Next time I need a small, personal macOS app I know that it's achievable with our existing set of tools. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Swift, a language I don't know, was absolutely the right choice here. I wanted a full screen app that embedded web content and could be controlled over the network. Swift had everything I needed. When I finally did look at the code it was simple, straightforward and did exactly what I needed and not an inch more. This solved a real problem for me. I've always wanted a good way to serve a presentation as a sequence of pages, and now I have exactly that.

1 views
Kev Quirk Yesterday

Introducing Pure Comments (and Pure Commons)

A few weeks ago I introduced Pure Blog a simple PHP based blogging platform that I've since moved to and I'm very happy. Once Pure Blog was done, I shifted my focus to start improving my commenting system . I ended that post by saying: At this point it's battle tested and working great. However, there's still some rough edges in the code, and security could definitely be improved. So over the next few weeks I'll be doing that, at which point I'll probably release it to the public so you too can have comments on your blog, if you want them. I've now finished that work and I'm ready to release Pure Comments to the world. 🎉 I'm really happy with how Pure Comments has turned out; it slots in perfectly with Pure Blog, which got me thinking about creating a broader suite of apps under the Pure umbrella. I've had Simple.css since 2022, and now I've added Pure Blog and Pure Comments to the fold. So I decided I needed an umbrella to house these disparate projects. That's where Pure Commons comes in. My vision for Pure Commons is to build it into a suite of simple, privacy focussed tools that are easy to self-host, and have just what you need and no more. Well, concurrent to working on Pure Comments, I've also started building a fully managed version that people will be able to use for a small monthly fee. That's about 60% done at this point, so I should be releasing that over the next few weeks. In the future I plan to add a managed version of Pure Blog too, but that will be far more complex than a managed version of Pure Comments. So I think that will take some time. I'm also looking at creating Pure Guestbook , which will obviously be a simple, self-hosted guestbook along the same vein as the other Pure apps. This should be relatively simple to build, as a guestbook is basically a simplified commenting system, so most of the code is already exists in Pure Comments. Looking beyond Pure Guestbook I have some other ideas, but you will have to wait and see... In the meantime, please take a look as Pure Comments - download the source code , take it for a spin, and provide any feedback/bugs you find. If you have any ideas for apps I could add to the Pure Commons family, please get in touch. Thanks for reading this post via RSS. RSS is ace, and so are you. ❤️ You can reply to this post by email , or leave a comment .

0 views
Martin Fowler Yesterday

Fragments: February 25

I don’t tend to post links to videos here, as I can’t stand watching videos to learn about things . But some talks are worth a watch, and I do suggest this overview on how organizations are currently using AI by Laura Tacho. There’s various nuggets of data from her work with DX: These are interesting numbers, but most of them are averages, and those who know me know I teach people to be suspicious of averages . Laura knows this too: average doesn’t mean typical.. there is no typical experience with AI Different companies (and teams within companies) are having very different experiences. Often AI is an amplifier to an organization’s practices, for good or ill. Organizational performance is multidimensional, and these organizations are just going off into different extremes based on what they were doing before. AI is an accelerator, it’s a multiplier, and it is moving organizations off in different directions. (08:52) Some organizations are facing twice as many customer incidents, but others are facing half. ❄                ❄                ❄                ❄                ❄ Rachel Laycock (Thoughtworks CTO) shares her reflections on our recent Future of Software Engineering retreat in Utah. On the latter: One of the most interesting and perhaps immediately applicable ideas was the concept of an ‘agent subconscious’, in which agents are informed by a comprehensive knowledge graph of post mortems and incident data. This particularly excites me because I’ve seen many production issues solved by the latent knowledge of those in leadership positions. The constant challenge comes from what happens when those people aren’t available or involved. ❄                ❄                ❄                ❄                ❄ Simon Willison (one of my most reliable sources for information about LLMs and programming) is starting a series of Agentic Engineering Patterns : I think of vibe coding using its original definition of coding where you pay no attention to the code at all, which today is often associated with non-programmers using LLMs to write code. Agentic Engineering represents the other end of the scale: professional software engineers using coding agents to improve and accelerate their work by amplifying their existing expertise. He’s intending this to be closer to evergreen material, as opposed to the day-to-day writing he does (extremely well) on his blog. One of the first patterns is Red/Green TDD This turns out to be a fantastic fit for coding agents. A significant risk with coding agents is that they might write code that doesn’t work, or build code that is unnecessary and never gets used, or both. Test-first development helps protect against both of these common mistakes, and also ensures a robust automated test suite that protects against future regressions. ❄                ❄                ❄                ❄                ❄ Aaron Erickson is one of those technologists with good judgment who I listen to a lot As much fun as people are having with OpenClaw, I think the days of “here is my agent with access to all my stuff” are numbered. Fine scoped agents who can read email and cleanse it before it reaches the agentic OODA loop that acts on it, policy agents (a claw with a job called “VP of NO” to money being spent) You structure your agents like you would a company. Insert friction where you want decisions to be slow and the cost of being wrong is high, reduce friction where you want decisions to be fast and the cost of being wrong is trivial or zero. I’ve posted here a lot about security concerns with agents. Right now I think this notion of fine-scoped agents is the most promising direction. Last year Korny Sietsma wrote about how to mitigate agentic AI security risks . His advice included to split the tasks, so that no agent has access to all parts of the Lethal Trifecta: This approach is an application of a more general security habit: follow the Principle of Least Privilege. Splitting the work, and giving each sub-task a minimum of privilege, reduces the scope for a rogue LLM to cause problems, just as we would do when working with corruptible humans. This is not only more secure, it is also increasingly a way people are encouraged to work. It’s too big a topic to cover here, but it’s a good idea to split LLM work into small stages, as the LLM works much better when its context isn’t too big. Dividing your tasks into “Think, Research, Plan, Act” keeps context down, especially if “Act” can be chunked into a number of small independent and testable chunks. ❄                ❄                ❄                ❄                ❄ Doonesbury outlines the opportunity for aging writers like myself . (Currently I’m still writing my words the old fashioned way.) ❄                ❄                ❄                ❄                ❄ An interesting story someone told me. They were at a swimming pool with their child, she looked at a photo on a poster advertising an event there and said “that’s AI”. Initially the parents didn’t think it was, but looking carefully spotted a tell-tale six fingers. They concluded that fresher biological neural networks are being trained to quickly recognize AI. ❄                ❄                ❄                ❄                ❄ I carefully curate my social media streams, following only feeds where I can control whose posts are picked up. In times gone by, editors of newspapers and magazines would do a similar job. But many users of social media are faced with a tsunami of stuff, much of it ugly, and don’t have to tools to control it. A few days ago I saw an Instagram reel of a young woman talking about how she had been raped six years ago, struggled with thoughts of suicide afterwards, but managed to rebuild her life again. Among the comments – the majority of which were from men – were things like “Well at least you had some”, “No way, she’s unrapeable”, “Hope you didn’t talk this much when it happened”, “Bro could have picked a better option.” Reading those comments, which had thousands of likes and many boys agreeing with them, made me feel sick. My tendencies are to free speech, and I try not to be a Free Speech Poseur, but the deluge of ugly material on the internet isn’t getting any better. The people running these platforms seem to be “tackling” this problem by putting their heads in the sand and hoping it won’t hurt them. It is hurting their users. 92.6% of devs are using AI assistants devs reckon it’s saving them 4 hours per week 27% of code is written by AI without significant human intervention AI cuts onboarding time by half We need to address cognitive load The staff engineer role is changing What happens to code reviews? Agent Topologies What exactly does AI mean for programming languages? Self-healing systems

0 views
Ahead of AI Yesterday

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026

If you have struggled a bit to keep up with open-weight model releases this month, this article should catch you up on the main themes. In this article, I will walk you through the ten main releases in chronological order, with a focus on the architecture similarities and differences: Arcee AI’s Trinity Large (Jan 27, 2026) Moonshot AI’s Kimi K2.5 (Jan 27, 2026) StepFun Step 3.5 Flash (Feb 1, 2026) Qwen3-Coder-Next (Feb 3, 2026) z.AI’s GLM-5 (Feb 12, 2026) MiniMax M2.5 (Feb 12, 2026) Nanbeige 4.1 3B (Feb 13, 2026) Qwen 3.5 (Feb 15, 2026) Ant Group’s Ling 2.5 1T & Ring 2.5 1T (Feb 16, 2026) Cohere’s Tiny Aya (Feb 17, 2026) (PS: DeepSeek V4 will be added once released.) Since there’s a lot of ground to cover, I will be referencing my previous The Big LLM Architecture Comparison article for certain technical topics (like Mixture-of-Experts, QK-Norm, Multi-head Latent Attention, etc.) throughout this article for background information to avoid redundancy in this article. On January 27, Arcee AI (a company I hadn’t had on my radar up to then) began releasing versions of their open-weight 400B Trinity Large LLMs on the model hub , along with two smaller variants: Their flagship large model is a 400B param Mixture-of-Experts (MoE) with 13B active parameters. The two smaller variants are Trinity Mini (26B with 3B active parameters) and Trinity Nano (6B with 1B active parameters). Figure 1: Overview of the Trinity Large architecture (based on the model hub config file ). Along with the model weights, Arcee AI also released a nice technical report on GitHub (as of Feb 18 also on arxiv ) with lots of details. So, let’s take a closer look at the 400B flagship model. Figure 2 below compares it to z.AI’s GLM-4.5 , which is perhaps the most similar model due to its size with 355B parameters. Figure 2: Arcee AI Trinity Large next to GLM-4.5 of a relatively similar size (400B vs 355B). As we can see in the Trinity and GLM-4.5 comparison, there are several interesting architectural components added to the Trinity model. First, there are the alternating local:global (sliding window) attention layers (SWA) like in Gemma 3, Olmo 3, Xiaomi MiMo, etc. In short, SWA is a type of sparse (local) attention pattern where each token attends only to a fixed-size window of t recent tokens (for example, 4096) instead of attending to the entire input (which could be up to n=256,000 tokens). This reduces the per-layer regular attention cost from O( n ²) to roughly O( n · t ) for sequence length n , which is why it is attractive for long-context models. Figure 3: A comparison between regular attention (global attention) and sliding window attention (local attention). But instead of using the common 5:1 local:global ratio that Gemma 3 and Xiaomi used, the Arcee team opted for a 3:1 ratio similar to Olmo 3, and a relatively large sliding window size of 4096 (also similar to Olmo 3). The architecture also uses QK-Norm , which is a technique that applies RMSNorm to the keys and queries to stabilize training (as shown in Figure 4 below), as well as no positional embeddings ( NoPE ) in the global attention layers similar to SmolLM3 . Trinity also has a form of gated attention. It’s not a full-blown Gated DeltaNet but it uses a similar gating as in the attention mechanism in Qwen3-Next . I.e., the Trinity team modified the standard attention by adding elementwise gating to the scaled dot-product before the output linear projection (as shown in the figure below), which reduces attention sinks and improves long-sequence generalization. Additionally, it also helped with training stability. Figure 4: Illustration of the gating mechanism that Trinity Large uses in the attention mechanism. Also, the Trinity technical report showed that the modeling performance of the Trinity Large and GLM-4.5 base models are practically identical (I assume they didn’t compare it to more recent base models because many companies only share their fine-tuned models these days.) You may have noticed the use of four (instead of two) RMSNorm layers in the previous Trinity Large architecture figure which looks similar to Gemma 3 at first glance. Figure 5: Arcee Trinity and Gemma 3 RMSNorm placement side by side. Overall, the RMSNorm placement looks like a Gemma 3-like RMSNorm placement, but the twist here is that the gain of the second RMSNorm (in each block) is depth-scaled, meaning it’s initialized to about 1 / sqrt(L) (with L the total number of layers). So, early in training, the residual update starts small and grows as the model learns the right scale. Figure 6: Arcee Trinity and DeepSeek V3/R1 MoE side by side. The MoE is a DeepSeek-like MoE with lots of small experts, but made it coarser as that helps with inference throughput (something we have also seen in Mistral 3 Large when they adopted the DeepSeek V3 architecture). Lastly, there are some interesting details on the training improvements (a new MoE load-balancing strategy and another using the MuOpt optimizer), but since this is a mainly an architecture article (and there are many more open-weight LLMs to cover), these details are out of scope. While Arcee Trinity essentially matched the modeling performance of the older GLM-4.5 model, Kimi K2.5 is an open-weight model that set a new open-weight performance ceiling at the time of its release on Jan 27. ​Impressively, according to their own benchmarks in their detailed technical report , it was on par with the leading proprietary models at the time of its release. Figure 7: Kimi K2.5 performance benchmark from the official K2.5 technical report . The good modeling performance is no surprise when compared to, e.g., Arcee Trinity or GLM-4.5 covered earlier, since (similar to its K2 predecessor), Kimi K2.5 is a 1-trillion-parameter model and thus 2.5x larger than Trinity and 2.8x larger than GLM-4.5. Overall, the Kimi K2.5 architecture is similar to Kimi K2, which, in turn, is a scaled-up version of the DeepSeek V3 architecture. Figure 8: Kimi K2 is a larger version of the DeepSeek V3 architecture. However, K2 was a pure text model, and Kimi K2.5 is now a multimodal model with vision support. To quote from the technical report: ​> Kimi K2.5 is a native multimodal model built upon Kimi K2 through large-scale joint pre-training on approximately 15 trillion mixed visual and text tokens. During the training, they adopted an early fusion approach and passed in the vision tokens early on alongside the text tokens, as I discussed in my older Understanding Multimodal LLMs article. Figure 9: Like most other contemporary multimodal LLMs, Kimi K2.5 uses method A, passing the vision tokens alongside the text tokens during training. Side note: In multimodal papers, “early fusion” is unfortunately overloaded. It can mean either 1. When the model sees vision tokens during pre-training. I.e., vision tokens are mixed in from the start (or very early) of pre-training as opposed to later stages. 2. How the image tokens are combined in the model. I.e., they are fed as embedded tokens alongside the text tokens. In this case, while the term “early fusion” in the report specifically refers to point 1 (when the vision tokens are provided during pre-training), point 2 is also true here. Furthermore, regarding point 1, the researchers included an interesting ablation study showing that the model benefits from seeing vision tokens early in pre-training, as shown in the annotated table below. Figure 10: Given a fixed number of vision tokens during training, the model performance benefits if the model is shown a smaller number of vision tokens early on during pre-training (as opposed to adding a higher number of vision tokens later on). Annotated table from the Kimi K2.5 technical report . I have to admit that I haven’t had the Step models on my radar yet. This one caught my attention due to its interesting size, detailed technical report , and fast tokens/sec performance. Step 3.5 Flash is a 196B parameter model that is more than 3x smaller than the recent DeepSeek V3.2 model (671B) while being slightly ahead in modeling performance benchmarks. According to the Step team, Step 3.5 Flash has a 100 tokens/sec throughput at a 128k context length, whereas DeepSeek V3.2 has only a 33 tokens/sec throughput on Hopper GPUs, according to the data on the Step model hub page . Figure 11: Step 3.5 Flash benchmark from the Step technical report . One reason for this higher performance is the model’s smaller size (196B-parameter MoE with 11B parameters active per token versus 671B-parameter MoE with 37B parameters active), as shown in the figure below. Figure 12: Step 3.5 Flash and DeepSeek V3.2 side by side. The other reason along with gated attention (which we previously discussed in the context of Trinity) is Multi-Token Prediction (MTP) . DeepSeek has been an early adopter of multi-token prediction, a technique that trains the LLM to predict multiple future tokens at each step, rather than a single one. Here, at each position t, small extra heads (linear layers) output logits for t+1...t+k, and we sum cross-entropy losses for these offsets (in the MTP paper, the researchers recommended k=4). This additional signal speeds up training, and inference may remain at generating one token at a time, as illustrated in the figure below. Figure 13: Multi-Token Prediction versus regular next token prediction. (Left subfigure inspired by the MTP paper .) Originally, MTP was only used during training, not inference; hence, the inference time steps (bottom) show a single next-token prediction. DeepSeek V3 reported using MTP-1, that is, MTP with 1 extra token (instead of 3) during training, and then making MTP optional during inference. Step 3.5 Flash uses MTP with 3 additional tokens (MTP-3) during both training and inference (note that MTP is usually not used during inference, and this is an exception). ​Note that the previously discussed Arcee Trinity and Kimi K2.5 do not use MTP, but other architectures already use an MTP-3 setup similar to Step 3.5 Flash, for example, GLM-4.7 and MiniMax M2.1. In early February 2026, the Qwen3 team shared the 80B Qwen3-Coder-Next model (3B parameters active), which made big headlines for outperforming much larger models like DeepSeek V3.2 (37B active) and Kimi K2.5 and GLM-4.7 (both 32B active) on coding tasks. Figure 14: Qwen3-Coder-Next performance on a coding benchmark next to other popular coding models; this figure appeared in the official technical report . Moreover, as shown in the benchmark figure above, the Qwen3-Coder-Next SWE-Bench Pro performance is roughly on par with Claude Sonnet 4.5 (and only slightly below Claude Opus 4.5), which is impressive for a relatively small open-weight model! Using the ollama version of Qwen3-Coder-Next locally, the model takes about 48.2 GB of storage space and 51 GB of RAM. Figure 15: Running Qwen3-Coder-Next locally. Note that the architecture behind Qwen3-Coder-Next is exactly the same as Qwen3-Next 80B (in fact, the pre-trained Qwen3-Next 80B is used as a base model for further mid- and post-training). Figure 16 below shows the Qwen3-Next architecture next to a regular Qwen3 235B model for reference. Figure 16: Qwen3-Coder-Next 80B (3B parameters active per token) and the 3x larger Qwen3 235B-A22B architecture. The new Qwen3 Next architecture stands out because, despite being 3x smaller than the previous 235B-A22B model, it introduces four times as many experts and even adds a shared expert. Both of these design choices (a high expert count and the inclusion of a shared expert). ​The other highlight is that they replace the regular attention mechanism with a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the 235B-A22B model supported 32k natively and 131k with YaRN scaling). ​So how does this new attention hybrid work? Compared to grouped‑query attention (GQA), which is still standard scaled dot‑product attention (sharing K/V across query‑head groups to cut KV‑cache size and memory bandwidth as discussed earlier, but whose decode cost and cache still grow with sequence length), their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks in a 3:1 ratio as shown in Figure 17. Figure 17: The Qwen3-Coder-Next attention hybrid setup. We can think of the gated attention block as standard scaled-dot-product attention used in GQA, with a few tweaks on top. The main differences between gated attention and plain GQA block are: an output gate (sigmoid-controlled, usually per-channel) that scales the attention result before it is added back to the residual; zero-centered RMSNorm for QKNorm, rather than a standard RMSNorm; partial RoPE (on a subset of dimensions). Note that these are essentially just stability changes to GQA. The Gated DeltaNet is a more significant change. In the DeltaNet block, q, k, v, and two gates (α, β) are produced by linear and lightweight convolutional layers with normalization, and the layer replaces attention with a fast‑weight delta rule update. However, the tradeoff is that DeltaNet offers less precise content‑based retrieval than full attention, which is why one gated attention layer remains. Given that attention grows quadratically, the DeltaNet component was added to help with memory efficiency. In the “linear-time, cache-free” family, the DeltaNet block is essentially an alternative to Mamba. Mamba keeps a state with a learned state-space filter (essentially a dynamic convolution over time). DeltaNet keeps a tiny, fast-weight memory updated with α and β, and reads it with q, using small convolutions only to help form q, k, v, α, β. For more details on the attention hybrid and Qwen3-Next architecture, please see my previous article Beyond Standard LLMs . ​Since this article is primarily focused on LLM architectures, the training details are outside its scope. However, interested readers can find more information in their detailed technical report on GitHub. The GLM-5 release on February 12th was a big deal, because at the time of its release it appeared to be on par with the major flagship LLM offerings, including GPT-5.2 extra-high, Gemini Pro 3, and Claude 4.6 Opus. (That said, benchmark performance does not necessarily translate to real-world performance.) Figure 18: GLM-5 architecture next to its GLM-4.7 predecessor. Benchmarks at the bottom taken from the official GLM-5 technical report . Not too long ago, GLM-4.7 (December 2025) was one of the strongest open-weight models. GLM-5 shows a major modeling performance improvement based on the benchmark shown in Figure 18 above. That jump is likely partly due to improvements to the training pipeline, but likely largely attributed to its 2x larger parameter count from 355B parameters in GLM-4.7 to 744B parameters in GLM-5. This size increase now places GLM-5 between DeepSeek V3.2 (671B) and Kimi K2.5 (1T) in terms of scale. Comparing the benchmark numbers of the previously discussed Kimi K2.5 (1T), the smaller GLM-5 (744B) model seems slightly ahead, as shown in the table below. Figure 19: GLM-5 (744B) and Kimi K2.5 (1T) benchmark performance side by side (larger is better). Like GLM-4.7, all the other models discussed so far, GLM-5 is a Mixture-of-Experts model. The number of active parameters per token increases only slightly, from 32B in GLM-4.7 to 40B in GLM-5. As shown in Figure 20 below, GLM-5 now adopts DeepSeek’s multi-head latent attention as well as DeepSeek Sparse Attention. (I described DeepSeek Sparse Attention in more detail in From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates .) These modifications are likely intended to reduce inference costs when working with long contexts. Otherwise, the overall architecture remains relatively similar. Figure 20: GLM-5 and DeepSeek V3.2 side by side (two similar architectures at a similar size). The increase in total size over GLM-4.7 mainly comes from expanding the number of experts, from 160 (GLM-4.7) to 256 (GLM-5), and slightly increasing layer dimensions (while keeping the number of experts the same at 8 regular + 1 shared expert per token). For example, the embedding dimension and expert size increase from 5,120 to 6,144, and the intermediate projection size rises from 1,536 to 2,048. Interestingly, the number of transformer layers is reduced from 92 in GLM-4.7 to 78 in GLM-5. I assume this change is also intended to reduce inference costs and improve latency, since layer depth cannot be parallelized in the same way as width. Additionally, I also checked an independent benchmark (here, the hallucination leaderboard ), and it indeed looks like GLM-5 is on par with Opus 4.5 and GPT-5.2 (while using fewer tokens). Figure 21: Next to the overall benchmark performance, this table adds hallucination rates from the hallucination leaderboard . Furthermore, looking at the most recent Artificial Intelligence Index, which aggregates various benchmarks, GLM-5 is indeed slightly ahead of Kimi K2.5 and only one point behind GPT-5.2 (xhigh) and the recent Claude Sonnet 4.6. Figure 22: Artificial Intelligence Index snapshot from Feb 21, 2026. The aforementioned GLM-5 and Kimi K2.5 are popular open-weight models, but according to OpenRouter statistics , they pale in comparison to MiniMax M2.5 , which was released on February 12 as well. Figure 23: OpenRouter usage snapshot from Feb 21, 2026. OpenRouter is a platform and API that lets developers access and route requests across many different LLMs from various providers. Note that while its usage statistics are a good indicator of open-weight model popularity, it’s heavily biased towards open-weight models (versus proprietary models), since most users use proprietary models through the official platform directly. There is also usage bias across open-weight models, since many people also use open-weight models through the official developers’ APIs. Anyways, it can still be an interesting place to guesstimate the relative popularity of open-weight models that are too large to run locally for most users. Now, back to MiniMax M2.5. Pulling together the GLM-5 data from the SWE-Bench Verified coding benchmark and combining it with the reported MiniMax M2.5, the latter appears to be a slightly stronger model (at least when it comes to coding). Figure 24: MiniMax M2.5 coding performance on SWE-Bench Verified​ Side note: It’s interesting to see Opus 4.5 and Opus 4.6 practically scoring identically on SWE-Bench Verified. This can be an indicator that LLM progress has stalled. I don’t think that’s true, though, given that users of Opus 4.6 can confirm that this model does seem to perform better in real-world usage. So, the more likely issue here is that the SWE-Bench Verified benchmark has saturated, and it may no longer be a meaningful benchmark to report from now on (in favor of other benchmarks like SWE-Bench Pro, for example). With saturated, I mean that it potentially contains unsolvable problems due to design issues (as discussed in a recent Reddit thread and the new “ Why SWE-bench Verified no longer measures frontier coding capabilities “ article by OpenAI). Anyways, back to the topic of MiniMax M2.5 performance. Looking across a broader selection of benchmarks, according to the Artificial Intelligence Index aggregation, GLM-5 remains ahead. This is perhaps no surprise because GLM-5 is still a 4x larger model than M2.5, even though the tokens/sec throughput is quite similar. Figure 25: GLM-5 vs MiniMax M2.5 comparison based on the Artificial Intelligence Index (Feb 21, 2026) I think MiniMax M2.5’s popularity is partly owed to the fact that it is a smaller, cheaper model with roughly similar modeling performance (i.e., a good bang for the buck). Architecture-wise, MiniMax M2.5 is a 230B model with a fairly classic design: just plain Grouped Query Attention, no sliding window attention or other efficiency improvements. Figure 26: MiniMax M2.5 next to GLM-5. So far, this is also the first architecture in this report that doesn’t come with a detailed technical report, but you can find additional information on the model hub page . In this section, we are switching gears and finally covering a smaller model that can run locally on a laptop. But first let’s start with some context before we get to Nanbeige 4.1 3B . Qwen models have always been very popular models. I often tell the story that when I was an advisor during the NeurIPS LLM efficiency challenge a few years back, most of the winning solutions were based on a Qwen model. ​Now, Qwen3 is likely among the most widely used open-weight model suite since they cover such a wide range of sizes and use cases (from 0.6B to 235B) Especially the smaller models (80B and less, like Qwen3-Next, covered previously) are great for local use on consumer hardware. Figure 27: Relative adoption popularity of open-weight models. Note that this shows the number of models on the Hugging Face model hub that are finetuned using one of those models as a base model. (This is not the number of people who use the models on their computer locally, which would be a number impossible to know.) Source: Atom Project .​ Why I am mentioning all this is that Nanbeige 4.1 3B seems to target the “small” LLM on-device use case that Qwen3 is so popular for. According to the Nanbeige 4.1 3B benchmarks, their model is way ahead of Qwen3 (perhaps no surprise, given that Qwen3 is almost a year old). Figure 28: Nanbeige 4.1 3B benchmark comparison with Qwen3 (Source: Nanbeige 4.1 3B model hub page ). Architecture-wise, Nanbeige 4.1 3B is similar to Qwen3 4B, which is, in turn, very similar to Llama 3.2 3B. I am showing Nanbeige 4.1 3B next to Llama 3.2 3B below because it is the most similar in size. Figure 29: Nanbeige 4.1 3B next to Llama 3.2 3B. Nanbeige 4.1 3B uses the same architectural components as Llama 3.2 3B, with some minor scaling differences (slightly smaller embedding dimensions and larger intermediate projections, and so on). The one difference not shown in the figure above is that Nanbeige does not tie the input embedding weights to the output layer weights, whereas Llama 3.2 3B does. (In my experience, weight tying is a nice way to reduce the total number of parameters, but it almost always results in worse training performance as evidenced by higher training and validation losses.) ​As mentioned before, this article focuses primarily on the architecture comparisons. And in this case, most of the performance gains (compared to the Nanbeige 4 3B predecessor) come from additional post-training with supervised fine-tuning and reinforcement learning, but interested readers can find more information in the detailed technical report . While the previous section briefly covered Qwen3 as the most open-weight model family, it is getting a bit long in the tooth as its release is almost a year ago (if we don’t count the Qwen3-Next variants geared towards efficiency). However, the Qwen team just released a new Qwen3.5 model variant on February 15. Qwen3.5 397B-A17B, a Mixture-of-Experts (MoE) with 397B parameters (17B active per token), is a step up from the largest Qwen3 model, which is 235B parameters in size. (There is also the 1 trillion-parameter Qwen3-Max model, but it was never released as an open-weight model.) The obligatory benchmark overview shows that Qwen3.5 exceeds the previous Qwen3-Max model across the board, with a much stronger focus on agentic terminal coding applications (the main theme this year). Qwen3.5 appears to be roughly on par with GLM-5 and MiniMax M2.5 in terms of pure agentic coding performance (e.g., SWE-Bench Verified).​ Figure 30: Qwen3.5 benchmark overview from the official model hub page . Since the Qwen team likes to release a separate coding model (e.g., see Qwen3-Coder-Next, which we discussed previously), this makes me curious to see how a potential Qwen3.5-Coder will perform. Architecture-wise, Qwen3.5 adopts the hybrid attention model (featuring Gated DeltaNet) that Qwen3-Next and Qwen3-Coder-Next (section 4) used. This is interesting because Qwen3-Next models were initially an alternative to the full-attention Qwen3 models, but this suggests that the Qwen team has now adopted the hybrid attention mechanism into its main line of models. Figure 31: Comparison between Qwen3.5 and the Qwen3(-Coder)-Next architectures.​ Besides scaling up the model size, as shown in the figure above, Qwen3.5 now also includes multimodal support (previously, it was only available in separate Qwen3-VL models). Anyways, Qwen3.5 is a nice refresh of the Qwen series, and I hope that we will see smaller Qwen3.5 variants in the future, too! Edit: Just as I finalized this article, the Qwen team launched said smaller model variants: Qwen3.5-27B Qwen3.5-35B-A3B Qwen3.5-122B-A10B Ling 2.5 (and the reasoning variant Ring 2.5 ) are 1-trillion-parameter LLMs with a hybrid attention architecture in a similar spirit to Qwen3.5 and Qwen3-Next. However, instead of Gated DeltaNet, they use a slightly simpler recurrent linear attention variant called Lightning Attention. In addition, Ling 2.5 adopts the Multi-Head Latent Attention (MLA) mechanism from DeepSeek. Figure 32: Ling 2.5 compared to Qwen3.5; both architectures are linear attention hybrids. Ling 2.5 is not the strongest model in terms of absolute benchmark performance, but its selling point is very good efficiency in long contexts (due to the hybrid attention). Unfortunately, there are no direct comparisons to Qwen3.5, but compared to Kimi K2 (1T parameters; the same size as Ling 2.5), Ling 2.5 achieves a 3.5x higher throughput at a sequence length of 32k tokens. Figure 33: Relative throughput of Ling 2.5 compared to Kimi K2 (same 1 trillion parameter size); note that the throughput is normalized so that Kimi K2 is shown at 1x (Kimi’s throughput is not linear even though it appears linear in this plot). Source: Ling 2.5 model hub page . Released on February 17, Tiny Aya is a new, “small” LLM by Cohere that is said to be the “most capable multilingual open-weight model” at the 3B parameter size class. (Tiny Aya outperforms Qwen3-4B, Gemma 3 4B, and Ministral 3 3B according to the announcement post ). This is a great model to run and experiment with locally. The only caveat is that while it’s an open-weight model, its licensing terms are relatively restricted and only allow non-commercial use. That aside, Aya is a 3.35B parameter model that comes in several flavors that are useful for personal and (non-commercial) research use: tiny-aya-base (base model) tiny-aya-global (best balance across languages and regions) tiny-aya-fire (optimized for South Asian languages) tiny-aya-water (optimized for European and Asia Pacific languages) tiny-aya-earth (optimized for West Asian and African languages) More specifically, below is a list of languages the models are optimized for. Figure 34: Languages supported by the various Aya models. Architecture-wise, Tiny Aya is a classic decoder-style transformer with a few noteworthy modifications (besides the obvious ones like SwiGLU and Grouped Query Attention), as illustrated in the figure below. Figure 35: Tiny Aya (featuring a parallel transformer block) and Qwen3 4B side by side. Overall, the most noteworthy highlight in this architecture is the parallel transformer blocks. Here, the parallel transformer block computes attention and an MLP from the same normalized input, then adds both to the residual in a single step. I assume this is to reduce serial dependencies inside a layer to improve computational throughput. For those readers familiar with Cohere’s Command-A architecture, Tiny Aya seems to be a smaller version of it. Also, an interesting detail is that the Tiny Aya team dropped QK-Norm (an RMSNorm applied to keys and queries inside the attention mechanism); QK-Norm has become quite standard for improving training stability in terms of reducing loss spikes. According to a developer on the Cohere team, QK-Norm was dropped “since it can interact with long context performance.” ​As you may know, I occasionally code architectures from scratch. Since I found the parallel transformer block quite intriguing and the model runs fine on low-end hardware, I implemented it from scratch (for educational purposes), which you can find here on GitHub . Figure 36: Tiny Aya from-scratch implementation . This article was quite the whirlwind tour covering the main open-weight LLM releases around February 2026. If there is a takeaway from this, it’s that there are various model architectures (all derived from the original GPT model) that work well. Modeling performance is likely not attributed to the architecture design itself but rather the dataset quality and training recipes (a good topic for a separate article). That said, architectural design remains an essential part of building a successful LLM, and many developers seem to be steering towards adding more and more computational performance tweaks. For example, this includes adapting MLA (Kimi K2.5, GLM-5, Ling 2.5) and DeepSeek Sparse Attention (GLM-5) to continue the Gated DeltaNet (Qwen3.5) or similar forms of linear attention (Ling 2.5). Figure 37: Attention types used by the various architectures mentioned in this article. Also, more classic efficiency tweaks like grouped query attention and sliding window attention (Arcee Trinity, Step 3.5 Flash, Tiny Aya) remain popular. Among the new releases, only MiniMax M2.5 and Nanbeige 4.1 stayed very classic here, using only Grouped Query Attention without any other efficiency tweak. DeepSeek V4 is the model everyone is waiting for. Unfortunately, as of this writing, it hasn’t been released yet. However, I plan to add it to this article once it’s released, which is likely on or before the first week of March. Another interesting model is Sarvam (30B & 100B) from India. The model was recently announced, but it hasn’t been released yet. Stay tuned for an update here as well. This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider a subscription or purchasing a copy of my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Arcee AI’s Trinity Large (Jan 27, 2026) Moonshot AI’s Kimi K2.5 (Jan 27, 2026) StepFun Step 3.5 Flash (Feb 1, 2026) Qwen3-Coder-Next (Feb 3, 2026) z.AI’s GLM-5 (Feb 12, 2026) MiniMax M2.5 (Feb 12, 2026) Nanbeige 4.1 3B (Feb 13, 2026) Qwen 3.5 (Feb 15, 2026) Ant Group’s Ling 2.5 1T & Ring 2.5 1T (Feb 16, 2026) Cohere’s Tiny Aya (Feb 17, 2026) Their flagship large model is a 400B param Mixture-of-Experts (MoE) with 13B active parameters. The two smaller variants are Trinity Mini (26B with 3B active parameters) and Trinity Nano (6B with 1B active parameters). Figure 1: Overview of the Trinity Large architecture (based on the model hub config file ). Along with the model weights, Arcee AI also released a nice technical report on GitHub (as of Feb 18 also on arxiv ) with lots of details. So, let’s take a closer look at the 400B flagship model. Figure 2 below compares it to z.AI’s GLM-4.5 , which is perhaps the most similar model due to its size with 355B parameters. Figure 2: Arcee AI Trinity Large next to GLM-4.5 of a relatively similar size (400B vs 355B). As we can see in the Trinity and GLM-4.5 comparison, there are several interesting architectural components added to the Trinity model. First, there are the alternating local:global (sliding window) attention layers (SWA) like in Gemma 3, Olmo 3, Xiaomi MiMo, etc. In short, SWA is a type of sparse (local) attention pattern where each token attends only to a fixed-size window of t recent tokens (for example, 4096) instead of attending to the entire input (which could be up to n=256,000 tokens). This reduces the per-layer regular attention cost from O( n ²) to roughly O( n · t ) for sequence length n , which is why it is attractive for long-context models. Figure 3: A comparison between regular attention (global attention) and sliding window attention (local attention). But instead of using the common 5:1 local:global ratio that Gemma 3 and Xiaomi used, the Arcee team opted for a 3:1 ratio similar to Olmo 3, and a relatively large sliding window size of 4096 (also similar to Olmo 3). The architecture also uses QK-Norm , which is a technique that applies RMSNorm to the keys and queries to stabilize training (as shown in Figure 4 below), as well as no positional embeddings ( NoPE ) in the global attention layers similar to SmolLM3 . Trinity also has a form of gated attention. It’s not a full-blown Gated DeltaNet but it uses a similar gating as in the attention mechanism in Qwen3-Next . I.e., the Trinity team modified the standard attention by adding elementwise gating to the scaled dot-product before the output linear projection (as shown in the figure below), which reduces attention sinks and improves long-sequence generalization. Additionally, it also helped with training stability. Figure 4: Illustration of the gating mechanism that Trinity Large uses in the attention mechanism. Also, the Trinity technical report showed that the modeling performance of the Trinity Large and GLM-4.5 base models are practically identical (I assume they didn’t compare it to more recent base models because many companies only share their fine-tuned models these days.) You may have noticed the use of four (instead of two) RMSNorm layers in the previous Trinity Large architecture figure which looks similar to Gemma 3 at first glance. Figure 5: Arcee Trinity and Gemma 3 RMSNorm placement side by side. Overall, the RMSNorm placement looks like a Gemma 3-like RMSNorm placement, but the twist here is that the gain of the second RMSNorm (in each block) is depth-scaled, meaning it’s initialized to about 1 / sqrt(L) (with L the total number of layers). So, early in training, the residual update starts small and grows as the model learns the right scale. Figure 6: Arcee Trinity and DeepSeek V3/R1 MoE side by side. The MoE is a DeepSeek-like MoE with lots of small experts, but made it coarser as that helps with inference throughput (something we have also seen in Mistral 3 Large when they adopted the DeepSeek V3 architecture). Lastly, there are some interesting details on the training improvements (a new MoE load-balancing strategy and another using the MuOpt optimizer), but since this is a mainly an architecture article (and there are many more open-weight LLMs to cover), these details are out of scope. 2. Moonshot AI’s Kimi K2.5: A DeepSeek-Like Model at a 1-Trillion-Parameter Scale While Arcee Trinity essentially matched the modeling performance of the older GLM-4.5 model, Kimi K2.5 is an open-weight model that set a new open-weight performance ceiling at the time of its release on Jan 27. ​Impressively, according to their own benchmarks in their detailed technical report , it was on par with the leading proprietary models at the time of its release. ​ Figure 7: Kimi K2.5 performance benchmark from the official K2.5 technical report . The good modeling performance is no surprise when compared to, e.g., Arcee Trinity or GLM-4.5 covered earlier, since (similar to its K2 predecessor), Kimi K2.5 is a 1-trillion-parameter model and thus 2.5x larger than Trinity and 2.8x larger than GLM-4.5. Overall, the Kimi K2.5 architecture is similar to Kimi K2, which, in turn, is a scaled-up version of the DeepSeek V3 architecture. Figure 8: Kimi K2 is a larger version of the DeepSeek V3 architecture. However, K2 was a pure text model, and Kimi K2.5 is now a multimodal model with vision support. To quote from the technical report: ​> Kimi K2.5 is a native multimodal model built upon Kimi K2 through large-scale joint pre-training on approximately 15 trillion mixed visual and text tokens. During the training, they adopted an early fusion approach and passed in the vision tokens early on alongside the text tokens, as I discussed in my older Understanding Multimodal LLMs article. Figure 9: Like most other contemporary multimodal LLMs, Kimi K2.5 uses method A, passing the vision tokens alongside the text tokens during training. Side note: In multimodal papers, “early fusion” is unfortunately overloaded. It can mean either 1. When the model sees vision tokens during pre-training. I.e., vision tokens are mixed in from the start (or very early) of pre-training as opposed to later stages. 2. How the image tokens are combined in the model. I.e., they are fed as embedded tokens alongside the text tokens. In this case, while the term “early fusion” in the report specifically refers to point 1 (when the vision tokens are provided during pre-training), point 2 is also true here. Furthermore, regarding point 1, the researchers included an interesting ablation study showing that the model benefits from seeing vision tokens early in pre-training, as shown in the annotated table below. Figure 10: Given a fixed number of vision tokens during training, the model performance benefits if the model is shown a smaller number of vision tokens early on during pre-training (as opposed to adding a higher number of vision tokens later on). Annotated table from the Kimi K2.5 technical report . 3. StepFun’s Step 3.5 Flash: Good Performance at Great Tokens/Sec Throughput I have to admit that I haven’t had the Step models on my radar yet. This one caught my attention due to its interesting size, detailed technical report , and fast tokens/sec performance. Step 3.5 Flash is a 196B parameter model that is more than 3x smaller than the recent DeepSeek V3.2 model (671B) while being slightly ahead in modeling performance benchmarks. According to the Step team, Step 3.5 Flash has a 100 tokens/sec throughput at a 128k context length, whereas DeepSeek V3.2 has only a 33 tokens/sec throughput on Hopper GPUs, according to the data on the Step model hub page . Figure 11: Step 3.5 Flash benchmark from the Step technical report . One reason for this higher performance is the model’s smaller size (196B-parameter MoE with 11B parameters active per token versus 671B-parameter MoE with 37B parameters active), as shown in the figure below. Figure 12: Step 3.5 Flash and DeepSeek V3.2 side by side. The other reason along with gated attention (which we previously discussed in the context of Trinity) is Multi-Token Prediction (MTP) . DeepSeek has been an early adopter of multi-token prediction, a technique that trains the LLM to predict multiple future tokens at each step, rather than a single one. Here, at each position t, small extra heads (linear layers) output logits for t+1...t+k, and we sum cross-entropy losses for these offsets (in the MTP paper, the researchers recommended k=4). This additional signal speeds up training, and inference may remain at generating one token at a time, as illustrated in the figure below. Figure 13: Multi-Token Prediction versus regular next token prediction. (Left subfigure inspired by the MTP paper .) Originally, MTP was only used during training, not inference; hence, the inference time steps (bottom) show a single next-token prediction. DeepSeek V3 reported using MTP-1, that is, MTP with 1 extra token (instead of 3) during training, and then making MTP optional during inference. Step 3.5 Flash uses MTP with 3 additional tokens (MTP-3) during both training and inference (note that MTP is usually not used during inference, and this is an exception). ​Note that the previously discussed Arcee Trinity and Kimi K2.5 do not use MTP, but other architectures already use an MTP-3 setup similar to Step 3.5 Flash, for example, GLM-4.7 and MiniMax M2.1. 4. Qwen3-Coder-Next: An Attention-Hybrid for Coding In early February 2026, the Qwen3 team shared the 80B Qwen3-Coder-Next model (3B parameters active), which made big headlines for outperforming much larger models like DeepSeek V3.2 (37B active) and Kimi K2.5 and GLM-4.7 (both 32B active) on coding tasks. Figure 14: Qwen3-Coder-Next performance on a coding benchmark next to other popular coding models; this figure appeared in the official technical report . Moreover, as shown in the benchmark figure above, the Qwen3-Coder-Next SWE-Bench Pro performance is roughly on par with Claude Sonnet 4.5 (and only slightly below Claude Opus 4.5), which is impressive for a relatively small open-weight model! Using the ollama version of Qwen3-Coder-Next locally, the model takes about 48.2 GB of storage space and 51 GB of RAM. Figure 15: Running Qwen3-Coder-Next locally. Note that the architecture behind Qwen3-Coder-Next is exactly the same as Qwen3-Next 80B (in fact, the pre-trained Qwen3-Next 80B is used as a base model for further mid- and post-training). Figure 16 below shows the Qwen3-Next architecture next to a regular Qwen3 235B model for reference. Figure 16: Qwen3-Coder-Next 80B (3B parameters active per token) and the 3x larger Qwen3 235B-A22B architecture. The new Qwen3 Next architecture stands out because, despite being 3x smaller than the previous 235B-A22B model, it introduces four times as many experts and even adds a shared expert. Both of these design choices (a high expert count and the inclusion of a shared expert). ​The other highlight is that they replace the regular attention mechanism with a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the 235B-A22B model supported 32k natively and 131k with YaRN scaling). ​So how does this new attention hybrid work? Compared to grouped‑query attention (GQA), which is still standard scaled dot‑product attention (sharing K/V across query‑head groups to cut KV‑cache size and memory bandwidth as discussed earlier, but whose decode cost and cache still grow with sequence length), their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks in a 3:1 ratio as shown in Figure 17. Figure 17: The Qwen3-Coder-Next attention hybrid setup. We can think of the gated attention block as standard scaled-dot-product attention used in GQA, with a few tweaks on top. The main differences between gated attention and plain GQA block are: an output gate (sigmoid-controlled, usually per-channel) that scales the attention result before it is added back to the residual; zero-centered RMSNorm for QKNorm, rather than a standard RMSNorm; partial RoPE (on a subset of dimensions). Figure 18: GLM-5 architecture next to its GLM-4.7 predecessor. Benchmarks at the bottom taken from the official GLM-5 technical report . Not too long ago, GLM-4.7 (December 2025) was one of the strongest open-weight models. GLM-5 shows a major modeling performance improvement based on the benchmark shown in Figure 18 above. That jump is likely partly due to improvements to the training pipeline, but likely largely attributed to its 2x larger parameter count from 355B parameters in GLM-4.7 to 744B parameters in GLM-5. This size increase now places GLM-5 between DeepSeek V3.2 (671B) and Kimi K2.5 (1T) in terms of scale. Comparing the benchmark numbers of the previously discussed Kimi K2.5 (1T), the smaller GLM-5 (744B) model seems slightly ahead, as shown in the table below. Figure 19: GLM-5 (744B) and Kimi K2.5 (1T) benchmark performance side by side (larger is better). Like GLM-4.7, all the other models discussed so far, GLM-5 is a Mixture-of-Experts model. The number of active parameters per token increases only slightly, from 32B in GLM-4.7 to 40B in GLM-5. As shown in Figure 20 below, GLM-5 now adopts DeepSeek’s multi-head latent attention as well as DeepSeek Sparse Attention. (I described DeepSeek Sparse Attention in more detail in From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates .) These modifications are likely intended to reduce inference costs when working with long contexts. Otherwise, the overall architecture remains relatively similar. Figure 20: GLM-5 and DeepSeek V3.2 side by side (two similar architectures at a similar size). The increase in total size over GLM-4.7 mainly comes from expanding the number of experts, from 160 (GLM-4.7) to 256 (GLM-5), and slightly increasing layer dimensions (while keeping the number of experts the same at 8 regular + 1 shared expert per token). For example, the embedding dimension and expert size increase from 5,120 to 6,144, and the intermediate projection size rises from 1,536 to 2,048. Interestingly, the number of transformer layers is reduced from 92 in GLM-4.7 to 78 in GLM-5. I assume this change is also intended to reduce inference costs and improve latency, since layer depth cannot be parallelized in the same way as width. Additionally, I also checked an independent benchmark (here, the hallucination leaderboard ), and it indeed looks like GLM-5 is on par with Opus 4.5 and GPT-5.2 (while using fewer tokens). Figure 21: Next to the overall benchmark performance, this table adds hallucination rates from the hallucination leaderboard . Furthermore, looking at the most recent Artificial Intelligence Index, which aggregates various benchmarks, GLM-5 is indeed slightly ahead of Kimi K2.5 and only one point behind GPT-5.2 (xhigh) and the recent Claude Sonnet 4.6. Figure 22: Artificial Intelligence Index snapshot from Feb 21, 2026. 6. MiniMax M2.5: A Strong Coder with “Only” 230B Parameters The aforementioned GLM-5 and Kimi K2.5 are popular open-weight models, but according to OpenRouter statistics , they pale in comparison to MiniMax M2.5 , which was released on February 12 as well. Figure 23: OpenRouter usage snapshot from Feb 21, 2026. ​ OpenRouter is a platform and API that lets developers access and route requests across many different LLMs from various providers. Note that while its usage statistics are a good indicator of open-weight model popularity, it’s heavily biased towards open-weight models (versus proprietary models), since most users use proprietary models through the official platform directly. There is also usage bias across open-weight models, since many people also use open-weight models through the official developers’ APIs. Anyways, it can still be an interesting place to guesstimate the relative popularity of open-weight models that are too large to run locally for most users. ​ Now, back to MiniMax M2.5. Pulling together the GLM-5 data from the SWE-Bench Verified coding benchmark and combining it with the reported MiniMax M2.5, the latter appears to be a slightly stronger model (at least when it comes to coding). Figure 24: MiniMax M2.5 coding performance on SWE-Bench Verified​ Side note: It’s interesting to see Opus 4.5 and Opus 4.6 practically scoring identically on SWE-Bench Verified. This can be an indicator that LLM progress has stalled. I don’t think that’s true, though, given that users of Opus 4.6 can confirm that this model does seem to perform better in real-world usage. So, the more likely issue here is that the SWE-Bench Verified benchmark has saturated, and it may no longer be a meaningful benchmark to report from now on (in favor of other benchmarks like SWE-Bench Pro, for example). With saturated, I mean that it potentially contains unsolvable problems due to design issues (as discussed in a recent Reddit thread and the new “ Why SWE-bench Verified no longer measures frontier coding capabilities “ article by OpenAI). Anyways, back to the topic of MiniMax M2.5 performance. Looking across a broader selection of benchmarks, according to the Artificial Intelligence Index aggregation, GLM-5 remains ahead. This is perhaps no surprise because GLM-5 is still a 4x larger model than M2.5, even though the tokens/sec throughput is quite similar. Figure 25: GLM-5 vs MiniMax M2.5 comparison based on the Artificial Intelligence Index (Feb 21, 2026) I think MiniMax M2.5’s popularity is partly owed to the fact that it is a smaller, cheaper model with roughly similar modeling performance (i.e., a good bang for the buck). Architecture-wise, MiniMax M2.5 is a 230B model with a fairly classic design: just plain Grouped Query Attention, no sliding window attention or other efficiency improvements. Figure 26: MiniMax M2.5 next to GLM-5. So far, this is also the first architecture in this report that doesn’t come with a detailed technical report, but you can find additional information on the model hub page . 7. Nanbeige 4.1 3B: A Strong Llama 3 Successor In this section, we are switching gears and finally covering a smaller model that can run locally on a laptop. But first let’s start with some context before we get to Nanbeige 4.1 3B . Qwen models have always been very popular models. I often tell the story that when I was an advisor during the NeurIPS LLM efficiency challenge a few years back, most of the winning solutions were based on a Qwen model. ​Now, Qwen3 is likely among the most widely used open-weight model suite since they cover such a wide range of sizes and use cases (from 0.6B to 235B) Especially the smaller models (80B and less, like Qwen3-Next, covered previously) are great for local use on consumer hardware. Figure 27: Relative adoption popularity of open-weight models. Note that this shows the number of models on the Hugging Face model hub that are finetuned using one of those models as a base model. (This is not the number of people who use the models on their computer locally, which would be a number impossible to know.) Source: Atom Project .​ Why I am mentioning all this is that Nanbeige 4.1 3B seems to target the “small” LLM on-device use case that Qwen3 is so popular for. According to the Nanbeige 4.1 3B benchmarks, their model is way ahead of Qwen3 (perhaps no surprise, given that Qwen3 is almost a year old). Figure 28: Nanbeige 4.1 3B benchmark comparison with Qwen3 (Source: Nanbeige 4.1 3B model hub page ). Architecture-wise, Nanbeige 4.1 3B is similar to Qwen3 4B, which is, in turn, very similar to Llama 3.2 3B. I am showing Nanbeige 4.1 3B next to Llama 3.2 3B below because it is the most similar in size. Figure 29: Nanbeige 4.1 3B next to Llama 3.2 3B. Nanbeige 4.1 3B uses the same architectural components as Llama 3.2 3B, with some minor scaling differences (slightly smaller embedding dimensions and larger intermediate projections, and so on). The one difference not shown in the figure above is that Nanbeige does not tie the input embedding weights to the output layer weights, whereas Llama 3.2 3B does. (In my experience, weight tying is a nice way to reduce the total number of parameters, but it almost always results in worse training performance as evidenced by higher training and validation losses.) ​As mentioned before, this article focuses primarily on the architecture comparisons. And in this case, most of the performance gains (compared to the Nanbeige 4 3B predecessor) come from additional post-training with supervised fine-tuning and reinforcement learning, but interested readers can find more information in the detailed technical report . 8. Qwen3.5 and the Continutation of Hybrid Attention While the previous section briefly covered Qwen3 as the most open-weight model family, it is getting a bit long in the tooth as its release is almost a year ago (if we don’t count the Qwen3-Next variants geared towards efficiency). However, the Qwen team just released a new Qwen3.5 model variant on February 15. Qwen3.5 397B-A17B, a Mixture-of-Experts (MoE) with 397B parameters (17B active per token), is a step up from the largest Qwen3 model, which is 235B parameters in size. (There is also the 1 trillion-parameter Qwen3-Max model, but it was never released as an open-weight model.) The obligatory benchmark overview shows that Qwen3.5 exceeds the previous Qwen3-Max model across the board, with a much stronger focus on agentic terminal coding applications (the main theme this year). Qwen3.5 appears to be roughly on par with GLM-5 and MiniMax M2.5 in terms of pure agentic coding performance (e.g., SWE-Bench Verified).​ Figure 30: Qwen3.5 benchmark overview from the official model hub page . Since the Qwen team likes to release a separate coding model (e.g., see Qwen3-Coder-Next, which we discussed previously), this makes me curious to see how a potential Qwen3.5-Coder will perform. Architecture-wise, Qwen3.5 adopts the hybrid attention model (featuring Gated DeltaNet) that Qwen3-Next and Qwen3-Coder-Next (section 4) used. This is interesting because Qwen3-Next models were initially an alternative to the full-attention Qwen3 models, but this suggests that the Qwen team has now adopted the hybrid attention mechanism into its main line of models. Figure 31: Comparison between Qwen3.5 and the Qwen3(-Coder)-Next architectures.​ Besides scaling up the model size, as shown in the figure above, Qwen3.5 now also includes multimodal support (previously, it was only available in separate Qwen3-VL models). Anyways, Qwen3.5 is a nice refresh of the Qwen series, and I hope that we will see smaller Qwen3.5 variants in the future, too! Edit: Just as I finalized this article, the Qwen team launched said smaller model variants: Qwen3.5-27B Qwen3.5-35B-A3B Qwen3.5-122B-A10B Figure 32: Ling 2.5 compared to Qwen3.5; both architectures are linear attention hybrids. Ling 2.5 is not the strongest model in terms of absolute benchmark performance, but its selling point is very good efficiency in long contexts (due to the hybrid attention). Unfortunately, there are no direct comparisons to Qwen3.5, but compared to Kimi K2 (1T parameters; the same size as Ling 2.5), Ling 2.5 achieves a 3.5x higher throughput at a sequence length of 32k tokens. Figure 33: Relative throughput of Ling 2.5 compared to Kimi K2 (same 1 trillion parameter size); note that the throughput is normalized so that Kimi K2 is shown at 1x (Kimi’s throughput is not linear even though it appears linear in this plot). Source: Ling 2.5 model hub page . 10. Tiny Aya: A 3.35B Model with Strong Multilingual Support Released on February 17, Tiny Aya is a new, “small” LLM by Cohere that is said to be the “most capable multilingual open-weight model” at the 3B parameter size class. (Tiny Aya outperforms Qwen3-4B, Gemma 3 4B, and Ministral 3 3B according to the announcement post ). This is a great model to run and experiment with locally. The only caveat is that while it’s an open-weight model, its licensing terms are relatively restricted and only allow non-commercial use. That aside, Aya is a 3.35B parameter model that comes in several flavors that are useful for personal and (non-commercial) research use: tiny-aya-base (base model) tiny-aya-global (best balance across languages and regions) tiny-aya-fire (optimized for South Asian languages) tiny-aya-water (optimized for European and Asia Pacific languages) tiny-aya-earth (optimized for West Asian and African languages)

0 views
James Stanley Yesterday

Bot Forensics

Most threat intelligence bots are easy to fingerprint. And trying to be stealthy often makes it worse because imperfect anti-detection methods have extra fingerprint surface area of their own. We run an instrumented honeypot site that collects data on what these bots do, and we've just released an Instant Bot Test so you can see whether we flag your bot without even having to talk to us first. You may want to see my previous post on this topic for more context on what we're doing. Since that post we've sold a handful of reports, including to a couple of big names. And we now have a website at botforensics.com to advertise our services. Anti-detection detection One of the most interesting things we've learnt is that anti-detection techniques are very rarely successful in preventing your bot from being detected. Our collector site sees only an extreme minority (<0.1%) of sessions that could plausibly be real human users. Far from preventing a bot from being detected, anti-detection measures more often provide specific fingerprints about which bot it is based on which measures are in use. Some of these measures take us from "we think this is probably a bot" to "this is bot XYZ operated by Foocorp", which is kind of an own goal. If you're going to run a bot with anti-detection measures in place (and you should, otherwise you'll trivially look like Headless Chrome), then you should definitely get a Bot Audit to make sure you aren't leaking any extra signals. The Puppeteer stealth evasions are a great example of this. Lots of bots are browsing with these evasions applied (we even see bots using them outside Puppeteer), but we can detect the evasions themselves, which often leak more signal than we would expect to see absent the evasions. We do take a canvas fingerprint because why not, but it turns out to be quite hard to definitively say that a given canvas is a bot unless you have enough data on real user sessions to rule out the possibility that it is a real user. While some people are very worried about canvas fingerprinting, a much stronger bot signal than the canvas fingerprint itself is if we read the pixel data out and it has random pixels in the wrong colour where it should be the same colour all over. And, worse, if we do the same thing twice in a row and get a different answer each time! We noticed a bot operated by Microsoft that had some very specific identifying features, including references to some of their developers' real names. Microsoft have a fairly reputable bug bounty programme, so I tested the waters by reporting it on MSRC . But after sitting on it for 2 weeks they classified it as "not important" and declined to pay a bounty, so I won't make this mistake again. To Microsoft's credit, they have still not fixed it, which is consistent with considering it not important. We are in some cases able to detect when bots are running on Kubernetes (thanks Feroz for the idea), and this also reveals some fingerprints that are unique to each Kubernetes cluster. This is a great signal because a.) hardly any real human users are browsing from inside Kubernetes, and b.) if 2 bots are running on the same Kubernetes cluster then it's a fair bet that they're operated by the same company. So far we have seen bots from 3 distinct Kubernetes clusters. We've been surprised by how few threat intelligence vendors are running their own fetching. There are 94 vendors listed on VirusTotal, but fewer than 50 genuinely distinct bots fetch our collector pages, so at most only a bit over half of those vendors are actually fetching the sites themselves. The others may outsource their fetching to a common third-party, or else they are simply consulting other threat intelligence vendors and not even doing classification themselves. If you looked at enough VirusTotal results pages you could probably work out which ones always share the same classification, maybe we should do that. One of our domains is now blocked on VirusTotal by 7 different vendors: This is kind of a poor show. You can't classify a site as phishing just because it has "bank" in the domain and the page has a login form. The litmus test for whether a site is phishing is whether you can name the site it is impersonating, and our collector site doesn't impersonate any real site. Vexatious takedowns We received our first takedown notices last week. To be honest, I expected this to happen sooner. The whole project is running on "disposable" infrastructure so that if it gets taken down it won't impact any of our other projects. But it would still be very inconvenient to have it taken down. The takedown notices were sent to our hosting provider, who forwarded them to us. It's possible they were also sent to our domain registrar, who did not forward them to us but also did not act on them. Here's the text from the first one: Hello, We have discovered a Phishing attack on your network. URL: hxxps[:]// REDACTED / IP's: REDACTED Threat Type: Phishing Threat Description: Banking credential harvesting page detected at REDACTED . The page presents a fake bank login form with a header that references BotForensics Collector Page and botforensics .com, which indicates branding inconsistent with any legitimate bank . The site is hosted on REDACTED infrastructure (IP REDACTED ) and registered recently on 2026-02-17 via REDACTED , with privacy-protected WHOIS data . The HTML shows a typical login card for username and password, a Sign In” [sic] button, and scripted UI enhancements, including external scripts and images, plus a dynamic header bar . This combination is characteristic of a phishing attempt intended to harvest user credentials . The domain age is only about 0 .01 years, and the presence of a login form on a brand-tampering page hosted on a known hosting provider strongly suggests malicious intent . Registrar abuse contact is abuse[@] REDACTED and hosting provider abuse contact is abuse[@] REDACTED . Because high confidence phishing has been detected, the page should be reported to abuse contacts and blocked; while there can be legitimate educational use of such content, the page as presented is designed to harvest credentials rather than serve legitimate banking functionality . Domain Registrar: REDACTED ASN: REDACTED This email was sent automatically by QuariShield Automated Analysis. Reports are sometimes verified using AI, while this means reports are mostly valid, there may be some false positives. For more info: REDACTED We are well aware that you may not be able to take abuse reports sent to this email address, therefore if you could forward this email to the correct team who can handle abuse reports, it would be much appreciated. Please note, replies to this email are logged, but aren't always seen, we don't usually monitor this email for replies. To contact us if you have any questions or concerns, please email [email protected] stating your Issue ID REDACTED Kind regards, QuariShield Cyber Security. (Redactions mine, but yes the text is all run into one like that with no linebreaks). A few highlights stand out: The page presents a fake bank login form with a header that references BotForensics Collector Page and botforensics .com, which indicates branding inconsistent with any legitimate bank . One would think that having branding "inconsistent with any legitimate bank" is evidence that you're not phishing? A phishing site would copy the bank's branding. The HTML shows a typical login card for username and password, a Sign In” button, and scripted UI enhancements, including external scripts and images, plus a dynamic header bar . This combination is characteristic of a phishing attempt intended to harvest user credentials Is it really? hosted on a known hosting provider What are the chances? This email was sent automatically by QuariShield Automated Analysis. Reports are sometimes verified using AI Very interesting. The takedown notices were sent by QuariShield . I emailed the QuariShield contact address and got a reply from the person operating it, and he seems friendly, and has whitelisted my collector page, which is helpful but in my opinion only part of the solution. How many other false positive takedown notices is he going to send for other websites? From what I have been able to gather, QuariShield grabs URLs from public sources, and uses an LLM agent to classify them and automatically send takedowns. On the one hand, yeah, it's not working very well yet and has a lot of false positives. On the other hand, just look at how far we've come. If you're running a traditional takedown provider: this is what's coming for you. People are spinning up (presumed) vibe-coded projects that now do fully-automated takedowns for sites that aren't even paying customers . Your anti-detection techniques may not be as effective as you think. Try our Instant Bot Test to see if we flag your bot (and please let us know how we did). And the lesson from QuariShield is: AI is coming for you.

0 views
iDiallo Yesterday

When access to knowledge is no longer the limitation

Let's do this thought experiment together. I have a little box. I'll place the box on the table. Now I'll open the little box and put all the arguments against large language models in it. I'll put all the arguments, including my own. Now, I'll close the box and leave it on the table. Now that that is out of the way, we are left with all the positives. All the good things that come from having the world's information at our fingertips. I can ask any question and get an answer almost instantly. Well, not all questions. The East has its sensitivities around a certain square, and the West about a certain island, but I digress. I can learn any subject I want to learn. I can take the work of any philosopher and ELI5 it. I can finally understand "The World as Will and Representation" by Schopenhauer. A friend gifted me a copy when I was still in my twenties, it's been steadily collecting dust ever since. But now I can turn to the book and ask questions until I thoroughly understand it. No need to read it cover to cover. In fact, last year I decided I wanted to learn about batteries. I first went to the Battery University website and started to read lesson by lesson. But I had questions. How was I going to get them answered? The StackExchange network is not what it used to be , so I turned to ChatGPT. It had all the answers. I learned and read so much about batteries that I am tempted to start a battery company. My twin boys are at that age where they suffer from the infinite WHYs. Why does it rain? Why does the earth spin? why does California still use the Highway Gothic font on some freeway signs? I do not have answers to these questions off the top of my head, but I have access to the infinite knowledge machine, so of course my kids know the answers now. Just the other day, I had a shower-thought about cars. "Are cars just a slab of metal on wheels?" And now I learned that the answer is "essentially yes." But then I kept reading on the subject and learned about all those little devices and pieces of mechanical technologies that exist that I had never heard of. For example, the sway bar link. Did you know about it? Did you know that it reduces body roll and maintains stability during turns? Fascinating. Ever since LLMs made their public debut in 2022, we've been gifted with this knowledge base that we can interact with on demand, day and night, at work or at home. The possibilities seem endless. I can learn or understand any codebase without being familiar with the programming language. And yet it feels like something is missing. The more I access this knowledge, the more I feel the little box on my table is starting to open. Now this is just my opinion, but I'm starting to believe that the sum of all parts is still just one. Let me explain. In 2022, the Japanese Prime Minister Shinzo Abe was shot and killed. It came as a shock to me, Japan is not a country known for gun violence. So in December of that year, I decided to learn more about him, about Japan, and about their stance on guns. With the holiday season and the rolling code freeze at work, I spent a good amount of time just reading through Wikipedia, some translated Japanese forums, and some official documents. A whole lot of material. Long story short, I still don't have a definitive answer as to why exactly he was killed, but I came away with a richer understanding of the story and the perspectives of the people around him. Reading more material is not going to give me a definitive answer, but it helps paint a richer picture of the event. I spent enough time with the subject to appreciate the knowledge I gathered over those weeks. When you ask ChatGPT why Shinzo Abe was shot, it will give you a satisfying answer. It will be correct, it will include some of the nuance, and will probably ask you if you want to learn more. The answer satisfies your curiosity and you move on... to your next question. It could be the chat interface. Even though the words on the page clearly ask you "if you want to know more," somehow you are more keen on starting a new subject. And rare are the times we go back and re-read the material we have been provided with. With the books I've "read" through an LLM by asking multiple questions, I can hardly tell you that I understand them. Yes, I know the gist of it but it doesn't replace the knowledge you build by reading a book at a steady pace. You save a whole bunch of time by using an LLM, but the knowledge is fleeting. Reading original sources is slow, but you get to better immerse yourself in the subject. It seems like reading through an LLM removes the friction of learning, but in doing so it makes knowledge shallow and disposable. The problem is the way we process information as humans. We don't become experts by learning from summaries. The effort of learning is part of the process. Those endless questions my children have, there is a snack-like quality to the answers I give them. Because the answers are so easy to get, we treat them like a social media feed. I scroll through and one post is about batteries, the next is about sway bars, and somehow I land on California highways. Having the world's information at your fingertips is a gift, but knowing the gist of everything is not the same as understanding something deeply. We do not form character by reading the gist of it. Instead, character comes from the hunt for information. The limitation of a manual process forces us to focus, to dwell on a subject, until we truly internalize it. You can hardly spot a hallucination unless it concerns material that you already have knowledge in. Wait a minute. What's happening here. Ah! I see. The box has crept back open.

0 views
Evan Schwartz Yesterday

Great RSS Feeds That Are Too Noisy to Read Manually

Some RSS feeds are fantastic but far too noisy to add to most RSS readers directly. Without serious filtering, you'd get swamped with more posts than you could possibly read, while missing the hidden gems. I built Scour specifically because I wanted to find the great articles I was missing in noisy feeds like these, without feeling like I was drowning in unread posts. If you want to try it, you can add all of these sources in one click . But these feeds are worth knowing about regardless of what reader you use. Feed: https://hnrss.org/newest Thousands of posts are submitted to Hacker News each week. While the front page gives a sense of what matches the tech zeitgeist, there are plenty of interesting posts that get buried simply because of the randomness of who happens to be reading the Newest page and voting in the ~20 minutes after posts are submitted. (You can try searching posts that were submitted but never made the front page in this demo I built into the Scour docs.) Feed: https://feeds.pinboard.in/rss/recent/ Pinboard describes itself as "Social Bookmarking for Introverts". The recent page is a delightfully random collection of everything one of the 30,000+ users has bookmarked. Human curated, without curation actually being the goal. Feed: https://bearblog.dev/discover/feed/?newest=True Bear is "A privacy-first, no-nonsense, super-fast blogging platform". This post is published on it, and I'm a big fan. The Discovery feed gives a snapshot of blogs that users have upvoted on the platform. But, even better than that, the Most Recent feed gives you every post published on it. There are lots of great articles, and plenty of blogs that are just getting started. Feed: https://feedle.world/rss Feedle is a search engine for blogs and podcasts. You can search for words or phrases among their curated collection of blogs, and every search can become an RSS feed. An empty search will give you a feed of every post published by any one of their blogs. Feed: https://kagi.com/api/v1/smallweb/feed/ Kagi, the search engine, maintains an open source list of around 30,000 "small web" websites that are personal and non-commercial sites. Their Small Web browser lets you browse random posts one at a time. The RSS feed gives you every post published by any one of those websites. Feed: https://threadreaderapp.com/rss.xml Thread Reader is a Twitter/X bot that lets users "unroll" threads into an easier-to-read format. While getting RSS feeds out of Twitter/X content is notoriously difficult, Thread Reader provides an RSS feed of all threads that users have used them to unroll. Like the content on that platform, the threads are very hit-or-miss, but there are some gems in there. Not an RSS feed: https://minifeed.net/global Minifeed is a nice "curated blog reader and search engine". They have a Global page that shows every post published by one of the blogs they've indexed. While this isn't technically an RSS feed, I thought it deserved a mention. Note that Scour can add some websites that don't have RSS feeds. It treats pages with repeated structures that look like blogs (e.g. they have links, titles, and publish dates) as if they were RSS feeds. Minifeed's Global view is one such page, so you can also get every post published from any one of their collected blogs. Feeds galore: https://info.arxiv.org/help/rss.html arXiv has preprint academic articles for technical fields ranging from Computer Science and Mathematics to Physics and Quantitative Biology. Like many of the feeds listed above, most of the categories are very noisy. But, if you're into reading academic articles, there is also plenty of great new research hidden in the noise. Every field and sub-field has its own RSS feed. (You can browse them and subscribe on Scour here ). While reading my Scour feed, I'll often check which feeds an article I liked came from (see what this looks like here ), and I'm especially delighted when it comes from some source I had no idea existed. These types of noisy feeds are great ways of discovering new content and new blogs, but you definitely need some good filters to make use of them. I hope you'll give Scour a try! P.S. Scour makes all of the feeds it creates consumable as RSS/Atom/JSON feeds , so you can add your personalized feed or each of your interests-specific feeds to your favorite feed reader. Read more in this guide for RSS users .

0 views
Stratechery Yesterday

Xbox Replaces Head of Gaming, Xbox History, Whither Xbox

Xbox has a new head, who isn't a gamer; I suspect Microsoft is doing what it should have done a decade ago: get out of the console business.

0 views
Andy Bell Yesterday

I’m unsubscribing from the AI discourse

I’m bored of hearing about it, bored of seeing people I respect(ed) fawn over it and it’s really not doing my mental health any good. I’m going wait for the inevitable bubble burst and then be ready to help people burned by this harmful technology via Set.studio and Piccalilli . We’ve recently made our viewpoint very clear over there anyway. Filters and mutes in Bluesky and Feedbin will do the job nicely for me I think because I don’t really listen to podcasts, I don’t really watch Youtube and I don’t really bother with Mastodon or Linkedin. I just wish I could mute words in Discord! Anyway, let’s see if this helps my brain!

0 views

curl security moves again

tldr: curl goes back to Hackerone. When we announced the end of the curl bug-bounty at the end of January 2026, we simultaneously moved over and started accepting curl security reports on GitHub instead of its previous platform. This move turns out to have been a mistake and we are now undoing that part of the decision. The reward money is still gone, there is no bug-bounty , no money for vulnerability reports, but we return to accepting and handling curl vulnerability and security reports on Hackerone . Starting March 1st 2026, this is now (again) the official place to report security problems to the curl project. This zig-zagging is unfortunate but we do it with the best of intentions. In the curl security team we were naively thinking that since so many projects are already using this setup it should be good enough for us too since we don’t have any particular special requirements. We wrongly thought . Now I instead question how other Open Source projects can use this. It feels like an area and use case for Open Source projects that is under-focused: proper, secure and efficient vulnerability reporting without bug-bounty. To illustrate what we are looking for, I made a little list that should show that we’re not looking for overly crazy things. Here is a list of nits and missing features we fell over on GitHub that, had we figured them out ahead of time, possibly would have made us go about this a different way. This list might interest fellow maintainers having the same thoughts and ideas we had. I have provided this feedback to GitHub as well – to make sure they know . Sure, we could switch to handling them all over email but that also has its set of challenges. Including: Since we dropped the bounty, the inflow tsunami has dried out substantially . Perhaps partly because of our switch over to GitHub? Perhaps it just takes a while for all the sloptimists to figure out where to send the reports now and perhaps by going back to Hackerone we again open the gates for them? We just have to see what happens. We will keep iterating and tweaking the program, the settings and the hosting providers going forward to improve. To make sure we ship a robust and secure set of products and that the team doing so can do that If you suspect a security problem in curl or libcurl, report it here: https://hackerone.com/curl Gitlab, Codeberg and others are GitHub alternatives and competitors, but few of them offer this kind of security reporting feature. That makes them bad alternatives or replacements for us for this particular service. Incoming submissions are reports that identify security problems . The reporter needs an account on the system. Submissions start private; only accessible to the reporter and the curl security team All submissions must be disclosed and made public once dealt with. Both correct and incorrect ones. This is important. We are Open Source. Maximum transparency is key. There should be a way to discuss the problem amongst security team members, the reporter and per-report invited guests. It should be possible to post security-team-only messages that the reporter and invited guests cannot see For confirmed vulnerabilities, an advisory will be produced that the system could help facilitate If there’s a field for CVE, make it possible to provide our own. We are after all our own CNA. Closed and disclosed reports should be clearly marked as invalid/valid etc Reports should have a tagging system so that they can be marked as “AI slop” or other terms for statistical and metric reasons Abusive users should be possible to ban/block from this program Additional (customizable) requirements for the privilege of submitting reports is appreciated (rate limit, time since account creation, etc) GitHub sends the whole report over email/notification with no way to disable this. SMTP and email is known for being insecure and cannot assure end to end protection. This risks leaking secrets early to the entire email chain. We can’t disclose invalid reports (and make them clearly marked as such) Per-repository default collaborators on GitHub Security Advisories is annoying to manage, as we now have to manually add the security team for each advisory or have a rather quirky workflow scripting it. https://github.com/orgs/community/discussions/63041 We can’t edit the CVE number field! We are a CNA, we mint our own CVE records so this is frustrating. This adds confusion. We want to (optionally) get rid of the CVSS score + calculator in the form as we actively discourage using those in curl CVE records No CI jobs working in private forks is going to make us effectively not use such forks, but is not a big obstacle for us because of our vulnerability working process. https://github.com/orgs/community/discussions/35165 No “quote” in the discussions? That looks… like an omission. We want to use GitHub’s security advisories as the report to the project, not the final advisory (as we write that ourselves) which might get confusing, as even for the confirmed ones, the project advisories (hosted elsewhere) are the official ones, not the ones on GitHub No number of advisories count is displayed next to “security” up in the tabs, like for issues and Pull requests. This makes it hard to see progress/updates. When looking at an individual advisory, there is no direct button/link to go back to the list of current advisories In an advisory, you can only “report content”, there is no direct “block user” option like for issues There is no way to add private comments for the team-only, as when discussing abuse or details not intended for the reporter or other invited persons in the issue There is a lack of short (internal) identifier or name per issue, which makes it annoying and hard to refer to specific reports when discussing them in the security team. The existing identifiers are long and hard to differentiate from each other. You quite weirdly cannot get completion help for in comments to address people that were added into the advisory thanks to them being in a team you added to the issue? There are no labels, like for issues and pull requests, which makes it impossible for us to for example mark the AI slop ones or other things, for statistics, metrics and future research Hard to keep track of the state of each current issue when a number of them are managed in parallel. Even just to see how many cases are still currently open or in need of attention. Hard to publish and disclose the invalid ones, as they never cause an advisory to get written and we rather want the initial report and the full follow-up discussion published. Hard to adapt to or use a reputation system beyond just the boolean “these people are banned”. I suspect that we over time need to use more crowdsourced knowledge or reputation based on how the reporters have behaved previously or in relation to other projects.

0 views
Phil Eaton Yesterday

I started a software research company

I quit my job at EnterpriseDB hacking on PostgreSQL products last month to start a company researching and writing about software infrastructure. I believe there is space for analysis that is more focused on code than TechCrunch or The Register, more open to covering corporate software development than LWN.net, and (as much as I love some of these folks) less biased than VCs writing about their own investments. I believe that more than ever there is a need for authentic and trustworthy analysis and coverage of the software we depend on. This company, The Consensus , will talk about databases and programming languages and web servers and everything else that is important for experienced developers to understand and think about. It is independent of any software vendor and independent of any particular technology. Some people were surprised (in a positive way) to see me cover MySQL already, for example. But that is exactly the point. I don't want The Consensus to be just "Phil's thoughts". I have already started working with a number of experienced developers who will be writing, and paid to write, for The Consensus. I also hope that this is another way, beyond the many communities I already run, to give back to the community such as in highlighting the work of open-source developers (the first interview with a DataFusion developer is coming soon), and highlighting compelling events and jobs in the software infrastructure world. The Consensus is entirely bootstrapped and will depend on the support of subscribers and, potentially, sponsors . The first few subscribers signed up just this past week. You can read more about the background and goals here , you can read about how contributors will work with The Consensus here , and you can get a sense for where this is going by browsing the homepage of The Consensus already. Thank you for your support in advance! Thank you to the folks who have subscribed already despite very little fanfare. Feedback is very welcome. I'm very excited and having quite a bit of fun already. We're all going to learn a lot. I started a software research company pic.twitter.com/PY3L0yhJlW

0 views
Hugo Yesterday

The B2BigB Syndrome: How Large Corporations Quietly Kill Startups

In the late 2000s, I worked at a software publisher and one of my colleagues started a company. It was a kind of corporate Second Life , where an avatar could move around and trigger discussions with other people. I don't remember the details anymore, but with hindsight and probably lots of exaggeration, I'd say it was like Gather but 15 years ahead of its time. The application seemed to work well and the company was lining up meetings with major corporations that seemed super interested in rolling it out across their enterprise. We're talking about big banks, major energy suppliers, really serious companies. Except it dragged on. A month. A quarter. A year. Then two. And eventually the company died waiting for an actual signature and, incidentally, some cash. My friend unfortunately ran into the infamous B2BigB syndrome, this curse (a French one?) that tends to kill a lot of companies every year. So if you're starting a company today or thinking about it, I invite you to think twice before prioritizing this segment, and that's what we're going to talk about today. First, I need to define this acronym. In the business world, we tend to segment companies based on the customers they target: For example, Netflix is B2C and Jira is B2B. Among all this you have plenty of nuances. Microsoft sells in both B2C and B2B, for example. You have C2C platforms (exchanges between individuals). But let's keep it simple and just talk about B2C and B2B. Except "B" is broad. Between a 5-person company and a 40,000-person conglomerate, the way you sell to the two is very different. And in this category, there's a category of death: large corporations. It's hard to really say when a large corporation begins, but you recognize them easily. A large corporation starts when a decision requires a ton of meetings, a quarter, a steering committee and board approval or a purchasing department sign-off. In practice, you can even have 500-person companies that behave this way, even if it's more common starting at 1,000. But in any case, it gets worse with size. A quarter can become a year, or even 2, or even 5 (and I swear I've seen sales cycles that long). Anyway, that's what I call the BigB (the big B's). The big advantage of BigB's is, in theory, the ability to buy expensive because we're talking about deployment across an entire large corporation, so volumes that make most startups' eyes light up. Except that, it's often a mirage. The moment you start looking at costs and margins, not to mention all the associated risks. Working with a large corporation is often synonymous with complexity, and that complexity is financed by specialists. You have to respond to costly processes (a 200-page security questionnaire, legal questionnaires, framework contracts, ISO certification this and that) that often requires a lot of specialists (lawyers, security experts, finance people, etc.). And that's just to get through the first step of the sales cycle. To sell to a large corporation, you need to be prepared to spend a fortune. By the way, it's worth noting that this doesn't prevent these large corporations from regularly appearing on the monthly data breach list. Because no, churning out Excel questionnaires is not synonymous with security quality. After that, you're quickly going to fall into the spiral of quarterly meetings with a bunch of people you'll only see once in your life, some of whom will take advantage of their temporary power to take out their frustrations and pet peeves on you. And since you'll be in a weak position, well... This time is time not spent on the product. Of course it's normal to spend time on sales, but we're talking about quarterly meetings to prepare, with McKinsey-style PowerPoints (you sometimes even see scale-ups calling in consulting firms to fill out these documents) that will require weeks of preparation. Again, to sell to a large corporation, you need to already be prepared to spend a fortune and wait ages. But let's imagine you've finally got the green light to deploy in a large corporation. The contract is signed. Now it's up to you to figure out adoption. Actually, this is the beginning of a second nightmare. A year has passed since the beginning of the sales cycle. All your previous contacts are gone. They might have been contractors who left the company. Or executives who got transferred to other branches of the group. And now you have to find the people capable of helping you deploy your software because without a doubt your revenue depends on how much the software is actually used. No deployment, no money. So you're going to need a dedicated team of salespeople capable of navigating complex bureaucracy to find the right contacts, and maybe even a dedicated implementation team. Your costs are going to explode and you still won't have made anything at this stage. With a bit of luck, and because you were smart enough to get a payment at signature, you'll eventually issue your first invoice. That will be paid 8 months later, end of month . The first 3 months having caused countless incidents because a purchase order needed to be signed and you had to go through 3 different departments for that. Bad luck, your cash flow is starting to choke. You reach the end of the first year and then the purchasing department will come see you to renegotiate the contract, knowing full well that, in theory, they're your biggest client so it would be natural to do them a favor. In short, 2 years later, you've spent a fortune, your cash flow is negative, and your margin has melted like snow during a World Cup ski race in Saudi Arabia. OK, let's say I'm exaggerating and that despite everything, this contract allowed you to instead cross a threshold, to have an impressive signature to put forward and life continues for your startup/scaleup. Actually, you don't know it yet, but you've invited a Trojan horse into your company. Working with a large corporation means accepting the complexity inherent to that business. If it took you 2 years to sign a contract with them, imagine that everything else takes the same time. Your product has to evolve to fit their way of working. You'll be asked for 12-level approval workflows, software integrations with ERPs, broken enterprise SSO, integrations with legacy systems from the 90s. And every company has its own internal jargon that you'll be asked to force into your software. You'll invoice in units of work, have a "purchasing" role in your RBAC schemas (authorization systems), in short, in reality, you're going to develop an extension of your first client's IT infrastructure with all its constraints, its complexity, its slow onboarding, and its costs. And when you have a client representing 80% of your revenue (and even from 20% onwards it really starts to matter), you can hardly say no. So your roadmap is regularly hijacked by salespeople dedicated to this client, and globally a product that drifts away from the mass market. And that's normal, hey, I'm not throwing stones at that team. If you've dedicated people to a client, it's normal they try to influence how you build the product and even if the requests are absurd. Because that team doesn't have the perspective needed to judge. And when the roadmap is regularly sidetracked, it's also a huge amount of customization debt that will end up slowing the entire product. This big client may have allowed you to double your headcount. But 3/4 of the company will end up working for them, and will develop their own software culture, less UX sense, less sensitivity to product performance (no point working on acquisition or conversion, for example). All enterprise software has terrible UX, because first, that's not what drives sales, and second, because after burning money in the sales process, certification and onboarding, you have to make savings somewhere, often on the product which is no longer really central to the relationship with this client. They'll try to reassure you by saying no, it's important, but actually, the product at that point has become a cost center that needs to be optimized to not lose more margin. Margin eaten by the consulting firm that helped you determine your deployment strategy and pricing... But even when you "improve" your product for this client, you're going to continuously degrade it for all the others you thought you'd attract next by showcasing this win on your beautiful landing page. Because again, you're going to impose their complexity on all the other companies that could have been interested in your services. I'm obviously painting a dark picture. And there are companies that specialized FROM DAY 1 in large corporations, that tailored their commercial offering taking into account all the associated costs. Deployments are priced at 100k, contracts impose minimum usage, everything was framed from the start because the strategy was always to expand exclusively here. But for all the companies that think "just" doing a BigB to get a validation badge, but who actually target the entire SMB market and are looking for volume. It's rarely a good plan. At the beginning I said: "this curse (a French one?)". Why do I say it's a great French curse? Actually it's probably a magnifying glass effect and I'd certainly see the same thing in every country. But every year, I see companies that die after quarters of waiting for that famous contract with a large corporation (just yesterday I was talking to someone who told me the exact same story). So I think there's something a bit different about us. We like to be different. Partly, I get the sense it's related to the size of our SMB market which is less important than in Germany (the German Mittelstand seems bigger). We go faster from SMB to large corporations. Obviously, then, in terms of credibility, it's easier to sell a product once you have the logo of a large corporation than a bunch of logos of unknown companies. What's certain is that culturally, there's the CAC 40 and everything else. The CAC 40 has been basically the same companies for 30 or 40 years. By contrast, look at the S&P 500, in 1990 it was Exxon, GE, Philip Morris, IBM. They've all given way to Apple, Nvidia, Amazon, Google. In France, the large corporations in the CAC are structurally stable and dominant, which makes them all the more attractive as clients for startups. They have budgets, longevity, legitimacy. But these same large corporations aren't springboards to a global market — they're markets closed in on themselves. And conversely the SMB market can work. If I look at Pennylane, Qonto, Indy, Payfit, Spendesk, Livestorm, it's precisely by targeting this market that they've managed to go far. By contrast, I have real questions about the strategy of a company like Mistral which seems to position itself only on large corporations (on-premise deployment, Azure partnerships, etc.) and seems to be neglecting the mass market. I hope it won't be the future DailyMotion, which favored big media and telecom operators while missing the opportunity to become the B2C media platform that YouTube managed to become. You'll have gathered, if you're starting a company today, I'd tend to advise you to not see "B2B" as a single big playground. I'd tend to tell you to avoid B2BigB which is often destructive for startups and often ends up leading to a dead end. It's still possible, but you need to be armed for it. And if that's your choice, I'll only say one thing. Good luck :) Targeting large corporations (and the public sector) obviously gives you access to larger markets. But I'd tend to recommend tackling that step later, when the company is already solid. When DJI (Chinese drones) attacked the professional market, they already had a huge foothold in the B2C market. They came with an expertise and know-how that allowed them to be sovereign over their decisions. Now if you're tempted anyway, the recipe for having a chance is above all a question of seniority of leadership: you need to know how to say no firmly, you need to stop chasing every rabbit that passes by when you see a so-called "low hanging fruit", the expression that has replaced "quick win" as one of my most hated expressions. There's no such thing as effortless gain. Everything has a cost, even when it's hidden. And you need a good financial and reputational foundation to impose these conditions, hence the advice to already have a good base on the other segments. It's easier to say no when a client represents 2% than when they represent 20%. One strategy I've seen work several times is to create software with great UX, get adopted by the teams, then go see the purchasing departments of the companies in question and put the usage figures under their nose: "See, you already have 300 people using it, wouldn't you like to set up a framework contract and better understand usage at your company?" That's interesting because you've created a product whose adoption happened from the teams, you didn't modify your roadmap, and you're in a strong position with procurement to improve your presence without being pressured on everything else. In short, make a good product, track usage, wait until you have enough footprint, and then go negotiate. Anthropic (Claude Code) by first targeting individual developers (indie hackers, side projects) and small teams was pushed to constantly improve its product which became number 1 in its category (at the time of writing, this passage might age poorly :)). Today, they're selling enterprise licenses. Good companies are able to do volume and then move up the chain, small companies then large companies. I've rarely (never?) seen the reverse. When you do large corporations, you don't know how to come back to the rest of the segments. B2C (Business to Consumer), that's the general public. B2B (Business to Business), that's selling to companies.

0 views
Chris Coyier Yesterday

You Get Good At What You Do (Or Do You?)

I used to feel really strongly about this. You get good at what you do. Like, if you build websites all the time, you get good at building websites. If you make burritos all the time, you get good at making burritos. It could extend to almost anything. Healthy places that fit into the logical narrative you already know, like if you lift weights to the point of exhausting your limits a lot, you’ll get stronger. But also silly and unhealthy situations. Like, if you sit on your ass and watch TV all day, you get good at sitting on your ass and watching TV all day. Your body and mind will tolerate it well. You’ll know how to operate the remote well. You’ll know what you want to watch and when. I have some doubts, though. In the ~9 years I’ve lived in Bend, Oregon, I’ve gone skiing ~100 times. I do not think I’m any better at skiing in my 100th time than I was when I moved here. Maybe like, a little? But I’m not entirely sure. Could be worse. I do it, and I don’t get better at it. I want to get better like I want to like seafood. It’s aspirational, it’s just not happening. I’m sure most people get very good after skiing 100 times. I’m just a weirdo. Yes, I’m getting older. Yes, I could be healthier . I’m not sure that’s the entire math here. I think I’m uniquely bad at skiing because I do not like going fast. I don’t like going fast in cars. I don’t like going fast on a bike. I don’t like going fast… ever. I get this extreme discomfort really quickly. So I’m constantly fighting to slow down, which just isn’t very enjoyable and doesn’t lead to the breezy flow state I see most people in. There’s like a speed threshold: if you’re comfortable there, that’s a super normal speed to travel down a hill and get into that breezy flow state where it’s fun, and you feel safe. If you’ve got this higher-speed tolerance, a much wider zone of fun opens up. Whereas I have this narrow sliver I can enjoy, and precious few runs that offer that kind of experience. I’m gonna keep doing it, but just because I want my daughter to be super comfortable skiing, because it’s quite a cool lifelong hobby.

0 views
<antirez> 2 days ago

Implementing a clear room Z80 / ZX Spectrum emulator with Claude Code

Anthropic recently released a blog post with the description of an experiment in which the last version of Opus, the 4.6, was instructed to write a C compiler in Rust, in a “clean room” setup. The experiment methodology left me dubious about the kind of point they wanted to make. Why not provide the agent with the ISA documentation? Why Rust? Writing a C compiler is exactly a giant graph manipulation exercise: the kind of program that is harder to write in Rust. Also, in a clean room experiment, the agent should have access to all the information about well established computer science progresses related to optimizing compilers: there are a number of papers that could be easily synthesized in a number of markdown files. SSA, register allocation, instructions selection and scheduling. Those things needed to be researched *first*, as a prerequisite, and the implementation would still be “clean room”. Not allowing the agent to access the Internet, nor any other compiler source code, was certainly the right call. Less understandable is the almost-zero steering principle, but this is coherent with a certain kind of experiment, if the goal was showcasing the completely autonomous writing of a large project. Yet, we all know how this is not how coding agents are used in practice, most of the time. Who uses coding agents extensively knows very well how, even never touching the code, a few hits here and there completely changes the quality of the result. # The Z80 experiment I thought it was time to try a similar experiment myself, one that would take one or two hours at max, and that was compatible with my Claude Code Max plan: I decided to write a Z80 emulator, and then a ZX Spectrum emulator (and even more, a CP/M emulator, see later) in a condition that I believe makes a more sense as “clean room” setup. The result can be found here: https://github.com/antirez/ZOT. # The process I used 1. I wrote a markdown file with the specification of what I wanted to do. Just English, high level ideas about the scope of the Z80 emulator to implement. I said things like: it should execute a whole instruction at a time, not a single clock step, since this emulator must be runnable on things like an RP2350 or similarly limited hardware. The emulator should correctly track the clock cycles elapsed (and I specified we could use this feature later in order to implement the ZX Spectrum contention with ULA during memory accesses), provide memory access callbacks, and should emulate all the known official and unofficial instructions of the Z80. For the Spectrum implementation, performed as a successive step, I provided much more information in the markdown file, like, the kind of rendering I wanted in the RGB buffer, and how it needed to be optional so that embedded devices could render the scanlines directly as they transferred them to the ST77xx display (or similar), how it should be possible to interact with the I/O port to set the EAR bit to simulate cassette loading in a very authentic way, and many other desiderata I had about the emulator. This file also included the rules that the agent needed to follow, like: * Accessing the internet is prohibited, but you can use the specification and test vectors files I added inside ./z80-specs. * Code should be simple and clean, never over-complicate things. * Each solid progress should be committed in the git repository. * Before committing, you should test that what you produced is high quality and that it works. * Write a detailed test suite as you add more features. The test must be re-executed at every major change. * Code should be very well commented: things must be explained in terms that even people not well versed with certain Z80 or Spectrum internals details should understand. * Never stop for prompting, the user is away from the keyboard. * At the end of this file, create a work in progress log, where you note what you already did, what is missing. Always update this log. * Read this file again after each context compaction. 2. Then, I started a Claude Code session, and asked it to fetch all the useful documentation on the internet about the Z80 (later I did this for the Spectrum as well), and to extract only the useful factual information into markdown files. I also provided the binary files for the most ambitious test vectors for the Z80, the ZX Spectrum ROM, and a few other binaries that could be used to test if the emulator actually executed the code correctly. Once all this information was collected (it is part of the repository, so you can inspect what was produced) I completely removed the Claude Code session in order to make sure that no contamination with source code seen during the search was possible. 3. I started a new session, and asked it to check the specification markdown file, and to check all the documentation available, and start implementing the Z80 emulator. The rules were to never access the Internet for any reason (I supervised the agent while it was implementing the code, to make sure this didn’t happen), to never search the disk for similar source code, as this was a “clean room” implementation. 4. For the Z80 implementation, I did zero steering. For the Spectrum implementation I used extensive steering for implementing the TAP loading. More about my feedback to the agent later in this post. 5. As a final step, I copied the repository in /tmp, removed the “.git” repository files completely, started a new Claude Code (and Codex) session and claimed that the implementation was likely stolen or too strongly inspired from somebody else's work. The task was to check with all the major Z80 implementations if there was evidence of theft. The agents (both Codex and Claude Code), after extensive search, were not able to find any evidence of copyright issues. The only similar parts were about well established emulation patterns and things that are Z80 specific and can’t be made differently, the implementation looked distinct from all the other implementations in a significant way. # Results Claude Code worked for 20 or 30 minutes in total, and produced a Z80 emulator that was able to pass ZEXDOC and ZEXALL, in 1200 lines of very readable and well commented C code (1800 lines with comments and blank spaces). The agent was prompted zero times during the implementation, it acted absolutely alone. It never accessed the internet, and the process it used to implement the emulator was of continuous testing, interacting with the CP/M binaries implementing the ZEXDOC and ZEXALL, writing just the CP/M syscalls needed to produce the output on the screen. Multiple times it also used the Spectrum ROM and other binaries that were available, or binaries it created from scratch to see if the emulator was working correctly. In short: the implementation was performed in a very similar way to how a human programmer would do it, and not outputting a complete implementation from scratch “uncompressing” it from the weights. Instead, different classes of instructions were implemented incrementally, and there were bugs that were fixed via integration tests, debugging sessions, dumps, printf calls, and so forth. # Next step: the ZX Spectrum I repeated the process again. I instructed the documentation gathering session very accurately about the kind of details I wanted it to search on the internet, especially the ULA interactions with RAM access, the keyboard mapping, the I/O port, how the cassette tape worked and the kind of PWM encoding used, and how it was encoded into TAP or TZX files. As I said, this time the design notes were extensive since I wanted this emulator to be specifically designed for embedded systems, so only 48k emulation, optional framebuffer rendering, very little additional memory used (no big lookup tables for ULA/Z80 access contention), ROM not copied in the RAM to avoid using additional 16k of memory, but just referenced during the initialization (so we have just a copy in the executable), and so forth. The agent was able to create a very detailed documentation about the ZX Spectrum internals. I provided a few .z80 images of games, so that it could test the emulator in a real setup with real software. Again, I removed the session and started fresh. The agent started working and ended 10 minutes later, following a process that really fascinates me, and that probably you know very well: the fact is, you see the agent working using a number of diverse skills. It is expert in everything programming related, so as it was implementing the emulator, it could immediately write a detailed instrumentation code to “look” at what the Z80 was doing step by step, and how this changed the Spectrum emulation state. In this respect, I believe automatic programming to be already super-human, not in the sense it is currently capable of producing code that humans can’t produce, but in the concurrent usage of different programming languages, system programming techniques, DSP stuff, operating system tricks, math, and everything needed to reach the result in the most immediate way. When it was done, I asked it to write a simple SDL based integration example. The emulator was immediately able to run the Jetpac game without issues, with working sound, and very little CPU usage even on my slow Dell Linux machine (8% usage of a single core, including SDL rendering). Once the basic stuff was working, I wanted to load TAP files directly, simulating cassette loading. This was the first time the agent missed a few things, specifically about the timing the Spectrum loading routines expected, and here we are in the territory where LLMs start to perform less efficiently: they can’t easily run the SDL emulator and see the border changing as data is received and so forth. I asked Claude Code to do a refactoring so that zx_tick() could be called directly and was not part of zx_frame(), and to make zx_frame() a trivial wrapper. This way it was much simpler to sync EAR with what it expected, without callbacks or the wrong abstractions that it had implemented. After such change, a few minutes later the emulator could load a TAP file emulating the cassette without problems. This is how it works now: do { zx_set_ear(zx, tzx_update(&tape, zx->cpu.clocks)); } while (!zx_tick(zx, 0)); I continued prompting Claude Code in order to make the key bindings more useful and a few things more. # CP/M One thing that I found really interesting was the ability of the LLM to inspect the COM files for ZEXALL / ZEXCOM tests for the Z80, easily spot the CP/M syscalls that were used (a total of three), and implement them for the extended z80 test (executed by make fulltest). So, at this point, why not implement a full CP/M environment? Same process again, same good result in a matter of minutes. This time I interacted with it a bit more for the VT100 / ADM3 terminal escapes conversions, reported things not working in WordStar initially, and in a few minutes everything I tested was working well enough (but, there are fixes to do, like simulating a 2Mhz clock, right now it runs at full speed making CP/M games impossible to use). # What is the lesson here? The obvious lesson is: always provide your agents with design hints and extensive documentation about what they are going to do. Such documentation can be obtained by the agent itself. And, also, make sure the agent has a markdown file with the rules of how to perform the coding tasks, and a trace of what it is doing, that is updated and read again quite often. But those tricks, I believe, are quite clear to everybody that has worked extensively with automatic programming in the latest months. To think in terms of “what a human would need” is often the best bet, plus a few LLMs specific things, like the forgetting issue after context compaction, the continuous ability to verify it is on the right track, and so forth. Returning back to the Anthropic compiler attempt: one of the steps that the agent failed was the one that was more strongly related to the idea of memorization of what is in the pretraining set: the assembler. With extensive documentation, I can’t see any way Claude Code (and, even more, GPT5.3-codex, which is in my experience, for complex stuff, more capable) could fail at producing a working assembler, since it is quite a mechanical process. This is, I think, in contradiction with the idea that LLMs are memorizing the whole training set and uncompress what they have seen. LLMs can memorize certain over-represented documents and code, but while they can extract such verbatim parts of the code if prompted to do so, they don’t have a copy of everything they saw during the training set, nor they spontaneously emit copies of already seen code, in their normal operation. We mostly ask LLMs to create work that requires assembling different knowledge they possess, and the result is normally something that uses known techniques and patterns, but that is new code, not constituting a copy of some pre-existing code. It is worth noting, too, that humans often follow a less rigorous process compared to the clean room rules detailed in this blog post, that is: humans often download the code of different implementations related to what they are trying to accomplish, read them carefully, then try to avoid copying stuff verbatim but often times they take strong inspiration. This is a process that I find perfectly acceptable, but it is important to take in mind what happens in the reality of code written by humans. After all, information technology evolved so fast even thanks to this massive cross pollination effect. For all the above reasons, when I implement code using automatic programming, I don’t have problems releasing it MIT licensed, like I did with this Z80 project. In turn, this code base will constitute quality input for the next LLMs training, including open weights ones. # Next steps To make my experiment more compelling, one should try to implement a Z80 and ZX Spectrum emulator without providing any documentation to the agent, and then compare the result of the implementation. I didn’t find the time to do it, but it could be quite informative. Comments

0 views

Does ChatGPT know what is a question?

I was explaining to a friend recently that ChatGPT, to its core, is “just” a model to predict the next word, the one coming after a bunch of other words. So when you ask it “What is the capital of France?”, it does not (really) answer your question, it completes a sequence of words on which it has been trained, deeply and efficiently. So considering that, it might seem that ChatGPT is in a situation that would be akin to you, if someone tells you a bunch of words you don’t understand (in a foreign language say) and then, someone else gives you a card, on which you can find some words to pronounce, as a reply (in a language that you don’t understand but can read, let’s say).

0 views