Latest Posts (20 found)

Web whetstones

How do you stay sharp as a web developer and/or designer? I’ll share my advice below. I’m also looking for front-end folk to advise me too. What are your whetstones? That is to say: sources of news and knowledge to level up professionally. Does that metaphor work? We’re sharpening our minds, and I suppose the web too with our minds… are minds the whetstone here? Moving swiftly on, in rough order of preference: People love to declare “RSS is dead” because they’ve chosen the likes of Google to gate-keep their web access. Interesting choice, but RSS remains alive and well. When I discover a new blog and like what I read, I’ll subscribe. There’s a good chance that person will write something useful again one day! Funny how that works. I don’t flood my reader with big sites that exist to generate content. I collect personal blogs that may only post once a year. That’s still plenty of unique insights as the list grows. I won’t share my list because I feel for RSS to work you have to curate it yourself. Shop Talk Show has been number one forever. Syntax remains a decent source if you’re deft with the fast-forward button (it’s a little ‘sloppy’ these days.) Igalia Chats is packed with wisdom. For a Better Web is Bruce Lawson in your ears. Wonders of Web Weaving from James is new and hopefully a regular listen. I’ve unsubscribed from too many podcasts that pivoted to AI servitude which is disheartening. I’m not adverse to such discussion but the level of mindless platitudes and gigglefests about what their wacky chat boxes said ain’t my cup of tea. Mastodon and Bluesky is where I follow folk in the web industry. Socials can be a great whetstone if you manage your follow list carefully. Everyone uses these platforms for different reasons which can be difficult to balance. Personally I stick to shop talk and mute politics for example. I follow individuals and rarely organisations to avoid “brand engagement”. Email newsletters are useful to catch stuff I’ve missed. Many exist in RSS form too. My favourites are typically link dumps with a side of commentary. Current favourites: Sidebar still has the odd gem if I care to sift through the “AI” links. Newsletters are a declining category for me. Perhaps because I keep getting unsubscribed by those with failed tracking pixels. Email costs money to send so I’ll accept my loss. Lobste.rs , Hacker News , Reddit (e.g. web dev , experienced devs , frontend etc). Does dev.to have any humans left? These forums are a good source of links — if you can filter the bot spam and avoid the cesspit of comments. Toxicity spreads and it’s all too easy to get dragged in. Sometimes you just have to let people be wrong on the internet. I’ve heard these still happen! I only leave my house now to scavenge for essentials so I don’t have much to say. Clearleft events are guaranteed value if you’re in the UK. Some conferences have online tickets but I find the in-person socialising to be the main benefit. Everything listed above is (or has) a website. I’m poor at organising and utilising bookmarks. I’ll manually visit bigger blogs like CSS-Tricks and Smashing Magazine once a month to see if anything interests me. I bookmark a handful of YouTube channels like Kevin Powell because I have no Google account to “smash that subscribe button” . YouTube isn’t my thing though. I have an allergic reaction to algorithm driven content. I don’t use Discord but I hear it get promoted often. Are these communities lively or are they a ghost town? That’s my problem with Discord. It’s a blackhole for information; antithetical to an open web! Am I missing out? Not sure I care. For no particular reason I’ll end with this quote from Seth Rogen. “I don’t understand what it’s supposed to do. Every time I see a video on Instagram that’s like, ‘Hollywood is cooked,’ what follows is, like, the most stupid dog shit I’ve ever seen in my life,” he said. “And if your instinct is to use AI and not go through that process, you shouldn’t be a writer, because then you’re not writing.” Seth Rogen Says If “Your Instinct Is to Use AI” to Write Scripts, “You Shouldn’t Be a Writer” - The Hollywood Reporter P.S. no more blog posts until June. I’m due a holiday! Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds. Frontend Focus Design Systems News

0 views

A Markdown-based test suite

This article is not about AI and it is not written with AI, but the work that I’m about to present was definitely motivated by AI. And because I generally like telling stories, I have to give you that background. Do with that whatever you want, but… it’d be a pity if you left just because the AI word showed up in the first paragraph! I think the technical explanation that follows is at the very least entertaining and also interesting independently of AI. Back in December, I started toying with coding agents. One thing I tried, and for which I didn’t expect a lot of success, was to point an AI agent to the EndBASIC public documentation and ask it to write games like Space Invaders or Mario from scratch. And even though the results weren’t perfect and they didn’t work on the first try, they did work with a few tiny tweaks. Combining that with a bunch of hand-written rules, I had an agent producing EndBASIC demos with ease. This experiment was impressive because I did not expect an agent to be able to write EndBASIC code… and because it worked, it fueled my interest to pick EndBASIC’s own development back up. Three thoughts came to mind: Increase EndBASIC’s “self-documenting” aspects so that an AI agent can learn about its idiosyncrasies unsupervised. Speed up EndBASIC so that it can run more elaborate games. Extend EndBASIC with long-desired primitives like sprites and sound, to finally realize the vision behind the project. These thoughts combined sparked the rewrite of EndBASIC’s core that I’ve been pursuing since January and which should see the light of day in the upcoming release. But before that happens, I want to talk to you about just one of the cool pieces behind the new core: namely, its approach to testing. I’ve stopped writing unit tests for the compiler and VM in Rust and I’ve switched to writing them in Markdown. And I believe this has turned out to be a pretty nice approach. One of the things I had to do to convince an AI agent to write proper EndBASIC code was to hand-craft a bunch of rules to tell it how EndBASIC differs from other, more traditional BASIC dialects. That worked OK, but writing these rules by hand was error-prone and difficult to make exhaustive. So I wanted to let LLMs extract that information directly from EndBASIC. The idea was simple: if I wrote the integration tests for the new core in Markdown, the lingua franca of AI, the tests would serve as the canonical and correct documentation demonstrating language behaviors. LLMs are great at summarizing information, so if I unleashed them over a large set of these hands-on “examples”, they would probably figure stuff out, right? And they actually do! I gave the following prompt to GPT 5.4: Based on your pre-existing knowledge of BASIC dialects, I want you to read all of the files, analyze how the EndBASIC dialect differs from your knowledge, and come up with a bunch of rules for yourself to know how to write EndBASIC code later on. You can ignore the Disassembly sections. Beware that all functions and commands in these integration tests are test-only: the real functions and commands that you can use are documented in , so read those too to learn what functionality is available. Write your findings to a file. And this produced a very comprehensive file with spot-on rules: here, take a look . But leaving that aside, let’s peek into the internals of this new Markdown-based test suite. All cool so far? Want to see more similar content in the future? Subscribe now to demonstrate your interest! It’s a collection of Markdown files: Where each file acts as a container of one or more test cases : Every test case has a section title describing what the test is about and various subsections to define the test scenario: A Source code block that is the input to the compiler. If compilation fails, a Compilation errors section with the error messages and nothing else afterwards. If compilation succeeds: A Disassembly section that contains the compiled bytecode. An optional Exit code section showing the program’s exit code, if different from zero. An Output section that contains any messages printed to the console by the executed program. A Runtime errors section that contains any errors from the executed program. Here is a simple example validating the command: There is no section to validate the lexer nor parser internals right now but I’m considering to further extend the format and dump the AST too in order to simplify the tests for these components. The driver for this test suite enumerates all Markdown files in the tests directory and processes them one at a time. For each file, the driver extracts all test case titles and their Source subsections to compute all the test cases to execute. Once the driver has this subset of information from the Markdown files, the driver feeds each individual test case to the compiler and, if compilation succeeds, to the VM. All side-effects are captured and the driver emits a new Markdown file from scratch with the results of the test. Once the driver has terminated producing a new version of the Markdown file for a test, the driver compares the produced file (actual) against the pre-recorded, checked-in version (golden). If they differ, the test fails and the driver uses the tool to print the differences. And that’s it. Easy peasy, right? This keeps the driver super-simple as the only thing it has to do is parse a minimal subset of Markdown, and the diffs it produces are trivial to understand to a human. There are currently 448 test cases and 13k lines of Markdown in this test suite so maintaining them “by hand” is not an option. You wouldn’t want to implement an optimization to the compiler and then have to rewrite hundreds of disassembly chunks in the golden files to reflect the changes, would you? The thing is that, due to the design described earlier, regenerating the golden files after a core change is easy: the driver is already doing exactly that to execute the tests! The trick is, simply put, to ask the driver to rewrite the golden file instead of producing an actual file by setting the environment variable. And voila: all golden files are regenerated in place. I can then use Git to validate the changes and commit them along with the actual code change. Let’s start with the pros of this Markdown-based test suite framework: It is much easier to work with than what I had before. I used to dread touching the compiler and VM of the previous EndBASIC core implementation because tweaking tens of tests was painful. Changes required me to fiddle with positions and deeply nested types, and now the tests are trivial to tweak and diff against previous state. Pretty much any decent text editor has Markdown support, including formatting fenced code blocks. This makes it easy to skim through the test suite and modify the files and is actually the primary reason I used Markdown instead of a bespoke textual format. LLMs can “learn” with ease. OK, fair, this is just a guess: I did not try the same prompt at the beginning of this article against the old core with its Rust-based tests, and maybe the LLMs would have done a good job at reverse-engineering the rules. But because the Markdown tests are so much easier to read by humans, I have to assume that they also are for LLMs. And now, of course, some cons: Regenerating the output of a test, or all tests, is way too easy . With the older Rust-based tests, I was forced to manually punch in things like line numbers and nested AST trees. This process forced me to think through the changes in detail. With the new approach… regenerating the golden files is trivial, so it’s easy to miss little mistakes in source positions or disassembled code. Differences in disassembly are usually noisy and hard to review because every line carries an address and thus any new or deleted instruction will introduce offsets into all other addresses. I could of course choose to not include the instruction addresses in the dump, but they come in handy when manually validating jump targets, so it felt better to keep them around. Rust cannot generate first-class test cases on the fly which means that the various test cases within a Markdown file are “invisible” to the driver: I can run them all or none, but regular test filtering via doesn’t apply. I was able to “expose” the different Markdown files as different Rust-native test cases, but this involves a hardcoded list of test files—which must be kept in sync with the files on disk, and so I mitigated the chances of divergence by adding a test that cross-references the two. This idea does not generalize well. The Markdown-based test suite presented here works well for components where end-to-end testing is favorable and, more importantly, cheap , but I wouldn’t recommend it for other scenarios. Keeping tests fast is a must for quick iteration. And I think that’s about it. If the above feels too abstract, I encourage you to take a look at the driver , its helper code , and the directory with test suites . Now that you have this new trick up your sleeve, what do you think? Back in December, I started toying with coding agents. One thing I tried, and for which I didn’t expect a lot of success, was to point an AI agent to the EndBASIC public documentation and ask it to write games like Space Invaders or Mario from scratch. And even though the results weren’t perfect and they didn’t work on the first try, they did work with a few tiny tweaks. Combining that with a bunch of hand-written rules, I had an agent producing EndBASIC demos with ease. This experiment was impressive because I did not expect an agent to be able to write EndBASIC code… and because it worked, it fueled my interest to pick EndBASIC’s own development back up. Three thoughts came to mind: Increase EndBASIC’s “self-documenting” aspects so that an AI agent can learn about its idiosyncrasies unsupervised. Speed up EndBASIC so that it can run more elaborate games. Extend EndBASIC with long-desired primitives like sprites and sound, to finally realize the vision behind the project. A Source code block that is the input to the compiler. If compilation fails, a Compilation errors section with the error messages and nothing else afterwards. If compilation succeeds: A Disassembly section that contains the compiled bytecode. An optional Exit code section showing the program’s exit code, if different from zero. An Output section that contains any messages printed to the console by the executed program. A Runtime errors section that contains any errors from the executed program. It is much easier to work with than what I had before. I used to dread touching the compiler and VM of the previous EndBASIC core implementation because tweaking tens of tests was painful. Changes required me to fiddle with positions and deeply nested types, and now the tests are trivial to tweak and diff against previous state. Pretty much any decent text editor has Markdown support, including formatting fenced code blocks. This makes it easy to skim through the test suite and modify the files and is actually the primary reason I used Markdown instead of a bespoke textual format. LLMs can “learn” with ease. OK, fair, this is just a guess: I did not try the same prompt at the beginning of this article against the old core with its Rust-based tests, and maybe the LLMs would have done a good job at reverse-engineering the rules. But because the Markdown tests are so much easier to read by humans, I have to assume that they also are for LLMs. Regenerating the output of a test, or all tests, is way too easy . With the older Rust-based tests, I was forced to manually punch in things like line numbers and nested AST trees. This process forced me to think through the changes in detail. With the new approach… regenerating the golden files is trivial, so it’s easy to miss little mistakes in source positions or disassembled code. Differences in disassembly are usually noisy and hard to review because every line carries an address and thus any new or deleted instruction will introduce offsets into all other addresses. I could of course choose to not include the instruction addresses in the dump, but they come in handy when manually validating jump targets, so it felt better to keep them around. Rust cannot generate first-class test cases on the fly which means that the various test cases within a Markdown file are “invisible” to the driver: I can run them all or none, but regular test filtering via doesn’t apply. I was able to “expose” the different Markdown files as different Rust-native test cases, but this involves a hardcoded list of test files—which must be kept in sync with the files on disk, and so I mitigated the chances of divergence by adding a test that cross-references the two. This idea does not generalize well. The Markdown-based test suite presented here works well for components where end-to-end testing is favorable and, more importantly, cheap , but I wouldn’t recommend it for other scenarios. Keeping tests fast is a must for quick iteration.

0 views

Another rant about web browsing

Yes, I’m writing again about my ongoing experiment with blocking JavaScript on a per-site basis. This time, I’m not here to explain how I operate in detail , but to complain about the work needed to maintain this web browsing hygiene. In short, the web is a mess, and while messy things can be fun , I’ve recently grown very frustrated with the need to dance around my extensions every time I visit a new site where displaying simple text apparently requires JavaScript, or where scrolling requires dismissing a cookie modal that is only visible if content blockers are turned off first. I’ve come to the conclusion that blocking JavaScript by default on all websites, as I’ve been doing lately, is a source of frustration. Yes, the web is light as a feather and my browser feels very fast when it doesn’t have to deal with all the JavaScript, I do love that. But this “strategy” breaks too many websites, pushing me to take detours so often that they can barely be called detours any more. You see, I can’t be bothered to manage an efficient “allow” list in the long run, so web browsing often feels like a series of new obstacles, as if every day is the first day of this setup. *1 This strategy is therefore a bad one. Just as bad as the other strategy I tried before, the one where I only block JavaScript after visiting the site, if it feels necessary. My discipline with that second strategy tends to fade away as days go by, and I end up barely ever blocking anything, even forgetting that this is something I can do. This strategy often encourages me to download a proper content blocker or use a filtering DNS. Not only are these strategies inefficient in their initial goal of making my web browsing experience better , but they are also only work on the Mac. On iOS, due to the way Safari extensions work — which is a bit shitty — neither of the two strategies for blocking JavaScript on a per-site basis is practical to use, pushing me to adopt another strategy just for my phone (which, in turn, makes everything feel so much more complex than it needs to be). On the iPhone, accessing the settings for each Safari extension is already complicated, but there seems to be no way to manage a per-site setting if the extension is not recognised as a content blocker and if it is set to “allow on all websites”. With StopTheScript for instance, I can only manage the per-site setting if I set the extension to “ask”. Also, per-site settings only seem to sync between the phone and the Mac if the extension is a content blocker. *2 So, if I were to rate my JavaScript-off web browsing strategies, taking into account the browsing experience itself (the way the websites look and behave), the impact on my computer’s CPU (if the fan turns on or not, if it lags), and the amount of maintenance required (having to manage exception lists): JavaScript-off by default, allowing a few selected sites permanently, visiting others temporarily in a private tab (where the extension is inactive): 8/20 JavaScript-on by default, managing the JS-off list extensively but facing the terribleness of raw webpages: 6/20 Both are bad strategies, but the truth is that none of the alternatives I’m thinking of are better. For example, using a full content blocker like Wipr is a frustrating experience in itself. Having to manage another list of sites and constantly refreshing pages with or without content blockers is a pain. That, and the fact that it seems to be a heavier workload for my old Mac, as are third-party browsers. Content-blocker-enabled Safari, managing the exception list and dealing with a laggy computer: 7/20 Third-party browser , like Helium, Quiche Browser, or Orion, combining content blocking and a neat JavaScript toggle (uBlock Origin is pretty great at both): 7/20 Naked browser, meaning no content blockers, no JavaScript limitation, no list to manage, nothing to do, just the natural web: 1/20 I think the best setup is the following, even if I’ll stick with strategy 1 for a while: The main issue with strategy 6 is that I’ve had issues with these DNS resolvers, like not being able to access common websites for hours, even my own website, resulting in a quick investigation only to realise that everything was working fine and that the issue was with the DNS resolver. This is the state of web browsing in 2026, terrible at best. Allowing JavaScript, blocking JavaScript, whatever; either way the experience is bad. Most websites are stuffed with invasive ads, surveillance tracking, dickpanels , noise, and junk. Nothing we can do really works, unless one spends hours fine-tuning everything and therefore adds extra layers of complexity. The more effort I put into filtering the filth, the more ready I am to give up at the first little hiccup. It doesn’t feel right to reload a webpage three times to view it properly and to take the time to ensure it’s properly set up for future visits. While content blockers and JS toggle tricks are improving things drastically, the added amount of work required is a pain in itself. The browser on one side, the extensions on the other. The more we consume websites as the filling in a sort of software sandwich, the more they resist. The thicker our bread, the more sauce they add. The more bread we bring to absorb it, the more junk they add to the filling. At what point does it become too disgusting to eat? Meanwhile, reading articles outside the web browser , via email newsletters or within my RSS reader, is pure bliss; a delightful, gourmet, delicious cuisine that stimulates my appetite instead of making me want to throw up. It’s so good that I don’t even need extra bread. *3 It just works. Just like it’s increasingly better to search for an answer using an A.I chatbot rather than a traditional search engine, it’s now better to read articles from websites by using apps that are not traditional web browsers. It feels wrong, like driving on the smooth cycle lane rather than a pothole-filled road. How long can this situation last? Between difficult business models  — the source of most problems, driving us to use content blockers in the first place — and new A.I. chatbot intermediaries , I don’t know what will happen to the web in the next three or four years. Some web browsers are already in a weird spot . In the meantime, I will keep overthinking this, as I want my next laptop to inherit a “final” and well-thought-out setup, developed on this early 2020 MacBook Air. Its lack of a powerful chip and its limited memory forces me to face the inefficiency of overloaded webpages and third-party browsers. Maybe I’m obsessing a little too much about this. Or maybe I need to sleep more . “Allow” or “deny” list, depending on whether we talk about JavaScript being on or off, or the extension blocking it. The vocabulary around content blockers and extensions like StopTheScript confuses me in terms of negation.  ^ Some extensions like StopTheMadness can be configured without relying on Safari per-site settings, but decentralising and maintaining two competing lists is pretty much the opposite of what I want.  ^ I do disable JavaScript in NetNewsWire though, just to be safe.  ^ JavaScript-off by default, allowing a few selected sites permanently, visiting others temporarily in a private tab (where the extension is inactive): 8/20 JavaScript-on by default, managing the JS-off list extensively but facing the terribleness of raw webpages: 6/20 Content-blocker-enabled Safari, managing the exception list and dealing with a laggy computer: 7/20 Third-party browser , like Helium, Quiche Browser, or Orion, combining content blocking and a neat JavaScript toggle (uBlock Origin is pretty great at both): 7/20 Naked browser, meaning no content blockers, no JavaScript limitation, no list to manage, nothing to do, just the natural web: 1/20 Strategy 2, with a DNS resolver like Mullvad or NextDNS, effectively blocking most crap without making my laptop choke. 9/20 “Allow” or “deny” list, depending on whether we talk about JavaScript being on or off, or the extension blocking it. The vocabulary around content blockers and extensions like StopTheScript confuses me in terms of negation.  ^ Some extensions like StopTheMadness can be configured without relying on Safari per-site settings, but decentralising and maintaining two competing lists is pretty much the opposite of what I want.  ^ I do disable JavaScript in NetNewsWire though, just to be safe.  ^

0 views

i suspect there is a new uber data leak

For the last 2 weeks, I have received 2FA codes for my UberEats account that I did not request roughly every couple days. The account was only created a year ago and used twice, then never again. The email address and password were created specifically for it and are not used elsewhere. No other accounts and services I use are affected. A quick search shows others have been dealing with the same very recently (~3 weeks). I logged in and tried to start the account deletion process. It ends in a white screen with no confirmation, and you get no “Sad to see you go” email or anything else. If you’re lucky, you’re forcibly logged out and take it as a sign that it worked. That’s a shitty process. The info says they deactivate for 30 days and then fully delete, and I had hoped that would stop the 2FA requests, but alas, it did not. They still let you attempt log in and send a code, and I have no idea if that stops the deletion process or resets the 30 days. As I did not want a random person thwarting my account deletion, I once again logged in, changed all personal information and the password, and started the account deletion again (same bullshit). I have contacted their privacy team to let them know. We’ll see what they say. I also requested them to delete my account in case the deletion process failed. I haven’t yet seen any notice about this on their website or in the media. 🤷🏻‍♀️ If you have an Uber/UberEats account, consider changing your password. Reply via email Published 17 May, 2026

0 views

How to actually handle database transactions (and why your ORM fails at it)

Wrapping database calls in a simple try-catch block is just asking for trouble. Here is a practical look at how to handle transactions properly, avoid the dreaded ORM lost update, and keep your data actually safe.

0 views
iDiallo Today

In the Empire's Defense

I didn't watch Star Wars when it was released. I wasn't even born. By the time we popped the cassette tape in the VCR, it was at least 15 years old. But I liked the movie all the same. It was not my favorite film by any means, but it was memorable. The first time you see Darth Vader appear on screen, you know this villain is not going to be easy to defeat. "Villain" because no one needs to tell you who the good guys and bad guys are in this movie. The visuals, the voices, the music, everything tells you that Darth Vader and the Empire are up to no good. Now I get to watch the movie with my kids. I quickly pointed out to my sons that this is the bad guy. Stating the obvious. One of them asked, "Why is he the bad guy?" I had to pause for a second to come up with an explanation. I didn't have an answer. Instead I said, "Because he is mean as hell!" They fell asleep before the movie ended. I think they enjoyed it. But I really couldn't tell you exactly why Darth Vader was the bad guy. This is just a thought experiment, don't go telling the world that I am pro-Galactic Empire, OK? I'm not digging into the lore of Star Wars. I did that already with the Galactic Timezones piece and it was exhausting. What I want to do is draw some parallels with real life. First, I think if a real-world government behaved like the Galactic Empire, they would clearly be the bad guys. But in real life, we don't have good guys or bad guys. I want to focus on just one aspect. The Empire's goal is to maintain order, or at least to try to. And the rebels are clearly creating chaos, with their freedom and what not (bear with me). Imagine what it takes to develop a system that keeps several star systems all in sync. The political process to elect senators, not just from different races, but different species. And then some religious zealots want you to throw everything you've built aside and just "feel" the force. You want to expel them as far from the system as possible. “Can one ever be too aggressive in preserving order?” — Syril Karn The rebels sabotage missions, attack army bases, and create chaos. On the surface, these rebels are clearly disruptive. I can already hear politicians calling them names and requesting additional funding for their "ally" to eradicate the threat. If the rebel attacks were broadcast on TV, even citizens of the many worlds would agree that the rebels need to be dealt with. Writers would write poems on the supposed virtues of keeping order as Kipling did in " The White Man's Burden ". All they are doing is bringing railways, law, and civilization to chaotic planets. Just think about rebels carelessly destroying a base on a remote planet whose only purpose was to track and sync time across a multi-star time zone system. Madness! But then I watched Andor. If you watch Star Wars as an adult and don't suspend your disbelief for a second (contrasting it with real life) then yes, the rebels are the bad guys. Which is exactly why Andor was a fantastic addition to the Star Wars universe. A more grounded show that I watched without my kids, and thoroughly enjoyed for how it depicted the inner workings of the Empire. Rather than focusing on the Empire as a whole, Andor zooms in on a small faction, the ISB, and shows how ordinary people end up joining the rebellion. The rebels are no longer just David fighting Goliath. Instead, you see the individual faces of people suffering at the hands of the Empire. You see the surveillance, the strong-arming, the unfair treatment, the killings. You see innocent people caught in the crossfire, labeled terrorists at the first sign of dissent. One man's rebel is another man's freedom fighter, and the Empire controls the broadcast. And the rebellion is not a single organization with a single leader. Anyone oppressed and frustrated with the Empire is a rebel in their own way. It's not good guys versus bad guys anymore. It is power exerting a crushing weight on its subjects. To hell with keeping time in sync, fight back! To hell with keeping order when all it means is blind obedience or else. Bring back those Jedis, the so-called religious zealots. But alas, it's just fiction. Real life is not the same. In our world, the Empire wins every time. Ask the Indians. Ask the so-called independent nations of West Africa. My sons, when I try to speak French with them, tell me that they are not French and neither am I. They are right, because the Empire won.

0 views
Unsung Today

“We accepted this gradual bloat, but that’s not progress.”

I like the Fits on a Floppy manifesto by Matt Sephton: Software should be as small as it can be. Not as a gimmick, but as a discipline. The floppy disk is the measuring stick: 1.44 MB. If the software that ran entire businesses could fit in that space, then a modern, focused, single-purpose tool certainly can. In my own work, I have mostly focused on the web side of this equation, as this is where the situation feels the most dire: tens of megabytes dedicated to heavy frameworks, unnecessary tracking scripts, and video ads have a real negative effect on experiencing websites. Progressive loading challenges also make it harder to offer a great experience. But space considerations are starting to feel more pertinent to local software too, in an era where SSD and hard drive prices are going up, and where local LLM models start taking up more room . Also, this passage feels very Unsung, and is exactly why the tag #history exists on this blog: I don’t miss floppy disks. I miss the mindset they demanded—that every byte matters, that constraints breed creativity, and that software should be light on its footprint. If you reduce tech history to just nostalgia, it won’t be that useful. But if you look at it as inspiration , you might find some truly wonderful and meaningful stuff in there. On that note: Bonus for a nice classic domain, and a nod toward Mac’s most famous screensaver. #history #performance

0 views
Unsung Today

Safari and system design, pt. 1

To me, “tap anywhere at the top to scroll to the beginning” is an amazing and underappreciated mobile gesture: It not only provides an alternative to desktop‘s Home and ⌘↑ keys, but the student laps the teacher here; it’s actually better than every way to scroll to the top on desktop (do you like pressing ⌘↑? do you even have a Home key?), and it’s an icing on a cake of a regular flick to throw the page to the top already being pretty nice. Tap to return to top is also distinctively mobile in that it allows you to tap just anywhere near the top edge that’s not already a tap target; as far as I can observe, traditional GUIs detest being imprecise in this way, always asking you to click on something specific (although window moving on macOS in the post-title-bar era is also starting to feel similar). The iPhone gesture seemed to work so well that, over the years, more patterns started borrowing from it. In Bluesky and tons of other apps, you can tap on any tab with scrollable content a second time to scroll all the way to the top. (Again, something that’s hard to imagine on desktop, where you pretty much almost never think of clicking on an already-selected item.) It’s not just the top, either. In Podcasts, tapping Home goes back to the left: And in Photos, to the bottom: To me, the whole “tap to return to the beginning” gesture universe feels ascended to be the core property of the interface. In that way, it is similar to scrolling, undo, copy/​paste, arrow keys moving the text cursor, and so on, all inducted to the National Register Of Historic Gestures. Why? Because these gestures can only blossom if they work consistently , everywhere. You need to start trusting them so much they slide into your subconsciousness. Breaking the gesture in one place will make it less trustworthy in other places, too, ejecting it from motor memory back to the level of deliberate effort , and therefore making it a lot less usable. “Does this thing work here or not?” is a death knell of flow. The fact that tapping on tabs is idempotent means there’s also no penalty; if you’re already at the beginning but are not sure, tapping it mindlessly won’t hurt or send you back somewhere else. This is all great. And this is why I’m unhappy Safari started mucking with it. Safari has tabs at the bottom – starting with two (regular set and “private” set), although you can add more. Above is a long list of site cards, with newest at the bottom. It’s exactly the same situation as in Photos, and yet tapping on either tab doesn’t restore the scroll position. Instead, it opens the settings dialog: And, tapping around the buttons does nothing. I would imagine Safari is a pretty important app used by many people, and so this feels like a bad place to introduce an inconsistency that could have a more serious consequences of un-teaching people about tap to scroll to top in the long run. The funny thing is that the solution is already there: you can tap ··· in the upper left corner to get to the same functionality. The long press on the tab also opens the same menu. Messing with a “tap to go back to the beginning” system gesture like this means to me the design team doesn’t fully share the understanding of the value of their own creation, or maybe that stewards of the gesture system are not vigilant… or perhaps the awareness is there, but the caretakers aren’t recognized, rewarded, or empowered enough. It’s similar to the “ no, thanks ” example I shared before, a possible worrisome tragedy of the UX commons in the making if the respective teams do not change course. Because, wedging that sort of an exception in – even if you have a great set of reasons in the moment – creates a precedent . Inevitably, from my experience, the next team that will want to override scroll to top, or misuse “No, thanks,” will now require less of a justification. #definitions #details #flow #interface design #touch

0 views
Sean Goedecke Yesterday

How I use LLMs as a staff engineer in 2026

A bit over a year ago I wrote How I use LLMs as a staff engineer . Here’s a brief summary of what I used AI for last year: Here are some tasks I explicitly didn’t use AI for last year: February 2025 was a long time ago. Back then the best model was the first reasoning model, OpenAI’s o1. Agents sort of worked, but would often get stuck or thrown off by compaction. What’s changed since then? The biggest change is that I now use LLMs to produce entire PRs in areas I’m familiar with . A year ago I would very occasionally ask an agent to make changes to a single file if it was a simple change I couldn’t be bothered typing out. Sometimes I would copy a function I wrote into a LLM chat window for feedback. But now I start every single change by asking an agent to solve the problem, and usually push the PR after a single editing pass. In late 2025 I used a lot of open VSCode windows. In early 2026, that changed to terminal tabs with the Copilot CLI, particularly when I needed to make changes across multiple repos at the same time. Now I use the GitHub Copilot app a lot (tens of sessions per day). This reflects a shift from having to line-edit the agent basically as it went to only doing an editing pass right at the end. Early agents would go wrong a lot and not be able to recover, so it was valuable to keep an eye on their thought processes and step in to pause them and set them right. In my experience, current agents move too fast to do this, and recover their own mistakes most of the time anyway. Sometimes I don’t even need to make edits and I can just push the change as-is, though this is rare: if nothing else, I typically go through and remove some of the over-commenting and other LLM-isms. I do a lot of skimming through and evaluating agent changes. Most of the time I reject them entirely, just based on “eh, that’s not what I was thinking”. On average it takes me about thirty seconds to make this initial assessment. If the change looks alright after that, I’ll dig in and do a proper review to make sure I understand it and it’s doing the right thing. For difficult tasks, I’ll often reject five or six (or more!) agent attempts before accepting one as good enough to work with, or giving up and making the change by hand. I rely on LLMs even more for bug-hunting than I do for making changes. In 2025, I used to throw the occasional bug at a LLM, just in case it was able to rapidly come up with an explanation. Now I throw every bug at a LLM (typically by opening a new agent session and pasting in the bug report), because it’s able to correctly diagnose 80% of issues on its own. Current agents are really good at chasing down bugs, particularly when you give them a vantage point across multiple repositories. I’m still better at it. Just last week I had a tricky bug that took about fourteen agent sessions before one finally figured it out. What was I doing in between and around those sessions? Ultimately an agent was the one to catch the bug. But I still count it as my find, because by that point I had narrowed the search space tightly enough that agent session #14 had a significantly easier problem to solve than agent session #1. In other words, human expertise still matters a lot for investigating bugs . I almost always write my own PR descriptions, since LLMs over-communicate and are bad at expressing the “core idea” behind a change. Writing the PR description by hand also signals to reviewers that I’ve reviewed the change myself, and I’m not asking them to be the first human to read the diff. The only time when I don’t write the PR description is when the change is trivial and the agent-generated description is one sentence. At that point I just leave it alone. I still don’t use LLMs to write Slack messages, ADRs, issues and so forth. I believe I have a better sense of what’s important to communicate, and I want to signal that there’s a human being thinking about the content. I still never use LLMs to write blog posts, though I do run each draft post through a LLM for feedback. OpenAI models used to be terrible at this and have only very recently gotten acceptable with GPT-5.5. Both OpenAI and Anthropic models still try to water down my arguments, but I’ve accepted that as part of the LLM “house style” and just ignore that part of the feedback. Another thing I do now is try and push as much testing and setup work as possible onto the agents . In 2025, I used to sometimes ask a LLM to produce a test script of curl commands that I could run against my dev server. In 2026, I just ask an agent to go and test my change, then read the log of what it did. I don’t test UI work like this, partly because it’s more fiddly and partly because I don’t trust agents to be sensitive to the subtle look-and-feel aspects of a change. Agents will write expansive unit tests without having to be told, but I do sometimes ask them to put together broader integration tests for a change. In general I now consider test code to be cheap: if I’m wondering whether a test would be useful, I just add it (so long as I know it won’t be flaky). Of course LLMs sometimes produce strange and unsatisfying test code - I do read it to catch obvious blunders - but I review it with a more generous eye than my actual production code. I’ll also task an agent with annoying local setup tasks that involve config wrangling on my machine. For instance, if my nvm installation is not switching my Node version correctly, I will often open a Copilot CLI agent and ask it to figure it out. This is a more-or-less direct replacement for Googling the problem, and is much quicker since the agent can run the trivial bash commands to diagnose and fix the problem itself. The main thing that’s changed in the last fifteen months is that agents are really good now . They’ve gone from something I used occasionally and suspiciously to something I use constantly and with light supervision. The core of my job is still the same: shipping projects , exercising my judgement, influencing tech company politics . But I now have a much wider net for small pieces of work that I’m willing to take on, which includes basically anything I can hand off to an agent and expect it to get more or less right. I used to spend a lot of time putting work off, either by delegating it or just saying “sorry, I don’t have time to do that now”. Now I get to say “yes” a lot more (at least when it comes to minor low-risk tweaks) 1 . Overall, here’s what I now use AI for: Here’s what I still don’t use AI for: In my view, the current core AI skill is shifting as much work onto AI agents as possible, without going too far . Many people are under-utilizing agents: not allowing them to investigate bugs or test their changes, or not throwing enough simple tasks at them. Other people are over-utilizing them: using them to write messages that ought to be hand-written, or trusting them to make sweeping changes that need careful human review. Since my last post, the balance has tilted more towards the agents, but finding the balance remains as tricky as ever. For once I can actually give an example, since it’s in a public repository. Someone internal wanted to be able to use the actions/ai-inference GitHub Action with Copilot-backed inference (for various reasons), and instead of saying “sorry, I don’t have time to get to it”, I was able to throw it at an agent. If a human had to do this, the output would likely have been better, but it wouldn’t have gotten done for weeks (if at all). Smart autocomplete with Copilot Short tactical changes in areas I don’t know well (always reviewed by a SME) Writing lots of use-once-and-throwaway research code Asking lots of questions to learn about new topics (e.g. the Unity game engine) Last-resort bugfixes, just in case it can figure it out immediately Big-picture proofreading for long-form English communication Writing whole PRs for me in areas I’m familiar with Writing ADRs or other technical communications Research in large codebases and finding out how things are done Digging up extra context on the bug (from logs, Slack, etc) and reporting it to the agents Building my own mental model of the problem, of course Setting up my own reproduction of the bug (in parallel with the agents’ efforts) Responding to agent sessions with “no, your theory can’t be right because of X” (or just killing and restarting the session with that extra hint) Writing (or drafting, depending on complexity) every code change I make Investigating and fixing bugs, either autonomously for most bugs or with my close involvement for trickier ones Research in large codebases, since current agents are now good enough to give the right answer almost all the time (and when they’re wrong, it’s clear from reading the explanation that they’ve missed something) Manual testing and local-machine setup or troubleshooting I still use AI for asking lots of questions to learn about topics, and for proofreading Writing any kind of public communication for me (PR descriptions, ADRs, messages) with the exception of trivial two-line PRs Writing code that I don’t carefully review Testing any kind of UI For once I can actually give an example, since it’s in a public repository. Someone internal wanted to be able to use the actions/ai-inference GitHub Action with Copilot-backed inference (for various reasons), and instead of saying “sorry, I don’t have time to get to it”, I was able to throw it at an agent. If a human had to do this, the output would likely have been better, but it wouldn’t have gotten done for weeks (if at all). ↩

0 views

The Applicability of Spaced Repetition

Spaced repetition has a natural domain of applicability: information that is systematically organized as an unambiguous key-value mapping with short keys and values. The “Hello, world!” of flashcards is the NATO phonetic alphabet : A → alpha, B → bravo, etc. Similarly, the periodic table can be thought of as defining a collection of mappings: element name ↔ symbol, element name ↔ atomic number, etc. You can just drill these cards and memorize the facts without a prior step of understanding, or building a conceptual model. Applying spaced repetition is trivial for this kind of information. That’s why most people who use spaced repetition are either language learners or medical students. In biology the main intuition you need is for “3D shapes bumping around in Brownian motion”, which comes free with your human brain, and afterwards it’s mostly just a lot of facts you have to memorize. Analogously with language: you already have a language center , you just need to drill vocabulary and grammar. And the further you go from this domain, the harder it is to apply spaced repetition. Highly conceptual knowledge, like math, is hard to encode. You have to spend a lot of time just understanding the information, and building a conceptual model in your head, and then you start writing flashcards to solidify that model, like taking tomographic cuts of some complex object. And coming up with questions that make good flashcards (short, unambiguous, etc.) out of this highly abstract knowledge is very hard. Often you have some deceptively simple fact, a simple assertion, but there’s no good way to encode it as a flashcard, so you have to encode “around it” by asking questions that assume or require that knowledge (e.g. asking why X is true), and hoping that in drilling those, your brain will remember the actual target. In general, relational facts are easier to encode, since a binary predicate like $\text{Property}(\text{Object}, \text{Value})$ readily becomes a question. “Caffeine is metabolized by cytochrome 1A2 ”, in Prolog , is $\text{Metabolism}(\text{Caffeine}, \text{CYP1A2})$, and becomes “Q: What is the cytochrome that metabolizes caffeine? A: 1A2”. But how do you encode stand-alone assertions like “all unitary matrices are invertible ”? You could encode that as a yes-or-no question, but that’s useless, because rationally you can expect such questions to be biased towards yes. Both “what is a property of unitary matrices?” and “what kinds of matrices are invertible?” are useless because they have hundreds of possible equally-valid answers, so they’re ambiguous. You have to be creative and find all kinds of tricks and stratagems to encode around the knowledge. Tangentially: this, I think, is why using AI to write flashcards is often misguided. In highly systematized domains, you don’t need AI in the first place, because there’s nothing for the AI to do except import a CSV into Anki. In domains that are highly conceptual and abstract, you’re not memorizing a set of objectively-knowable facts, you’re trying to solidify a private, internal mental model that you build by reading and thinking and solving problems. You can give the AI all kinds of general rules on how to write good flashcards, but the AI can’t look into your mind and know which facts are salient for you , what you already know, which micro-volumes of knowledge can be encoded lightly with just a few flashcards, and which things need more shoring up and consequently more coverage. Can this situation be improved, or is this just an intrinsic limitation of spaced repetition? I don’t know. But it seems reasonable to think some limited gains are possible. I think not a lot of people are using spaced repetition on these more “conceptual” domains, and (by the rule that most people in a community are lurkers ) even fewer of those people are writing, in detail, to share their knowledge. Plenty of people have written about how to write good flashcards in general, what I want to read is closer to case studies where someone sits down with a text (or, even better, a textbook) and describes the process by which they turned that text into flashcards, like this from Michael Nielsen. From a corpus of similar case studies we might derive general rules for, not how to write effective flashcards, but how to encode complex, conceptual knowledge into question-answer form.

0 views

named globs with curl

One of the established power features of the curl command line tool is its support for “globbing”. It is a built-in way to specify ranges and sets in different ways and have curl iterate over them to simplify repeated transfers. For example, you can easily download three images from the same host without having to repeat the almost same URL three times: Or if you have them in a numbered range, you can get a thousand images in a single tiny command line: And they can be combined in crazy ways: curl allows globs used in a single URL to create up to 2 63 permutations – which, if you can do one million transfers per second, would take 292 thousand years to complete. (As an added bonus you can of course also add to the command line to make curl transfer all those images in parallel rather than serially.) To help users save files when using globbing, curl provides a way to reference the globbed components using when setting the target filename. The number then references the specific glob, where the first is 1, the second 2 etc. Saving the one thousand images using different filenames locally than they use remotely: This allows a compact command line to also offer flexibility. All functionality mentioned above has existed in curl for years; decades even. It just so happened that one day when working with curl I fell over a use case that I could not solve with the existing command line functionality. I wanted to do a globbed upload to a HTTP server and then save all the separate responses into their own dedicated files, preferably with names based on the glob. I will admit that I at first had a hard time to accept the fact that we actually could not do this already, but that was then rather quickly instead turned into: how should I add support for this in the smoothest and most convenient way? Using what syntax? The road to fixing it for uploads took a little detour. Starting in 8.21.0, curl can assign a name to each glob and then reference that glob by name instead of using just a glob index number. This allows command lines to get ever so slightly more readable I think. The image range example from above, but instead using named globs: Or a version with three separate globs where they all are used in the output file name: Slick, right? Back to the globbed upload challenge: … but with the responses saved in separate files instead of sent to stdout. Use named globs: The only way to refer to an upload glob is to set a name and refer to that name. There are no indexed references for uploads, only for URL globs. It is in fact possible to also use a mix of upload globs and URL globs in the same command line if you want to upload multiple files to multiple destinations. They set the names in the same namespace and you refer to the names the same way, independently of source. This feels more like a thing to show off in a blog post like this rather than something people will actually find good use for: Upload three files to three sites, save all nine response in separate files:

0 views

Photo Journal - Day 7

I really enjoyed doing macro shots last time, so I did it again! To switch things up though, I swapped my Sony aIV frame with an old Nikon D5100 (my first DSLR). It was kind of a beast to work with. I used the same lens as last time, but the D5100 doesn't have focus peaking. It was an additional challenge going back to a crop sensor. The shots are from the same park as day 6 , but during a rainstorm this time. There are very few things as wonderful as hiking through a forest at the end of a rain. The smells, the sound of birds coming out of hiding...it's magical. I ended up walking just over 3 miles and it was the most relaxed I've been in awhile. Field macro photography is a fun challenge, it forces you to focus on the small, easily missed details around you. You have to balance apeture and light a lot more than usual. Capturing anything more than a tiny slice of detail requires more light, which is hard in the woods. Slowing shutter speed to compensate makes it near impossible to capture something like a spider web swaying in the breeze. When you do get the camera dialed in, the viewfinder reveals a minitature world ready for you to capture.

0 views

SQLAlchemy 2 In Practice - Chapter 8: SQLAlchemy and the Web

This is the eighth and final chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! Whether you are building a traditional web application, or a web API that works alongside a web front end or smartphone app, SQLAlchemy is one of the best choices to add database support to a Python web server. In this chapter two example integrations with Flask and FastAPI will be demonstrated. These are two of the most popular Python web frameworks and should serve as examples even if you use another web framework.

0 views
Kev Quirk Yesterday

Is Bitwarden preparing for a sale?

by Jan-Lukas Else Jan-Lukas writes about the warning signs that Bitwarden might be heading for a private equity sale. The irony is that founder built Bitwarden because he didn't trust what happened when LastPass got acquired. Read post ➡ I saw this on the fedi this morning and it made me let out a big sigh. I was an early adopter of Bitwarden, having used it for nearly 10 years at this point, after LastPass were acquired by LogMeIn . If this does come to fruition (I really hope it doesn't) I'm not sure what I'd do. My wife and I have a family account and share many credentials, so whatever I potentially flip to would need to be super simple to use, like Bitwarden. The fact that Bitwarden is so simple yo use, yet so secure , is a testament to how good of a product it really is. So I'd rather not jump ship. In the Fast Company post that Jan-Lukas links to, there's a quote following an email from Bitwarden's "chief customer officer", Gary Orenstein, saying: Orenstein says via email that Bitwarden is not seeking a buyer, and that Sullivan’s [new CEO] appointment “reflects a continued focus at Bitwarden on scaling the business and serving customers globally.” That gives me some hope, but it could also be corporate bullshit - let's be honest, it wouldn't be the first time. I'm not going to make any rash decisions though. I get a tonne of use from Bitwarden, so I don't want to move unless I have to. Even if they are sold, I'd have to consider my options once I know who they've potentially been sold to. For now it's business as usual for me and my password manager. Thanks for reading this post via RSS. RSS is ace, and so are you. ❤️ You can reply to this post by email , or leave a comment .

0 views
Ahead of AI Yesterday

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

After a short family break, I am excited to be back and catching up on a busy few weeks of open-weight LLM releases. The thing that stood out to me is how much newer architectures are focused on long-context efficiency. As reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints, and LLM developers are adding a growing number of architecture tricks to reduce those costs. The main examples I want to look at are KV sharing and per-layer embeddings in Gemma 4, layer-wise attention budgeting in Laguna XS.2, compressed convolutional attention in ZAYA1-8B, and mHC plus compressed attention in DeepSeek V4. Most of these changes look like small tweaks in my architecture diagrams, but some of them are quite intricate design changes that are worth a more detailed discussion. Figure 1. LLM architecture drawings of recent, major open-weight releases (April to May). You can find the images, and more details, in my LLM architecture gallery . Not all model sizes are shown; Qwen3.6 includes the 27B and 35B-A3B variants, and ZAYA1 is represented by the 8B model (omitting ZAYA1-base and ZAYA1-reasoning-base). The architectures in the dotted boxes are covered in more detail in this article. Note that this article is about architecture designs, so I will mostly skip dataset mixtures, training schedules, post-training details, RL recipes, benchmark tables, and product comparisons. Even with that narrower scope, there is a lot to cover. And, like always, the article turned out longer than I expected, so I will keep the focus on what changes inside the transformer block, residual stream, KV cache, or attention computation. Please also note that I am only covering those topics that are interesting (new) design choices and that I haven’t covered elsewhere, yet. This list includes: KV sharing and per-layer embeddings in Gemma 4 Compressed convolutional attention in ZAYA1 Attention budgeting in Laguna XS.2 mHC and compressed attention in DeepSeek V4 Before getting into the new parts, here are the two previous articles I will refer back to. The first one gives a broader architecture background on recent MoE models, routed experts, active parameters, and model-size comparisons. The second one covers the attention background that comes up repeatedly below, including MHA, MQA, GQA, MLA, sliding-window attention, sparse attention, and hybrid attention designs. I also turned several of these explanations into short, standalone tutorial pages in the LLM Architecture Gallery . For example, readers can find compact explainers for GQA, MLA, sliding-window attention, DeepSeek Sparse Attention, MoE routing, and other concepts linked from the corresponding model cards and concept labels. For this tour of architecture advances and tweaks, we will go back to the beginning of April when Google released their new open-weight Gemma 4 suite of models. They come in 3 broad categories: the Gemma 4 E2B and E4B models for mobile and small, local (embedded) devices (aka IoT), the Gemma 4 26B mixture-of-experts (MoE) model, optimized for efficient local inference, and the Gemma 4 31B dense model, for maximum quality and more convenient post-training (since MoEs are trickier to work with) Figure 2: Gemma 4 architecture drawings. The first small architecture tweak in the E2B and E4B variants is that they adopt a shared KV cache scheme, where later layers reuse key-value states from earlier layers to reduce long-context memory and compute. This KV-sharing was not invented by Gemma 4. For instance, see Brandon et al. , “ Reducing Transformer Key-Value Cache Size with Cross-Layer Attention ” (NeurIPS 2024). But it’s the first popular architecture where I saw this concept applied. (Cross-layer attention is not to be confused with cross-attention .) Before explaining KV-sharing further, let’s briefly talk about the motivation. As I wrote and talked about in recent months, one of the main recent themes in LLM architecture design is KV cache size reduction. In turn, the motivation behind KV cache size reduction is to reduce the required memory, which allows us to work with longer contexts, which is especially relevant in the age of reasoning models and agents. For more background on KV caching, see my “Understanding and Coding the KV Cache in LLMs from Scratch” article: Practically all of the popular attention variants I described in my previous A Visual Guide to Attention Variants in Modern LLMs article are designed to reduce the KV cache size: To pick a classic example (that Gemma 4 still uses): Grouped Query Attention (GQA) already shares key-value (KV) heads across different query heads to reduce the KV cache size, as illustrated in the figure below. Figure 3: Grouped Query Attention (GQA) shares the same key (K) and value (V) heads among multiple query (Q) heads. As mentioned before, Gemma 4 uses GQA. However, in addition to the KV sharing among queries as part of GQA, Gemma 4 also shares KV projections across different layers instead of computing it as part of the attention module in each layer. This KV-sharing scheme, also called cross-layer attention, is illustrated in the figure below. Figure 4: Regular transformer blocks compute separate Q, K, and V projections in each attention module (left). Cross-layer attention designs (right) share the same K and V projections across multiple layers. As briefly hinted at in the architecture overview in Figure 2, Gemma 4 E2B uses regular GQA and sliding window attention in a 4:1 pattern. (More precisely, Gemma 4 E2B uses MQA, which is the one-KV-head special case of GQA). In the case of GQA (or MQA), the KV-sharing works like this. Later layers no longer compute their own key and value projections but reuse the KV tensors from the most recent earlier non-shared layer of the same attention type. In other words, sliding-window layers share KV with a previous sliding-window layer. Full-attention layers share KV with a previous full-attention layer. The layers still compute their own query projections, so each layer can form its own attention pattern, but the expensive and memory-heavy KV cache is reused across several layers. For example, Gemma 4 E2B has 35 transformer layers, but only the first 15 compute their own KV projections; the final 20 layers reuse KV tensors from the most recent earlier non-shared layer of the same attention type. Similarly, Gemma 4 E4B has 42 layers, with 24 layers computing their own KV and the final 18 layers sharing them. How much does this actually save? Since we share roughly half of the KVs across layers, we save approximately half of the KV cache size. For the smallest E2B model, this results in a 2.7 GB saving (at bfloat16 precision) in long 128K contexts, as shown below. (For the E4B variant, this saves about 6 GB at 128K.) Figure 5: KV cache memory savings from GQA and cross-layer KV sharing in a Gemma 4 E2B-like setup. For simplicity, additional savings from sliding window attention are not shown. The downside of KV-sharing is, of course, that it’s an “approximation” of the real thing. Or, more precisely, it reduces model capacity. However, according to the cross-layer attention paper, the impact can be minimal (for small models that were tested). The Gemma 4 E2B and E4B variants include a second efficiency-oriented design choice called per-layer embeddings (PLE). This is separate from the KV-sharing scheme above. KV sharing reduces the KV cache. PLE is instead about parameter efficiency, where it lets the small Gemma 4 models use more token-specific information without making the main transformer stack as expensive as a dense model with the same total parameter count. For instance, the “E” in Gemma 4 E2B and E4B stands for “effective”. Concretely, Gemma 4 E2B is listed as 2.3B effective parameters, or 5.1B parameters when the embeddings are counted. (Similarly, Gemma 4 E4B is listed as 4.5B effective parameters, or 8B parameters with embeddings). In short, in the “E” models, the main transformer-stack compute is closer to the smaller number, while the larger number includes the additional embedding-table layers. (For an illustration of how embedding layers work, see my “ Understanding the Difference Between Embedding Layers and Linear Layers ” code notebook.) Conceptually, the new PLE path looks like this: Figure 6: Simplified Gemma 4 block with the PLE residual path. The normal block first computes the attention and feed-forward residual updates. The resulting hidden state gates the layer-specific PLE vector, and the projected PLE update is added as an extra residual update at the end of the block. The PLE vectors themselves are prepared outside the repeated transformer blocks. In simplified form, there are two inputs to the PLE construction. First, the token IDs go through a per-layer embedding lookup. Second, the normal token embeddings go through a linear projection into the same packed PLE space. These two pieces are added, scaled, and reshaped into a tensor with one slice per layer. Note that each block then receives its own slice. Figure 7: Simplified PLE construction. The token IDs provide a per-layer embedding lookup, while the normal token embeddings are projected into the same space. The two contributions are combined and reshaped so that each transformer block receives its own layer-specific PLE slice. The important detail is that PLE does not give each transformer block a full independent copy of the normal token embedding layer. Instead, the per-layer embedding lookup is computed once. Then, as mentioned before, it gives each layer a small token-specific embedding slice (via “reshape / select layer l”. So, for each input token, Gemma 4 prepares a packed PLE tensor that contains one small vector per decoder layer. Then, during the forward pass, layer l receives only its own slice (ple_l in the Gemma4WithPLEBlock in figure 6). Inside the transformer block, the regular attention and feed-forward branches run as usual. First, the block computes the attention residual update. Then it computes the feed-forward residual update. After that second residual add, the resulting hidden state, which I denoted as z in the pseudocode in figure 6, is used to gate the layer-specific PLE vector. The gated PLE vector is projected back to the model hidden size, normalized, and added as one extra residual update. So the useful mental model is that the transformer block still has the same main attention and feed-forward path, but Gemma 4 adds a small layer-specific token vector after the feed-forward branch. This increases representational capacity through embedding parameters and small projections. This adds computational overhead but avoids the cost of scaling the entire transformer stack to the larger parameter count. But why PLEs? The simpler alternative would be to make the dense model smaller, using fewer layers, narrower hidden states, or smaller feed-forward networks. That would reduce memory and latency, but it also removes capacity from the parts of the model that do the main computation. The PLE design keeps the expensive transformer blocks closer to the smaller “effective” size, while storing additional capacity in per-layer embedding tables. These are much cheaper to use than adding more attention or FFN weights, since they are mainly lookup-style parameters that can be cached. Also, we have to take Google’s word here that this is an effective and worthwhile design choice. It would be interesting to see some comparison studies to see how this E2B design compares to a regular Gemma 4 2.3B model and a regular Gemma 4 5.1B model. Also, in principle, PLE is not inherently limited to small models. We could attach per-layer embedding slices to larger models, too. However, larger models already have sufficient capacity where these extra embeddings may not help that much. Also, for larger models, we already use MoE designs as a trick to increase capacity while keeping the compute footprint smaller. By the way, if you are interested in a relatively simple and readable code implementation, I implemented the Gemma 4 E2B and E4B models from scratch here . Figure 8: Snapshot of my Gemma 4 from-scratch implementation . Laguna is the first open-weight model by Poolside , a Europe-based company focused on training LLMs for coding applications. Several of my former colleagues joined Poolside in recent years, and they have a great team with lots of talent. It’s just nice to see more companies also releasing some of their models as open-weight variants. Anyways, the Laguna XS.2 architecture depicted below looks very standard at first glance. However, one detail that I didn’t show (/try to cram into there) is a concept we can refer to as “Layer-wise attention budgeting”. Figure 9: Poolside’s Laguna XS.2 architecture. Part of the idea behind the attention budgeting here is that instead of giving every transformer layer the same full attention budget, Laguna XS.2 varies the attention cost by layer. It has 40 layers total, with 30 sliding-window attention layers and 10 global/full attention layers. As usual, the sliding-window layers only attend over a local window (here: 512 tokens), which keeps the KV cache and attention computation cheaper. The global layers are more expensive but preserve the ability to access all information in the context window. This mixed sliding-window + global/full attention pattern is not unique to Laguna XS.2 and is used by many other architectures (including Gemma 4). But what’s new is the use of per-layer query-head counts. For instance, the Hugging Face model hub config.json includes a setting, so layers can have different numbers of query heads while keeping the KV cache shape compatible. Figure 10: Per-layer query-head budgeting in Laguna, where full attention layers use 6 query heads per KV head, and sliding window attention layers use 8 query heads per KV head. So Laguna XS.2 gives more query heads to sliding-window layers and fewer query heads to global layers, while keeping the KV heads fixed at 8. That is the actual layer-wise head budgeting in the config. Laguna XS.2 is one of the most prominent recent examples of this per-layer query-head budgeting in a production-style open model. But the broader idea of varying model capacity by layer goes back to (at least) Apple’s 2024 OpenELM . And again, what’s the point of such a design? Similar to KV-sharing, the point is to spend attention capacity where it is most useful, instead of giving every layer the same budget. Specifically, full-attention layers are expensive because they look across the whole context, so Laguna gives them fewer query heads compared to sliding window attention modules. (Besides, another smaller implementation detail is that Laguna also applies per-head attention-output gating; this is somewhat similar to Qwen3-Next and others, which I also omit here since I covered it in earlier articles.) Similar to Laguna, ZAYA1-8B is another new player on the open-weight market. It is developed by Zyphra , and one of the interesting details around the release is that the model was trained on AMD GPUs rather than the more common NVIDIA GPU (or Google TPU) setup. The main architecture detail, though, is Compressed Convolutional Attention (CCA), used together with grouped-query attention. Unlike MLA-style designs that mainly use a latent representation as a compact KV cache format, CCA performs the attention operation directly in the compressed latent space, but more on that later. (Sidenote: the ZAYA1-8B config.json lists 80 alternating layer entries rather than 40 conventional transformer blocks. These entries alternate between CCA/GQA attention and MoE feed-forward layers. But for the architecture figure, it is more convenient to visualize this as 40 repeated attention + MoE pairs, which is conceptually equivalent.) Figure 11: Zaya1 (8B) with transformer blocks featuring compressed convolutional attention. As hinted at in the figure above, ZAYA1-8B uses Compressed Convolutional Attention (CCA) together with a 4:1 GQA layout. The key point is that its attention block is built around CCA rather than a standard sliding-window attention block. What is Compressed Convolutional Attention? I would say CCA is related in spirit to Multi-head Latent Attention (MLA) in DeepSeek’s models, since both introduce a compressed latent representation into the attention block. However, they use that latent space differently. MLA mainly uses the latent representation to reduce the KV cache. In MLA, the KV tensors are stored compactly and then projected into the attention-head space for the actual attention computation. Figure 12: Regular Multi-head Attention (MHA) and Multi-head Latent (MLA) attention side by side. CCA compresses Q, K, and V and performs the attention operation directly in the compressed latent space. This is why CCA can reduce not only KV cache size, but also attention FLOPs during prefill and training. Figure 13: Multi-head Latent Attention (MLA) and Compressed Convolutional Attention (CCA) side by side. As Figure 13 above illustrates, in CCA, the compressed, latent representations enter the attention mechanism directly, and the resulting compressed attention vector is then up-projected. Note that this is called Compressed Convolutional Attention, not just Compressed Attention, since there is an additional convolutional mixing happening on the latent K and Q representations. The convolutional mixing part is not shown in Figure 12, because it would have been too crammed, but it’s relatively straightforward. As hinted at in Figure 12, the convolutional mixing happens directly on the compressed Q and K tensors. The point is that compression makes Q, K, and V narrower, which saves compute and cache, but it can also make attention less expressive. The convolutions are a cheap way to give the compressed Q and K vectors more local context before they are used to compute attention scores. (The convolutional mixing is only applied to Q and K, not V, because Q and K determine the attention scores, while V represents the content that gets averaged via these scores). Figure 14: conceptual overview of the sequence-mixing convolution Next to the sequence mixing shown in Figure 13, there is also a channel mixing component. It’s in principle similar though, so I am omitting the illustration. CCA appears to be a Zyphra-introduced attention mechanism that predates the ZAYA1-8B technical report . The standalone CCA paper, Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space , was first posted in October 2025 and explicitly introduces CCA. ZAYA1-8B then uses this mechanism as one of the core pieces. But the question is, “is it better than MLA”? According to the CCA paper’s own experiments, yes, they report CCA outperforming MLA under comparable compression settings. Figure 15: Annotated figures from the CCA paper, https://arxiv.org/abs/2510.04476 . Overall, the interesting part here is really the new attention mechanism. The model also uses a pretty extreme (= very sparse) MoE setup, with only one routed expert active per token, but that part is more familiar. CCA is more unusual because it performs the attention operation directly in a compressed latent space, and then uses convolutional mixing on the compressed Q and K representations to make this compressed attention less limiting. So, in short, ZAYA1-8B is not only trying to save compute in the feed-forward layers, but also in the attention mechanism itself. DeepSeek V4 was the biggest release of the year so far, both in terms of hype and model size. Interestingly, DeepSeek V4-Pro is also the most parameter-sparse MoE among the models in the table below, measured by active-parameter share, as summarized in the table below. Figure 16: Percent active parameter plot for MoE models. You can also find an HTML version at https://sebastianraschka.com/llm-architecture-gallery/active-parameter-ratio/ . Caveat: active parameter share is only one lens. It does not capture KV cache size, attention pattern, context length, routing overhead, hardware efficiency, or training quality. But it is a helpful, quick check when comparing sparse models. There’s a lot to say about DeepSeek V4, but since it’s been all over the news already, and to stay on topic regarding architecture tweaks, I will focus on the two most relevant parts that are new compared to previous architectures: mHC for a wider residual pathway, CSA/HCA for long-context attention compression and sparsity Looking at the DeepSeek V4 architecture drawing below, there seems to be a lot going on. The useful way to read it is to separate the residual-path change, mHC, from the attention-path changes, CSA/HCA, and compressed attention caches. Figure 17: DeepSeek V4-Pro architecture overview. Let’s start with the mHC component of DeepSeek V4. This goes back to a research paper that the DeepSeek team shared last year (31 Dec 2025, mHC: Manifold-Constrained Hyper-Connections ). However, in this paper, the technique was only tested on an experimental 27B scale model. Now, we see it in their flagship release, which is a good sign that this idea actually works well in production. The main idea behind mHC here is to modernize the design of the residual connections inside the transformer block, which is refreshing, because architecture tweaks are usually focused on the attention mechanism, normalization layer placement, and MoE parts. Now, mHC is based on previous work on hyper-connections (see Hyper-connections by Zhu et al., 2024), which we should briefly discuss first. Hyper-connections essentially modify the single residual stream inside the transformer block by replacing it with several parallel residual streams and learned mappings between them. (For those new to residual connections, I made a video on residual neural networks many years ago, where I explained the general mechanism.) The idea behind hyper-connections is to widen the residual stream. We can think of this as keeping several parallel residual streams, with an additional Res Mapping linear transformation that mixes them across layers. Since the Attention or MoE layer itself still operates on the normal hidden size, hyper-connections also add a Pre Mapping that combines the parallel residual streams into one normal hidden vector for the layer, and a Post Mapping that distributes the layer output back across the parallel residual streams. This is visually summarized in the figure below. Figure 18: Regular transformer block (top) vs transformer block with hyper-connections (bottom) using annotated figures from the mHC paper, https://arxiv.org/abs/2512.24880 . The figure below focuses on the attention-layer portion of the transformer block, but the same concept applies to the second residual branch around the MoE layer. The purpose of hyper-connections is to make the residual pathway more expressive without making the actual Attention or MoE layer wider. This is only mildly more expensive in FLOPs because the extra mappings operate over the small residual-stream axis, for example, n = 4 in DeepSeek V4, not over a huge hidden dimension. In the original hyper-connections paper, the 7B OLMo MoE experiment goes from 13.36G to 13.38G FLOPs per token, which is basically unchanged. In terms of reported gains, there were modest (but consistent) improvements, as shown in the figure below. (However, only looking at FLOPs is a bit simplistic. The widened residual state still has to be stored, moved through memory, mixed, etc. So the practical overhead can come more from memory traffic and implementation complexity than from arithmetic, which is not explicitly measured. However, given that DeepSeek V4 is all about efficiency, it seems to be a worthwhile addition.) Figure 19: Hyper-connections performance versus baseline, using an annotated figure from the hyper-connections paper, https://arxiv.org/abs/2409.19606 . Also, as shown in the figure above, metrics reached the baseline’s performance using roughly half the training tokens. The main change from regular hyper-connections (HC) to manifold-constrained hyper-connections (mHC) is that the mappings are no longer left unconstrained. In regular HC, the Res Mapping is a learned matrix that mixes the parallel residual streams, but stacking many such matrices can amplify or shrink signals unpredictably. In mHC, this residual mapping is projected onto the manifold of doubly stochastic matrices, meaning all entries are non-negative and each row and column sums to 1. This makes the residual mixing behave more like a stable redistribution of information across streams. The Pre Mapping and Post Mapping are also constrained to be non-negative and bounded, which avoids cancellation when reading from and writing back into the widened residual state. In short, mHC keeps the richer residual mixing of HC, but adds constraints so it scales more safely, which becomes more relevant for larger (deeper) models. Otherwise, the main idea of using parallel residual streams remains, as shown in the figure below. Figure 20: Transformer block with hyper-connections (HC) and manifold-constrained hyper-connections (mHC) using annotated figures from the mHC paper, https://arxiv.org/abs/2512.24880 . In the mHC paper, using a 27B parameter model for the experiments, the DeepSeek team’s optimized implementation (with fusion, recomputation, and pipeline scheduling) adds only 6.7% additional training time overhead for 4 residual streams (n = 4) throughout all transformer blocks compared to the single-stream baseline. To sum up this section, HC/mHC changes how information is carried around these layers by replacing the single residual stream with several interacting residual streams, with the additional stability constraints added in mHC, while adding minimal compute overhead. Also, it pairs well with the CSA/HCA attention changes, which modify other parts of the transformer block, which I will discuss below. The other major DeepSeek V4 architecture change is on the attention side. Again, the motivation is that at very long context lengths, attention becomes expensive not only because of the attention score computation, but also because the KV cache grows with the sequence length. DeepSeek V4 addresses this issue with a hybrid of two compressed-attention mechanisms, Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). For a refresher, I recommend checking out my previous “ A Visual Guide to Attention Variants in Modern LLMs ” article, which covers Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention (DSA), among others. The first thing to note is that CSA/HCA in DeepSeek V4 is a different kind of compression than the MLA-style compression used in DeepSeek V2/V3. Where MLA mainly compresses the per-token KV representation, CSA and HCA compress along the sequence dimension. So, instead of keeping one full (or compressed) KV entry for every previous token, they summarize groups of tokens into fewer compressed KV entries. Consequently, the cache gets shorter. DeepSeek V4 also uses compact compressed entries and shared-KV attention, but the main distinction from MLA is the sequence-length compression. This is illustrated in the figure below. Figure 21: Conceptual comparison of MLA-style per-token latent caching, CSA, and HCA. MLA compresses the stored KV representation but keeps one latent entry per token. CSA shortens the sequence more mildly with m=4 and sparse top-k selection, while HCA uses much heavier sequence compression with m’=128 and dense attention over the shorter cache. The quality tradeoff for CSA/HCA is also different from MLA. As shown in the figure above, MLA compresses the representation stored for each token, but it still keeps one latent KV entry per token. CSA and especially HCA go further by reducing the number of sequence entries themselves, so the model gives up some token-level info in exchange for much lower long-context cost. Again, it’s all about reducing long-context cost, but this trade-off can hurt modeling quality if the compression is too strong, which is why DeepSeek V4 does not rely on one compression scheme alone but alternates between CSA and HCA. CSA uses a milder compression rate and a DeepSeek Sparse Attention (DSA)-style selector, HCA uses much heavier compression for cheaper global coverage, and both keep a local sliding-window branch for recent uncompressed tokens. This sparse selection in CSA builds on DeepSeek Sparse Attention (DSA), which I discussed in more detail in my earlier DeepSeek V3.2 write-up . HCA is the more aggressive variant of the two. It compresses every 128 tokens into one compressed KV entry, but then uses dense attention over those heavily compressed entries. In other words, CSA keeps more details but uses sparse selection, while HCA keeps far fewer entries and can afford dense attention over them, as illustrated in the figure below. This makes the two mechanisms somewhat complementary, which is why DeepSeek V4 interleaves CSA and HCA layers rather than using only one of them. Figure 22: CSA selects a sparse set of compressed history blocks, while HCA attends densely over more heavily compressed blocks. Both paths also include recent uncompressed KV entries through a 128-token sliding-window branch. The DeepSeek V4 paper reports that, at a 1M-token context length, DeepSeek V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache size compared with DeepSeek V3.2, which uses MLA and DeepSeek Sparse Attention (DSA). DeepSeek V4-Flash is even smaller, at 10% of the FLOPs and 7% of the KV cache size relative to DeepSeek V3.2. Figure 23. Reported 1M-context efficiency numbers from the DeepSeek V4 paper, relative to DeepSeek V3.2. By the way, I would not describe CSA/HCA as “better” than MLA in a general sense. CSA/HCA is a more aggressive long-context design. And it’s also more complicated for sure. Unfortunately, there is no ablation study in the paper. But overall, the paper reports strong overall modeling results, including DeepSeek V4-Flash-Base outperforming DeepSeek V3.2-Base on a majority of base-model benchmarks and strong 1M-token retrieval results, but these results are for the full DeepSeek V4 recipe, which also includes better data, Muon-based optimization, mHC, precision/storage optimizations, and training/inference-system changes. Personally, for now, I would treat CSA/HCA as an efficiency-focused long-context design that appears to preserve modeling quality well in their large flagship model(s) but not necessarily universally better than MLA. Overall, the interesting pattern this year is that most new open-weight models try to make long-context inference cheaper without just shrinking the model in terms of total parameters. For instance, Gemma 4 reduces KV-cache memory with cross-layer KV sharing and adds capacity via per-layer embeddings. Laguna XS.2 tweaks how much attention capacity each layer gets. ZAYA1-8B moves attention into a compressed latent space. DeepSeek V4 adds constrained residual-stream mixing and compressed long-context attention. All of these tweaks add more complexity, which seems to be where LLM architecture is going right now. My main takeaway is that the transformer block is still changing, but in fairly targeted ways. The basic recipe is still based on the original GPT decoder-only transformer architecture, but many parts are upgraded or replaced, and they get more specialized for longer contexts and more efficient inference, whereas the qualitative modeling performance seems largely driven by data quality (and quantity) and training recipes. The question many of you asked me in the past is centered on when (or if) transformers are being replaced with something else. Of course, there are other designs like diffusion models, but transformers remain the status quo for state-of-the-art architecture releases. However, with each increasing yearly release quarter, we get more and more tweaks. While it was possible to implement a basic transformer block in perhaps 50-100 lines of PyTorch code, these tweaks (esp. around the attention variants) probably 10x the code complexity. This is not an inherently bad thing as these tweaks reduce (not increase) runtime costs. However, it’s becoming increasingly difficult to gain a clear understanding of the individual components and their interactions. Figure 24: The evolution from GPT-2 (2019) to DeepSeek V4-Pro (2026) For instance, I am fairly certain that someone who is diving into LLM architectures for the first time will be totally overwhelmed when seeing the DeepSeek V4 source code. However, by starting with the original decoder-style LLM (GPT/GPT-2) and then gradually adding / learning about these new components one at a time, we can keep the learning effort manageable. The moral of the story, I guess, is to keep learning, one architecture at a time :). By the way, I am very excited to share that I finished writing Build A Reasoning Model (From Scratch) and all chapters are in early access now. The publisher and I worked hard on the final layouts in the past month, and it’s going to be send to the printer this week. (Good news: the print version will be in color this time!) This is probably my most ambitious book so far. I spent about 1.5 years writing it, and a large number of experiments went into it. It is also probably the book I worked hardest on in terms of time, effort, and polish, and I hope you’ll enjoy it. Build a Reasoning Model (From Scratch) on Manning and Amazon . The main topics are evaluating reasoning models inference-time scaling self-refinement reinforcement learning distillation There is a lot of discussion around “reasoning” in LLMs, and I think the best way to understand what it really means in the context of LLMs is to implement one from scratch! Amazon (pre-order of Kindle ebook and print paperback) Manning (complete book in early access , pre-final layout, 528 pages) Figure 1. LLM architecture drawings of recent, major open-weight releases (April to May). You can find the images, and more details, in my LLM architecture gallery . Not all model sizes are shown; Qwen3.6 includes the 27B and 35B-A3B variants, and ZAYA1 is represented by the 8B model (omitting ZAYA1-base and ZAYA1-reasoning-base). The architectures in the dotted boxes are covered in more detail in this article. Note that this article is about architecture designs, so I will mostly skip dataset mixtures, training schedules, post-training details, RL recipes, benchmark tables, and product comparisons. Even with that narrower scope, there is a lot to cover. And, like always, the article turned out longer than I expected, so I will keep the focus on what changes inside the transformer block, residual stream, KV cache, or attention computation. Please also note that I am only covering those topics that are interesting (new) design choices and that I haven’t covered elsewhere, yet. This list includes: KV sharing and per-layer embeddings in Gemma 4 Compressed convolutional attention in ZAYA1 Attention budgeting in Laguna XS.2 mHC and compressed attention in DeepSeek V4 the Gemma 4 E2B and E4B models for mobile and small, local (embedded) devices (aka IoT), the Gemma 4 26B mixture-of-experts (MoE) model, optimized for efficient local inference, and the Gemma 4 31B dense model, for maximum quality and more convenient post-training (since MoEs are trickier to work with) Figure 2: Gemma 4 architecture drawings. The first small architecture tweak in the E2B and E4B variants is that they adopt a shared KV cache scheme, where later layers reuse key-value states from earlier layers to reduce long-context memory and compute. This KV-sharing was not invented by Gemma 4. For instance, see Brandon et al. , “ Reducing Transformer Key-Value Cache Size with Cross-Layer Attention ” (NeurIPS 2024). But it’s the first popular architecture where I saw this concept applied. (Cross-layer attention is not to be confused with cross-attention .) Before explaining KV-sharing further, let’s briefly talk about the motivation. As I wrote and talked about in recent months, one of the main recent themes in LLM architecture design is KV cache size reduction. In turn, the motivation behind KV cache size reduction is to reduce the required memory, which allows us to work with longer contexts, which is especially relevant in the age of reasoning models and agents. For more background on KV caching, see my “Understanding and Coding the KV Cache in LLMs from Scratch” article: Practically all of the popular attention variants I described in my previous A Visual Guide to Attention Variants in Modern LLMs article are designed to reduce the KV cache size: To pick a classic example (that Gemma 4 still uses): Grouped Query Attention (GQA) already shares key-value (KV) heads across different query heads to reduce the KV cache size, as illustrated in the figure below. Figure 3: Grouped Query Attention (GQA) shares the same key (K) and value (V) heads among multiple query (Q) heads. As mentioned before, Gemma 4 uses GQA. However, in addition to the KV sharing among queries as part of GQA, Gemma 4 also shares KV projections across different layers instead of computing it as part of the attention module in each layer. This KV-sharing scheme, also called cross-layer attention, is illustrated in the figure below. Figure 4: Regular transformer blocks compute separate Q, K, and V projections in each attention module (left). Cross-layer attention designs (right) share the same K and V projections across multiple layers. As briefly hinted at in the architecture overview in Figure 2, Gemma 4 E2B uses regular GQA and sliding window attention in a 4:1 pattern. (More precisely, Gemma 4 E2B uses MQA, which is the one-KV-head special case of GQA). In the case of GQA (or MQA), the KV-sharing works like this. Later layers no longer compute their own key and value projections but reuse the KV tensors from the most recent earlier non-shared layer of the same attention type. In other words, sliding-window layers share KV with a previous sliding-window layer. Full-attention layers share KV with a previous full-attention layer. The layers still compute their own query projections, so each layer can form its own attention pattern, but the expensive and memory-heavy KV cache is reused across several layers. For example, Gemma 4 E2B has 35 transformer layers, but only the first 15 compute their own KV projections; the final 20 layers reuse KV tensors from the most recent earlier non-shared layer of the same attention type. Similarly, Gemma 4 E4B has 42 layers, with 24 layers computing their own KV and the final 18 layers sharing them. How much does this actually save? Since we share roughly half of the KVs across layers, we save approximately half of the KV cache size. For the smallest E2B model, this results in a 2.7 GB saving (at bfloat16 precision) in long 128K contexts, as shown below. (For the E4B variant, this saves about 6 GB at 128K.) Figure 5: KV cache memory savings from GQA and cross-layer KV sharing in a Gemma 4 E2B-like setup. For simplicity, additional savings from sliding window attention are not shown. The downside of KV-sharing is, of course, that it’s an “approximation” of the real thing. Or, more precisely, it reduces model capacity. However, according to the cross-layer attention paper, the impact can be minimal (for small models that were tested). 2. Per-Layer Embeddings and “Effective” Size (Gemma 4 E2B/E4B) The Gemma 4 E2B and E4B variants include a second efficiency-oriented design choice called per-layer embeddings (PLE). This is separate from the KV-sharing scheme above. KV sharing reduces the KV cache. PLE is instead about parameter efficiency, where it lets the small Gemma 4 models use more token-specific information without making the main transformer stack as expensive as a dense model with the same total parameter count. For instance, the “E” in Gemma 4 E2B and E4B stands for “effective”. Concretely, Gemma 4 E2B is listed as 2.3B effective parameters, or 5.1B parameters when the embeddings are counted. (Similarly, Gemma 4 E4B is listed as 4.5B effective parameters, or 8B parameters with embeddings). In short, in the “E” models, the main transformer-stack compute is closer to the smaller number, while the larger number includes the additional embedding-table layers. (For an illustration of how embedding layers work, see my “ Understanding the Difference Between Embedding Layers and Linear Layers ” code notebook.) Conceptually, the new PLE path looks like this: Figure 6: Simplified Gemma 4 block with the PLE residual path. The normal block first computes the attention and feed-forward residual updates. The resulting hidden state gates the layer-specific PLE vector, and the projected PLE update is added as an extra residual update at the end of the block. The PLE vectors themselves are prepared outside the repeated transformer blocks. In simplified form, there are two inputs to the PLE construction. First, the token IDs go through a per-layer embedding lookup. Second, the normal token embeddings go through a linear projection into the same packed PLE space. These two pieces are added, scaled, and reshaped into a tensor with one slice per layer. Note that each block then receives its own slice. Figure 7: Simplified PLE construction. The token IDs provide a per-layer embedding lookup, while the normal token embeddings are projected into the same space. The two contributions are combined and reshaped so that each transformer block receives its own layer-specific PLE slice. The important detail is that PLE does not give each transformer block a full independent copy of the normal token embedding layer. Instead, the per-layer embedding lookup is computed once. Then, as mentioned before, it gives each layer a small token-specific embedding slice (via “reshape / select layer l”. So, for each input token, Gemma 4 prepares a packed PLE tensor that contains one small vector per decoder layer. Then, during the forward pass, layer l receives only its own slice (ple_l in the Gemma4WithPLEBlock in figure 6). Inside the transformer block, the regular attention and feed-forward branches run as usual. First, the block computes the attention residual update. Then it computes the feed-forward residual update. After that second residual add, the resulting hidden state, which I denoted as z in the pseudocode in figure 6, is used to gate the layer-specific PLE vector. The gated PLE vector is projected back to the model hidden size, normalized, and added as one extra residual update. So the useful mental model is that the transformer block still has the same main attention and feed-forward path, but Gemma 4 adds a small layer-specific token vector after the feed-forward branch. This increases representational capacity through embedding parameters and small projections. This adds computational overhead but avoids the cost of scaling the entire transformer stack to the larger parameter count. But why PLEs? The simpler alternative would be to make the dense model smaller, using fewer layers, narrower hidden states, or smaller feed-forward networks. That would reduce memory and latency, but it also removes capacity from the parts of the model that do the main computation. The PLE design keeps the expensive transformer blocks closer to the smaller “effective” size, while storing additional capacity in per-layer embedding tables. These are much cheaper to use than adding more attention or FFN weights, since they are mainly lookup-style parameters that can be cached. Also, we have to take Google’s word here that this is an effective and worthwhile design choice. It would be interesting to see some comparison studies to see how this E2B design compares to a regular Gemma 4 2.3B model and a regular Gemma 4 5.1B model. Also, in principle, PLE is not inherently limited to small models. We could attach per-layer embedding slices to larger models, too. However, larger models already have sufficient capacity where these extra embeddings may not help that much. Also, for larger models, we already use MoE designs as a trick to increase capacity while keeping the compute footprint smaller. By the way, if you are interested in a relatively simple and readable code implementation, I implemented the Gemma 4 E2B and E4B models from scratch here . Figure 8: Snapshot of my Gemma 4 from-scratch implementation . 3. Layer-Wise Attention Budgeting (Laguna XS.2) Laguna is the first open-weight model by Poolside , a Europe-based company focused on training LLMs for coding applications. Several of my former colleagues joined Poolside in recent years, and they have a great team with lots of talent. It’s just nice to see more companies also releasing some of their models as open-weight variants. Anyways, the Laguna XS.2 architecture depicted below looks very standard at first glance. However, one detail that I didn’t show (/try to cram into there) is a concept we can refer to as “Layer-wise attention budgeting”. Figure 9: Poolside’s Laguna XS.2 architecture. Part of the idea behind the attention budgeting here is that instead of giving every transformer layer the same full attention budget, Laguna XS.2 varies the attention cost by layer. It has 40 layers total, with 30 sliding-window attention layers and 10 global/full attention layers. As usual, the sliding-window layers only attend over a local window (here: 512 tokens), which keeps the KV cache and attention computation cheaper. The global layers are more expensive but preserve the ability to access all information in the context window. This mixed sliding-window + global/full attention pattern is not unique to Laguna XS.2 and is used by many other architectures (including Gemma 4). But what’s new is the use of per-layer query-head counts. For instance, the Hugging Face model hub config.json includes a setting, so layers can have different numbers of query heads while keeping the KV cache shape compatible. Figure 10: Per-layer query-head budgeting in Laguna, where full attention layers use 6 query heads per KV head, and sliding window attention layers use 8 query heads per KV head. So Laguna XS.2 gives more query heads to sliding-window layers and fewer query heads to global layers, while keeping the KV heads fixed at 8. That is the actual layer-wise head budgeting in the config. Laguna XS.2 is one of the most prominent recent examples of this per-layer query-head budgeting in a production-style open model. But the broader idea of varying model capacity by layer goes back to (at least) Apple’s 2024 OpenELM . And again, what’s the point of such a design? Similar to KV-sharing, the point is to spend attention capacity where it is most useful, instead of giving every layer the same budget. Specifically, full-attention layers are expensive because they look across the whole context, so Laguna gives them fewer query heads compared to sliding window attention modules. (Besides, another smaller implementation detail is that Laguna also applies per-head attention-output gating; this is somewhat similar to Qwen3-Next and others, which I also omit here since I covered it in earlier articles.) 4. Compressed Convolutional Attention (ZAYA1-8B) Similar to Laguna, ZAYA1-8B is another new player on the open-weight market. It is developed by Zyphra , and one of the interesting details around the release is that the model was trained on AMD GPUs rather than the more common NVIDIA GPU (or Google TPU) setup. The main architecture detail, though, is Compressed Convolutional Attention (CCA), used together with grouped-query attention. Unlike MLA-style designs that mainly use a latent representation as a compact KV cache format, CCA performs the attention operation directly in the compressed latent space, but more on that later. (Sidenote: the ZAYA1-8B config.json lists 80 alternating layer entries rather than 40 conventional transformer blocks. These entries alternate between CCA/GQA attention and MoE feed-forward layers. But for the architecture figure, it is more convenient to visualize this as 40 repeated attention + MoE pairs, which is conceptually equivalent.) Figure 11: Zaya1 (8B) with transformer blocks featuring compressed convolutional attention. As hinted at in the figure above, ZAYA1-8B uses Compressed Convolutional Attention (CCA) together with a 4:1 GQA layout. The key point is that its attention block is built around CCA rather than a standard sliding-window attention block. What is Compressed Convolutional Attention? I would say CCA is related in spirit to Multi-head Latent Attention (MLA) in DeepSeek’s models, since both introduce a compressed latent representation into the attention block. However, they use that latent space differently. MLA mainly uses the latent representation to reduce the KV cache. In MLA, the KV tensors are stored compactly and then projected into the attention-head space for the actual attention computation. Figure 12: Regular Multi-head Attention (MHA) and Multi-head Latent (MLA) attention side by side. CCA compresses Q, K, and V and performs the attention operation directly in the compressed latent space. This is why CCA can reduce not only KV cache size, but also attention FLOPs during prefill and training. Figure 13: Multi-head Latent Attention (MLA) and Compressed Convolutional Attention (CCA) side by side. As Figure 13 above illustrates, in CCA, the compressed, latent representations enter the attention mechanism directly, and the resulting compressed attention vector is then up-projected. Note that this is called Compressed Convolutional Attention, not just Compressed Attention, since there is an additional convolutional mixing happening on the latent K and Q representations. The convolutional mixing part is not shown in Figure 12, because it would have been too crammed, but it’s relatively straightforward. As hinted at in Figure 12, the convolutional mixing happens directly on the compressed Q and K tensors. The point is that compression makes Q, K, and V narrower, which saves compute and cache, but it can also make attention less expressive. The convolutions are a cheap way to give the compressed Q and K vectors more local context before they are used to compute attention scores. (The convolutional mixing is only applied to Q and K, not V, because Q and K determine the attention scores, while V represents the content that gets averaged via these scores). Figure 14: conceptual overview of the sequence-mixing convolution Next to the sequence mixing shown in Figure 13, there is also a channel mixing component. It’s in principle similar though, so I am omitting the illustration. CCA appears to be a Zyphra-introduced attention mechanism that predates the ZAYA1-8B technical report . The standalone CCA paper, Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space , was first posted in October 2025 and explicitly introduces CCA. ZAYA1-8B then uses this mechanism as one of the core pieces. But the question is, “is it better than MLA”? According to the CCA paper’s own experiments, yes, they report CCA outperforming MLA under comparable compression settings. Figure 15: Annotated figures from the CCA paper, https://arxiv.org/abs/2510.04476 . Overall, the interesting part here is really the new attention mechanism. The model also uses a pretty extreme (= very sparse) MoE setup, with only one routed expert active per token, but that part is more familiar. CCA is more unusual because it performs the attention operation directly in a compressed latent space, and then uses convolutional mixing on the compressed Q and K representations to make this compressed attention less limiting. So, in short, ZAYA1-8B is not only trying to save compute in the feed-forward layers, but also in the attention mechanism itself. 5. CSA/HCA, mHC, and Compressed Attention Caches (DeepSeek V4) DeepSeek V4 was the biggest release of the year so far, both in terms of hype and model size. Interestingly, DeepSeek V4-Pro is also the most parameter-sparse MoE among the models in the table below, measured by active-parameter share, as summarized in the table below. Figure 16: Percent active parameter plot for MoE models. You can also find an HTML version at https://sebastianraschka.com/llm-architecture-gallery/active-parameter-ratio/ . Caveat: active parameter share is only one lens. It does not capture KV cache size, attention pattern, context length, routing overhead, hardware efficiency, or training quality. But it is a helpful, quick check when comparing sparse models. There’s a lot to say about DeepSeek V4, but since it’s been all over the news already, and to stay on topic regarding architecture tweaks, I will focus on the two most relevant parts that are new compared to previous architectures: mHC for a wider residual pathway, CSA/HCA for long-context attention compression and sparsity Figure 17: DeepSeek V4-Pro architecture overview. 5.1 Manifold-Constrained Hyper-Connections (mHC) Let’s start with the mHC component of DeepSeek V4. This goes back to a research paper that the DeepSeek team shared last year (31 Dec 2025, mHC: Manifold-Constrained Hyper-Connections ). However, in this paper, the technique was only tested on an experimental 27B scale model. Now, we see it in their flagship release, which is a good sign that this idea actually works well in production. The main idea behind mHC here is to modernize the design of the residual connections inside the transformer block, which is refreshing, because architecture tweaks are usually focused on the attention mechanism, normalization layer placement, and MoE parts. Now, mHC is based on previous work on hyper-connections (see Hyper-connections by Zhu et al., 2024), which we should briefly discuss first. Hyper-connections essentially modify the single residual stream inside the transformer block by replacing it with several parallel residual streams and learned mappings between them. (For those new to residual connections, I made a video on residual neural networks many years ago, where I explained the general mechanism.) The idea behind hyper-connections is to widen the residual stream. We can think of this as keeping several parallel residual streams, with an additional Res Mapping linear transformation that mixes them across layers. Since the Attention or MoE layer itself still operates on the normal hidden size, hyper-connections also add a Pre Mapping that combines the parallel residual streams into one normal hidden vector for the layer, and a Post Mapping that distributes the layer output back across the parallel residual streams. This is visually summarized in the figure below. Figure 18: Regular transformer block (top) vs transformer block with hyper-connections (bottom) using annotated figures from the mHC paper, https://arxiv.org/abs/2512.24880 . The figure below focuses on the attention-layer portion of the transformer block, but the same concept applies to the second residual branch around the MoE layer. The purpose of hyper-connections is to make the residual pathway more expressive without making the actual Attention or MoE layer wider. This is only mildly more expensive in FLOPs because the extra mappings operate over the small residual-stream axis, for example, n = 4 in DeepSeek V4, not over a huge hidden dimension. In the original hyper-connections paper, the 7B OLMo MoE experiment goes from 13.36G to 13.38G FLOPs per token, which is basically unchanged. In terms of reported gains, there were modest (but consistent) improvements, as shown in the figure below. (However, only looking at FLOPs is a bit simplistic. The widened residual state still has to be stored, moved through memory, mixed, etc. So the practical overhead can come more from memory traffic and implementation complexity than from arithmetic, which is not explicitly measured. However, given that DeepSeek V4 is all about efficiency, it seems to be a worthwhile addition.) Figure 19: Hyper-connections performance versus baseline, using an annotated figure from the hyper-connections paper, https://arxiv.org/abs/2409.19606 . Also, as shown in the figure above, metrics reached the baseline’s performance using roughly half the training tokens. The main change from regular hyper-connections (HC) to manifold-constrained hyper-connections (mHC) is that the mappings are no longer left unconstrained. In regular HC, the Res Mapping is a learned matrix that mixes the parallel residual streams, but stacking many such matrices can amplify or shrink signals unpredictably. In mHC, this residual mapping is projected onto the manifold of doubly stochastic matrices, meaning all entries are non-negative and each row and column sums to 1. This makes the residual mixing behave more like a stable redistribution of information across streams. The Pre Mapping and Post Mapping are also constrained to be non-negative and bounded, which avoids cancellation when reading from and writing back into the widened residual state. In short, mHC keeps the richer residual mixing of HC, but adds constraints so it scales more safely, which becomes more relevant for larger (deeper) models. Otherwise, the main idea of using parallel residual streams remains, as shown in the figure below. Figure 20: Transformer block with hyper-connections (HC) and manifold-constrained hyper-connections (mHC) using annotated figures from the mHC paper, https://arxiv.org/abs/2512.24880 . In the mHC paper, using a 27B parameter model for the experiments, the DeepSeek team’s optimized implementation (with fusion, recomputation, and pipeline scheduling) adds only 6.7% additional training time overhead for 4 residual streams (n = 4) throughout all transformer blocks compared to the single-stream baseline. To sum up this section, HC/mHC changes how information is carried around these layers by replacing the single residual stream with several interacting residual streams, with the additional stability constraints added in mHC, while adding minimal compute overhead. Also, it pairs well with the CSA/HCA attention changes, which modify other parts of the transformer block, which I will discuss below. 5.2 Compressed Attention via CSA and HCA The other major DeepSeek V4 architecture change is on the attention side. Again, the motivation is that at very long context lengths, attention becomes expensive not only because of the attention score computation, but also because the KV cache grows with the sequence length. DeepSeek V4 addresses this issue with a hybrid of two compressed-attention mechanisms, Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). For a refresher, I recommend checking out my previous “ A Visual Guide to Attention Variants in Modern LLMs ” article, which covers Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention (DSA), among others. The first thing to note is that CSA/HCA in DeepSeek V4 is a different kind of compression than the MLA-style compression used in DeepSeek V2/V3. Where MLA mainly compresses the per-token KV representation, CSA and HCA compress along the sequence dimension. So, instead of keeping one full (or compressed) KV entry for every previous token, they summarize groups of tokens into fewer compressed KV entries. Consequently, the cache gets shorter. DeepSeek V4 also uses compact compressed entries and shared-KV attention, but the main distinction from MLA is the sequence-length compression. This is illustrated in the figure below. Figure 21: Conceptual comparison of MLA-style per-token latent caching, CSA, and HCA. MLA compresses the stored KV representation but keeps one latent entry per token. CSA shortens the sequence more mildly with m=4 and sparse top-k selection, while HCA uses much heavier sequence compression with m’=128 and dense attention over the shorter cache. The quality tradeoff for CSA/HCA is also different from MLA. As shown in the figure above, MLA compresses the representation stored for each token, but it still keeps one latent KV entry per token. CSA and especially HCA go further by reducing the number of sequence entries themselves, so the model gives up some token-level info in exchange for much lower long-context cost. Again, it’s all about reducing long-context cost, but this trade-off can hurt modeling quality if the compression is too strong, which is why DeepSeek V4 does not rely on one compression scheme alone but alternates between CSA and HCA. CSA uses a milder compression rate and a DeepSeek Sparse Attention (DSA)-style selector, HCA uses much heavier compression for cheaper global coverage, and both keep a local sliding-window branch for recent uncompressed tokens. This sparse selection in CSA builds on DeepSeek Sparse Attention (DSA), which I discussed in more detail in my earlier DeepSeek V3.2 write-up . HCA is the more aggressive variant of the two. It compresses every 128 tokens into one compressed KV entry, but then uses dense attention over those heavily compressed entries. In other words, CSA keeps more details but uses sparse selection, while HCA keeps far fewer entries and can afford dense attention over them, as illustrated in the figure below. This makes the two mechanisms somewhat complementary, which is why DeepSeek V4 interleaves CSA and HCA layers rather than using only one of them. Figure 22: CSA selects a sparse set of compressed history blocks, while HCA attends densely over more heavily compressed blocks. Both paths also include recent uncompressed KV entries through a 128-token sliding-window branch. The DeepSeek V4 paper reports that, at a 1M-token context length, DeepSeek V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache size compared with DeepSeek V3.2, which uses MLA and DeepSeek Sparse Attention (DSA). DeepSeek V4-Flash is even smaller, at 10% of the FLOPs and 7% of the KV cache size relative to DeepSeek V3.2. Figure 23. Reported 1M-context efficiency numbers from the DeepSeek V4 paper, relative to DeepSeek V3.2. By the way, I would not describe CSA/HCA as “better” than MLA in a general sense. CSA/HCA is a more aggressive long-context design. And it’s also more complicated for sure. Unfortunately, there is no ablation study in the paper. But overall, the paper reports strong overall modeling results, including DeepSeek V4-Flash-Base outperforming DeepSeek V3.2-Base on a majority of base-model benchmarks and strong 1M-token retrieval results, but these results are for the full DeepSeek V4 recipe, which also includes better data, Muon-based optimization, mHC, precision/storage optimizations, and training/inference-system changes. Personally, for now, I would treat CSA/HCA as an efficiency-focused long-context design that appears to preserve modeling quality well in their large flagship model(s) but not necessarily universally better than MLA. 6. Conclusion Overall, the interesting pattern this year is that most new open-weight models try to make long-context inference cheaper without just shrinking the model in terms of total parameters. For instance, Gemma 4 reduces KV-cache memory with cross-layer KV sharing and adds capacity via per-layer embeddings. Laguna XS.2 tweaks how much attention capacity each layer gets. ZAYA1-8B moves attention into a compressed latent space. DeepSeek V4 adds constrained residual-stream mixing and compressed long-context attention. Figure 24: The evolution from GPT-2 (2019) to DeepSeek V4-Pro (2026) For instance, I am fairly certain that someone who is diving into LLM architectures for the first time will be totally overwhelmed when seeing the DeepSeek V4 source code. However, by starting with the original decoder-style LLM (GPT/GPT-2) and then gradually adding / learning about these new components one at a time, we can keep the learning effort manageable. The moral of the story, I guess, is to keep learning, one architecture at a time :). By the way, I am very excited to share that I finished writing Build A Reasoning Model (From Scratch) and all chapters are in early access now. The publisher and I worked hard on the final layouts in the past month, and it’s going to be send to the printer this week. (Good news: the print version will be in color this time!) This is probably my most ambitious book so far. I spent about 1.5 years writing it, and a large number of experiments went into it. It is also probably the book I worked hardest on in terms of time, effort, and polish, and I hope you’ll enjoy it. Build a Reasoning Model (From Scratch) on Manning and Amazon . The main topics are evaluating reasoning models inference-time scaling self-refinement reinforcement learning distillation Amazon (pre-order of Kindle ebook and print paperback) Manning (complete book in early access , pre-final layout, 528 pages)

0 views
Ankur Sethi Yesterday

Land and expand

Link: https://lobste.rs/s/oznirn/redis_cost_ambition#c_dzrja0 Mitchell Hashimoto (founder of HashiCorp, creator of Vagrant and Ghostty) commenting on why software products often lose their core identity and grow irrelevant features: The cost (cognitive, time, risk, money, etc.) of adopting a new thing is significantly higher than expanding an old thing. You see this even without any commercial interests. For example, one I've spoken publicly on is how many programming languages became a least-common-denominator of everything features rather than hold strong to a core identity. And many/most of these have no commercial motive, its just laziness. Commercial interests of course definitely push this though. At a certain points its all about horizontal expansion. Or, in more businessy terms: "land and expand." You have the P&P (pricing/packaging) for land deals that explicitly aim to get someone to use your software, usually lead by a flagship functionality that your product is truly probably best in class or nearly at. Then once the deal is landed, you have a cadre of add-on functionality that you're probably just average at at best, but its easier for procurement (the department that handles software purchasing in a business) to upgrade an existing closed deal than to engage in a new one. So you can sell mediocre stuff. I recently heard a different term for the "land and expand" idea in The Positioning Manual for Indie Consultants : "creating a beachhead". I find it interesting (and off-putting) that much of business vocabulary borrows from military operations. But that's a post for another day. The "land and expand" strategy doesn't always result in bad products. But when it's done badly, you end up with Zoom Mail, Microsoft Teams, and JIRA.

0 views
Ankur Sethi Yesterday

Selling to practitioners vs. selling to technical decision makers

Link: https://lobste.rs/s/oznirn/redis_cost_ambition#c_dzrja0 Mitchell Hashimoto (founder of HashiCorp, creator of Vagrant and Ghostty) commenting on Lobste.rs about how software products are sold: For software solutions, there are two main groups: practitioners and technical decision makers (TDMs). Practitioners are the main users of a piece of software (and in the case of OSS, adopters, though not the case always). TDMs are the higher level management with budgetary discretion that are making broad stroke technical decisions. The Redis landing page to me looks like a TDM-oriented site. And the "real-time context engine for AI" and AI focus feels correct for that target user. You know the phrase "no one ever got fired for choosing IBM?" The thing about 90% of TDMs is that they're motivated primarily by NOT GETTING FIRED. These aren't people who browser Lobsters or push to GH on the weekend. These are people that work 9 to 5, get paid, go home, and NEVER THINK ABOUT WORK AGAIN. So to achieve all that, they follow secular trends supported by analysts and broad public sentiment. Oh, Gartner said that "AI strategy" is most important? McKinsey said "context" needs to be managed? Well, "Context Engine for AI Apps" is going to be defensible. Buy it. On the surface, this might sound like a dismissal of TDMs as people who don't care about the job, but I don't think Mitchell meant it that way. TDMs are doing their best with the information they have. They're paying attention to signals that are high quality in their estimation, but not necessarily high quality in the estimation of their technical co-workers. I personally would never use a Gartner report to make technical decisions, but in the same way the CFO at your company would never use a Hacker News comment to make financial decisions. And you know what? It's okay if your CFO doesn't care about what Hacker News thinks about Redis. That's not their job. That's your job. Their job is to make sure the business doesn't go bankrupt. If I want my company to pick Valkey over Redis, the onus for communicating that to management is entirely on me. It's my job to explain why it's valuable not just from a technical point of view, but also from a business point of view. Will it help the company ship faster? Save money on AWS bills? Build new features we couldn't build before? Will it help reduce liability, create better audit trails, onboard new engineers faster? TDMs can't make good decisions based on information they can't parse, so it's my job to make sure they can parse the differences between two relatively similar products. If I refuse to do this job properly, the marketing department at Redis Ltd. will do it in a way that serves their business needs rather than mine. There are economic, social, legal, and political dimensions to picking technology. It's never just about the quality of the product in isolation.

0 views
ava's blog Yesterday

privacy is becoming even more of a privilege

I've been thinking more about the future we might be heading towards if things continue the way they do, relatively unstopped, especially in regards to data harvesting and leaks, and how digitalized our society continues to become. I wonder if we are simply headed for a society in which there is bleak acceptance and normalization of most pieces of information being out there already. Everything you put out there voluntarily/openly (like a blog, or social media) and the things passively collected about you (via your devices) being trained on, analyzed, in some database that cannot withstand the latest AI release or whatever, together with vibecoded insecure software. Your cloud, your social media posts, your DMs, your purchase history on different platforms, health data in your eFile, the journal entries you did in that aesthetic journaling app, the poop pictures you gave to an AI app to analyze, the recordings of your Alexa and smart TV, etc. that all may or may not be combined. We have lost so many of the previous barriers. Compared to previous times in history, many things aren't automatically private in your own home, or just saved in just people's brains anymore. Less and less things are exclusively physically in some cabinet you have to locate and get several keys for or lie your way in (social engineering) for. Digital things are written down and stored in a more accessible way, and while there is a metaphorical door, it can be broken down from anywhere in the world, and you no longer need to rely on pressuring things out of people or enduring any of the prep and risk of a physical break in. Your home can be broken into from half the planet away. All of this is making secrecy and privacy hard; it is all a technology arms race. Data protection and privacy is only seen as a hindrance, an annoyance in the eyes of many. Unnecessary when things are going fine until they aren't. It's annoying when a website asks you to consent, but it's suddenly important when you need to know what data a company still has from you, or when there's a breach. I see privacy laws overall being weakened, employees in those teams, authorities and organizations terminated, all because data is the new gold, or an even better oil. I see the EU trying to use our rights and data as a bargaining chip for US travel and exports. As usual, human rights stand in the way of big money. Historically, we are used to seeing the privacy of the rich as something rather physical; they move to gated communities, or land in bumfuck nowhere, to have no neighbors and peace from paparazzi and weird stalkers. They get to have certain media pulled from the shelves when it is not favorable to them. Increasingly, we have seen them remove digital content: Blog posts, Reddit threads, specific images and videos, stats tracking their whereabouts, meetings and flights. Unfortunately, the richer you are, the more protection of your data and privacy you can buy. You can see it even now: We need to give up so much information just to travel and pass airport checks, down to social media checks or the EU bartering over sharing biometric data with the US for EU travellers. Meanwhile, Taylor Swift and Elon Musk can restrict the activity of their private jets. They can obscure or limit their real-time location exposure, acquire surrounding properties to create buffer zones, forbid aerial photography and maritime tracking around their properties, tighten security around family information and their children’s identities, can afford security teams and compartmentalized travel arrangements, can subject others to NDA's, and influence powerful government officials - can you do the same? As you are told you need these devices with all these data mining features, all these privacy-disrespecting apps and LLMs, all these social media accounts to be successful, or happy, or organized, or be seen and loved, or get a chance at an additional income stream or fame, they are already rich and known enough. They get to be private, not overshare on socials, and leave posting and taking calls and messages to their assistants. It's okay for them not to be overly online and active. They probably get to be exempt from their own companies' tracking for "security reasons", despite using the same products. They know the data their services mine is harmful if you have a stalker or abuser; they only care if it affects them, though. And think of the legal repertoire they have when they have their likeness stolen, deepfakes of their voice and visual characteristics made in a way that harms them. You don't have the same options. When data leaks that makes you uninteresting to employers, you have to potentially live with that; they are the employers. Continuing on, having any privacy will be even more of a privilege. It is maddening, because very rich and powerful techbros like Musk, Altman, Zuckerberg, etc. get rich off of our data that we can no longer afford to protect against them, eventually always funding their dominance over us, and enabling their own exemption status in this data mining society. They benefit from collecting and analyzing information at industrial scale while attempting to selectively limit information flowing the other direction. In their ideal little world, they don't invest it back into us; they use it to further fund AI replacement workers, weapons, and their doomsday bunkers away from us all. It makes me wonder if we will end up in a society where people will deliver as much information up front as they deem necessary to be in control of the narrative and tell themselves they have not been spied on and instead have shared it voluntarily in an act of bravery. Reply via email Published 16 May, 2026

0 views
Unsung Yesterday

“193 hours of attempts (and practice)”

More unexpected Mario content: a 7-minute video speedrunning composite by FlibidyDibidy : = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/193-hours-of-attempts-and-practice/yt1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/193-hours-of-attempts-and-practice/yt1.1600w.avif" type="image/avif"> This video combines my first 5,162 attempts to speedrun Super Mario Bros. I recorded 193 hours of attempts (and practice) on an original 1985 Nintendo Entertainment System, then I wrote a custom computer program to process those videos and combine them via machine learning and conventional image processing techniques. This is not just fun to look at, and – presumably – study as you’re speedrunning yourself. A sign of a good visualization is that it makes you see stuff that you haven’t before and here, at some point (after 1:42), you start noticing strange comb-like patterns in Mario running. Turns out this is actually a thing called a “frame rule,” a quirk of game’s timing code where it only checks for a completion of the level every 21 frames, or one third of a second. That means that for every level after the first one, your start will be rounded up to the nearest 21st frame : The analogy often given is to think of a bus that leaves every 21 frames, and levels can only end by getting on that bus, and so other than in the last level (which has no new level to load at the end of it), improvements in Super Mario Bros. can only happen in 21 frame increments. If you save a frame or two in a level, but it’s not enough to make the previous frame rule, it’s not enough to take the previous bus, you’ll just end up waiting for it to happen anyway. Stay tuned to the end of the video for some fun stats, and click through in the description below to see the same tech applied live during an in-person speedrunning event. #speedrunning #super mario bros #youtube

0 views
Unsung Yesterday

Not a radio pharma ad

I like sharing, thinking about, and revisiting basic rules and principles because they really do ladder up to help you with more complex things down the road. I wrote before how a simple rule to give some breathing room to your length-limited edit fields can be upleveled to a more general “let me color outside the lines when I’m editing” principle. This is another example of a similar situation. I am in Buttondown, which is a mailing list software. I created a quick test draft just to check something out in the editor, I didn’t do anything else, and then I proceeded to delete it. Then, I was greeted with this: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/not-a-radio-pharma-ad/1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/not-a-radio-pharma-ad/1.1600w.avif" type="image/avif"> This is nothing more than a larger version of the “You have 1 email(s)” problem . There might be a situation when I’m deleting something that has been published and linked to. In that case, it’d be good to know about how any links to that thing will cease working. But this is not that kind of a situation, and the software has all the info to know that. In this moment, it could show me a simpler, much less alarming message more appropriate to my situation. This is not the end of a radio pharma ad where you have to rattle out all the legal disclaimers just in case something could happen. One tiny counterexample from my neck of the woods: in Figma, when you start writing a comment and then exit without posting it, you get a little warning. But you don’t get that warning when you type something that’s <= 8 characters. In this case, the assumption is that retyping a few characters elsewhere (assuming you haven’t just changed your mind altogether) is much easier and faster than cognitively processing and dismissing the warning. = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/not-a-radio-pharma-ad/2.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/not-a-radio-pharma-ad/2.1600w.avif" type="image/avif"> The challenge with Buttondown’s dialog is that this is more than just extra cognitive processing and “cheapness.“ Here, the stakes are higher, as we’re talking about something adjacent to data loss; the dialog really does feel a bit scary and makes me think I can do some real damage in a situation no real damage is possible. #interface design #principles #writing

0 views