Latest Posts (20 found)
Jim Nielsen -24 days ago

You Might Debate It — If You Could See It

Imagine I’m the design leader at your org and I present the following guidelines I want us to adopt as a team for doing design work: How do you think that conversation would go? I can easily imagine a spirited debate where some folks disagree with any or all of my points, arguing that they should be struck as guidelines from our collective ethos of craft. Perhaps some are boring, or too opinionated, or too reliant on trends. There are lots of valid, defensible reasons. I can easily see this discussion being an exercise in frustration, where we debate for hours and get nowhere — “I suppose we can all agree to disagree”. And yet — thanks to a link to Codex’s front-end tool guidelines in Simon Willison’s article about how coding agents work — I see that these are exactly the kind of guidelines that are tucked away inside an LLM that’s generating output for many teams. It’s like a Trojan Horse of craft: guidelines you might never agree to explicitly are guiding LLM outputs, which means you are agreeing to them implicitly. It’s a good reminder about the opacity of the instructions baked in to generative tools. We would debate an open set of guidelines for hours, but if there’re opaquely baked in to a tool without our knowledge does anybody even care? When you offload your thinking, you might be on-loading someone else’s you’d never agree to — personally or collectively. Reply via: Email · Mastodon · Bluesky Typography: Use expressive, purposeful fonts and avoid default stacks (Inter, Roboto, Arial, system). Motion: Use a few meaningful animations (page-load, staggered reveals) instead of generic micro-motions. Background: Don't rely on flat, single-color backgrounds; use gradients, shapes, or subtle patterns to build atmosphere. Overall: Avoid boilerplate layouts and interchangeable UI patterns. Vary themes, type families, and visual languages.

1 views

Dissecting and Modeling the Architecture of Modern GPU Cores

Dissecting and Modeling the Architecture of Modern GPU Cores Rodrigo Huerta, Mojtaba Abaie Shoushtary, José-Lorenzo Cruz, and Antonio Gonzalez MICRO'25 The purpose of this paper is to understand the microarchitecture of recent NVIDIA GPUs, to be able to update architectural simulators that are used for research purposes. The authors uncovered lots of interesting tidbits. Take this information with a grain of salt; it is derived from careful experimentation rather than NVIDIA documentation. The paper uses the term sub-core to represent the hardware module which can execute warp-wide instructions. Each SM comprises four sub-cores. Fig. 3 illustrates the components within a sub-core and shows how 4 sub-cores share instruction and data caches: Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Instruction Issue The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware. There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards: The instruction encoding allows any instruction to set the value of a per-warp stall counter. When the hardware issues such an instruction, it sets the stall counter to the specified value. On each clock cycle thereafter, the counter is decremented by one. The hardware will not issue more instructions for the warp until the counter reaches zero. This is useful for handling hazards with a fixed latency. Variable-latency hazards are resolved with dependence counters . The hardware tracks the value of six dependence counters per warp. The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued. One of these counters is decremented when the instruction writes to the register file, and the other is decremented when the instruction reads from the register file (to resolve WAR hazards). Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero. In fig. 2 above, the values of these counters are checked in the block, and the counters are incremented in the block. The warp scheduler prefers to pick a warp and stick with it (e.g., it is not a round-robin scheduler). If the current warp cannot be scheduled (e.g., the stall counter is greater than zero, or there was a cache miss), then the scheduler switches to another warp. The warp scheduler issues instructions in program order (within a warp). There is no out-of-order execution support. The register file has a limited number of ports, and instructions must be controlled to avoid attempting too many reads or writes in parallel. Register file port contention is not handled by the warp scheduler, instead it is handled further down the pipe. For example, the stage in fig. 2 will stall fixed-latency instructions until register file read ports are available. The register file cache (RFC) is a hardware component that reduces contention on the register file read ports. The RFC has storage for 6 vectors (and tags). The compiler can mark a source operand of an instruction such that the hardware will store the source operand in the cache for a subsequent operation to use. Note that the RFC does not store per-warp values and is only useful for caching data within one warp. This plays nicely with the “pick a warp and stick to it” scheduling policy. Listing 4 has some example code sequences demonstrating how the compiler can direct the operation of the RFC (e.g., ): Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Memory Access Most of the resources that are shared between sub-cores are shared for efficiency reasons. A single sub-core will not generate memory requests at a high throughput, and there is locality of reference between the memory accesses in multiple sub-cores. The block in fig. 3 is shared in order to properly support thread group shared memory (as a thread group is spread across all sub-cores in a SM). The shared memory access modules can handle one request every two cycles. That means if all 4 sub-cores are contending on memory, each one can make a request every 8 cycles. There is a FIFO of depth ~4 between each sub-core and the shared memory structures. Typical read-after-write latency in shared memory is between 20-40 cycles. The authors built a simulation model based on their experiments. Mean percentage absolute error (MAPE) is one metric for measuring how accurate a simulation model is compared to real hardware. Table 4 shows that the model derived from the findings in this paper are a better performance model for recent NVIDIA GPUs than the baseline ( ): Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Subscribe now Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Instruction Issue The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware. There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards: The instruction encoding allows any instruction to set the value of a per-warp stall counter. When the hardware issues such an instruction, it sets the stall counter to the specified value. On each clock cycle thereafter, the counter is decremented by one. The hardware will not issue more instructions for the warp until the counter reaches zero. This is useful for handling hazards with a fixed latency. Variable-latency hazards are resolved with dependence counters . The hardware tracks the value of six dependence counters per warp. The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued. One of these counters is decremented when the instruction writes to the register file, and the other is decremented when the instruction reads from the register file (to resolve WAR hazards). Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero.

0 views

March 2026 blend of links

I promise you I try to avoid linking to more than two articles on the same topic in each edition — and I really want to avoid my readers to feel too depressed reading this blog — but everything seems to be about A.I. or some sort of automation these days, either directly or indirectly. I also notice that most of the topics revolve around the how and rarely on the why , as if accelerating tasks to the max, regardless of their purpose, is unquestionably a good thing. Emily Tucker’s Open Letter to Georgetown Students, In Response to Recent Announcements by the University about “Generative A.I.” – “ It’s a big win for them, in their quest to persuade you of your powerlessness, that they have gotten your university to [adopt] their marketing language for its official statements, to shape its academic programming around the presumption of their indefinite economic primacy, and to pay for you to have free access to technologies that will make it harder — the more you use them — to know yourself to be a free intellectual, creative and moral agent. ” (via Dan Gillmor ) Overthinking: A.I. wasn't the first to break my heart – This article from Ana Rodrigues read a little too close to home for my own comfort; the feelings described and words chosen are very accurate and indeed increasingly familiar to a growing number of people. We’re Training Students To Write Worse To Prove They’re Not Robots, And It’s Pushing Them To Use More A.I. – “ […] the AI detection tool flagged the essay as “18% A.I. written.” The culprit? Using the word “devoid.” When the word was swapped out for “without,” the score magically dropped to 0%. ” The Future Smells Like Paper – “ The technology should remove bureaucratic friction while preserving ceremonial weight. Make the process transparent without making it trivial. You can't automate meaning. You can only create conditions where it might emerge. ” (via iA Writer ) What I mean when I say that I hate Gen A.I. – “ I hate that I do it, and I am angry that I am forced - but I am an adult and I do what I must. I couldn't care less if I write the code I "make", but I am disenchanted with humanity. As a young boy I was full of optimism, I thought we can strive to be better. I was wrong. Money is all that matters. ” (via Brain Baking ) Backseat Software – So many quotable parts in this beauty of an article by Mike Swanson. Before writing this very sentence, I successively pasted 3 to 4 quotes, each better than the previous one. What a great read; actually very hard to get through, as you'll want to stop every other paragraph to take notes. (via The Talk Show ) TextEdit and the Relief of Simple Software – An interesting perspective from someone deeply involved in the activity of writing on a computer, but seemingly not as passionate about software as one would assume. I’ll keep an eye on Kyle Chayka’s future columns, as I wouldn’t be surprised if this one is just a first step into the inevitable quest of finding a better writing app on the Mac. I’ve been there, both as a TextEdit-only user and as a text-editing software snob. I even play with Vim in the Terminal from time to time, just so I can feel like Dana Scully typing a report . (via Michael Tsai ) SubEthaEdit – Perfect transition to a really excellent text editor, for people who love “real” Mac apps, with a neat collaboration feature. The Shape of Paris – At first, I just wanted to watch the first couple of seconds of this to see if it was worth saving for later or not, and I ended up watching it in full. Beautiful scenery that somehow made me nostalgic for the eight years of my life I lived in Paris. Also, has any other sport or hobby ever beaten skateboard in terms of style and looks? I don’t think so, it’s the epitome of cool . (via Kottke ) Shady Characters – Not as cool as a skateboard video in Paris, but this whole website looks incredible thanks to an exquisite typography. Subscribed to the RSS feed, and there is also a book, that I’ve just ordered. Previous blend of links editions

0 views

‘CanisterWorm’ Springs Wiper Attack Targeting Iran

A financially motivated data theft and extortion group is attempting to inject itself into the Iran war, unleashing a worm that spreads through poorly secured cloud services and wipes data on infected systems that use Iran’s time zone or have Farsi set as the default language. Experts say the wiper campaign against Iran materialized this past weekend and came from a relatively new cybercrime group known as TeamPCP . In December 2025, the group began compromising corporate cloud environments using a self-propagating worm that went after exposed Docker APIs, Kubernetes clusters, Redis servers, and the React2Shell vulnerability. TeamPCP then attempted to move laterally through victim networks, siphoning authentication credentials and extorting victims over Telegram. A snippet of the malicious CanisterWorm that seeks out and destroys data on systems that match Iran’s timezone or have Farsi as the default language. Image: Aikido.dev. In a profile of TeamPCP published in January, the security firm Flare  said the group weaponizes exposed control planes rather than exploiting endpoints, predominantly targeting cloud infrastructure over end-user devices, with Azure (61%) and AWS (36%) accounting for 97% of compromised servers. “TeamPCP’s strength does not come from novel exploits or original malware, but from the large-scale automation and integration of well-known attack techniques,” Flare’s Assaf Morag wrote . “The group industrializes existing vulnerabilities, misconfigurations, and recycled tooling into a cloud-native exploitation platform that turns exposed infrastructure into a self-propagating criminal ecosystem.” On March 19, TeamPCP executed a supply chain attack against the vulnerability scanner Trivy from Aqua Security , injecting credential-stealing malware into official releases on GitHub actions. Aqua Security said it has since removed the harmful files, but the security firm Wiz notes the attackers were able to publish malicious versions that snarfed SSH keys, cloud credentials, Kubernetes tokens and cryptocurrency wallets from users. Over the weekend, the same technical infrastructure TeamPCP used in the Trivy attack was leveraged to deploy a new malicious payload which executes a wiper attack if the user’s timezone and locale are determined to correspond to Iran, said Charlie Eriksen , a security researcher at Aikido . In a blog post published on Sunday, Eriksen said if the wiper component detects that the victim is in Iran and has access to a Kubernetes cluster, it will destroy data on every node in that cluster. “If it doesn’t it will just wipe the local machine,” Eriksen told KrebsOnSecurity. Image: Aikido.dev. Aikido refers to TeamPCP’s infrastructure as “ CanisterWorm ” because the group orchestrates their campaigns using an Internet Computer Protocol (ICP) canister — a system of tamperproof, blockchain-based “smart contracts” that combine both code and data. ICP canisters can serve Web content directly to visitors, and their distributed architecture makes them resistant to takedown attempts. These canisters will remain reachable so long as their operators continue to pay virtual currency fees to keep them online. Eriksen said the people behind TeamPCP are bragging about their exploits in a group on Telegram and claim to have used the worm to steal vast amounts of sensitive data from major companies, including a large multinational pharmaceutical firm. “When they compromised Aqua a second time, they took a lot of GitHub accounts and started spamming these with junk messages,” Eriksen said. “It was almost like they were just showing off how much access they had. Clearly, they have an entire stash of these credentials, and what we’ve seen so far is probably a small sample of what they have.” Security experts say the spammed GitHub messages could be a way for TeamPCP to ensure that any code packages tainted with their malware will remain prominent in GitHub searches. In a newsletter published today titled GitHub is Starting to Have a Real Malware Problem , Risky Business reporter Catalin Cimpanu writes that attackers often are seen pushing meaningless commits to their repos or using online services that sell GitHub stars and “likes” to keep malicious packages at the top of the GitHub search page. This weekend’s outbreak is the second major supply chain attack involving Trivy in as many months. At the end of February, Trivy was hit as part of an automated threat called HackerBot-Claw , which mass exploited misconfigured workflows in GitHub Actions to steal authentication tokens. Eriksen said it appears TeamPCP used access gained in the first attack on Aqua Security to perpetrate this weekend’s mischief. But he said there is no reliable way to tell whether TeamPCP’s wiper actually succeeded in trashing any data from victim systems, and that the malicious payload was only active for a short time over the weekend. “They’ve been taking [the malicious code] up and down, rapidly changing it adding new features,” Eriksen said, noting that when the malicious canister wasn’t serving up malware downloads it was pointing visitors to a Rick Roll video on YouTube. “It’s a little all over the place, and there’s a chance this whole Iran thing is just their way of getting attention,” Eriksen said. “I feel like these people are really playing this Chaotic Evil role here.” Cimpanu observed that supply chain attacks have increased in frequency of late as threat actors begin to grasp just how efficient they can be, and his post documents an alarming number of these incidents since 2024. “While security firms appear to be doing a good job spotting this, we’re also gonna need GitHub’s security team to step up,” Cimpanu wrote. “Unfortunately, on a platform designed to copy (fork) a project and create new versions of it (clones), spotting malicious additions to clones of legitimate repos might be quite the engineering problem to fix.” Update, 2:40 p.m. ET: Wiz is reporting that TeamPCP also pushed credential stealing malware to the KICS vulnerability scanner from Checkmarx , and that the scanner’s GitHub Action was compromised between 12:58 and 16:50 UTC today (March 23rd).

0 views
David Bushell Yesterday

Top ten Figma betrayals

Figma is the industry standard for painting pretty pictures of websites. It’s where designers spend my designated dev time pushing pixels around one too many artboards. Figma promises to remove the proverbial fence between design and development. In reality it provides the comfort of an ideal viewport that doesn’t exist. I don’t mind Figma (the software), although I prefer Penpot myself. I still dabble in the deceptive arts of web design. Don’t be thinking I’m out here hating on designers. I like to stick my nose inside a Figma file and point out issues before they escalate. Below I cover classic Figma betrayals that I bet you’ve experienced. Betrayals happen when software promises more than it can deliver. Take a gander at this amazing website design I whipped up in Figma to illustrate the most common betrayals. I told you I was a designer! I’ll evolve this design throughout the post. Figma has deemed 1440×1024 to be “Desktop” resolution so I’ve started there. In this mockup I’ve added a full-width banner of our hero Johnny Business . I’ve built this website far too many times than I care to remember. I’ll repeat the same question here I ask every time I build it: what happens at other viewport sizes? Do I scale the banner proportionally? On wider viewports this is likely to push content out of sight. It might even require scrolling to see the entire image on Johnny’s ultra-wide 8K. The phrase “above the fold” will be spoken in a Teams call, can we avoid that? Do I also set a maximum height on the banner? This is going to decapitate poor Johnny! He paid a lot for that haircut. What are we doing below the “Desktop” viewport, by the way? Let’s design for the 402×874 resolution Figma calls “iPhone 17” because it was first on the list. Note the absolute perfect crop of Johnny’s sockless businessing. Okay, next question: how do we move between “mobile” and “desktop”? That’s a very specific focal point. We can’t just change it willy-nilly! Code has rules; logic. A website must be responsive between all breakpoints. Are we going to use multiple images? At what breakpoint do they swap? Because that perfectly cropped mobile image doesn’t scale up very far. Hold the phone! A shadow stakeholder has asked for a redesign to “make it pop!” The ultra-wide problem has been solved with a centred fixed-width style. If that is the intention? Does either the banner or header stretch to the edge of the viewport? More importantly, that image and text has no room to move. I’ve only reduced the viewport by 200 pixels and it’s already crashing into Johnny’s face. Are we expecting breakpoints every 100 pixels? — No, wait! Please don’t spend more time designing more breakpoints! Okay, I’ll hold until more breakpoints are designed. Are we extending my development deadline? No. Okay. As development continues I’ve got more bad news to share. Figma is very happy allowing us to enter arbitrary line breaks for the perfect text fit. That’s not how the web works. One of these options is probably what we’ll see if text is left to naturally break. Yes, we can technically allow for a manual line break. That’s a pain in the content management system, but sure. Text is still forced to wrap on a smaller viewport, then what? Oh that? Now you want the manual line break to magically disappear? (╯°□°)╯︵ ┻━┻ I lied when I said “top ten” Figma betrayals. The issues above can appear in hundreds of guises across any component. If you’re betrayed once you’ll be hit again and again. Figma is not exactly conducive to responsive web design. Designing more breakpoints often leads to more questions, not less. Another betrayal I pull my hair out over is the three card pattern packed with content. This leads to an immediate breakpoint where one card drops awkwardly below. I dread this because the word “carousel” will be uttered and my sobbing is heard far and wide. Carousels are not a content strategy. I was once inspecting a Figma file only to witness the enemy cursor drive by and drop several dots underneath an image. The audacity! Figma betrayals are classic waterfall mistakes that are solved by human conversation. Developers need to be part of the design process to ask these questions. Content authors should be involved before and not after a design is complete. You’ll note I never answered the questions above because what might work for my fictional design isn’t universal. On a tangential topic Matthias Ott notes: Think about what actually happens when a designer and an engineer disagree about an interaction pattern. There’s a moment of tension – maybe even frustration. The engineer says it’ll be fragile. The designer says it’s essential for the experience. Neither is wrong, necessarily. But the conversation – if your process allows for it to happen – that back-and-forth where both sides have to articulate why they believe what they believe, is where the design becomes robust and both people gain experience. Not in the Figma file. Not in the pull request. In the friction between two people who care about different things and are forced to find a shared answer. The Shape of Friction - Matthias Ott Figma is not friction-free and that’s fine. We can’t expect any software in the hands of a single person to solve problems alone. Software doesn’t know what questions to ask. Not then with Clippy, not now with Copilot. Humans should talk to one another, not the software. Together we can solve things early the easy way, or later the hard way. One thing that has kept me employed is the ability to identify questions early and not allow Fireworks, Photoshop, Sketch, XD, and now Figma to lead a project astray. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views
matduggan.com Yesterday

Markdown Ate The World

I have always enjoyed the act of typing words and seeing them come up on screen. While my favorite word processor of all time might be WordPerfect ( here ), I've used almost all of them. These programs were what sold me on the entire value proposition of computers. They were like typewriters, which I had used in school, except easier in every single way. You could delete things. You could move paragraphs around. It felt like cheating, and I loved it. As time has gone up what makes up a "document" in word processing has increased in complexity. This grew as word processors moved on from being proxies for typewriters and into something closer to a publishing suite. In the beginning programs like WordPerfect, WordStar, MultiMate, etc had flat binary files with proprietary formatting codes embedded in there. When word processors were just proxies for typewriters, this made a lot of sense. But as Microsoft Word took off in popularity and quickly established itself as the dominant word processor, we saw the rise of the .doc file format. This was an exponential increase in complexity from what came before, which made sense because suddenly word processors were becoming "everything tools" — not just typing, but layout, images, revision tracking, embedded objects, and whatever else Microsoft could cram in there. At its base the is a Compound File Binary Format, which is effectively just a FAT file system with the file broken into sectors that are chained together with a File Allocation Table. It's an interesting design. A normal file system would end up with sort of a mess of files to try and contain everything that the has, but if you store all of that inside of a simplified file system contained within one file then you could optimize for performance and reduced the overhead that comes with storing separate objects in a flat file. It also optimizes writes, because you don't need to rewrite the entire file when you add an object and it keeps it simple to keep revision history. But from a user perspective, they're "just" dealing with a single file. ( Reference ) The .doc exploded and quickly became the default file format for humanity's written output. School papers, office memos, résumés, the Great American Novel your uncle was definitely going to finish — all .doc files. But there was a problem with these files. They would become corrupted all of the goddamn time. Remember, these were critical documents traveling from spinning rust drives on machines that crashed constantly compared to modern computers, often copied to floppy disks or later to cheap thumb drives you got from random vendor giveaways at conferences, and then carried to other computers in backpacks and coat pockets. The entire workflow had the structural integrity of a sandwich bag full of soup. So when Word was saving your critical file, it was actually doing a bunch of different operations. It was: These weren't atomic operations so it was super easy in an era when computers constantly crashed or had problems to end up in a situation where some structures were updated and others weren't. Compared to like a file where you would either get the old version or a truncated new version. You might lose content, but you almost never ended up with an unreadable file. With as someone doing like helpdesk IT, you constantly ended up with people that had just corrupted unreadable files. And here's the part that really twisted the knife: the longer you worked on the same file, the more important that file likely was. But Word didn't clean up after itself. As a .doc accumulated images, tracked changes, and revision history, the internal structure grew more complex and the file got larger. But even when you deleted content from the document, the data wasn't actually removed from the file. It was marked as free space internally but left sitting there, like furniture you moved to the curb that nobody ever picked up. The file bloated. The internal fragmentation worsened. And the probability of corruption increased in direct proportion to how much you cared about the contents. Users had to be trained both to save the file often (as AutoRecover wasn't reliable enough) and to periodically "Save As" a new file to force Word to write a clean version from scratch. This was the digital equivalent of being told that your car works fine, you just need to rebuild the engine every 500 miles as routine maintenance. The end result was that Microsoft Word quickly developed a reputation among technical people as horrible to work with. Not because it was a bad word processor — it was actually quite good at the word processing part — but because when a user showed up at the Help Desk with tears in their eyes, the tools I had to help them were mostly useless. I could scan the raw file for text patterns, which often pulled out the content, but without formatting it wasn't really a recovered file — it was more like finding your belongings scattered across a field after a tornado. Technically your stuff, but not in any useful arrangement. Sometimes you could rebuild the FAT or try alternative directory entries to recover slightly older versions. But in general, if the .doc encountered a structural error, the thing was toast and your work was gone forever. This led to a never-ending series of helpdesk sessions where I had to explain to people that yes, I understood they had worked on this file for months, but it was gone and nobody could help them. I became a grief counselor who happened to know about filesystems. Thankfully, people quickly learned to obsessively copy their files to multiple locations with different names — thesis_final.doc, thesis_final_v2.doc, thesis_FINAL_FINAL_REAL.doc — but this required getting burned at least once, which is sort of like saying you learned your car's brakes didn't work by driving into a bus. So around 2007 we see the shift from to , which introduces a lot of hard lessons from the problems of . First, it's just a bundle, specifically a ZIP archive. Now in theory, this is great. Your content is human-readable XML. Your images are just image files. If something goes wrong, you can rename the file to .zip, extract it, and at least recover your text by opening document.xml in Notepad. The days of staring at an opaque binary blob and praying were supposed to be over. However, in practice, something terrible happened. Microsoft somehow managed to produce the worst XML to ever exist in human history. Let me lay down the scope of this complexity, because I have never seen anything like it in my life. Here is the standards website for ECMA-376. Now you know you are in trouble when you see a 4 part download that looks like the following: If you download Part 1, you are given the following: Now if you open that PDF, get ready for it. It's a 5039 page PDF. I have never conceived of something this complicated. It's also functionally unreadable, and I say this as someone who has, on multiple occasions in his life, read a car repair manual cover to cover because I didn't have anything else to do. I once read the Haynes manual for a 1994 Honda Civic like it was a beach novel. This is not that. This is what happens when a standards committee gets a catering budget and no deadline. There was an accusation at the time that Microsoft was making OOXML deliberately more complicated than it needed to be — that the goal was to claim it was an "open standard" while making the standard so incomprehensibly vast that it would take a heroic effort for anyone else to implement it. I think this is unquestionably true. LibreOffice has a great blog post on it that includes this striking comparison: So the difference between ODF format and the OOXML format results in a exponentially less complicated XML file. Either you could do the incredible amount of work to become compatible with this nightmarish specification or you could effectively find yourself cut out of the entire word processing ecosystem. Now without question this was done by Microsoft in order to have their cake and eat it too. They would be able to tell regulators and customers that this wasn't a proprietary format and that nobody was locked into the Microsoft Office ecosystem for the production of documents, which had started to become a concern among non-US countries that now all of their government documents and records were effectively locked into using Microsoft. However the somewhat ironic thing is it ended up not mattering that much because soon the only desktop application that would matter is the browser. The file formats of word processors were their own problems, but more fundamentally the nature of how people consumed content was changing. Desktop based applications became less and less important post 2010 and users got increasingly more frustrated with the incredibly clunky way of working with Microsoft Word and all traditional files with emailing them back and forth endlessly or working with file shares. So while was a superior format from the perspective of "opening the file and it becoming corrupted", it also was fundamentally incompatible with the smartphone era. Even though you could open these files, soon the expectation was that whatever content you wanted people to consume should be viewable through a browser. As "working for a software company" went from being a niche profession to being something that seemingly everyone you met did, the defacto platform for issues, tracking progress, discussions, etc moved to GitHub. This was where I (and many others) first encountered Markdown and started using it on a regular basis. John Gruber, co-creator of Markdown, has a great breakdown of "standard" Markdown and then there are specific flavors that have branched off over time. You can see that here . The important part though is: it lets you very quickly generate webpages that work on every browser on the planet with almost no memorization and (for the most part) the same thing works in GitHub, on Slack, in Confluence, etc. You no longer had to ponder whether the person you were sending to had the right license to see the thing you were writing in the correct format. This combined with the rise of Google Workspace with Google Docs, Slides, etc meant your technical staff were having conversations through Markdown pages and your less technical staff were operating entirely in the cloud. Google was better than Microsoft at the sort of stuff Word had always been used for, which is tracking revisions, handling feedback, sharing securely, etc. It had a small subset of the total features but as we all learned, nobody knew about the more advanced features of Word anyway. By 2015 the writing was on the wall. Companies stopped giving me an Office license by default, switching them to "you can request a license". This, to anyone who has ever worked for a large company, is the kiss of death. If I cannot be certain that you can successfully open the file I'm working on, there is absolutely no point in writing it inside of that platform. Combine that with the corporate death of email and replacing it with Slack/Teams, the entire workflow died without a lot of fanfare. Then with the rise of LLMs and their use (perhaps overuse) of Markdown, we've reached peak . Markdown is the format of our help docs, many of our websites are generated exclusively from Markdown. It's now the most common format that I write anything in. This was originally written in Markdown inside of Vim. There's a lot of reasons why I think Markdown ended up winning, in no small part because it solved a real problem in an easy to understand way. Writing HTML is miserable and overkill for most tasks, this removed the need to do that and your output was consumable in a universal and highly performant way that required nothing of your users except access to a web browser. But I also think it demonstrates an interesting lesson about formats. and . along with ODF are pretty highly specialized things designed to handle the complexity of what modern word processing can do. LibreOffice lets you do some pretty incredible things that cover a huge range of possible needs. Markdown doesn't do most of what those formats do. You can't set margins. You can't do columns. You can't embed a pivot table or track changes or add a watermark that says DRAFT across every page in 45-degree gray Calibri. Markdown doesn't even have a native way to change the font color. And none of that mattered, because it turns out most writing isn't about any of those things. Most writing is about getting words down in a structure that makes sense, and then getting those words in front of other people. Markdown does that with less friction than anything else ever created. You can learn it in ten minutes, write it in any text editor on any device, read the source file without rendering it, diff it in version control, and convert it to virtually any output format. The files are plain text. They will outlive every application that currently renders them. They don't belong to any company. They can't become corrupted in any meaningful way — the worst thing that can happen to a Markdown file is you lose some characters, and even then the rest of the file is fine. After decades of nursing .doc files like they were delicate flowers that you had to transport home strapped to your car roof, the idea of a format that simply cannot structurally fail is not just convenient. It's a kind of liberation. I think about this sometimes when I'm writing in Vim at midnight, just me and a blinking cursor and a plain text file that will still be readable when I'm dead. No filesystem-within-a-filesystem. No sector allocation tables. No 5,039-page specification. Just words, a few hash marks, and never having to think about it again. Updating the document stream (your text) Updating the formatting tables Update the sector allocation tables Update the directory entries Update summary information Flush everything to disk Part 1 “Fundamentals And Markup Language Reference”, 5th edition, December 2016 Part 2 “Open Packaging Conventions”, 5th edition, December 2021 Part 3 “Markup Compatibility and Extensibility”, 5th edition, December 2015 Part 4 “Transitional Migration Features”, 5th edition, December 2016

0 views
fLaMEd fury Yesterday

MF DOOM: Long Island to Leeds

What’s going on, Internet? A quick post to share some thoughts on a great little podcast I just listened to, MF DOOM: Long Island to Leeds . Even if you’re not a fan of underground hip-hop, or even hip-hop in general you may have heard of MF Doom. MF Doom is your favourite rapper’s favourite rapper. The podcast, hosted by AFRODEUTSCHE and Adam Batty takes us through the story of how the reclusive underground MC from Long Island, New York came up in the underground Hip Hop scene and wound up in Leeds, England where he passed away in 2020. It’s crazy to me how many American based Hip Hop artists were born outside of the USA but are still able to make it big in American hip-hop. His particular circumstances when it came to leaving and coming back to the States after touring were quite unfortunate but highlights the stance the United States takes on immigration. I’d hate to think how this would have come down if he was going through this today with the current immigration climate over there. I’ve always been aware of MF Doom and listen to his music, but not a mega fan. What this podcast has done for me is bump up some of his albums to the top of my vinyl wish list. The podcast is made of five 30 minute episodes. Even if you’re not a big Hip Hop head, give it a listen 🤙 Hey, thanks for reading this post in your feed reader! Want to chat? Reply by email or add me on XMPP , or send a webmention . Check out the posts archive on the website.

0 views
Andy Bell Yesterday

Wait it out

You know what I’m talking about. Play along but protect yourself by doing everything to ensure decision makers are held accountable when the burst comes Keep your cognition levels high and don’t outsource your brain Touch grass and embrace art Preserve your mental health and do things that make you happy Do work, go home

0 views
ava's blog Yesterday

vegan with a soy sensitivity

As a kid, I got diagnosed with a soy allergy; it caused me to itch everywhere and scratch until it bled, all over the body, and worse. I went through a desensitization process of weekly shots until my symptoms improved and went away. Until last year, I could eat soy with no issue; very convenient when you’re vegan. Then it seemingly came back and caused some nasty rashes. Took me a while to identify the culprit. Unfortunately, another round of desensitization is contraindicated for me and likely won’t work again, so I’m just having to roll with it. I really love tofu, edamame, natto, miso, soy sauce, tempeh, lao gan ma and more, so that sucks, but avoiding it has been easier than I thought. I’m not really that fond of eating many replacement products; I like veggie pans with just seasoned vegetables and some beans or other protein the most, and I prefer oat milk to soy milk. The only things I consciously had to switch were going from sugarfree soy skyr to a sugarfree pea-based yoghurt. Other than that, whole foods have been my friend, and there are a surprising amount of replacement products made from bean or pea protein, even chickpeas. I like the chickpea tofu I found, Beyond’s stuff is with pea protein as well, Seitan still works, and we replaced the TVP soy chunks with ones made from field beans, whose powder is also great for egg replacement in baking and for scrambled egg. Kidney beans patties are awesome, too, and red lentil stews are a comfort food to me. I can just use coconut cream instead of soy creams. So aside from losing some of my comfort foods, this has been a rather painless switch. Reply via email Published 23 Mar, 2026

0 views
HeyDingus Yesterday

7 Things This Week [#183]

A weekly list of interesting things I found on the internet, posted on Sundays. Sometimes themed, often not. 1️⃣ That screamy sound you hear when peeling tape? It’s a ‘ sonic whisper’ from tearing at twice the speed of sound! [ 🔗 sciencealert.com ] 2️⃣ Craig Mod built the accounting software of his dreams, fitting his exact international needs, and which can be adapted with Claude Code as needed. Sounds amazing. [ 🔗 craigmod.com ] 3️⃣ Chris Coyer argues that web forms should always automatically email you a copy of your submission. I agree, though I wouldn’t be opposed to it being optional, as long as the default is for the copy to be sent. [ 🔗 email-is-good.com ] 4️⃣ Terry Godier’s essay about how all the objects in our our lives have steadily stolen more of our attention, and made us feel guilty about it is excellent As is it’s web design. You gotta read this one in its original form. [ 🔗 terrygodier.com ] 5️⃣ Stephen Hackett (via James Thomson) shared some incredible 5K wallpapers featuring Lil Finder Guy. I love how the Lil Guy’s taken the Mac community by storm. [ 🔗 512pixels.net ] 6️⃣ I thought tweet from Caleb Sexton was a joke about Kagi having ‘ LinkedIn Speak’ as a language that you could translate into. It’s not a joke. It’s real . [ 🦣 mastodon.social ] 7️⃣ D. Griffin Jones did the thing and put an episode of the Connected podcast onto a floppy disk. Incredible commitment to the bit! [ 🦣 tech.lgbt ] Thanks for reading 7 Things . If you enjoyed these links or have something neat to share, please let me know . And remember that you can get more links to internet nuggets that I’m finding every day by following me @jarrod on the social web. HeyDingus is a blog by Jarrod Blundy about technology, the great outdoors, and other musings. If you like what you see — the blog posts , shortcuts , wallpapers , scripts , or anything — please consider leaving a tip , checking out my store , or just sharing my work. Your support is much appreciated! I’m always happy to hear from you on social , or by good ol' email .

1 views

Code Proven to Work - The Math Way

This post is aimed at a general programmer and no prior knowledge of math or CS is assumed. I got nerd-sniped to write this after reading Simon’s excellent post Code proven to work . As someone who works in the software industry that is adapting to using AI tools, the post was very timely and very well put. I agree with everything said, but I just can’t resist being that guy - bringing up the “you keep using that word” meme.

0 views
Justin Duke Yesterday

Mistress America

I'm sorry, I know you liked Brooke. He told me that she worships you, she kept talking about how smart you are, how interesting... Last year I watched Liberal Arts , which may have been the single worst quote-unquote college movie that I've seen. Lazy, boring, and incoherent. In contrast, Mistress America nails not only being a college movie, but being a New York movie and a farce with specificity, flair, and warmth, and manages to do all of these things within the confines of a 97-minute runtime. No mean feat. I do feel like, for better and for worse, my analysis of the veracity of any of these films boils down to me coming out of the metaphorical theater thinking and then nodding my head and being like, "Yep, that's what it was like." And in Mistress America, that's what it was like. I did not have the same experience that Lola Kirke's character did. But the details were so hyper-specific and accurate, I could see so many people I knew like her from my time at William & Mary. What's more, the Greta Gerwig character serves as an equally hyper and honest depiction of that kind of late-twenties driftless coquette without ever being cruel or mean unnecessarily. Much of this is, I think, delivered on the hands of Gerwig's performance and screenwriting. Baumbach, I think, is a director who needs Gerwig. Baumbach, I think, is a director who needs Gerwig more than the other way around. The surrounding cast is all pitch-perfect, too — including the second-act Connecticut set, who once again are drawn with broad comedic brushes without feeling particularly flat or cardboard (another problem with most films in this genre.)

0 views
Loren Stewart Yesterday

ChatGPT, Claude, and Gemini Render Markdown in the Browser. I Do the Opposite

The big AI chat apps ship heavy rendering libraries to every device. Cheddy Chat renders markdown server-side and streams finished HTML, eliminating 160-440KB of client JavaScript while keeping the main thread free.

0 views

Experimenting with Starlette 1.0 with Claude skills

Starlette 1.0 is out ! This is a really big deal. I think Starlette may be the Python framework with the most usage compared to its relatively low brand recognition because Starlette is the foundation of FastAPI , which has attracted a huge amount of buzz that seems to have overshadowed Starlette itself. Kim Christie started working on Starlette in 2018 and it quickly became my favorite out of the new breed of Python ASGI frameworks. The only reason I didn't use it as the basis for my own Datasette project was that it didn't yet promise stability, and I was determined to provide a stable API for Datasette's own plugins... albeit I still haven't been brave enough to ship my own 1.0 release (after 26 alphas and counting)! Then in September 2025 Marcelo Trylesinski announced that Starlette and Uvicorn were transferring to their GitHub account , in recognition of their many years of contributions and to make it easier for them to receive sponsorship against those projects. The 1.0 version has a few breaking changes compared to the 0.x series, described in the release notes for 1.0.0rc1 that came out in February. The most notable of these is a change to how code runs on startup and shutdown. Previously that was handled by and parameters, but the new system uses a neat lifespan mechanism instead based around an async context manager : If you haven't tried Starlette before it feels to me like an asyncio-native cross between Flask and Django, unsurprising since creator Kim Christie is also responsible for Django REST Framework. Crucially, this means you can write most apps as a single Python file, Flask style. This makes it really easy for LLMs to spit out a working Starlette app from a single prompt. There's just one problem there: if 1.0 breaks compatibility with the Starlette code that the models have been trained on, how can we have them generate code that works with 1.0? I decided to see if I could get this working with a Skill . Regular Claude Chat on claude.ai has skills, and one of those default skills is the skill-creator skill . This means Claude knows how to build its own skills. So I started a chat session and told it: Clone Starlette from GitHub - it just had its 1.0 release. Build a skill markdown document for this release which includes code examples of every feature. I didn't even tell it where to find the repo, Starlette is widely enough known that I expected it could find it on its own. It ran which is actually the old repository name, but GitHub handles redirects automatically so this worked just fine. The resulting skill document looked very thorough to me... and then I noticed a new button at the top I hadn't seen before labelled "Copy to your skills". So I clicked it: And now my regular Claude chat has access to that skill! I started a new conversation and prompted: Build a task management app with Starlette, it should have projects and tasks and comments and labels And Claude did exactly that, producing a simple GitHub Issues clone using Starlette 1.0, a SQLite database (via aiosqlite ) and a Jinja2 template. Claude even tested the app manually like this: For all of the buzz about Claude Code, it's easy to overlook that Claude itself counts as a coding agent now, fully able to both write and then test the code that it is writing. Here's what the resulting app looked like. The code is here in my research repository . You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views

You can make a compelling narrative for almost anything untrue (Rule 4)

Pick something you believe to be untrue, for example a common conspiracy theory or policy narrative. Then go to your favorite chatbot and ask, “ While I believe <insert claim here> to be not true, some people nevertheless argue it is true, and I want to better understand their most compelling narratives. So, please list out the most compelling narrative(s) you can in favor of this position, as if you were trying to convince me. Make it compelling! Next to each, list out the best counter-narrative. ” This rule, that you can make a compelling narrative for almost anything untrue, has profound negative consequences for democratic societies. I believe that’s because: What makes a narrative compelling isn’t its truth, but that it sounds true. For example, simpler things often sound truer (making them more compelling), while actual truth is often nuanced , and increasingly so in our ever-more-complicated world. People are generally either already committed to a political tribe, or are deciding on a more case-by-case basis based on narrative. If they’re in the first camp, the tribe can concoct a compelling narrative to keep them satisfied no matter the issue. If they’re in the second camp, what they end up believing won’t necessarily be correlated to the best policy outcomes (truth) per the first proposition, as they are more likely to just choose the narrative that sounds most compelling to them. Our current state of social media amplification seems to have only made this dynamic worse. More news and political argument consumption has moved from longer-form content with robust journalistic standards to shorter-form video clips with lesser standards. There were always narratives in either form, but the former tries harder to correlate with the truth, and so, on net, as a society, we have less exposure to truth-correlated narratives. Building on this, we primarily communicate with each other, and certainly in the political sphere, via narrative. Politicians are elected on sound bites and video clips, and are therefore incentivized to sell the public crisp policy stories that sound compelling, regardless of whether those policies are our best probabilistic shot at achieving the intended policy outcomes. The politicians may even believe their stories, and that perhaps makes them even more persuasive, but that doesn’t make the stories any more true. It’s not that the stories need to be actually against the policy outcomes; they may just not be pushing the best possible policies. In aggregate, this means we’re always enacting sub-optimal policies, and over the long-run are less prosperous than we would otherwise be. The antidote for this Rule, and more generally the best way we know how to align with desired real world outcomes, is through the scientific method, namely tight, iterative loops of hypothesis → experiment → analysis (and back to hypothesis). Narratives are of course central to this process as well! People usually create stories to come up with hypotheses, and that’s fine as long as eventually the experimental results catch up with the stories and to the extent the stories don’t fit the results, they get replaced by new ones that do. The tighter the loops, the faster this happens. The same is of course true in policy land on a long enough time horizon. Ultimately it becomes so obvious a policy isn’t working that it gets thrown out (or the government collapses). So what’s the difference? I think it is just a matter of how tight the feedback loops really are, of how quickly these loops are happening and how much analysis is really informing the next set of hypotheses. In government, at least in the U.S., it can take literally decades to revise a policy, whether that is throwing out a bad one or just tweaking a mediocre one based on real-world outcomes to make it better. That’s obviously too slow to achieve optimal results. The two-party system makes this worse, but it isn’t even good within one party, so it seems more like a structural problem: compelling narratives insulate sub-optimal policies from needed scrutiny, which, among other things, slows the feedback loops down too much. That takes us all the way back to Rule 4: You can make a compelling narrative for almost anything untrue. Thanks for reading! Subscribe for free to receive new posts or get the audio version . Thank You For Smoking (2005) This rule, that you can make a compelling narrative for almost anything untrue, has profound negative consequences for democratic societies. I believe that’s because: What makes a narrative compelling isn’t its truth, but that it sounds true. For example, simpler things often sound truer (making them more compelling), while actual truth is often nuanced , and increasingly so in our ever-more-complicated world. People are generally either already committed to a political tribe, or are deciding on a more case-by-case basis based on narrative. If they’re in the first camp, the tribe can concoct a compelling narrative to keep them satisfied no matter the issue. If they’re in the second camp, what they end up believing won’t necessarily be correlated to the best policy outcomes (truth) per the first proposition, as they are more likely to just choose the narrative that sounds most compelling to them. Our current state of social media amplification seems to have only made this dynamic worse. More news and political argument consumption has moved from longer-form content with robust journalistic standards to shorter-form video clips with lesser standards. There were always narratives in either form, but the former tries harder to correlate with the truth, and so, on net, as a society, we have less exposure to truth-correlated narratives.

0 views
David Bushell 2 days ago

I should build a game

I should build a game! I feel like that’s a common dream, right? Game development is what got me interested in design and programming to begin with. I learnt ECMAScript via Flash ActionScript many moons ago. Some time later “Thoughts on Flash” brought a swift demise and ruined legacy to Flash. History is written by the winners, they say. Although Flash was largely proprietary software, and Adobe would have ruined it themselves, Flash was a wonderfully creative tool in its prime. I studied art and went into print/web design before transitioning almost entirely to front-end dev. I’ve been trapped here every since! In that time, open web standards have become way more powerful than Flash every was. Today HTML is the new Flash. Over my winter break I created a new playground where I relearned old tricks by building fun little canvas prototypes. Just basic stuff. No libraries or game engines. This is my retreat of solace until the “AI” fallout blows over. I’ll be sharing my slop-free explorations into game dev. The purpose here is understanding and creativity. No amount of prompt-fondling can achieve that! Work got busy, which is a good thing I guess, and I haven’t had time to build more. If the web industry does fall apart, at least I have a fallback plan to keep me busy! I’m going to build the games I always wanted to. Or at least try. I’ve been playing Slay the Spire 2 recently and I thought, “I could build that!” — I mean, I could technically build a shallow shitty clone. Nevertheless, it inspired me once again to consider if I really could design and build a game. I’ve set myself a personal goal of spending a few hours every week to create something game related. Maybe that’s sketching concept art, or plotting puzzles, or writing code, or researching, or just daydreaming ideas. Not with the grand plan of creating “the game”. I don’t know where it will lead but I know I’ll enjoy the process. Whether I share anything is unknown. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views
ava's blog 2 days ago

what's in my todo app

I use a gamified todo app that I log into daily, and have been using it for almost a year now. The interaction with six of my friends kinda drew me in; we can have goals together, send each other encouraging messages, visit each other in our rooms and gift each other items. Each day I check off enough on my list, I send a little bird off to an adventure and then it discovers something. I also get little micropets. What I also enjoy is that it's not strictly a productivity-focused app, it's more about selfcare. It offers soundscapes, meditations, a mood tracker, breathing exercises, physical exercises, mental health quizzes, journaling prompts and more. Initially, I used it like any other todo app, meaning I wanted to get everything on the list done in a day and wanted to build a streak. That didn't work out, like it always does, and I chose to embrace the format of the app more. Now, I use it as a list of suggestions to do, from optional and kind things to gentle reminders of what needs doing. I used to struggle a lot with sitting around wanting to do things or knowing I needed to do stuff, but not exactly being sure what, or feeling like I'm missing something. For years, I made lists for everything. Nowadays, it's all combined in that app and not spread between different notes. I have set all goals to just continue being there until they're checked off, and they can be skipped and snoozed as well, all neatly sorted into categories. Let me show you. The hygiene category reminds me to This holds all the stuff I consider productive. Daily stuff is: The less frequent stuff: This is category is usually intended for daily reminders to reach out to people, suggestions to make plans, to remember everyone that loves you, and all that. For me, it has This checks my daily drinking and goals for when to eat. This takes into account that I am mostly hungry in the evening and that eating early, especially sweet or carb-y stuff, seems to spike me a lot and makes me very hungry the rest of the day. So I try to eat breakfast and lunch later, and currently working on delaying it until even later. All of this is daily. I don't always feel good enough physically to fully commit to a routine for weeks or months, so this is basically a platter to pick and choose from each day. Some days, I do all. Some, I only do one or none. This is for stuff that gets me into the flow, or meditative stuff. Also daily! This is also a daily goal, but only holds one at the moment: "Do one thing makes me happy". It's very vague on purpose, and I count a lot of things based on the day. It gets me to go through my day and see what good things happened, practice gratitude. I check if I have treated myself well, and see if there's maybe something I'd like to do for myself. Reminders for myself. Very helpful for my chronic illness stuff! It can be hard to see rest as something productive and needed, instead of just something that holds me back. It also helps me see small good things and wins I had that day that otherwise, I would have just forgotten or downplayed again. So I get these three daily tasks: Still working on perfecting my sleep schedule and quality. Daily goal: Reminders to take some stuff. Only my injection is scheduled for every two weeks. Haven't had this category for long yet! But my hair is longer now and I take great care regrowing it, together with other things I want to focus more on. I don't put my usual skin care in there, because it's so embedded into my routine and easy to think of that I don't need it to be in there. I love that I don't have to just do the very productive or exhausting stuff; I can just do enough . Sometimes, selfcare is all you can manage, or you procrastinate on hard stuff but do lots of other things. That should still be rewarded, and you're still making progress. I feel like this setup finally acknowledges that for me. It's not a stressor anymore, just a wide selection of things I get to do , and even self-kindness and rest count. Most days, I don't do all of these, and it's not even an expectation. I'm just happy to see that I did stuff at all, and have an easy list of things that I can go through and see "Oh yes, that fits my mood and energy right now." and feeling like I make progress even by resting or affirming or acknowledging small wins. Reply via email Published 22 Mar, 2026 change the bed sheets on every Sunday do laundry on Saturday clean the bathrooms on Tuesday take out the trash on Saturday (or as needed) vacuum on Monday and Friday dust and wipe surfaces on Wednesday. spend 5 minutes tidying my home (I usually do this automatically, because I tidy up a bit first thing in the morning and before going to bed, and I always try to take stuff with me whenever I go through the apartment) read a book or magazine water plants (Thursday) do a case for Noyb (Friday-Sunday) do favors for my wife take a stretch break (this is under connection because this is my wife and I's shared goal we do together) drink water (3 bottles) breakfast after 10 am lunch past 1 pm go for a walk 20+ mins indoor cycling read a simple affirmation for myself (tapping this launches the affirmation part of the app, where I can skip through ones and find one I need for the day) give myself permission to rest (this one changed a lot of how I see breaks in my fitness plans!) name one small success from today avoid caffeine after lunch (usually, I treat this as noon, because I usually have lunch later) go to bed at 22:00 Supplements daily (a general one, my extra iron stuff, Vit D during the winter) Endovelle daily in the evening Minoxidil twice daily Injection every two weeks on Friday hair oiling on Sunday monthly teeth bleaching

0 views
Ahead of AI 2 days ago

A Visual Guide to Attention Variants in Modern LLMs

I had originally planned to write about DeepSeek V4. Since it still hasn’t been released, I used the time to work on something that had been on my list for a while, namely, collecting, organizing, and refining the different LLM architectures I have covered over the past few years. So, over the last two weeks, I turned that effort into an LLM architecture gallery (with 45 entries at the time of this writing), which combines material from earlier articles with several important architectures I had not documented yet. Each entry comes with a visual model card, and I plan to keep the gallery updated regularly. You can find the gallery here: https://sebastianraschka.com/llm-architecture-gallery/ Figure 1: Overview of the LLM architecture gallery and its visual model cards. After I shared the initial version, a few readers also asked whether there would be a poster version. So, there is now a poster version via Redbubble . I ordered the Medium size (26.9 x 23.4 in) to check how it looks in print, and the result is sharp and clear. That said, some of the smallest text elements are already quite small at that size, so I would not recommend the smaller versions if you intend to have everything readable. Figure 2: Poster version of the architecture gallery with some random objects for scale. Alongside the gallery, I was/am also working on short explainers for a few core LLM concepts. So, in this article, I thought it would be interesting to recap all the recent attention variants that have been developed and used in prominent open-weight architectures in recent years. My goal is to make the collection useful both as a reference and as a lightweight learning resource. I hope you find it useful and educational! Self-attention lets each token look at the other visible tokens in the sequence, assign them weights, and use those weights to build a new context-aware representation of the input. Multi-head attention (MHA) is the standard transformer version of that idea. It runs several self-attention heads in parallel with different learned projections, then combines their outputs into one richer representation. Figure 3: Olmo 2 as an example architecture using MHA. The sections below start with a whirlwind tour of explaining self-attention to explain MHA. It’s more meant as a quick overview to set the stage for related attention concepts like grouped-query attention, sliding window attention, and so on. If you are interested in a longer, more detailed self-attention coverage, you might like my longer Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article. EXAMPLE ARCHITECTURES GPT-2 , OLMo 2 7B , and OLMo 3 7B Attention predates transformers and MHA. Its immediate background is encoder-decoder RNNs for translation. In those older systems, an encoder RNN would read the source sentence token by token and compress it into a sequence of hidden states, or in the simplest version into one final state. Then the decoder RNN had to generate the target sentence from that limited summary. This worked for short and simple cases, but it created an obvious bottleneck once the relevant information for the next output word lived somewhere else in the input sentence. In short, the limitation is that the hidden state can’t store infinitely much information or context, and sometimes it would be useful to just refer back to the full input sequence. The translation example below shows one of the limitations of this idea. For instance, a sentence can preserve many locally reasonable word choices and still fail as a translation when the model treats the problem too much like a word-by-word mapping. (The top panel shows an exaggerated example where we translate the sentence word by word; obviously, the grammar in the resulting sentence is wrong.) In reality, the correct next word depends on sentence-level structure and on which earlier source words matter at that step. Of course, this could still be translated fine with an RNN, but it would struggle with longer sequences or knowledge retrieval tasks because the hidden state can only store so much information as mentioned earlier. Figure 4: Translation can fail even when many individual word choices look reasonable because sentence-level structure still matters (Original source LLMs-from-scratch ). The next figure shows that change more directly. When the decoder is producing an output token, it should not be limited to one compressed memory path. It should be able to reach back to the more relevant input tokens directly. Figure 5: Attention breaks the RNN bottleneck by letting the current output position revisit the full input sequence instead of relying on one compressed state alone (Original source LLMs-from-scratch ). Transformers keep that core idea from the aforementioned attention-modified RNN but remove the recurrence. In the classic Attention Is All You Need paper, attention becomes the main sequence-processing mechanism itself (instead of being just part of an RNN encoder-decoder.) In transformers, that mechanism is called self-attention, where each token in the sequence computes weights over all other tokens and uses them to mix information from those tokens into a new representation. Multi-head attention is the same mechanism run several times in parallel. For a sequence of tokens, attention needs one row of weights per token, so overall we get a matrix. Each row answers a simple question. When updating this token, how much should each visible token matter? In a decoder-only LLM, future positions are masked out, which is why the upper-right part of the matrix is grayed out in the figure below. Self-attention is fundamentally about learning these token-to-token weight patterns, under a causal mask, and then using them to build context-aware token representations. Figure 6: A concrete masked attention matrix where each row belongs to one token, each entry is an attention weight, and future-token entries are removed by the causal mask (Original source Understanding and Coding Self-Attention ). 1.4 Self-Attention Internals The next figure shows how the transformer computes the attention matrix ( ) from the input embeddings , which is then used to produce the transformed inputs ( ). Here , , and stand for queries, keys, and values. The query for a token represents what that token is looking for, the key represents what each token makes available for matching, and the value represents the information that gets mixed into the output once the attention weights have been computed. The steps are as follows: , , and are weight matrices that project the input embeddings into , , and produces the raw token-to-token relevance scores softmax converts those scores into the normalized attention matrix that we discussed in the previous section is applied to to produce the output matrix Note that the attention matrix is not a separate hand-written object. It emerges from , , and softmax. Figure 7: The full single-head pipeline, from input embeddings X to the normalized attention matrix A and output representations Z (Original source Understanding and Coding Self-Attention ). The next figure shows the same concept as the previous figure but the attention matrix computation is hidden inside the “scaled-dot-product attention” box, and we perform the computation only for one input token instead of all input tokens. This is to show a compact form of self-attention with a single head before extending this to multi-head attention in the next section. Figure 8: One attention head is already a complete mechanism. One set of learned projections produces one attention matrix and one context-aware output stream (Original source Understanding and Coding Self-Attention ). 1.5 From One Head To Multi-Head Attention One set of matrices gives us one attention head, which means one attention matrix and one output matrix . (This concept was illustrated in the previous section.) Multi-head attention simply runs several of these heads in parallel with different learned projection matrices. This is useful because different heads can specialize in different token relationships. One head might focus on short local dependencies, another on broader semantic links, and another on positional or syntactic structure. Figure 9: Multi-head attention keeps the same basic attention recipe, but repeats it across several heads in parallel so the model can learn several token-to-token patterns at once (Original source Understanding and Coding Self-Attention ). 2. Grouped-Query Attention (GQA) Grouped-query attention is an attention variant derived from standard MHA. It was introduced in the 2023 paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints by Joshua Ainslie and colleagues. Instead of giving every query head its own keys and values, it lets several query heads share the same key-value projections, which makes KV caching much cheaper (primarily as a memory reduction) without changing the overall decoder recipe very much. Figure 10: GQA keeps the same overall attention pattern as MHA, but collapses the number of key-value heads by sharing them across multiple query heads (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Dense: Llama 3 8B , Qwen3 4B , Gemma 3 27B , Mistral Small 3.1 24B , SmolLM3 3B , and Tiny Aya 3.35B . Sparse (Mixture-of-Experts): Llama 4 Maverick , Qwen3 235B-A22B , Step 3.5 Flash 196B , and Sarvam 30B . In my architecture comparison article , I framed GQA as the new standard replacement for classic multi-head attention (MHA). The reason is that standard MHA gives every head its own keys and values, which is more optimal from a modeling perspective but expensive once we have to keep all of that state in the KV cache during inference. In GQA, we keep a larger set of query heads, but we reduce the number of key-value heads and let multiple queries share them. That lowers both parameter count and KV-cache traffic without making drastic implementation changes like multi-head latent attention (MLA), which will be discussed later. In practice, that made and keeps it a very popular choice for labs that wanted something cheaper than MHA but simpler to implement than newer compression-heavy alternatives like MLA. GQA results in big savings in KV storage, since the fewer key-value heads we keep per layer, the less cached state we need per token. That is why GQA becomes more useful as sequence length grows. GQA is also a spectrum. If we reduce all the way down to one shared K/V group, we are effectively in multi-query attention territory, which is even cheaper but can hurt modeling quality more noticeably. The sweet spot is usually somewhere in between multi-query attention (1 shared group) and MHA (where K/V groups are equal to the number of queries), where the cache savings are large but the modeling degradation relative to MHA stays modest. Figure 11: Lower is better. Once the context window grows, KV-cache savings become more pronounced. (Original source: LLMs-from-scratch GQA materials ) 2.3 Why GQA Still Matters In 2026 More advanced variants such as MLA are becoming popular because they can offer better modeling performance at the same KV efficiency levels (e.g., as discussed in the ablation studies of the DeepSeek-V2 paper ), but they also involve a more complicated implementation and a more complicated attention stack. GQA remains appealing because it is robust, easier to implement, and also easier to train (since there are fewer hyperparameter tunings necessary, based on my experience). That is why some of the newer releases still stay deliberately classic here. E.g., in my Spring Architectures article, I mentioned that MiniMax M2.5 and Nanbeige 4.1 as models that remained very classic, using only grouped-query attention without piling on other efficiency tricks. Sarvam is a particularly useful comparison point as well: the 30B model keeps classic GQA, while the 105B version switches to MLA. Figure 12: Total KV cache sizes for 105B Sarvam (using MLA) versus 30B Sarvam (using GQA), versus using plain MHA. The motivation behind Multi-head Latent Attention (MLA) is similar to Grouped-Query Attention (GQA). Both are solutions for reducing KV-cache memory requirements. The difference between GQA and MLA is that MLA shrinks the cache by compressing what gets stored rather than by reducing how many K/Vs are stored by sharing heads. Figure 13: Unlike GQA, MLA does not reduce KV cost by grouping heads. It reduces it by caching a compressed latent representation. Note that it is also applied to the query, which is not shown for simplicity (Original source: The Big LLM Architecture Comparison ). MLA, originally proposed in the DeepSeek-V2 paper, became such a defining DeepSeek-era idea (especially after DeepSeek-V3 and R1). It is more complicated to implement than GQA, more complicated to serve, but nowadays also often more compelling once model size and context length get large enough that cache traffic starts to dominate, because at the same rate of memory reduction, it could maintain better modeling performance (more on that later). EXAMPLE ARCHITECTURES DeepSeek V3 , Kimi K2 , GLM-5 , Ling 2.5 , Mistral Large 3 , and Sarvam 105B Instead of caching full-resolution key and value tensors as in MHA and GQA, MLA stores a latent representation and reconstructs the usable state when needed. Essentially, it is a cache compression strategy embedded inside attention, as illustrated in the previous figure. The figure below shows the savings compared to regular MHA. Figure 14: Once context length grows, the savings from caching a latent representation instead of full K/V tensors become very visible (Original source: LLMs-from-scratch MLA section). 3.2 MLA Ablation Studies The DeepSeek-V2 paper provided some ablations where GQA looked worse than MHA in terms of modeling performance, while MLA held up much better and could even outperform MHA when tuned carefully. That is a much stronger justification than “it (also) saves memory.” In other words, MLA is a preferable attention mechanism for DeepSeek not just because it was efficient, but because it looked like a quality-preserving efficiency move at large scale. (But colleagues also told me that MLA only works well at a certain size. For smaller models, let’s say <100B, GQA seems to work better, or, is at least easier to tune and get right.) Figure 15: GQA drops below MHA here, while MLA remains competitive and can even slightly outperform it. Underlying paper: DeepSeek-V2 . Below is again the comparison between GQA in 30B Sarvam versus MLA in 105B Sarvam. Figure 16: GQA and MLA are solving the same bottleneck from different directions. The tradeoff is simplicity versus better modeling performance for larger models. 3.3 How MLA Spread After DeepSeek Once DeepSeek V3/R1, V3.1 etc. normalized the design after its introduction in V2, it started showing up in a second wave of architectures. Kimi K2 kept the DeepSeek recipe and scaled it up. GLM-5 adopted MLA together with DeepSeek Sparse Attention (from DeepSeek V3.2). Ling 2.5 paired MLA with a linear-attention hybrid. Sarvam released two models where the 30B model stayed with classic GQA and the 105B model switched to MLA. That last pair is particularly useful as it puts the technical-complexity discussion aside. I.e., the Sarvam team implemented both variants and deliberately chose to then use GQA for one variant and MLA for the other. So, in a sense, that makes MLA feel less like a theoretical alternative and more like a concrete architectural upgrade path once a family scales up. Sliding window attention reduces the memory and compute cost of long-context inference by limiting how many previous tokens each position can attend to. Instead of attending to the entire prefix, each token only attends to a fixed window of recent tokens around its position. Because attention is restricted to a local token neighborhood, this mechanism is often referred to as local attention. Some architectures combine these local layers with occasional global attention layers so that information can still propagate across the entire sequence. Figure 17: The conceptual shift is simple. Regular attention is global attention, while sliding-window attention is local attention. Global attention lets every token see the full prefix; SWA turns many of those layers into local attention layers (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Gemma 3 27B , OLMo 3 32B , Xiaomi MiMo-V2-Flash , Arcee Trinity , Step 3.5 Flash , and Tiny Aya Gemma 3 is still one of the clearest recent SWA examples because it is easy to compare against Gemma 2. Gemma 2 already used a hybrid attention setup with a 1:1 ratio between local and global layers and a 4096-token window. Gemma 3 pushed this further to a 5:1 ratio and reduced the window size to 1024. The key finding was not that local attention is cheaper, because that was already known. Here, the more interesting takeaway from the Gemma 3 ablation study was that using this more aggressively seemed to hurt modeling performance only slightly. The Gemma ablation study suggests that the smaller window and more aggressive local:global ratio have little effect on perplexity. Underlying paper: Gemma 3 article (Original source: The Big LLM Architecture Comparison ). 4.2 The Ratio And Window Size In practice, saying that a model “uses SWA” does not mean it relies on SWA alone. What usually matters are the local-to-global layer pattern and the attention window size. For example: Gemma 3 and Xiaomi use a 5:1 local-to-global pattern. OLMo 3 and Arcee Trinity use a 3:1 pattern. Xiaomi also uses a window size of 128, which is much smaller, and therefore more aggressive, than Gemma’s 1024. SWA is essentially a knob that can be tuned more or less aggressively. Figure 18: The long-context savings come from turning many full-attention layers into local ones, which reduces how much cached context those layers need to consider (Original source: LLMs-from-scratch SWA materials ). 4.3 Combining SWA with GQA SWA often appears together with GQA because the two ideas address different parts of the same inference problem. SWA reduces how much context a local layer has to consider. GQA reduces how much key-value state each token contributes to the cache. That is why many recent dense models use both rather than treating them as alternatives. Gemma 3 is again a good reference point here, since it combines sliding window attention with grouped-query attention in the same architecture. DeepSeek Sparse Attention is one of the architectural changes that appeared in the DeepSeek V3.2 line and later showed up again in GLM-5. Specifically, DeepSeek V3.2 combines it with Multi-head Latent Attention (MLA) , and GLM-5 adopts the same pair for the same general reason, namely, reducing inference cost when context lengths get large. EXAMPLE ARCHITECTURES DeepSeek V3.2 and GLM-5 In sliding-window attention, the current token does not attend to the full prefix but only to a fixed local window. This is the same broad idea behind DeepSeek Sparse Attention, where each token also only attends to a subset of previous tokens. However, the selected tokens are not determined by a fixed-width local window. Instead, DeepSeek Sparse Attention uses a learned sparse pattern. In short, it uses an indexer-plus-selector setup, where a lightning indexer computes relevance scores, and a token selector keeps only a smaller set of high-scoring past positions. The way the subset of tokens is selected is the main difference from sliding-window attention. Sliding-window attention hard-codes locality. DeepSeek Sparse Attention still limits attention to a subset, but it lets the model decide which prior tokens are worth revisiting. Figure 19: Similar to sliding-window attention, DeepSeek Sparse Attention also restricts each token to a subset of prior tokens, but does not do so with a fixed local window (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). 5.2 DeepSeek Sparse Attention and MLA DeepSeek V3.2 uses both Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention. MLA reduces KV-cache cost by compressing what gets stored. DeepSeek Sparse Attention reduces how much of the prior context the model has to revisit. Put differently, one optimizes the cache representation, the other optimizes the attention pattern on top of it. Figure 20: DeepSeek V3.2 is the obvious reference point, because this is the model family most closely associated with the sparse-attention idea. The sparse pattern is not random. The first stage is a lightning indexer that scores previous tokens for each new query token. It uses MLA’s compressed token representations and computes a learned similarity score over the prior context, so the model can rank which earlier positions are worth revisiting. The second stage is a token selector. It keeps only a smaller high-scoring subset, for example, a top- set of past positions, and turns that subset into the sparse attention mask. So the main point is that DeepSeek Sparse Attention does not hard-code the sparsity pattern. It learns which past tokens to keep. Figure 21: The mechanism consists of a lightning indexer that scores prior tokens and a selector that keeps only a smaller subset for attention (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). DeepSeek Sparse Attention is relatively new and relatively complicated to implement, which is why it has not been so widely adopted as Grouped-Query Attention (GQA) yet. Gated attention is best understood as a modified full-attention block rather than as a separate attention family. It usually appears inside hybrid stacks that still keep an occasional full-attention layer for exact content retrieval, but add a few stability-oriented changes on top of an otherwise familiar scaled dot-product attention block. Figure 22: Trinity Large is a useful comparison because gated attention is not only a Qwen idea (more on that later). Here the gate appears after the scaled dot-product attention output and before the output projection in a different long-context architecture (Original source: A Dream of Spring for Open-Weight LLMs ). 6.1 Where Gated Attention Appears The Qwen3-Next and Qwen3.5 architectures show that recent hybrids (covered in the next section) do not replace attention everywhere. Instead, they replace most attention layers with a cheaper alternative and keep a smaller number of full-attention layers in the stack. Those remaining full-attention layers are where gated attention typically appears. Qwen3-Next and Qwen3.5 use it together with Gated DeltaNet in a 3:1 pattern. But hybrid architectures aside, Trinity uses a related gating idea in a more conventional attention stack, as shown in the previous figure above. The gated attention block in Qwen-style hybrids or Trinity (not a hybrid) is essentially standard scaled-dot-product attention with a few changes on top. In the original Gated Attention paper , those changes are presented as a way to make the retained full-attention layers behave more predictably inside a hybrid stack. The block still looks like standard (full) attention, but it adds: an output gate that scales the attention result before it is added back to the residual, a zero-centered QK-Norm variant instead of standard RMSNorm for q and k, partial RoPE. These are not changes on the scale of MLA or linear attention but merely stability and control changes applied to an otherwise familiar attention block. Figure 23: In Qwen3-Next and Qwen3.5, gated attention appears as the full-attention layer that periodically breaks up runs of Gated DeltaNet blocks. Note that the figure above also includes Gated DeltaNet, which we will cover in the next section below. Hybrid attention is a broader design pattern rather than a specific, single mechanism. The overall idea is to keep a transformer-like stack, but replace most of the expensive full-attention layers with cheaper linear or state-space sequence modules. The motivation is long-context efficiency. Full attention grows quadratically with sequence length, so once models move to contexts like 128k, 256k, or 1M tokens, attention memory and compute become expensive enough that using cheaper sequence modules in most layers while keeping only a smaller number of heavier retrieval layers starts making more sense. (Note that this comes with a bit of a modeling performance trade-off, though.) In Qwen3-Next, this pattern appears as a 3:1 mix of Gated DeltaNet and Gated Attention blocks. Gated DeltaNet is also closely related to Mamba-2 (see the Gated Delta Networks: Improving Mamba2 with Delta Rule paper, for instance), and the mechanism can be read as a DeltaNet-style fast-weight update combined with Mamba-style gating. Later architectures keep the same overall idea but swap in other lightweight sequence mixers, such as Kimi Delta Attention, Lightning Attention, or standard Mamba-2. Figure 24: The basic hybrid pattern, where most blocks are cheaper sequence mixers and every fourth block restores a heavier attention layer (Original source The Big LLM Architecture Comparison ). To my knowledge, the first prominent example of a close-to-flagship LLM with hybrid attention was Qwen3-Next in 2025, which does not remove attention completely but mixes three Gated DeltaNet blocks with one Gated Attention block. Here, lightweight Gated DeltaNet blocks do most of the long-context work and keep memory growth much flatter than full attention. The heavier gated-attention layer remains because DeltaNet is less exact at content-based retrieval. Inside a Gated DeltaNet block, the model computes query, key, and value vectors together with two learned gates (α, β). Rather than forming the usual token-to-token attention matrix, it writes to a small fast-weight memory using a delta-rule update. In rough terms, the memory stores a compressed running summary of past information, while the gates control how much new information is added and how much previous state is retained. That makes Gated DeltaNet a linear-attention or recurrent-style mechanism rather than just another tweak to MHA. Relative to Mamba-2, the close connection is that both belong to the linear-time gated sequence-model family, but Gated DeltaNet uses a DeltaNet-style fast-weight memory update instead of the Mamba state-space update. Figure 25: The practical motivation behind the hybrids is shown here in the memory curve. Hybrid stacks with Gated DeltaNet grow much more slowly with context length than ordinary full attention (Original source LLMs-from-scratch DeltaNet materials ). Qwen3.5 moves the former Qwen3-Next hybrid into Qwen’s main flagship series, which is an interesting move. This basically signals that the hybrid strategy is a success and that we may see more models with this architecture in the future. Figure 26: Qwen3.5 shows the Qwen team promoting the former Qwen3-Next side-branch into the main model line rather than leaving it as a one-off efficiency variant (Original source A Dream of Spring for Open-Weight LLMs ). 7.2 Kimi Linear And Modified Delta Attention Kimi Linear keeps the same broad transformer skeleton and the same 3:1 pattern, but it changes both halves of the recipe. On the lightweight side, Kimi Delta Attention is a refinement of Gated DeltaNet. Where Qwen3-Next uses a scalar gate per head to control memory decay, Kimi uses channel-wise gating, which gives finer control over the memory update. On the heavier side, Kimi replaces Qwen3-Next’s gated-attention layers with gated MLA layers. So, it’s still the same broader pattern as in Qwen3-Next and Qwen3.5, but both ingredients (slightly) change. I.e., most layers are still handled by a cheaper linear-style mechanism, and periodic heavier layers still remain for stronger retrieval. Figure 27: Kimi Linear keeps the same overall hybrid pattern while changing both the lightweight side and the heavier attention side of the stack (Original source The Big LLM Architecture Comparison ). 7.3 Ling 2.5 And Lightning Attention Ling 2.5 shows another swap on the lightweight side. Instead of Gated DeltaNet, Ling uses a slightly simpler recurrent linear attention variant called Lightning Attention. On the heavier side, it keeps MLA from DeepSeek. Most sequence mixing happens in the cheaper linear-attention blocks, while a smaller number of heavier layers remain to preserve stronger retrieval. The difference is that the specific lightweight mechanism is now Lightning Attention rather than DeltaNet or Kimi Delta Attention. Figure 28: Ling 2.5 and Qwen3.5 are both linear-attention hybrids, even though Ling swaps in Lightning Attention and MLA instead of the Qwen recipe (Original source A Dream of Spring for Open-Weight LLMs ). Ling 2.5 is aimed more at long-context efficiency than at absolute benchmark leadership. According to the Ling team, it was reported as substantially faster than Kimi K2 at 32k tokens, which is the practical payoff these hybrids are aiming for. Figure 29: Ling 2.5 was presented as a strong efficiency upgrade, with much higher 32k-token throughput than Kimi K2 at the same 1-trillion-parameter scale (Original source Ling 2.5 model hub page ). Nemotron And Mamba-2 Nemotron pushes the pattern further away from the transformer baseline. Nemotron 3 Nano is a Mamba-Transformer hybrid that interleaves Mamba-2 sequence-modeling blocks with sparse MoE layers and uses self-attention only in a small subset of layers. This is a more extreme version of the same basic tradeoff discussed above. Here, the lightweight sequence module is a Mamba-2 state-space block rather than a DeltaNet-style fast-weight update, but the basic tradeoff is similar. Figure 30: Nemotron 3 Nano uses Mamba-2 for most of the sequence modeling work, with self-attention only appearing in a small subset of layers (Original source The Big LLM Architecture Comparison ). The larger Nemotron 3 Super keeps the Mamba-2 hybrid attention approach and adds other efficiency-oriented changes such as latent MoE and shared-weight multi-token prediction (MTP) for speculative decoding. Figure 31: Nemotron 3 Super keeps the Mamba-2 hybrid attention pattern while adding latent MoE and shared-weight MTP on top (Original source The Big LLM Architecture Comparison ). Conclusion Of course, there are many more (mostly niche) attention variants throughout the literature that I haven’t covered here. The focus of this article was on those that are currently used in state-of-the-art (open-weight) models. In particular, I am looking forward to (1) seeing the brand new Mamba-3 layers getting integrated into the aforementioned hybrid architectures (replacing Gated DeltaNet) and (2) attention residuals being used in general. In practice, you may also wonder what the “best” architecture is at the moment. This is hard to answer, as there are no public experiments that train different architectures on the same training data etc. Hence, we can currently only answer what the best (trained) model choice is for a given problem. In my opinion, hybrid architectures are still a novelty, and the main selling point is mainly (long-context) efficiency versus just modeling performance. Hence, I think they are a great candidate for agent contexts (like OpenClaw). Personally, I think the problem with hybrid architectures is also that the inference stacks are not quite as optimized, yet, and I find that I get better tok/sec throughput when running LLMs locally using more classic setups like GPT-OSS with grouped-query attention. Anyways, I am curious to see what DeepSeek V4 has in store, since DeepSeek has been quite the reliable trend-setter in the recent 2 years. Figure 1: Overview of the LLM architecture gallery and its visual model cards. After I shared the initial version, a few readers also asked whether there would be a poster version. So, there is now a poster version via Redbubble . I ordered the Medium size (26.9 x 23.4 in) to check how it looks in print, and the result is sharp and clear. That said, some of the smallest text elements are already quite small at that size, so I would not recommend the smaller versions if you intend to have everything readable. Figure 2: Poster version of the architecture gallery with some random objects for scale. Alongside the gallery, I was/am also working on short explainers for a few core LLM concepts. So, in this article, I thought it would be interesting to recap all the recent attention variants that have been developed and used in prominent open-weight architectures in recent years. My goal is to make the collection useful both as a reference and as a lightweight learning resource. I hope you find it useful and educational! 1. Multi-Head Attention (MHA) Self-attention lets each token look at the other visible tokens in the sequence, assign them weights, and use those weights to build a new context-aware representation of the input. Multi-head attention (MHA) is the standard transformer version of that idea. It runs several self-attention heads in parallel with different learned projections, then combines their outputs into one richer representation. Figure 3: Olmo 2 as an example architecture using MHA. The sections below start with a whirlwind tour of explaining self-attention to explain MHA. It’s more meant as a quick overview to set the stage for related attention concepts like grouped-query attention, sliding window attention, and so on. If you are interested in a longer, more detailed self-attention coverage, you might like my longer Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article. EXAMPLE ARCHITECTURES GPT-2 , OLMo 2 7B , and OLMo 3 7B 1.2 Historical Tidbits And Why Attention Was Invented Attention predates transformers and MHA. Its immediate background is encoder-decoder RNNs for translation. In those older systems, an encoder RNN would read the source sentence token by token and compress it into a sequence of hidden states, or in the simplest version into one final state. Then the decoder RNN had to generate the target sentence from that limited summary. This worked for short and simple cases, but it created an obvious bottleneck once the relevant information for the next output word lived somewhere else in the input sentence. In short, the limitation is that the hidden state can’t store infinitely much information or context, and sometimes it would be useful to just refer back to the full input sequence. The translation example below shows one of the limitations of this idea. For instance, a sentence can preserve many locally reasonable word choices and still fail as a translation when the model treats the problem too much like a word-by-word mapping. (The top panel shows an exaggerated example where we translate the sentence word by word; obviously, the grammar in the resulting sentence is wrong.) In reality, the correct next word depends on sentence-level structure and on which earlier source words matter at that step. Of course, this could still be translated fine with an RNN, but it would struggle with longer sequences or knowledge retrieval tasks because the hidden state can only store so much information as mentioned earlier. Figure 4: Translation can fail even when many individual word choices look reasonable because sentence-level structure still matters (Original source LLMs-from-scratch ). The next figure shows that change more directly. When the decoder is producing an output token, it should not be limited to one compressed memory path. It should be able to reach back to the more relevant input tokens directly. Figure 5: Attention breaks the RNN bottleneck by letting the current output position revisit the full input sequence instead of relying on one compressed state alone (Original source LLMs-from-scratch ). Transformers keep that core idea from the aforementioned attention-modified RNN but remove the recurrence. In the classic Attention Is All You Need paper, attention becomes the main sequence-processing mechanism itself (instead of being just part of an RNN encoder-decoder.) In transformers, that mechanism is called self-attention, where each token in the sequence computes weights over all other tokens and uses them to mix information from those tokens into a new representation. Multi-head attention is the same mechanism run several times in parallel. 1.3 The Masked Attention Matrix For a sequence of tokens, attention needs one row of weights per token, so overall we get a matrix. Each row answers a simple question. When updating this token, how much should each visible token matter? In a decoder-only LLM, future positions are masked out, which is why the upper-right part of the matrix is grayed out in the figure below. Self-attention is fundamentally about learning these token-to-token weight patterns, under a causal mask, and then using them to build context-aware token representations. Figure 6: A concrete masked attention matrix where each row belongs to one token, each entry is an attention weight, and future-token entries are removed by the causal mask (Original source Understanding and Coding Self-Attention ). 1.4 Self-Attention Internals The next figure shows how the transformer computes the attention matrix ( ) from the input embeddings , which is then used to produce the transformed inputs ( ). Here , , and stand for queries, keys, and values. The query for a token represents what that token is looking for, the key represents what each token makes available for matching, and the value represents the information that gets mixed into the output once the attention weights have been computed. The steps are as follows: , , and are weight matrices that project the input embeddings into , , and produces the raw token-to-token relevance scores softmax converts those scores into the normalized attention matrix that we discussed in the previous section is applied to to produce the output matrix Figure 7: The full single-head pipeline, from input embeddings X to the normalized attention matrix A and output representations Z (Original source Understanding and Coding Self-Attention ). The next figure shows the same concept as the previous figure but the attention matrix computation is hidden inside the “scaled-dot-product attention” box, and we perform the computation only for one input token instead of all input tokens. This is to show a compact form of self-attention with a single head before extending this to multi-head attention in the next section. Figure 8: One attention head is already a complete mechanism. One set of learned projections produces one attention matrix and one context-aware output stream (Original source Understanding and Coding Self-Attention ). 1.5 From One Head To Multi-Head Attention One set of matrices gives us one attention head, which means one attention matrix and one output matrix . (This concept was illustrated in the previous section.) Multi-head attention simply runs several of these heads in parallel with different learned projection matrices. This is useful because different heads can specialize in different token relationships. One head might focus on short local dependencies, another on broader semantic links, and another on positional or syntactic structure. Figure 9: Multi-head attention keeps the same basic attention recipe, but repeats it across several heads in parallel so the model can learn several token-to-token patterns at once (Original source Understanding and Coding Self-Attention ). 2. Grouped-Query Attention (GQA) Grouped-query attention is an attention variant derived from standard MHA. It was introduced in the 2023 paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints by Joshua Ainslie and colleagues. Instead of giving every query head its own keys and values, it lets several query heads share the same key-value projections, which makes KV caching much cheaper (primarily as a memory reduction) without changing the overall decoder recipe very much. Figure 10: GQA keeps the same overall attention pattern as MHA, but collapses the number of key-value heads by sharing them across multiple query heads (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Dense: Llama 3 8B , Qwen3 4B , Gemma 3 27B , Mistral Small 3.1 24B , SmolLM3 3B , and Tiny Aya 3.35B . Sparse (Mixture-of-Experts): Llama 4 Maverick , Qwen3 235B-A22B , Step 3.5 Flash 196B , and Sarvam 30B . 2.1 Why GQA Became Popular In my architecture comparison article , I framed GQA as the new standard replacement for classic multi-head attention (MHA). The reason is that standard MHA gives every head its own keys and values, which is more optimal from a modeling perspective but expensive once we have to keep all of that state in the KV cache during inference. In GQA, we keep a larger set of query heads, but we reduce the number of key-value heads and let multiple queries share them. That lowers both parameter count and KV-cache traffic without making drastic implementation changes like multi-head latent attention (MLA), which will be discussed later. In practice, that made and keeps it a very popular choice for labs that wanted something cheaper than MHA but simpler to implement than newer compression-heavy alternatives like MLA. 2.2 GQA Memory Savings GQA results in big savings in KV storage, since the fewer key-value heads we keep per layer, the less cached state we need per token. That is why GQA becomes more useful as sequence length grows. GQA is also a spectrum. If we reduce all the way down to one shared K/V group, we are effectively in multi-query attention territory, which is even cheaper but can hurt modeling quality more noticeably. The sweet spot is usually somewhere in between multi-query attention (1 shared group) and MHA (where K/V groups are equal to the number of queries), where the cache savings are large but the modeling degradation relative to MHA stays modest. Figure 11: Lower is better. Once the context window grows, KV-cache savings become more pronounced. (Original source: LLMs-from-scratch GQA materials ) 2.3 Why GQA Still Matters In 2026 More advanced variants such as MLA are becoming popular because they can offer better modeling performance at the same KV efficiency levels (e.g., as discussed in the ablation studies of the DeepSeek-V2 paper ), but they also involve a more complicated implementation and a more complicated attention stack. GQA remains appealing because it is robust, easier to implement, and also easier to train (since there are fewer hyperparameter tunings necessary, based on my experience). That is why some of the newer releases still stay deliberately classic here. E.g., in my Spring Architectures article, I mentioned that MiniMax M2.5 and Nanbeige 4.1 as models that remained very classic, using only grouped-query attention without piling on other efficiency tricks. Sarvam is a particularly useful comparison point as well: the 30B model keeps classic GQA, while the 105B version switches to MLA. Figure 12: Total KV cache sizes for 105B Sarvam (using MLA) versus 30B Sarvam (using GQA), versus using plain MHA. 3. Multi-Head Latent Attention (MLA) The motivation behind Multi-head Latent Attention (MLA) is similar to Grouped-Query Attention (GQA). Both are solutions for reducing KV-cache memory requirements. The difference between GQA and MLA is that MLA shrinks the cache by compressing what gets stored rather than by reducing how many K/Vs are stored by sharing heads. Figure 13: Unlike GQA, MLA does not reduce KV cost by grouping heads. It reduces it by caching a compressed latent representation. Note that it is also applied to the query, which is not shown for simplicity (Original source: The Big LLM Architecture Comparison ). MLA, originally proposed in the DeepSeek-V2 paper, became such a defining DeepSeek-era idea (especially after DeepSeek-V3 and R1). It is more complicated to implement than GQA, more complicated to serve, but nowadays also often more compelling once model size and context length get large enough that cache traffic starts to dominate, because at the same rate of memory reduction, it could maintain better modeling performance (more on that later). EXAMPLE ARCHITECTURES DeepSeek V3 , Kimi K2 , GLM-5 , Ling 2.5 , Mistral Large 3 , and Sarvam 105B 3.1 Compression, Not Sharing Instead of caching full-resolution key and value tensors as in MHA and GQA, MLA stores a latent representation and reconstructs the usable state when needed. Essentially, it is a cache compression strategy embedded inside attention, as illustrated in the previous figure. The figure below shows the savings compared to regular MHA. Figure 14: Once context length grows, the savings from caching a latent representation instead of full K/V tensors become very visible (Original source: LLMs-from-scratch MLA section). 3.2 MLA Ablation Studies The DeepSeek-V2 paper provided some ablations where GQA looked worse than MHA in terms of modeling performance, while MLA held up much better and could even outperform MHA when tuned carefully. That is a much stronger justification than “it (also) saves memory.” In other words, MLA is a preferable attention mechanism for DeepSeek not just because it was efficient, but because it looked like a quality-preserving efficiency move at large scale. (But colleagues also told me that MLA only works well at a certain size. For smaller models, let’s say <100B, GQA seems to work better, or, is at least easier to tune and get right.) Figure 15: GQA drops below MHA here, while MLA remains competitive and can even slightly outperform it. Underlying paper: DeepSeek-V2 . Below is again the comparison between GQA in 30B Sarvam versus MLA in 105B Sarvam. Figure 16: GQA and MLA are solving the same bottleneck from different directions. The tradeoff is simplicity versus better modeling performance for larger models. 3.3 How MLA Spread After DeepSeek Once DeepSeek V3/R1, V3.1 etc. normalized the design after its introduction in V2, it started showing up in a second wave of architectures. Kimi K2 kept the DeepSeek recipe and scaled it up. GLM-5 adopted MLA together with DeepSeek Sparse Attention (from DeepSeek V3.2). Ling 2.5 paired MLA with a linear-attention hybrid. Sarvam released two models where the 30B model stayed with classic GQA and the 105B model switched to MLA. That last pair is particularly useful as it puts the technical-complexity discussion aside. I.e., the Sarvam team implemented both variants and deliberately chose to then use GQA for one variant and MLA for the other. So, in a sense, that makes MLA feel less like a theoretical alternative and more like a concrete architectural upgrade path once a family scales up. 4. Sliding Window Attention (SWA) Sliding window attention reduces the memory and compute cost of long-context inference by limiting how many previous tokens each position can attend to. Instead of attending to the entire prefix, each token only attends to a fixed window of recent tokens around its position. Because attention is restricted to a local token neighborhood, this mechanism is often referred to as local attention. Some architectures combine these local layers with occasional global attention layers so that information can still propagate across the entire sequence. Figure 17: The conceptual shift is simple. Regular attention is global attention, while sliding-window attention is local attention. Global attention lets every token see the full prefix; SWA turns many of those layers into local attention layers (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Gemma 3 27B , OLMo 3 32B , Xiaomi MiMo-V2-Flash , Arcee Trinity , Step 3.5 Flash , and Tiny Aya 4.1 Gemma 3 As A Reference Point Gemma 3 is still one of the clearest recent SWA examples because it is easy to compare against Gemma 2. Gemma 2 already used a hybrid attention setup with a 1:1 ratio between local and global layers and a 4096-token window. Gemma 3 pushed this further to a 5:1 ratio and reduced the window size to 1024. The key finding was not that local attention is cheaper, because that was already known. Here, the more interesting takeaway from the Gemma 3 ablation study was that using this more aggressively seemed to hurt modeling performance only slightly. The Gemma ablation study suggests that the smaller window and more aggressive local:global ratio have little effect on perplexity. Underlying paper: Gemma 3 article (Original source: The Big LLM Architecture Comparison ). 4.2 The Ratio And Window Size In practice, saying that a model “uses SWA” does not mean it relies on SWA alone. What usually matters are the local-to-global layer pattern and the attention window size. For example: Gemma 3 and Xiaomi use a 5:1 local-to-global pattern. OLMo 3 and Arcee Trinity use a 3:1 pattern. Xiaomi also uses a window size of 128, which is much smaller, and therefore more aggressive, than Gemma’s 1024. Figure 18: The long-context savings come from turning many full-attention layers into local ones, which reduces how much cached context those layers need to consider (Original source: LLMs-from-scratch SWA materials ). 4.3 Combining SWA with GQA SWA often appears together with GQA because the two ideas address different parts of the same inference problem. SWA reduces how much context a local layer has to consider. GQA reduces how much key-value state each token contributes to the cache. That is why many recent dense models use both rather than treating them as alternatives. Gemma 3 is again a good reference point here, since it combines sliding window attention with grouped-query attention in the same architecture. 5. DeepSeek Sparse Attention (DSA) DeepSeek Sparse Attention is one of the architectural changes that appeared in the DeepSeek V3.2 line and later showed up again in GLM-5. Specifically, DeepSeek V3.2 combines it with Multi-head Latent Attention (MLA) , and GLM-5 adopts the same pair for the same general reason, namely, reducing inference cost when context lengths get large. EXAMPLE ARCHITECTURES DeepSeek V3.2 and GLM-5 5.1 Changes Relative To Sliding-Window Attention In sliding-window attention, the current token does not attend to the full prefix but only to a fixed local window. This is the same broad idea behind DeepSeek Sparse Attention, where each token also only attends to a subset of previous tokens. However, the selected tokens are not determined by a fixed-width local window. Instead, DeepSeek Sparse Attention uses a learned sparse pattern. In short, it uses an indexer-plus-selector setup, where a lightning indexer computes relevance scores, and a token selector keeps only a smaller set of high-scoring past positions. The way the subset of tokens is selected is the main difference from sliding-window attention. Sliding-window attention hard-codes locality. DeepSeek Sparse Attention still limits attention to a subset, but it lets the model decide which prior tokens are worth revisiting. Figure 19: Similar to sliding-window attention, DeepSeek Sparse Attention also restricts each token to a subset of prior tokens, but does not do so with a fixed local window (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). 5.2 DeepSeek Sparse Attention and MLA DeepSeek V3.2 uses both Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention. MLA reduces KV-cache cost by compressing what gets stored. DeepSeek Sparse Attention reduces how much of the prior context the model has to revisit. Put differently, one optimizes the cache representation, the other optimizes the attention pattern on top of it. Figure 20: DeepSeek V3.2 is the obvious reference point, because this is the model family most closely associated with the sparse-attention idea. The sparse pattern is not random. The first stage is a lightning indexer that scores previous tokens for each new query token. It uses MLA’s compressed token representations and computes a learned similarity score over the prior context, so the model can rank which earlier positions are worth revisiting. The second stage is a token selector. It keeps only a smaller high-scoring subset, for example, a top- set of past positions, and turns that subset into the sparse attention mask. So the main point is that DeepSeek Sparse Attention does not hard-code the sparsity pattern. It learns which past tokens to keep. Figure 21: The mechanism consists of a lightning indexer that scores prior tokens and a selector that keeps only a smaller subset for attention (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). DeepSeek Sparse Attention is relatively new and relatively complicated to implement, which is why it has not been so widely adopted as Grouped-Query Attention (GQA) yet. 6. Gated Attention Gated attention is best understood as a modified full-attention block rather than as a separate attention family. It usually appears inside hybrid stacks that still keep an occasional full-attention layer for exact content retrieval, but add a few stability-oriented changes on top of an otherwise familiar scaled dot-product attention block. Figure 22: Trinity Large is a useful comparison because gated attention is not only a Qwen idea (more on that later). Here the gate appears after the scaled dot-product attention output and before the output projection in a different long-context architecture (Original source: A Dream of Spring for Open-Weight LLMs ). 6.1 Where Gated Attention Appears The Qwen3-Next and Qwen3.5 architectures show that recent hybrids (covered in the next section) do not replace attention everywhere. Instead, they replace most attention layers with a cheaper alternative and keep a smaller number of full-attention layers in the stack. Those remaining full-attention layers are where gated attention typically appears. Qwen3-Next and Qwen3.5 use it together with Gated DeltaNet in a 3:1 pattern. But hybrid architectures aside, Trinity uses a related gating idea in a more conventional attention stack, as shown in the previous figure above. 6.2 Gated Attention Relative To Standard Attention The gated attention block in Qwen-style hybrids or Trinity (not a hybrid) is essentially standard scaled-dot-product attention with a few changes on top. In the original Gated Attention paper , those changes are presented as a way to make the retained full-attention layers behave more predictably inside a hybrid stack. The block still looks like standard (full) attention, but it adds: an output gate that scales the attention result before it is added back to the residual, a zero-centered QK-Norm variant instead of standard RMSNorm for q and k, partial RoPE.

0 views
daniel.haxx.se 2 days ago

NTLM and SMB go opt-in

The NTLM authentication method was always a beast. It is a proprietary protocol designed by Microsoft which was reverse engineered a long time ago. That effort resulted in the online documentation that I based the curl implementation on back in 2003. I then also wrote the NTLM code for wget while at it. NTLM broke with the HTTP paradigm: it is made to authenticate the connection instead of the request , which is what HTTP authentication is supposed to do and what all the other methods do. This might sound like a tiny and insignificant detail, but it has a major impact in all HTTP implementations everywhere. Indirectly it is also the cause for quite a few security related issues in HTTP code, because NTLM needs many special exceptions and extra unique treatments. curl has recorded no less than seven past security vulnerabilities in NTLM related code! While that may not be only NTLM’s fault, it certainly does not help. The connection-based concept also makes the method incompatible with HTTP/2 and HTTP/3. NTLM requires services to stick to HTTP/1. NTLM (v1) uses super weak cryptographic algorithms (DES and MD5), which makes it a bad choice even when disregarding the other reasons. We are slowly deprecating NTLM in curl, but we are starting out by making it opt-in. Starting in curl 8.20.0, NTLM is disabled by default in the build unless specifically enabled. Microsoft themselves have deprecated NTLM already. The wget project looks like it is about to make their NTLM support opt-in. curl only supports SMB version 1. This protocol uses NTLM for the authentication and it is equally bad in this protocol. Without NTLM enabled in the build, SMB support will also get disabled. But also: SMBv1 is in itself a weak protocol that is barely used by curl users, so this protocol is also opt-in starting in curl 8.20.0. You need to explicitly enable it in the build to get it added. I want to emphasize that we have not removed support for these ancient protocols, we just strongly discourage using them and I believe this is a first step down the ladder that in a future will make them get removed completely.

0 views