Latest Posts (20 found)
Jim Nielsen -25 days ago

You Might Debate It — If You Could See It

Imagine I’m the design leader at your org and I present the following guidelines I want us to adopt as a team for doing design work: How do you think that conversation would go? I can easily imagine a spirited debate where some folks disagree with any or all of my points, arguing that they should be struck as guidelines from our collective ethos of craft. Perhaps some are boring, or too opinionated, or too reliant on trends. There are lots of valid, defensible reasons. I can easily see this discussion being an exercise in frustration, where we debate for hours and get nowhere — “I suppose we can all agree to disagree”. And yet — thanks to a link to Codex’s front-end tool guidelines in Simon Willison’s article about how coding agents work — I see that these are exactly the kind of guidelines that are tucked away inside an LLM that’s generating output for many teams. It’s like a Trojan Horse of craft: guidelines you might never agree to explicitly are guiding LLM outputs, which means you are agreeing to them implicitly. It’s a good reminder about the opacity of the instructions baked in to generative tools. We would debate an open set of guidelines for hours, but if there’re opaquely baked in to a tool without our knowledge does anybody even care? When you offload your thinking, you might be on-loading someone else’s you’d never agree to — personally or collectively. Reply via: Email · Mastodon · Bluesky Typography: Use expressive, purposeful fonts and avoid default stacks (Inter, Roboto, Arial, system). Motion: Use a few meaningful animations (page-load, staggered reveals) instead of generic micro-motions. Background: Don't rely on flat, single-color backgrounds; use gradients, shapes, or subtle patterns to build atmosphere. Overall: Avoid boilerplate layouts and interchangeable UI patterns. Vary themes, type families, and visual languages.

1 views

Top ten Figma betrayals

Figma is the industry standard for painting pretty pictures of websites. It’s where designers spend my designated dev time pushing pixels around one too many artboards. Figma promises to remove the proverbial fence between design and development. In reality it provides the comfort of an ideal viewport that doesn’t exist. I don’t mind Figma (the software), although I prefer Penpot myself. I still dabble in the deceptive arts of web design. Don’t be thinking I’m out here hating on designers. I like to stick my nose inside a Figma file and point out issues before they escalate. Below I cover classic Figma betrayals that I bet you’ve experienced. Betrayals happen when software promises more than it can deliver. Take a gander at this amazing website design I whipped up in Figma to illustrate the most common betrayals. I told you I was a designer! I’ll evolve this design throughout the post. Figma has deemed 1440×1024 to be “Desktop” resolution so I’ve started there. In this mockup I’ve added a full-width banner of our hero Johnny Business . I’ve built this website far too many times than I care to remember. I’ll repeat the same question here I ask every time I build it: what happens at other viewport sizes? Do I scale the banner proportionally? On wider viewports this is likely to push content out of sight. It might even require scrolling to see the entire image on Johnny’s ultra-wide 8K. The phrase “above the fold” will be spoken in a Teams call, can we avoid that? Do I also set a maximum height on the banner? This is going to decapitate poor Johnny! He paid a lot for that haircut. What are we doing below the “Desktop” viewport, by the way? Let’s design for the 402×874 resolution Figma calls “iPhone 17” because it was first on the list. Note the absolute perfect crop of Johnny’s sockless businessing. Okay, next question: how do we move between “mobile” and “desktop”? That’s a very specific focal point. We can’t just change it willy-nilly! Code has rules; logic. A website must be responsive between all breakpoints. Are we going to use multiple images? At what breakpoint do they swap? Because that perfectly cropped mobile image doesn’t scale up very far. Hold the phone! A shadow stakeholder has asked for a redesign to “make it pop!” The ultra-wide problem has been solved with a centred fixed-width style. If that is the intention? Does either the banner or header stretch to the edge of the viewport? More importantly, that image and text has no room to move. I’ve only reduced the viewport by 200 pixels and it’s already crashing into Johnny’s face. Are we expecting breakpoints every 100 pixels? — No, wait! Please don’t spend more time designing more breakpoints! Okay, I’ll hold until more breakpoints are designed. Are we extending my development deadline? No. Okay. As development continues I’ve got more bad news to share. Figma is very happy allowing us to enter arbitrary line breaks for the perfect text fit. That’s not how the web works. One of these options is probably what we’ll see if text is left to naturally break. Yes, we can technically allow for a manual line break. That’s a pain in the content management system, but sure. Text is still forced to wrap on a smaller viewport, then what? Oh that? Now you want the manual line break to magically disappear? (╯°□°)╯︵ ┻━┻ I lied when I said “top ten” Figma betrayals. The issues above can appear in hundreds of guises across any component. If you’re betrayed once you’ll be hit again and again. Figma is not exactly conducive to responsive web design. Designing more breakpoints often leads to more questions, not less. Another betrayal I pull my hair out over is the three card pattern packed with content. This leads to an immediate breakpoint where one card drops awkwardly below. I dread this because the word “carousel” will be uttered and my sobbing is heard far and wide. Carousels are not a content strategy. I was once inspecting a Figma file only to witness the enemy cursor drive by and drop several dots underneath an image. The audacity! Figma betrayals are classic waterfall mistakes that are solved by human conversation. Developers need to be part of the design process to ask these questions. Content authors should be involved before and not after a design is complete. You’ll note I never answered the questions above because what might work for my fictional design isn’t universal. On a tangential topic Matthias Ott notes: Think about what actually happens when a designer and an engineer disagree about an interaction pattern. There’s a moment of tension – maybe even frustration. The engineer says it’ll be fragile. The designer says it’s essential for the experience. Neither is wrong, necessarily. But the conversation – if your process allows for it to happen – that back-and-forth where both sides have to articulate why they believe what they believe, is where the design becomes robust and both people gain experience. Not in the Figma file. Not in the pull request. In the friction between two people who care about different things and are forced to find a shared answer. The Shape of Friction - Matthias Ott Figma is not friction-free and that’s fine. We can’t expect any software in the hands of a single person to solve problems alone. Software doesn’t know what questions to ask. Not then with Clippy, not now with Copilot. Humans should talk to one another, not the software. Together we can solve things early the easy way, or later the hard way. One thing that has kept me employed is the ability to identify questions early and not allow Fireworks, Photoshop, Sketch, XD, and now Figma to lead a project astray. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views

vegan with a soy sensitivity

As a kid, I got diagnosed with a soy allergy; it caused me to itch everywhere and scratch until it bled, all over the body, and worse. I went through a desensitization process of weekly shots until my symptoms improved and went away. Until last year, I could eat soy with no issue; very convenient when you’re vegan. Then it seemingly came back and caused some nasty rashes. Took me a while to identify the culprit. Unfortunately, another round of desensitization is contraindicated for me and likely won’t work again, so I’m just having to roll with it. I really love tofu, edamame, natto, miso, soy sauce, tempeh, lao gan ma and more, so that sucks, but avoiding it has been easier than I thought. I’m not really that fond of eating many replacement products; I like veggie pans with just seasoned vegetables and some beans or other protein the most, and I prefer oat milk to soy milk. The only things I consciously had to switch were going from sugarfree soy skyr to a sugarfree pea-based yoghurt. Other than that, whole foods have been my friend, and there are a surprising amount of replacement products made from bean or pea protein, even chickpeas. I like the chickpea tofu I found, Beyond’s stuff is with pea protein as well, Seitan still works, and we replaced the TVP soy chunks with ones made from field beans, whose powder is also great for egg replacement in baking and for scrambled egg. Kidney beans patties are awesome, too, and red lentil stews are a comfort food to me. I can just use coconut cream instead of soy creams. So aside from losing some of my comfort foods, this has been a rather painless switch. Reply via email Published 23 Mar, 2026

0 views

7 Things This Week [#183]

A weekly list of interesting things I found on the internet, posted on Sundays. Sometimes themed, often not. 1️⃣ That screamy sound you hear when peeling tape? It’s a ‘ sonic whisper’ from tearing at twice the speed of sound! [ 🔗 sciencealert.com ] 2️⃣ Craig Mod built the accounting software of his dreams, fitting his exact international needs, and which can be adapted with Claude Code as needed. Sounds amazing. [ 🔗 craigmod.com ] 3️⃣ Chris Coyer argues that web forms should always automatically email you a copy of your submission. I agree, though I wouldn’t be opposed to it being optional, as long as the default is for the copy to be sent. [ 🔗 email-is-good.com ] 4️⃣ Terry Godier’s essay about how all the objects in our our lives have steadily stolen more of our attention, and made us feel guilty about it is excellent As is it’s web design. You gotta read this one in its original form. [ 🔗 terrygodier.com ] 5️⃣ Stephen Hackett (via James Thomson) shared some incredible 5K wallpapers featuring Lil Finder Guy. I love how the Lil Guy’s taken the Mac community by storm. [ 🔗 512pixels.net ] 6️⃣ I thought tweet from Caleb Sexton was a joke about Kagi having ‘ LinkedIn Speak’ as a language that you could translate into. It’s not a joke. It’s real . [ 🦣 mastodon.social ] 7️⃣ D. Griffin Jones did the thing and put an episode of the Connected podcast onto a floppy disk. Incredible commitment to the bit! [ 🦣 tech.lgbt ] Thanks for reading 7 Things . If you enjoyed these links or have something neat to share, please let me know . And remember that you can get more links to internet nuggets that I’m finding every day by following me @jarrod on the social web. HeyDingus is a blog by Jarrod Blundy about technology, the great outdoors, and other musings. If you like what you see — the blog posts , shortcuts , wallpapers , scripts , or anything — please consider leaving a tip , checking out my store , or just sharing my work. Your support is much appreciated! I’m always happy to hear from you on social , or by good ol' email .

0 views

ChatGPT, Claude, and Gemini Render Markdown in the Browser. I Do the Opposite

The big AI chat apps ship heavy rendering libraries to every device. Cheddy Chat renders markdown server-side and streams finished HTML, eliminating 160-440KB of client JavaScript while keeping the main thread free.

0 views

Experimenting with Starlette 1.0 with Claude skills

Starlette 1.0 is out ! This is a really big deal. I think Starlette may be the Python framework with the most usage compared to its relatively low brand recognition because Starlette is the foundation of FastAPI , which has attracted a huge amount of buzz that seems to have overshadowed Starlette itself. Kim Christie started working on Starlette in 2018 and it quickly became my favorite out of the new breed of Python ASGI frameworks. The only reason I didn't use it as the basis for my own Datasette project was that it didn't yet promise stability, and I was determined to provide a stable API for Datasette's own plugins... albeit I still haven't been brave enough to ship my own 1.0 release (after 26 alphas and counting)! Then in September 2025 Marcelo Trylesinski announced that Starlette and Uvicorn were transferring to their GitHub account , in recognition of their many years of contributions and to make it easier for them to receive sponsorship against those projects. The 1.0 version has a few breaking changes compared to the 0.x series, described in the release notes for 1.0.0rc1 that came out in February. The most notable of these is a change to how code runs on startup and shutdown. Previously that was handled by and parameters, but the new system uses a neat lifespan mechanism instead based around an async context manager : If you haven't tried Starlette before it feels to me like an asyncio-native cross between Flask and Django, unsurprising since creator Kim Christie is also responsible for Django REST Framework. Crucially, this means you can write most apps as a single Python file, Flask style. This makes it really easy for LLMs to spit out a working Starlette app from a single prompt. There's just one problem there: if 1.0 breaks compatibility with the Starlette code that the models have been trained on, how can we have them generate code that works with 1.0? I decided to see if I could get this working with a Skill . Regular Claude Chat on claude.ai has skills, and one of those default skills is the skill-creator skill . This means Claude knows how to build its own skills. So I started a chat session and told it: Clone Starlette from GitHub - it just had its 1.0 release. Build a skill markdown document for this release which includes code examples of every feature. I didn't even tell it where to find the repo, Starlette is widely enough known that I expected it could find it on its own. It ran which is actually the old repository name, but GitHub handles redirects automatically so this worked just fine. The resulting skill document looked very thorough to me... and then I noticed a new button at the top I hadn't seen before labelled "Copy to your skills". So I clicked it: And now my regular Claude chat has access to that skill! I started a new conversation and prompted: Build a task management app with Starlette, it should have projects and tasks and comments and labels And Claude did exactly that, producing a simple GitHub Issues clone using Starlette 1.0, a SQLite database (via aiosqlite ) and a Jinja2 template. Claude even tested the app manually like this: For all of the buzz about Claude Code, it's easy to overlook that Claude itself counts as a coding agent now, fully able to both write and then test the code that it is writing. Here's what the resulting app looked like. The code is here in my research repository . You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
David Bushell Yesterday

I should build a game

I should build a game! I feel like that’s a common dream, right? Game development is what got me interested in design and programming to begin with. I learnt ECMAScript via Flash ActionScript many moons ago. Some time later “Thoughts on Flash” brought a swift demise and ruined legacy to Flash. History is written by the winners, they say. Although Flash was largely proprietary software, and Adobe would have ruined it themselves, Flash was a wonderfully creative tool in its prime. I studied art and went into print/web design before transitioning almost entirely to front-end dev. I’ve been trapped here every since! In that time, open web standards have become way more powerful than Flash every was. Today HTML is the new Flash. Over my winter break I created a new playground where I relearned old tricks by building fun little canvas prototypes. Just basic stuff. No libraries or game engines. This is my retreat of solace until the “AI” fallout blows over. I’ll be sharing my slop-free explorations into game dev. The purpose here is understanding and creativity. No amount of prompt-fondling can achieve that! Work got busy, which is a good thing I guess, and I haven’t had time to build more. If the web industry does fall apart, at least I have a fallback plan to keep me busy! I’m going to build the games I always wanted to. Or at least try. I’ve been playing Slay the Spire 2 recently and I thought, “I could build that!” — I mean, I could technically build a shallow shitty clone. Nevertheless, it inspired me once again to consider if I really could design and build a game. I’ve set myself a personal goal of spending a few hours every week to create something game related. Maybe that’s sketching concept art, or plotting puzzles, or writing code, or researching, or just daydreaming ideas. Not with the grand plan of creating “the game”. I don’t know where it will lead but I know I’ll enjoy the process. Whether I share anything is unknown. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views
ava's blog Yesterday

what's in my todo app

I use a gamified todo app that I log into daily, and have been using it for almost a year now. The interaction with six of my friends kinda drew me in; we can have goals together, send each other encouraging messages, visit each other in our rooms and gift each other items. Each day I check off enough on my list, I send a little bird off to an adventure and then it discovers something. I also get little micropets. What I also enjoy is that it's not strictly a productivity-focused app, it's more about selfcare. It offers soundscapes, meditations, a mood tracker, breathing exercises, physical exercises, mental health quizzes, journaling prompts and more. Initially, I used it like any other todo app, meaning I wanted to get everything on the list done in a day and wanted to build a streak. That didn't work out, like it always does, and I chose to embrace the format of the app more. Now, I use it as a list of suggestions to do, from optional and kind things to gentle reminders of what needs doing. I used to struggle a lot with sitting around wanting to do things or knowing I needed to do stuff, but not exactly being sure what, or feeling like I'm missing something. For years, I made lists for everything. Nowadays, it's all combined in that app and not spread between different notes. I have set all goals to just continue being there until they're checked off, and they can be skipped and snoozed as well, all neatly sorted into categories. Let me show you. The hygiene category reminds me to This holds all the stuff I consider productive. Daily stuff is: The less frequent stuff: This is category is usually intended for daily reminders to reach out to people, suggestions to make plans, to remember everyone that loves you, and all that. For me, it has This checks my daily drinking and goals for when to eat. This takes into account that I am mostly hungry in the evening and that eating early, especially sweet or carb-y stuff, seems to spike me a lot and makes me very hungry the rest of the day. So I try to eat breakfast and lunch later, and currently working on delaying it until even later. All of this is daily. I don't always feel good enough physically to fully commit to a routine for weeks or months, so this is basically a platter to pick and choose from each day. Some days, I do all. Some, I only do one or none. This is for stuff that gets me into the flow, or meditative stuff. Also daily! This is also a daily goal, but only holds one at the moment: "Do one thing makes me happy". It's very vague on purpose, and I count a lot of things based on the day. It gets me to go through my day and see what good things happened, practice gratitude. I check if I have treated myself well, and see if there's maybe something I'd like to do for myself. Reminders for myself. Very helpful for my chronic illness stuff! It can be hard to see rest as something productive and needed, instead of just something that holds me back. It also helps me see small good things and wins I had that day that otherwise, I would have just forgotten or downplayed again. So I get these three daily tasks: Still working on perfecting my sleep schedule and quality. Daily goal: Reminders to take some stuff. Only my injection is scheduled for every two weeks. Haven't had this category for long yet! But my hair is longer now and I take great care regrowing it, together with other things I want to focus more on. I don't put my usual skin care in there, because it's so embedded into my routine and easy to think of that I don't need it to be in there. I love that I don't have to just do the very productive or exhausting stuff; I can just do enough . Sometimes, selfcare is all you can manage, or you procrastinate on hard stuff but do lots of other things. That should still be rewarded, and you're still making progress. I feel like this setup finally acknowledges that for me. It's not a stressor anymore, just a wide selection of things I get to do , and even self-kindness and rest count. Most days, I don't do all of these, and it's not even an expectation. I'm just happy to see that I did stuff at all, and have an easy list of things that I can go through and see "Oh yes, that fits my mood and energy right now." and feeling like I make progress even by resting or affirming or acknowledging small wins. Reply via email Published 22 Mar, 2026 change the bed sheets on every Sunday do laundry on Saturday clean the bathrooms on Tuesday take out the trash on Saturday (or as needed) vacuum on Monday and Friday dust and wipe surfaces on Wednesday. spend 5 minutes tidying my home (I usually do this automatically, because I tidy up a bit first thing in the morning and before going to bed, and I always try to take stuff with me whenever I go through the apartment) read a book or magazine water plants (Thursday) do a case for Noyb (Friday-Sunday) do favors for my wife take a stretch break (this is under connection because this is my wife and I's shared goal we do together) drink water (3 bottles) breakfast after 10 am lunch past 1 pm go for a walk 20+ mins indoor cycling read a simple affirmation for myself (tapping this launches the affirmation part of the app, where I can skip through ones and find one I need for the day) give myself permission to rest (this one changed a lot of how I see breaks in my fitness plans!) name one small success from today avoid caffeine after lunch (usually, I treat this as noon, because I usually have lunch later) go to bed at 22:00 Supplements daily (a general one, my extra iron stuff, Vit D during the winter) Endovelle daily in the evening Minoxidil twice daily Injection every two weeks on Friday hair oiling on Sunday monthly teeth bleaching

0 views
Ahead of AI Yesterday

A Visual Guide to Attention Variants in Modern LLMs

I had originally planned to write about DeepSeek V4. Since it still hasn’t been released, I used the time to work on something that had been on my list for a while, namely, collecting, organizing, and refining the different LLM architectures I have covered over the past few years. So, over the last two weeks, I turned that effort into an LLM architecture gallery (with 45 entries at the time of this writing), which combines material from earlier articles with several important architectures I had not documented yet. Each entry comes with a visual model card, and I plan to keep the gallery updated regularly. You can find the gallery here: https://sebastianraschka.com/llm-architecture-gallery/ Figure 1: Overview of the LLM architecture gallery and its visual model cards. After I shared the initial version, a few readers also asked whether there would be a poster version. So, there is now a poster version via Redbubble . I ordered the Medium size (26.9 x 23.4 in) to check how it looks in print, and the result is sharp and clear. That said, some of the smallest text elements are already quite small at that size, so I would not recommend the smaller versions if you intend to have everything readable. Figure 2: Poster version of the architecture gallery with some random objects for scale. Alongside the gallery, I was/am also working on short explainers for a few core LLM concepts. So, in this article, I thought it would be interesting to recap all the recent attention variants that have been developed and used in prominent open-weight architectures in recent years. My goal is to make the collection useful both as a reference and as a lightweight learning resource. I hope you find it useful and educational! Self-attention lets each token look at the other visible tokens in the sequence, assign them weights, and use those weights to build a new context-aware representation of the input. Multi-head attention (MHA) is the standard transformer version of that idea. It runs several self-attention heads in parallel with different learned projections, then combines their outputs into one richer representation. Figure 3: Olmo 2 as an example architecture using MHA. The sections below start with a whirlwind tour of explaining self-attention to explain MHA. It’s more meant as a quick overview to set the stage for related attention concepts like grouped-query attention, sliding window attention, and so on. If you are interested in a longer, more detailed self-attention coverage, you might like my longer Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article. EXAMPLE ARCHITECTURES GPT-2 , OLMo 2 7B , and OLMo 3 7B Attention predates transformers and MHA. Its immediate background is encoder-decoder RNNs for translation. In those older systems, an encoder RNN would read the source sentence token by token and compress it into a sequence of hidden states, or in the simplest version into one final state. Then the decoder RNN had to generate the target sentence from that limited summary. This worked for short and simple cases, but it created an obvious bottleneck once the relevant information for the next output word lived somewhere else in the input sentence. In short, the limitation is that the hidden state can’t store infinitely much information or context, and sometimes it would be useful to just refer back to the full input sequence. The translation example below shows one of the limitations of this idea. For instance, a sentence can preserve many locally reasonable word choices and still fail as a translation when the model treats the problem too much like a word-by-word mapping. (The top panel shows an exaggerated example where we translate the sentence word by word; obviously, the grammar in the resulting sentence is wrong.) In reality, the correct next word depends on sentence-level structure and on which earlier source words matter at that step. Of course, this could still be translated fine with an RNN, but it would struggle with longer sequences or knowledge retrieval tasks because the hidden state can only store so much information as mentioned earlier. Figure 4: Translation can fail even when many individual word choices look reasonable because sentence-level structure still matters (Original source LLMs-from-scratch ). The next figure shows that change more directly. When the decoder is producing an output token, it should not be limited to one compressed memory path. It should be able to reach back to the more relevant input tokens directly. Figure 5: Attention breaks the RNN bottleneck by letting the current output position revisit the full input sequence instead of relying on one compressed state alone (Original source LLMs-from-scratch ). Transformers keep that core idea from the aforementioned attention-modified RNN but remove the recurrence. In the classic Attention Is All You Need paper, attention becomes the main sequence-processing mechanism itself (instead of being just part of an RNN encoder-decoder.) In transformers, that mechanism is called self-attention, where each token in the sequence computes weights over all other tokens and uses them to mix information from those tokens into a new representation. Multi-head attention is the same mechanism run several times in parallel. For a sequence of tokens, attention needs one row of weights per token, so overall we get a matrix. Each row answers a simple question. When updating this token, how much should each visible token matter? In a decoder-only LLM, future positions are masked out, which is why the upper-right part of the matrix is grayed out in the figure below. Self-attention is fundamentally about learning these token-to-token weight patterns, under a causal mask, and then using them to build context-aware token representations. Figure 6: A concrete masked attention matrix where each row belongs to one token, each entry is an attention weight, and future-token entries are removed by the causal mask (Original source Understanding and Coding Self-Attention ). 1.4 Self-Attention Internals The next figure shows how the transformer computes the attention matrix ( ) from the input embeddings , which is then used to produce the transformed inputs ( ). Here , , and stand for queries, keys, and values. The query for a token represents what that token is looking for, the key represents what each token makes available for matching, and the value represents the information that gets mixed into the output once the attention weights have been computed. The steps are as follows: , , and are weight matrices that project the input embeddings into , , and produces the raw token-to-token relevance scores softmax converts those scores into the normalized attention matrix that we discussed in the previous section is applied to to produce the output matrix Note that the attention matrix is not a separate hand-written object. It emerges from , , and softmax. Figure 7: The full single-head pipeline, from input embeddings X to the normalized attention matrix A and output representations Z (Original source Understanding and Coding Self-Attention ). The next figure shows the same concept as the previous figure but the attention matrix computation is hidden inside the “scaled-dot-product attention” box, and we perform the computation only for one input token instead of all input tokens. This is to show a compact form of self-attention with a single head before extending this to multi-head attention in the next section. Figure 8: One attention head is already a complete mechanism. One set of learned projections produces one attention matrix and one context-aware output stream (Original source Understanding and Coding Self-Attention ). 1.5 From One Head To Multi-Head Attention One set of matrices gives us one attention head, which means one attention matrix and one output matrix . (This concept was illustrated in the previous section.) Multi-head attention simply runs several of these heads in parallel with different learned projection matrices. This is useful because different heads can specialize in different token relationships. One head might focus on short local dependencies, another on broader semantic links, and another on positional or syntactic structure. Figure 9: Multi-head attention keeps the same basic attention recipe, but repeats it across several heads in parallel so the model can learn several token-to-token patterns at once (Original source Understanding and Coding Self-Attention ). 2. Grouped-Query Attention (GQA) Grouped-query attention is an attention variant derived from standard MHA. It was introduced in the 2023 paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints by Joshua Ainslie and colleagues. Instead of giving every query head its own keys and values, it lets several query heads share the same key-value projections, which makes KV caching much cheaper (primarily as a memory reduction) without changing the overall decoder recipe very much. Figure 10: GQA keeps the same overall attention pattern as MHA, but collapses the number of key-value heads by sharing them across multiple query heads (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Dense: Llama 3 8B , Qwen3 4B , Gemma 3 27B , Mistral Small 3.1 24B , SmolLM3 3B , and Tiny Aya 3.35B . Sparse (Mixture-of-Experts): Llama 4 Maverick , Qwen3 235B-A22B , Step 3.5 Flash 196B , and Sarvam 30B . In my architecture comparison article , I framed GQA as the new standard replacement for classic multi-head attention (MHA). The reason is that standard MHA gives every head its own keys and values, which is more optimal from a modeling perspective but expensive once we have to keep all of that state in the KV cache during inference. In GQA, we keep a larger set of query heads, but we reduce the number of key-value heads and let multiple queries share them. That lowers both parameter count and KV-cache traffic without making drastic implementation changes like multi-head latent attention (MLA), which will be discussed later. In practice, that made and keeps it a very popular choice for labs that wanted something cheaper than MHA but simpler to implement than newer compression-heavy alternatives like MLA. GQA results in big savings in KV storage, since the fewer key-value heads we keep per layer, the less cached state we need per token. That is why GQA becomes more useful as sequence length grows. GQA is also a spectrum. If we reduce all the way down to one shared K/V group, we are effectively in multi-query attention territory, which is even cheaper but can hurt modeling quality more noticeably. The sweet spot is usually somewhere in between multi-query attention (1 shared group) and MHA (where K/V groups are equal to the number of queries), where the cache savings are large but the modeling degradation relative to MHA stays modest. Figure 11: Lower is better. Once the context window grows, KV-cache savings become more pronounced. (Original source: LLMs-from-scratch GQA materials ) 2.3 Why GQA Still Matters In 2026 More advanced variants such as MLA are becoming popular because they can offer better modeling performance at the same KV efficiency levels (e.g., as discussed in the ablation studies of the DeepSeek-V2 paper ), but they also involve a more complicated implementation and a more complicated attention stack. GQA remains appealing because it is robust, easier to implement, and also easier to train (since there are fewer hyperparameter tunings necessary, based on my experience). That is why some of the newer releases still stay deliberately classic here. E.g., in my Spring Architectures article, I mentioned that MiniMax M2.5 and Nanbeige 4.1 as models that remained very classic, using only grouped-query attention without piling on other efficiency tricks. Sarvam is a particularly useful comparison point as well: the 30B model keeps classic GQA, while the 105B version switches to MLA. Figure 12: Total KV cache sizes for 105B Sarvam (using MLA) versus 30B Sarvam (using GQA), versus using plain MHA. The motivation behind Multi-head Latent Attention (MLA) is similar to Grouped-Query Attention (GQA). Both are solutions for reducing KV-cache memory requirements. The difference between GQA and MLA is that MLA shrinks the cache by compressing what gets stored rather than by reducing how many K/Vs are stored by sharing heads. Figure 13: Unlike GQA, MLA does not reduce KV cost by grouping heads. It reduces it by caching a compressed latent representation. Note that it is also applied to the query, which is not shown for simplicity (Original source: The Big LLM Architecture Comparison ). MLA, originally proposed in the DeepSeek-V2 paper, became such a defining DeepSeek-era idea (especially after DeepSeek-V3 and R1). It is more complicated to implement than GQA, more complicated to serve, but nowadays also often more compelling once model size and context length get large enough that cache traffic starts to dominate, because at the same rate of memory reduction, it could maintain better modeling performance (more on that later). EXAMPLE ARCHITECTURES DeepSeek V3 , Kimi K2 , GLM-5 , Ling 2.5 , Mistral Large 3 , and Sarvam 105B Instead of caching full-resolution key and value tensors as in MHA and GQA, MLA stores a latent representation and reconstructs the usable state when needed. Essentially, it is a cache compression strategy embedded inside attention, as illustrated in the previous figure. The figure below shows the savings compared to regular MHA. Figure 14: Once context length grows, the savings from caching a latent representation instead of full K/V tensors become very visible (Original source: LLMs-from-scratch MLA section). 3.2 MLA Ablation Studies The DeepSeek-V2 paper provided some ablations where GQA looked worse than MHA in terms of modeling performance, while MLA held up much better and could even outperform MHA when tuned carefully. That is a much stronger justification than “it (also) saves memory.” In other words, MLA is a preferable attention mechanism for DeepSeek not just because it was efficient, but because it looked like a quality-preserving efficiency move at large scale. (But colleagues also told me that MLA only works well at a certain size. For smaller models, let’s say <100B, GQA seems to work better, or, is at least easier to tune and get right.) Figure 15: GQA drops below MHA here, while MLA remains competitive and can even slightly outperform it. Underlying paper: DeepSeek-V2 . Below is again the comparison between GQA in 30B Sarvam versus MLA in 105B Sarvam. Figure 16: GQA and MLA are solving the same bottleneck from different directions. The tradeoff is simplicity versus better modeling performance for larger models. 3.3 How MLA Spread After DeepSeek Once DeepSeek V3/R1, V3.1 etc. normalized the design after its introduction in V2, it started showing up in a second wave of architectures. Kimi K2 kept the DeepSeek recipe and scaled it up. GLM-5 adopted MLA together with DeepSeek Sparse Attention (from DeepSeek V3.2). Ling 2.5 paired MLA with a linear-attention hybrid. Sarvam released two models where the 30B model stayed with classic GQA and the 105B model switched to MLA. That last pair is particularly useful as it puts the technical-complexity discussion aside. I.e., the Sarvam team implemented both variants and deliberately chose to then use GQA for one variant and MLA for the other. So, in a sense, that makes MLA feel less like a theoretical alternative and more like a concrete architectural upgrade path once a family scales up. Sliding window attention reduces the memory and compute cost of long-context inference by limiting how many previous tokens each position can attend to. Instead of attending to the entire prefix, each token only attends to a fixed window of recent tokens around its position. Because attention is restricted to a local token neighborhood, this mechanism is often referred to as local attention. Some architectures combine these local layers with occasional global attention layers so that information can still propagate across the entire sequence. Figure 17: The conceptual shift is simple. Regular attention is global attention, while sliding-window attention is local attention. Global attention lets every token see the full prefix; SWA turns many of those layers into local attention layers (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Gemma 3 27B , OLMo 3 32B , Xiaomi MiMo-V2-Flash , Arcee Trinity , Step 3.5 Flash , and Tiny Aya Gemma 3 is still one of the clearest recent SWA examples because it is easy to compare against Gemma 2. Gemma 2 already used a hybrid attention setup with a 1:1 ratio between local and global layers and a 4096-token window. Gemma 3 pushed this further to a 5:1 ratio and reduced the window size to 1024. The key finding was not that local attention is cheaper, because that was already known. Here, the more interesting takeaway from the Gemma 3 ablation study was that using this more aggressively seemed to hurt modeling performance only slightly. The Gemma ablation study suggests that the smaller window and more aggressive local:global ratio have little effect on perplexity. Underlying paper: Gemma 3 article (Original source: The Big LLM Architecture Comparison ). 4.2 The Ratio And Window Size In practice, saying that a model “uses SWA” does not mean it relies on SWA alone. What usually matters are the local-to-global layer pattern and the attention window size. For example: Gemma 3 and Xiaomi use a 5:1 local-to-global pattern. OLMo 3 and Arcee Trinity use a 3:1 pattern. Xiaomi also uses a window size of 128, which is much smaller, and therefore more aggressive, than Gemma’s 1024. SWA is essentially a knob that can be tuned more or less aggressively. Figure 18: The long-context savings come from turning many full-attention layers into local ones, which reduces how much cached context those layers need to consider (Original source: LLMs-from-scratch SWA materials ). 4.3 Combining SWA with GQA SWA often appears together with GQA because the two ideas address different parts of the same inference problem. SWA reduces how much context a local layer has to consider. GQA reduces how much key-value state each token contributes to the cache. That is why many recent dense models use both rather than treating them as alternatives. Gemma 3 is again a good reference point here, since it combines sliding window attention with grouped-query attention in the same architecture. DeepSeek Sparse Attention is one of the architectural changes that appeared in the DeepSeek V3.2 line and later showed up again in GLM-5. Specifically, DeepSeek V3.2 combines it with Multi-head Latent Attention (MLA) , and GLM-5 adopts the same pair for the same general reason, namely, reducing inference cost when context lengths get large. EXAMPLE ARCHITECTURES DeepSeek V3.2 and GLM-5 In sliding-window attention, the current token does not attend to the full prefix but only to a fixed local window. This is the same broad idea behind DeepSeek Sparse Attention, where each token also only attends to a subset of previous tokens. However, the selected tokens are not determined by a fixed-width local window. Instead, DeepSeek Sparse Attention uses a learned sparse pattern. In short, it uses an indexer-plus-selector setup, where a lightning indexer computes relevance scores, and a token selector keeps only a smaller set of high-scoring past positions. The way the subset of tokens is selected is the main difference from sliding-window attention. Sliding-window attention hard-codes locality. DeepSeek Sparse Attention still limits attention to a subset, but it lets the model decide which prior tokens are worth revisiting. Figure 19: Similar to sliding-window attention, DeepSeek Sparse Attention also restricts each token to a subset of prior tokens, but does not do so with a fixed local window (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). 5.2 DeepSeek Sparse Attention and MLA DeepSeek V3.2 uses both Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention. MLA reduces KV-cache cost by compressing what gets stored. DeepSeek Sparse Attention reduces how much of the prior context the model has to revisit. Put differently, one optimizes the cache representation, the other optimizes the attention pattern on top of it. Figure 20: DeepSeek V3.2 is the obvious reference point, because this is the model family most closely associated with the sparse-attention idea. The sparse pattern is not random. The first stage is a lightning indexer that scores previous tokens for each new query token. It uses MLA’s compressed token representations and computes a learned similarity score over the prior context, so the model can rank which earlier positions are worth revisiting. The second stage is a token selector. It keeps only a smaller high-scoring subset, for example, a top- set of past positions, and turns that subset into the sparse attention mask. So the main point is that DeepSeek Sparse Attention does not hard-code the sparsity pattern. It learns which past tokens to keep. Figure 21: The mechanism consists of a lightning indexer that scores prior tokens and a selector that keeps only a smaller subset for attention (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). DeepSeek Sparse Attention is relatively new and relatively complicated to implement, which is why it has not been so widely adopted as Grouped-Query Attention (GQA) yet. Gated attention is best understood as a modified full-attention block rather than as a separate attention family. It usually appears inside hybrid stacks that still keep an occasional full-attention layer for exact content retrieval, but add a few stability-oriented changes on top of an otherwise familiar scaled dot-product attention block. Figure 22: Trinity Large is a useful comparison because gated attention is not only a Qwen idea (more on that later). Here the gate appears after the scaled dot-product attention output and before the output projection in a different long-context architecture (Original source: A Dream of Spring for Open-Weight LLMs ). 6.1 Where Gated Attention Appears The Qwen3-Next and Qwen3.5 architectures show that recent hybrids (covered in the next section) do not replace attention everywhere. Instead, they replace most attention layers with a cheaper alternative and keep a smaller number of full-attention layers in the stack. Those remaining full-attention layers are where gated attention typically appears. Qwen3-Next and Qwen3.5 use it together with Gated DeltaNet in a 3:1 pattern. But hybrid architectures aside, Trinity uses a related gating idea in a more conventional attention stack, as shown in the previous figure above. The gated attention block in Qwen-style hybrids or Trinity (not a hybrid) is essentially standard scaled-dot-product attention with a few changes on top. In the original Gated Attention paper , those changes are presented as a way to make the retained full-attention layers behave more predictably inside a hybrid stack. The block still looks like standard (full) attention, but it adds: an output gate that scales the attention result before it is added back to the residual, a zero-centered QK-Norm variant instead of standard RMSNorm for q and k, partial RoPE. These are not changes on the scale of MLA or linear attention but merely stability and control changes applied to an otherwise familiar attention block. Figure 23: In Qwen3-Next and Qwen3.5, gated attention appears as the full-attention layer that periodically breaks up runs of Gated DeltaNet blocks. Note that the figure above also includes Gated DeltaNet, which we will cover in the next section below. Hybrid attention is a broader design pattern rather than a specific, single mechanism. The overall idea is to keep a transformer-like stack, but replace most of the expensive full-attention layers with cheaper linear or state-space sequence modules. The motivation is long-context efficiency. Full attention grows quadratically with sequence length, so once models move to contexts like 128k, 256k, or 1M tokens, attention memory and compute become expensive enough that using cheaper sequence modules in most layers while keeping only a smaller number of heavier retrieval layers starts making more sense. (Note that this comes with a bit of a modeling performance trade-off, though.) In Qwen3-Next, this pattern appears as a 3:1 mix of Gated DeltaNet and Gated Attention blocks. Gated DeltaNet is also closely related to Mamba-2 (see the Gated Delta Networks: Improving Mamba2 with Delta Rule paper, for instance), and the mechanism can be read as a DeltaNet-style fast-weight update combined with Mamba-style gating. Later architectures keep the same overall idea but swap in other lightweight sequence mixers, such as Kimi Delta Attention, Lightning Attention, or standard Mamba-2. Figure 24: The basic hybrid pattern, where most blocks are cheaper sequence mixers and every fourth block restores a heavier attention layer (Original source The Big LLM Architecture Comparison ). To my knowledge, the first prominent example of a close-to-flagship LLM with hybrid attention was Qwen3-Next in 2025, which does not remove attention completely but mixes three Gated DeltaNet blocks with one Gated Attention block. Here, lightweight Gated DeltaNet blocks do most of the long-context work and keep memory growth much flatter than full attention. The heavier gated-attention layer remains because DeltaNet is less exact at content-based retrieval. Inside a Gated DeltaNet block, the model computes query, key, and value vectors together with two learned gates (α, β). Rather than forming the usual token-to-token attention matrix, it writes to a small fast-weight memory using a delta-rule update. In rough terms, the memory stores a compressed running summary of past information, while the gates control how much new information is added and how much previous state is retained. That makes Gated DeltaNet a linear-attention or recurrent-style mechanism rather than just another tweak to MHA. Relative to Mamba-2, the close connection is that both belong to the linear-time gated sequence-model family, but Gated DeltaNet uses a DeltaNet-style fast-weight memory update instead of the Mamba state-space update. Figure 25: The practical motivation behind the hybrids is shown here in the memory curve. Hybrid stacks with Gated DeltaNet grow much more slowly with context length than ordinary full attention (Original source LLMs-from-scratch DeltaNet materials ). Qwen3.5 moves the former Qwen3-Next hybrid into Qwen’s main flagship series, which is an interesting move. This basically signals that the hybrid strategy is a success and that we may see more models with this architecture in the future. Figure 26: Qwen3.5 shows the Qwen team promoting the former Qwen3-Next side-branch into the main model line rather than leaving it as a one-off efficiency variant (Original source A Dream of Spring for Open-Weight LLMs ). 7.2 Kimi Linear And Modified Delta Attention Kimi Linear keeps the same broad transformer skeleton and the same 3:1 pattern, but it changes both halves of the recipe. On the lightweight side, Kimi Delta Attention is a refinement of Gated DeltaNet. Where Qwen3-Next uses a scalar gate per head to control memory decay, Kimi uses channel-wise gating, which gives finer control over the memory update. On the heavier side, Kimi replaces Qwen3-Next’s gated-attention layers with gated MLA layers. So, it’s still the same broader pattern as in Qwen3-Next and Qwen3.5, but both ingredients (slightly) change. I.e., most layers are still handled by a cheaper linear-style mechanism, and periodic heavier layers still remain for stronger retrieval. Figure 27: Kimi Linear keeps the same overall hybrid pattern while changing both the lightweight side and the heavier attention side of the stack (Original source The Big LLM Architecture Comparison ). 7.3 Ling 2.5 And Lightning Attention Ling 2.5 shows another swap on the lightweight side. Instead of Gated DeltaNet, Ling uses a slightly simpler recurrent linear attention variant called Lightning Attention. On the heavier side, it keeps MLA from DeepSeek. Most sequence mixing happens in the cheaper linear-attention blocks, while a smaller number of heavier layers remain to preserve stronger retrieval. The difference is that the specific lightweight mechanism is now Lightning Attention rather than DeltaNet or Kimi Delta Attention. Figure 28: Ling 2.5 and Qwen3.5 are both linear-attention hybrids, even though Ling swaps in Lightning Attention and MLA instead of the Qwen recipe (Original source A Dream of Spring for Open-Weight LLMs ). Ling 2.5 is aimed more at long-context efficiency than at absolute benchmark leadership. According to the Ling team, it was reported as substantially faster than Kimi K2 at 32k tokens, which is the practical payoff these hybrids are aiming for. Figure 29: Ling 2.5 was presented as a strong efficiency upgrade, with much higher 32k-token throughput than Kimi K2 at the same 1-trillion-parameter scale (Original source Ling 2.5 model hub page ). Nemotron And Mamba-2 Nemotron pushes the pattern further away from the transformer baseline. Nemotron 3 Nano is a Mamba-Transformer hybrid that interleaves Mamba-2 sequence-modeling blocks with sparse MoE layers and uses self-attention only in a small subset of layers. This is a more extreme version of the same basic tradeoff discussed above. Here, the lightweight sequence module is a Mamba-2 state-space block rather than a DeltaNet-style fast-weight update, but the basic tradeoff is similar. Figure 30: Nemotron 3 Nano uses Mamba-2 for most of the sequence modeling work, with self-attention only appearing in a small subset of layers (Original source The Big LLM Architecture Comparison ). The larger Nemotron 3 Super keeps the Mamba-2 hybrid attention approach and adds other efficiency-oriented changes such as latent MoE and shared-weight multi-token prediction (MTP) for speculative decoding. Figure 31: Nemotron 3 Super keeps the Mamba-2 hybrid attention pattern while adding latent MoE and shared-weight MTP on top (Original source The Big LLM Architecture Comparison ). Conclusion Of course, there are many more (mostly niche) attention variants throughout the literature that I haven’t covered here. The focus of this article was on those that are currently used in state-of-the-art (open-weight) models. In particular, I am looking forward to (1) seeing the brand new Mamba-3 layers getting integrated into the aforementioned hybrid architectures (replacing Gated DeltaNet) and (2) attention residuals being used in general. In practice, you may also wonder what the “best” architecture is at the moment. This is hard to answer, as there are no public experiments that train different architectures on the same training data etc. Hence, we can currently only answer what the best (trained) model choice is for a given problem. In my opinion, hybrid architectures are still a novelty, and the main selling point is mainly (long-context) efficiency versus just modeling performance. Hence, I think they are a great candidate for agent contexts (like OpenClaw). Personally, I think the problem with hybrid architectures is also that the inference stacks are not quite as optimized, yet, and I find that I get better tok/sec throughput when running LLMs locally using more classic setups like GPT-OSS with grouped-query attention. Anyways, I am curious to see what DeepSeek V4 has in store, since DeepSeek has been quite the reliable trend-setter in the recent 2 years. Figure 1: Overview of the LLM architecture gallery and its visual model cards. After I shared the initial version, a few readers also asked whether there would be a poster version. So, there is now a poster version via Redbubble . I ordered the Medium size (26.9 x 23.4 in) to check how it looks in print, and the result is sharp and clear. That said, some of the smallest text elements are already quite small at that size, so I would not recommend the smaller versions if you intend to have everything readable. Figure 2: Poster version of the architecture gallery with some random objects for scale. Alongside the gallery, I was/am also working on short explainers for a few core LLM concepts. So, in this article, I thought it would be interesting to recap all the recent attention variants that have been developed and used in prominent open-weight architectures in recent years. My goal is to make the collection useful both as a reference and as a lightweight learning resource. I hope you find it useful and educational! 1. Multi-Head Attention (MHA) Self-attention lets each token look at the other visible tokens in the sequence, assign them weights, and use those weights to build a new context-aware representation of the input. Multi-head attention (MHA) is the standard transformer version of that idea. It runs several self-attention heads in parallel with different learned projections, then combines their outputs into one richer representation. Figure 3: Olmo 2 as an example architecture using MHA. The sections below start with a whirlwind tour of explaining self-attention to explain MHA. It’s more meant as a quick overview to set the stage for related attention concepts like grouped-query attention, sliding window attention, and so on. If you are interested in a longer, more detailed self-attention coverage, you might like my longer Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article. EXAMPLE ARCHITECTURES GPT-2 , OLMo 2 7B , and OLMo 3 7B 1.2 Historical Tidbits And Why Attention Was Invented Attention predates transformers and MHA. Its immediate background is encoder-decoder RNNs for translation. In those older systems, an encoder RNN would read the source sentence token by token and compress it into a sequence of hidden states, or in the simplest version into one final state. Then the decoder RNN had to generate the target sentence from that limited summary. This worked for short and simple cases, but it created an obvious bottleneck once the relevant information for the next output word lived somewhere else in the input sentence. In short, the limitation is that the hidden state can’t store infinitely much information or context, and sometimes it would be useful to just refer back to the full input sequence. The translation example below shows one of the limitations of this idea. For instance, a sentence can preserve many locally reasonable word choices and still fail as a translation when the model treats the problem too much like a word-by-word mapping. (The top panel shows an exaggerated example where we translate the sentence word by word; obviously, the grammar in the resulting sentence is wrong.) In reality, the correct next word depends on sentence-level structure and on which earlier source words matter at that step. Of course, this could still be translated fine with an RNN, but it would struggle with longer sequences or knowledge retrieval tasks because the hidden state can only store so much information as mentioned earlier. Figure 4: Translation can fail even when many individual word choices look reasonable because sentence-level structure still matters (Original source LLMs-from-scratch ). The next figure shows that change more directly. When the decoder is producing an output token, it should not be limited to one compressed memory path. It should be able to reach back to the more relevant input tokens directly. Figure 5: Attention breaks the RNN bottleneck by letting the current output position revisit the full input sequence instead of relying on one compressed state alone (Original source LLMs-from-scratch ). Transformers keep that core idea from the aforementioned attention-modified RNN but remove the recurrence. In the classic Attention Is All You Need paper, attention becomes the main sequence-processing mechanism itself (instead of being just part of an RNN encoder-decoder.) In transformers, that mechanism is called self-attention, where each token in the sequence computes weights over all other tokens and uses them to mix information from those tokens into a new representation. Multi-head attention is the same mechanism run several times in parallel. 1.3 The Masked Attention Matrix For a sequence of tokens, attention needs one row of weights per token, so overall we get a matrix. Each row answers a simple question. When updating this token, how much should each visible token matter? In a decoder-only LLM, future positions are masked out, which is why the upper-right part of the matrix is grayed out in the figure below. Self-attention is fundamentally about learning these token-to-token weight patterns, under a causal mask, and then using them to build context-aware token representations. Figure 6: A concrete masked attention matrix where each row belongs to one token, each entry is an attention weight, and future-token entries are removed by the causal mask (Original source Understanding and Coding Self-Attention ). 1.4 Self-Attention Internals The next figure shows how the transformer computes the attention matrix ( ) from the input embeddings , which is then used to produce the transformed inputs ( ). Here , , and stand for queries, keys, and values. The query for a token represents what that token is looking for, the key represents what each token makes available for matching, and the value represents the information that gets mixed into the output once the attention weights have been computed. The steps are as follows: , , and are weight matrices that project the input embeddings into , , and produces the raw token-to-token relevance scores softmax converts those scores into the normalized attention matrix that we discussed in the previous section is applied to to produce the output matrix Figure 7: The full single-head pipeline, from input embeddings X to the normalized attention matrix A and output representations Z (Original source Understanding and Coding Self-Attention ). The next figure shows the same concept as the previous figure but the attention matrix computation is hidden inside the “scaled-dot-product attention” box, and we perform the computation only for one input token instead of all input tokens. This is to show a compact form of self-attention with a single head before extending this to multi-head attention in the next section. Figure 8: One attention head is already a complete mechanism. One set of learned projections produces one attention matrix and one context-aware output stream (Original source Understanding and Coding Self-Attention ). 1.5 From One Head To Multi-Head Attention One set of matrices gives us one attention head, which means one attention matrix and one output matrix . (This concept was illustrated in the previous section.) Multi-head attention simply runs several of these heads in parallel with different learned projection matrices. This is useful because different heads can specialize in different token relationships. One head might focus on short local dependencies, another on broader semantic links, and another on positional or syntactic structure. Figure 9: Multi-head attention keeps the same basic attention recipe, but repeats it across several heads in parallel so the model can learn several token-to-token patterns at once (Original source Understanding and Coding Self-Attention ). 2. Grouped-Query Attention (GQA) Grouped-query attention is an attention variant derived from standard MHA. It was introduced in the 2023 paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints by Joshua Ainslie and colleagues. Instead of giving every query head its own keys and values, it lets several query heads share the same key-value projections, which makes KV caching much cheaper (primarily as a memory reduction) without changing the overall decoder recipe very much. Figure 10: GQA keeps the same overall attention pattern as MHA, but collapses the number of key-value heads by sharing them across multiple query heads (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Dense: Llama 3 8B , Qwen3 4B , Gemma 3 27B , Mistral Small 3.1 24B , SmolLM3 3B , and Tiny Aya 3.35B . Sparse (Mixture-of-Experts): Llama 4 Maverick , Qwen3 235B-A22B , Step 3.5 Flash 196B , and Sarvam 30B . 2.1 Why GQA Became Popular In my architecture comparison article , I framed GQA as the new standard replacement for classic multi-head attention (MHA). The reason is that standard MHA gives every head its own keys and values, which is more optimal from a modeling perspective but expensive once we have to keep all of that state in the KV cache during inference. In GQA, we keep a larger set of query heads, but we reduce the number of key-value heads and let multiple queries share them. That lowers both parameter count and KV-cache traffic without making drastic implementation changes like multi-head latent attention (MLA), which will be discussed later. In practice, that made and keeps it a very popular choice for labs that wanted something cheaper than MHA but simpler to implement than newer compression-heavy alternatives like MLA. 2.2 GQA Memory Savings GQA results in big savings in KV storage, since the fewer key-value heads we keep per layer, the less cached state we need per token. That is why GQA becomes more useful as sequence length grows. GQA is also a spectrum. If we reduce all the way down to one shared K/V group, we are effectively in multi-query attention territory, which is even cheaper but can hurt modeling quality more noticeably. The sweet spot is usually somewhere in between multi-query attention (1 shared group) and MHA (where K/V groups are equal to the number of queries), where the cache savings are large but the modeling degradation relative to MHA stays modest. Figure 11: Lower is better. Once the context window grows, KV-cache savings become more pronounced. (Original source: LLMs-from-scratch GQA materials ) 2.3 Why GQA Still Matters In 2026 More advanced variants such as MLA are becoming popular because they can offer better modeling performance at the same KV efficiency levels (e.g., as discussed in the ablation studies of the DeepSeek-V2 paper ), but they also involve a more complicated implementation and a more complicated attention stack. GQA remains appealing because it is robust, easier to implement, and also easier to train (since there are fewer hyperparameter tunings necessary, based on my experience). That is why some of the newer releases still stay deliberately classic here. E.g., in my Spring Architectures article, I mentioned that MiniMax M2.5 and Nanbeige 4.1 as models that remained very classic, using only grouped-query attention without piling on other efficiency tricks. Sarvam is a particularly useful comparison point as well: the 30B model keeps classic GQA, while the 105B version switches to MLA. Figure 12: Total KV cache sizes for 105B Sarvam (using MLA) versus 30B Sarvam (using GQA), versus using plain MHA. 3. Multi-Head Latent Attention (MLA) The motivation behind Multi-head Latent Attention (MLA) is similar to Grouped-Query Attention (GQA). Both are solutions for reducing KV-cache memory requirements. The difference between GQA and MLA is that MLA shrinks the cache by compressing what gets stored rather than by reducing how many K/Vs are stored by sharing heads. Figure 13: Unlike GQA, MLA does not reduce KV cost by grouping heads. It reduces it by caching a compressed latent representation. Note that it is also applied to the query, which is not shown for simplicity (Original source: The Big LLM Architecture Comparison ). MLA, originally proposed in the DeepSeek-V2 paper, became such a defining DeepSeek-era idea (especially after DeepSeek-V3 and R1). It is more complicated to implement than GQA, more complicated to serve, but nowadays also often more compelling once model size and context length get large enough that cache traffic starts to dominate, because at the same rate of memory reduction, it could maintain better modeling performance (more on that later). EXAMPLE ARCHITECTURES DeepSeek V3 , Kimi K2 , GLM-5 , Ling 2.5 , Mistral Large 3 , and Sarvam 105B 3.1 Compression, Not Sharing Instead of caching full-resolution key and value tensors as in MHA and GQA, MLA stores a latent representation and reconstructs the usable state when needed. Essentially, it is a cache compression strategy embedded inside attention, as illustrated in the previous figure. The figure below shows the savings compared to regular MHA. Figure 14: Once context length grows, the savings from caching a latent representation instead of full K/V tensors become very visible (Original source: LLMs-from-scratch MLA section). 3.2 MLA Ablation Studies The DeepSeek-V2 paper provided some ablations where GQA looked worse than MHA in terms of modeling performance, while MLA held up much better and could even outperform MHA when tuned carefully. That is a much stronger justification than “it (also) saves memory.” In other words, MLA is a preferable attention mechanism for DeepSeek not just because it was efficient, but because it looked like a quality-preserving efficiency move at large scale. (But colleagues also told me that MLA only works well at a certain size. For smaller models, let’s say <100B, GQA seems to work better, or, is at least easier to tune and get right.) Figure 15: GQA drops below MHA here, while MLA remains competitive and can even slightly outperform it. Underlying paper: DeepSeek-V2 . Below is again the comparison between GQA in 30B Sarvam versus MLA in 105B Sarvam. Figure 16: GQA and MLA are solving the same bottleneck from different directions. The tradeoff is simplicity versus better modeling performance for larger models. 3.3 How MLA Spread After DeepSeek Once DeepSeek V3/R1, V3.1 etc. normalized the design after its introduction in V2, it started showing up in a second wave of architectures. Kimi K2 kept the DeepSeek recipe and scaled it up. GLM-5 adopted MLA together with DeepSeek Sparse Attention (from DeepSeek V3.2). Ling 2.5 paired MLA with a linear-attention hybrid. Sarvam released two models where the 30B model stayed with classic GQA and the 105B model switched to MLA. That last pair is particularly useful as it puts the technical-complexity discussion aside. I.e., the Sarvam team implemented both variants and deliberately chose to then use GQA for one variant and MLA for the other. So, in a sense, that makes MLA feel less like a theoretical alternative and more like a concrete architectural upgrade path once a family scales up. 4. Sliding Window Attention (SWA) Sliding window attention reduces the memory and compute cost of long-context inference by limiting how many previous tokens each position can attend to. Instead of attending to the entire prefix, each token only attends to a fixed window of recent tokens around its position. Because attention is restricted to a local token neighborhood, this mechanism is often referred to as local attention. Some architectures combine these local layers with occasional global attention layers so that information can still propagate across the entire sequence. Figure 17: The conceptual shift is simple. Regular attention is global attention, while sliding-window attention is local attention. Global attention lets every token see the full prefix; SWA turns many of those layers into local attention layers (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Gemma 3 27B , OLMo 3 32B , Xiaomi MiMo-V2-Flash , Arcee Trinity , Step 3.5 Flash , and Tiny Aya 4.1 Gemma 3 As A Reference Point Gemma 3 is still one of the clearest recent SWA examples because it is easy to compare against Gemma 2. Gemma 2 already used a hybrid attention setup with a 1:1 ratio between local and global layers and a 4096-token window. Gemma 3 pushed this further to a 5:1 ratio and reduced the window size to 1024. The key finding was not that local attention is cheaper, because that was already known. Here, the more interesting takeaway from the Gemma 3 ablation study was that using this more aggressively seemed to hurt modeling performance only slightly. The Gemma ablation study suggests that the smaller window and more aggressive local:global ratio have little effect on perplexity. Underlying paper: Gemma 3 article (Original source: The Big LLM Architecture Comparison ). 4.2 The Ratio And Window Size In practice, saying that a model “uses SWA” does not mean it relies on SWA alone. What usually matters are the local-to-global layer pattern and the attention window size. For example: Gemma 3 and Xiaomi use a 5:1 local-to-global pattern. OLMo 3 and Arcee Trinity use a 3:1 pattern. Xiaomi also uses a window size of 128, which is much smaller, and therefore more aggressive, than Gemma’s 1024. Figure 18: The long-context savings come from turning many full-attention layers into local ones, which reduces how much cached context those layers need to consider (Original source: LLMs-from-scratch SWA materials ). 4.3 Combining SWA with GQA SWA often appears together with GQA because the two ideas address different parts of the same inference problem. SWA reduces how much context a local layer has to consider. GQA reduces how much key-value state each token contributes to the cache. That is why many recent dense models use both rather than treating them as alternatives. Gemma 3 is again a good reference point here, since it combines sliding window attention with grouped-query attention in the same architecture. 5. DeepSeek Sparse Attention (DSA) DeepSeek Sparse Attention is one of the architectural changes that appeared in the DeepSeek V3.2 line and later showed up again in GLM-5. Specifically, DeepSeek V3.2 combines it with Multi-head Latent Attention (MLA) , and GLM-5 adopts the same pair for the same general reason, namely, reducing inference cost when context lengths get large. EXAMPLE ARCHITECTURES DeepSeek V3.2 and GLM-5 5.1 Changes Relative To Sliding-Window Attention In sliding-window attention, the current token does not attend to the full prefix but only to a fixed local window. This is the same broad idea behind DeepSeek Sparse Attention, where each token also only attends to a subset of previous tokens. However, the selected tokens are not determined by a fixed-width local window. Instead, DeepSeek Sparse Attention uses a learned sparse pattern. In short, it uses an indexer-plus-selector setup, where a lightning indexer computes relevance scores, and a token selector keeps only a smaller set of high-scoring past positions. The way the subset of tokens is selected is the main difference from sliding-window attention. Sliding-window attention hard-codes locality. DeepSeek Sparse Attention still limits attention to a subset, but it lets the model decide which prior tokens are worth revisiting. Figure 19: Similar to sliding-window attention, DeepSeek Sparse Attention also restricts each token to a subset of prior tokens, but does not do so with a fixed local window (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). 5.2 DeepSeek Sparse Attention and MLA DeepSeek V3.2 uses both Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention. MLA reduces KV-cache cost by compressing what gets stored. DeepSeek Sparse Attention reduces how much of the prior context the model has to revisit. Put differently, one optimizes the cache representation, the other optimizes the attention pattern on top of it. Figure 20: DeepSeek V3.2 is the obvious reference point, because this is the model family most closely associated with the sparse-attention idea. The sparse pattern is not random. The first stage is a lightning indexer that scores previous tokens for each new query token. It uses MLA’s compressed token representations and computes a learned similarity score over the prior context, so the model can rank which earlier positions are worth revisiting. The second stage is a token selector. It keeps only a smaller high-scoring subset, for example, a top- set of past positions, and turns that subset into the sparse attention mask. So the main point is that DeepSeek Sparse Attention does not hard-code the sparsity pattern. It learns which past tokens to keep. Figure 21: The mechanism consists of a lightning indexer that scores prior tokens and a selector that keeps only a smaller subset for attention (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). DeepSeek Sparse Attention is relatively new and relatively complicated to implement, which is why it has not been so widely adopted as Grouped-Query Attention (GQA) yet. 6. Gated Attention Gated attention is best understood as a modified full-attention block rather than as a separate attention family. It usually appears inside hybrid stacks that still keep an occasional full-attention layer for exact content retrieval, but add a few stability-oriented changes on top of an otherwise familiar scaled dot-product attention block. Figure 22: Trinity Large is a useful comparison because gated attention is not only a Qwen idea (more on that later). Here the gate appears after the scaled dot-product attention output and before the output projection in a different long-context architecture (Original source: A Dream of Spring for Open-Weight LLMs ). 6.1 Where Gated Attention Appears The Qwen3-Next and Qwen3.5 architectures show that recent hybrids (covered in the next section) do not replace attention everywhere. Instead, they replace most attention layers with a cheaper alternative and keep a smaller number of full-attention layers in the stack. Those remaining full-attention layers are where gated attention typically appears. Qwen3-Next and Qwen3.5 use it together with Gated DeltaNet in a 3:1 pattern. But hybrid architectures aside, Trinity uses a related gating idea in a more conventional attention stack, as shown in the previous figure above. 6.2 Gated Attention Relative To Standard Attention The gated attention block in Qwen-style hybrids or Trinity (not a hybrid) is essentially standard scaled-dot-product attention with a few changes on top. In the original Gated Attention paper , those changes are presented as a way to make the retained full-attention layers behave more predictably inside a hybrid stack. The block still looks like standard (full) attention, but it adds: an output gate that scales the attention result before it is added back to the residual, a zero-centered QK-Norm variant instead of standard RMSNorm for q and k, partial RoPE.

0 views

NTLM and SMB go opt-in

The NTLM authentication method was always a beast. It is a proprietary protocol designed by Microsoft which was reverse engineered a long time ago. That effort resulted in the online documentation that I based the curl implementation on back in 2003. I then also wrote the NTLM code for wget while at it. NTLM broke with the HTTP paradigm: it is made to authenticate the connection instead of the request , which is what HTTP authentication is supposed to do and what all the other methods do. This might sound like a tiny and insignificant detail, but it has a major impact in all HTTP implementations everywhere. Indirectly it is also the cause for quite a few security related issues in HTTP code, because NTLM needs many special exceptions and extra unique treatments. curl has recorded no less than seven past security vulnerabilities in NTLM related code! While that may not be only NTLM’s fault, it certainly does not help. The connection-based concept also makes the method incompatible with HTTP/2 and HTTP/3. NTLM requires services to stick to HTTP/1. NTLM (v1) uses super weak cryptographic algorithms (DES and MD5), which makes it a bad choice even when disregarding the other reasons. We are slowly deprecating NTLM in curl, but we are starting out by making it opt-in. Starting in curl 8.20.0, NTLM is disabled by default in the build unless specifically enabled. Microsoft themselves have deprecated NTLM already. The wget project looks like it is about to make their NTLM support opt-in. curl only supports SMB version 1. This protocol uses NTLM for the authentication and it is equally bad in this protocol. Without NTLM enabled in the build, SMB support will also get disabled. But also: SMBv1 is in itself a weak protocol that is barely used by curl users, so this protocol is also opt-in starting in curl 8.20.0. You need to explicitly enable it in the build to get it added. I want to emphasize that we have not removed support for these ancient protocols, we just strongly discourage using them and I believe this is a first step down the ladder that in a future will make them get removed completely.

0 views
David Bushell Yesterday

RSS Club #006: Burnout

This is an RSS-only post, which I like to do sporadically! Thank you for subscribing :) Am I burning out? Let me know what you think, internet doctors. I work a four day week and I have done so for many years. Fridays are mine to have fun. By fun I mean making my own websites without the pressure of clients. That helps me wind down. When the weekend arrives my mind is already stress free. At least it was! I’ve been struggling more than usual lately. My watch monitors heart rate, steps, sleep etc. It has started to report a lower than average “body battery” — that’s what Garmin has trademarked to say: “sir, you look like shit.” A major factor here is definitely a hamstring tear that has kept me from running. Not long ago I was doing half-marathons every other week. Now I can only manage a light 5k or risk prolonged injury. Being stuck inside isn’t helping my mental or physical health. Hopefully before summer I’ll have recovered. But there is more I reckon. I’m fed up. Everything makes me grouchy. Is it too simple to say that the web industry, and tech at large, has lost its collective marbles? Not a week goes by where I don’t mute a word on social media, or unsubscribe from a blog. Everyone is talking nonsense. Everyone is grifting. It never used to be this way. What depresses me most though is how negative my own blog can be on occassions. Part of me wants to defend my career. To call out the ludicrous stuff that is said and done these days. I’m not worried about upsetting people. The clients that hire me don’t care that I dared mock an industry influencer or challenged one of the old boys’ club. I try to do that in a joking way but my tone has always been blunt. That has gotten me into a wee bit of trouble before. Lately though, I can’t help but feel I’ve been looking for trouble. Is it even possible to ‘fight back’ in a positive way? I’m not just talking about “AI” bollocks, I mean the general enshittifcation of the web industry and tech at large. The hot drama and spicy takes are great for clickbait and like-farming. I’ve been too guilty of that. Even though I know for a fact that my most popular posts, over the long run, are topics like: Multiple Accounts and SSH Keys . That got zero attention the day I published it but I’ve received random “thank you” emails every year since. Thing is though, I actually do get “thank you” emails for my stance against AI. There are a lot of developers who aren’t in a position to speak their mind. I don’t blame anyone for staying quiet when their job is on the line. I’m lucky I am my own boss. I’ve always blogged primarily for myself. That’s the secret to blogging I think. Regardless, after so many years I have the power to reach a significant audience. I feel somewhat obliged to do something with that. I’m just not sure I’m venting my frustrations in the right way. Maybe I am burning out and it’s affecting my judgement? I’m genuinely curious. Send me an email: [email protected] Are you burning out? Am I burning out? Or is the industry burning down around us? Feedback is always welcome. I can take criticism. I’ve received some absolute scorchers from anonymous cowards recently. I wish I could share those but I do respect my privacy policy . (That’s not an invitation for hate!) Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views
ava's blog Yesterday

how to properly ask for help

I’ve been noticing more disregard for a more respectful way to ask for help recently, both in private, at work, and between strangers online. It seems like a growing group of people is comfortable with just barking words at other people to receive answers. No please, no thank you, no further explanations and no attempt to first solve it on their own. I don’t know if this is some sort of effect search engines and LLMs have, but either way: Here’s how you can do better. You message your friend, a coworker or a stranger “ My printer won’t print. ” Now you have to wait until they see it and have time to respond. That could be hours or days. Then when they get back to you, they have to establish some context first. “ Okay. Have you tried turning it off and back on again? Are your drivers up to date? ” Now they have to wait for you to answer again. What usually happens now is further slowing down the actual resolution. “ Yes I already tried all that. ” This can go back and forth for ages , just dragging on about what you did or didn’t do, and wastes both your time and the other person’s time. It’s disrespectful to make the other person do all the work of getting the right info out of you, and put together a detailed guide, just to be shot down with “Already did that.” So before you reach out to another person, use the tools available to you, depending on the problem. If you have exhausted all options and tried all the suggestions, then reach out. You might not even need to do that, and solving it on your own this way was faster than just involving someone else from the getgo! A respectful message would be: “ Hey, I’m having issues with my printer, can you help me? It’s a [model number] and I have consulted [resources] and tried [everything you actually tried], but still no luck. Do you have any other ideas? Thank you. ” This is polite, not commanding or imposing, and it gives the other person all relevant information that makes helping you easier and faster. Instead of dragging every piece of information out of you and each having to wait for a response, they can immediately research the model, and focus on the things you haven’t tried yet, and find other resources. This is respectful of the other person’s time and efforts, and this way, they are more inclined to help you in the future. It’s not only about tech support or a defective device; apply it to other situations as well. It shouldn’t need to be said, but of course, it’s okay to ask “ What’s dirtbiking? ” when someone brought up they like to do dirtbiking in conversation, even if you could research it yourself. That’s normal bonding and socializing, and you wanna hear it from them and find out more about how they do it or why they like it. It’s also okay to ask someone what their opinion or stance is on something, or whether they have recommendations for something. Of course you could also find opinions and recommendations online, but this is obviously about valuing this exact person’s opinion and insight, which you will not find online. I’m sure the other person is delighted to be asked and get to tell you something about that topic. I hope this is a worthwhile reminder; send it to people who do this, hang up a version of this at your workplace, whatever. It’s okay to need help, it’s okay to not know something, but you need to go about this the right way and remember some etiquette. Otherwise, people will think you are just too lazy, difficult to work with, and weaponize your incompetence just so someone else does it for you. Reply via email Published 22 Mar, 2026 Check the manufacturer website, check the manual, or check if the manual is available online; check FAQ’s and similar informational pages. Use a search engine. Check a wiki, search the problem + ‘reddit’ to find a relevant Reddit thread, check if YouTube has a video on how to solve the problem. Ask an LLM. Know what the problem is or what topic you wanna know more about. Make sure you use the correct words and names, and you are specific. For example: don’t just ask your coworker to help with “that one database” when you all use multiple. Exhaust your options first. Give the other person as much information as possible.

0 views
baby steps Yesterday

Maximally minimal view types, a follow-up

A short post to catalog two interesting suggestions that came in from my previous post, and some other related musings. It was suggested to me via email that we could use to eliminate the syntax ambiguity: Conceivably we could do this for the type, like: and in position: I have to sit with it but…I kinda like it? I’ll use it in the next example to try it on for size. In my post I said that if you hvae a public method whose type references private fields, you would not be able to call it from another scope: The error arises from desugaring to a call that references private fields: I proposed we could lint to avoid this situation. But an alternative was proposed where we would say that, when we introduce an auto-ref, if the callee references local variables not visible from this point in the program, we just borrow the entire struct rather than borrowing specific fields. So then we would desugar to: If we then say that is coercable to a , then the call would be legal. Interestingly, the autoderef loop already considers visibility: if you do , we will deref until we see a field visible to you at the current point . This raises an interesting question I did not discuss. What happens when you write a value of a type like ? For example, what if I do this: What I expect is that this would just swap the selected fields ( , in this case) and leave the other fields untouched. The basic idea is that a type indicates that the messages field is initialized and accessible and the other fields must be completely ignored. This represents another possible future extension. Today if you move out of a field in a struct, then you can no longer work with the value as a whole: But with selective borrowing, we could allow this, and you could even return “partially initialized” values: That’d be neat.

0 views
Ratfactor Yesterday

My home network observes bedtime with OpenBSD and pf

Another OpenBSD article. There's at least two more coming, so brace yourselves. This one is about setting up OpenBSD as a router and controlling traffic with pf (Packet Filter)...

0 views

Profiling Hacker News users based on their comments

Here's a mildly dystopian prompt I've been experimenting with recently: "Profile this user", accompanied by a copy of their last 1,000 comments on Hacker News. Obtaining those comments is easy. The Algolia Hacker News API supports listing comments sorted by date that have a specific tag, and the author of a comment is tagged there as . Here's a JSON feed of my ( ) most recent comments, for example: https://hn.algolia.com/api/v1/search_by_date?tags=comment,author_simonw&hitsPerPage=1000 The Algolia API is served with open CORS headers, which means you can access the API from JavaScript running on any web page. Last August I had ChatGPT build me a simple tool for hitting that API for any user which fetches their comments and gives me a mobile-friendly "copy to clipboard" button. I've since tweaked it a few times with Claude. I can then paste the whole lot into any LLM - these days I mostly use Claude Opus 4.6 for this - and prompt "profile this user" to see what it can figure out. It's startlingly effective! It feels invasive to quote the profile of another user here, so I'll show you what it produces for me: This is Simon Willison — a prolific, independent software developer, blogger, and one of the most visible voices in the AI-assisted coding space. Here's a profile drawn from his comments: Professional identity: Independent developer and writer. Co-creator of Django, creator of Datasette and many other open source tools. On the board of the Python Software Foundation. Runs a heavily-trafficked personal blog at simonwillison.net with multiple content types (entries, link blog, notes, quotations). Monetizes through GitHub sponsors, ethical ads, and occasional consulting — not through any AI company payroll, a point he defends frequently. Core thesis on AI coding: He's the leading evangelist for what he calls "agentic engineering" — using coding agents (Claude Code, Codex CLI) as productivity multipliers for skilled developers. His central argument: LLMs don't replace programmers, they amplify existing expertise. He's writing a guide/book on this at simonwillison.net/guides. He insists these tools are deceptively difficult to use well, and that most people who report bad results simply haven't learned the craft yet. Working style: He does most of his programming from his iPhone via Claude Code for web, often while riding BART, walking the dog, or sitting in the garden. He embraces "YOLO mode" (auto-approving agent actions) and frequently runs 2-3 agent sessions in parallel. He starts every session with "run uv run pytest" to anchor the agent in TDD. He's on the $200/month Claude Max plan. Key technical interests: Sandboxing and security (WebAssembly, Pyodide, sandbox-exec, Firecracker), SQLite (deeply), Python packaging and tooling (uv, PyPI distribution tricks), browser-in-a-browser experiments (v86, WASM Linux), and local LLM inference. He's been learning Go "by osmosis" through coding agents. Security consciousness: He coined the term "prompt injection" and the "lethal trifecta" (access to private data + exposure to untrusted input + ability to take actions). He's genuinely alarmed about OpenClaw/Clawdbot security risks and has been predicting a headline-grabbing prompt injection attack "in the next six months" for over two years — acknowledging it keeps not happening while insisting the risk is real. He references the "normalization of deviance" pattern frequently. Personality and debate style: Energetic, combative when challenged, but generally good-natured. He engages heavily in HN comment threads — sometimes dozens of replies in a single discussion. He's transparent about his biases and maintains a public disclosures page. He pushes back hard against what he sees as motivated reasoning from AI skeptics, but also against AGI hype ("I refer to that as the science fiction version of AI"). He gets visibly frustrated by sock-puppet accounts and bad-faith criticism. He has a self-deprecating streak ("my stupid pelican benchmark"). Recurring themes across hundreds of comments: Personal interests mentioned: Niche museums (runs niche-museums.com), New Zealand kākāpō parrots, cooking food from countries he's never visited, chickens in his garden. Lives in Half Moon Bay, CA near a major pelican roost. The pelican-on-a-bicycle SVG test is his signature LLM benchmark. In a sentence: A deeply experienced, independently-minded developer who's genuinely excited about AI coding tools, fights hard against both uncritical hype and reflexive dismissal, and is trying to professionalize how the industry uses these tools — while also worrying loudly about the security implications almost nobody else takes seriously enough. This all checks out! I ran this in Claude incognito mode to hopefully prevent Claude from guessing that I was evaluating myself and sycophantically glazing me - the tone of the response it gave here is similar to the tone I've seen against other accounts. I expect it guessed my real name due to my habit of linking to my own writing from some of my comments, which provides plenty of simonwillison.net URLs for it to associate with my public persona. I haven't seen it take a guess at a real name for any of the other profiles I've generated. It's a little creepy to be able to derive this much information about someone so easily, even when they've shared that freely in a public (and API-available) place. I mainly use this to check that I'm not getting embroiled in an extensive argument with someone who has a history of arguing in bad faith. Thankfully that's rarely the case - Hacker News continues to be a responsibly moderated online space. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . "Two things can be true at the same time" — he holds nuanced positions Tests are for productivity, not just quality The November 2025 model releases (Opus 4.5, GPT-5.2) were a genuine inflection point Code review is the biggest bottleneck in agent-assisted workflows "Cognitive debt" is a real and unsolved problem The best engineering practices (tests, docs, CI/CD, clean code) also make agents work better He's deliberately trying to "teach people good software engineering while tricking them into thinking the book is about AI"

0 views
flowtwo.io Yesterday

Fundamentals of Software Architecture

A handshake should be firm, but not overpowering. Look the person in the eye; looking away while shaking someone’s hand is a sign of disrespect, and most people will notice that. Also, don’t keep the handshake going too long. Two or three seconds are all you need. — Richards & Ford, Fundamentals of Software Architecture , Ch. 32, para. 87 I swear, I find a lot of value in reading books about software. But I take issue with the length of some of them. When I'm 600 pages into an 800 page technical book, and I'm reading something barely tangential to the book's topic, like detailed instructions on how to shake hands...I get a bit annoyed. I think it's because every author wants to make their book "the definitive reference on X", whatever X is, so they feel the need to include stuff about leadership, soft skills, etc. Technical books like this could be more approachable if they kept to a more concise topic. My two cents. Anyways, Fundamentals of Software Architecture was written by Mark Richards and Neal Ford. It's a thorough cataloguing of every popular architectural style and their pros/cons. It introduces a lot of terminology, with the goal of defining how to evaluate and explain the architectural qualities of a system—qualities like availability, coupling, fault tolerance etc. This post is mostly a summary the architectural topics covered by the book; I've added some personal commentary on system coupling and AI near the end. According to Richards and Ford, the 3 laws of software architecture are: Everything in software architecture is a trade-off Why  is more important than  how Most architecture decisions aren’t binary but rather exist on a spectrum between extremes. They added the 3rd law in the book's 2nd edition. It sorta just feels like a different way of phrasing the 1st law, but I think they're trying to highlight that any architectural decision is never "absolute", i.e. most systems don't perfectly align to any one architectural style. A system might lean towards microservices architecture but have elements of other patterns too, for example. "As I have evolved, so has my understanding of the Three Laws. You cannot be trusted with your own system architecture." — Claude For mostly my own sake, I've briefly summarized each of the architecture styles covered by the book. Just 1 or 2 sentences explaining what it is and when you should use it—I'm aiming for brevity here, like a crib sheet. Pictured: Enterprise Service Java Beans from the Neolithic era. Thought to be a tribute to Sun Microsystems It's important to understand how to define a system's boundaries. In the book, the authors define the concept of an architectural quantum which is the "smallest part of the system that runs independently". The system might be your entire microservice architecture, but if one part of it can function independently of other parts of the system, it forms its own architectural quantum. So how does an architectural quantum run independently if it has to communicate with other parts of the system? The critical part is how the communication happens—whether it's synchronous or asynchronous: The dependency turns them into a single architectural quantum. Asynchronous communication can help detangle architectural quanta because it removes that dynamic dependency — Richards & Ford, Ch. 21, para. 48 If the operation of System A requires information from System B, then it's coupled to System B and they form a single architectural quantum. This means that System A's characteristics are impacted by System B's characteristics. If System A needs to be fast, we must ensure System B is fast, and consistently fast. At my current company, every service is associated with a reliability tier. The service's tier determines many of its operational requirements. For instance, a tier 0 system (the highest tier) needs to be deployed in multiple regions for redundancy. It needs an on-call engineer, clearly defined SLAs, etc. But if a tier-0 system needs to retrieve data from a lower tier system as part of its operation, all of a sudden the lower tier system needs to be a tier-0 system. They become coupled. In practice, there's some nuance here. Just because you call another service via HTTP and block the current process waiting for a response, doesn't mean the two services are fully coupled. As long as there's fallback functionality that doesn't constitute an error state, they needn't be considered coupled. If your service needs to be fast and the other service isn't reliably fast, you may implement a strict timeout and then fallback to some degraded functionality in the event the request times out. As an example, consider a new user recommendation system being built by your company's ML team. Your tier-0 homepage rendering service can still attempt to retrieve user recommendations from this new system, but as long as you can fallback to some other functionality (like just choosing the user's recently viewed content) we don't need to group that recommendation system in with our service and its strict functional requirements. The 2nd edition of this book was published in April 2025. So of course, AI was brought up a lot. In general, the authors' stance was that AI is not an effective replacement for human architects—and they didn't seem optimistic that it could ever be. Why? Because, as we’ve demonstrated in this book, everything in software architecture is a trade-off. LLMs are great for understanding knowledge, but to this day, they still lack the wisdom necessary to make appropriate decisions. That wisdom includes so much context that it’s much faster for the architect to solve a business problem by themselves than to teach an LLM all about the problem and its extended environment and context. The fact that we’ve included eight other intersections to be concerned about should be evidence enough that this is a daunting task. — Richards & Ford, Ch. 33, para. 80 While I agree that the amount of context necessary to properly make architectural decisions is hard to shove into an LLM's context window right now, I don't believe that'll be the case for long. I have a feeling the opinions in this book will become outdated quite soon. Also, despite the authors' insistence that "architecture is the stuff you can’t Google or ask an LLM about", I fully believe that AI tools are an indispensable tool for researching architectural decisions. They can explore the problem domain more completely and much faster than any human could. They can also illuminate trade-offs and nuances you might have missed. The fact that the authors' never mentioned this in their statements on AI utility is a major oversight. Every job function in software development, from junior dev to CTO, should be leveraging AI tooling at this point. Like I mentioned at the start, I found FoSA to be a bit bloated. Also, the book didn't didn't really cover what I was looking for. I wanted a book that described more specific architectural patterns for solving common technical challenges like cache invalidation, database replication etc. Instead, it focuses exclusively on the overall system layout—how the domain boundaries are divided and what the physical topology looks like. And how to shake someone's hand properly. I also think the book tried too hard to quantify complex system characteristics. I don't find much use in assigning a 1 to 5 star rating for the "maintainability" of a "microkernel" architecture style (which is 3/5 according to the book)—simply because both the characteristic and the style itself are too vaguely defined to warrant a rating. I'm certain you could build your microkernel system to have poor maintainability OR incredible maintainability. There's too much ambiguity to extract any conclusions from these assessments. Still, in general, FoSA is an interesting book that tackles one of the more complex and less formally researched areas of software development. Architectural decisions are the hardest to make due to their consequences and trade-offs, so knowing the patterns that have worked for others is a great starting point. Everything in software architecture is a trade-off Why  is more important than  how Most architecture decisions aren’t binary but rather exist on a spectrum between extremes. What is it: Technically partitioned: presentation, business, persistence, and database layers for example. Typically a monolithic application with a monolithic database. Very common, especially in legacy systems. When to use it: Small, low-budget applications. But it can scale surprisingly well. What is it: Another monolithic style, i.e. a singularly deployed application. The system is divided by business domain instead of technical functionality. Domains are called "modules". Goal is to minimize communication between modules as much as possible. When to use it: If teams are domain-focused and using domain-driven development, it's a good starting architecture. Can later migrate to a distributed architecture more easily. What is it: Topology consists of pipes and filters . Filters perform business logic; pipes coordinate and transfer data. Systems have a unidirectional data flow; it can be monolithic or distributed. When to use it: Suitable for systems with one-way, ordered processing steps. ETL pipelines, etc. What is it: Topology consists of a core system (the "microkernel") and plug-ins. Plug-ins are optional and provide extensible functionality to the system. Traditionally monolithic with a single database. Plug-ins shouldn't access database directly. When to use it: Installable desktop applications, or domains that address a wide market and require many custom rules and functionalities for each customer. What is it: Distributed architecture with a separately deployed user interface, coarse-grained domain-centric remote services, and a monolithic database. Basically microservices but with coarser service boundaries and a single shared database, or just a few. When to use it: When the system is of significant complexity and serves a wide enough user base that the benefits of a distributed architecture outweigh the costs. Can be a stepping stone towards other distributed architectures What is it: Distributed system using mostly asynchronous communication. Consists of event publishers, brokers, and processors (the services). Central communication unit is an event, as opposed to a request. When to use it: Systems that require flexible, dynamic processing that need to scale to lots of concurrent users. Applications where eventual consistency is tolerable and immediate acknowledgement isn't needed. What is it: A complicated distributed infrastructure of scalable processing units that are supported by replicated and/or distributed caches. There is a shared "data grid" that handles data syncing between units and reading/writing from the database. This removes the database bottleneck from the system—database access isn't needed for processing requests. When to use it: Applications with very high concurrent user volume and high traffic variability, AND a low need for data consistency between users. Race conditions and data conflicts will be unavoidable in this system. What is it: A legacy architectural style that uses abstract service layers and operations orchestrated by a shared "enterprise service bus" which knows which services to call to complete operations. Uses generic components to increase code re-use. When to use it: If you've taken a time machine back to the 90s and you have to write enterprise software. What is it: Domain-driven architecture that enforces strict API boundaries and minimizes coupling between domains. Duplication is favoured over re-use where possible. Each service should "do one thing" and have its own database ideally. When to use it: Systems that are highly modular and have high enough load to justify the scalability and performance benefits compared to the development and operational costs.

0 views
Rik Huijzer 2 days ago

Placeholder

This is a placeholder post that will be filled later

0 views
ava's blog 2 days ago

a love letter

I love that I am so passionate about a topic that makes me research and learn so much, that I go to conferences for, that I get newsletters and magazines about. I especially love that I feel so intensely about it that nothing could stop me from it. I’ll find ways to engage with it anyway, somehow. Nothing can ruin this for me. I don’t force myself to read or write about it, it pulls me in. I’m never too sick or too tired. I’m never satisfied about how much I know, I always want more. This drive helps me so much in having the audacity needed to actually make it. I don’t see my aspired career paths as a possible dream that could be made true under the right circumstances; I just can’t view it that way, not even if I tried. Internally, it feels like an inevitability, a fact, that I will progress and go far in this field. Ironically, that reassurance makes doing the work for it easier. I can’t know whether that prediction will become true, but even just feeling that way makes me act differently, which is increasing my surface area for opportunities and cool coincidences to happen. Instead of waiting for a sign, for permission or for things to fall into my lap, I get going. It’s the typical effect of “Just act like you belong here”, I think. I don’t hesitate or think twice before I message people in the field that I could learn from. I sign up for volunteering or apply to jobs without worrying if I’m good enough. I am not ashamed or afraid of being annoying when I contribute more, ask questions or share news that could be interesting in that space. I don’t feel impostor syndrome when I write about the topic. In my mind, I absolutely deserve to be here and be heard and considered. It just clicks, it makes sense, there is no other outcome in my mind; because either I contribute well, or I learn. There’s no other option. Something in me feels like it is all being taken care of somehow, that things will happen the way they should, and I can fully focus on the work and letting my passion carry me. I also have delusional goals on purpose: Be asked to speak at a panel, and get my own Wikipedia page one day (only once I deserve to be there for something great!). These keep me aiming higher and higher, ans have more standards for myself. I wasn’t always that way, and I’m not like this in every area of my life either. I’ve actually been insecure for most of my life, with a crippling fear of failure and preferring not to even try, and dropping everything I wasn’t immediately good at. I’d prefer not to ask than receive a no. I thought I was very annoying to others, and that everyone was so much further ahead in anything. But times change, and if you’re lucky, the right interest/hobby builds up your confidence and ability to showcase your skills with ease. With this, I feel things are just perfectly falling into place, and I’m ready and grateful for whatever happens. Everything feels like a reward, like one step closer to something. I finally, for the first time in my life, feel like I truly and thoroughly enjoy the way there, the process itself, instead of just craving the finish line. To me there is no one to compete with in a negative way, no one to measure against and feel insecure - I just see amazing people to learn from and future mentors. I see people I’d love to work with and for. I see them as proof I can do it too. I look at some things and go: This is great, but I think I can do better; and then I try exactly that, and use it as an opportunity to grow and to prove myself. Mostly to myself, but an audience is also nice. And it makes sense, doesn’t it? If you don’t believe in yourself, it can be very hard for others to do so and it impacts your ability to put your best foot forward. Nothing of this felt like a deliberate choice or a process I put myself through to “become better”. It just happened to me, and now I’m gladly riding the wave this special interest has given me. And I’m so proud of myself. Thank you to everyone sending encouraging emails, will respond soon! Reply via email Published 21 Mar, 2026

1 views
ava's blog 2 days ago

bad parents know / being the villain

This is for the ones with abusive parents. Bad, abusive parents know a lot more about it than you’d think. We all know the saying “ The axe forgets what the tree remembers. ” And I’m sure for some stuff, that is true. But I’ve seen when they act clueless while knowing what what happened. Sometimes, the mask slips. Things they claim never happened and that they can’t remember are suddenly mentally present. I remember a time when all the yelling and abuse over math homework allegedly never happened. That our neighbors informing my teacher and my teacher ringing our doorbell to come in after another screaming match, and finding me crying, never happened. But at a restaurant dinner in adulthood, suddenly the she says “ I ruined math for you back then. I messed that up .” Funny how that happens. The amnesia is selectively lifting sometimes, I guess. Then you cut them off and decide to end the relationship for good. They notice. Messages like “ I hope you are safe. Did something happen? ” “ Please reach out, it’s urgent. ” It’s supposed to make you respond in case something bad happened to them or another one. Then they try and catch you in front of your home. That’s when they reveal they know exactly what’s going on. They don’t really think something happened to you. If they did, or at least of they cared to put on an act, they’d say: “ Oh hey!! Thank god you are alright! I wasn’t able to reach you, I wanted to make sure you’re good and if maybe yours or my phone are broken? ” But instead they act like everything is fine, as if this is just randomly happening, like they were just randomly outside your place, running into you. “ Hey, how are you? ” With a demeanor and face as if nothing is wrong. You then say you don’t wanna talk. The mood shifts. “ Yeah I have noticed! But why? ” There’s not even any visible concern for your feelings, no discernible feeling of guilt, apology or shock that they apparently did something that was the final straw. Others would be aghast, apologetic, shocked. But here, there’s only offense and an attempt to regain control. How dare you cut me off, how dare you enforce boundaries, how dare you not tell me why or give me an option to argue - that’s what’s being communicated. There’s no genuine attempt to be sorry, to understand, to hear you out. No “ I respect your decision, but understand I will always love you and if you decide you want contact again, just reach out. You’re always welcome and you will always be my daughter. ” Just attempts to rope you into conversation, stall for time, get in your head, argue and try to invalidate your feelings about certain events. Then suddenly it’s all your fault. The relationship is bad because you aren’t giving anything, you don’t put enough effort in, you don’t want to be close. And you know what? That’s partially right. This is what happens when your child doesn’t feel comfortable around you, can’t feel like letting their guard down, feels harshly judged and shamed by you, and is scared of you. I used to be a very cuddly child. I loved my mother. Then she turned into a monster. Without the words to describe what happened and without knowing anyone else going through the same, even as a kid as young as 6, I likened it to something or someone “possessing” my mother. It felt like over night, someone else replaced my mum that looked like her, and it never got better. Later on when I was older, my dad revealed he noticed it too and begged her to get psychological help, but she refused. Even she remarked on my change in behavior. I remember her being mad about me no longer wanting to cuddle with her when I was a kid. I remember her angrily asking “ Why are you so scared of me?! I never did anything to you!! ” every now and then growing up, and I either lacked the words to say why, or I was too scared to say it, or I said the reasons weren’t believed or respected. I was just gaslit. This never happened, this is wrong, this is just normal, you are overreacting, you’re too sensitive, this isn’t fair… I heard it all. So why explain to someone why you’re scared when that happens? All you have left is greyrocking them. They always love to make it seem like it’s all in your head, you chose this, this is your fault. As if a child would choose to be scared, choose to cry, chooses to dream of being adopted into another family, chooses to dream of running away, hopes it was switched up in the hospital and would find its true family one day. As if the same child, but as an adult, would choose to be diagnosed with ©PTSD for it, change their first name because the original one is too traumatic, and is still scared when they hear keys turn in a lock and someone arriving home. Yeah sure, I was just born defective, born to hate my parents somehow! Nevermind that I wanted to reconcile so bad, gave endless chances, ignored my own needs and wants and tried to just “ accept who they are ” and preferred to endlessly question if I am the problem. I went to years of therapy! Everyone I talked to, mental health professional or not, were shocked of how I grew up and said it’s not normal! And why would I lie? Because it’s so cool to pretend your parents are trash? Sure. I just live for the pity, apparently? Yes, at some point, I changed and created distance. But I still said yes to every request to see me. I always responded to messages. I gave gifts. I reached out to ask how it’s going. I agreed to spend some festivities together. I stayed in contact and agreed to meet more just to make her more comfortable and to do my part as a daughter. I don’t wanna be a bad daughter. I didn’t wanna give up yet. It could get better, right? Maybe she ages out of it, maybe we can find ways to make it work. I do want a family, and it’s hard to cut off the only family you still have. Further in the final conversation, the diversions start. Was this only because ( insert very harmless interaction that would make you look insane if you were truly mad about that )? No, of course not. They know the relationship was bad for most of your life. They know you’re not seriously cutting them off because they cancelled on you when they were sick. It’s all searching for reasons that are not their fault and easy to make you look bad. They also remember those years, and they can reference some of the bullshit they did in rare occurrences (see above) or they can at least suddenly lament all these years where the relationship was bad because of you . They just choose to switch multiple times in the same convo between pretending they believe everything was actually fine and this is out of nowhere, and knowing everything’s bad but it’s yours to fix instead of moving on. Anything but taking accountability or acceptance. Sometimes they can’t even look you in the eye during all of this. It’s like even they are afraid it’ll all just spill out, all the rot they try to ignore. I think deep down they know they fucked up, but thought you forgot or would just continue to bear it. Them acknowledging it out loud first would make it real, would be unbearable, I assume. We’re just all expected to dance around it. You giving up and choosing no longer to keep contact is bringing it all into the open, in a way. You don’t need to actually recount specific situations or times, or summarize all those years. Simply choosing to abandon them, doing the unthinkable for a child, is enough. Even they know realistically it can’t all be you. They have to have done something bad bad . That means it’s all real, it all happened, and it has an effect. It can, for a brief moment, no longer be denied. They lose control over it, and are faced to reckon with it at least in the situation. Of course they’ll run to other people for validation. Leave out info, make you look insane, moody, unreliable, and “always having been difficult”. Maybe they’re proudly telling people that if you’d ever reach out again, they’d just ignore you. You went too far, now you can never go back, etc. As if you’d ever reach back out. It’s funny that their fantasy isn’t about you reaching out again and making up and being a happy family, but about hurting you back, and holding it against you, and it being their turn to refuse contact now. It’s never about love, it’s just about revenge. It’s about who gets to leave first. It’s less about having a good relationship, and more about not being seen as a bad parent by others. She has always hated when she lost control over the narrative. Wanting to sway me into cancelling therapy, screaming at me that I make her look bad, that all I speak about is just bullshit, and wanting to know exactly what I said in the sessions. I planned to cut her off when I moved out. That was 8 years ago. All this time, I wanted to make it work. We had better times that gave me hope. I was scared of having no family anymore, and I felt guilty and sad imagining my mother no longer having her child. I was scared of the harassment and abuse it could cause. I couldn’t go through with it, I always delayed it. I empathized more with her than myself, and put her needs over mine. I tried to mold myself into something she could accept and could always feel her disappointment. I had to keep my own wedding from her so she wouldn’t show up or guilt me into inviting her. Each meeting felt like we were two strangers on a theatre stage, acting out our roles, with zero chemistry or acting skills. It all left me drained and shamed. Ashamed, too, when she told me really bad things, like the fact that she is yet again being the affair of a married man. But it’s over now. I can finally move on. I know I tried. I gave it enough time and chances. Now I have to be comfortable being the villain, the bad daughter, being badmouthed, and being shamed by people who have a great relationship with their parents because “ You can’t do that to your parents! ” Too bad they were never there to step in and say “ You can’t do that to your child! ” Reply via email Published 21 Mar, 2026

0 views