Posts in Open-source (20 found)
Evan Hahn Yesterday

Notes from May 2026

My blog turned 16 this month! I did nothing to celebrate, but made some little tools and clicked some links about tech ethics. I published four little tools this month: I also did some work on Helmet, my open source project: And like every month, I wrote a few articles at Zelda Dungeon . I don’t feel I wrote anything special this month, but my colleagues put together a feature about Zelda and mental health which was very affecting! “The vast majority of tech workers, at least those who I have encountered in my many years of reporting, are not vampiric Silicon Valley tech bro caricatures [… They] both like working with tech and ultimately want to see it serve the public good.” From “They just formed the biggest tech worker union in the US. They plan to rein in AI and curb layoffs” . This “love letter to Gnutella” is both an introduction to a P2P protocol and a celebration of the culture around it. From “Affordances for me, but not for thee” : “One of the oddest parts of the AI shift is that people are much more willing to do things for LLMs that they should have been doing for human beings all along.” Accessibility, specifications, documentation, and policies are better codified now. The author calls this “dystopian”, and I agree: our motivation to do this stuff is AI or productivity, not helping our fellow human. “More importantly, whereas accessibility affordances provide new abilities for vulnerable people, an AI affordance provides new abilities for people with power. And that’s probably the heart of it.” Looking forward to being surveilled because I’m an “anti-tech extremist” . I can’t tell you how exciting it was to watch Jira add 2 + 3 . “What can I do to resist AI?” asks the AI Resist List . “Tech companies like Google, Facebook and Microsoft are ignoring data controls mandated under California law, researchers say.” “Your AI Slop Bores Me” presents an interface that looks like an LLM chatbot, but it’s entirely powered by humans. A very cute idea. I’m a very bad “image generator”, at least according to the ratings I received. I continue to be amazed by “Lest We Forget the Horrors: An Unending Catalog of Trump’s Cruelties, Collusions, Corruptions, and Crimes” . It’s so thorough. RIP to a real one: Wikinews is shutting down after 21 years . Hope you had a good May. ZIP Shrinker , a web app that shrinks ZIP files with higher compression ratios A command line tool to do (completely offline) translation Open Link in Unloaded Tab , a Firefox extension to open links without loading them png-cmp , a command line tool to compare PNG pixel data After over a year of quiet maintenance, I released version 8.2.0 with some small new features and documentation updates. In a step toward dropping GitHub, I moved the docs from a GitHub URL to helmet.js.org . “The vast majority of tech workers, at least those who I have encountered in my many years of reporting, are not vampiric Silicon Valley tech bro caricatures [… They] both like working with tech and ultimately want to see it serve the public good.” From “They just formed the biggest tech worker union in the US. They plan to rein in AI and curb layoffs” . This “love letter to Gnutella” is both an introduction to a P2P protocol and a celebration of the culture around it. From “Affordances for me, but not for thee” : “One of the oddest parts of the AI shift is that people are much more willing to do things for LLMs that they should have been doing for human beings all along.” Accessibility, specifications, documentation, and policies are better codified now. The author calls this “dystopian”, and I agree: our motivation to do this stuff is AI or productivity, not helping our fellow human. “More importantly, whereas accessibility affordances provide new abilities for vulnerable people, an AI affordance provides new abilities for people with power. And that’s probably the heart of it.” Looking forward to being surveilled because I’m an “anti-tech extremist” . I can’t tell you how exciting it was to watch Jira add 2 + 3 . “What can I do to resist AI?” asks the AI Resist List . “Tech companies like Google, Facebook and Microsoft are ignoring data controls mandated under California law, researchers say.” “Your AI Slop Bores Me” presents an interface that looks like an LLM chatbot, but it’s entirely powered by humans. A very cute idea. I’m a very bad “image generator”, at least according to the ratings I received. I continue to be amazed by “Lest We Forget the Horrors: An Unending Catalog of Trump’s Cruelties, Collusions, Corruptions, and Crimes” . It’s so thorough. RIP to a real one: Wikinews is shutting down after 21 years .

0 views
Langur Monkey 2 days ago

Langur Agent

Langur Agent is a simple, open, hackable CLI AI agent for Linux and macOS. It connects to any service providing an OpenAI-compatible endpoint. It features: The source is available in this repository . Langur Agent has been tested on Linux and macOS only. Install the agent with: Run the agent with the default session: If you need an API key to access the endpoint, put it in the file. Langur Agent looks for the file in the following locations, in order: Create the file with the API key: The agent uses to load at startup. The package reads from the environment automatically. You can also set in your shell profile. On first run, the configuration is created in . You can configure the agent interactively with the slash command. The agent works with any OpenAI-compatible endpoint, so LM Studio, Ollama, OpenWebUI, or any other service you configure. Here are the default values: Run the agent, and then you can enter your prompt. You can use the following key bindings during input: During inference, you can cancel the turn and return to the input prompt with Ctrl + c . Use to print information about the available commands, and to configure the agent interactively. Internally, Langur Agent uses sessions to separate different memory histories. Sessions are named by the user. By default, the agent uses the session. You can start in a different session (either create a new one, or restore it if it exists) with the argument: The default session’s name is , so the following two commands are equivalent: You can also list the existing sessions with : Sessions contain: For now, the configuration file is the same for all sessions. Sessions are matched by the directory name in the sessions location ( ). You can rename a session by just renaming the directory! You can enable mode for the current session with the command , or permanently in the configuration . External editor —In mode, exit INSERT mode ( Esc ), then press v to edit your prompt in an external editor (uses your or variable). There are a few commands available to use in the agent loop. You can list them with . Also, use (e.g. ) to show additional help for a command. Persistent memory follows XDG Base Directory spec in : In addition to persistent memory, the agent maintains a chat history of recent user input and assistant output pairs. This provides context that survives beyond the LLM’s context window. Here is how it works: Persistence: Configuration: Langur Agent can be easily customized and extended by adding new tools, commands, and skills. If you create a cool new tool, skill, or slash command, consider contributing it via a pull request! Create a file in or use one of the existing ones. To create a tool, create a method and decorate it with : Tools are auto-discovered on startup. The process is very similar to tools. You need to create your method, preferably in , and decorate it with . A slash command must return, in that order, , , , : Decorated commands are automatically registered, and auto-completed in the input prompt. Add a file in with YAML front matter, following the agentskills.io standard: The front matter and are parsed and shown in the skills list. The body is injected into the system prompt. session management memory management visual candy autocompletion interactive configuration Python 3.13+ for dependency management Current directory, Home directory, Alt + Enter : add a new line Enter : submit the prompt Ctrl + q : quit The input history Chat memory (see chat memory ) Notes (see session memory ) User profile (see session memory ) — user information — persistent notes (added via tool) Memory is loaded into the system prompt each turn tool adds notes during a session tool explicitly persists memory to disk Memory is auto-saved when the agent exits (interactive mode) Each user message and assistant response is stored in memory Reasoning is omitted from chat memory Automatically compacted when exceeding the configured character limit The user can trigger the compaction any time with Chat memory is attached to the system prompt on each turn The agent displays the last 10 exchanges, with long messages truncated Chat history is persisted to Automatically loaded on startup Saved after every exchange (user input or assistant response) Compacted history is also persisted to disk : a indicating if the command succeeded or failed. : an optional short status message. It is printed with or . : an optional with the Python Rich-formatted content, it is printed to the output. : an optional formatted in Markdown, it is printed to the output.

0 views
daniel.haxx.se 2 days ago

curl up 2026 summary

Getting curl developers and related enthusiasts into a single room to hang out in the real world for a whole weekend once a year is awesome. We find inspiration, we share experiences, we learn from each other and we dream and plan of future endeavors and things to work on. Seeing faces, hearing voices and watching body language help us communicate better virtually and on video calls during the rest of the year. We have gathered curl people like this annually since 2017, even if some years during Covid were “different”. To me, this is one of the best events of the year. I get to hang out and talk curl with good friends a whole weekend! The 2026 edition was held in Prague in late May and kept the general style of past events. About 25 people got into the room. We had five curl maintainers present and quite a lot of local curious minds. The curl up format is easy, casual and friendly. We do topical presentations, followed up with Q&A and discussions around the topics brought up – of course usually with reflections about curl’s role, both past and future. We live-stream and record the presentations to allow our friends who could not attend to keep up both in real-time but also after the fact. Unfortunately the tech is not always on our side so the quality sometimes is a little lacking. This year I brought an HDMI-splitter and an HDMI-to-USB device to allow us to get better recordings, but they were not working as smoothly as intended so we had to use inferior backup solutions for most of the meetup. This presentation above was the “keynote”, the introduction talk to the event. We then also recorded another nine session that are all available in the curl up 2026 playlist on YouTube. To give you all a little glimpse of what curl up is about, here’s a gallery showing some of the speakers and some scenery. Daniel Stenberg Alexandr Nedvedicky Daniel Stenberg Jim Fuller Jim Fuller Carlos Henrique Lima Melara Jim Klimov Moritz Buhl Stanislav Fort Daniel Stenberg Igor Chubin Igor Chubin Daniel Stenberg Daniel Stenberg and Frank Gevaerts All photos taken by and donated to us by an anonymous curl fan present in the room.

0 views
Andre Garzia 3 days ago

We need to own our computing experience

Originally when I talked about owning our own platform is this blog, I meant owning the stack that powers and serves the blog. Moving to your own VPS or servers or static pages in which you didn't depend on some *Blog As A Service* company such as Wordpress.com. Eventually, I [started talking about owning the workflow that empowered your blog experience](https://andregarzia.com/2026/02/building-your-own-blogging-tools-is-a-fun-journey.html) not only your posting experience but your reading experience. To that effect, I showed how I created my own blog reader and integrated that into Firefox and also my own blog editor. Recently, I think that we need to move further into owning more and more of our computing experience. The avalanche of LLM/AI based slop solutions being force fed into our lives is radicalising me towards a very specific path in which owning my own platform now needs to mean controlling my own computing experience. I been an Apple user for a very long time and have [spoken previously about my recent desire to leave the platform](https://andregarzia.com/2026/03/apple-just-lost-me.html) because of a recent decrease in quality of macOS, change in priority for Apple in regards to being an independent developer in their ecosystem, and a general feeling that I must move away from big tech. In that post, I outlined my desire to move to an [MNT Pocket Reform](https://mntre.com), [Fairphone Gen 6 with potentially Murena /e/OS](https://fairphone.com) and maybe a NAS. I already purchased the Pocket Reform and am waiting for assembly and shipment, but I changed my approach for the next two items in that list. Instead of buying a NAS, I decided first to experiment with self-hosting and homelabbing by converting an old x86 MacBook Pro into a server using [Yunohost](https://yunohost.org). That server is going surprisingly well for me and I am moving more and more of my computing to inside the house. I will eventually get a proper NAS or build one, but at the moment that server is all I need. I am even hosting my [fediverse account](https://social.soapdog.org/@soapdog) in it using [GoToSocial](https://gotosocial.org). I reckon that I will spend close to 500 pounds to get the Fairphone with /e/OS. I don't have that budget right now and am afraid of doing it blind cause I been checking the forums and it seems like WhatsApp stopped working in the last update and not all features of Halifax UK bank app are working. I don't want a switch to a deGoogled OS to prevent me from talking to my friends or using my bank. I know that sucks, but those are not easily solvable problems. Like my original plan with the NAS, I think I might be able to test the waters of e/OS/ by buying an old second-hand smartphone and installing it and seeing for myself how well it works. That will cost me much less and then if I like it enough, I can make the move to a Fairphone. So now the issue is figuring out what phone to buy on a budget of 150 pounds or less. Moving back to Linux on open hardware and to Android but deGoogled is my slow journey towards computing autonomy. Google was never worth trust, but the recent move to prevent side-loading on Android and stop showing links on their search result page, becoming a de facto slop as service engine, is something I can't really abide. Apple hypermaniacal need to control the experience of their users and milk both developers and users as much as possible reached a tiping point for me. My Macbook Air doesn't feel like mine since there are piling frictions when trying to run software that is not coming from the App Store. I'm done with that. What is left then? We need to return to a human-focused FOSS community. Not the fast turnaround LLM/AI commits into every single repo cause whoever is sponsoring this project needs it to move FAST. The best thing about the free and open source community has never been the code, but the ethos. Made by humans, to be understandable by humans, to be modifiable by humans. This crazy trend towards LLM assisted coding is removing the understandable part. Lots of commits are being generated by machine and reviewed by machines without a single person actually having read the whole thing. That will erode skills and also lead to code that is impossible to maintain cause no one has ever fully understood it. Hence why I am starting to also build my own tools. There are of course tools I depend on that are too large for me to build from scratch, goddess forbid trying to build a web browser, in those cases it is okay to use a FOSS solution like Firefox. But things that are dear to me like blogging, well I can build my own tools for that. Or epub manipulation tools, or small decentralisation apps. The more I build, the more I can be sure I can maintain it in the long run. I don't want a Web where all we do as creators is feed training models so that gigantic greedy corporations can get it all wrong and regurgitate shit to users. FAANG erected a wall inside the internet and creators are now on the outside. Fighting back is not done by creating local models, or ethical AI companies, fighting back is done by walking away and playing a different game. We can't win over Google and Apple at their own game. It is rigged. But we can play a different game in which they don't matter. For me that game is building offline-first, local-first, decentralised tools and apps for my friends and whoever else can benefit from them. Create for those around you, for those that matter. Forget web scale, think in terms of a village. Get back to Linux, deGoogle yourself if you're able to. Create FOSS and also use the tools you create. Use repairable tech if you can afford it and make sure to step out of this consumption and slop cycle the digital world has become.

0 views
Jeff Geerling 4 days ago

I patched iozone for better disk benchmarks on modern macOS

A decade ago, I settled on for disk benchmarking on all my systems. Tools like ('Flexible IO' tester) are a little more capable for raw disk performance testing, and other tools test network-scale filesystems better, but gives me an easy overview of real-world disk performance across hard drives and SSDs, and runs on Mac, Windows, and Linux (and a smattering of other OSes). It's been around since 1991 , and is still updated today—in fact, the two latest updates (version 509 and 510) contain patches I sent in to get iozone to compile on Apple Silicon Macs running newer releases of macOS.

0 views
Max Woolf 4 days ago

The mysterious Hy3 LLM is topping OpenRouter Model Rankings by a large margin

OpenRouter is a service that provides access to most LLMs with a singular API, which has become exceedingly useful as of late given the rapid cadence of new LLM releases. Due to the company’s role as an intermediary between users and the LLM APIs, OpenRouter has robust, representative data on how users interact with LLMs and it publishes this data on the AI Model Rankings page: a welcome deviation from the labs themselves which generally keep this data secret for competitive reasons. Recently, I checked the OpenRouter rankings and noticed something peculiar. Retrieved May 25, 2026. Two new models are now beating LLM darling Claude in terms of token usage and by more than 50%? I’ve heard of DeepSeek Flash V4 : it’s an open-source release from DeepSeek that is not only fast/cheap, but also performs closer to the leading LLM models at a very low cost so it’s no surprise that it’s incredibly popular. But what the heck is Hy3 preview? I’ve never heard of Hy3 or anyone talking about it. Googling it returns an announcement from Chinese megacorp Tencent about Hy3’s open-source release: the model page itself on Hugging Face is sparse and includes oddly honest benchmark results that are not favorable for the model compared to other Chinese open-source models. Coding-oriented benchmark results for Hy3 from Tencent’s Hugging Face repo. A Hacker News search for Hy3 only returned a single submission that isn’t about Hy3 , and Reddit discussion is more about the open-weights release . One Reddit thread also noted the rise of Hy3 but from May 6, when Hy3 was offered by OpenRouter for free; that free endpoint is no longer available, and therefore Hy3’s usage in the weekly rankings above is from paying users. Hy3 preview is apparently popular in domains outside of agentic coding as well. Retrieved May 25, 2026. Did I miss something? After some nonscientific testing, the model quality is indeed on par with the other Chinese models indicated and not close to models such as Claude Opus 4.7 and GPT 5.5. It’s not a magic overlooked diamond-in-the-rough, so there has to be something else at play. Fortunately, OpenRouter has the data to narrow down possible explanations, but after checking the data I became more confused. Hy3 preview is available from the OpenRouter API at a stated price of $0.066/1M tokens input which is indeed cheaper than the current top-ranked model DeepSeek V4 Flash with a stated price of $0.10/1M tokens input. Given the drastically rising cost of LLMs and coding agents, it makes sense that a cheaper model would prevail, but only if it offered similar quality and that doesn’t appear to be the case. Here’s the chart of Hy3 preview model usage over time on OpenRouter from the model page: Hy3 preview has no usage data before May 8, which implies that is the time the model switched from the free SKU to the paid SKU. Usage is also steady over time since then with the initial rankings shown in this post being several weeks after launch, showing that the usage is at least organic (or very expensive to fake) and not a one-off outlier. Of note, if you do the math on the numbers presented here, the input-token-to-output-token breakdown on LLM API calls is now 98% input , 2% output in aggregate. For the OpenRouter AI Model Rankings, there have historically been spikes by specific apps switching their default to a particular LLM, such as when Kilo Code offered Grok Code Fast 1 for free in September 2025, which rocketed it up in popularity . That does not appear to be the case here because apps only constitute a very small part of Hy3 preview’s activity. The top 5 apps accout for <1% of all activity to Hy3 preview. OpenRouter’s value proposition is the ability to automatically route a given API request to different providers: for open-weight models such as DeepSeek V4 Flash, OpenRouter lists 13 providers, but Hy3 preview only has one provider despite its open weights 1 : the Singapore-based SiliconFlow . Their usage page on OpenRouter shows that SiliconFlow had relatively little usage…until Hy3. The green area corresponds to free Hy3 usage while the blue area corresponds to paid Hy3 usage: OpenRouter does not differentiate them on mouseover which I suspect is a bug. Coincidentially that data visualization shows that usage didn’t drop drastically when Hy3 preview moved from free to paid, which in itself is interesting: if users were not getting value from the free model, they likely would have stopped using it once the costs hit their wallet. What am I missing? Am I overthinking it and the answer is really because “it’s the cheapest” and it received sufficient loss leader traction from the free period? …but is Hy3 preview actually the cheapest LLM backed by a major company on OpenRouter? While I was double-checking some assumptions, I found that OpenRouter has data that shows Hy3 preview is not the cheapest well-performing LLM available: it’s actually DeepSeek V4 Flash, but with interesting caveats. So here are a few more notes about how LLM APIs work that aren’t often discussed. LLM calls are still stateless, which means that after every turn (including user messages to the LLM asking questions), all of the tokens in the current conversation thread are reprocessed, meaning that in the case of agents, the count of input tokens increases cumulatively with each successive message and is one reason why starting new threads frequently as context fills up is encouraged for effective agent use. Reverse-chronological OpenRouter logs from one minute of Zed Agent use with DeepSeek V4 Flash selected. But even before agentic workflows, large inputs such as full PDFs bloated context similarly. As a result, most LLM providers implemented prompt caching , which reuses input tokens processed earlier in the conversation: this is a win-win that saves time/compute for the LLM provider and the savings are passed to the customer. Most LLM providers cache inputs automatically, including when accessed through OpenRouter: the disk-lightning-bolt symbol next to the cost indicates tokens were cached and the cache may not always be hit, especially if OpenRouter switches providers mid-thread. The odd API provider out is the Anthropic (Claude) API which requires paying for a cache write first for some reason. Typically, cache read costs are 10% of the input costs: this is the case for the latest models from OpenAI API , Anthropic API , and Google Gemini API . For the 13 providers that serve DeepSeek V4 Flash, cache read costs are between 20% and 50% of input cost, which makes sense as they may not have the same economies of scale. There’s one DeepSeek V4 Flash provider that’s an exception, though: That’s a 2% cache read cost! (multiply by 2, move decimal left 2 places) How are DeepSeek’s cache read prices so low? DeepSeek has implemented a new approach to KV caching starting with V4 and as the model’s creator it is positioned to best leverage its own innovations, which as mentioned the benefits are passed to the customer. The DeepSeek V4 Pro variant model, when served by DeepSeek, has a cache read cost of 0.83% ! (use a calculator for that one) Remember how I showed that 98% of LLM API costs are now input tokens, which are aggressively cached? That means the “stated” prices of LLMs are now misleading, but unusually in a pro-customer way because the effective price will be much cheaper! To counter this ambiguity, OpenRouter now has a table for effective prices on the model page, which accounts for the cost savings from cache hits. Here’s the effective pricing for DeepSeek V4 Flash via OpenRouter by provider, which is different for each provider as they have different cache read costs and cache hit rates: Retrieved May 25, 2026; these values update every hour. The prices are all over the place, but notice the second row where DeepSeek itself is the provider, which is priced at a whopping $0.018/1M input tokens! That 2% cache read really pays off. Comparing apples to apples with Hy3 preview, the effective pricing for Hy3 preview as noted on its model page from SiliconFlow (a whopping 44% cache read cost) is $0.034/1M: nearly double DeepSeek V4 Flash from DeepSeek! Of course, this is only applicable if DeepSeek is explicitly used as the provider, which some downstream OpenRouter clients/agents may not support: the OpenRouter prices match the prices directly from DeepSeek, so using a direct DeepSeek API key will work the same. There is also an elephant in the room: DeepSeek is a China-based company and some may not want—or may not legally be able—to give their payment processing information or LLM input data to a Chinese company who has set prompt training = on their OpenRouter data policy information, which is a legitimate concern. Yes, subscription-based LLM services such as Claude Code and Codex are still the best bang for your buck if you’re able to consistently exhaust the usage limits. But the super-cheap DeepSeek V4 Flash via the API doesn’t lock you into a subscription, and if you need a bit more agentic compute to finish a project, it’s cheaper than paying for extra usage from the subscription services. 2 At the least, it’s a microeconomic check against additional pricing shenanigans that will likely continue through 2026 as competition in agentic AI heats up. Overall, I still don’t understand the popularity of Hy3 preview on OpenRouter. Given the available data and analysis above, my guess is that a single large app not affiliated with Tencent is indeed using Hy3 as its data-processing backbone, and this app isn’t solely an agentic coding app. But one of the advantages of OpenRouter is that it’s low-lift to switch models and providers: it wouldn’t surprise me if DeepSeek V4 Flash gets a spike in a few weeks once people catch on to its pricing. The license for Hy3 is very restrictive in a way that could potentially prevent providers from adopting the model.  ↩︎ DeepSeek has also just announced its own coding agent platform with V4 Flash that claims to leverage their strong caching, however it’s at 50% input cost but at a significantly more expensive 20% cache read cost so its unclear if the economics are actually cheaper than just using an DeepSeek API key with another agent.  ↩︎ The license for Hy3 is very restrictive in a way that could potentially prevent providers from adopting the model.  ↩︎ DeepSeek has also just announced its own coding agent platform with V4 Flash that claims to leverage their strong caching, however it’s at 50% input cost but at a significantly more expensive 20% cache read cost so its unclear if the economics are actually cheaper than just using an DeepSeek API key with another agent.  ↩︎

0 views
daniel.haxx.se 4 days ago

The pressure

I’m doing Open Source primarily because I love it. The social aspects, the for-the-good angle and for the challenge of engineering this to work for everyone. I also do it because it is my full-time job and getting food on the table and provide for my family is not unimportant. It may come as a shock, but I am not in this game for the money or the extravagant life style. I have been working full-time on curl since 2019. For me, this typically means doing 50 hour work weeks, as I spend all days on it and then I top them off with a few more hours every late night – all days of the week, I spend all this time on curl because it is a work of love and it is both my job and my spare time hobby and no one counts my hours anyway. (And no, I do not recommend anyone else to do the same. I’m not suggesting this for others.) I consider my primary work-related mission in life to be to make curl the best transfer library and tool possible and make it qualify as a top project in Open Source, quality, performance and not the least, security. I believe we generally meet these lofty goals. I founded the curl project, I am still a lead developer in the project almost thirty years later. While I always clearly state that curl is not a one-man shop and that curl would absolutely not be what it is without my awesome curl team mates, a large part of the world still thinks of curl as my project and sometimes more or less equals curl with my person. I cannot help to take curl issues personally. When someone critiques curl, it is by extension a complaint on decisions and choices I stand by and behind – and many cases I made the calls. curl is personal to me. curl has formed my life forever. I have two kids. They were both born many years after I started working on curl and they are both adults and independent individuals now. I love them dearly. Life passes by but curl remains. We’ve had slow times and busy times. The decades pass. Later this year the curl project celebrates thirty years. We typically repeat that the number of curl installations in the world is perhaps thirty billion. Over the last years I have done numerous blog posts on the state of security reports submitted to curl. They have gradually switched over from complaints on stupid LLMs , to stupid AI slop reports , closing the bug bounty over to the current high quality chaos which for us started maybe at some point in March 2026. We have seen many spectacular security failures through the years, in Internet products, in software infrastructure and in Open Source. Every time we read about those events, we get reminded about how curl is everywhere and how we really really really do not want anything such to happen to us or our users. And we take another lap around the project, tighten every bolt a little more, add a few more checks, tests and guidelines to ideally make the curl ship ever so slightly less likely to ever leak or sink. Recently, after I pointed out that Mythos only found a single low severity problem in curl in its first scan, countless people have repeated the claim that curl is one of the most scrutinized, most reviewed, most fuzzed and most verified source codes you can imagine. Perhaps that’s true, but I just want to mention this: that’s not by mistake. That’s not an accident or a happy circumstance. That’s the result of relentless work and attention to details through decades. Software engineering done right . Iterative improvements over time that simply never ends is an effective method. This does not however mean that we don’t have bugs or that we don’t have security problems left, because we do. We have hundreds of thousands of lines of source code that is doing highly parallel networking for many protocols on all imaginable operating systems and CPU architectures – in C. So we fix the problems, patch them up and ship new releases. Over and over. Thirty billion installations world-wide means that everyone reading this blog post has curl installed multiple times in stuff they own. In phones, tablets, cars, TVs, printers, game consoles, kitchen equipment and more. Not to mention all the online digital services we use and those devices communicate with. I cannot stress the importance of curl security and I would guess that most of you agree with me. I am jealous of those projects that shipped a horrible bug at some point in the past that made the world burn for a while. They got attention and some of them then got funding and financial muscles to get them staff and hire multiple full time engineers. I sometimes think we would be better off if we also had one of those. A thirty years old project could make you think you’ve seen most things already, but we have not been in this situation before. The rate of incoming security reports is 4-5 times higher than it was in 2024 and double the speed of 2025 – meaning that on average we now get more than one report per day . The quality is way higher than ever before. The reports are typically very detailed and long. In order to manage this incoming flood of submissions, we need to make sure to handle them as soon as possible as we know there are more coming. If we don’t take care of them roughly at the same speed they arrive, the backlog just grows and having that list of potential security problems in a list that you don’t have control over takes a mental toll. I spend almost all my days right now working through the list of reported security issues that we have on Hackerone. Verify the claim, assess the importance, write a patch, figure out when the bug was introduced, understand the vulnerability, write a detailed advisory explaining the problem to the world and communicate all this with the security researcher and the rest of the curl security team. For the first time in my life, my wife voiced concerns about my work hours and my imbalanced work/life situation. I work more than I’ve done before, but the flood keeps coming. People in my surrounding, I guess reading between the lines, have asked me how I and we cope with this deluge and want to make sure we don’t burn in the process. I am concerned for my team mates. I might soon have to reduce my work hours to allow myself more breathing time. This is a never-before seen or experienced pressure on the curl project and its security team members. An avalanche of high priority work that trumps all other things in the project that is primarily mental because we certainly could ignore them all if we wanted, but we feel a responsibility, we have a conscience and we are proud about our work. We feel obliged to fix security problems in the software we have helped shipped to every device on the globe. This is personal to us. With about half the release cycle left until the pending release ships, we already have twelve confirmed vulnerabilities meaning twelve pending CVE announcements. That’s a new project record and it also means we will reach thirty published CVEs in 2026 even before half the calendar year has passed. The projected total amount of curl CVEs published through the whole year is therefore at least double this number! What help would we like? Short term it is a little late. We already have work up to our ears. I wish more companies that use and depend upon curl or libcurl in commercial software and services would chime in their part to fund us. We could then pay more developers to distribute the work load across. That would be great. Feel free to contact me to discuss how you can contribute to this. Get your employer to pay for a support contract! Fortunately we have customers who already do this, so some of us can work on curl full time. I am a pragmatic (and a bit of a cynic) and I have danced this dance for a long time already. I have no illusions that anything significant is going to change in this area even if we are in an unparalleled situation and in a tighter spot than ever before. I totally expect us to ride out this storm by ourselves. Like we are used to. We will survive. We will endure. It might just be a bit of a shaky period in the project and in the world at large as we try to maneuver our way through this. There’s a tsunami coming over us and all we can do is swim, there are no life boats for us. The curl project is not owned by a company. We are not part of any umbrella organization. This makes us a little under-powered at times, but it also gives us maximum freedom and flexibility. We act solely in the interest of making curl as good as possible for the world and curl users. Fixing bugs and problems is good. Every reported problem implies a fixed issue. curl becomes a better product. What is also a good trend: almost no one finds terrible vulnerabilities. All vulnerabilities found the last few years in curl have all been deemed severity LOW or MEDIUM. I’m not saying there won’t be any more HIGH ever, but at least they are rare. The most recent severity high curl CVE was published in October 2023. Right now we are under a little pressure. Forgive us if we are a little slow to respond sometimes. Image by Brian Merrill from Pixabay

0 views

How my minimal, memory-safe Go rsync steers clear of vulnerabilities

Back in January 2025, multiple different security researchers published a total of 6 security vulnerabilities in rsync , some of which allow arbitrary code execution and file leaks, so naturally I was wondering whether/how my gokrazy/rsync implementation was affected. Did implementing my own (compatible, but minimal) rsync in Go, a modern and memory-safe programming language, really rule out entire classes of security vulnerabilities? This deep dive article was in the making since January 2025, but was delayed because we uncovered more unpublished vulnerabilities in the process! The “Security Vulnerabilities” section now covers all 12 vulnerabilities from the January 2025 batch and the May 2026 batch. If you are running (upstream, samba) rsync in production, upgrade to version 3.4.3 or newer. If you are running gokrazy/rsync in production, upgrade to version v0.3.3 or newer. Feel free to skip over the nitty-gritty security issue details and jump directly to: For context, I blogged about rsync, how I use it, and how it works back in June 2022. See also all posts tagged “rsync” . The original motivation for writing my own rsync (back then only a server, today all directions are supported) was to provide the software packages of distri, my Linux distribution research project for fast package management , which I wanted to host on router7 , my small home Linux+Go internet router, which in turn is built on gokrazy , my Go appliance platform. I am still running multiple gokrazy/rsync servers for this original purpose, and also many others! Having rsync available as a primitive (that you can link into your Go programs!) is really nice. This article covers the following security vulnerabilities: The first batch of the vulnerabilities above was announced on the oss-security mailing list , but note that the original report has more detail compared to the oss-security summaries! The later vulnerabilities were announced via GitHub Security Advisories on the rsync project . When the checksums are read by the daemon, two different checksums are read: Most importantly, note that field is filled with bytes. always has a size of 16: rsync.h is an attacker-controlled value and can have a value up to bytes, as the next snipper shows: The problem here is that can be larger than 16 bytes, depending on the digest support the binary was compiled with: md-defines.h support is common and sets the value to 64. As a result, an attacker can write up to 48 bytes past the buffer limit. Upstream fix: The upstream fix for CVE-2024-12084 changes the field to a dynamically-allocated field, which is allocated with length, and fixes the bounds check to check against the (checksum length for this transfer’s algorithm). Can Go help prevent this? Yes: Missing or incorrect bounds checks will not result in a heap buffer overflow in Go! Instead, attempting to write out of bounds will result in a panic because the Go runtime performs bounds checks. How does gokrazy/rsync fare? gokrazy/rsync also had insufficient validation! Our issue was different, though: It wasn’t size confusion, we just were not doing any validation of the sum header at all — oops! We can confirm that the Go runtime’s bounds check triggers on an attempt to write out of bounds by changing the code like so and running the tests: As expected, the Go runtime panics with the following message: Of course, crashing the entire server is not the best failure mode, so I added the missing bounds checking to turn the panic into an error . Because of the same lack of validation as in the previous CVE-2024-12084 vulnerability, an attacker could select a checksum algorithm with short checksums (e.g. with 8 byte checksums), but then claim they were sending longer checksums (e.g. 9 bytes), making the victim leak one byte of uninitialized stack content in the response. Leaking one byte of stack content may seem benign, but as the Google Security report puts it: The first pair of vulnerabilities are a Heap Buffer Overflow and an Info Leak. When combined, they allow a client to execute arbitrary code on the machine a Rsync server is running on. The client only requires anonymous read-access to the server. The daemon matches checksums of chunks the client sent to the server against the local file contents in . Part of the function prologue is to allocate a buffer on the stack of bytes: The daemon then iterates over the checksums the client sent and generates a digest for each of the chunks and compares them to the remote digest: Notably, the number of bytes that are compared again are bytes. In this case, the comparison does not go out of bounds since can be a maximum of . However, the local buffer, not to be confused with the attacker-controlled , is a buffer on the stack that is not cleared and thus contains uninitialized stack contents. A malicious client can send a (known) checksum for a given chunk of a file, which leads to the daemon writing 8 bytes to the stack buffer . The attacker can then set to 9 bytes. The result of such a setup would be that the first 8 bytes match and an attacker-controlled 9th byte is compared with an unknown value of uninitialized stack data. An attacker can divide a file into 255 chunks and as a result leak one byte per file download. An attacker can incrementally repeat the process, either in the same connection or by resetting the connection. As a result, they can leak bytes of uninitialized stack data, which can contain pointers to Heap objects, Stack cookies, local variables and pointers to global variables and return pointers. With those pointers they can defeat ASLR. Upstream fix: There are two relevant upstream fixes: Can Go help prevent this? Yes: By design, Go initializes all variables to the zero value. Go programmers do not need to remember to explicitly initialize variables. How does gokrazy/rsync fare? gokrazy/rsync is not affected by this vulnerability: Variables are always initialized in Go. Additionally, selecting checksums other than MD4 was only introduced in protocol version 30 (gokrazy/rsync implements protocol version 27). Description: (quoting the Google Security report ) When the syncing of symbolic links is enabled, either through the or ( ) flags, a malicious server can make the client write arbitrary files outside of the destination directory. A malicious server can send the client a file list such as: Symbolic links, by default, can be absolute or contain characters such as . In practice, the client validates the file list and when it sees the entry, it will look for a directory called , otherwise it will error out. If the server sends as [both, a directory and a symbolic link], [the client] will only keep the directory entry, thus the attack requires some more details to work. In mode, which the server can enable for the client, the server sends the client multiple file lists. The deduplication of the entries happens on a per-file-list basis. As a result, a malicious server can send a client multiple file lists, where: As a result, the directory is created first and is considered a valid entry in the file list. Then, the attacker changes the type of to a symbolic link. When the server then instructs the client to create the file, it will follow the symbolic link and thus files can be created outside of the destination directory. Can Go help prevent this? No. This vulnerability is caused by a logic error: when multiple file lists are used, the merged file list needs to be re-verified. But see Defense in depth: Go’s Upstream fix: The upstream fix for CVE-2024-12087 adds the missing validation. How does gokrazy/rsync fare? gokrazy/rsync is not affected by this vulnerability: gokrazy/rsync does not implement the incremental recursion mode ( ). The trade-off here is implementation complexity vs. resource usage: the incremental recursion mode allows working with the file set in a “windowed” way, as opposed to having to scan the entire file set before any transfer can begin. See also my How does rsync work? blog post. Description: (quoting the Google Security report ) The CLI flag makes the client validate any symbolic links it receives from the server. The desired behavior is that symbolic links target can only be 1) relative to the destination directory and 2) never point outside of the destination directory. The function is responsible for validating these symbolic links. The function calculates the traversal depth of a symbolic link target, relative to its position within the destination directory. As an example, the following symbolic link is considered unsafe: As it points outside the destination directory. On the other hand, the following symbolic link is considered safe as it still points within the destination directory: This function can be bypassed as it does not consider if the destination of a symbolic link contains other symbolic links in the path. For example, take the following two symbolic links: In this case, foo would actually point outside the destination directory. However, the function assumes that is a directory and that the symbolic link is safe. Upstream fix: The upstream fix for CVE-2024-12088 makes stricter by not allowing anywhere within the path, except at the very beginning. Can Go help prevent this? No. This vulnerability is caused by a logic error: the validation function was incorrect. We could have implemented that same bug. But see Defense in depth: Go’s How does gokrazy/rsync fare? gokrazy/rsync is not vulnerable: The feature is not yet implemented in gokrazy/rsync. The rsync receiver (in client mode) did not sanitize file names provided by the rsync sender, or otherwise prevent opening files outside the destination tree. A malicious sender could instruct a receiver to compare checksums of arbitrary files outside the destination tree. By observing the receiver’s reaction to a provided one-byte checksum, a malicious sender can leak arbitrary files. When a client connects to a malicious server the server is able to leak the contents of an arbitrary file on the client’s machine. In the client will read type as well as the from the server if the server sets the appropriate flags. The flag will not be set for the client. The caller ( ) then uses the server provided values to determine a file to compare the incoming data with. In the contents of the file specified by are copied into the destination file. This can be achieved by the server sending a negative token. The server sends a checksum to compare. If they don’t match, a 0 is returned. When the return value is 0 the receiver will then send a to the generator. The generator will then write a message to the server. The server can use this as a signal to determine if the checksum they sent was correct. By starting off with a of 1 a malicious server is able to determine the contents of the target file byte by byte. Upstream fix: The upstream fix for CVE-2024-12086 prevents opening files outside the destination tree by verifying the sender-provided path. Can Go help prevent this? Yes, Go offers an API to prevent this, see Defense in depth: Go’s . How does gokrazy/rsync fare? gokrazy/rsync is not vulnerable: the fuzzy matching feature was introduced with rsync protocol version 29, but gokrazy/rsync implements protocol version 27. Description: (quoting the Red Hat Security Advisory ) A flaw was found in rsync. This vulnerability arises from a race condition during rsync’s handling of symbolic links. Rsync’s default behavior when encountering symbolic links is to skip them. If an attacker replaced a regular file with a symbolic link at the right time, it was possible to bypass the default behavior and traverse symbolic links. Depending on the privileges of the rsync process, an attacker could leak sensitive information, potentially leading to privilege escalation. Upstream fix: The upstream fix for CVE-2024-12747 changes calls in the rsync sender to use the option. The paths are not expected to be symlinks at that point in the algorithm (symlinks would be handled with ). Can Go help prevent this? Yes, Go offers an API to prevent this, see Defense in depth: Go’s . How does gokrazy/rsync fare? gokrazy/rsync was vulnerable before commit , which introduces the same mitigation that upstream rsync uses. To reproduce the issue, use the following steps: Check out gokrazy/rsync v0.2.7: Patch the code as follows to undo the fix and execute the attack: Running the test now shows that the server traversed the symlink: A surprising discovery When I shared a draft of this article with Damien Neil, member of the Go Security Team and the author of the traversal-resistant API , he pointed out: I believe the gokrazy fix for CVE-2024-12747 is insufficient. You’re calling with , but only prevents symlink traversal in the last path component. This is probably still vulnerable to replacing an earlier path component so can be redirected by symlinking to . We reported this to the rsync security contact address in April 2025. In December 2025 I learned that someone else had also independently discovered and reported this issue. Ultimately, this resulted in CVE-2026-29518, published on 2026-05-20. Description: (quoting the rsync 3.4.3 NEWS entry ) TOCTOU symlink race condition allowing local privilege escalation in daemon mode without chroot. An rsync daemon configured with is exposed to a time-of-check / time-of-use race on parent path components. A local attacker with write access to a module can replace a parent directory component with a symlink between the receiver’s check and its open(), redirecting reads (basis-file disclosure) and writes (file overwrite) outside the module. Under elevated daemon privilege this allows privilege escalation. Default is not exposed. Reach: local attacker on the daemon host, write access to a module path, daemon configured with . Upstream fix: The upstream fix for CVE-2026-29518 uses , which is similar to Go’s API. Can Go help prevent this? Yes, Go offers an API to prevent this, see Defense in depth: Go’s . How does gokrazy/rsync fare? gokrazy/rsync was vulnerable until I switched the sender and the receiver to the traversal-resistant API . Description: (quoting the GitHub Security Advisory ) Description: The receiver’s compressed-token decoder accumulated a 32-bit signed counter without overflow checking. A malicious sender can trigger an overflow that, with careful manipulation, leaks process memory contents to the attacker – environment variables, passwords, heap and library pointers – significantly weakening ASLR and facilitating further exploitation. Reach: authenticated daemon connection with compression enabled (the default for protocols >= 30 when both peers advertise it). Disabling compression on the daemon (“refuse options = compress” in rsyncd.conf) is the available workaround. Upstream fix: The upstream fix for CVE-2026-43618 introduces the missing checks. How does gokrazy/rsync fare? gokrazy/rsync is not vulnerable because it does not implement compression. See gokrazy/rsync issue #35 for details on why compression support sounds simple, but is non-trivial. Description: (quoting the GitHub Security Advisory ) The 2025 fix that added a guard in was not applied to the visually-identical block in . A malicious rsync server can drive any connecting client into a deterministic by setting in the compatibility flags, sending a flist whose first sorted entry is not a leading “.” directory (which causes to set ), then sending a transfer record with and a non- iflag word. The receiver reads and dereferences the result. On glibc x86-64 the dereferenced pointer is mmap chunk metadata that lands at an unmapped address, hence a clean ; non-glibc allocators have not been audited. Reach: any rsync client doing a normal pull from an attacker-controlled URL. Works for both rsync:// URLs and remote-shell pulls. is the protocol-30+ default; no special options are required on the victim. Workaround: on the client. Upstream fix: The upstream fix for CVE-2026-43620 adds the guard to as well. How does gokrazy/rsync fare? Just like for CVE-2024-12087 , gokrazy/rsync is not affected by this vulnerability: gokrazy/rsync does not implement the incremental recursion mode ( ). Description: (quoting the GitHub Security Advisory ) Description: Earlier fixes for symlink races on the receiver’s open() call (CVE-2026-29518) missed the same race class on every other path-based system call: chmod, lchown, utimes, rename, unlink, mkdir, symlink, mknod, link, rmdir, lstat. On rsync daemons with “use chroot = no” a local attacker with filesystem access on the daemon host can swap a symlink into a parent directory component between the receiver’s check and one of these syscalls, redirecting it outside the exported module. The fix routes each affected path-based syscall through a parent dirfd opened under RESOLVE_BENEATH-equivalent kernel-enforced confinement (openat2 on Linux 5.6+, O_RESOLVE_BENEATH on FreeBSD 13+ and macOS 15+, per-component O_NOFOLLOW walk elsewhere). Default “use chroot = yes” is not exposed. Reach: local attacker on the daemon host, write access to a module path, daemon configured with use chroot = no. Upstream fix: The upstream fix for CVE-2026-43619 uses the family of syscalls, just like Go’s . Can Go help prevent this? Yes, Go offers an API to prevent this, see Defense in depth: Go’s . How does gokrazy/rsync fare? gokrazy/rsync is not affected, because it uses Go’s API throughout. Description: (quoting the GitHub Security Advisory ) On an rsync daemon configured with the global rsyncd.conf setting, the reverse-DNS lookup of the connecting client was performed after the daemon had chrooted into . If did not contain the files glibc needs for resolution ( , , , NSS service modules), the lookup failed and the connecting hostname was set to “UNKNOWN”. Hostname-based deny rules (“hosts deny = *.evil.example”) therefore could not match, and an attacker controlling their PTR record could connect from a hostname the administrator had intended to deny. IP-based ACLs are unaffected. The per-module setting is unrelated to this issue. Reach: rsync daemon configured with AND hostname-based ACLs AND does not include the libc resolver fixtures. Upstream fix: The upstream fix for CVE-2026-43617 moves the DNS lookup to an earlier point in the protocol. How does gokrazy/rsync fare? gokrazy/rsync is not vulnerable because we only implement IP-based allow/deny lists, not hostname-based allow/deny lists. Description: (quoting the GitHub Security Advisory ) The rsync client’s HTTP proxy support contains an off-by-one out-of-bounds stack write in ( ). After issuing the request, rsync reads the proxy’s first response line one byte at a time into a 1024-byte stack buffer with the bound , so the loop only ever writes . If the proxy (or a man-in-the-middle in front of it) returns 1023+ bytes on the first response line without a terminator, the loop exits with — a slot the loop never wrote, so holds stale stack bytes left there by the earlier that formatted the outgoing request. The post-loop code then does: The lands one byte past the end of the on-stack , corrupting whatever lives in the adjacent stack slot. AddressSanitizer reports at in the frame. Upstream fix: The upstream fix for CVE-2026-45232 validates the attacker-supplied data. How does gokrazy/rsync fare? gokrazy/rsync does not implement such proxy support, so it is not vulnerable. Let’s summarize how Go fares: Aside from being written in Go, another key difference between gokrazy/rsync and the official upstream rsync is that the gokrazy implementation is minimal : Let’s have a look at whether gokrazy/rsync was affected by each CVE at the time of publishing: To be clear: all known vulnerabilities are fixed in gokrazy/rsync! The table above documents what the state was at the time when each CVE was published. In other words: When the January 2025 vulnerabilities were published, gokrazy/rsync panicked (CVE-2024-12084) and was vulnerable to a TOCTOU race (CVE-2024-12747). In the process of fixing the TOCTOU issue, we discovered CVE-2026-29518, which was fixed in gokrazy/rsync before the CVE was published. CVE-2026-43619 was discovered even later, but was also already fixed in gokrazy/rsync with the same fix: using Go’s everywhere. As I was reading the vulnerability reports, I noticed that the reports were slightly misleading by their choice of words: most reports just spoke of “server” and “client”. However, in an rsync transfer, both sides, the rsync client and the rsync server can assume either role: sender (upload files) or receiver (download files)! Some setups come with further restrictions that make certain attacks harder or impossible to pull off. For example, when running in daemon mode, file system access can be restricted to the pre-configured module paths (but not in command mode!). Here is a diagram to give you an overview of the 4 different setups and role/protocol layering: In the context of our vulnerability reports, I would say that the Arbitrary File Leak vulnerability (CVE-2024-12086)’s original title “Server leaks arbitrary client files” can easily be misunderstood. Instead, I would say: The rsync receiver will leak arbitrary files to a malicious sender . I have verified that a malicious client sender can make an unpatched remote rsync open files outside the destination tree (e.g. the system password database) when running in command mode, for example over SSH. (But, when running in daemon mode, the server enables additional path sanitization, which prevents this attack.) Similarly, the Symlink Path Traversal vulnerability (CVE-2024-12087) speaks about a “malicious server”, but again, it should be “malicious sender”, which can be either the client or the server. The OpenBSD project is known for its security focus, so how does openrsync compare? openrsync is not affected by the Heap Buffer Overflow (CVE-2024-12084) and Stack Info Leak (CVE-2024-12085) vulnerabilities because it validates the checksum length and only supports one checksum size/algorithm (MD4). openrsync is not affected by CVE-2024-12086, CVE-2024-12087 and CVE-2024-12088 because it does not implement the relevant features (like gokrazy/rsync). Even if it was vulnerable, openrsync’s defense-in-depth measures like using OpenBSD’s and to restrict file system access would have prevented successful exploitation — at least when running on OpenBSD. openrsync is not affected by CVE-2024-12747 because it used from the very moment they implemented symlink support . But, because is not a sufficient fix for this issue, openrsync is affected by CVE-2026-29518! The above covers the January 2025 batch of vulnerabilities; the May 2026 batch is similar in that most features just are not implemented. Overall, I say: Well done, Kristaps and contributors! By diligently implementing validation, restricting the attack surface and employing defense-in-depth measures, openrsync manages to not be affected by almost all of the reported vulnerabilities. Which APIs and environments can we use on Linux for defense-in-depth measures? I’ll go through the ones supports, ordered by traditional to modern. Within a few weeks after starting the project, I added support for dropping privileges and using mount/pid namespaces on Linux to restrict the file system objects that my rsync server could work with. This approach works very well to mitigate path traversal attacks, but requires privileges, meaning we need to run as or in a Linux user namespace (if enabled on your distribution / system). That limitation makes mount namespaces well-suited for server setups, but usually unavailable for interactive one-off transfers that are typically running under a human’s user account. In the same commit that introduced Linux mount/pid namespace support, I also included a systemd service file that restricted file system access to home directories and encouraged folks in the README to further restrict file system access, depending on what their use-case allows. These file system restrictions, if set up correctly, mitigate the File Leak (CVE-2024-12086) and Path Traversal (CVE-2024-12087) vulnerabilities. The Symlink Race Condition (CVE-2024-12747) relies on privilege escalation through the rsync process, but thanks to the DynamicUser feature, our process has fewer privileges than other users. Similarly to mount namespaces, these measures are great for server setups, but too cumbersome to set up for interactive one-off usages. I stumbled upon Justine’s blog post Porting OpenBSD pledge() to Linux (2022) and was reminded that Linux offers the Landlock API for unprivileged, per-process access control, similar to OpenBSD’s system call, which openrsync uses. The basic idea is that once your program knows the directory it works with, it makes a call like and no longer has access to other file system locations. I had previously heard of Landlock at a Go Meetup, so I knew there was Go support for Landlock. Back in 2022, I enabled Landlock support in the gokrazy kernel images. So I gave it a shot in March 2025 and implemented Landlock support to restrict file system access . It took me a few hours, which seems a little longer than one might expect at first. Making Landlock work (and/or skipping it) in our test environment ran into a couple of road blocks: Our tests had defined many functions that get run in the same process, but when repeatedly adding rulesets, we would exceed the limit of 16 (!) policy layers per process. Once I had it set up just right, it is a beautiful solution. Now we can restrict rsync transfers to their sources (read-only) or destination directories (read-write), even for unprivileged invocations of ! 🎉 The downside to Landlock is that Landlock operates at the process level. This means that Landlock policies must include the files that your program needs, e.g. needs to be able to read for user id lookup, so if the attacker is after the file, Landlock does not help. In February 2025, the Go 1.24 release introduced the API, which is resistant against path traversal, see The Go Blog: Traversal-resistant file APIs (by Damien Neil, March 2025). This API allows more fine-grained control (per file system operation) compared to Landlock. Go 1.25 (released in August 2025) added more methods to , making it a convenient choice for most file system usage. I have converted all of ’s file system usage to use , which is a great fit: users configure input/output directories, but the filenames received over the network are untrusted. That’s exactly what was designed for! When I first looked into using , I thought that some system calls could inherently not be made with this API, like for example to create device node files. Damien explained: It won’t support mknod, though. However, you should be able to use it to enable a safe mknod: If you’re curious how that looks in practice, check out ’s usage in , line 15-29 . Another stumbling block was when I realized that unlike with , Linux only implements , but no (as of Linux 7.0)! Luckily, Lennart Poettering pointed out that there’s a trick to skip path resolution without : you can probably bind to in the meantime… And indeed, this works! Path resolution is skipped because we only specify a basename (last component of a path) after the known-safe , not a path (see line 49-56 ). With these two tips, v0.3.1 and newer are fully using , meaning all file system access is traversal-safe! 🥳 Lacking validation causes vulnerabilities It is interesting to note that aside from the TOCTOU vulnerabilities (CVE-2024-12747, CVE-2026-29518 and CVE-2026-43619), all other vulnerabilities were caused by missing or incorrect input validation. In three cases, there was just no validation to begin with. In another case (CVE-2024-12088), the subject matter of file system path resolution is tricky enough that the existing validation did not cover all edge cases. As the Go verdict section explains in more detail, the most valuable structural fixes are to provide bounds checking (= always-on validation) and safe-by-default APIs like Go’s . Too much complexity A few of the vulnerabilities came from evolution of the rsync protocol: The code used to correctly perform sufficient validation, but then new features were added. For example, when checksum algorithm negotiation was added (protocol version 30), the validation was not correctly updated. When incremental recursion was added (also protocol version 30), the validation that made sense for individual file lists was not updated for the new processing approach of merging incremental file lists. Avoiding complexity avoids vulnerabilities! Both gokrazy/rsync and also openrsync were not vulnerable to 8 out of the 12 security vulnerabilities simply because they do not implement the feature with the vulnerability. Of course, these features were added to rsync because they were valuable to someone at some point, and of course I am not saying that we should just… not develop software any further, ever. But, I consider it ideal to use an implementation whose complexity is appropriate for and proportional to the complexity of the use-case . In other words: for simple use-cases, reach for a simple implementation. Only reach for the fully-featured implementation where needed. The verdict on whether using Go has helped . The verdict on whether a minimal re-implementation like gokrazy/rsync helps . My comparison with OpenBSD’s (written in C). Defense in depth mechanisms one can use on Linux. The conclusion . CVE-2024-12084 to 12088 (original report) CVE-2024-12747 (discovered separately by Aleksei Gorban “loqpa”) CVE-2026-29518 (discovered by Damien Neil and myself! and independently by Nullx3D ) CVE-2026-43617 to 43620 CVE-2026-45232 rsync performed insufficient validation: It read the (attacker-controlled) checksum length from the network and compared the length against . However, rsync’s data structures always declared a 16 byte buffer: is always 16 (bytes), which is sufficient to hold an MD4 or MD5 checksum. used to be 16 (bytes), but can be larger when rsync is compiled with SHA256 or SHA512 checksum support. Hence, the bounds check was ineffective! An attacker could write out of bounds. This issue was introduced with commit in September 2022 , which added SHA256/SHA512 checksum support. A 32-bit Adler-CRC32 Checksum A digest of the file chunk. The digest algorithm is determined at the beginning of the protocol negotiation. The corresponding code can be seen below: sender.c : The “Some checksum buffer fixes” commit prevents this attack because the attacker-controlled can no longer be larger than the transfer’s checksum length. The “prevent information leak off the stack” commit initializes the memory to zero, thereby making any stack leak through impossible. Check out gokrazy/rsync v0.2.7: Patch the code as follows to undo the fix and execute the attack: The Go runtime’s bounds checks turn more serious security issues into a panic. A panic is still a denial-of-service risk, but that’s much preferable. Go initializes memory to zero, making info leaks like CVE-2024-12085 impossible. Go’s API prevents most of the remaining vulnerabilities. Only one out of twelve vulnerabilities (CVE-2026-43617) is a proper bug in the application logic that using Go could not have prevented. gokrazy/rsync is unaffected by many vulnerabilities because it does not implement the feature in question, for example . Like all other wire protocol-compatible rsync implementations, gokrazy/rsync targets protocol version 27, because later protocol versions introduce significant complexity. In some cases, features that would be good to implement come with significant blockers, e.g. compression is tricky, see gokrazy/rsync issue #35 for details. os.Root.OpenFile the parent directory of the target, File.Fd to get the file descriptor for that directory, https://pkg.go.dev/golang.org/x/sys/unix#Mknodat to create the file.

0 views
Armin Ronacher 1 weeks ago

Building Pi With Pi

Pi is now part of Earendil, but in the important sense it is still Mario’s project. He has been living with its issue tracker longer than I have, and he has been exposed to the weirdness of the new form of agent traffic in Open Source projects for longer too. This post is mostly a reflection of my own experience after spending more time in the tracker, using Pi to work on Pi, and watching what I have learned about it so far. Unsurprisingly, we are using Pi to build Pi. That sounds like a cute dogfooding thing but it really helps understand what we do. An interesting effect of building with agents is that it changes the role of the issue tracker a tiny bit. The issue descriptions are not just messages from a user to a maintainer because we also use them as inputs for prompts in Pi sessions. It is something I might hand to my clanker 1 and say: “understand this, reproduce it, inspect the code, and propose a fix.” That means the shape of the issue matters in a new way. A bad issue was always annoying, but at least a lot of issues were vague. Now we are also dealing with a class of issues that are 5% human and 95% clanker-generated and largely inaccurate shit. A bad issue that contains a plausible but wrong diagnosis creates extra work. The most frustrating failure mode right now is that people submit issues that are not in their own voice. They contain an observed problem somewhere, but it has been thrown into a clanker and the clanker reworded it and made a huge mess of it. Typically, it was prompted so badly that the conclusions produced are more often than not inaccurate but always full of confidence. The result is complete guesswork on root causes, fake-minimal repros, suggested implementation strategies, analogies to adjacent but often the wrong code, and long lists of error classes that might or might not matter. That is worse than no diagnosis. I don’t want to point to specific issues because I really do not want to bad mouth anyone, but it is frustrating. It is also frustrating because when I give that issue to Pi, Pi sees the wrong diagnosis too. It does not treat the issue body as a rumor. It treats it as evidence. It will happily go down the path that the issue already prepared for it, because the prose is confident and the code references look plausible. We use a custom slash command called , which specifically has this instruction in it: Do not trust analysis written in the issue. Independently verify behavior and derive your own analysis from the code and execution path. Unfortunately, it does not fully work, because when humans first throw their issue through the clanker wringer, their clanker expands scope almost immediately. What was once a very narrow and fact based bug observation, turns into a much expanded surface area full of hypotheses. So at least personally, I increasingly want issue reports to be condensed to what the human actually observed: That is enough. If you used an LLM to understand the problem, great, maybe leave it as a follow-up comment. But the issue and the issue text should be something you own. If you do not know the root cause, say that. I too can operate a clanker, and I would rather do this myself than use your slop. If your repro is a guess, say that. If the only hard fact is one stack trace, give me the stack trace and stop there. That we’re seeing issues full of slop is just a result of the present day quality of these machines. Sadly, their failures in creating good issues extend to a lot of code that is generated. Not all of it, but a lot of code. Over and over I keep running into them over-engineering the hell out of issues and implementations. If you tell them that “this malformed session log crashes the reader,” the clanker will often add a tolerant reader. Then it will add a fallback, then maybe a migration, then more debug output, then a test for all of this. None of this is necessarily wrong in isolation, but it can be the wrong move for the system. At Pi’s core is a rather well-designed session log with invariants that must be upheld. The clanker’s present-day behavior is to just assume that no such invariants exist, and instead to make the system work with all kinds of malformedness, blowing up the complexity in the process. Almost always, the correct fix is not to handle the bad state, but to make the bad state impossible. This matters a lot for persisted data such as Pi session logs. They are opened, branched, compacted, exported, shared, and analyzed. The goal here is to never write bad session data. Yet if you just let the clanker roam freely, it will attempt to handle every case of bad data in the session log with a more permissive reader. I have complained about this plenty, but working on Pi’s code base continues to reinforce the point. This is one of the ways LLM authored code grows so much needless complexity. All these models see a local failure and try to locally defend against it. As maintainers we have to keep pulling the conversation back to the global invariant, which is harder than it should be, and it’s laborious. Then there is the issue of volume. The tracker is receiving a lot of issues and PRs, and a significant fraction of them are clearly LLM-assisted. Some are good, none are excellent, and most are just bad. The total throughput is a maintenance problem by itself. As you might know, Pi’s issue tracker is automated to close all issues and pull requests from new contributors, and there is a manual process by which we might reopen some of them or approve individuals. So auto-close -> reopen -> close again is an interesting statistic for us to look at. I pulled the public GitHub tracker data while writing this over the last 90 days. Excluding Earendil members, that leaves 3,145 external issues and pull requests. Of those, 2,504 were auto-closed because they were from non-approved individuals. 17% were reopened. For pull requests the number is worse: less than 10% were merged. Many of the issues and PRs are complete slop and in some cases the humans did not even realize that they created them. Sources of low-quality spam include OpenClaw instances, as well as some skills that people put into their context that seemingly encourage issue creation. GitHub clearly is not built to deal with this new form of Open Source, but I’m increasingly feeling the need to put the blame less on GitHub than on all the people involved who make that experience painful. If your clanker shits on someone else’s issue tracker then it’s not the fault of GitHub, it’s yours alone. Pi might be built with Pi, but we’re quite far off today from where Bun and OpenClaw already are: fully detached, automated software engineering. Maybe we will reach that point, I don’t know. Today it does not seem like we know how to pull off a dark factory and we also don’t yet have the desire. That said, there is quite a bit of parallelism going on, and it is mostly for reproducing issues. The small setup we use for this is three tiny pieces in Pi’s own committed folder. (for analyze is sue) is a prompt for analyzing GitHub issues: it labels and assigns the issue, reads the full thread and links, then explicitly tells the agent not to trust the analysis in the issue and to derive its own diagnosis from the code. Then an extension adds a which watches the prompt before the agent starts, recognizes the GitHub issue or PR URL that (or the PR equivalent) put into the prompt, fetches the title and author with , renders that in a little UI widget, and renames the session. It also rebuilds that state on session start or session switch, so if we reopen an older investigation the window still tells the developer which issue it belongs to. In practice this means it’s possible to have several Pi windows open, each running against a different issue, and the UI keeps the investigations visually distinct while the agents do their independent reproduction and code reading. Once the investigations are done, one can work through them sequentially. To finish off everything, ( wr ap it up) is the matching wrap-up prompt: it infers the GitHub context from the session, updates the changelog, drafts or posts the final issue comment with a disclaimer, commits only the files changed in that session, adds the appropriate when there is exactly one issue, and pushes from . You will have noticed this already but Open Source in a post-AI world is under a strange new pressure. We are getting more code, more projects, and more issues. Projects appear with no real users, or a temporary audience of one, and even projects with thousands of stars can have a shelf life of weeks. For us, Pi’s harness layer is worth maintaining carefully because it solves hard coordination problems and creates a platform we and others can build on. We also know that coordination and cooperation lifts us all up. Many times the right answer is not to work around a problem locally, but to make the upstream behavior correct. Mario has been very good at refusing to make Pi paper over every misconfigured gateway, and we’re trying to preserve that discipline. When a gateway behaves correctly, everybody benefits. Sadly that type of thinking is quickly disappearing because these machines make local workarounds cheap, so code accumulates local defenses against every misbehavior. Instead of humans talking to humans about where a fix belongs, one human and one machine work around the problem in isolation. Keep in mind that AI has not increased the number of people who need software, or the number of maintainers who can review it. It has mostly increased the amount of code and the number of projects competing for attention. Some of that is healthy, but a lot of it fragments effort that should be shared. We need stronger foundations, not weaker ones. Open Source needs more collaboration, not more isolated work with a machine. Human communication is hard, and it is tempting to avoid it when you can sit alone with your clanker. But isolation is not where Open Source derives its value. The value is in the community and the structure that lets projects outlive their original creators. To me, clanker is a much preferable term for agent. Agency lies with humans, not with machines. Calling these things agents I still believe is a mistake, but alas. ↩ I ran this command. I expected this to happen. This happened instead. Here is the exact error or log. To me, clanker is a much preferable term for agent. Agency lies with humans, not with machines. Calling these things agents I still believe is a mistake, but alas. ↩

0 views
Jeff Geerling 1 weeks ago

News about Raspberry Pi 6 and Microcontroller Development

On Thursday, three of the lead Raspberry Pi engineers hosted an AMA on the r/engineering subreddit . One of the most interesting tidbits was on the Pi 6. Looking back at previous launches: Following that cycle, one would expect a Pi 6 3-4 years after the Pi 5, which would put it in 2026 or 2027. 2012: Raspberry Pi 2015: Raspberry Pi 2 (+3 years) 2016: Raspberry Pi 3 (+1 year) 2019: Raspberry Pi 4 (+3 years) 2023: Raspberry Pi 5 (+4 years)

0 views
Manuel Moreale 1 weeks ago

I feel your pain Sara

I stumbled on this piece of code recently that made me laugh, cry, sigh in despair, and think of poor Sara doing her best to make the web a better place. I guess people have forgot that is a thing that exists. Thank you for keeping RSS alive. You're awesome. Email me :: Sign my guestbook :: Support for 1$/month :: See my generous supporters :: Subscribe to People and Blogs

0 views
Jack Vanlightly 1 weeks ago

Introducing Dimster, a performance benchmarking tool for Apache Kafka

Dimster = DIMensional teSTER for Apache Kafka On GitHub: https://github.com/dimster-hq/dimster Most of my career in distributed systems has been as a tester, performance engineer and formal verification specialist. I’ve written performance benchmarking tools in the past, for RabbitMQ and Apache Pulsar but in recent years I’ve used OpenMessagingBenchmark (OMB) to run benchmarks against Apache Kafka and other messaging systems. But OMB is hard to deploy and has several limitations compared to more sophisticated benchmarking systems I’ve developed in the past. With Claude becoming so much better since Christmas I decided to write a Kafka-centric performance benchmarking tool, with a lot of inspiration from OMB. I took the bits I like about OMB and the things I like about the tooling I’ve built in the past, to make a performance testing tool for testing Apache Kafka. In this post I’ll introduce some aspects of Dimster that are core to its design: Dimensional testing Shareable, self-contained results with reproducibility in mind Benchmark prep and post-processing Kubernetes as a standardized runtime A benchmarking and stress testing technique I’ve used for years is something I have called “Dimensional Testing”. We can think of all the configs and workload aspects as forming N-dimensional space. Within that space we can explore the impact of points in that space along a single dimension, or even co-varying dimensions. Take a config or an aspect of a workload as a dimension, and run a series of identical benchmarks where a set of points along that dimension are explored (while everything else remains the same). The dimension could be a client config, such as batch.size or acks. It could be an aspect of the workload such as number of consumers, type of consumer, number of consumer groups, the partition count, the produce rate and so on. There are hundreds of dimensions to explore, which requires some patience and care lest you become overwhelmed. The below depicts just three dimensions, and a set of three scenarios which test performance along one or two dimensions at a time. Fig 1. Three examples of varying or co-varying an aspect of a workload as dimensions Each of the above 16 test points (across 3 scenarios) is a separate benchmark, with a fresh topic, warm-up time, recorded time, and cooldown time etc. The generated charts for throughput and various latencies are repeated for each of the three scenarios, with each test point within a scenario plotted as a series/bar on those charts. This makes it easy to compare the performance results of varying the values of a single dimension (or co-varying values across multiple dimensions). Fig 2. Each scenario maps to a set of charts, with the test points as data series. With share groups being relatively new, I could compare the performance of regular consumers against share group consumers, with identical benchmarks where the dimension explored is consumer type (CONSUMER_GROUP|SHARE_GROUP). The following test has as the base workload of ten topics with each topic having 6 partitions, 6 consumers and 4 producers. Each scenario changes the producer rate, and compares consumer groups to share groups. Record keys are used, so batch sizes will be small, which is a tougher workload than a no-key test which typically results in larger batches. The charts below show the results for an EKS deployment with Kafka deployed on 3x m6i.2xlarge with 300 MB/s provisioned gp3. At 50 MB/s we see that p99 end-to-end latency is stable, with roughly 15 ms overhead for share groups. At 200 MB/s, p99 end-to-end exhibits peaks in a periodic fashion. Dimster uses environments. The sizing of a test is determined by which environment is used. I ran some share group consumer scaling tests, with full mTLS, on Kafka clusters assigned 2, 4, and 8 CPUs. These are the equivalent of vCPUs, as my Threadripper has SMT (hyperthreading) enabled. 2-CPU environment on my Threadripper: I ran the following workload with the above environment, with the CPU requests/limit of 2, 4 and 8. Then I used the dimster compare command to generate comparison charts based on the JSON result files of each run. Each chart compares each test point side-by-side. 10k msg/s - 1000 consumers (6th test point in 1st scenario) We see that 2 CPUs fare a lot worse than 4 and 8 CPUs. 100k msg/s, 250 consumers (4th test point, 3rd scenario) The 2 CPU cluster simply can’t keep up with 100k msg/s and 250 consumers. If we unselect 2-CPU, we see that 4-CPU and 8-CPU was ok. Dimster charts are interactive. Series can be toggled, time and percentile ranges can be selected. One thing I really like about OMB is that it produces a JSON file for the results. These files are easy to store and easy to share. But there was also a lot missing for full traceability and reproducibility. Dimster includes the following in every test campaign result (a set of files in a result directory): Results :  The JSON result file which contains all the test point performance results. For each test point, it includes the effective workload and client configuration. It also includes the hardware and other metadata to know what the benchmark was run against. A CSV file generated from the result JSON file (to make it easy to put in a spreadsheet or run custom visualizations). Source configs : The source workload file itself, as well as any additional files such as any dedicated client config file, the broker config file, the version of Kafka, the version of the Kafka clients, and the CPU/memory/disk given to the brokers and clients. Log files : the log files of dimster-core, the benchmarking framework, and each Kafka broker. Charts : Throughput and latency charts (clickable, zoomable) generated from the result JSON file. Dashboards : Grafana dashboards converted to interactive HTML files. I can run a test campaign then send you the results and you’ll be able to reproduce the results because you know exactly what was run and on what. The results are also completely self-contained, if you want to see the dashboard to look at Kafka metrics during the test, it’s right there as an HTML file in the results. No need for access to Grafana and Prometheus and no need to keep monitoring infrastructure around, it can be ephemeral. Dimster comes with four test modes (which all support dimensional testing): Run : Fixed throughput benchmarks, plus: Live-interaction . Run-mode also supports live interaction with the user. The user can change the producer rate, number of producers and consumers, message size, etc.  Availability : Optionally measure availability (producer/consumer/aggregate) during the standard run-mode benchmark. Explore : Discover the highest sustainable throughput while staying under a target end-to-end latency and percentile. Drain-backlog : Build a backlog and time how long it takes for the consumers to drain it. Optionally set a producer rate during the drain phase, such as when testing if a cluster is big enough to drain a backlog while under normal producer load. Correctness : Detects data loss, data corruption, out-of-order delivery and duplicates.  Example 1: Peak sustainable throughput, 1 partition, share group consumers Explore mode on my Threadripper. The idea was to see the bottleneck of a single partition, as consumers are scaled out. The rule was for p75 e2e latency to stay below 50ms. Example 2: Consumer group vs share group with 1 ms processing time The prior example was an unrealistic synthetic test where the consumer spent no time processing. This explore test added 1 ms consumer processing time per message with 300 consumers. It compared a 300 member consumer group with 300 partitions, vs a 300 member share group, with 5, 10, 25 and 50 partitions. Share groups managed the same throughput (95% of theoretical max based on 1 ms processing time and consumer count), on only 10 partitions. Consumers groups needed 300 partitions. Personally, explore and run are my bread and butter benchmark modes. For a given workload I usually start by finding the throughput limit where Kafka transitions from normal stable performance into degraded territory. I either use run mode and use live interaction to discover the performance limit, or I use explore which is slower but I can leave to run and it discovers the limit in an automated way. For latency benchmarks, once I know the limit, I can craft benchmarks that fit inside the performance envelope for that workload on the specific version of Kafka on the specific hardware I am using. The Dimster CLI has some commands that help before running benchmarks and for post-processing. Dimster resources command The resources command calculates the network and disk throughput required to service a workload. This is important in the cloud for selecting the right instances, ensuring that baseline network and disk throughput are greater than the workload’s demands. Dimster compare command Compare different runs that were executed on different hardware, different broker configurations, different broker versions etc. Dimster pivot command You can slice and dice the data any way you want based on the CSV data. However, you can also pivot the results and generate a chart with the pivot command. This compares the Nth test point across all scenarios. Dimster is easiest to use with Kubernetes. Dimster has a CLI you use from your laptop which speaks Kubernetes and leverages it to run benchmarks on any hardware, any cloud, any laptop or workstation using the exact same orchestration logic. All it needs is a properly configured k8s cluster. It could be minikube or k3d on a laptop or workstation, or AWS EKS or Google Cloud GKE or your own in-house cluster. You can tell Dimster to deploy Apache Kafka to a stateful set in the k8s cluster: Fig 3. Dimster architecture in full deploy mode Or point Dimster (deployed to k8s) at a Kafka service or in-house Kafka cluster. When testing a Kafka service, you can provision a single powerful instance for the Dimster coordinator and worker, and deploy them to a local k8s distro such as Minikube, K3d or Kind. A single worker will happily consume all the cores and memory you give it. Fig 4. Dimster architecture in external deploy mode Or run a super-slim full setup in a tiny minikube/kind/etc local k8s distro: Fig 5. Dimster deployed in a tiny local k8s cluster The workflow is the same. If you can provide a k8s cluster, then Dimster does the rest. Deployment is really simple, monitoring, gathering results, troubleshooting is all simplified via a mix of the CLI being relatively capable, and k8s providing a well-understood platform. K8s is not obligatory , you can run dimster-core directly as a Java program, and point it at a Kafka cluster already provisioned. But you lose many features such as monitoring, live-interaction, automatic gathering of logs, automatic chart and CSV generation and so on. However, you can use the post-processing command dimster chart to generate the charts of a result JSON file. Run the Java directly via the benchmark script: ./bin/benchmark -w path/to/workload file I will be publishing a blog post regularly about Dimster and what you can do with it. So stay tuned. I invite you to go and play around with Dimster , even if it's just running benchmarks on your laptop or workstation. You can get an idea of what charts get produced, what kinds of benchmarks you can run, trying out dimensional testing etc. The docs are pretty decent and should cover most of it. It’s fully featured but still a 0.X version. Myself and a Confluent colleague are the only ones who have run it thus far, so there may be bugs you encounter, if you do encounter a problem, please open an issue with repro steps. If you want to run serious benchmarks, you’ll likely need an EKS or GKE type of Kubernetes cluster. Dimster comes with a special CLI for EKS to deploy EKS with node groups for Kafka, Dimster workers/coordinator, Grafana/Prometheus, as well as storage classes for gp3.  While evaluating consumer group vs share group consumers, I’ve been running benchmarks in k3d on my beefy Threadripper 9980X workstation with 64 cores (128 threads), 256 GB RAM and an Samsung 9100 PRO 8TB SSD, which is plenty to run an entire medium sized Kafka cluster plus workers on it. I’ll be sharing some share group benchmarks tomorrow. Happy testing! Dimensional testing Shareable, self-contained results with reproducibility in mind Benchmark prep and post-processing Kubernetes as a standardized runtime Results :  The JSON result file which contains all the test point performance results. For each test point, it includes the effective workload and client configuration. It also includes the hardware and other metadata to know what the benchmark was run against. A CSV file generated from the result JSON file (to make it easy to put in a spreadsheet or run custom visualizations). Source configs : The source workload file itself, as well as any additional files such as any dedicated client config file, the broker config file, the version of Kafka, the version of the Kafka clients, and the CPU/memory/disk given to the brokers and clients. Log files : the log files of dimster-core, the benchmarking framework, and each Kafka broker. Charts : Throughput and latency charts (clickable, zoomable) generated from the result JSON file. Dashboards : Grafana dashboards converted to interactive HTML files. Run : Fixed throughput benchmarks, plus: Live-interaction . Run-mode also supports live interaction with the user. The user can change the producer rate, number of producers and consumers, message size, etc.  Availability : Optionally measure availability (producer/consumer/aggregate) during the standard run-mode benchmark. Explore : Discover the highest sustainable throughput while staying under a target end-to-end latency and percentile. Drain-backlog : Build a backlog and time how long it takes for the consumers to drain it. Optionally set a producer rate during the drain phase, such as when testing if a cluster is big enough to drain a backlog while under normal producer load. Correctness : Detects data loss, data corruption, out-of-order delivery and duplicates.

0 views
マリウス 1 weeks ago

Photography Workflow with ~~Darktable on Linux~~ Lightroom on GrapheneOS

Disclaimer: I had initially prepared this post under the title Photography Workflow with Darktable on Linux , but after endless fights with Darktable I eventually decided to scrap that workflow altogether and look for an alternative. The workflow documented herein is unfortunately very far from the result I was striving for, yet it is sadly the best I can put together given the current state of open-source RAW development and photo editing software. After I gave Adobe the finger back in 2019 and moved my photography workflow to Capture One on a MacBook , I eventually had to reconsider this approach when I moved back to Linux on the desktop and replaced the device with a Linux laptop . I briefly tried running Capture One in a Windows VM on my laptop , but decided against it, as it was a huge PITA and lacked proper hardware acceleration. Initially I considered a fork of what is probably the best-known open-source RAW developer and photography workflow application out there, Darktable , called Ansel , but ultimately decided against it. The points that Ansel ’s author, Aurélien, brought up seemed like valid criticisms and demonstrated both his knowledge of and his passion for making Darktable a better tool. However, reading further through his website and his GitHub account, it became apparent that he might be the kind of misunderstood genius who has great ideas and ambition, but who would ultimately struggle to operate within, let alone lead the kind of community required to successfully maintain a fork of a piece of software this large. I therefore didn’t have high hopes of this lone cowboy keeping up with, let alone surpassing, the development efforts the Darktable community is currently putting in. Given that Ansel was explicitly billed as a hard-fork that would not remain compatible with the official Darktable release, going down that path felt too risky. Ansel would ultimately have to provide a migration path for existing Darktable users, as otherwise there would be little to no incentive for anyone with a functioning Darktable workflow already in place to put up with the effort. Instead, I decided to stick with Darktable . For about a year I tried to build a new workflow on top of it. The things I would miss the most from Capture One were the VSCO presets that I had brought over from Lightroom , and for which there didn’t seem to be any way to convert them into a format compatible with Darktable while producing roughly similar results. Luckily, João, a developer and photographer, made what he calls t3mujinpack , a collection of film emulation presets for Darktable . In a blog post , he provides details on which film stocks are included and how to make use of them in Darktable . His pack includes the presets I almost exclusively use from VSCO : Kodak’s Portra 160, 400 and 800. While the results aren’t 100% identical to what Capture One produces with the converted VSCO packs, neither are those exports identical to what Lightroom originally produced. Every piece of software has slight differences in its inner workings, so this is to be expected and can be adjusted for. During my travel through all of Spain in 2024 I decided to rely exclusively on Darktable for developing and editing the photos that I would ultimately upload to this site. That was a big mistake. I rarely say bad things about truly open-source software, because ultimately it is open-source, it’s driven by a community of volunteers, and everyone should be happy that these people do what they do. Also, given that it’s open-source, anyone is free to go ahead and improve what they deem worth improving. However, Darktable is, in my opinion, one of the few exceptions that seem to have derailed so badly that it’s fair to say it has reached a point of no return in terms of usability and jankiness . Let me explain by starting with one of the most annoying things: More often than not, Darktable crashes in the middle of editing sessions, apparently due to Wayland-related issues. However, since I’m also running GIMP and Blender , which I would argue do similar, or even slightly more complex things than Darktable , yet don’t run into such issues, I’d assume that this is not a problem with my Wayland setup specifically. I didn’t try to debug the issue further, as I was mainly focused on testing and establishing a workflow. Had Darktable otherwise worked perfectly fine for me and only run into this issue every once in a while, I would have dug deeper to find the root cause. Unfortunately, this was only one of many things that kept me from continuing to use Darktable . Besides the random crashes, Darktable is unbearably janky and slow. The UI feels like it’s about to fall over at any moment, regardless of whether ROCm acceleration is enabled or not. UI elements feel hacked together, the overall navigation is hostile towards regular users, and it’s impossible to find anything just by looking, because everything is hidden behind collapsed modules, tabs and a gazillion sliders and buttons. To give a single example of the sheer UI craziness that is Darktable : To rotate an image to the right (clockwise), you need to drag a slider to the left (counterclockwise). While on a touchscreen interface this might be more intuitive, when using a touchpad on a laptop or even a mouse it definitely doesn’t feel natural. After all, maybe a slider isn’t the best UI element for this operation to begin with? Another issue that I experienced was related to organizing photos. With over 4000 (RAW) photos in the library, Darktable becomes unbearable to work with. Aside from the spontaneous crashes and overall slow UI, finding specific photos in a library of that size is an excruciatingly painful task. Unlike Capture One and Lightroom , Darktable doesn’t easily support a workflow based on individual, smaller libraries, e.g. organized by location or event. There are ways to sort photos within Darktable ’s main library, but I couldn’t find an easy way to split them out into multiple small libraries. Assuming that you managed to find and edit the photos you were looking for, the headaches continue when you try to export them. It appears that Darktable is unable to export photos with pixel-perfect adherence to the crop aspect ratio . The implementation details and the proposed solution appear to be just as janky as everything else, and a quick search for in the Darktable GitHub repository uncovers a lot more of that same jankiness. I ended up running the following command over every photo exported by Darktable , just to obtain a properly shaped image, meaning I’d lose a few pixels here and there: As mentioned a long time back in an update , I ended up with a broken Darktable library, meaning that I lost all the adjustments that I did manage to export up until that point . Short story long, I eventually ditched Darktable for a plan B . After Darktable broke my library and I lost months’ worth of edits, I found myself back at square one. The idea of returning to Adobe felt like defeat, but when I looked at what was actually available for my setup, which is a Google Pixel Tablet running GrapheneOS , Adobe Lightroom for Android turned out to be the only realistic option that could handle RAW files and offer a non-destructive editing workflow. Adobe Lightroom Mobile is, on paper, a reasonably capable RAW editor for Android. It supports a wide range of camera RAW formats and offers the familiar tone curve, HSL sliders, color grading, masking, and healing tools that anyone coming from desktop Lightroom will recognize. It can read photos directly from the device’s own storage, edit them locally without an internet connection, and export to JPEG with full control over quality and output dimensions. In short, the feature set is there. The physical side of the workflow is straightforward. I attach a USB-C SD card reader to the Pixel Tablet, open a file manager, and copy the RAW files from the card into a dedicated folder on the tablet’s internal storage. From there I open Lightroom , import the photos from that folder into a local album, and work through them one by one. Once a photo is where I want it, I export it as a JPEG into the folder on the tablet’s storage. That folder is monitored by Syncthing , which synchronizes the finished exports to my other devices in the background. The performance of Adobe Lightroom on Android is, to put it mildly, terrible. Rendering a RAW preview after entering edit mode takes long enough that you find yourself staring at a loading indicator more often than at the actual photo. Scrolling through a grid of thumbnails is a choppy, stuttering affair that makes you wonder whether the application is doing something computationally expensive or is just poorly written. I acknowledge that the Pixel Tablet is an older budget device, yet Lightroom treats it as if it were running on hardware from 2005. Lightroom on Android is every bit as buggy as Adobe products traditionally are on macOS and Windows, but somehow worse, because the interface is also frequently broken in ways that make the application essentially unusable without restarting it. The UI will routinely enter a state where confirmation and action buttons either stop responding to taps, as if the touch layer has fallen out of sync with whatever is rendered on screen, or simply disappear altogether. The only resolution is to quit the app and reopen it, at which point you hope that the edit you were in the middle of survived. Entire features will similarly go dark without warning. The auto-straighten function, which should detect the horizon in a photo and level it, simply grays out and stops working at some point. No error, no indication as to why it has become unavailable, nothing. Again, restart the app, try again, maybe it works this time. These are not edge cases or exotic scenarios, but rather the normal operating experience of Adobe Lightroom Mobile . One of the things I was most concerned about before committing to this workflow was the prospect of Adobe silently uploading my photos to their cloud infrastructure. The desktop version of Lightroom has a long and well-documented history of syncing content to Adobe’s servers in ways that are easy to miss and difficult to fully disable. On Android, GrapheneOS gives you a tool that the desktop doesn’t: Per-application network permission revocation. I first disabled the cloud sync option within Lightroom ’s own settings, then went into GrapheneOS’s permission manager and removed the network permission from the Lightroom app entirely. It continues to function as a local RAW editor without any network access whatsoever. Photos stay on the device. Nothing leaves without my explicit say-so via Syncthing. Note: To keep things simple, I did not go into the fact that Lightroom is running inside an Android 16 Private Space , which also contains a sandboxed instance of Google Play Services and lets me create a virtual barrier between the rest of the FOSS apps on the Pixel Tablet and this spyware malware crap proprietary software. With this setup, however, importing data becomes slightly more tedious, as it requires the Google Files app to be able to read an attached USB-C storage device (SD card) from within the Private Space . The Google Files app is a giant UX disaster all by itself, into which, for the sake of our both’s time and mental health, I won’t dive into. One pleasant surprise was that I managed to import the VSCO Lightroom presets I purchased well over a decade ago into Lightroom Mobile on Android. The preset files still work, and the film emulations I had relied on for years, in particular the Kodak Portra series, show up in the presets panel and can be applied to photos. With Adobe being Adobe, however, this had to come with a catch. Lightroom Mobile is apparently incapable of remembering which preset was applied to a given photo. Open an edited photo that had a VSCO preset applied, and Lightroom will display a warning telling you it cannot find the preset, even though the preset is sitting right there in the presets list, available and functioning, ready to be applied to new photos. The edit itself is intact… well… at least sometimes. Other times, Lightroom simply loses the edits altogether. It’s the kind of bug that suggests the feature was never properly tested beyond the initial happy path, which is about what you’d expect from Adobe. To be frank, this workflow sucks compared to the one I had on macOS using Capture One . Lightroom is still the terrible POS it had always been, and paying money to a company like Adobe feels like funding a criminal organization. Unfortunately, there doesn’t appear to be a viable alternative, especially not one that’s libre . The remaining options would be to either pay into Apple’s walled garden by purchasing one of their newer iAmtheproduct devices and subscribing to Capture One Mobile , or to rely exclusively on Fuji’s in-camera film simulations (which sadly won’t work for the Sony ). Judging by the reviews of Capture One Mobile , however, the former option doesn’t appear too promising either. Looking at the situation in a more positive light, I nevertheless managed to replace the underlying stack on which my photography workflow runs with more privacy-respecting software ( GrapheneOS ). That’s at least something , although it seems this workflow won’t live that long either, given that Google keeps locking down their Pixel devices and GrapheneOS appears to be pivoting to Motorola-made hardware , who might not release a GrapheneOS-compatible Moto Pad anytime soon. Oh well. Pro tip: A USI 2.0 pen makes using Lightroom on a device like the Pixel Tablet significantly less painful, at least as long as the USI pen actually works properly, which sadly isn’t always the case with the Renaisser pen I own. If you’re looking for a more general review of the Google Pixel Tablet with GrapheneOS, look here .

0 views
(think) 1 weeks ago

neat: a language-agnostic nREPL client for Emacs

I think I’ll take my REPL neat My parens black and my bed at three CIDER’s too sweet for me… Last week I announced Port , a small prepl client for Emacs. Today I’m following it up with another small Emacs package. Meet neat , a tiny, deliberately language-agnostic nREPL client. For years I’ve been hearing some version of the same request: “could CIDER work with my non-Clojure nREPL server?”. Babashka, Basilisp, nREPL-CLR, even some homegrown servers people built on top of nREPL for languages I’d never heard of. 1 The answer was always the same kind of squishy “sort of, in theory, with caveats”, because while bare nREPL is genuinely language-agnostic, CIDER is not. CIDER was built for Clojure and assumes Clojure pretty much everywhere. I always thought the right answer was “let’s gradually make CIDER more language-agnostic.” That’s the kind of plan that sounds reasonable until you actually try it. The thing that pushed me over the edge was, oddly enough, building Port. Port is small, focused, and doesn’t try to be CIDER. Working on it for a couple of weeks reminded me how (deceptively) productive it is to start from a clean slate when the new requirements don’t match the assumptions baked into a mature codebase. Trying to retrofit CIDER into a language-agnostic shape would have meant fighting with every helper that ever assumed exists, every middleware contract defines, every project-type heuristic that knows about and and nothing else. A whole lot of “is the server Clojure, or is it the other thing?” branches. The Port experience reaffirmed that the right move for a genuinely different client is a new project , not a thousand cuts to an existing one. So was born. The name is short, says what it does (it’s neat, both in the small-and-tidy sense and in the “no deps, no special assumptions, just the protocol” sense), and conveniently leaves room for puns I haven’t fully committed to yet. I might land on a backronym one day. For now it’s just “neat”. neat is a small Emacs nREPL client. The code is split across four files: It only uses Emacs builtins. There are no external runtime dependencies, not even on , because neat doesn’t assume Clojure on the other end. If you write , , , or anything else that talks nREPL, you turn on in that buffer and it just works. The connection routing is also intentionally library-friendly. There’s a buffer-local override so downstream packages can implement their own routing logic, plus a global default for the simple “one server at a time” case that most people will want. Capability discovery is done at connect time via the nREPL op. neat doesn’t hardcode “this server has completions, this one doesn’t” assumptions. If the server reports a op, the CAPF backend lights up (with type annotations next to each candidate, when the server provides them). If it reports , eldoc starts working and jumps to definitions via an xref backend. If neither is there, you still get a perfectly serviceable raw REPL. Start an nREPL server. Anything that speaks the protocol will do. For a Clojure server: Then in Emacs: A REPL buffer pops up, the prompt follows the server’s reported namespace, and you can type expressions at it. Multi-line input works because only submits when the form parses as balanced under (Emacs Lisp syntax by default, which is close enough for any Lisp). Input history is persisted across sessions. If there’s a file in the project, the prompt defaults to its contents, so is enough to connect. To evaluate from a source buffer, turn on the minor mode: The familiar bindings are there, intentionally compatible with what CIDER users expect: ships the buffer contents as an op; uses the standard op instead, so the server can attribute file and line numbers to errors. Use the latter when you’re actually loading a file from disk and care about good diagnostics. sets the buffer-local , which gets sent as the field on every op from that buffer. For languages where the namespace is declared in the source (Clojure’s , etc.), swap in a parser via . For juggling multiple connections, opens a tabulated-list buffer with one row per live connection, where you can set the default or disconnect interactively. That’s roughly the whole user-facing surface today. There’s no jack-in command, no inspector, no debugger, no test runner. Likely there will never be, but if you need those you should probably be using CIDER anyways… If you write Clojure and CIDER works for you, keep using CIDER. It’s mature, full-featured, and supported, and I’m going to keep working on it for as long as people use it. Nothing about neat changes that. But if you find yourself in one of these situations: then neat might be a better fit. It’s small enough that you can read the whole thing in an afternoon, and the library/UI split ( and are perfectly usable from other packages) is genuinely designed for downstream consumers. neat is part of a broader push I’ve been chewing on for a while now: making nREPL a healthy multi-language ecosystem rather than a Clojure-only protocol. That push has three legs: This is also why I keep teasing a “reference CLI client” in conversations. An editor client is one thing, but a small command-line nREPL client written in a non-Lisp language would be a much sharper test of how language-agnostic the protocol really is. neat is plausibly a precursor to that. Time will tell how far I push this; for now I just wanted to get the Emacs side moving. As always, big thanks to Clojurists Together and everyone supporting my open source work. You make it possible for me to keep tweaking and improving CIDER, nREPL, clj-refactor, and friends, and occasionally try something “neat” on the side. isn’t replacing any of the existing Clojure tooling for Emacs. It’s just another tool in the box for the people who want it. Feedback, ideas, and contributions are most welcome over at the issue tracker . Keep hacking! https://github.com/clojure-emacs/cider/issues/3905   ↩︎ For a long time I planned to extract CIDER’s nREPL client code into a reusable package, but now that we have I probably will finally abandon this idea.  ↩︎ : bencode encode/decode. : TCP connections, request dispatch, the standard nREPL ops. : a comint-derived REPL buffer. : the entry point, customization group, and minor mode for source buffers. you write a non-Clojure language whose runtime ships an nREPL server, and you’ve been muddling through with a half-supported CIDER setup, you write Clojure but you value minimalism and don’t need the full CIDER feature set, you’re building an Emacs package that needs to talk nREPL and you want a small, dependency-free library to build on, 2 An actual nREPL specification. The spec.nrepl.org draft is (will be) the formal version of what today is “whatever nREPL the project does”. Reference clients. neat is one. The point of building a deliberately Clojure-free client is that it stress-tests the spec. Anywhere neat ends up needing to special-case the server, the spec has a gap. A compatibility test suite. The parameterised integration suite in neat already runs the same assertions against multiple servers and surfaces real divergences (Clojure batching into a single message where Basilisp emits two, for example). I’d like to grow this into a portable suite that any nREPL server can self-check against. https://github.com/clojure-emacs/cider/issues/3905   ↩︎ For a long time I planned to extract CIDER’s nREPL client code into a reusable package, but now that we have I probably will finally abandon this idea.  ↩︎

0 views
Blog System/5 1 weeks ago

A Markdown-based test suite

This article is not about AI and it is not written with AI, but the work that I’m about to present was definitely motivated by AI. And because I generally like telling stories, I have to give you that background. Do with that whatever you want, but… it’d be a pity if you left just because the AI word showed up in the first paragraph! I think the technical explanation that follows is at the very least entertaining and also interesting independently of AI. Back in December, I started toying with coding agents. One thing I tried, and for which I didn’t expect a lot of success, was to point an AI agent to the EndBASIC public documentation and ask it to write games like Space Invaders or Mario from scratch. And even though the results weren’t perfect and they didn’t work on the first try, they did work with a few tiny tweaks. Combining that with a bunch of hand-written rules, I had an agent producing EndBASIC demos with ease. This experiment was impressive because I did not expect an agent to be able to write EndBASIC code… and because it worked, it fueled my interest to pick EndBASIC’s own development back up. Three thoughts came to mind: Increase EndBASIC’s “self-documenting” aspects so that an AI agent can learn about its idiosyncrasies unsupervised. Speed up EndBASIC so that it can run more elaborate games. Extend EndBASIC with long-desired primitives like sprites and sound, to finally realize the vision behind the project. These thoughts combined sparked the rewrite of EndBASIC’s core that I’ve been pursuing since January and which should see the light of day in the upcoming release. But before that happens, I want to talk to you about just one of the cool pieces behind the new core: namely, its approach to testing. I’ve stopped writing unit tests for the compiler and VM in Rust and I’ve switched to writing them in Markdown. And I believe this has turned out to be a pretty nice approach. One of the things I had to do to convince an AI agent to write proper EndBASIC code was to hand-craft a bunch of rules to tell it how EndBASIC differs from other, more traditional BASIC dialects. That worked OK, but writing these rules by hand was error-prone and difficult to make exhaustive. So I wanted to let LLMs extract that information directly from EndBASIC. The idea was simple: if I wrote the integration tests for the new core in Markdown, the lingua franca of AI, the tests would serve as the canonical and correct documentation demonstrating language behaviors. LLMs are great at summarizing information, so if I unleashed them over a large set of these hands-on “examples”, they would probably figure stuff out, right? And they actually do! I gave the following prompt to GPT 5.4: Based on your pre-existing knowledge of BASIC dialects, I want you to read all of the files, analyze how the EndBASIC dialect differs from your knowledge, and come up with a bunch of rules for yourself to know how to write EndBASIC code later on. You can ignore the Disassembly sections. Beware that all functions and commands in these integration tests are test-only: the real functions and commands that you can use are documented in , so read those too to learn what functionality is available. Write your findings to a file. And this produced a very comprehensive file with spot-on rules: here, take a look . But leaving that aside, let’s peek into the internals of this new Markdown-based test suite. All cool so far? Want to see more similar content in the future? Subscribe now to demonstrate your interest! It’s a collection of Markdown files: Where each file acts as a container of one or more test cases : Every test case has a section title describing what the test is about and various subsections to define the test scenario: A Source code block that is the input to the compiler. If compilation fails, a Compilation errors section with the error messages and nothing else afterwards. If compilation succeeds: A Disassembly section that contains the compiled bytecode. An optional Exit code section showing the program’s exit code, if different from zero. An Output section that contains any messages printed to the console by the executed program. A Runtime errors section that contains any errors from the executed program. Here is a simple example validating the command: There is no section to validate the lexer nor parser internals right now but I’m considering to further extend the format and dump the AST too in order to simplify the tests for these components. The driver for this test suite enumerates all Markdown files in the tests directory and processes them one at a time. For each file, the driver extracts all test case titles and their Source subsections to compute all the test cases to execute. Once the driver has this subset of information from the Markdown files, the driver feeds each individual test case to the compiler and, if compilation succeeds, to the VM. All side-effects are captured and the driver emits a new Markdown file from scratch with the results of the test. Once the driver has terminated producing a new version of the Markdown file for a test, the driver compares the produced file (actual) against the pre-recorded, checked-in version (golden). If they differ, the test fails and the driver uses the tool to print the differences. And that’s it. Easy peasy, right? This keeps the driver super-simple as the only thing it has to do is parse a minimal subset of Markdown, and the diffs it produces are trivial to understand to a human. There are currently 448 test cases and 13k lines of Markdown in this test suite so maintaining them “by hand” is not an option. You wouldn’t want to implement an optimization to the compiler and then have to rewrite hundreds of disassembly chunks in the golden files to reflect the changes, would you? The thing is that, due to the design described earlier, regenerating the golden files after a core change is easy: the driver is already doing exactly that to execute the tests! The trick is, simply put, to ask the driver to rewrite the golden file instead of producing an actual file by setting the environment variable. And voila: all golden files are regenerated in place. I can then use Git to validate the changes and commit them along with the actual code change. Let’s start with the pros of this Markdown-based test suite framework: It is much easier to work with than what I had before. I used to dread touching the compiler and VM of the previous EndBASIC core implementation because tweaking tens of tests was painful. Changes required me to fiddle with positions and deeply nested types, and now the tests are trivial to tweak and diff against previous state. Pretty much any decent text editor has Markdown support, including formatting fenced code blocks. This makes it easy to skim through the test suite and modify the files and is actually the primary reason I used Markdown instead of a bespoke textual format. LLMs can “learn” with ease. OK, fair, this is just a guess: I did not try the same prompt at the beginning of this article against the old core with its Rust-based tests, and maybe the LLMs would have done a good job at reverse-engineering the rules. But because the Markdown tests are so much easier to read by humans, I have to assume that they also are for LLMs. And now, of course, some cons: Regenerating the output of a test, or all tests, is way too easy . With the older Rust-based tests, I was forced to manually punch in things like line numbers and nested AST trees. This process forced me to think through the changes in detail. With the new approach… regenerating the golden files is trivial, so it’s easy to miss little mistakes in source positions or disassembled code. Differences in disassembly are usually noisy and hard to review because every line carries an address and thus any new or deleted instruction will introduce offsets into all other addresses. I could of course choose to not include the instruction addresses in the dump, but they come in handy when manually validating jump targets, so it felt better to keep them around. Rust cannot generate first-class test cases on the fly which means that the various test cases within a Markdown file are “invisible” to the driver: I can run them all or none, but regular test filtering via doesn’t apply. I was able to “expose” the different Markdown files as different Rust-native test cases, but this involves a hardcoded list of test files—which must be kept in sync with the files on disk, and so I mitigated the chances of divergence by adding a test that cross-references the two. This idea does not generalize well. The Markdown-based test suite presented here works well for components where end-to-end testing is favorable and, more importantly, cheap , but I wouldn’t recommend it for other scenarios. Keeping tests fast is a must for quick iteration. And I think that’s about it. If the above feels too abstract, I encourage you to take a look at the driver , its helper code , and the directory with test suites . Now that you have this new trick up your sleeve, what do you think? Back in December, I started toying with coding agents. One thing I tried, and for which I didn’t expect a lot of success, was to point an AI agent to the EndBASIC public documentation and ask it to write games like Space Invaders or Mario from scratch. And even though the results weren’t perfect and they didn’t work on the first try, they did work with a few tiny tweaks. Combining that with a bunch of hand-written rules, I had an agent producing EndBASIC demos with ease. This experiment was impressive because I did not expect an agent to be able to write EndBASIC code… and because it worked, it fueled my interest to pick EndBASIC’s own development back up. Three thoughts came to mind: Increase EndBASIC’s “self-documenting” aspects so that an AI agent can learn about its idiosyncrasies unsupervised. Speed up EndBASIC so that it can run more elaborate games. Extend EndBASIC with long-desired primitives like sprites and sound, to finally realize the vision behind the project. A Source code block that is the input to the compiler. If compilation fails, a Compilation errors section with the error messages and nothing else afterwards. If compilation succeeds: A Disassembly section that contains the compiled bytecode. An optional Exit code section showing the program’s exit code, if different from zero. An Output section that contains any messages printed to the console by the executed program. A Runtime errors section that contains any errors from the executed program. It is much easier to work with than what I had before. I used to dread touching the compiler and VM of the previous EndBASIC core implementation because tweaking tens of tests was painful. Changes required me to fiddle with positions and deeply nested types, and now the tests are trivial to tweak and diff against previous state. Pretty much any decent text editor has Markdown support, including formatting fenced code blocks. This makes it easy to skim through the test suite and modify the files and is actually the primary reason I used Markdown instead of a bespoke textual format. LLMs can “learn” with ease. OK, fair, this is just a guess: I did not try the same prompt at the beginning of this article against the old core with its Rust-based tests, and maybe the LLMs would have done a good job at reverse-engineering the rules. But because the Markdown tests are so much easier to read by humans, I have to assume that they also are for LLMs. Regenerating the output of a test, or all tests, is way too easy . With the older Rust-based tests, I was forced to manually punch in things like line numbers and nested AST trees. This process forced me to think through the changes in detail. With the new approach… regenerating the golden files is trivial, so it’s easy to miss little mistakes in source positions or disassembled code. Differences in disassembly are usually noisy and hard to review because every line carries an address and thus any new or deleted instruction will introduce offsets into all other addresses. I could of course choose to not include the instruction addresses in the dump, but they come in handy when manually validating jump targets, so it felt better to keep them around. Rust cannot generate first-class test cases on the fly which means that the various test cases within a Markdown file are “invisible” to the driver: I can run them all or none, but regular test filtering via doesn’t apply. I was able to “expose” the different Markdown files as different Rust-native test cases, but this involves a hardcoded list of test files—which must be kept in sync with the files on disk, and so I mitigated the chances of divergence by adding a test that cross-references the two. This idea does not generalize well. The Markdown-based test suite presented here works well for components where end-to-end testing is favorable and, more importantly, cheap , but I wouldn’t recommend it for other scenarios. Keeping tests fast is a must for quick iteration.

0 views
Ahead of AI 2 weeks ago

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

After a short family break, I am excited to be back and catching up on a busy few weeks of open-weight LLM releases. The thing that stood out to me is how much newer architectures are focused on long-context efficiency. As reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints, and LLM developers are adding a growing number of architecture tricks to reduce those costs. The main examples I want to look at are KV sharing and per-layer embeddings in Gemma 4, layer-wise attention budgeting in Laguna XS.2, compressed convolutional attention in ZAYA1-8B, and mHC plus compressed attention in DeepSeek V4. Most of these changes look like small tweaks in my architecture diagrams, but some of them are quite intricate design changes that are worth a more detailed discussion. Figure 1. LLM architecture drawings of recent, major open-weight releases (April to May). You can find the images, and more details, in my LLM architecture gallery . Not all model sizes are shown; Qwen3.6 includes the 27B and 35B-A3B variants, and ZAYA1 is represented by the 8B model (omitting ZAYA1-base and ZAYA1-reasoning-base). The architectures in the dotted boxes are covered in more detail in this article. Note that this article is about architecture designs, so I will mostly skip dataset mixtures, training schedules, post-training details, RL recipes, benchmark tables, and product comparisons. Even with that narrower scope, there is a lot to cover. And, like always, the article turned out longer than I expected, so I will keep the focus on what changes inside the transformer block, residual stream, KV cache, or attention computation. Please also note that I am only covering those topics that are interesting (new) design choices and that I haven’t covered elsewhere, yet. This list includes: KV sharing and per-layer embeddings in Gemma 4 Compressed convolutional attention in ZAYA1 Attention budgeting in Laguna XS.2 mHC and compressed attention in DeepSeek V4 Before getting into the new parts, here are the two previous articles I will refer back to. The first one gives a broader architecture background on recent MoE models, routed experts, active parameters, and model-size comparisons. The second one covers the attention background that comes up repeatedly below, including MHA, MQA, GQA, MLA, sliding-window attention, sparse attention, and hybrid attention designs. I also turned several of these explanations into short, standalone tutorial pages in the LLM Architecture Gallery . For example, readers can find compact explainers for GQA, MLA, sliding-window attention, DeepSeek Sparse Attention, MoE routing, and other concepts linked from the corresponding model cards and concept labels. For this tour of architecture advances and tweaks, we will go back to the beginning of April when Google released their new open-weight Gemma 4 suite of models. They come in 3 broad categories: the Gemma 4 E2B and E4B models for mobile and small, local (embedded) devices (aka IoT), the Gemma 4 26B mixture-of-experts (MoE) model, optimized for efficient local inference, and the Gemma 4 31B dense model, for maximum quality and more convenient post-training (since MoEs are trickier to work with) Figure 2: Gemma 4 architecture drawings. The first small architecture tweak in the E2B and E4B variants is that they adopt a shared KV cache scheme, where later layers reuse key-value states from earlier layers to reduce long-context memory and compute. This KV-sharing was not invented by Gemma 4. For instance, see Brandon et al. , “ Reducing Transformer Key-Value Cache Size with Cross-Layer Attention ” (NeurIPS 2024). But it’s the first popular architecture where I saw this concept applied. (Cross-layer attention is not to be confused with cross-attention .) Before explaining KV-sharing further, let’s briefly talk about the motivation. As I wrote and talked about in recent months, one of the main recent themes in LLM architecture design is KV cache size reduction. In turn, the motivation behind KV cache size reduction is to reduce the required memory, which allows us to work with longer contexts, which is especially relevant in the age of reasoning models and agents. For more background on KV caching, see my “Understanding and Coding the KV Cache in LLMs from Scratch” article: Practically all of the popular attention variants I described in my previous A Visual Guide to Attention Variants in Modern LLMs article are designed to reduce the KV cache size: To pick a classic example (that Gemma 4 still uses): Grouped Query Attention (GQA) already shares key-value (KV) heads across different query heads to reduce the KV cache size, as illustrated in the figure below. Figure 3: Grouped Query Attention (GQA) shares the same key (K) and value (V) heads among multiple query (Q) heads. As mentioned before, Gemma 4 uses GQA. However, in addition to the KV sharing among queries as part of GQA, Gemma 4 also shares KV projections across different layers instead of computing it as part of the attention module in each layer. This KV-sharing scheme, also called cross-layer attention, is illustrated in the figure below. Figure 4: Regular transformer blocks compute separate Q, K, and V projections in each attention module (left). Cross-layer attention designs (right) share the same K and V projections across multiple layers. As briefly hinted at in the architecture overview in Figure 2, Gemma 4 E2B uses regular GQA and sliding window attention in a 4:1 pattern. (More precisely, Gemma 4 E2B uses MQA, which is the one-KV-head special case of GQA). In the case of GQA (or MQA), the KV-sharing works like this. Later layers no longer compute their own key and value projections but reuse the KV tensors from the most recent earlier non-shared layer of the same attention type. In other words, sliding-window layers share KV with a previous sliding-window layer. Full-attention layers share KV with a previous full-attention layer. The layers still compute their own query projections, so each layer can form its own attention pattern, but the expensive and memory-heavy KV cache is reused across several layers. For example, Gemma 4 E2B has 35 transformer layers, but only the first 15 compute their own KV projections; the final 20 layers reuse KV tensors from the most recent earlier non-shared layer of the same attention type. Similarly, Gemma 4 E4B has 42 layers, with 24 layers computing their own KV and the final 18 layers sharing them. How much does this actually save? Since we share roughly half of the KVs across layers, we save approximately half of the KV cache size. For the smallest E2B model, this results in a 2.7 GB saving (at bfloat16 precision) in long 128K contexts, as shown below. (For the E4B variant, this saves about 6 GB at 128K.) Figure 5: KV cache memory savings from GQA and cross-layer KV sharing in a Gemma 4 E2B-like setup. For simplicity, additional savings from sliding window attention are not shown. The downside of KV-sharing is, of course, that it’s an “approximation” of the real thing. Or, more precisely, it reduces model capacity. However, according to the cross-layer attention paper, the impact can be minimal (for small models that were tested). The Gemma 4 E2B and E4B variants include a second efficiency-oriented design choice called per-layer embeddings (PLE). This is separate from the KV-sharing scheme above. KV sharing reduces the KV cache. PLE is instead about parameter efficiency, where it lets the small Gemma 4 models use more token-specific information without making the main transformer stack as expensive as a dense model with the same total parameter count. For instance, the “E” in Gemma 4 E2B and E4B stands for “effective”. Concretely, Gemma 4 E2B is listed as 2.3B effective parameters, or 5.1B parameters when the embeddings are counted. (Similarly, Gemma 4 E4B is listed as 4.5B effective parameters, or 8B parameters with embeddings). In short, in the “E” models, the main transformer-stack compute is closer to the smaller number, while the larger number includes the additional embedding-table layers. (For an illustration of how embedding layers work, see my “ Understanding the Difference Between Embedding Layers and Linear Layers ” code notebook.) Conceptually, the new PLE path looks like this: Figure 6: Simplified Gemma 4 block with the PLE residual path. The normal block first computes the attention and feed-forward residual updates. The resulting hidden state gates the layer-specific PLE vector, and the projected PLE update is added as an extra residual update at the end of the block. The PLE vectors themselves are prepared outside the repeated transformer blocks. In simplified form, there are two inputs to the PLE construction. First, the token IDs go through a per-layer embedding lookup. Second, the normal token embeddings go through a linear projection into the same packed PLE space. These two pieces are added, scaled, and reshaped into a tensor with one slice per layer. Note that each block then receives its own slice. Figure 7: Simplified PLE construction. The token IDs provide a per-layer embedding lookup, while the normal token embeddings are projected into the same space. The two contributions are combined and reshaped so that each transformer block receives its own layer-specific PLE slice. The important detail is that PLE does not give each transformer block a full independent copy of the normal token embedding layer. Instead, the per-layer embedding lookup is computed once. Then, as mentioned before, it gives each layer a small token-specific embedding slice (via “reshape / select layer l”. So, for each input token, Gemma 4 prepares a packed PLE tensor that contains one small vector per decoder layer. Then, during the forward pass, layer l receives only its own slice (ple_l in the Gemma4WithPLEBlock in figure 6). Inside the transformer block, the regular attention and feed-forward branches run as usual. First, the block computes the attention residual update. Then it computes the feed-forward residual update. After that second residual add, the resulting hidden state, which I denoted as z in the pseudocode in figure 6, is used to gate the layer-specific PLE vector. The gated PLE vector is projected back to the model hidden size, normalized, and added as one extra residual update. So the useful mental model is that the transformer block still has the same main attention and feed-forward path, but Gemma 4 adds a small layer-specific token vector after the feed-forward branch. This increases representational capacity through embedding parameters and small projections. This adds computational overhead but avoids the cost of scaling the entire transformer stack to the larger parameter count. But why PLEs? The simpler alternative would be to make the dense model smaller, using fewer layers, narrower hidden states, or smaller feed-forward networks. That would reduce memory and latency, but it also removes capacity from the parts of the model that do the main computation. The PLE design keeps the expensive transformer blocks closer to the smaller “effective” size, while storing additional capacity in per-layer embedding tables. These are much cheaper to use than adding more attention or FFN weights, since they are mainly lookup-style parameters that can be cached. Also, we have to take Google’s word here that this is an effective and worthwhile design choice. It would be interesting to see some comparison studies to see how this E2B design compares to a regular Gemma 4 2.3B model and a regular Gemma 4 5.1B model. Also, in principle, PLE is not inherently limited to small models. We could attach per-layer embedding slices to larger models, too. However, larger models already have sufficient capacity where these extra embeddings may not help that much. Also, for larger models, we already use MoE designs as a trick to increase capacity while keeping the compute footprint smaller. By the way, if you are interested in a relatively simple and readable code implementation, I implemented the Gemma 4 E2B and E4B models from scratch here . Figure 8: Snapshot of my Gemma 4 from-scratch implementation . Laguna is the first open-weight model by Poolside , a Europe-based company focused on training LLMs for coding applications. Several of my former colleagues joined Poolside in recent years, and they have a great team with lots of talent. It’s just nice to see more companies also releasing some of their models as open-weight variants. Anyways, the Laguna XS.2 architecture depicted below looks very standard at first glance. However, one detail that I didn’t show (/try to cram into there) is a concept we can refer to as “Layer-wise attention budgeting”. Figure 9: Poolside’s Laguna XS.2 architecture. Part of the idea behind the attention budgeting here is that instead of giving every transformer layer the same full attention budget, Laguna XS.2 varies the attention cost by layer. It has 40 layers total, with 30 sliding-window attention layers and 10 global/full attention layers. As usual, the sliding-window layers only attend over a local window (here: 512 tokens), which keeps the KV cache and attention computation cheaper. The global layers are more expensive but preserve the ability to access all information in the context window. This mixed sliding-window + global/full attention pattern is not unique to Laguna XS.2 and is used by many other architectures (including Gemma 4). But what’s new is the use of per-layer query-head counts. For instance, the Hugging Face model hub config.json includes a setting, so layers can have different numbers of query heads while keeping the KV cache shape compatible. Figure 10: Per-layer query-head budgeting in Laguna, where full attention layers use 6 query heads per KV head, and sliding window attention layers use 8 query heads per KV head. So Laguna XS.2 gives more query heads to sliding-window layers and fewer query heads to global layers, while keeping the KV heads fixed at 8. That is the actual layer-wise head budgeting in the config. Laguna XS.2 is one of the most prominent recent examples of this per-layer query-head budgeting in a production-style open model. But the broader idea of varying model capacity by layer goes back to (at least) Apple’s 2024 OpenELM . And again, what’s the point of such a design? Similar to KV-sharing, the point is to spend attention capacity where it is most useful, instead of giving every layer the same budget. Specifically, full-attention layers are expensive because they look across the whole context, so Laguna gives them fewer query heads compared to sliding window attention modules. (Besides, another smaller implementation detail is that Laguna also applies per-head attention-output gating; this is somewhat similar to Qwen3-Next and others, which I also omit here since I covered it in earlier articles.) Similar to Laguna, ZAYA1-8B is another new player on the open-weight market. It is developed by Zyphra , and one of the interesting details around the release is that the model was trained on AMD GPUs rather than the more common NVIDIA GPU (or Google TPU) setup. The main architecture detail, though, is Compressed Convolutional Attention (CCA), used together with grouped-query attention. Unlike MLA-style designs that mainly use a latent representation as a compact KV cache format, CCA performs the attention operation directly in the compressed latent space, but more on that later. (Sidenote: the ZAYA1-8B config.json lists 80 alternating layer entries rather than 40 conventional transformer blocks. These entries alternate between CCA/GQA attention and MoE feed-forward layers. But for the architecture figure, it is more convenient to visualize this as 40 repeated attention + MoE pairs, which is conceptually equivalent.) Figure 11: Zaya1 (8B) with transformer blocks featuring compressed convolutional attention. As hinted at in the figure above, ZAYA1-8B uses Compressed Convolutional Attention (CCA) together with a 4:1 GQA layout. The key point is that its attention block is built around CCA rather than a standard sliding-window attention block. What is Compressed Convolutional Attention? I would say CCA is related in spirit to Multi-head Latent Attention (MLA) in DeepSeek’s models, since both introduce a compressed latent representation into the attention block. However, they use that latent space differently. MLA mainly uses the latent representation to reduce the KV cache. In MLA, the KV tensors are stored compactly and then projected into the attention-head space for the actual attention computation. Figure 12: Regular Multi-head Attention (MHA) and Multi-head Latent (MLA) attention side by side. CCA compresses Q, K, and V and performs the attention operation directly in the compressed latent space. This is why CCA can reduce not only KV cache size, but also attention FLOPs during prefill and training. Figure 13: Multi-head Latent Attention (MLA) and Compressed Convolutional Attention (CCA) side by side. As Figure 13 above illustrates, in CCA, the compressed, latent representations enter the attention mechanism directly, and the resulting compressed attention vector is then up-projected. Note that this is called Compressed Convolutional Attention, not just Compressed Attention, since there is an additional convolutional mixing happening on the latent K and Q representations. The convolutional mixing part is not shown in Figure 12, because it would have been too crammed, but it’s relatively straightforward. As hinted at in Figure 12, the convolutional mixing happens directly on the compressed Q and K tensors. The point is that compression makes Q, K, and V narrower, which saves compute and cache, but it can also make attention less expressive. The convolutions are a cheap way to give the compressed Q and K vectors more local context before they are used to compute attention scores. (The convolutional mixing is only applied to Q and K, not V, because Q and K determine the attention scores, while V represents the content that gets averaged via these scores). Figure 14: conceptual overview of the sequence-mixing convolution Next to the sequence mixing shown in Figure 13, there is also a channel mixing component. It’s in principle similar though, so I am omitting the illustration. CCA appears to be a Zyphra-introduced attention mechanism that predates the ZAYA1-8B technical report . The standalone CCA paper, Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space , was first posted in October 2025 and explicitly introduces CCA. ZAYA1-8B then uses this mechanism as one of the core pieces. But the question is, “is it better than MLA”? According to the CCA paper’s own experiments, yes, they report CCA outperforming MLA under comparable compression settings. Figure 15: Annotated figures from the CCA paper, https://arxiv.org/abs/2510.04476 . Overall, the interesting part here is really the new attention mechanism. The model also uses a pretty extreme (= very sparse) MoE setup, with only one routed expert active per token, but that part is more familiar. CCA is more unusual because it performs the attention operation directly in a compressed latent space, and then uses convolutional mixing on the compressed Q and K representations to make this compressed attention less limiting. So, in short, ZAYA1-8B is not only trying to save compute in the feed-forward layers, but also in the attention mechanism itself. DeepSeek V4 was the biggest release of the year so far, both in terms of hype and model size. Interestingly, DeepSeek V4-Pro is also the most parameter-sparse MoE among the models in the table below, measured by active-parameter share, as summarized in the table below. Figure 16: Percent active parameter plot for MoE models. You can also find an HTML version at https://sebastianraschka.com/llm-architecture-gallery/active-parameter-ratio/ . Caveat: active parameter share is only one lens. It does not capture KV cache size, attention pattern, context length, routing overhead, hardware efficiency, or training quality. But it is a helpful, quick check when comparing sparse models. There’s a lot to say about DeepSeek V4, but since it’s been all over the news already, and to stay on topic regarding architecture tweaks, I will focus on the two most relevant parts that are new compared to previous architectures: mHC for a wider residual pathway, CSA/HCA for long-context attention compression and sparsity Looking at the DeepSeek V4 architecture drawing below, there seems to be a lot going on. The useful way to read it is to separate the residual-path change, mHC, from the attention-path changes, CSA/HCA, and compressed attention caches. Figure 17: DeepSeek V4-Pro architecture overview. Let’s start with the mHC component of DeepSeek V4. This goes back to a research paper that the DeepSeek team shared last year (31 Dec 2025, mHC: Manifold-Constrained Hyper-Connections ). However, in this paper, the technique was only tested on an experimental 27B scale model. Now, we see it in their flagship release, which is a good sign that this idea actually works well in production. The main idea behind mHC here is to modernize the design of the residual connections inside the transformer block, which is refreshing, because architecture tweaks are usually focused on the attention mechanism, normalization layer placement, and MoE parts. Now, mHC is based on previous work on hyper-connections (see Hyper-connections by Zhu et al., 2024), which we should briefly discuss first. Hyper-connections essentially modify the single residual stream inside the transformer block by replacing it with several parallel residual streams and learned mappings between them. (For those new to residual connections, I made a video on residual neural networks many years ago, where I explained the general mechanism.) The idea behind hyper-connections is to widen the residual stream. We can think of this as keeping several parallel residual streams, with an additional Res Mapping linear transformation that mixes them across layers. Since the Attention or MoE layer itself still operates on the normal hidden size, hyper-connections also add a Pre Mapping that combines the parallel residual streams into one normal hidden vector for the layer, and a Post Mapping that distributes the layer output back across the parallel residual streams. This is visually summarized in the figure below. Figure 18: Regular transformer block (top) vs transformer block with hyper-connections (bottom) using annotated figures from the mHC paper, https://arxiv.org/abs/2512.24880 . The figure below focuses on the attention-layer portion of the transformer block, but the same concept applies to the second residual branch around the MoE layer. The purpose of hyper-connections is to make the residual pathway more expressive without making the actual Attention or MoE layer wider. This is only mildly more expensive in FLOPs because the extra mappings operate over the small residual-stream axis, for example, n = 4 in DeepSeek V4, not over a huge hidden dimension. In the original hyper-connections paper, the 7B OLMo MoE experiment goes from 13.36G to 13.38G FLOPs per token, which is basically unchanged. In terms of reported gains, there were modest (but consistent) improvements, as shown in the figure below. (However, only looking at FLOPs is a bit simplistic. The widened residual state still has to be stored, moved through memory, mixed, etc. So the practical overhead can come more from memory traffic and implementation complexity than from arithmetic, which is not explicitly measured. However, given that DeepSeek V4 is all about efficiency, it seems to be a worthwhile addition.) Figure 19: Hyper-connections performance versus baseline, using an annotated figure from the hyper-connections paper, https://arxiv.org/abs/2409.19606 . Also, as shown in the figure above, metrics reached the baseline’s performance using roughly half the training tokens. The main change from regular hyper-connections (HC) to manifold-constrained hyper-connections (mHC) is that the mappings are no longer left unconstrained. In regular HC, the Res Mapping is a learned matrix that mixes the parallel residual streams, but stacking many such matrices can amplify or shrink signals unpredictably. In mHC, this residual mapping is projected onto the manifold of doubly stochastic matrices, meaning all entries are non-negative and each row and column sums to 1. This makes the residual mixing behave more like a stable redistribution of information across streams. The Pre Mapping and Post Mapping are also constrained to be non-negative and bounded, which avoids cancellation when reading from and writing back into the widened residual state. In short, mHC keeps the richer residual mixing of HC, but adds constraints so it scales more safely, which becomes more relevant for larger (deeper) models. Otherwise, the main idea of using parallel residual streams remains, as shown in the figure below. Figure 20: Transformer block with hyper-connections (HC) and manifold-constrained hyper-connections (mHC) using annotated figures from the mHC paper, https://arxiv.org/abs/2512.24880 . In the mHC paper, using a 27B parameter model for the experiments, the DeepSeek team’s optimized implementation (with fusion, recomputation, and pipeline scheduling) adds only 6.7% additional training time overhead for 4 residual streams (n = 4) throughout all transformer blocks compared to the single-stream baseline. To sum up this section, HC/mHC changes how information is carried around these layers by replacing the single residual stream with several interacting residual streams, with the additional stability constraints added in mHC, while adding minimal compute overhead. Also, it pairs well with the CSA/HCA attention changes, which modify other parts of the transformer block, which I will discuss below. The other major DeepSeek V4 architecture change is on the attention side. Again, the motivation is that at very long context lengths, attention becomes expensive not only because of the attention score computation, but also because the KV cache grows with the sequence length. DeepSeek V4 addresses this issue with a hybrid of two compressed-attention mechanisms, Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). For a refresher, I recommend checking out my previous “ A Visual Guide to Attention Variants in Modern LLMs ” article, which covers Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention (DSA), among others. The first thing to note is that CSA/HCA in DeepSeek V4 is a different kind of compression than the MLA-style compression used in DeepSeek V2/V3. Where MLA mainly compresses the per-token KV representation, CSA and HCA compress along the sequence dimension. So, instead of keeping one full (or compressed) KV entry for every previous token, they summarize groups of tokens into fewer compressed KV entries. Consequently, the cache gets shorter. DeepSeek V4 also uses compact compressed entries and shared-KV attention, but the main distinction from MLA is the sequence-length compression. This is illustrated in the figure below. Figure 21: Conceptual comparison of MLA-style per-token latent caching, CSA, and HCA. MLA compresses the stored KV representation but keeps one latent entry per token. CSA shortens the sequence more mildly with m=4 and sparse top-k selection, while HCA uses much heavier sequence compression with m’=128 and dense attention over the shorter cache. The quality tradeoff for CSA/HCA is also different from MLA. As shown in the figure above, MLA compresses the representation stored for each token, but it still keeps one latent KV entry per token. CSA and especially HCA go further by reducing the number of sequence entries themselves, so the model gives up some token-level info in exchange for much lower long-context cost. Again, it’s all about reducing long-context cost, but this trade-off can hurt modeling quality if the compression is too strong, which is why DeepSeek V4 does not rely on one compression scheme alone but alternates between CSA and HCA. CSA uses a milder compression rate and a DeepSeek Sparse Attention (DSA)-style selector, HCA uses much heavier compression for cheaper global coverage, and both keep a local sliding-window branch for recent uncompressed tokens. This sparse selection in CSA builds on DeepSeek Sparse Attention (DSA), which I discussed in more detail in my earlier DeepSeek V3.2 write-up . HCA is the more aggressive variant of the two. It compresses every 128 tokens into one compressed KV entry, but then uses dense attention over those heavily compressed entries. In other words, CSA keeps more details but uses sparse selection, while HCA keeps far fewer entries and can afford dense attention over them, as illustrated in the figure below. This makes the two mechanisms somewhat complementary, which is why DeepSeek V4 interleaves CSA and HCA layers rather than using only one of them. Figure 22: CSA selects a sparse set of compressed history blocks, while HCA attends densely over more heavily compressed blocks. Both paths also include recent uncompressed KV entries through a 128-token sliding-window branch. The DeepSeek V4 paper reports that, at a 1M-token context length, DeepSeek V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache size compared with DeepSeek V3.2, which uses MLA and DeepSeek Sparse Attention (DSA). DeepSeek V4-Flash is even smaller, at 10% of the FLOPs and 7% of the KV cache size relative to DeepSeek V3.2. Figure 23. Reported 1M-context efficiency numbers from the DeepSeek V4 paper, relative to DeepSeek V3.2. By the way, I would not describe CSA/HCA as “better” than MLA in a general sense. CSA/HCA is a more aggressive long-context design. And it’s also more complicated for sure. Unfortunately, there is no ablation study in the paper. But overall, the paper reports strong overall modeling results, including DeepSeek V4-Flash-Base outperforming DeepSeek V3.2-Base on a majority of base-model benchmarks and strong 1M-token retrieval results, but these results are for the full DeepSeek V4 recipe, which also includes better data, Muon-based optimization, mHC, precision/storage optimizations, and training/inference-system changes. Personally, for now, I would treat CSA/HCA as an efficiency-focused long-context design that appears to preserve modeling quality well in their large flagship model(s) but not necessarily universally better than MLA. Overall, the interesting pattern this year is that most new open-weight models try to make long-context inference cheaper without just shrinking the model in terms of total parameters. For instance, Gemma 4 reduces KV-cache memory with cross-layer KV sharing and adds capacity via per-layer embeddings. Laguna XS.2 tweaks how much attention capacity each layer gets. ZAYA1-8B moves attention into a compressed latent space. DeepSeek V4 adds constrained residual-stream mixing and compressed long-context attention. All of these tweaks add more complexity, which seems to be where LLM architecture is going right now. My main takeaway is that the transformer block is still changing, but in fairly targeted ways. The basic recipe is still based on the original GPT decoder-only transformer architecture, but many parts are upgraded or replaced, and they get more specialized for longer contexts and more efficient inference, whereas the qualitative modeling performance seems largely driven by data quality (and quantity) and training recipes. The question many of you asked me in the past is centered on when (or if) transformers are being replaced with something else. Of course, there are other designs like diffusion models, but transformers remain the status quo for state-of-the-art architecture releases. However, with each increasing yearly release quarter, we get more and more tweaks. While it was possible to implement a basic transformer block in perhaps 50-100 lines of PyTorch code, these tweaks (esp. around the attention variants) probably 10x the code complexity. This is not an inherently bad thing as these tweaks reduce (not increase) runtime costs. However, it’s becoming increasingly difficult to gain a clear understanding of the individual components and their interactions. Figure 24: The evolution from GPT-2 (2019) to DeepSeek V4-Pro (2026) For instance, I am fairly certain that someone who is diving into LLM architectures for the first time will be totally overwhelmed when seeing the DeepSeek V4 source code. However, by starting with the original decoder-style LLM (GPT/GPT-2) and then gradually adding / learning about these new components one at a time, we can keep the learning effort manageable. The moral of the story, I guess, is to keep learning, one architecture at a time :). By the way, I am very excited to share that I finished writing Build A Reasoning Model (From Scratch) and all chapters are in early access now. The publisher and I worked hard on the final layouts in the past month, and it’s going to be send to the printer this week. (Good news: the print version will be in color this time!) This is probably my most ambitious book so far. I spent about 1.5 years writing it, and a large number of experiments went into it. It is also probably the book I worked hardest on in terms of time, effort, and polish, and I hope you’ll enjoy it. Build a Reasoning Model (From Scratch) on Manning and Amazon . The main topics are evaluating reasoning models inference-time scaling self-refinement reinforcement learning distillation There is a lot of discussion around “reasoning” in LLMs, and I think the best way to understand what it really means in the context of LLMs is to implement one from scratch! Amazon (pre-order of Kindle ebook and print paperback) Manning (complete book in early access , pre-final layout, 528 pages) Figure 1. LLM architecture drawings of recent, major open-weight releases (April to May). You can find the images, and more details, in my LLM architecture gallery . Not all model sizes are shown; Qwen3.6 includes the 27B and 35B-A3B variants, and ZAYA1 is represented by the 8B model (omitting ZAYA1-base and ZAYA1-reasoning-base). The architectures in the dotted boxes are covered in more detail in this article. Note that this article is about architecture designs, so I will mostly skip dataset mixtures, training schedules, post-training details, RL recipes, benchmark tables, and product comparisons. Even with that narrower scope, there is a lot to cover. And, like always, the article turned out longer than I expected, so I will keep the focus on what changes inside the transformer block, residual stream, KV cache, or attention computation. Please also note that I am only covering those topics that are interesting (new) design choices and that I haven’t covered elsewhere, yet. This list includes: KV sharing and per-layer embeddings in Gemma 4 Compressed convolutional attention in ZAYA1 Attention budgeting in Laguna XS.2 mHC and compressed attention in DeepSeek V4 the Gemma 4 E2B and E4B models for mobile and small, local (embedded) devices (aka IoT), the Gemma 4 26B mixture-of-experts (MoE) model, optimized for efficient local inference, and the Gemma 4 31B dense model, for maximum quality and more convenient post-training (since MoEs are trickier to work with) Figure 2: Gemma 4 architecture drawings. The first small architecture tweak in the E2B and E4B variants is that they adopt a shared KV cache scheme, where later layers reuse key-value states from earlier layers to reduce long-context memory and compute. This KV-sharing was not invented by Gemma 4. For instance, see Brandon et al. , “ Reducing Transformer Key-Value Cache Size with Cross-Layer Attention ” (NeurIPS 2024). But it’s the first popular architecture where I saw this concept applied. (Cross-layer attention is not to be confused with cross-attention .) Before explaining KV-sharing further, let’s briefly talk about the motivation. As I wrote and talked about in recent months, one of the main recent themes in LLM architecture design is KV cache size reduction. In turn, the motivation behind KV cache size reduction is to reduce the required memory, which allows us to work with longer contexts, which is especially relevant in the age of reasoning models and agents. For more background on KV caching, see my “Understanding and Coding the KV Cache in LLMs from Scratch” article: Practically all of the popular attention variants I described in my previous A Visual Guide to Attention Variants in Modern LLMs article are designed to reduce the KV cache size: To pick a classic example (that Gemma 4 still uses): Grouped Query Attention (GQA) already shares key-value (KV) heads across different query heads to reduce the KV cache size, as illustrated in the figure below. Figure 3: Grouped Query Attention (GQA) shares the same key (K) and value (V) heads among multiple query (Q) heads. As mentioned before, Gemma 4 uses GQA. However, in addition to the KV sharing among queries as part of GQA, Gemma 4 also shares KV projections across different layers instead of computing it as part of the attention module in each layer. This KV-sharing scheme, also called cross-layer attention, is illustrated in the figure below. Figure 4: Regular transformer blocks compute separate Q, K, and V projections in each attention module (left). Cross-layer attention designs (right) share the same K and V projections across multiple layers. As briefly hinted at in the architecture overview in Figure 2, Gemma 4 E2B uses regular GQA and sliding window attention in a 4:1 pattern. (More precisely, Gemma 4 E2B uses MQA, which is the one-KV-head special case of GQA). In the case of GQA (or MQA), the KV-sharing works like this. Later layers no longer compute their own key and value projections but reuse the KV tensors from the most recent earlier non-shared layer of the same attention type. In other words, sliding-window layers share KV with a previous sliding-window layer. Full-attention layers share KV with a previous full-attention layer. The layers still compute their own query projections, so each layer can form its own attention pattern, but the expensive and memory-heavy KV cache is reused across several layers. For example, Gemma 4 E2B has 35 transformer layers, but only the first 15 compute their own KV projections; the final 20 layers reuse KV tensors from the most recent earlier non-shared layer of the same attention type. Similarly, Gemma 4 E4B has 42 layers, with 24 layers computing their own KV and the final 18 layers sharing them. How much does this actually save? Since we share roughly half of the KVs across layers, we save approximately half of the KV cache size. For the smallest E2B model, this results in a 2.7 GB saving (at bfloat16 precision) in long 128K contexts, as shown below. (For the E4B variant, this saves about 6 GB at 128K.) Figure 5: KV cache memory savings from GQA and cross-layer KV sharing in a Gemma 4 E2B-like setup. For simplicity, additional savings from sliding window attention are not shown. The downside of KV-sharing is, of course, that it’s an “approximation” of the real thing. Or, more precisely, it reduces model capacity. However, according to the cross-layer attention paper, the impact can be minimal (for small models that were tested). 2. Per-Layer Embeddings and “Effective” Size (Gemma 4 E2B/E4B) The Gemma 4 E2B and E4B variants include a second efficiency-oriented design choice called per-layer embeddings (PLE). This is separate from the KV-sharing scheme above. KV sharing reduces the KV cache. PLE is instead about parameter efficiency, where it lets the small Gemma 4 models use more token-specific information without making the main transformer stack as expensive as a dense model with the same total parameter count. For instance, the “E” in Gemma 4 E2B and E4B stands for “effective”. Concretely, Gemma 4 E2B is listed as 2.3B effective parameters, or 5.1B parameters when the embeddings are counted. (Similarly, Gemma 4 E4B is listed as 4.5B effective parameters, or 8B parameters with embeddings). In short, in the “E” models, the main transformer-stack compute is closer to the smaller number, while the larger number includes the additional embedding-table layers. (For an illustration of how embedding layers work, see my “ Understanding the Difference Between Embedding Layers and Linear Layers ” code notebook.) Conceptually, the new PLE path looks like this: Figure 6: Simplified Gemma 4 block with the PLE residual path. The normal block first computes the attention and feed-forward residual updates. The resulting hidden state gates the layer-specific PLE vector, and the projected PLE update is added as an extra residual update at the end of the block. The PLE vectors themselves are prepared outside the repeated transformer blocks. In simplified form, there are two inputs to the PLE construction. First, the token IDs go through a per-layer embedding lookup. Second, the normal token embeddings go through a linear projection into the same packed PLE space. These two pieces are added, scaled, and reshaped into a tensor with one slice per layer. Note that each block then receives its own slice. Figure 7: Simplified PLE construction. The token IDs provide a per-layer embedding lookup, while the normal token embeddings are projected into the same space. The two contributions are combined and reshaped so that each transformer block receives its own layer-specific PLE slice. The important detail is that PLE does not give each transformer block a full independent copy of the normal token embedding layer. Instead, the per-layer embedding lookup is computed once. Then, as mentioned before, it gives each layer a small token-specific embedding slice (via “reshape / select layer l”. So, for each input token, Gemma 4 prepares a packed PLE tensor that contains one small vector per decoder layer. Then, during the forward pass, layer l receives only its own slice (ple_l in the Gemma4WithPLEBlock in figure 6). Inside the transformer block, the regular attention and feed-forward branches run as usual. First, the block computes the attention residual update. Then it computes the feed-forward residual update. After that second residual add, the resulting hidden state, which I denoted as z in the pseudocode in figure 6, is used to gate the layer-specific PLE vector. The gated PLE vector is projected back to the model hidden size, normalized, and added as one extra residual update. So the useful mental model is that the transformer block still has the same main attention and feed-forward path, but Gemma 4 adds a small layer-specific token vector after the feed-forward branch. This increases representational capacity through embedding parameters and small projections. This adds computational overhead but avoids the cost of scaling the entire transformer stack to the larger parameter count. But why PLEs? The simpler alternative would be to make the dense model smaller, using fewer layers, narrower hidden states, or smaller feed-forward networks. That would reduce memory and latency, but it also removes capacity from the parts of the model that do the main computation. The PLE design keeps the expensive transformer blocks closer to the smaller “effective” size, while storing additional capacity in per-layer embedding tables. These are much cheaper to use than adding more attention or FFN weights, since they are mainly lookup-style parameters that can be cached. Also, we have to take Google’s word here that this is an effective and worthwhile design choice. It would be interesting to see some comparison studies to see how this E2B design compares to a regular Gemma 4 2.3B model and a regular Gemma 4 5.1B model. Also, in principle, PLE is not inherently limited to small models. We could attach per-layer embedding slices to larger models, too. However, larger models already have sufficient capacity where these extra embeddings may not help that much. Also, for larger models, we already use MoE designs as a trick to increase capacity while keeping the compute footprint smaller. By the way, if you are interested in a relatively simple and readable code implementation, I implemented the Gemma 4 E2B and E4B models from scratch here . Figure 8: Snapshot of my Gemma 4 from-scratch implementation . 3. Layer-Wise Attention Budgeting (Laguna XS.2) Laguna is the first open-weight model by Poolside , a Europe-based company focused on training LLMs for coding applications. Several of my former colleagues joined Poolside in recent years, and they have a great team with lots of talent. It’s just nice to see more companies also releasing some of their models as open-weight variants. Anyways, the Laguna XS.2 architecture depicted below looks very standard at first glance. However, one detail that I didn’t show (/try to cram into there) is a concept we can refer to as “Layer-wise attention budgeting”. Figure 9: Poolside’s Laguna XS.2 architecture. Part of the idea behind the attention budgeting here is that instead of giving every transformer layer the same full attention budget, Laguna XS.2 varies the attention cost by layer. It has 40 layers total, with 30 sliding-window attention layers and 10 global/full attention layers. As usual, the sliding-window layers only attend over a local window (here: 512 tokens), which keeps the KV cache and attention computation cheaper. The global layers are more expensive but preserve the ability to access all information in the context window. This mixed sliding-window + global/full attention pattern is not unique to Laguna XS.2 and is used by many other architectures (including Gemma 4). But what’s new is the use of per-layer query-head counts. For instance, the Hugging Face model hub config.json includes a setting, so layers can have different numbers of query heads while keeping the KV cache shape compatible. Figure 10: Per-layer query-head budgeting in Laguna, where full attention layers use 6 query heads per KV head, and sliding window attention layers use 8 query heads per KV head. So Laguna XS.2 gives more query heads to sliding-window layers and fewer query heads to global layers, while keeping the KV heads fixed at 8. That is the actual layer-wise head budgeting in the config. Laguna XS.2 is one of the most prominent recent examples of this per-layer query-head budgeting in a production-style open model. But the broader idea of varying model capacity by layer goes back to (at least) Apple’s 2024 OpenELM . And again, what’s the point of such a design? Similar to KV-sharing, the point is to spend attention capacity where it is most useful, instead of giving every layer the same budget. Specifically, full-attention layers are expensive because they look across the whole context, so Laguna gives them fewer query heads compared to sliding window attention modules. (Besides, another smaller implementation detail is that Laguna also applies per-head attention-output gating; this is somewhat similar to Qwen3-Next and others, which I also omit here since I covered it in earlier articles.) 4. Compressed Convolutional Attention (ZAYA1-8B) Similar to Laguna, ZAYA1-8B is another new player on the open-weight market. It is developed by Zyphra , and one of the interesting details around the release is that the model was trained on AMD GPUs rather than the more common NVIDIA GPU (or Google TPU) setup. The main architecture detail, though, is Compressed Convolutional Attention (CCA), used together with grouped-query attention. Unlike MLA-style designs that mainly use a latent representation as a compact KV cache format, CCA performs the attention operation directly in the compressed latent space, but more on that later. (Sidenote: the ZAYA1-8B config.json lists 80 alternating layer entries rather than 40 conventional transformer blocks. These entries alternate between CCA/GQA attention and MoE feed-forward layers. But for the architecture figure, it is more convenient to visualize this as 40 repeated attention + MoE pairs, which is conceptually equivalent.) Figure 11: Zaya1 (8B) with transformer blocks featuring compressed convolutional attention. As hinted at in the figure above, ZAYA1-8B uses Compressed Convolutional Attention (CCA) together with a 4:1 GQA layout. The key point is that its attention block is built around CCA rather than a standard sliding-window attention block. What is Compressed Convolutional Attention? I would say CCA is related in spirit to Multi-head Latent Attention (MLA) in DeepSeek’s models, since both introduce a compressed latent representation into the attention block. However, they use that latent space differently. MLA mainly uses the latent representation to reduce the KV cache. In MLA, the KV tensors are stored compactly and then projected into the attention-head space for the actual attention computation. Figure 12: Regular Multi-head Attention (MHA) and Multi-head Latent (MLA) attention side by side. CCA compresses Q, K, and V and performs the attention operation directly in the compressed latent space. This is why CCA can reduce not only KV cache size, but also attention FLOPs during prefill and training. Figure 13: Multi-head Latent Attention (MLA) and Compressed Convolutional Attention (CCA) side by side. As Figure 13 above illustrates, in CCA, the compressed, latent representations enter the attention mechanism directly, and the resulting compressed attention vector is then up-projected. Note that this is called Compressed Convolutional Attention, not just Compressed Attention, since there is an additional convolutional mixing happening on the latent K and Q representations. The convolutional mixing part is not shown in Figure 12, because it would have been too crammed, but it’s relatively straightforward. As hinted at in Figure 12, the convolutional mixing happens directly on the compressed Q and K tensors. The point is that compression makes Q, K, and V narrower, which saves compute and cache, but it can also make attention less expressive. The convolutions are a cheap way to give the compressed Q and K vectors more local context before they are used to compute attention scores. (The convolutional mixing is only applied to Q and K, not V, because Q and K determine the attention scores, while V represents the content that gets averaged via these scores). Figure 14: conceptual overview of the sequence-mixing convolution Next to the sequence mixing shown in Figure 13, there is also a channel mixing component. It’s in principle similar though, so I am omitting the illustration. CCA appears to be a Zyphra-introduced attention mechanism that predates the ZAYA1-8B technical report . The standalone CCA paper, Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space , was first posted in October 2025 and explicitly introduces CCA. ZAYA1-8B then uses this mechanism as one of the core pieces. But the question is, “is it better than MLA”? According to the CCA paper’s own experiments, yes, they report CCA outperforming MLA under comparable compression settings. Figure 15: Annotated figures from the CCA paper, https://arxiv.org/abs/2510.04476 . Overall, the interesting part here is really the new attention mechanism. The model also uses a pretty extreme (= very sparse) MoE setup, with only one routed expert active per token, but that part is more familiar. CCA is more unusual because it performs the attention operation directly in a compressed latent space, and then uses convolutional mixing on the compressed Q and K representations to make this compressed attention less limiting. So, in short, ZAYA1-8B is not only trying to save compute in the feed-forward layers, but also in the attention mechanism itself. 5. CSA/HCA, mHC, and Compressed Attention Caches (DeepSeek V4) DeepSeek V4 was the biggest release of the year so far, both in terms of hype and model size. Interestingly, DeepSeek V4-Pro is also the most parameter-sparse MoE among the models in the table below, measured by active-parameter share, as summarized in the table below. Figure 16: Percent active parameter plot for MoE models. You can also find an HTML version at https://sebastianraschka.com/llm-architecture-gallery/active-parameter-ratio/ . Caveat: active parameter share is only one lens. It does not capture KV cache size, attention pattern, context length, routing overhead, hardware efficiency, or training quality. But it is a helpful, quick check when comparing sparse models. There’s a lot to say about DeepSeek V4, but since it’s been all over the news already, and to stay on topic regarding architecture tweaks, I will focus on the two most relevant parts that are new compared to previous architectures: mHC for a wider residual pathway, CSA/HCA for long-context attention compression and sparsity Figure 17: DeepSeek V4-Pro architecture overview. 5.1 Manifold-Constrained Hyper-Connections (mHC) Let’s start with the mHC component of DeepSeek V4. This goes back to a research paper that the DeepSeek team shared last year (31 Dec 2025, mHC: Manifold-Constrained Hyper-Connections ). However, in this paper, the technique was only tested on an experimental 27B scale model. Now, we see it in their flagship release, which is a good sign that this idea actually works well in production. The main idea behind mHC here is to modernize the design of the residual connections inside the transformer block, which is refreshing, because architecture tweaks are usually focused on the attention mechanism, normalization layer placement, and MoE parts. Now, mHC is based on previous work on hyper-connections (see Hyper-connections by Zhu et al., 2024), which we should briefly discuss first. Hyper-connections essentially modify the single residual stream inside the transformer block by replacing it with several parallel residual streams and learned mappings between them. (For those new to residual connections, I made a video on residual neural networks many years ago, where I explained the general mechanism.) The idea behind hyper-connections is to widen the residual stream. We can think of this as keeping several parallel residual streams, with an additional Res Mapping linear transformation that mixes them across layers. Since the Attention or MoE layer itself still operates on the normal hidden size, hyper-connections also add a Pre Mapping that combines the parallel residual streams into one normal hidden vector for the layer, and a Post Mapping that distributes the layer output back across the parallel residual streams. This is visually summarized in the figure below. Figure 18: Regular transformer block (top) vs transformer block with hyper-connections (bottom) using annotated figures from the mHC paper, https://arxiv.org/abs/2512.24880 . The figure below focuses on the attention-layer portion of the transformer block, but the same concept applies to the second residual branch around the MoE layer. The purpose of hyper-connections is to make the residual pathway more expressive without making the actual Attention or MoE layer wider. This is only mildly more expensive in FLOPs because the extra mappings operate over the small residual-stream axis, for example, n = 4 in DeepSeek V4, not over a huge hidden dimension. In the original hyper-connections paper, the 7B OLMo MoE experiment goes from 13.36G to 13.38G FLOPs per token, which is basically unchanged. In terms of reported gains, there were modest (but consistent) improvements, as shown in the figure below. (However, only looking at FLOPs is a bit simplistic. The widened residual state still has to be stored, moved through memory, mixed, etc. So the practical overhead can come more from memory traffic and implementation complexity than from arithmetic, which is not explicitly measured. However, given that DeepSeek V4 is all about efficiency, it seems to be a worthwhile addition.) Figure 19: Hyper-connections performance versus baseline, using an annotated figure from the hyper-connections paper, https://arxiv.org/abs/2409.19606 . Also, as shown in the figure above, metrics reached the baseline’s performance using roughly half the training tokens. The main change from regular hyper-connections (HC) to manifold-constrained hyper-connections (mHC) is that the mappings are no longer left unconstrained. In regular HC, the Res Mapping is a learned matrix that mixes the parallel residual streams, but stacking many such matrices can amplify or shrink signals unpredictably. In mHC, this residual mapping is projected onto the manifold of doubly stochastic matrices, meaning all entries are non-negative and each row and column sums to 1. This makes the residual mixing behave more like a stable redistribution of information across streams. The Pre Mapping and Post Mapping are also constrained to be non-negative and bounded, which avoids cancellation when reading from and writing back into the widened residual state. In short, mHC keeps the richer residual mixing of HC, but adds constraints so it scales more safely, which becomes more relevant for larger (deeper) models. Otherwise, the main idea of using parallel residual streams remains, as shown in the figure below. Figure 20: Transformer block with hyper-connections (HC) and manifold-constrained hyper-connections (mHC) using annotated figures from the mHC paper, https://arxiv.org/abs/2512.24880 . In the mHC paper, using a 27B parameter model for the experiments, the DeepSeek team’s optimized implementation (with fusion, recomputation, and pipeline scheduling) adds only 6.7% additional training time overhead for 4 residual streams (n = 4) throughout all transformer blocks compared to the single-stream baseline. To sum up this section, HC/mHC changes how information is carried around these layers by replacing the single residual stream with several interacting residual streams, with the additional stability constraints added in mHC, while adding minimal compute overhead. Also, it pairs well with the CSA/HCA attention changes, which modify other parts of the transformer block, which I will discuss below. 5.2 Compressed Attention via CSA and HCA The other major DeepSeek V4 architecture change is on the attention side. Again, the motivation is that at very long context lengths, attention becomes expensive not only because of the attention score computation, but also because the KV cache grows with the sequence length. DeepSeek V4 addresses this issue with a hybrid of two compressed-attention mechanisms, Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). For a refresher, I recommend checking out my previous “ A Visual Guide to Attention Variants in Modern LLMs ” article, which covers Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention (DSA), among others. The first thing to note is that CSA/HCA in DeepSeek V4 is a different kind of compression than the MLA-style compression used in DeepSeek V2/V3. Where MLA mainly compresses the per-token KV representation, CSA and HCA compress along the sequence dimension. So, instead of keeping one full (or compressed) KV entry for every previous token, they summarize groups of tokens into fewer compressed KV entries. Consequently, the cache gets shorter. DeepSeek V4 also uses compact compressed entries and shared-KV attention, but the main distinction from MLA is the sequence-length compression. This is illustrated in the figure below. Figure 21: Conceptual comparison of MLA-style per-token latent caching, CSA, and HCA. MLA compresses the stored KV representation but keeps one latent entry per token. CSA shortens the sequence more mildly with m=4 and sparse top-k selection, while HCA uses much heavier sequence compression with m’=128 and dense attention over the shorter cache. The quality tradeoff for CSA/HCA is also different from MLA. As shown in the figure above, MLA compresses the representation stored for each token, but it still keeps one latent KV entry per token. CSA and especially HCA go further by reducing the number of sequence entries themselves, so the model gives up some token-level info in exchange for much lower long-context cost. Again, it’s all about reducing long-context cost, but this trade-off can hurt modeling quality if the compression is too strong, which is why DeepSeek V4 does not rely on one compression scheme alone but alternates between CSA and HCA. CSA uses a milder compression rate and a DeepSeek Sparse Attention (DSA)-style selector, HCA uses much heavier compression for cheaper global coverage, and both keep a local sliding-window branch for recent uncompressed tokens. This sparse selection in CSA builds on DeepSeek Sparse Attention (DSA), which I discussed in more detail in my earlier DeepSeek V3.2 write-up . HCA is the more aggressive variant of the two. It compresses every 128 tokens into one compressed KV entry, but then uses dense attention over those heavily compressed entries. In other words, CSA keeps more details but uses sparse selection, while HCA keeps far fewer entries and can afford dense attention over them, as illustrated in the figure below. This makes the two mechanisms somewhat complementary, which is why DeepSeek V4 interleaves CSA and HCA layers rather than using only one of them. Figure 22: CSA selects a sparse set of compressed history blocks, while HCA attends densely over more heavily compressed blocks. Both paths also include recent uncompressed KV entries through a 128-token sliding-window branch. The DeepSeek V4 paper reports that, at a 1M-token context length, DeepSeek V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache size compared with DeepSeek V3.2, which uses MLA and DeepSeek Sparse Attention (DSA). DeepSeek V4-Flash is even smaller, at 10% of the FLOPs and 7% of the KV cache size relative to DeepSeek V3.2. Figure 23. Reported 1M-context efficiency numbers from the DeepSeek V4 paper, relative to DeepSeek V3.2. By the way, I would not describe CSA/HCA as “better” than MLA in a general sense. CSA/HCA is a more aggressive long-context design. And it’s also more complicated for sure. Unfortunately, there is no ablation study in the paper. But overall, the paper reports strong overall modeling results, including DeepSeek V4-Flash-Base outperforming DeepSeek V3.2-Base on a majority of base-model benchmarks and strong 1M-token retrieval results, but these results are for the full DeepSeek V4 recipe, which also includes better data, Muon-based optimization, mHC, precision/storage optimizations, and training/inference-system changes. Personally, for now, I would treat CSA/HCA as an efficiency-focused long-context design that appears to preserve modeling quality well in their large flagship model(s) but not necessarily universally better than MLA. 6. Conclusion Overall, the interesting pattern this year is that most new open-weight models try to make long-context inference cheaper without just shrinking the model in terms of total parameters. For instance, Gemma 4 reduces KV-cache memory with cross-layer KV sharing and adds capacity via per-layer embeddings. Laguna XS.2 tweaks how much attention capacity each layer gets. ZAYA1-8B moves attention into a compressed latent space. DeepSeek V4 adds constrained residual-stream mixing and compressed long-context attention. Figure 24: The evolution from GPT-2 (2019) to DeepSeek V4-Pro (2026) For instance, I am fairly certain that someone who is diving into LLM architectures for the first time will be totally overwhelmed when seeing the DeepSeek V4 source code. However, by starting with the original decoder-style LLM (GPT/GPT-2) and then gradually adding / learning about these new components one at a time, we can keep the learning effort manageable. The moral of the story, I guess, is to keep learning, one architecture at a time :). By the way, I am very excited to share that I finished writing Build A Reasoning Model (From Scratch) and all chapters are in early access now. The publisher and I worked hard on the final layouts in the past month, and it’s going to be send to the printer this week. (Good news: the print version will be in color this time!) This is probably my most ambitious book so far. I spent about 1.5 years writing it, and a large number of experiments went into it. It is also probably the book I worked hardest on in terms of time, effort, and polish, and I hope you’ll enjoy it. Build a Reasoning Model (From Scratch) on Manning and Amazon . The main topics are evaluating reasoning models inference-time scaling self-refinement reinforcement learning distillation Amazon (pre-order of Kindle ebook and print paperback) Manning (complete book in early access , pre-final layout, 528 pages)

0 views
Evan Hahn 2 weeks ago

Make ZIP files smaller with ZIP Shrinker

I built ZIP Shrinker, a little browser tool to shrink ZIP files. It also works with formats that are secretly ZIPs underneath, like APK, EPUB, JAR, and many more. Try it out! At a high level, this tool (1) re-compresses every file in the ZIP archive with higher compression (2) removes all metadata (3) removes entries for directories. ZIP files are typically compressed with an algorithm called Deflate . There are a few tools that can re-compress Deflate data and make it smaller, usually by spending more time on the computation. I took one of these tools, libdeflate , and applied it to each compressed entry in the ZIP. I chose libdeflate because of its performance; alternatives like Zopfli can achieve marginally smaller results but take much longer. I created libdeflate.js , a WebAssembly wrapper for libdeflate, as part of this work. (I always relish my time working with WASM!) Each entry in a ZIP file can contain additional metadata like comments. These aren’t typically used, and if they’re there, my shrinker removes them. This usually doesn’t save too many bytes, but it doesn’t hurt. Removing directories is a slightly spicier decision. Usually, the existence of a file entry implies the existence of the directory it’s inside. For example, implies the existence of the directory. Some ZIPs include separate entries for directories, but because most extractors don’t need them, I remove those. This has the side effect of removing empty directories— let me know if that’s a problem for you. If you want to see how the whole project works, check out the full source code . I tested several ZIPs to see what this tool could do. Some anecdotal results: Not particularly scientific, but useful to see. This proof-of-concept shows that you can make ZIP files smaller without sacrificing backwards compatibility. It could be useful for sending an archive to someone, but could also be useful to reduce bandwidth and server costs. For example, if Project Gutenberg re-compressed all their EPUB books with this method, they might be able to save some money. Of course, ZIP isn’t always the most efficient format. Typically, other archives like can be smaller. But those aren’t backwards-compatible! ZIP also supports compression methods other than Deflate. They’re atypical, but you could use them to achieve a smaller result, too. Give my tool a try if you want a smaller ZIP.

0 views
Aran Wilkinson 2 weeks ago

Introducing jjw: a workspace manager for jj

I released jjw, a Go CLI for managing jj workspaces with bookmarks and lifecycle hooks. This post explains why I built it, how it works, and how to get started.

0 views
Unsung 2 weeks ago

Speaking of wiggling the mouse

In light of a recent Googlebook announcement that uses a mouse wiggle gesture for AI (which to me doesn’t seem like a pleasant interaction), some of us were talking about how, on macOS, mouse wiggle helps you locate the cursor by making it bigger. I am maybe a sucker for videos and podcasts where people start laughing, but here we go – a very short video about a version of Linux that “does not limit how big your pointer can get if you wiggle the mouse pointer”: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/speaking-of-wiggling-the-mouse/yt1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/speaking-of-wiggling-the-mouse/yt1.1600w.avif" type="image/avif"> #humor #mouse #youtube

0 views
<antirez> 2 weeks ago

A few words on DS4

I didn’t expect DwarfStar 4 (https://github.com/antirez/ds4) to become so popular so fast. It is clear that there was a need for single-model integration focused local AI experience, and that a few things happened together: the release of a quasi-frontier model that is large and fast enough to change the game of local inference, and the fact that it works extremely well with an extremely asymmetric quants recipe of 2/8 bit, so that 96 or 128GB of RAM are enough to run it. And, of course: all the experience produced by the local AI movement in the latest years, that can be leveraged more promptly because of GPT 5.5 (otherwise you can’t build DS4 in one week — and even with all this help you need to know how to gently talk to LLMs). The last week was funny and also tiring, I worked 14 hours per day on average. My normal average is 4/6 since early Redis times, but the first few months of Redis were like that. So, what’s next? Is this a project that starts and ends with DeepSeek v4 Flash? Nope, the model can change over time. The space will be occupied, in my vision, by the best current open weights model that is *practically fast* on a high end Mac or “GPU in a box” gear (like the DGX Spark and other similar setups). I bet that the next contender is DeepSeek v4 Flash itself, in the new checkpoint that will be released and, hopefully, a version specifically tuned for coding, and who knows, other expert-variants (not in the sense of MoE experts) maybe. For local inference, to have a ds4-coding, ds4-legal, ds4-medical models make a lot of sense, after all. You just load what you need depending on the question. It is the first time since I play with local inference (I play with it since the start) that I find myself using a local model for serious stuff that I would normally ask to Claude / GPT. This, I think, is really a big thing. It is also the first time that using vector steering I can enjoy an experience where the LLM can be used with more freedom. DeepSeek v4 Flash is really an impressive model, no doubt about that. If you can imagine in your mind the small good local model experience as A, and the frontier model you use online as B, DS4 is a lot more B than A. I can’t wait for the new releases, honestly (btw, thank you DeepSeek). So, after those chaotic first days, I hope the project will focus on: quality benchmarks, potentially adding a coding agent that is also part of the project, a hardware setup here in my home that can run the CI test in order to ensure long term quality, more ports, and finally but as a very important point: distributed inference (both serial and parallel). For now, thank you for all the support: it was really appreciated :) AI is too critical to be just a provided service. Comments

0 views