Posts in Go (20 found)
Ahead of AI 4 days ago

Beyond Standard LLMs

From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism. However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance. After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised attendees to follow up with a write-up of these alternative approaches). So here it is! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 and many more. (The list above focuses on the open-weight models; there are proprietary models like GPT-5, Grok 4, Gemini 2.5, etc. that also fall into this category.) Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. In code, a simplified version of the Gated DeltaNet depicted above (without the convolutional mixing) can be implemented as follows (the code is inspired by the official implementation by the Qwen3 team): (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. (And the final output gate, not shown in the snippet above, is similar to gated attention; it controls how much of the output is kept.) So, in a sense, this state update in Gated DeltaNet is similar to how recurrent neural networks (RNNs) work. The advantage is that it scales linearly (via the for-loop) instead of quadratically with context length. The downside of this recurrent state update is that, compared to regular (or gated) attention, it sacrifices the global context modeling ability that comes from full pairwise attention. Gated DeltaNet, can, to some extend, still capture context, but it has to go through the memory ( S ) bottleneck. That memory is a fixed size and thus more efficient, but it compresses past context into a single hidden state similar to RNNs. That’s why the Qwen3-Next and Kimi Linear architectures don’t replace all attention layers with DeltaNet layers but use the 3:1 ratio mentioned earlier. In the previous section, we discussed the advantage of the DeltaNet over full attention in terms of linear instead of quadratic compute complexity with respect to the context length. Next to the linear compute complexity, another big advantage of DeltaNet is the memory savings, as DeltaNet modules don’t grow the KV cache. (For more information about KV caching, see my Understanding and Coding the KV Cache in LLMs from Scratch article). Instead, as mentioned earlier, they keep a fixed-size recurrent state, so memory stays constant with context length. For a regular multi-head attention (MHA) layer, we can compute the KV cache size as follows: (The 2 multiplier is there because we have both keys and values that we store in the cache.) For the simplified DeltaNet version implemented above, we have: Note that the memory size doesn’t have a context length ( ) dependency. Also, we have only the memory state S that we store instead of separate keys and values, hence becomes just bytes. However, note that we now have a quadratic in here. This comes from the state: But that’s usually nothing to worry about, as the head dimension is usually relatively small. For instance, it’s 128 in Qwen3-Next. The full version with the convolutional mixing is a bit more complex, including the kernel size and so on, but the formulas above should illustrate the main trend and motivation behind the Gated DeltaNet. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. The training runs for up to 16 refinement steps per batch. Each step performs several no-grad loops to iteratively refine the answer. This is followed by a gradient loop that backpropagates through the full reasoning sequence to update the model weights. It’s important to note that TRM is not a language model operating on text. However, because (a) it’s a transformer-based architecture, (b) reasoning is now a central focus in LLM research, and this model represents a distinctly different take on reasoning, and (c) many readers have asked me to cover HRM (and TRM is its more advanced successor) I decided to include it here. While TRM could be extended to textual question-answer tasks in the future, TRM currently works on grid-based inputs and outputs. In other words, both the “question” and the “answer” are grids of discrete tokens (for example, 9×9 Sudoku or 30×30 ARC/Maze puzzles), not text sequences. HRM consists of two small transformer modules (each 4 blocks) that communicate across recursion levels. TRM only uses a single 2-layer transformer. (Note that the previous TRM figure shows a 4× next to the transformer block, but that’s likely to make it easier to compare against HRM.) TRM backpropagates through all recursive steps, whereas HRM only backpropagates through the final few. HRM includes an explicit halting mechanism to determine when to stop iterating. TRM replaces this mechanism with a simple binary cross-entropy loss that learns when to stop iterating. Performance-wise, TRM performs really well compared to HRM, as shown in the figure below. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length. While HRM and TRM achieve really good reasoning performance on these benchmarks, comparing them to large LLMs is not quite fair. HRM and TRM are specialized models for tasks like ARC, Sudoku, and Maze pathfinding, whereas LLMs are generalists. Sure, HRM and TRM can be adopted for other tasks as well, but they have to be specially trained on each task. So, in that sense, we can perhaps think of HRM and TRM as efficient pocket calculators, whereas LLM are more like computers, which can do a lot of other things as well. Still, these recursive architectures are exciting proof-of-concepts that highlight how small, efficient models can “reason” through iterative self-refinement. Perhaps, in the future, such models could act as reasoning or planning modules embedded within larger tool-using LLM systems. For now, LLMs remain ideal for broad tasks, but domain-specific recursive models like TRM can be developed to solve certain problems more efficiently once the target domain is well understood. Beyond the Sudoku, Maze finding, and ARC proof-of-concept benchmarks, there are possibly lots of use cases in the physics and biology domain where such models could find use. As an interesting tidbit, the author shared that it took less than $500 to train this model, with 4 H100s for around 2 days. I am delighted to see that it’s still possible to do interesting work without a data center. I originally planned to cover all models categories in the overview figure, but since the article ended up longer than I expected, I will have to save xLSTMs, Liquid Foundation Models, Transformer-RNN hybrids, and State Space Models for another time (although, Gated DeltaNet already gave a taste of State Space Models and recurrent designs.) As a conclusion to this article, I want to repeat the earlier words, i.e., that standard autoregressive transformer LLMs are proven and have stood the test of time so far. They are also, if efficiency is not the main factor, the best we have for now. Traditional Decoder-Style, Autoregressive Transformers + Proven & mature tooling + “well-understood” + Scaling laws + SOTA - Expensive training - Expensive inference (except for aforementioned tricks) If I were to start a new LLM-based project today, autoregressive transformer-based LLMs would be my first choice. I definitely find the upcoming attention hybrids very promising, which are especially interesting when working with longer contexts where efficiency is a main concern. Linear Attention Hybrids + Same as decoder-style transformers + Cuts FLOPs/KV memory at long-context tasks - Added complexity - Trades a bit of accuracy for efficiency On the more extreme end, text diffusion models are an interesting development. I’m still somewhat skeptical about how well they perform in everyday use, as I’ve only tried a few quick demos. Hopefully, we’ll soon see a large-scale production deployment with Google’s Gemini Diffusion that we can test on daily and coding tasks, and then find out how people actually feel about them. Text Diffusion Models + Iterative denoising is a fresh idea for text + Better parallelism (no next-token dependence) - Can’t stream answers - Doesn’t benefit from CoT? - Tricky tool-calling? - Solid models but not SOTA While the main selling point of text diffusion models is improved efficiency, code world models sit on the other end of the spectrum, where they aim to improve modeling performance. As of this writing, coding models, based on standard LLMs, are mostly improved through reasoning techniques, yet if you have tried them on trickier challenges, you have probably noticed that they (more or less) still fall short and can’t solve many of the trickier coding problems well. I find code world models particularly interesting and believe they could be an important next step toward developing more capable coding systems. Code World Model + Promising approach to improve code understanding + Verifiable intermediate states - Inclusion of executable code traces complicates training - Code running adds latency Lastly, we covered small recursive transformers such as hierarchical and tiny reasoning models. These are super interesting proof-of-concept models. However, as of today, they are primarily puzzle solvers, not general text or coding models. So, they are not in the same category as the other non-standard LLM alternatives covered in this article. Nonetheless, they are very interesting proofs-of-concept, and I am glad researchers are working on them. Right now, LLMs like GPT-5, DeepSeek R1, Kimi K2, and so forth are developed as special purpose models for free-form text, code, math problems and much more. They feel like brute-force and jack-of-all-trades approach that we use on a variety of tasks, from general knowledge questions to math and code. However, when we perform the same task repeatedly, such brute-force approaches become inefficient and may not even be ideal in terms of specialization. This is where tiny recursive transformers become interesting: they could serve as lightweight, task-specific models that are both efficient and purpose-built for repeated or structured reasoning tasks. Also, I can see them as potential “tools” for other tool-calling LLMs; for instance, when LLMs use Python or calculator APIs to solve math problems, special tiny reasoning models could fill this niche for other types of puzzle- or reasoning-like problems. Small Recursive Transformers + Very small architecture + Good generalization on puzzles - Special purpose models - Limited to puzzles (so far) This has been a long article, but I hope you discovered some of the fascinating approaches that often stay outside the spotlight of mainstream LLMs. And if you’ve been feeling a bit bored by the more or less conventional LLM releases, I hope this helped rekindle your excitement about AI again because there’s a lot of interesting work happening right now! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) 1. Transformer-Based LLMs Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. 2. (Linear) Attention Hybrids Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . 2.1 Traditional Attention and Quadratic Costs The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. 2.2 Linear attention Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. 2.3 Linear Attention Revival In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. 2.4 Qwen3-Next Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? 2.5 Gated Attention Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. 2.6 Gated DeltaNet Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. 2.8 Kimi Linear vs. Qwen3-Next Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. 2.9 Kimi Delta Attention Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. 2.10 The Future of Attention Hybrids Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. 3. Text Diffusion Models A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). 3.1 Why Work on Text Diffusion? With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. 3.2 The Denoising Process The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. 3.3 Autoregressive vs Diffusion LLMs Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. 3.4 Text Diffusion Today It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. 4. World Models So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. 4.1 The Main Idea Behind World Models Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. 4.2 From Vision to Code That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. 4.3 Code World Models Vs Regular LLMs for Code So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. 5. Small Recursive Transformers You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. 5.1 What Does Recursion Mean Here? TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length.

0 views

Claude Code Can Debug Low-level Cryptography

Over the past few days I wrote a new Go implementation of ML-DSA, a post-quantum signature algorithm specified by NIST last summer. I livecoded it all over four days, finishing it on Thursday evening. Except… Verify was always rejecting valid signatures. I was exhausted, so I tried debugging for half an hour and then gave up, with the intention of coming back to it the next day with a fresh mind. On a whim, I figured I would let Claude Code take a shot while I read emails and resurfaced from hyperfocus. I mostly expected it to flail in some maybe-interesting way, or rule out some issues. Instead, it rapidly figured out a fairly complex low-level bug in my implementation of a relatively novel cryptography algorithm. I am sharing this because it made me realize I still don’t have a good intuition for when to invoke AI tools, and because I think it’s a fantastic case study for anyone who’s still skeptical about their usefulness. Full disclosure: Anthropic gave me a few months of Claude Max for free. They reached out one day and told me they were giving it away to some open source maintainers. Maybe it’s a ploy to get me hooked so I’ll pay for it when the free coupon expires. Maybe they hoped I’d write something like this. Maybe they are just nice. Anyway, they made no request or suggestion to write anything public about Claude Code. Now you know. I started Claude Code v2.0.28 with Opus 4.1 and no system prompts, and gave it the following prompt (typos included): I implemented ML-DSA in the Go standard library, and it all works except that verification always rejects the signatures. I know the signatures are right because they match the test vector. YOu can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Look for potential reasons the signatures don’t verify. ultrathink I spot-checked and w1 is different from the signing one. To my surprise, it pinged me a few minutes later with a complete fix . Maybe I shouldn’t be surprised! Maybe it would have been clear to anyone more familiar with AI tools that this was a good AI task: a well-scoped issue with failing tests. On the other hand, this is a low-level issue in a fresh implementation of a complex, relatively novel algorithm. It figured out that I had merged and into a single function for using it from Sign, and then reused it from Verify where already produces the high bits, effectively taking the high bits of w1 twice in Verify. Looking at the log , it loaded the implementation into the context and then immediately figured it out, without any exploratory tool use! After that it wrote itself a cute little test that reimplemented half of verification to confirm the hypothesis, wrote a mediocre fix, and checked the tests pass. I threw the fix away and refactored to take high bits as input, and changed the type of the high bits, which is both clearer and saves a round-trip through Montgomery representation. Still, this 100% saved me a bunch of debugging time. On Monday, I had also finished implementing signing with failing tests. There were two bugs, which I fixed in the following couple evenings. The first one was due to somehow computing a couple hardcoded constants (1 and -1 in the Montgomery domain) wrong . It was very hard to find, requiring a lot of deep printfs and guesswork. Took me maybe an hour or two. The second one was easier: a value that ends up encoded in the signature was too short (32 bits instead of 32 bytes) . It was relatively easy to tell because only the first four bytes of the signature were the same, and then the signature lengths were different. I figured these would be an interesting way to validate Claude’s ability to help find bugs in low-level cryptography code, so I checked out the old version of the change with the bugs (yay Jujutsu!) and kicked off a fresh Claude Code session with this prompt: I am implementing ML-DSA in the Go standard library, and I just finished implementing signing, but running the tests against a known good test vector it looks like it goes into an infinite loop, probably because it always rejects in the Fiat-Shamir with Aborts loop. You can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Figure out why it loops forever, and get the tests to pass. ultrathink It spent some time doing printf debugging and chasing down incorrect values very similarly to how I did it, and then figured out and fixed the wrong constants . Took Claude definitely less than it took me. Impressive. It gave up after fixing that bug even if the tests still failed, so I started a fresh session (on the assumption that the context on the wrong constants would do more harm than good investigating an independent bug), and gave it this prompt: I am implementing ML-DSA in the Go standard library, and I just finished implementing signing, but running the tests against a known good test vector they don’t match. You can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Figure out what is going on. ultrathink It took a couple wrong paths, thought for quite a bit longer, and then found this one too . I honestly expected it to fail initially. It’s interesting how Claude found the “easier” bug more difficult. My guess is that maybe the large random-looking outputs of the failing tests did not play well with its attention. The fix it proposed was updating only the allocation’s length and not its capacity, but whatever, the point is finding the bug, and I’ll usually want to throw away the fix and rewrite it myself anyway. Three out of three one-shot debugging hits with no help is extremely impressive . Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it. As ever, I wish we had better tooling for using LLMs which didn’t look like chat or autocomplete or “make me a PR.” For example, how nice would it be if every time tests fail, an LLM agent was kicked off with the task of figuring out why, and only notified us if it did before we fixed it? For more low-level cryptography bugs implementations, follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . I promise I almost never post about AI. Enjoy the silliest floof. Surely this will help redeem me in the eyes of folks who consider AI less of a tool and more of something to be hated or loved. My work is made possible by Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team.

0 views

How I turned Zig into my favorite language to write network programs in

I’ve been watching the Zig language for a while now, given that it was created for writing audio software (low-level, no allocations, real time). I never paid too much attention though, it seemed a little weird to me and I didn’t see the real need. Then I saw a post from Andrew Kelley (creator of the language) on Hacker News, about how he reimplemented my Chromaprint algorithm in Zig, and that got me really interested. I’ve been planning to rewrite AcoustID’s inverted index for a long time, I had a couple of prototypes, but none of the approaches felt right. I was going through some rough times, wanted to learn something new, so I decided to use the project as an opportunity to learn Zig. And it was great, writing Zig is a joy. The new version was faster and more scalable than the previous C++ one. I was happy, until I wanted to add a server interface. In the previous C++ version, I used Qt , which might seem very strange for a server software, but I wanted a nice way of doing asynchronous I/O and Qt allowed me to do that. It was callback-based, but Qt has a lot of support for making callbacks usable. In the newer prototypes, I used Go, specifically for the ease of networking and concurrency. With Zig, I was stuck. There are some Zig HTTP servers, so I could use those. I wanted to implement my legacy TCP server as well, and that’s a lot harder, unless I want to spawn a lot of threads. Then I made a crazy decision, to use Zig also for implementing a clustered layer on top of my server, using NATS as a messaging system, so I wrote a Zig NATS client , and that gave me a lot of experience with Zig’s networking capabilities. Fast forward to today, I’m happy to introduce Zio, an asynchronous I/O and concurrency library for Zig . If you look at the examples, you will not really see where is the asynchronous I/O, but it’s there, in the background and that’s the point. Writing asynchronous code with callbacks is a pain. Not only that, it requires a lot of allocations, because you need state to survive across callbacks. Zio is an implementation of Go style concurrency, but limited to what’s possible in Zig. Zio tasks are stackful coroutines with fixed-size stacks. When you run , this will initiate the I/O operation in the background and then suspend the current task until the I/O operation is done. When it’s done, the task will be resumed, and the result will be returned. That gives you the illusion of synchronous code, allowing for much simpler state management. Zio support fully asynchronous network and file I/O, has synchronization primitives (mutexes, condition variables, etc.) that work with the cooperative runtime, has Go-style channels, OS signal watches and more. Tasks can run in single-threaded mode, or multi-threaded, in which case they can migrate from thread to thread for lower latency and better load balancing. And it’s FAST. I don’t want to be posting benchmarks here, maybe later when I have more complex ones, but the single-threaded mode is beating any framework I’ve tried so far. It’s much faster than both Go and Rust’s Tokio. Context switching is virtually free, comparable to a function call. The multi-threaded mode, while still not being as robust as Go/Tokio, has comparable performance. It’s still a bit faster than either of them, but that performance might go down as I add more fairness features. Because it implements the standard interfaces for reader/writer, you can actually use external libraries that are unaware they are running within Zio. Here is an example of a HTTP server: When I started working with Zig, I really thought it’s going to be a niche language to write the fast code in, and then I’ll need a layer on top of that in a different language. With Zio, that changed. The next step for me is to update my NATS client to use Zio internally. And after that, I’m going to work on a HTTP client/server library based on Zio.

0 views
Filippo Valsorda 2 weeks ago

The Geomys Standard of Care

One of the most impactful effects of professionalizing open source maintenance is that as professionals we can invest into upholding a set of standards that make our projects safer and more reliable. The same commitments and overhead that are often objected to when required of volunteers should be table stakes for professional maintainers. I didn’t find a lot of prior art, so to compile the Geomys Standard of Care I started by surveying recent supply chain compromises to look for mitigable root causes. (By the way, you might have missed that email because it includes the name of a domain used for a phishing campaign, so it got flagged as phishing. Oops.) I also asked feedback from experts in various areas such as CI security, and from other Geomys maintainers. The first draft is below, and we’ll maintain the latest version at geomys.org/standard-of-care . It covers general maintenance philosophy, ongoing stability and reliability, dependency management, account and CI security, vulnerability handling, licensing, and more. In the future, we want to look into adopting more binary transparency tools, and into doing periodic reviews of browser extensions and of authorized Gerrit and GitHub OAuth apps and tokens (just GitHub has four places 1 to look in!). We also welcome feedback on things that would be valuable to add, for security or for reliability. We aim to maintain our projects sustainably and predictably. We are only able to do this thanks to our retainer contracts with our clients, but these commitments are offered to the whole community, not just to paying clients. Scope . We apply this standard to projects maintained or co-maintained by Geomys, including For projects where we are not the sole maintainers, we prioritize working well with the rest of the team. Geomys maintainers may also have personal projects that are not held to this standard (e.g. everything in mostly-harmless ). Code review . If the project accepts external contributions, we review all the code provided to us. This extends to any code generated with LLMs, as well. Complexity . A major part of the role of a maintainer is saying no. We consciously limit complexity, and keep the goals and non-goals of a project in mind when considering features. (See for example the Go Cryptography Principles .) Static analysis . We run staticcheck , by our very own @dominikh , in CI. Stability . Once a Go package reaches v1, we maintain strict backwards compatibility within a major version, similarly to the standard library’s compatibility promise . Ongoing maintenance . Not all projects are actively worked on at all times (e.g. some projects may be effectively finished, or we may work in batches). However, unless a project is explicitly archived or deprecated, we will address newly arising issues that make the project unsuitable for a previously working use case (e.g. compatibility with a new OS). Dependency management . We don’t use automatic dependency version bump tools, like Dependabot. For our purposes, they only cause churn and increase the risk of supply chain attacks by adopting new module versions before the ecosystem has had time to detect attacks. (Dependabot specifically also has worrying impersonation risks , which would make for trivial social engineering attacks.) Instead, we run govulncheck on a schedule, to get high signal-to-noise ratio notifications of vulnerable dependencies that actually affect our projects; and run isolated CI jobs with the latest versions of our dependencies (i.e. running before ) to ensure we’re alerted early of breakages, so we can easily update to future security releases and so we’re aware of potential compatibility issues for our dependents. Phishing-resistant authentication . Phishing is by far the greatest threat to our security and, transitively, to that of our users. We acknowledge there is no amount of human carefulness that can systematically withstand targeted attacks, so we use technically phishing-resistant authentication for all services that allow impacting our projects’ users. Phishing-resistant authentication means passkeys or WebAuthn 2FA, with credentials stored in platform authenticators (e.g. iCloud Keychain), password managers (e.g. 1Password or Chrome), or hardware tokens (e.g. YubiKeys). Critical accounts that allow escalating to user impact include: If a strict mode such as Google’s Advanced Protection Program or Apple’s Advanced Data Protection is available, we enable it. If a phishable fallback authentication or account recovery method is instead required, we configure one that is secret-based (e.g. TOTP or recovery codes) and either delete the secret or commit to never using it without asking a fellow Geomys maintainer to review the circumstances that necessitated it. TOTP can’t hurt us if we don’t use it. We never enable SMS as an authentication mechanism or as an account recovery mechanism, because SIM jacking is possible even without action on our part. Long-lived credentials . We avoid where possible long-lived persistent credentials, or make them non-extractable if possible. For example, we use git-credential-oauth instead of Gerrit cookies, and hardware-bound SSH keys with yubikey-agent or Secretive instead of personal access tokens for git pushes to GitHub. Unlike phishing-resistant authentication, we found it impractical to roll out short-lived credentials universally. Notably, we have not found a way to use the GitHub CLI without extractable long-lived credentials. CI security . We run zizmor on our GitHub Actions workflows, and we don’t use dangerous GitHub Actions triggers that run privileged workflows with attacker-controlled contexts, such as . We run GitHub Actions workflows with read-only permissions and no secrets by default. Workflows that have write permissions or access to secrets disable all use of caches (including indirectly through actions like ), to mitigate cache poisoning attacks . (Note that, incredibly, read-only workflows can write arbitrary cache entries, which is why this must be mitigated at cache use time.) Third-party access . For projects maintained solely by Geomys, we avoid providing user-impacting (i.e. push or release) access to external people, and publicly disclose any exceptions. If abandoning a project, we prefer archiving it and letting a fork spawn to handing over control to external people. This way dependents can make their own assessment of whether to trust the new maintainers. Any exceptions will be widely communicated well in advance. Under no circumstances will we release to public registration a domain, GitHub user/org, or package name that was previously assigned to a Geomys project. Availability monitoring . We have automated uptime monitoring for critical user-facing endpoints, such as the Go import path meta pages. This also provides monitoring for critical domain expiration, preventing accidental takeovers. Transparency logging . We subscribe to new version notifications via GopherWatch , to be alerted of unauthorized module versions published to the Go Checksum Database. We monitor Certificate Transparency logs for critical domains (e.g. the roots of our Go import paths) using tools such as Cert Spotter or Silent CT . We also set CAA records on those domains limiting issuance to the minimal set of CAs required for operation. Vulnerability handling . We document the official vulnerability reporting mechanism of each project, we encourage coordinated vulnerability reporting, and we appreciate the work of security researchers. We honor embargoes of up to 90 days, and we do not share vulnerability details with people not involved in fixing it until they are public. (Paying clients do not get access to private vulnerability details. This is to honor our responsibility to the various stakeholders of an open source project, and to acknowledge that often these details are not ours to share.) Once a vulnerability is made public, we ensure it is included in the Go vulnerability database with accurate credit and metadata, including a CVE number. If the documented vulnerability reporting mechanism is unresponsive, an escalation path is available by emailing security at geomys.org. Licenses . We use permissive, well-known licenses: BSD-3-Clause, BSD-2-Clause, BSD-1-Clause, 0BSD, ISC, MIT, or (less preferably) Apache-2.0. Disclaimer . This is not a legally binding agreement. Your use of the projects continues to be controlled by their respective licenses, and/or by your contract with Geomys, which does not include this document unless explicitly specified. I am getting a cat (if I successfully defeat my allergies through a combination of LiveClear , SLIT , antihistamines, and HEPA filters), so obviously you are going to get a lot of cat pictures going forward. For more, you can follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . This is the work of Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. https://github.com/settings/tokens and https://github.com/settings/personal-access-tokens and https://github.com/settings/apps/authorizations and https://github.com/settings/applications  ↩ the and packages in the Go standard library and the FIPS 140-3 Go Cryptographic Module (co-maintained with the rest of the Go team) Staticcheck filippo.io/edwards25519 filippo.io/csrf filippo.io/keygen filippo.io/intermediates (externalized from the standard library) age and typage Sunlight and filippo.io/torchwood yubikey-agent run govulncheck on a schedule, to get high signal-to-noise ratio notifications of vulnerable dependencies that actually affect our projects; and run isolated CI jobs with the latest versions of our dependencies (i.e. running before ) to ensure we’re alerted early of breakages, so we can easily update to future security releases and so we’re aware of potential compatibility issues for our dependents. All Google accounts linked to a Gerrit account Password manager Passkey sync (e.g. Apple iCloud) Website host Domain registrar Package registry (if applicable, although Go’s decentralized package management largely removes this attack surface) https://github.com/settings/tokens and https://github.com/settings/personal-access-tokens and https://github.com/settings/apps/authorizations and https://github.com/settings/applications  ↩

0 views
Anton Zhiyanov 2 weeks ago

Go proposal: Compare IP subnets

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Compare IP address prefixes the same way IANA does. Ver. 1.26 • Stdlib • Low impact An IP address prefix represents a IP subnet. These prefixes are usually written in CIDR notation: In Go, an IP prefix is represented by the type. The new method lets you compare two IP prefixes, making it easy to sort them without having to write your own comparison code. The imposed order matches both Python's implementation and the assumed order from IANA. When the Go team initially designed the IP subnet type ( ), they chose not to add a method because there wasn't a widely accepted way to order these values. Because of this, if a developer needs to sort IP subnets — for example, to organize routing tables or run tests — they have to write their own comparison logic. This results in repetitive and error-prone code. The proposal aims to provide a standard way to compare IP prefixes. This should reduce boilerplate code and help programs sort IP subnets consistently. Add the method to the type: orders two prefixes as follows: This follows the same order as Python's and the standard IANA convention . Sort a list of IP prefixes: 𝗣 61642 • 𝗖𝗟 700355 First by validity (invalid before valid). Then by address family (IPv4 before IPv6). Then by masked IP address (network IP). Then by prefix length. Then by unmasked address (original IP).

1 views

Interview with a new hosting provider founder

Most of us use infrastructure provided by companies like DigitalOcean and AWS. Some of us choose to work on that infrastructure. And some of us are really built different and choose to build all that infrastructure from scratch . This post is a real treat for me to bring you. I met Diana through a friend of mine, and I've gotten some peeks behind the curtain as she builds a new hosting provider . So I was thrilled that she agreed to an interview to let me share some of that with you all. So, here it is: a peek behind the curtain of a new hosting provider, in a very early stage. This is the interview as transcribed (any errors are mine), with a few edits as noted for clarity. Nicole: Hi, Diana! Thanks for taking the time to do this. Can you start us off by just telling us a little bit about who you are and what your company does? Diana: So I'm Diana, I'm trans, gay, AuDHD and I like to create, mainly singing and 3D printing. I also have dreams of being the change I want to see in the world. Since graduating high school, all infrastructure has become a passion for me. Particularly networking and computer infrastructure. From your home internet connection to data centers and everything in between. This has led me to create Andromeda Industries and the dba Gigabit.Host. Gigabit.Host is a hosting service where the focus is affordable and performant host for individuals, communities, and small businesses. Let's start out talking about the business a little bit. What made you decide to start a hosting company? The lack of performance for a ridiculous price. The margins on hosting is ridiculous, it's why the majority of the big tech companies' revenue comes from their cloud offerings. So my thought has been why not take that and use it more constructively. Instead of using the margins to crush competition while making the rich even more wealthy, use those margins for good. What is the ethos of your company? To use the net profits from the company to support and build third spaces and other low return/high investment cost ventures. From my perspective, these are the types of ideas that can have the biggest impact on making the world a better place. So this is my way of adopting socialist economic ideas into the systems we currently have and implementing the changes. How big is the company? Do you have anyone else helping out? It’s just me for now, though the plan is to make it into a co-op or unionized business. I have friends and supporters of the project, giving feedback and suggesting improvements. What does your average day-to-day look like? I go to my day job during the week, and work on the company in my spare time. I have alerts and monitors that warn me when something needs addressing, overall operations are pretty hands off. You're a founder, and founders have to wear all the hats. How have you managed your work-life balance while starting this? At this point it’s more about balancing my job, working on the company, and taking care of my cat. It's unfortunately another reason that I started this endeavor, there just aren't spaces I'd rather be than home, outside of a park or hiking. All of my friends are online and most say the same, where would I go? Hosting businesses can be very capital intensive to start. How do you fund it? Through my bonuses and stocks currently, also through using more cost effective brands that are still reliable and performant. What has been the biggest challenge of operating it from a business perspective? Getting customers. I'm not a huge fan of marketing and have been using word of mouth as the primary method of growing the business. Okay, my part here then haha. If people want to sign up, how should they do that? If people are interested in getting service, they can request an invite through this link: https://portal.gigabit.host/invite/request . What has been the most fun part of running a hosting company? Getting to actually be hands on with the hardware and making it as performant as possible. It scratches an itch of eking out every last drop of performance. Also not doing it because it's easy, doing it because I thought it would be easy. What has been the biggest surprise from starting Gigabit.Host? How both complex and easy it has been at the same time. Also how much I've been learning and growing through starting the company. What're some of the things you've learned? It's been learning that wanting it to be perfect isn't realistic, taking the small wins and building upon and continuing to learn as you go. My biggest learning challenge was how to do frontend work with Typescript and styling, the backend code has been easy for me. The frontend used to be my weakness, now it could be better, and as I add new features I can see it continuing to getting better over time. Now let's talk a little bit about the tech behind the scenes. What does the tech stack look like? Next.js and Typescript for the front and backend. Temporal is used for provisioning and task automation. Supabase is handling user management Proxmox for the hardware virtualization How do you actually manage this fleet of VMs? For the customer side we only handle the initial provisioning, then the customer is free to use whatever tool they choose. The provisioning of the VMs is handled using Go and Temporal. For our internal services we use Ansible and automation scripts. [Nicole: the code running the platform is open source, so you can take a look at how it's done in the repository !] How do your technical choices and your values as a founder and company work together? They are usually in sync, the biggest struggle has been minimizing cost of hardware. While I would like to use more advanced networking gear, it's currently cost prohibitive. Which choices might you have made differently? [I would have] gathered more capital before getting started. Though that's me trying to be a perfectionist, when the reality is buy as little as possible and use what you have when able. This seems like a really hard business to be in since you need reliability out of the gate. How have you approached that? Since I've been self-funding this endeavor, I've had to forgo high availability for now due to costs. To work around that I've gotten modern hardware for the critical parts of the infrastructure. This so far has enabled us to achieve 90%+ uptime, with the current goal to add redundancy as able to do so. What have been the biggest technical challenges you've run into? Power and colocation costs. Colocation is expensive in Seattle. Around 8x the cost of my previous colo in Atlanta, GA. Power has been the second challenge, running modern hardware means higher power requirements. Most data centers outside of hyperscalers are limited to 5 to 10 kW per rack. This limits the hardware and density, thankfully for now it [is] a future struggle. Huge thanks to Diana for taking the time out of her very busy for this interview! And thank you to a few friends who helped me prepare for the interview.

0 views
Emil Privér 2 weeks ago

We Re-Built Our Integration Service Using Postgres and Go

Our integration service connects our platform to external systems. Earlier this year, we reached a scaling limit at 40 integrations and rebuilt it from the ground up. The service handles three primary responsibilities: sending data to external systems, managing job queues, and prioritizing work based on criticality. The original implementation functioned but had architectural constraints that prevented horizontal scaling. We use microservices because different components have conflicting requirements. The management API handles complex business logic with normalized schemas—separate tables for translations and categories. The public API optimizes for read performance under load, using denormalized data by adding translations directly into category tables and handling filtering in Go. A monolithic architecture would require compromising performance in one area to accommodate the other. The integration service currently processes millions of events daily, with volume increasing as we onboard new customers. This post describes our implementation of a queue system using PostgreSQL and Go, focusing on design decisions and technical trade-offs. The first implementation used GCP Pub/Sub, a topic-to-many-subscription service where messages are replicated across multiple queues. This architecture introduced several scalability issues. The integration service maintained a database for integration configurations but lacked ownership of its operational data. This violated a distributed systems principle: services should own their data rather than depend on other services for it. This dependency forced our management service to serialize complete payloads into the queue. Updating a single attribute on a sub-object required sending the entire parent object with all nested sub-objects, metadata, and relationships. Different external APIs have varying data requirements—some need individual sub-objects while others require complete hierarchies. For clients with records containing 300-500 sub-objects, this resulted in significant message size inflation. GCP charges by message size rather than count, making large messages substantially more expensive than smaller ones. GCP’s WebSocket delivery requires clients to buffer messages internally. With 40 integrations running separate consumers with filters, traffic spikes created memory pressure: This prevented horizontal scaling and limited us to vertical scaling approaches. External APIs enforce varying rate limits. Our in-memory rate limiter tracked requests per integration but prevented horizontal scaling since state couldn’t be shared across instances without risking rate limit violations. By early 2025, these issues had compounded: excessive message sizes increasing costs, memory bloat requiring oversized containers, vertical-only scaling, high operational expenses, rate limiting preventing horizontal scale, and lack of data independence. The system couldn’t accommodate our growth trajectory. A complete rebuild was necessary. The v2 design addressed specific limitations: Additional improvements: The standard approach involves the producer computing payloads and sending them to the queue for consumer processing. We used this in v1 but rejected it for v2. Customers frequently make multiple rapid changes to the same record—updating a title, then a price, then a description. Each change triggers an event. Instead of sending three separate updates, we consolidate changes into a single update. We implemented a in the jobs table. Multiple updates to the same record within a short time window are deduplicated into a single job, reducing load on both our system and recipient systems. We chose PostgreSQL as our queue backend for several reasons: Often, we think we need something bigger like Apache Kafka when a relational database like PostgreSQL is sufficient for our requirements. The jobs table structure: Each job tracks: Postgres-backed queues require careful indexing. We use partial indexes (with WHERE clauses) only for actively queried states: , , , and . We don’t index or states. These statuses contain the majority of jobs in the table and aren’t needed in the job processing flow. Indexing them would just add more data into the memory when we don’t use it in the flow. Jobs are ordered by for FIFO processing, with priority queue overrides when applicable. Jobs follow a defined lifecycle: Timestamp fields serve observability purposes, measuring job duration and identifying bottlenecks. For jobs, retry timing is calculated using exponential backoff. The worker system requirements: We evaluated two approaches: maintaining in-memory queues with multiple goroutines using for and select to fetch jobs, or having goroutines fetch data from the database and iterate over the results. We chose the database iteration approach for its simplicity. pgxpool handles connection pooling, eliminating the need for channel-based in-memory queues. Each worker runs in a separate goroutine, using a ticker to poll for jobs every second. Before processing, workers check for shutdown signals ( or channel). When shutdown is initiated, workers stop accepting new jobs and mark in-flight jobs as . This prevents stalled jobs from blocking integration queues. Checking shutdown signals between jobs ensures clean shutdowns. During shutdown, we create a fresh context with for retrying jobs. This prevents database write failures when the main context is canceled. The query implements fair scheduling to prevent high-volume integrations from monopolizing workers: Query breakdown: Step 1: Identify busy integrations This CTE identifies integrations with 50+ concurrent processing jobs. Step 2: Select jobs with priority ordering Jobs are selected from integrations not in the busy list. Priority updates are ordered first, followed by FIFO ordering. locks selected rows to the current transaction, preventing duplicate processing by concurrent workers. Step 3: Update job status Selected jobs are updated to status with a recorded start time. This ensures fair resource allocation across integrations. Job timeouts are critical for queue health. In the initial release, we reused the global context for job processing. When jobs hung waiting for slow external APIs, they couldn’t be marked completed or failed due to context lifecycle coupling. Jobs accumulated in state indefinitely. The solution: context separation. The global context controls worker lifecycle. Each job receives its own context with a timeout. Timed-out jobs are marked , allowing queue progression. This also enables database writes during shutdown using a fresh context, even when the global context is canceled. Failed jobs require retry logic with appropriate timing. Immediate retries against failing external APIs are counterproductive. We implement exponential backoff: instant first retry, 10 seconds for the second, 30 seconds for the third, up to 30 minutes. The field drives backoff calculation. After 10 attempts, jobs are marked . Error types guide retry behavior: This allows each integration to decide how to handle errors based on the external API’s response. For example, a 400 Bad Request might be a permanent validation failure (NonRetryableError), while a 503 Service Unavailable is transient and should retry (RetryableError). The integration implementation determines the appropriate error type for each scenario. Jobs occasionally become stuck in state due to worker panics, database connection failures, or unexpected container termination. A cron job runs every minute, identifying jobs in state beyond the expected duration. These jobs are moved to with incremented retry counts, treating them as standard failures. This ensures queue progression despite unexpected failures. Rate limiting across multiple containers was v2’s most complex challenge. V1’s in-memory rate limiter worked for single containers but couldn’t share state across instances. While Redis was an option, we already had PostgreSQL with sufficient performance. The solution: a table tracking request counts per integration per second: Before external API requests, we increment the counter for the integration’s current time window (rounded to the second). PostgreSQL returns the new count. If the count exceeds the limit, we sleep 250ms and retry. If under the limit, we proceed. This works because all containers share the database as the source of truth for rate limiting. Occasionally, jobs are rate-limited during heavy load due to the gap between count checking and request sending. These jobs retry immediately. The occurrence rate is acceptable. Hope you enjoyed this article and learned something new. This system has worked really well so far, and we’ve had only a few minor issues that we fixed quickly. I will update this article over time. Mass updates generate large objects per record Objects are duplicated for each configured integration Copies buffer across 5-10 consumer instances Infrastructure requires 2GB RAM and 2 cores to handle spikes, despite needing only 512MB and 1 core during normal operation Horizontal scaling - Enable scaling across multiple containers Distributed rate limiting - Coordinate rate limits across instances Data ownership - Store operational data within the service Delta updates - Send only changed data rather than complete records Fair scheduling - Prevent single integrations from monopolizing resources Priority queuing - Process critical updates before lower-priority changes Self-service re-sync - Enable customers to re-sync catalogs independently Visibility - Provide APIs for customers to monitor sent data and queue status Performance - PostgreSQL is fast enough for our use case. We don’t need sub-second message delivery. Simplicity - Using a managed PostgreSQL instance on GCP is significantly simpler than introducing new infrastructure. Familiarity - Most developers understand SQL, reducing onboarding time. Existing infrastructure - We already use PostgreSQL for our data, eliminating the need for additional systems. - Links logs across services - Specifies the action (e.g., ) - Records failure details - Tracks current workflow state - Counts retry attempts - Schedules next retry , , - Provides metrics for observability - Links to specific integrations - Identifies the platform - Contains job data - Prevents duplicate execution Created → Initial state: Picked up → Transitions to Success → Becomes , records Failed (10 retries) → Becomes , records Failed (retries remaining) → Becomes , increments , calculates Parallel worker execution Horizontal scaling across containers Graceful shutdowns without job loss Distributed rate limit enforcement—we need to respect rate limits no matter how many containers we run - Permanent failures (e.g., validation errors). No retry. - Transient failures (e.g., 500 Internal Server Error). Retry with backoff. - Retry limit reached. Mark failed.

0 views
alikhil 3 weeks ago

kubectl-find - UNIX-find-like plugin to find resources and perform action on them

Recently, I have developed a plugin for inspired by UNIX utility to find and perform action on resources. And few days ago number of stars in the repo reached 50! I think it’s a good moment to tell more about the project. As engineer who works with kubernetes everyday I use kubectl a lot. Actually, more than 50% of my terminal history commands are related to kubernetes. Here is a top 10 commands: Run this command if you are curious what about yours the most popular commands in terminal history. I use kubectl to check status of the pods, delete orphaned resources, trigger sync on and much more. When I realized half my terminal history was just kubectl commands, I thought — there must be a better way to find things in Kubernetes without chaining pipes with / / . And I imagined how nice it would be to have a UNIX -like tool — something that lets you search for exactly what you need in the cluster and then perform actions directly on the matching resources. I searched for a krew plugin like this but there was not any. For that reason, I decided to develop one ! I used sample-cli-plugin as a starting point. Its clean repository structure and straightforward design make it a great reference for working with the Kubernetes API. Additionally, it allows easy reuse of the extensive Kubernetes client libraries. Almost everything in the Kubernetes ecosystem is written in Go, and this plugin is no exception — which is great, as it allows building binaries for a wide range of CPU architectures and operating systems. Use filter to find any resource by any custom condition. uses gojq implementation of . By default, will print found resources to Stdout. However, there flags that you can provide to perform action on found resources: Use krew to install the plugin: I’m currently working on adding: If you’re tired of writing long chains, give a try — it’s already saved me countless keystrokes. Check out the repo ⭐ github.com/alikhil/kubectl-find and share your ideas or issues — I’d love to hear how you use it! - to delete them - to patch with provided JSON - to run command on pods JSON/YAML output format More filters Saved queries

0 views
The Coder Cafe 3 weeks ago

Conflict-Free Replicated Data Types (CRDTs)

☕ Welcome to The Coder Cafe! Today, we will explore CRDTs, why they matter in distributed systems, and how they keep nodes in sync. Get cozy, grab a coffee, and let’s begin! CRDTs, short for Conflict-Free Replicated Data Types, are a family of data structures built for distributed systems. At first sight, CRDTs may look intimidating. Yet at their core, the idea is not that complex. What makes them special is that they allow updates to happen independently on different nodes while still guaranteeing that all replicas eventually converge to the same state. To understand how CRDTs achieve this, we first need to step back. We need to talk about concurrent operations and what coordination means in a distributed system. Let’s take it step by step. What does concurrent operations mean? Our first intuition might be to say they happen at the same time. That’s not quite right. Here’s a counterargument based on a collaborative editing example. While on a plane, Alice connects to a document and makes an offline change to a sentence. An hour later, Bob connects to the same document and edits the very same sentence, but online. Later, when Alice lands, both versions have to sync. The two edits (1. and 2.) were separated by an hour. They didn’t happen at the same time, yet they are concurrent. So what’s a better definition for concurrent operations? Two operations that are not causally related. In the previous example, neither operation was made with knowledge of the other. They are not causally related, which makes them concurrent. Yet, if Bob had first seen Alice’s update and then made his own, his edit would depend on hers. In that case, the two operations wouldn’t be concurrent anymore. We should also understand concurrent ≠ conflict: If Alice fixes a missing letter in a word while Bob removes the whole word, that’s a conflict. If Alice edits one sentence while Bob edits another, that’s not a conflict. Concurrency is about independence in knowledge. Conflict is about whether the effects of operations collide. Now, let’s talk about coordination in distributed systems. Imagine a database with two nodes, node 1 and node 2. A bunch of clients connect to it. Sometimes requests go to node 1, sometimes to node 2. Let’s say two clients send concurrent and conflicting operations: In this case, we can’t have node 1 storing $200 while node 2 stores -$100. That would be a consistency violation with the two nodes disagreeing on Alice’s balance. Instead, both nodes need to agree on a shared value. To do that, they have to communicate and decide on one of the following: Reject both operations Accept client A’s update and set the balance to $200 Accept client B’s update and set the balance to -$100 The very action of nodes communicating and, if needed, waiting to agree on a single outcome is called coordination. Coordination is one way to keep replicas consistent under concurrent operations. But coordination is not the only way. That’s where CRDTs come in. CRDT stands for Conflict-Free Replicated Data Types . In short, CRDTs are data structures built so that nodes can accept local updates independently and concurrently, without the need for coordination. If you read our recent post on availability models, you might notice we’re now in the territory of total availability: a system is totally available if every non-faulty node can execute any operation. Total availability comes with weaker consistency. For CRDTs, the consistency guarantee is called Strong Eventual Consistency (SEC) . For that, CRDTs rely on a deterministic conflict resolution algorithm. Because every node applies the same rules, all replicas are guaranteed to eventually converge to the same state. Let’s make this more concrete with a classic CRDT: the G-Counter (Grow-Only Counter). Imagine a database with two nodes tracking the number of likes on a post. Node 1 receives a new like, increments its counter, and replies success to the client: Then, node 1 communicates with node 2 to send this update: Ultimately, both nodes converge to the same value: 6. How does the conflict resolution work for a G-Counter? Each replica keeps a vector of counters, with one slot per node. In our example, the total number of likes is 5. Let’s say node 1 has seen 2 likes and node 2 has seen 3 likes. So the initial state is the following: When node 1 receives a new like, it only increments its own slot. Node 2 is now temporarily out of sync: During synchronization, both nodes merge their vectors by taking the element-wise maximum: Now both replicas converge to the same state: The beauty of this algorithm is that it’s deterministic and order-independent. No matter when or how often the nodes sync, they always end up with the same state. NOTE : Do you know Gossip Glomers? It’s a series of distributed systems challenges we briefly introduced in an earlier post . Challenge 4 is to build a Grow-Only Counter. It’s worth checking out if you haven’t already. CRDTs can also be combined to make a more complex CRDT. For example, if we want to track both likes and dislikes, we can use two G-Counters together. This data type is called a PN-Counter (Positive-Negative Counter). Imagine two clients act concurrently on the same post: one likes it, another dislikes it. The nodes exchange their updates and converge to the same value: In the case of a PN-Counter, the conflict resolution algorithm is similar to the G-Counter. The difference lies in the fact that it involves not one but two vectors: one for increases and one for decreases. Assume an initial state where node 1 has received 2 likes and 0 dislikes, and node 2 has received 3 likes and 0 dislikes: Now, suppose node 1 receives a new like and node 2 receives a dislike. Before the sync, the state is the following: When the replicas exchange their state, the merge rule is element-wise maximum for each vector: After sync, both nodes converge to: The final counter of likes is: Let’s pause for a second. Based on what we’ve discussed, can you think of some use cases for CRDTs? A data structure where nodes are updated independently, concurrently, without coordination, and still guarantees that they converge to the same state? One main use case is collaborative and offline-first systems. For example, Notion, a collaborative workspace, recently introduced a feature that lets people edit the same content offline. They rely on CRDTs, and more specifically on Peritext, a CRDT for rich-text collaboration co-authored by multiple people, including . Another big use case is totally available systems that put availability ahead of strong consistency. As we’ve seen, nodes don’t need to coordinate before acknowledging a client request, which makes the system more highly available. Take Redis, for example. It can be configured in an active-active architecture with geographically distributed datacenter s. Clients connect to their closest cluster and get local latencies without waiting for coordination across distant regions. And yes, this setup is built on CRDTs. We could also think about other applications for CRDTs, like: Edge & IoT : Devices update offline and merge later without a central server. Peer-to-peer : Peers share changes directly and match up when they reconnect. CDN/edge state : Keep preferences, drafts, or counters near users and sync to the origin later. There are two main types of CRDTs: State-based CRDTs : Convergence happens by propagating the full state. Operation-based CRDTs : Convergence happens by propagating the update operations. In the previous examples, we looked at two state-based CRDTs: the G-Counter (Grow-Only Counter) and the PN-Counter (Positive-Negative Counter). In both cases, what was exchanged between the nodes was the entire state. For example, node 1 could tell node 2 that its total number of likes is 3. With state-based CRDTs, states are merged with a function that must be: Commutative: We can merge in any order and get the same result. Idempotent: Merging something with itself doesn’t change it. Associative: We can merge in any grouping and get the same result. Each synchronization monotonically increases the internal state. In other words, when two replicas sync, the state can only move forward, never backward. This is enforced by a simple “ can’t-go-backwards ” rule (a partial order), where merges use operations like max for numbers (as we’ve seen) or union for sets. In operation-based CRDTs, nodes share the operations rather than the full state. Convergence relies on three properties: Commutativity of concurrent operations Causality: Either carried in the operations’ metadata (for example, vector clocks) or guaranteed by the transport layer through causal delivery Duplicate tolerance: Handled by idempotent operations, unique operation IDs with deduplication, or a transport layer that guarantees no duplicates One example of an operation-based CRDT is the LWW-Register (Last-Writer-Wins Register), which stores a single value. Updates are resolved using a logical timestamp (such as Lamport clocks) along with a tie-breaker like the node ID. When a node writes a value, it broadcasts an operation . On receiving it, a node applies the update if the pair is greater than the one it currently holds. To summarize: State-based CRDTs: Convergence is guaranteed because merging states is associative, commutative, and idempotent. Don’t require assumptions on the delivery layer beyond eventual delivery. Simpler to reason about. Exchanging full states can be more bandwidth-intensive. Operation-based CRDTs: More bandwidth-efficient; we only send the operations, not the whole state. Correctness usually depends on having causal order (or encoding causality in the ops) and tolerating duplicates via idempotence/dedup. More complex to implement (causal broadcast, vector clocks, or equivalent). For completeness, there’s also a third type we should be aware of: delta-based CRDTs . Here, convergence is achieved by sending and merging fragments of state (deltas) rather than the entire state. A quick analogy to picture the differences: State-based CRDT: “ From time to time, send me the whole document. ” Operation-based CRDT: “ When you make a change, tell me exactly what you did. ” → “ Adding word `miles` at position 42. ” Delta-based CRDT: “ When you make a change, send me just the delta that reflects it (for example, the updated sentence) ” → “ And miles to go before I sleep. ” We talked about collaborative document editing. So you might assume a system like Google Docs is based on CRDTs, right? Well, that’s not the case. Google Docs is based on another concept called OT (Operational Transformation) . The goal of OT and CRDT is the same: convergence among all nodes in a collaborative system. The main difference is that OT requires all communication to go through the same server: We haven’t mentioned it until now (on purpose), but with CRDTs, there’s no need for a central server to achieve convergence . Back to our collaborative editing tool: if Alice and Bob are both offline but manage to connect their laptops directly, they could still achieve convergence without talking to a central server: As we saw earlier, CRDTs embed a deterministic conflict resolution algorithm. The data type itself ensures convergence. That’s the key difference: CRDTs don’t need to make any assumptions about the network topology or about a central server. considers CRDT to be the natural successor of OT. NOTE : So, why is Google Docs still based on OT? Historical reasons. Google Docs was launched before CRDTs existed, and it still works really well. There’s no practical reason for Google to migrate from OT to CRDT, despite some discussions about it in the past. Operations are concurrent when they aren’t causally related; concurrency doesn’t automatically mean conflict. Coordination is when replicas communicate and, if needed, wait to agree on a single outcome for concurrent updates before acknowledging clients, so they don’t diverge. CRDTs accept independent updates on each replica and still converge via deterministic merge rules. Three types: state-based (share full state), operation-based (share operations), delta-based (share just the changed parts). CRDTs are a great fit for systems like offline-first collaboration and highly available systems. Unlike OT, CRDTs don’t rely on a central server to reach the same result everywhere. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Exploring Database Isolation Levels Safety and Liveness Ivan Zhao (Notion’s CEO) tweet on the new Notion offline collaboration feature Diving into Conflict-Free Replicated Data Types (CRDTs) - Redis CRDTs: The Hard Parts by Hacker News discussion Peritext - A CRDT for Rich-Text Collaboration Active-Active geo-distribution (CRDTS-based) - Redis Bartosz Sypytkowski’s 12-part blog series on CRDT ❤️ If you enjoyed this post, please hit the like button. 💬 Have you worked with CRDTs before, or do you see another use case where they shine? Share your thoughts in the comments! Leave a comment CRDTs, short for Conflict-Free Replicated Data Types, are a family of data structures built for distributed systems. At first sight, CRDTs may look intimidating. Yet at their core, the idea is not that complex. What makes them special is that they allow updates to happen independently on different nodes while still guaranteeing that all replicas eventually converge to the same state. To understand how CRDTs achieve this, we first need to step back. We need to talk about concurrent operations and what coordination means in a distributed system. Let’s take it step by step. Concurrent Operations What does concurrent operations mean? Our first intuition might be to say they happen at the same time. That’s not quite right. Here’s a counterargument based on a collaborative editing example. While on a plane, Alice connects to a document and makes an offline change to a sentence. An hour later, Bob connects to the same document and edits the very same sentence, but online. Later, when Alice lands, both versions have to sync. If Alice fixes a missing letter in a word while Bob removes the whole word, that’s a conflict. If Alice edits one sentence while Bob edits another, that’s not a conflict. In this case, we can’t have node 1 storing $200 while node 2 stores -$100. That would be a consistency violation with the two nodes disagreeing on Alice’s balance. Instead, both nodes need to agree on a shared value. To do that, they have to communicate and decide on one of the following: Reject both operations Accept client A’s update and set the balance to $200 Accept client B’s update and set the balance to -$100 Then, node 1 communicates with node 2 to send this update: Ultimately, both nodes converge to the same value: 6. How does the conflict resolution work for a G-Counter? Each replica keeps a vector of counters, with one slot per node. In our example, the total number of likes is 5. Let’s say node 1 has seen 2 likes and node 2 has seen 3 likes. So the initial state is the following: When node 1 receives a new like, it only increments its own slot. Node 2 is now temporarily out of sync: During synchronization, both nodes merge their vectors by taking the element-wise maximum: Now both replicas converge to the same state: The beauty of this algorithm is that it’s deterministic and order-independent. No matter when or how often the nodes sync, they always end up with the same state. NOTE : Do you know Gossip Glomers? It’s a series of distributed systems challenges we briefly introduced in an earlier post . Challenge 4 is to build a Grow-Only Counter. It’s worth checking out if you haven’t already. PN-Counter CRDTs can also be combined to make a more complex CRDT. For example, if we want to track both likes and dislikes, we can use two G-Counters together. This data type is called a PN-Counter (Positive-Negative Counter). Imagine two clients act concurrently on the same post: one likes it, another dislikes it. The nodes exchange their updates and converge to the same value: In the case of a PN-Counter, the conflict resolution algorithm is similar to the G-Counter. The difference lies in the fact that it involves not one but two vectors: one for increases and one for decreases. Assume an initial state where node 1 has received 2 likes and 0 dislikes, and node 2 has received 3 likes and 0 dislikes: Now, suppose node 1 receives a new like and node 2 receives a dislike. Before the sync, the state is the following: When the replicas exchange their state, the merge rule is element-wise maximum for each vector: After sync, both nodes converge to: The final counter of likes is: Use Cases Let’s pause for a second. Based on what we’ve discussed, can you think of some use cases for CRDTs? A data structure where nodes are updated independently, concurrently, without coordination, and still guarantees that they converge to the same state? One main use case is collaborative and offline-first systems. For example, Notion, a collaborative workspace, recently introduced a feature that lets people edit the same content offline. They rely on CRDTs, and more specifically on Peritext, a CRDT for rich-text collaboration co-authored by multiple people, including . Another big use case is totally available systems that put availability ahead of strong consistency. As we’ve seen, nodes don’t need to coordinate before acknowledging a client request, which makes the system more highly available. Take Redis, for example. It can be configured in an active-active architecture with geographically distributed datacenter s. Clients connect to their closest cluster and get local latencies without waiting for coordination across distant regions. And yes, this setup is built on CRDTs. We could also think about other applications for CRDTs, like: Edge & IoT : Devices update offline and merge later without a central server. Peer-to-peer : Peers share changes directly and match up when they reconnect. CDN/edge state : Keep preferences, drafts, or counters near users and sync to the origin later. State-based CRDTs : Convergence happens by propagating the full state. Operation-based CRDTs : Convergence happens by propagating the update operations. Commutative: We can merge in any order and get the same result. Idempotent: Merging something with itself doesn’t change it. Associative: We can merge in any grouping and get the same result. Commutativity of concurrent operations Causality: Either carried in the operations’ metadata (for example, vector clocks) or guaranteed by the transport layer through causal delivery Duplicate tolerance: Handled by idempotent operations, unique operation IDs with deduplication, or a transport layer that guarantees no duplicates State-based CRDTs: Convergence is guaranteed because merging states is associative, commutative, and idempotent. Don’t require assumptions on the delivery layer beyond eventual delivery. Simpler to reason about. Exchanging full states can be more bandwidth-intensive. Operation-based CRDTs: More bandwidth-efficient; we only send the operations, not the whole state. Correctness usually depends on having causal order (or encoding causality in the ops) and tolerating duplicates via idempotence/dedup. More complex to implement (causal broadcast, vector clocks, or equivalent). State-based CRDT: “ From time to time, send me the whole document. ” Operation-based CRDT: “ When you make a change, tell me exactly what you did. ” → “ Adding word `miles` at position 42. ” Delta-based CRDT: “ When you make a change, send me just the delta that reflects it (for example, the updated sentence) ” → “ And miles to go before I sleep. ” We haven’t mentioned it until now (on purpose), but with CRDTs, there’s no need for a central server to achieve convergence . Back to our collaborative editing tool: if Alice and Bob are both offline but manage to connect their laptops directly, they could still achieve convergence without talking to a central server: As we saw earlier, CRDTs embed a deterministic conflict resolution algorithm. The data type itself ensures convergence. That’s the key difference: CRDTs don’t need to make any assumptions about the network topology or about a central server. considers CRDT to be the natural successor of OT. NOTE : So, why is Google Docs still based on OT? Historical reasons. Google Docs was launched before CRDTs existed, and it still works really well. There’s no practical reason for Google to migrate from OT to CRDT, despite some discussions about it in the past. Conclusion Operations are concurrent when they aren’t causally related; concurrency doesn’t automatically mean conflict. Coordination is when replicas communicate and, if needed, wait to agree on a single outcome for concurrent updates before acknowledging clients, so they don’t diverge. CRDTs accept independent updates on each replica and still converge via deterministic merge rules. Three types: state-based (share full state), operation-based (share operations), delta-based (share just the changed parts). CRDTs are a great fit for systems like offline-first collaboration and highly available systems. Unlike OT, CRDTs don’t rely on a central server to reach the same result everywhere. Exploring Database Isolation Levels Safety and Liveness Ivan Zhao (Notion’s CEO) tweet on the new Notion offline collaboration feature Diving into Conflict-Free Replicated Data Types (CRDTs) - Redis CRDTs: The Hard Parts by Hacker News discussion Peritext - A CRDT for Rich-Text Collaboration Active-Active geo-distribution (CRDTS-based) - Redis Bartosz Sypytkowski’s 12-part blog series on CRDT

0 views
Filippo Valsorda 4 weeks ago

A Retrospective Survey of 2024/2025 Open Source Supply Chain Compromises

Lack of memory safety is such a predominant cause of security issues that we have a responsibility as professional software engineering to robustly mitigate it in security-sensitive use cases—by using memory safe languages. Similarly, I have the growing impression that software supply chain compromises have a few predominant causes which we might have a responsibility as a professional open source maintainers to robustly mitigate. To test this impression and figure out any such mitigations, I collected all 2024/2025 open source supply chain compromises I could find, and categorized their root cause. (If you find more, do email me!) Since I am interested in mitigations we can apply as maintainers of depended-upon projects to avoid compromises, I am ignoring: intentionally malicious packages (e.g. typosquatting), issues in package managers (e.g. internal name shadowing), open source infrastructure abuse (e.g. using package registries for post-compromise exfiltration), and isolated app compromises (i.e. not software that is depended upon). Also, I am specifically interested in how an attacker got their first unauthorized access, not in what they did with it. Annoyingly, there is usually a lot more written about the latter than the former. In no particular order, but kind of grouped. XZ Utils Long term pressure campaign on the maintainer to hand over access. Root cause : control handoff. Contributing factor: non-reproducible release artifacts. Nx S1ingularity Shell injection in GitHub Action with trigger and unnecessary read/write permissions 1 , used to extract a npm token. Root cause : pull_request_target. Contributing factors: read/write CI permissions, long-lived credential exfiltration, post-install scripts. Shai-Hulud Worm behavior by using compromised npm tokens to publish packages with malicious post-install scripts, and compromised GitHub tokens to publish malicious GitHub Actions workflows. Root cause : long-lived credential exfiltration. Contributing factor: post-install scripts. npm debug/chalk/color Maintainer phished with an "Update 2FA Now" email. Had TOTP 2FA enabled. Root cause : phishing. polyfill.io Attacker purchased CDN domain name and GitHub organization. Root cause : control handoff. MavenGate Expired domains and changed GitHub usernames resurrected to take control of connected packages. Root causes : domain resurrection, username resurrection. reviewdog and tj-actions/changed-files Contributors deliberately granted automatic write access for GitHub Action repository 2 . Malicious tag re-published to compromise GitHub PAT of more popular GitHub Action 3 . Root cause : control handoff. Contributing factors: read/write CI permissions, long-lived credential exfiltration, mutable GitHub Actions tags. Ultralytics Shell injection in GitHub Action with trigger (which required read/write permissions), pivoted to publishing pipeline via GitHub Actions cache poisoning. Compromised again later using an exfiltrated PyPI token. Root cause : pull_request_target. Contributing factors: GitHub Actions cache poisoning, long-lived credential exfiltration. Kong Ingress Controller GitHub Action with trigger restricted to trusted users but bypassed via Dependabot impersonation 4 , previously patched but still available on old branch. GitHub PAT exfiltrated and used. Root causes : pull_request_target, Dependabot impersonation. Contributing factors: per-branch CI configuration, long-lived credential exfiltration. Rspack Pwn request 5 against workflow 6 in other project, leading to a GitHub classic token of a maintainer with permissions to the web-infra-dev organization 7 (kindly confirmed via email by the Rspack Team). Similar to previously reported and fixed vulnerability 8 in the Rspack repository. Root causes : issue_comment. Contributing factor: long-lived credential exfiltration. eslint-config-prettier "Verify your account" 9 npm phishing. Root cause : phishing. num2words "Email verification" PyPI phishing. Root cause : phishing. @solana/web3.js A "phishing attack on the credentials for publishing npm packages." Root cause : phishing. rustfoundation.dev Fake compromise remediation 10 Crates.io phishing. Unclear if successful. Root cause : phishing. React Native ARIA & gluestack-ui "[U]nauthorized access to publishing credentials." Colorful and long Incident Report lacks any details on "sophisticated" entry point. Presumably an exposed npm token. Root cause : long-lived credential exfiltration(?). lottie-player Unclear, but mitigation involved "remov[ing] all access and associated tokens/services accounts of the impacted developer." Root cause : long-lived credential exfiltration(?) or control handoff(?). rand-user-agent Unclear. Malicious npm versions published, affected company seems to have deleted the project. Presumably npm token compromise. Root cause : long-lived credential exfiltration(?). DogWifTool GitHub token extracted from distributed binary. Root cause : long-lived credential exfiltration. Surprising no one, the most popular confirmed initial compromise vector is phishing. It works against technical open source maintainers. It works against 2FA TOTP. It. Works. It is also very fixable. It’s 2025 and every professional open source maintainer should be using phishing-resistant authentication (passkeys or WebAuthn 2FA) on all developer accounts, and accounts upstream of them. Upstream accounts include email, password manager, passkey sync (e.g. Apple iCloud), web/DNS hosting, and domain registrar. Some services, such as GitHub, require a phishable 2FA method along with phishing-resistant ones. In that case, the best option is to enable TOTP, and delete the secret or write it down somewhere safe and never ever use it—effectively disabling it. This does not work with SMS, since SIM jacking is possible even without action by the victim. Actually surprisingly—to me—a number of compromises are due to, effectively, giving access to the attacker. This is a nuanced people issue. The solution is obviously “don’t do that” but that really reduces to the decades-old issue of open source maintenance sustainability. In a sense, since this analysis is aimed at professional maintainers who can afford it, control handoff is easily avoided by not doing it. Kind of incredible that a specific feature has a top 3 spot, but projects get compromised by “pwn requests” all the time. The workflow trigger runs privileged CI with a context full of attacker-controlled data in response to pull requests. It makes a meek attempt to be safer by not checking out the attacker’s code, instead checking out the upstream target. That’s empirically not enough, with shell injection attacks causing multiple severe compromises. The zizmor static analyzer can help detect injection vulnerabilities, but it seems clear that is unsafe at any speed, and should just never be used. Other triggers that run privileged with attacker-controlled context should be avoided for the same reason. The Rspack compromise, for example, was due to checking out attacker-controlled code on an trigger if the PR receives a comment. What are the alternatives? One option is to implement an external service in a language that can safely deal with untrusted inputs (i.e. not YAML’d shell), and use webhooks. That unfortunately requires long-lived credentials (see below). GitHub itself recommends using the unprivileged trigger followed by the trigger, but it’s unclear to me how safer that would actually be against injection attacks. Finally, since two out of three compromises were due to shell injection, it might be safer to use a proper programming language, like JavaScript with actions/github-script , or any other language accessing the context via environment variables instead of YAML interpolation. This means not using any third-party actions, as well. Allowlisting actors and read-only steps are not robust mitigations, see Read/write CI permissions and Dependabot impersonation below. Overall, none of the mitigations are particularly satisfactory, so the solution might be simply to eschew features that require and other privileged attacker-controlled triggers. (To be honest, I am not a fan of chatty bots on issues and PRs, so I never needed them.) Attackers love to steal tokens. There is no universal solution, but it’s so predominant that we can consider piecemeal solutions. Long-lived credentials are only a root cause when they are accidentally exposed. Otherwise, they are a secondary compromise mechanism for lateral movement or persistence, after the attacker got privileged code execution. Mitigating the latter is somewhat less appealing because an attacker with code execution can find more creative ways to carry out an attack, but we can prune some low-hanging fruit. Go removes the need for package registry tokens by simply not having accounts. (Instead, the go command fetches modules directly from VCS, with caching by the Go Modules Proxy and universality and immutability guaranteed by the Go Checksum Database.) In other ecosystems Trusted Publishing replaces long-lived private tokens with short-lived OIDC tokens, although there is no way to down-scope the capabilities of an OIDC token. GitHub Personal Access Tokens are harder to avoid for anything that’s not supported by GitHub Actions permissions. Chainguard has a third-party Security Token Service that trades OIDC tokens for short-lived tokens , and their article has a good list of cases in which PATs end up otherwise necessary. Given the risk, it might be worth giving up on non-critical features that would require powerful tokens. Gerrit “git cookies” (which are actually just OAuth refresh tokens for the Gerrit app) can be replaced with… well, OAuth refresh tokens but kept in memory instead of disk, using git-credential-oauth . They can also be stored a little more safely in the platform keychain by treating them as an HTTP password, although that’s not well documented . In the long term, it would be great to see the equivalent of Device Bound Session Credentials for developer and automated workflows. Turns out you can just exfiltrate a token from a GitHub Actions runner to impersonate Dependabot with arbitrary PRs ??? I guess! Fine! Just don’t allowlist Dependabot. Not sure what a deeper meta-mitigation that didn’t require knowing this factoid would have been. This is also a social engineering risk, so I guess just turn off Dependabot? Multiple ecosystems (Go and Maven, for example) are vulnerable to name takeovers, whether expired domain names or changed GitHub user/org names. The new owner of the name gets to publish updates for that package. From the point of view of the maintainer, the mitigation is just not to change GitHub names (at least without registering the old one), and to register critical domains for a long period, with expiration alerting. Some CI compromises happened in contexts that could or should have been read-only. It sounds like giving GitHub Actions workflows only read permissions like should be a robust mitigation for any compromise of the code they run. Unfortunately, and kind of incredibly, even a read-only workflow is handed a token that can write to the cross-workflow cache for any key. This cache is then used implicitly by a number of official actions, allowing cross-workflow escalation by GitHub Actions cache poisoning . This contradicts some of GitHub’s own recommendations, and makes the existence of a setting to make GitHub Actions read-only by default more misleading than useful. The behavior does not extend to regular triggers, which are actually read-only (otherwise anyone could poison caches with a PR). GitHub simply doesn’t seem to offer a way to opt in to it. I can see no robust mitigation in the GitHub ecosystem. I would love to be wrong, this is maddening. Two compromises propagated by injecting npm post-install scripts, to obtain code execution as soon as a dependency was installed. This can be disabled with which is worth doing for defense in depth. However, it’s only useful if the dependency is not going to be executed in a privileged context, e.g. to run tests in Node.js. Go, unlike most ecosystems, considers code execution during fetch or compilation to be a security vulnerability, so has this safety margin by default. The XZ backdoor was hidden in a release artifact that didn’t match the repository source. It would be great if that was more detectable, in the form of reproducible artifacts. The road to a fail-closed world where systems automatically detect non-reproducing artifacts is still long, though. How supply chain attacks usually work these days is that an attacker gets the ability to publish new versions for a package, publishes a malicious version, and waits for dependents to update (maybe with the help of Dependabot) or install the latest version ex novo. Not with GitHub Actions! The recommended and most common way to refer to a GitHub Action is by its major version, which is resolved to a git tag that is expected to change arbitrarily when new versions are published. This means that an attacker can instantly compromise every dependent workflow. This was an unforced error already in 2019, when GitHub Actions launched while Go had already shipped an immutable package system. This has been discussed many times since and most other ecosystems have improved somewhat. A roadmap item for immutable Actions has been silent since 2022 . The new immutable releases feature doesn’t apply to non-release tags, and the GitHub docs still recommend changing tags for Actions. As maintainers, we can opt in to pinning where it’s somehow still not the default. For GitHub Actions, that means using unreadable commit hashes, which can be somewhat ameliorated with tooling . For npm, it means using instead of . One compromise was due to a vulnerability that was already fixed, but had persisted on an old branch. Any time we make a security improvement (including patching a vulnerable Action) on a GitHub Actions workflow, we need to remember to cherry-pick it to all branches, including stale ones. Can’t think of a good mitigation, just yet another sharp edge of GitHub Actions you need to be aware of, I suppose. There are a number of useful mitigations, but the ones that appear to be as clearly a professional responsibility as memory safety are phishing-resistant authentication; not handing over access to attackers; and avoiding privileged attacker-controlled GitHub Actions triggers (e.g. ). This research was part of an effort to compile a Geomys Standard of Care that amongst other things mitigates the most common security risks to the projects we are entrusted with. We will publish and implement it soon, to keep up to date follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . On Saturday, between 250,000 and 1,000,000 people (depending on who you believe, 0.4–1.7% of the whole population of Italy) took part in a demonstration against the genocide unfolding in Gaza. Anyway, here's a picture of the Archbasilica of San Giovanni in Laterano at the end of the march. My work is made possible by Geomys , an organization of professional Go maintainer, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. https://github.com/nrwl/nx/security/advisories/GHSA-cxm3-wv7p-598c#:~:text=20%20AM%20EDT-,Attack%20Vector,-Vulnerable%20Workflow https://github.com/reviewdog/reviewdog/issues/2079 https://github.com/tj-actions/changed-files/issues/2464#issuecomment-2727020537 https://www.synacktiv.com/publications/github-actions-exploitation-dependabot https://github.com/module-federation/core/pull/3324 https://github.com/module-federation/core/tree/c3aff14a4b9de2588122ec24cf456dc1fdd742f0/.github/workflows https://github.com/web-infra-dev/rspack/issues/8767#issuecomment-2563345582 https://www.praetorian.com/blog/compromising-bytedances-rspack-github-actions-vulnerabilities/ https://github.com/prettier/eslint-config-prettier/issues/339#issuecomment-3090304490 https://github.com/rust-lang/crates.io/discussions/11889#discussion-8886064 One option is to implement an external service in a language that can safely deal with untrusted inputs (i.e. not YAML’d shell), and use webhooks. That unfortunately requires long-lived credentials (see below). GitHub itself recommends using the unprivileged trigger followed by the trigger, but it’s unclear to me how safer that would actually be against injection attacks. Finally, since two out of three compromises were due to shell injection, it might be safer to use a proper programming language, like JavaScript with actions/github-script , or any other language accessing the context via environment variables instead of YAML interpolation. This means not using any third-party actions, as well. Allowlisting actors and read-only steps are not robust mitigations, see Read/write CI permissions and Dependabot impersonation below. phishing-resistant authentication; not handing over access to attackers; and avoiding privileged attacker-controlled GitHub Actions triggers (e.g. ). https://github.com/nrwl/nx/security/advisories/GHSA-cxm3-wv7p-598c#:~:text=20%20AM%20EDT-,Attack%20Vector,-Vulnerable%20Workflow https://github.com/reviewdog/reviewdog/issues/2079 https://github.com/tj-actions/changed-files/issues/2464#issuecomment-2727020537 https://www.synacktiv.com/publications/github-actions-exploitation-dependabot https://github.com/module-federation/core/pull/3324 https://github.com/module-federation/core/tree/c3aff14a4b9de2588122ec24cf456dc1fdd742f0/.github/workflows https://github.com/web-infra-dev/rspack/issues/8767#issuecomment-2563345582 https://www.praetorian.com/blog/compromising-bytedances-rspack-github-actions-vulnerabilities/ https://github.com/prettier/eslint-config-prettier/issues/339#issuecomment-3090304490 https://github.com/rust-lang/crates.io/discussions/11889#discussion-8886064

0 views
flak 1 months ago

backporting go on openbsd

The OpenBSD ports tree generally tracks current, but sometimes backports (and stable packages) are made for more serious issues. As was the case for git 2.50.1. However, the go port has not seen a backport in quite some time. The OpenBSD release schedule aligns with the go schedule such that we always get the latest release, but not minor revisions. For instance, OpenBSD 7.7 shipped with go 1.24.1, but there’s a few minor revisions after that. We maybe don’t care about many of these backports, but issue 73570 is a backported fix for a bug specific to OpenBSD, so let’s say we want that. I always forget the procedures for building ports from scratch and waste a bunch of time running and cancelling and rerunning commands. So here’s a recipe that worked. If we don’t have the ports tree, we need to get that. If we don’t have bash, we need to install that. (There’s a magic formula to have ports install packages, but this is simpler.) We need to update the go port to a suitable revision. The port is currently on 1.25, but I’d rather stick with 1.24, so we go back a little ways. The OpenBSD port was never updated for 1.24.7, but those changes don’t look very exciting. Maybe next time I’ll try a custom update to a new version. We build the port. The bootstrap flavor is important, or we’ll end up building it twice. Tick, tock, ding, ding. Running will build and install a package. Check. Looks good. redux What if we want a version that’s not in ports? I figured this post would be pretty boring, but go just released 1.24.8, which includes security fixes I’d like, so now we definitely need to try building a new version. Let’s edit the Makefile . Now tell the ports system to download the new version and update the checksum. This downloads the new version and prints it’s checksum. Okay? Well, the go downloads page shows checksums in hex, but we can redo it to check. Looks good. Now run and again. Uh oh. Fucking FIPS, every fucking time. I don’t want to think too much about what this is doing, but the file has been renamed, so we need to update the pkg/PLIST file. Hopefully this is an aberration, as the go team is usually conservative about backporting changes, but one never knows what to expect. Now the package builds and installs correctly. And then rebuild everything that uses go.

0 views
Michael Lynch 1 months ago

Refactoring English: Month 10

Hi, I’m Michael. I’m a software developer and founder of small, indie tech businesses. I’m currently working on a book called Refactoring English: Effective Writing for Software Developers . Every month, I publish a retrospective like this one to share how things are going with my book and my professional life overall. At the start of each month, I declare what I’d like to accomplish. Here’s how I did against those goals: I did complete this successfully, but I spent too long on the post and felt somewhat underwhelmed with my final result. I wrote a first draft of a new chapter but didn’t publish it. I ended up spending more time than I planned on “The Software Essays that Shaped Me” and freelance editing clients. I was going to write this off and say that I’m not learning anything new anymore by reaching out to customers. Then, a few days ago, I heard back from a reader I’d reached out to who said he used what he learned from my book to get an article on the front page of Hacker News for the first time. So, that was pretty indisputably valuable and tells me I should be doing more of this. I brainstorm more about this below . September had a nice bump in website visitors and pre-orders. I’d like to get to the point where there’s a virtuous cycle of readers referring other readers, but I don’t think I’m there yet. Still, nice to make almost $1k for the month. In baseball, a bunt is when you hold the bat in the ball’s path rather than swinging the bat. The upside is that you’re less likely to miss, but the downside is that you won’t hit the ball very far. The best you can hope for with a bunt is making it to first base, but a bunt is almost never going to be a home run. Most of my blog posts are “swing for the fences” posts. I put in a lot of effort because I want to reach #1 on Hacker News, reddit, or search results. The problem is that my “swing for the fences” posts take me about a month to write, so if I’m publishing blog posts as I write my book, I’d have to put my book on hold for a month every time I write a blog post. I’ve been thinking about whether I could do some “bunt” posts instead. That way, I can only put my book on hold for a week rather than the whole month. I don’t want to take a topic that deserves a lot of care and just do a lazy version of it. Rather, I want to take a topic that’s easy to cover and just see how it does. My first bunt was, “I Once Appeared in The Old New Thing.” It was about an experience I had at 22 at my first real job. I didn’t have a lot of insightful things to say about it, but I thought it was an interesting story. I was able to write it in about four hours, and it felt complete for what it was. My next bunt was, “The Software Essays that Shaped Me.” I’ve seen other people share lists of their favorite software blog posts, and I thought it would be an easy, fun thing to do. Best of all, the people who appreciate good software writing might also find my book interesting. As I started to write “The Software Essays that Shaped Me,” it turned into more than just a bunt. I ended up spending almost all of September on it. I originally thought I’d list my favorite blog posts and call it a day, but that felt too boring. So, I tried to include short commentary about each post. Then, I got carried away and ended up writing commentary that was longer than the originals themselves. It took me several drafts to figure out what commentary felt interesting, and I still don’t feel like I quite succeeded. I ended up spending 17 hours on “The Software Essays that Shaped Me” and never stopped to evaluate whether it was still worth writing if it was going to be all that work. I think the post is interesting to people who read my blog. If someone I knew published a list of articles that influenced them, I’d find that interesting. But in comment threads about the post, people shared their own lists, and I found strangers’ lists totally uninteresting. Maybe I counteracted that some by investing a lot in my commentary, but I just don’t think a list of good blog posts can be all that interesting. Both posts did well. They both reached the front page of Hacker News, though they did it through the second chance pool , which feels a little like winning through TKO rather than a real knockout. It’s interesting that the results scaled almost linearly with the effort I invested, which I typically don’t find to be the case . Previously, when one of my Refactoring English posts did well on Hacker News, there was a noticeable uptick in readers purchasing the book . This time, “The Software Essays that Shaped Me” reached #2 and stayed on the front page for 11 hours, but only one person purchased. Maybe everyone seeing my post on Hacker News has already seen that I’m writing a book, so everyone who’s interested has already bought? I woke up the morning after my article had already fallen off the front page of Hacker News and suddenly realized: I never included the ad for the book! All the sample chapters on the book’s website include a little self-ad to tell the reader I’m writing a book on this topic, and they can buy early access. All the pages on the Refactoring English website are supposed to have a little self-ad on them for the book. I forgot to include the self-ad for the blog post, so the first 14k readers saw my post and had no idea I’m writing a book. D’oh! I’ve updated my blog template so that I can’t possibly forget to include the self-ad in the future. A few months ago, I decided to offer freelance editing services to help other developers improve writing on their blogs. My idea was that it’s an opportunity to make sure the way I explain concepts in my book makes sense to real people. The downside is that there’s a high cost to the editing. Each job takes me between four to seven hours, and it eats up my “hard thinking” of the day, so it’s tough to do my own writing in the same day. I also feel pressure to offer quick turnaround, even though nobody has asked me to hurry. But just knowing my own writing process, it sucks to be stuck for days waiting on feedback. At the beginning, freelance editing worked as I planned: it gave me good ideas for my book. As I do more jobs, I’m getting fewer ideas for my book. Now, most of the feedback I write is basically writing a personalized version of something I’ve already written for my book. I want to keep doing the editing, but only for authors who have read my book. I doubled my rates, so now my price for editing a blog post is $400. But I’m going to offer a 90% discount to readers who have read my book. At a 90% discount, it’s almost not worth charging at all, but I want clients to pay some amount so that they feel like they have skin in the game, too. I’ll continue to take on clients who haven’t read the book, but I want to charge enough that I feel like it’s worth the tradeoff of taking time from my book. $400 might still be too low, but we’ll see. I’m trying to figure out why I keep missing my goal of reader outreach. On its face, it doesn’t seem that hard, but it never seems like the most important thing, so I keep deferring it. There are other tasks I procrastinate because I don’t enjoy doing them, but I actually enjoy reaching out to readers. It’s fun to see what different readers are up to and how they might apply my techniques. Part of the issue is that emailing readers requires activation energy because I have to: It might help if I first gather a list of customers to email and their websites. That way, when I’m in the mood to reach out, I’m not starting from scratch every time. A few Refactoring English customers have emailed me confused because they paid but never got an email with a link to the book. I collect payment through Stripe, and Stripe redirects customers to the book’s URL after they complete payment. If the customer doesn’t notice the redirect or forgets to bookmark the page, they lose access to the book. Whenever customers tell me they can’t find the link to the book, I dig around in Stripe to look for a setting to customize post-purchase emails, give up after a few minutes, and then email the correct link to the customer. Last month, I finally sat down and searched through Stripe’s documentation and forum posts, and I can’t find any way to customize the email Stripe sends after a customer completes a one-time payment. As far as I can tell, the only option is to spin up your own web server to listen for Stripe webhooks, then send your own emails from your own email provider. All because Stripe can’t be bothered to let merchants customize any text in the payment completion emails… Setting up a web server to respond to webhooks shouldn’t be that hard for me, but it means writing code to glue together Stripe, Buttondown, and Netlify functions, and they all have their little gotchas and bugs. Especially Stripe. I’ve spent about 10 hours so far just trying to get emails to send after a customer makes a purchase, and I’m still not sure it’s working correctly. Here are the gotchas I’ve hit so far: I’m still tinkering with Hacker News Observer, a product that I still haven’t released and don’t know what to do with. For now, I’m just gathering data and using it to satisfy some curiosities about success on Hacker News. One curiosity I’ve had for a long time is whether there are times of day when it’s easier for a post to reach the front page of Hacker News, so I aggregated what percentage of posts reach the front page over the course of a day: I created a view in Hacker News observer to show front page stats by hour I initially thought I had a bug that overcounted the success rate, as the percentage of Hacker News submissions that reach the front page feels lower than 12% in my experience. Then, I looked at some random slices from the last few days, and it seems to match up. If I browse , there will typically be 2-5 stories that reached the front page. I found a 30-minute slice from a few days ago where 27% of submissions reached the front page, which is surprising. I thought that success rate would be significantly higher on the weekends, when there are fewer submissions. Weekend posts are more likely to reach the front page, but the effect is much smaller than I thought. I thought it was going to be like 5% on weekdays vs. 20% on weekends. It makes submitting on the weekend less attractive because your chances of hitting the front page are only slightly better, but if you succeed, there are substantially fewer readers. I’d like to try limiting the data to personal blogs like I do on HN Popularity Contest , as I’m curious to see if personal blogs have better chances at certain times. I’m experimenting with low-investment, low-payoff-style blog posts. I’m adjusting my strategy for freelance editing to work specifically with people who have read my book. My intuition was way off about the odds of reaching the front page of Hacker News. Result : Published “The Software Essays that Shaped Me” , which attracted 16k readers in the first three days Result : Didn’t publish anything new Result : Emailed two new readers Go to my list of pre-paid readers Look for ones that have a website (so I can say something personalized) Read through their website to learn more about them Write an email and word it carefully to avoid sounding AI-generated Stripe’s Go client library is compatible with exactly one version of the Stripe webhook API. No, the documentation doesn’t say which one. Run it and find out from the webhook failures! If you update your Stripe account to use the latest webhook API version and then resend a webhook for a previous event, Stripe still uses the old API version even though it claims to use the new version. Netlify silently converts HTTP header names to lowercase, so if you’re looking for the header, you have to look for . Instead of a normal v2 Go module , Stripe for some reason decided to make every package upgrade a source change as well, so when I upgrade from v83 to v84, I have to replace in every file that imports the Stripe package. Normally, you’d upgrade the version in one place without affecting imports. The Stripe webhook signing secret is different from your Stripe API key. Weekdays: 12.1% of submissions reach the front page. Weekends: 13.2% of submissions reach the front page. Published “The Software Essays that Shaped Me” Published “I Once Appeared in The Old New Thing” Published “Get xkcd Cartoons at 2x Resolution” Worked with two freelance clients for Refactoring English Set up a webhook handler to send post-purchase emails to Refactoring English customers Added “success by hour of day” feature to Hacker News observer Started contributing to the Jellyfin Roku client code Had a call with AirGradient to discuss improving relations between the company and community members Consider bailing if a low-investment post turns out to be high-investment. Stripe does not allow you to customize post-purchase emails. You have to do a bunch of other stuff to send your customers an email. Set up editing discounts for readers who have read the book. Create a list of early access customers to reach out to. Publish a new chapter of the book.

0 views
Kix Panganiban 1 months ago

Python feels sucky to use now

I've been writing software for over 15 years at this point, and most of that time has been in Python. I've always been a Python fan. When I first picked it up in uni, I felt it was fluent, easy to understand, and simple to use -- at least compared to other languages I was using at the time, like Java, PHP, and C++. I've kept myself mostly up to date with "modern" Python -- think pure tooling, , and syntax, and strict almost everywhere. For the most part, I've been convinced that it's fine. But lately, I've been running into frustrations, especially with async workflows and type safety, that made me wonder if there’s a better tool for some jobs. And then I had to help rewrite a service from Python to Typescript + Bun. I'd stayed mostly detached from Typescript before, only dabbling in non-critical path code, but oh, what a different and truly joyful world it turned out to be to write code in. Here are some of my key observations: Bun is fast . It builds fast -- including installing new dependencies -- and runs fast, whether we're talking runtime performance or the direct loading of TS files. Bun's speed comes from its use of JavaScriptCore instead of V8, which cuts down on overhead, and its native bundler and package manager are written in Zig, making dependency resolution and builds lightning-quick compared to or even Python’s with . When I’m iterating on a project, shaving off seconds (or minutes) on installs and builds is a game-changer -- no more waiting around for to resolve or virtual envs to spin up. And at runtime, Bun directly executes Typescript without a separate compilation step. This just feels like a breath of fresh air for developer productivity. Type annotations and type-checking in Python still feel like mere suggestions, whereas they're fundamental in Typescript . This is especially true when defining interfaces or using inheritance -- compared to ABCs (Abstract Base Classes) and Protocols in Python, which can feel clunky. In Typescript, type definitions are baked into the language - I can define an or with precise control over shapes of data, and the compiler catches mismatches while I'm writing (provided that I've enabled it on my editor). Tools like enforce this rigorously. In Python, even with strict , type hints are optional and often ignored by the runtime, leading to errors that only surface when the code runs. Plus, Python’s approach to interfaces via or feels verbose and less intuitive -- while Typescript’s type system feels like better mental model for reasoning about code. About 99% of web-related code is async. Async is first-class in Typescript and Bun, while it’s still a mess in Python . Sure -- Python's and the list of packages supporting it have grown, but it often feels forced and riddled with gotchas and pitfalls. In Typescript, / is a core language feature, seamlessly integrated with the event loop in environments like Node.js or Bun. Promises are a natural part of the ecosystem, and most libraries are built with async in mind from the ground up. Compare that to Python, where was bolted on later (introduced in 3.5), and the ecosystem (in 2025!) is still only slowly catching up. I’ve run into issues with libraries that don’t play nicely with , forcing me to mix synchronous and asynchronous code in awkward ways. This experience has me rethinking how I approach projects. While I’m not abandoning Python -- it’s still my go-to for many things -- I’m excited to explore more of what Typescript and Bun have to offer. It’s like discovering a new favorite tool in the shed, and I can’t wait to see what I build with it next. Bun is fast . It builds fast -- including installing new dependencies -- and runs fast, whether we're talking runtime performance or the direct loading of TS files. Bun's speed comes from its use of JavaScriptCore instead of V8, which cuts down on overhead, and its native bundler and package manager are written in Zig, making dependency resolution and builds lightning-quick compared to or even Python’s with . When I’m iterating on a project, shaving off seconds (or minutes) on installs and builds is a game-changer -- no more waiting around for to resolve or virtual envs to spin up. And at runtime, Bun directly executes Typescript without a separate compilation step. This just feels like a breath of fresh air for developer productivity. Type annotations and type-checking in Python still feel like mere suggestions, whereas they're fundamental in Typescript . This is especially true when defining interfaces or using inheritance -- compared to ABCs (Abstract Base Classes) and Protocols in Python, which can feel clunky. In Typescript, type definitions are baked into the language - I can define an or with precise control over shapes of data, and the compiler catches mismatches while I'm writing (provided that I've enabled it on my editor). Tools like enforce this rigorously. In Python, even with strict , type hints are optional and often ignored by the runtime, leading to errors that only surface when the code runs. Plus, Python’s approach to interfaces via or feels verbose and less intuitive -- while Typescript’s type system feels like better mental model for reasoning about code. About 99% of web-related code is async. Async is first-class in Typescript and Bun, while it’s still a mess in Python . Sure -- Python's and the list of packages supporting it have grown, but it often feels forced and riddled with gotchas and pitfalls. In Typescript, / is a core language feature, seamlessly integrated with the event loop in environments like Node.js or Bun. Promises are a natural part of the ecosystem, and most libraries are built with async in mind from the ground up. Compare that to Python, where was bolted on later (introduced in 3.5), and the ecosystem (in 2025!) is still only slowly catching up. I’ve run into issues with libraries that don’t play nicely with , forcing me to mix synchronous and asynchronous code in awkward ways. Sub-point: Many Python patterns still push for workers and message queues -- think RQ and Celery -- when a simple async function in Typescript could handle the same task with less overhead. In Python, if I need to handle background tasks or I/O-bound operations, the go-to solution often involves spinning up a separate worker process with something like Celery, backed by a broker like Redis or RabbitMQ. This adds complexity -- now I’m managing infrastructure, debugging message serialization, and dealing with potential failures in the queue. In Typescript with Bun, I can often just write an function, maybe wrap it in a or use a lightweight library like if I need queuing, and call it a day. For a recent project, I replaced a Celery-based task system with a simple async setup in Typescript, cutting down deployment complexity and reducing latency since there’s no broker middleman. It’s not that Python can’t do async -- it’s that the cultural and technical patterns around it often lead to over-engineering for problems that Typescript, in my opinion, solves more elegantly.

0 views
Karan Sharma 1 months ago

State of My Homelab 2025

For the past five years, I have maintained a homelab in various configurations. This journey has served as a practical exploration of different technologies, from Raspberry Pi clusters running K3s to a hybrid cloud setup and eventually a cloud-based Nomad setup . Each iteration provided valuable lessons, consistently highlighting the operational benefits of simplicity. This article details the current state of my homelab. A primary motivation for this build was to dip my toes into “actual” homelabbing—that is, maintaining a physical server at home. The main design goal was to build a dedicated, reliable, and performant server that is easy to maintain. This led me to move away from complex container orchestrators like Kubernetes in favor of a more straightforward Docker Compose workflow. I will cover the hardware build, software architecture, and the rationale behind the key decisions. After considerable research, I selected components to balance performance, power efficiency, and cost. The server is designed for 24/7 operation in a home environment, making noise and power consumption important considerations. My previous setups involved Kubernetes and Nomad, but the operational overhead proved unnecessary for my use case. I have since standardized on a Git-based, Docker Compose workflow that prioritizes simplicity and transparency. The core of the system is a Git repository that holds all configurations. Each service is defined as a self-contained “stack” in its own directory. The structure is organized by machine, making it easy to manage multiple environments: This modular approach allows me to manage each application’s configuration, including its and any related files, as an independent unit. Deployments are handled by a custom script, with a providing a convenient command-runner interface. The process is fundamentally simple: Each machine’s connection settings ( , , ) are defined in its file. This file can also contain and hooks for custom actions. The makes daily operations trivial: This system provides fine-grained control over deployments, with support for actions like , , , , and (which also removes persistent volumes). To keep the system consistent, I follow a few key patterns: The homelab comprises three distinct machines to provide isolation and redundancy. This distributed setup isolates my home network from the public internet and ensures that critical public services remain online even if the home server is down for maintenance. The following is a breakdown of the services, or “stacks,” running on each machine. A few key services that are central to the homelab are detailed further in the next section. I came across Technitium DNS after seeing a recommendation from @oddtazz , and it has been a revelation. For anyone who wants more than just basic ad blocking from their DNS server, it’s a game-changer. It serves as both a recursive and authoritative server, meaning I don’t need a separate tool like to resolve from root hints. The level of configuration is incredible—from DNSSEC, custom zones, and SOA records to fine-grained caching control. The UI is a bit dated, but that’s a minor point for me given the raw power it provides. It is a vastly underrated tool for any homelabber who wants to go beyond Pi-hole or AdGuard Home. For a long time, I felt that monitoring a homelab meant spinning up a full Prometheus and Grafana stack. Beszel is the perfect antidote to that complexity. It provides exactly what I need for basic node monitoring—CPU, memory, disk, and network usage—in a simple, lightweight package. It’s incredibly easy to set up and provides a clean, real-time view of my servers without the overhead of a more complex system. For a simple homelab monitoring setup, it’s hard to beat. While Beszel monitors the servers from the inside, Gatus watches them from the outside. Running on an independent Hetzner VM, its job is to ensure my services are reachable from the public internet. It validates HTTP status codes, response times, and more. This separation is crucial; if my entire home network goes down, Gatus is still online to send an alert to my phone. It’s the final piece of the puzzle for robust monitoring, ensuring I know when things are broken even if the monitoring service itself is part of the outage. Data integrity and recoverability are critical. My strategy is built on layers of redundancy and encryption. I chose BTRFS for its modern features: The two 4TB drives are mirrored in a RAID 1 array, providing redundancy against a single drive failure. The entire array is encrypted using LUKS2, with the key stored on the boot SSD for automatic mounting. This protects data at rest in case of physical theft or drive disposal. Mount options in : RAID does not protect against accidental deletion, file corruption, or catastrophic failure. My backup strategy follows the 3-2-1 rule. Daily, automated backups are managed by systemd timers running . Backups are encrypted and sent to Cloudflare R2, providing an off-site copy. R2 was chosen for its zero-cost egress, which is a significant advantage for restores. The backup script covers critical application data and the Docker Compose configurations: Each backup run reports its status to a healthchecks.io endpoint, which sends a push notification on failure. I must appreciate its generous free tier, which is more than sufficient for my needs. This homelab represents a shift in philosophy from exploring complexity to valuing simplicity and reliability. The upfront hardware investment of ~$1,200 is offset by eliminating recurring cloud hosting costs and providing complete control over my data and services. For those considering a homelab, my primary recommendation is to start with a simple, well-understood foundation. A reliable machine with a solid backup strategy is more valuable than a complex, hard-to-maintain cluster. The goal is to build a system that serves your needs, not one that you serve. CPU : The Ryzen 5 7600X provides a strong price-to-performance ratio. Its 6 cores offer ample headroom for concurrent containerized workloads and future experimentation. Storage : The boot drive is a 500GB NVMe for fast OS and application performance. The primary storage consists of two 4TB HDDs in a BTRFS RAID 1 configuration. To mitigate the risk of correlated failures, I chose drives from different manufacturers (WD and Seagate) purchased at different times. RAM : 32GB of DDR5-6000 provides sufficient memory for a growing number of services without risking contention. Case & PSU : The ASUS Prime AP201 is a compact MicroATX case with a clean aesthetic suitable for a home office. The Corsair SF750 (80+ Platinum) PSU was chosen for its efficiency and to provide capacity for a future GPU for local LLM or transcoding workloads. Sync : copies the specified stack’s directory from the local Git repository to a (e.g., ) on the target machine. Execute : runs the appropriate command on the remote machine. Data Persistence : Instead of using Docker named volumes, I use host bind mounts. All persistent data for a service is stored in a dedicated directory on the host, typically . This makes backups and data management more transparent. Reverse Proxy Network : The Caddy stack defines a shared Docker network called . Other stacks that need to be exposed to the internet are configured to join this network. This allows Caddy to discover and proxy them without exposing their ports on the host machine. I have written about this pattern in detail in a previous post . Port Exposure : Services behind the reverse proxy use the directive in their to make ports available to Caddy within the Docker network. I avoid binding ports directly with unless absolutely necessary. floyd-homelab-1 (Primary Server) : The core of the homelab, running on the AMD hardware detailed above. It runs data-intensive personal services (e.g., Immich, Paperless-ngx) and is accessible only via the Tailscale network. floyd-pub-1 (Public VPS) : A small cloud VPS that hosts public-facing services requiring high availability, such as DNS utilities, analytics, and notification relays. floyd-monitor-public (Monitoring VPS) : A small Hetzner VM running Gatus for health checks. Its independence ensures that I am alerted if the primary homelab or home network goes offline. Actual : A local-first personal finance and budgeting tool. Caddy : A powerful, enterprise-ready, open source web server with automatic HTTPS. Gitea : A Git service for personal projects. Glance : A dashboard for viewing all my feeds and data in one place. Immich : A photo and video backup solution, directly from my mobile phone. Karakeep : An app for bookmarking everything, with AI-based tagging and full-text search. Owntracks : A private location tracker for recording my own location data. Paperless-ngx : A document management system that transforms physical documents into a searchable online archive. Silverbullet : A Markdown-based knowledge management and note-taking tool. Caddy : Reverse proxy for the services on this node. Beszel-agent : The agent for the Beszel monitoring platform. Caddy : Reverse proxy for the services on this node. Cloak : A service to securely share sensitive text with others. Doggo : A command-line DNS Client for Humans, written in Golang. Ntfy : A self-hosted push notification service. prom2grafana : A tool to convert Prometheus metrics to Grafana dashboards and alert rules using AI. Umami : A simple, fast, privacy-focused alternative to Google Analytics. Checksumming : Protects against silent data corruption. Copy-on-Write : Enables instantaneous, low-cost snapshots. Transparent Compression : compression saves space without significant performance overhead.

0 views
Sean Goedecke 1 months ago

How I influence tech company politics as a staff software engineer

Many software engineers are fatalistic about company politics. They believe that it’s pointless to get involved, because 1 : The general idea here is that software engineers are simply not equipped to play the game at the same level as real political operators . This is true! It would be a terrible mistake for a software engineer to think that you ought to start scheming and plotting like you’re in Game of Thrones . Your schemes will be immediately uncovered and repurposed to your disadvantage and other people’s gain. Scheming takes practice and power, and neither of those things are available to software engineers. It is simply a fact that software engineers are tools in the political game being played at large companies, not players in their own right. However, there are many ways to get involved in politics without scheming. The easiest way is to actively work to make a high-profile project successful . This is more or less what you ought to be doing anyway, just as part of your ordinary job. If your company is heavily investing in some new project - these days, likely an AI project - using your engineering skill to make it successful 2 is a politically advantageous move for whatever VP or executive is spearheading that project. In return, you’ll get the rewards that executives can give at tech companies: bonuses, help with promotions, and positions on future high-profile projects. I wrote about this almost a year ago in Ratchet effects determine engineer reputation at large companies . A slightly harder way (but one that gives you more control) is to make your pet idea available for an existing political campaign . Suppose you’ve wanted for a while to pull out some existing functionality into its own service. There are two ways to make that happen. The hard way is to expend your own political capital: drum up support, let your manager know how important it is to you, and slowly wear doubters down until you can get the project formally approved. The easy way is to allow some executive to spend their (much greater) political capital on your project . You wait until there’s a company-wide mandate for some goal that aligns with your project (say, a push for reliability, which often happens in the wake of a high-profile incident). Then you suggest to your manager that your project might be a good fit for this. If you’ve gauged it correctly, your org will get behind your project. Not only that, but it’ll increase your political capital instead of you having to spend it. Organizational interest comes in waves. When it’s reliability time, VPs are desperate to be doing something . They want to come up with plausible-sounding reliability projects that they can fund, because they need to go to their bosses and point at what they’re doing for reliability, but they don’t have the skillset to do it on their own. They’re typically happy to fund anything that the engineering team suggests. On the other hand, when the organization’s attention is focused somewhere else - say, on a big new product ship - the last thing they want is for engineers to spend their time on an internal reliability-focused refactor that’s invisible to customers. So if you want to get something technical done in a tech company, you ought to wait for the appropriate wave . It’s a good idea to prepare multiple technical programs of work, all along different lines. Strong engineers will do some of this kind of thing as an automatic process, simply by noticing things in the normal line of work. For instance, you might have rough plans: When executives are concerned about billing, you can offer the billing refactor as a reliability improvement. When they’re concerned about developer experience, you can suggest replacing the build pipeline. When customers are complaining about performance, you can point to the Golang rewrite as a good option. When the CEO checks the state of the public documentation and is embarrassed, you can make the case for rebuilding it as a static site. The important thing is to have a detailed, effective program of work ready to go for whatever the flavor of the month is. Some program of work will be funded whether you do this or not. However, if you don’t do this, you have no control over what that program is. In my experience, this is where companies make their worst technical decisions : when the political need to do something collides with a lack of any good ideas. When there are no good ideas, a bad idea will do, in a pinch. But nobody prefers this outcome. It’s bad for the executives, who then have to sell a disappointing technical outcome as if it were a success 4 , and it’s bad for the engineers, who have to spend their time and effort building the wrong idea. If you’re a very senior engineer, the VPs (or whoever) will quietly blame you for this. They’ll be right to! Having the right idea handy at the right time is your responsibility. You can view all this in two different ways. Cynically, you can read this as a suggestion to make yourself a convenient tool for the sociopaths who run your company to use in their endless internecine power struggles. Optimistically, you can read this as a suggestion to let executives set the overall priorities for the company - that’s their job, after all - and to tailor your own technical plans to fit 3 . Either way, you’ll achieve more of your technical goals if you push the right plan at the right time. edit: this post got some attention on Hacker News . The comments were much more positive than on my other posts about politics, for reasons I don’t quite understand. This comment is an excellent statement of what I write about here (but targeted at more junior engineers). This comment (echoed here ) references a Milton Friedman quote that applies the idea in this post to political policy in general, which I’d never thought of but sounds correct: Only a crisis—actual or perceived—produces real change. When that crisis occurs, the actions that are taken depend on the ideas that are lying around. That, I believe, is our basic function: to develop alternatives to existing policies, to keep them alive and available until the politically impossible becomes politically inevitable. There’s a few comments calling this approach overly game-playing and self-serving. I think this depends on the goal you’re aiming at. The ones I referenced above seem pretty beneficial to me! Finally, this comment is a good summary of what I was trying to say: Instead of waiting to be told what to do and being cynical about bad ideas coming up when there’s a vacumn and not doing what he wants to do, the author keeps a back log of good and important ideas that he waits to bring up for when someone important says something is priority. He gets what he wants done, compromising on timing. I was prompted to write this after reading Terrible Software’s article Don’t avoid workplace politics and its comments on Hacker News. Disclaimer: I am talking here about broadly functional tech companies (i.e. ones that are making money). If you’re working somewhere that’s completely dysfunctional, I have no idea whether this advice would apply at all. What it takes to make a project successful is itself a complex political question that every senior+ engineer is eventually forced to grapple with (or to deliberately avoid, with consequences for their career). For more on that, see How I ship projects at large tech companies . For more along these lines, see Is it cynical to do what your manager wants? Just because they can do this doesn’t mean they want to. Technical decisions are often made for completely selfish reasons that cannot be influenced by a well-meaning engineer Powerful stakeholders are typically so stupid and dysfunctional that it’s effectively impossible for you to identify their needs and deliver solutions to them The political game being played depends on private information that software engineers do not have, so any attempt to get involved will result in just blundering around Managers and executives spend most of their time playing politics, while engineers spend most of their time doing engineering, so engineers are at a serious political disadvantage before they even start to migrate the billing code to stored-data-updated-by-webhooks instead of cached API calls to rip out the ancient hand-rolled build pipeline and replace it with Vite to rewrite a crufty high-volume Python service in Golang to replace the slow CMS frontend that backs your public documentation with a fast static site I was prompted to write this after reading Terrible Software’s article Don’t avoid workplace politics and its comments on Hacker News. Disclaimer: I am talking here about broadly functional tech companies (i.e. ones that are making money). If you’re working somewhere that’s completely dysfunctional, I have no idea whether this advice would apply at all. ↩ What it takes to make a project successful is itself a complex political question that every senior+ engineer is eventually forced to grapple with (or to deliberately avoid, with consequences for their career). For more on that, see How I ship projects at large tech companies . ↩ For more along these lines, see Is it cynical to do what your manager wants? ↩ Just because they can do this doesn’t mean they want to. ↩

0 views
Anton Zhiyanov 1 months ago

Gist of Go: Atomics

This is a chapter from my book on Go concurrency , which teaches the topic from the ground up through interactive examples. Some concurrent operations don't require explicit synchronization. We can use these to create lock-free types and functions that are safe to use from multiple goroutines. Let's dive into the topic! Non-atomic increment • Atomic operations • Composition • Atomic vs. mutex • Keep it up Suppose multiple goroutines increment a shared counter: There are 5 goroutines, and each one increments 10,000 times, so the final result should be 50,000. But it's usually less. Let's run the code a few more times: The race detector is reporting a problem: This might seem strange — shouldn't the operation be atomic? Actually, it's not. It involves three steps (read-modify-write): If two goroutines both read the value , then each increments it and writes it back, the new will be instead of like it should be. As a result, some increments to the counter will be lost, and the final value will be less than 50,000. As we talked about in the Race conditions chapter, you can make an operation atomic by using mutexes or other synchronization tools. But for this chapter, let's agree not to use them. Here, when I say "atomic operation", I mean an operation that doesn't require the caller to use explicit locks, but is still safe to use in a concurrent environment. An operation without synchronization can only be truly atomic if it translates to a single processor instruction. Such operations don't need locks and won't cause issues when called concurrently (even the write operations). In a perfect world, every operation would be atomic, and we wouldn't have to deal with mutexes. But in reality, there are only a few atomics, and they're all found in the package. This package provides a set of atomic types: Each atomic type provides the following methods: reads the value of a variable, sets a new value: sets a new value (like ) and returns the old one: sets a new value only if the current value is still what you expect it to be: Numeric types also provide an method that increments the value by the specified amount: And the / methods for bitwise operations (Go 1.23+): All methods are translated to a single CPU instruction, so they are safe for concurrent calls. Strictly speaking, this isn't always true. Not all processors support the full set of concurrent operations, so sometimes more than one instruction is needed. But we don't have to worry about that — Go guarantees the atomicity of operations for the caller. It uses low-level mechanisms specific to each processor architecture to do this. Like other synchronization primitives, each atomic variable has its own internal state. So, you should only pass it as a pointer, not by value, to avoid accidentally copying the state. When using , all loads and stores should use the same concrete type. The following code will cause a panic: Now, let's go back to the counter program: And rewrite it to use an atomic counter: Much better! ✎ Exercise: Atomic counter +1 more Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. An atomic operation in a concurrent program is a great thing. Such operation usually transforms into a single processor instruction, and it does not require locks. You can safely call it from different goroutines and receive a predictable result. But what happens if you combine atomic operations? Let's find out. Let's look at a function that increments a counter: As you already know, isn't safe to call from multiple goroutines because causes a data race. Now I will try to fix the problem and propose several options. In each case, answer the question: if you call from 100 goroutines, is the final value of the guaranteed? Is the value guaranteed? It is guaranteed. Is the value guaranteed? It's not guaranteed. Is the value guaranteed? It's not guaranteed. People sometimes think that the composition of atomic operations also magically becomes an atomic operation. But it doesn't. For example, the second of the above examples: Call 100 times from different goroutines: Run the program with the flag — there are no races: But can we be sure what the final value of will be? Nope. and calls are interleaved from different goroutines. This causes a race condition (not to be confused with a data race) and leads to an unpredictable value. Check yourself by answering the question: in which example is an atomic operation? In none of them. In all examples, is not an atomic operation. The composition of atomics is always non-atomic. The first example, however, guarantees the final value of the in a concurrent environment: If we run 100 goroutines, the will ultimately equal 200. The reason is that is a sequence-independent operation. The runtime can perform such operations in any order, and the result will not change. The second and third examples use sequence-dependent operations. When we run 100 goroutines, the order of operations is different each time. Therefore, the result is also different. A bulletproof way to make a composite operation atomic and prevent race conditions is to use a mutex: But sometimes an atomic variable with is all you need. Let's look at an example. ✎ Exercise: Concurrent-safe stack Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Let's say we have a gate that needs to be closed: In a concurrent environment, there are data races on the field. We can fix this with a mutex: Alternatively, we can use on an atomic instead of a mutex: The type is now more compact and simple. This isn't a very common use case — we usually want a goroutine to wait on a locked mutex and continue once it's unlocked. But for "early exit" situations, it's perfect. Atomics are a specialized but useful tool. You can use them for simple counters and flags, but be very careful when using them for more complex operations. You can also use them instead of mutexes to exit early. In the next chapter, we'll talk about testing concurrent code (coming soon). Pre-order for $10   or read online Read the current value of . Add one to it. Write the new value back to . — a boolean value; / — a 4- or 8-byte integer; / — a 4- or 8-byte unsigned integer; — a value of type; — a pointer to a value of type (generic).

0 views
Aran Wilkinson 1 months ago

building a mqtt client in go

Build a robust MQTT client in Go with real Home Assistant examples. Covers concurrent processing, wildcard matching, and device integrations including Aqara sensors and EV charging data.

0 views

The Case Against Generative AI

Soundtrack: Queens of the Stone Age - First It Giveth Before we go any further: This is, for the third time this year, the longest newsletter I've ever written, weighing in somewhere around 18,500 words. I've written it specifically to be read at your leisure — dip in and out where you'd like — but also in one go.  This is my comprehensive case that yes, we’re in a bubble, one that will inevitably (and violently) collapse in the near future. I'll also be cutting this into a four-part episode starting tomorrow on my podcast Better Offline . I deeply appreciate your time. If you like this newsletter, please think about subscribing to the premium, which I write weekly. Thanks for reading. Alright, let’s do this one last time . In 2022, a (kind-of) company called OpenAI surprised the world with a website called ChatGPT that could generate text that sort-of sounded like a person using a technology called Large Language Models (LLMs), which can also be used to generate images, video and computer code.  Large Language Models require entire clusters of servers connected with high-speed networking, all containing this thing called a GPU — graphics processing units. These are different to the GPUs in your Xbox, or laptop, or gaming PC. They cost much, much more, and they’re good at doing the processes of inference (the creation of the output of any LLM) and training (feeding masses of training data to models, or feeding them information about what a good output might look like, so they can later identify a thing or replicate it). These models showed some immediate promise in their ability to articulate concepts or generate video, visuals, audio, text and code. They also immediately had one glaring, obvious problem: because they’re probabilistic, these models can’t actually be relied upon to do the same thing every single time. So, if you generated a picture of a person that you wanted to, for example, use in a story book, every time you created a new page, using the same prompt to describe the protagonist, that person would look different — and that difference could be minor (something that a reader should shrug off), or it could make that character look like a completely different person. Moreover, the probabilistic nature of generative AI meant that whenever you asked it a question, it would guess as to the answer, not because it knew the answer, but rather because it was guessing on the right word to add in a sentence based on previous training data. As a result, these models would frequently make mistakes — something which we later referred to as “hallucinations.”  And that’s not even mentioning the cost of training these models, the cost of running them, the vast amounts of computational power they required, the fact that the legality of using material scraped from books and the web without the owner’s permission was (and remains) legally dubious, or the fact that nobody seemed to know how to use these models to actually create profitable businesses.  These problems were overshadowed by something flashy, and new, and something that investors — and the tech media — believed would eventually automate the single thing that’s proven most resistant to automation: namely, knowledge work and the creative economy.  This newness and hype and these expectations sent the market into a frenzy, with every hyperscaler immediately creating the most aggressive market for one supplier I’ve ever seen. NVIDIA has sold over $200 billion of GPUs since the beginning of 2023 , becoming the largest company on the stock market and trading at over $170 as of writing this sentence only a few years after being worth $19.52 a share . While I’ve talked about some of the propelling factors behind the AI wave — automation and novelty — that’s not a complete picture. A huge reason why everybody decided to “do AI” was because the software industry’s growth was slowing , with SaaS (Software As A Service) company valuations stalling or dropping , resulting in  the terrifying prospect of companies having to “ under promise and over deliver ” and “be efficient.” Things that normal companies — those whose valuations aren’t contingent on ever-increasing, ever-constant growth — don’t have to worry about, because they’re normal companies.  Suddenly, there was the promise of a new technology — Large Language Models — that were getting exponentially more powerful, which was mostly a lie but hard to disprove because “powerful” can mean basically anything, and the definition of “powerful” depended entirely on whoever you asked at any given time, and what that person’s motivations were.  The media also immediately started tripping on its own feet, mistakenly claiming OpenAI’s GPT-4 model tricked a Taskrabbit into solving a CAPTCHA ( it didn’t — this never happened), or saying that “ people who don’t know how to code already [used] bots to produce full-fledged games, ” and if you’re wondering what “full-fledged” means, it means “pong” and a cobbled-together rolling demo of SkyRoads, a game from 1993 . The media (and investors) helped peddle the narrative that AI was always getting better, could do basically anything, and that any problems you saw today would be inevitably solved in a few short months, or years, or, well, at some point I guess.  LLMs were touted as a digital panacea, and the companies building them offered traditional software companies the chance to plug these models into their software using an API, thus allowing them to ride the same generative AI wave that every other company was riding.  The model companies similarly started going after individual and business customers, offering software and subscriptions that promised the world, though this mostly boiled down to chatbots that could generate stuff, and then doubled down with the promise of “agents” — a marketing term that’s meant to make you think “autonomous digital worker” but really means “ broken digital product .” Throughout this era, investors and the media spoke with a sense of inevitability that they never really backed up with data. It was an era based on confidently-asserted “vibes.” Everything was always getting better and more powerful, even though there was never much proof that this was truly disruptive technology, other than in its ability to disrupt apps you were using with AI — making them worse by, for example, suggesting questions on every Facebook post that you could ask Meta AI, but which Meta AI couldn’t answer. “AI” was omnipresent, and it eventually grew to mean everything and nothing. OpenAI would see its every move lorded over like a gifted child, its CEO Sam Altman called the “ Oppenheimer of Our Age ,” even if it wasn’t really obvious why everyone was impressed. GPT-4 felt like something a bit different, but was it actually meaningful?  The thing is, Artificial Intelligence is built and sold on not just faith, but a series of myths that the AI boosters expect us to believe with the same certainty that we treat things like gravity, or the boiling point of water.  Can large language models actually replace coders? Not really, no, and I’ll get into why later in the piece. Can Sora — OpenAI’s video creation tool — replace actors or animators? No, not at all, but it still fills the air full of tension because you can immediately see who is pre-registered to replace everyone that works for them.  AI is apparently replacing workers, but nobody appears to be able to prove it! But every few weeks a story runs where everybody tries to pretend that AI is replacing workers with some poorly-sourced and incomprehensible study , never actually saying “someone’s job got replaced by AI” because it isn’t happening at scale, and because if you provide real-world examples, people can actually check. To be clear, some people have lost jobs to AI, just not the white collar workers, software engineers, or really any of the career paths that the mainstream media and AI investors would have you believe.  Brian Merchant has done excellent work covering how LLMs have devoured the work of translators , using cheap, “almost good” automation to lower already-stagnant wages in a field that was already hurting before the advent of generative AI, with some having to abandon the field, and others pushed into bankruptcy. I’ve heard the same for art directors, SEO experts, and copy editors, and Christopher Mims of the Wall Street Journal covered these last year .  These are all fields with something in common: shitty bosses with little regard for their customers who have been eagerly waiting for the opportunity to slash contract labor. To quote Merchant, “the drumbeat, marketing, and pop culture of ‘powerful AI’ encourages and permits management to replace or degrade jobs they might not otherwise have.”  Across the board, the people being “replaced” by AI are the victims of lazy, incompetent cost-cutters who don’t care if they ship poorly-translated text. To quote Merchant again, “[AI hype] has created the cover necessary to justify slashing rates and accepting “good enough” automation output for video games and media products.” Yet the jobs crisis facing translators speaks to the larger flaws of the Large Language Model era, and why other careers aren’t seeing this kind of disruption. Generative AI creates outputs , and by extension defines all labor as some kind of output created from a request. In the case of translation, it’s possible for a company to get by with a shitty version, because many customers see translation as “what do these words say,” even though ( as one worker told Merchant ) translation is about conveying meaning. Nevertheless, “translation” work had already started to condense to a world where humans would at times clean up machine-generated text, and the same worker warned that the same might come for other industries. Yet the problem is that translation is a heavily output-driven industry, one where (idiot) bosses can say “oh yeah that’s fine” because they ran an output back through Google Translate and it seemed fine in their native tongue. The problems of a poor translation are obvious, but the customers of translation are, it seems, often capable of getting by with a shitty product. The problem is that most jobs are not output-driven at all, and what we’re buying from a human being is a person’s ability to think.   Every CEO talking about AI replacing workers is an example of the real problem: that most companies are run by people who don’t understand or experience the problems they’re solving, don’t do any real work, don’t face any real problems, and thus can never be trusted to solve them. The Era of the Business Idiot is the result of letting management consultants and neoliberal “free market” sociopaths take over everything, leaving us with companies run by people who don’t know how the companies make money, just that they must always make more. When you’re a big, stupid asshole, every job that you see is condensed to its outputs, and not the stuff that leads up to the output, or the small nuances and conscious decisions that make an output good as opposed to simply acceptable, or even bad.  What does a software engineer do? They write code! What does a writer do? They write words! What does a hairdresser do? They cut hair!  Yet that’s not actually the case.  As I’ll get into later, a software engineer does far more than just code, and when they write code they’re not just saying “what would solve this problem?” with a big smile on their face — they’re taking into account their years of experience, what code does, what code could do , all the things that might break as a result, and all of the things that you can’t really tell from just looking at code , like whether there’s a reason things are made in a particular way. A good coder doesn’t just hammer at the keyboard with the aim of doing a particular task. They factor in questions like: How does this functionality fit into the code that’s already here? Or, if someone has to update this code in the future, how do I make it easy for them to understand what I’ve written and to make changes without breaking a bunch of other stuff? A writer doesn’t just “write words.” They jostle ideas and ideals and emotions and thoughts and facts and feelings into a condensed piece of text, explaining both what’s happening and why it’s happening from their perspective, finding nuanced ways to convey large topics, none of which is the result of a single (or many) prompts but the ever-shifting sand of a writer’s brain.  Good writing is a fight between a bunch of different factors: structure, style, intent, audience, and prioritizing the things that you (or your client) care about in the text. It’s often emotive — or at the very least, driven or inspired by a given emotion — which is something that an AI simply can’t replicate in a way that’s authentic and believable.  And a hairdresser doesn’t just cut hair, but cuts your hair, which may be wiry, dry, oily, long, short, healthy, unhealthy, on a scalp with particular issues, at a time of year when perhaps you want to change length, at a time that fits you, in “the way you like” which may be impossible to actually write down but they get it just right. And they make conversation, making you feel at ease while they snip and clip away at your tresses, with you having to trust that they’ll get it right.  This is the true nature of labor that executives fail to comprehend at scale: that the things we do are not units of work, but extrapolations of experience, emotion, and context that cannot be condensed in written meaning. Business Idiots see our labor as the result of a smart manager saying “do this,” rather than human ingenuity interpreting both a request and the shit the manager didn’t say. What does a CEO do? Uhhh, um, well, a Harvard study says they spend 25% of their time on “people and relationships,” 25% on “functional and business unit reviews,” 16% on “organization and culture,” and 21% on “strategy,” with a few percent here and there for things like “professional development.”  That’s who runs the vast majority of companies: people that describe their work predominantly as “looking at stuff,” “talking to people” and “thinking about what we do next.” The most highly-paid jobs in the world are impossible to describe, their labor described in a mish-mash of LinkedInspiraton, yet everybody else’s labor is an output that can be automated. As a result, Large Language Models seem like magic. When you see everything as an outcome — an outcome you may or may not understand, and definitely don’t understand the process behind, let alone care about — you kind of already see your workers as LLMs.   You create a stratification of the workforce that goes beyond the normal organizational chart, with senior executives — those closer to the class level of CEO — acting as those who have risen above the doldrums of doing things to the level of “decisionmaking,” a fuzzy term that can mean everything from “making nuanced decisions with input from multiple different subject-matter experts” to, as ServiceNow Bill McDermott did in 2022 , “[make] it clear to everybody [in a boardroom of other executives], everything you do: AI, AI, AI, AI, AI.”  The same extends to some members of the business and tech media that have, for the most part, gotten by without having to think too hard about the actual things the companies are saying.  I realize this sounds a little mean, and I must be clear it doesn’t mean that these people know nothing , just that it’s been possible to scoot through the world without thinking too hard about whether or not something is true. When Salesforce said back in 2024 that its “Einstein Trust Layer” and AI would be “transformational for jobs,” the media dutifully wrote it down and published it without a second thought. It fully trusted Marc Benioff when he said that Agentforce agents would replace human workers , and then again when he said that AI agents are doing “ 30% to 50% of all the work in Salesforce itself ,” even though that’s an unproven and nakedly ridiculous statement.  Salesforce’s CFO said earlier this year that AI wouldn’t boost sales growth in 2025 . One would think this would change how they’re covered, or how seriously one takes Marc Benioff.  It hasn’t, because nobody is paying attention. In fact, nobody seems to be doing their job. This is how the core myths of generative AI were built: by executives saying stuff and the media publishing it without thinking too hard.  AI is replacing workers! AI is writing entire computer programs! AI is getting exponentially more-powerful! What does “powerful” mean? That the models are getting better on benchmarks that are rigged in their favor, but because nobody fucking explains it , regular people are regularly told that AI is “powerful.”  The only thing “powerful” about generative AI is its mythology. The world’s executives, entirely disconnected from labor and actual production, are doing the only thing they know how to — spend a bunch of money and say vague stuff about “AI being the future.” There are people — journalists, investors, and analysts — that have built entire careers on filling in the gaps for the powerful as they splurge billions of dollars and repeat with increasing desperation that “the future is here” as absolutely nothing happens. You’ve likely seen a few ridiculous headlines recently. One of the most recent, and most absurd, is that that OpenAI will pay Oracle $300 billion over four years , closely followed with the claim that NVIDIA will “invest” “$100 billion” in OpenAI to build 10GW of AI data centers , though the deal is structured in a way that means that OpenAI is paid “progressively as each gigawatt is deployed,” and OpenAI will be leasing the chips (rather than buying them outright) . I must be clear that these deals are intentionally made to continue the myth of generative AI, to pump NVIDIA, and to make sure OpenAI insiders can sell $10.3 billion of shares .   OpenAI cannot afford the $300 billion, NVIDIA hasn’t sent OpenAI a cent and won’t do so if it can’t build the data centers, which OpenAI most assuredly can’t afford to do.  NVIDIA needs this myth to continue, because in truth, all of these data centers are being built for demand that doesn’t exist, or that — if it exists — doesn’t necessarily translate into business customers paying huge amounts for access to OpenAI’s generative AI services.  NVIDIA, OpenAI, CoreWeave and other AI-related companies hope that by announcing theoretical billions of dollars (or hundreds of billions of dollars) of these strange, vague and impossible-seeming deals, they can keep pretending that demand is there, because why else would they build all of these data centers, right?   That, and the entire stock market rests on NVIDIA’s back . It accounts for 7% to 8% of the value of the S&P 500, and Jensen Huang needs to keep selling GPUs. I intend to explain later on how all of this works, and how brittle it really is. The intention of these deals is simple: to make you think “this much money can’t be wrong.” It can. These people need you to believe this is inevitable, but they are being proven wrong, again and again, and today I’m going to continue doing so.  Underpinning these stories about huge amounts of money and endless opportunity lies a dark secret — that none of this is working, and all of this money has been invested in a technology that doesn’t make much revenue and loves to burn millions or billions or hundreds of billions of dollars. Over half a trillion dollars has gone into an entire industry without a single profitable company developing models or products built on top of models. By my estimates, there is around $44 billion of revenue in generative AI this year (when you add in Anthropic and OpenAI’s revenues to the pot, along with the other stragglers) and most of that number has been gathered through reporting from outlets like The Information, because none of these companies share their revenues, all of them lose shit tons of money , and their actual revenues are really, really small. Only one member of the Magnificent Seven (outside of NVIDIA) has ever disclosed its AI revenue — Microsoft, which stopped reporting in January 2025, when it reported “$13 billion in annualized revenue,” so around $1.083 billion a month.   Microsoft is a sales MACHINE. It is built specifically to create or exploit software markets, suffocating competitors by using its scale to drive down prices, and to leverage the ecosystem that it’s created over the past few decades. $1 billion a month in revenue is chump change for an organization that makes over $27 billion a quarter in PROFITS .  Don’t worry Satya, I’ll come back to you later. “But Ed, the early days!” Worry not — I’ve got that covered .  This is nothing like any other era of tech. There has never been this kind of cash-rush, even in the fiber boom . Over a decade, Amazon spent about one-tenth of the capex that the Magnificent Seven spent in two years on generative AI building AWS — something that now powers a vast chunk of the web, and has long been Amazon’s most profitable business unit .  Generative AI is nothing like Uber , with OpenAI and Anthropic’s true costs coming in at about $159 billion in the past two years, approaching five times Uber’s $30 billion all-time burn. And that’s before the bullshit with NVIDIA and Oracle. Microsoft last reported AI revenue in January . It’s October this week. Why did it stop reporting this number, you think? Is it because the numbers are so good it couldn’t possibly let people know? As a general rule, publicly traded companies — especially those where the leadership are compensated primarily in equity — tend to brag about their successes, in part because said bragging boosts the value of the thing that the leadership gets paid in. There’s no benefit to being shy. Oracle literally made a regulatory filing to boast it had a $30 billion customer , which turned out to be OpenAI, who eventually agreed (publicly) to spend $300 billion in compute over five years .  Which is to say that Microsoft clearly doesn’t have any good news to share, and as I’ll reveal later, they can’t even get 3% of their 440 million Microsoft 365 subscribers to pay for Microsoft 365 Copilot.  If Microsoft can’t sell this shit, nobody can.  Anyway, I’m nearly done, sorry, you see, I’m writing this whole thing as if you’re brand new and walking up to this relatively unprepared, so I need to introduce another company.  In 2020, a splinter group jumped off of OpenAI, funded by Amazon and Google to do much the same thing as OpenAI but pretend to be nicer about it until they have to raise from the Middle East . Anthropic has always been better at coding for some reason, and people really like its Claude models.  Both OpenAI and Anthropic have become the only two companies in generative AI to make any real progress, either in terms of recognition or in sheer commercial terms, accounting for the majority of the revenue in the AI industry.  In a very real sense, the AI industry’s revenue is OpenAI and Anthropic. In the year where Microsoft recorded $13bn in AI revenues, $10 billion came from OpenAI’s  spending on Microsoft Azure. Anthropic burned $5.3 billion last year — with the vast majority of that going towards compute . Outside of these two companies, there’s barely enough revenue to justify a single data center. Where we sit today is a time of immense tension. Mark Zuckerberg says we’re in a bubble , Sam Altman says we’re in a bubble , Alibaba Chairman and billionaire Joe Tsai says we’re in a bubble , Apollo says we’re in a bubble , nobody is making money and nobody knows why they’re actually doing this anymore, just that they must do it immediately.  And they have yet to make the case that generative AI warranted any of these expenditures.  That was undoubtedly the longest introduction to a newsletter I’ve ever written, and the reason why I took my time was because this post demands a level of foreshadowing and exposition, and because I want to make it make sense to anyone who reads it — whether they’ve read my newsletter for years, or whether they’re only just now investigating their suspicions that generative AI may not be all it’s cracked up to be.  Today I will make the case that generative AI’s fundamental growth story is flawed, and explain why we’re in the midst of an egregious bubble. This industry is sold by keeping things vague, and knowing that most people don’t dig much deeper than a headline, a problem I simply do not have.  This industry is effectively in service of two companies — OpenAI and NVIDIA — who pump headlines out through endless contracts between them and subsidiaries or investments to give the illusion of activity.  OpenAI is now, at this point, on the hook for over a trillion dollars , an egregious sum for a company that already forecast billions in losses, with no clear explanation as to how it’ll afford any of this beyond “we need more money” and the vague hope that there’s another Softbank or Microsoft waiting in the wings to swoop in and save the day.  I’m going to walk you through where I see this industry today, and why I see no future for it beyond a fiery apocalypse.  While everybody (reasonably!) harps on about hallucinations — which, to remind you, is when a model authoritatively states something that isn’t true — but the truth is far more complex, and far worse than it seems.  You cannot rely on a large language model to do what you want. Even the most highly-tuned models on the most expensive and intricate platform can’t actually be relied upon to do exactly what you want.  A “hallucination” isn’t just when these models say something that isn’t true. It’s when they decide to do something wrong because it seems the most likely thing to do, or when a coding model decides to go on a wild goose chase, failing the user and burning a ton of money in the process.  The advent of “reasoning” models — those engineered to ‘think’ through problems in a way reminiscent of a human — and the expansion of what people are (trying) to use LLMs for demands that the definition of an AI hallucination be widened, not merely referring to factual errors, but fundamental errors in understanding the user’s request or intent, or what constitutes a task, in part because these models cannot think and do not know anything .  However successful a model might be in generating something good *once*, it will also often generate something bad, or it’ll generate the right thing but in an inefficient and over-verbose fashion. You do not know what you’re going to get each time, and hallucinations multiply with the complexity of the thing you’re asking for, or whether a task contains multiple steps (which is a fatal blow to the idea of “agents.”  You can add as many levels of intrigue and “reasoning” as you want, but Large Language Models cannot be trusted to do something correctly, or even consistently, every time. Model companies have successfully convinced everybody that the issue is that users are prompting the models wrong, and that people need to be “trained to use AI,” but what they’re doing is training people to explain away the inconsistencies of Large Language Models, and to assume individual responsibility for what is an innate flaw in how large language models work.  Large Language Models are also uniquely expensive. Many mistakenly try and claim this is like the dot com boom or Uber, but the basic unit economics of generative AI are insane. Providers must purchase tens or hundreds of thousands of GPUs each costing $50,000 a piece, and hundreds of millions or billions of dollars of infrastructure for large clusters. And that’s without mentioning things like staffing, construction, power, or water.   Then you turn them on and start losing money. Despite hundreds of billions of GPUs sold, nobody seems to make any money, other than NVIDIA, the company that makes them, and resellers like Dell and Supermicro who buy the GPUs, put them in servers, and sell them to other people.  This arrangement works out great for Jensen Huang, and terribly for everybody else.  I am going to explain the insanity of the situation we find ourselves in, and why I continue to do this work undeterred. The bubble has entered its most pornographic, aggressive and destructive stage, where the more obvious it becomes that they’re cooked, the more ridiculous the generative AI industry will act — a dark juxtaposition against every new study that says “generative AI does not work” or new story about ChatGPT’s uncanny ability to activate mental illness in people.  So, let’s start simple: NVIDIA is a hardware company that sells GPUs, including the consumer GPUs that you’d see in a modern gaming PC, but when you read someone say “GPU” within the context of AI, they mean enterprise-focused GPUs like the A100, H100, H200, and more modern GPUs like the Blackwell-series B200 and GB200 (which combines two GPUs with an NVIDIA CPU).  These GPUs cost anywhere from $30,000 to $50,000 (or as high as $70,000 for the newer Blackwell GPUs), and require tens of thousands of dollars more of infrastructure — networking to “cluster” server racks of GPUs together to provide compute, and massive cooling systems to deal with the massive amounts of heat they produce, as well as the servers themselves that they run on, which typically use top-of-the-line data center CPUs, and contain vast quantities of high-speed memory and storage. While the GPU itself is likely the most expensive single item within an AI server, the other costs — and I’m not even factoring in the actual physical building that the server lives in, or the water or electricity that it uses — add up.  I’ve mentioned NVIDIA because it has a virtual monopoly in this space. Generative AI effectively requires NVIDIA GPUs, in part because it’s the only company really making the kinds of high-powered cards that generative AI demands, and  because NVIDIA created something called CUDA — a collection of software tools that lets programmers write software that  runs on GPUs, which were traditionally used primarily for rendering graphics in games.  While there are open-source alternatives , as well as alternatives from Intel (with its ARC GPUs) and AMD (Nvidia’s main rival in the consumer space), these aren’t nearly as mature or feature-rich.  Due to the complexities of AI models, one cannot just stand up a few of these things either — you need clusters of thousands, tens of thousands, or hundreds of thousands of them for it to be worthwhile, making any investment in GPUs in the hundreds of millions or billions of dollars, especially considering they require completely different data center architecture to make them run. A common request — like asking a generative AI model to parse through thousands of lines of code and make a change or an addition — may use multiple of these $50,000 GPUs at the same time, and so if you aspire to serve thousands, or millions of concurrent users, you need to spend big. Really big.  It’s these factors — the vendor lock-in, the ecosystem, and the fact that generative AI only works when you’re buying GPUs at scale — that underpin the rise of Nvidia. But beyond the economic and technical factors, there are human ones, too.   To understand the AI bubble is to understand why CEOs do the things they do. Because an executive’s job is so vague , they can telegraph the value of their “labor” by spending money on initiatives and making partnerships. AI gave hyperscalers the excuse to spend hundreds of billions of dollars on data centers and buy a bunch of GPUs to go in them, because that, to the markets, looks like they’re doing something. By virtue of spending a lot of money in a frighteningly short amount of time, Satya Nadella received multiple glossy profiles , all without having to prove that AI can really do anything, be it a job or make Microsoft money.  Nevertheless, AI allowed CEOs to look busy, and once the markets and journalists had agreed on the consensus opinion that “AI would be big,” all that these executives had to do was buy GPUs and “do AI.”   We are in the midst of one of the darkest forms of software in history, described by many as an unwanted guest invading their products, their social media feeds, their bosses’ empty minds, and resting in the hands of monsters. Every story of its success feels bereft of any real triumph, with every literal description of its abilities involving multiple caveats about the mistakes it makes or the incredible costs of running it.  Generative AI exists for two reasons: to cost money, and to make executives look busy. It was meant to be the new enterprise software and the new iPhone and the new Netflix all at once, a panacea where software guys pay one hardware guy for GPUs to unlock the incredible value creation of the future.  Generative AI was always set up to fail, because it was meant to be everything, was talked about like it was everything, is still sold like it’s everything, yet for all the fucking hype, it all comes down to two companies: OpenAI, and, of course, NVIDIA. NVIDIA was, for a while, living high on the hog. All CEO Jensen Huang had to do every three months was saying “check out these numbers” and the markets and business journalists would squeal with glee, even as he said stuff like “ the more you buy the more you save ,” in part tipping his head to the (very real and sensible) idea of accelerated computing, but framed within the context of the cash inferno that’s generative AI, seems ludicrous.  Huang’s showmanship  worked really well for NVIDIA for a while, because for a while the growth was easy. Everybody was buying GPUs. Meta, Microsoft, Amazon, Google (and to a lesser extent Apple and Tesla) make up 42% of NVIDIA’s revenue , creating, at least for the first four, a degree of shared mania where everybody justified buying tens of billions of dollars of GPUs a year by saying “the other guy is doing it!” This is one of the major reasons the AI bubble is happening, because people conflated NVIDIA’s incredible sales with “interest in AI,” rather than everybody buying GPUs. Don’t worry, I’ll explain the revenue side a little bit later. We’re here for the long haul. Anyway, NVIDIA is facing a problem — that the only thing that grows forever is cancer .  On September 9 2025, the Wall Street Journal said that NVIDIA’s “wow” factor was fading , going from beating analyst estimates in by nearly 21% in its Fiscal Year Q2 2024 earnings to scraping by with a mere 1.52% beat in its most-recent earnings — something that for any other company, would be a good thing, but framed against the delusional expectations that generative AI has inspired, is a figure that looks nothing short of ominous. Per the Wall Street Journal: In any other scenario, 56% year-over-year growth would lead to an abundance of Dom Perignon and Huang signing hundreds of boobs , but this is NVIDIA , and that’s just not good enough. Back in February 2024, NVIDIA was booking 265% year-over-year growth , but in its February 2025 earnings, NVIDIA only grew by a measly 78% year-over-year .  It isn’t so much that NVIDIA isn’t growing , but that to grow year-over-year at the rates that people expect is insane. Life was a lot easier when NVIDIA went from $6.05 billion in revenue in Q4 FY2023 to $22 billion in revenue in Q4 FY2024 , but for it to grow even 55% year-over-year from Q2 FY2026 ( $46.7 billion ) to Q2 2027 would require it to make $72.385 billion in revenue in the space of three months, mostly from selling GPUs (which make up around 88% of its revenue).   This would put Nvidia in the ballpark of Microsoft ( $76 billion in the last quarter ) and within the neighborhood of Apple ( $94 billion in the last quarter ), predominantly making money in an industry that a year-and-a-half ago barely made the company $6 billion in a quarter.  And the market needs NVIDIA to perform, as the company makes up 8% of the value of the S&P 500 . It’s not enough for it to be wildly profitable, or have a monopsony on selling GPUs, or for it to have effectively 10x’d their stock in a few years. It must continue to grow at the fastest rate of anything ever, making more and more money selling these GPUs to a small group of companies that immediately start losing money once they plug them in.  While a few members of the Magnificent Seven could be depended on to funnel tens of billions of dollars into a furnace each quarter, there were limits , even for companies like Microsoft, which had bought over 485,000 GPUs in 2024 alone . To take a step back, companies like Microsoft, Google and Amazon make their money by either selling access to Large Language Models that people incorporate into their products, or by renting out servers full of GPUs to run inference (as said previously, the process to generate an output by a model or series of models) or train AI models for companies that develop and market models themselves, namely Anthropic and OpenAI.  The latter revenue stream of which is where Jensen Huang found a solution to that eternal growth problem: the neocloud, namely CoreWeave, Lambda and Nebius.  These businesses are fairly straightforward. They own (or lease) data centers that they then fill full of servers that are full of NVIDIA GPUs, which they then rent on an hourly basis to customers, either on a per-GPU basis or in large batches for larger customers, who guarantee they'll use a certain amount of compute and sign up to long-term (i.e. more than an hour at a time) commitments. A neocloud is a specialist cloud compute company that exists only to provide access to GPUs for AI, unlike Amazon Web Services, Microsoft Azure and Google Cloud, all of which have healthy businesses selling other kinds of compute, with AI (as I’ll get into later) failing to provide much of a return on investment.  It’s not just the fact that these companies are more specialized than, say, Amazon’s AWS or Microsoft Azure. As you’ve gathered from the name, these are new, young, and in almost all cases, incredibly precarious businesses — each with financial circumstances that would make a Greek finance minister blush.  That’s because setting up a neocloud is expensive . Even if the company in question already has data centers — as CoreWeave did with its cryptocurrency mining operation — AI requires completely new data center infrastructure to house and cool the GPUs , and those GPUs also need paying for, and then there’s the other stuff I mentioned earlier, like power, water, and the other bits of the computer (the CPU, the motherboard, the memory and storage, and the housing).  As a result, these neoclouds are forced to raise billions of dollars in debt, which they collateralize using the GPUs they already have , along with contracts from customers, which they use to buy more GPUs. CoreWeave, for example, has $25 billion in debt on estimated revenues of $5.35 billion , losing hundreds of millions of dollars a quarter. You know who also invests in these neoclouds? NVIDIA! NVIDIA is also one of CoreWeave’s largest customers (accounting for 15% of its revenue in 2024), and just signed a deal to buy $6.3 billion of any capacity that CoreWeave can’t otherwise sell to someone else through 2032 , an extension of a $1.3 billion 2023 deal reported by the Information . It was the anchor investor ($250 million) in CoreWeave’s IPO , too. NVIDIA is currently doing the same thing with Lambda, another neocloud that NVIDIA invested in, which also  plans to go public next year. NVIDIA is also one of Lambda’s largest customers, signing a deal with it this summer to rent 10,000 GPUs for $1.3 billion over four years . In the UK, NVIDIA has just invested $700 million in Nscale , a former crypto miner that has never built an AI data center , and that has, despite having no experience, committed $1 billion (and/or 100,000 GPUs) to an OpenAI data center in Norway . On Thursday, September 25, Nscale announced it had closed another funding round, with NVIDIA listed as a main backer — although it’s unclear how much money it put in . It would be safe to assume it’s another few hundred million.  NVIDIA also invested in Nebius , an outgrowth of Russian conglomerate Yandex, and Nebius provides, through a partnership with NVIDIA, tens of thousands of dollars’ worth of compute credits to companies in NVIDIA’s Inception startup program. NVIDIA’s plan is simple: fund these neoclouds, let these neoclouds load themselves up with debt, at which point they buy GPUs from NVIDIA, which can then be used as collateral for loans, along with contracts from customers, allowing the neoclouds to buy even more GPUs. It’s like that Robinhood infinite money glitch… …except, that is, for one small problem. There don’t appear to be that many customers. As I went into recently on my premium newsletter , NVIDIA funds and sustains Neoclouds as a way of funnelling revenue to itself, as well as partners like Supermicro and Dell , resellers that take NVIDIA GPUs and put them in servers to sell pre-built to customers. These two companies made up 39% of NVIDIA’s revenues last quarter .  Yet when you remove hyperscaler revenue — Microsoft, Amazon, Google, OpenAI and NVIDIA — from the revenues of these neoclouds, there’s barely $1 billion in revenue combined, across CoreWeave, Nebius and Lambda . CoreWeave’s $5.35 billion revenue is predominantly made up of its contracts with NVIDIA, Microsoft (offering compute for OpenAI), Google ( hiring CoreWeave to offer compute for OpenAI ), and OpenAI itself, which has promised CoreWeave $22.4 billion in business over the next few years. This is all a lot of stuff , so I’ll make it really simple: there is no real money in offering AI compute, but that isn’t Jensen Huang’s problem, so he will simply force NVIDIA to hand money to these companies so that they have contracts to point to when they raise debt to buy more NVIDIA GPUs.  Neoclouds are effectively giant private equity vehicles that exist to raise money to buy GPUs from NVIDIA, or for hyperscalers to move money around so that they don’t increase their capital expenditures and can, as Microsoft did earlier in the year , simply walk away from deals they don’t like. Nebius’ “$17.4 billion deal” with Microsoft even included a clause in its 6-K filing that Microsoft can terminate the deal in the event the capacity isn’t built by the delivery dates, and Nebius has already used the contract to raise $3 billion to… build the data center to provide compute for the contract. Here, let me break down the numbers: From my analysis, it appears that CoreWeave, despite expectations to make that $5.35 billion this year, has only around $500 million of non-Magnificent Seven or OpenAI AI revenue in 2025 , with Lambda estimated to have around $100 million in AI revenue , and Nebius around $250 million without Microsoft’s share , and that’s being generous. In simpler terms, the Magnificent Seven is the AI bubble, and the AI bubble exists to buy more GPUs, because (as I’ll show) there’s no real money or growth coming out of this, other than in the amount that private credit is investing — “ $50 billion a quarter, for the low end, for the past three quarters .” I dunno man, let’s start simple: $50 billion a quarter of data center funding is going into an industry that has less revenue than Genshin Impact . That feels pretty bad. Who’s gonna use these data centers? How are they going to even make money on them? Private equity firms don’t typically hold onto assets, they sell them or take them public. Doesn’t seem great to me! Anyway, if AI was truly the next big growth vehicle, neoclouds would be swimming in diverse global revenue streams. Instead, they’re heavily-centralized around the same few names, one of which (NVIDIA) directly benefits from their existence not as a company doing business, but as an entity that can accrue debt and spend money on GPUs. These Neoclouds are entirely dependent on a continual flow of private credit from firms like Goldman Sachs ( Nebius , CoreWeave , Lambda for its IPO ), JPMorgan ( Lambda , Crusoe , CoreWeave ), and Blackstone ( Lambda , CoreWeave ), who have in a very real sense created an entire debt-based infrastructure to feed billions of dollars directly to NVIDIA, all in the name of an AI revolution that's yet to arrive. The fact that the rest of the neocloud revenue stream is effectively either a hyperscaler or OpenAI is also concerning. Hyperscalers are, at this point, the majority of data center capital expenditures , and have yet to prove any kind of success from building out this capacity, outside, of course, Microsoft’s investment in OpenAI, which has succeeded in generating revenue while burning billions of dollars .  Hyperscaler revenue is also capricious, but even if it isn’t, why are there no other major customers? Why, across all of these companies, does there not seem to be one major customer who isn’t OpenAI?  The answer is obvious: nobody that wants it can afford it, and those who can afford it don’t need it.  It’s also unclear what exactly hyperscalers are doing with this compute, because it sure isn’t “making money.” While Microsoft makes $10 billion in revenue from renting compute to OpenAI via Microsoft Azure, it does so at-cost, and was charging OpenAI $1.30-per-hour for each A100GPU it rents, a loss of $2.2 an hour per GPU , meaning that it is  likely losing money on this compute, especially as SemiAnalysis has the total cost per hour per GPU at around $1.46 with the cost of capital and debt associated for a hyperscaler, though it’s unclear if that’s for an H100 or A100 GPU. In any case, how do these neoclouds pay for their debt if the hyperscalers give up, or NVIDIA doesn’t send them money, or, more likely, private credit begins to notice that there’s no real revenue growth outside of circular compute deals with neoclouds’ largest supplier, investor and customer? They don’t! In fact, I have serious concerns that they can’t even build the capacity necessary to fulfil these deals, but nobody seems to worry about that. No, really! It appears to be taking Oracle and Crusoe around 2.5 years per gigawatt of compute capacity . How exactly are any of these neoclouds (or Oracle itself) able to expand to actually capture this revenue? Who knows! But I assume somebody is going to say “OpenAI!” Here’s an insane statistic for you: OpenAI will account for — in both its own revenue (projected $13 billion) and in its own compute costs ($16 billion, according to The Information, although that figure is likely out of date, and seemingly only includes the compute it’ll use, and not that it has committed to build, and thus has spent money on) — about 50% of all AI revenues in 2025. That figure takes into account the $400m ARR for ServiceNow, Adobe, and Salesforce ; the $35bn in revenue for the Magnificent Seven from AI (not profit, and based on figures from the previous year); revenue from neoclouds like CoreWeave, Nebius, and Lambda; and the estimated revenue from the entire generative AI industry (including Anthropic and other smaller players, like Perplexity and Anysphere) for a total of $55bn.OpenAI is the generative AI industry — and it’s a dog of a company. As a reminder, OpenAI has leaked that it’ll burn $115 billion in the next four years , and based on my estimates, it needs to raise more than $290 billion in the next four years based on its $300 billion deal with Oracle alone. That alone is a very, very bad sign, especially as we’re three years and $500 billion or more into this hype cycle with few signs of life outside of, well, OpenAI promising people money. Credit to Anthony Restaino for this horrifying graphic : This is not what a healthy, stable industry looks like. Alright, well, things can’t be that bad on the software side. As I covered on my premium newsletter a few weeks ago , everybody is losing money on generative AI, in part because the cost of running AI models is increasing , and in part because the software itself doesn’t do enough to warrant the costs associated with running them, which are already subsidized and unprofitable for the model providers .  Outside of OpenAI (and to a lesser extent Anthropic), nobody seems to be making much revenue, with the most “successful” company being Anysphere, makers of AI coding tool Cursor, which hit $500 million ‘annualized” ( so $41.6 million in one month ) a few months ago, just before Anthropic and OpenAI jacked up the prices for “priority processing” on enterprise queries , raising its operating costs as a result. In any case, that’s some piss-poor revenue for an industry that’s meant to be the future of software. Smartwatches are projected to make $32 billion this year , and as mentioned, the Magnificent Seven expects to make $35 billion or so in revenue from AI this year . Even Anthropic and OpenAI seem a little lethargic, both burning billions of dollars while making, by my estimates, no more than $2 billion and $6.26 billion in 2025 so far , despite projections of $5 billion and $13 billion respectively.  Outside of these two, AI startups are floundering, struggling to stay alive and raising money in several-hundred million dollar bursts as their negative-gross-margin businesses flounder.  As I dug into a few months ago , I could find only 12 AI-powered companies making more than $8.3 million a month, with two of them slightly improving their revenues, specifically AI search company Perplexity ( which has now hit $150 million ARR, or $12.5 million in a month ) and AI coding startup Replit ( which also hit $150 million ARR in September ).  Both of these companies burn ridiculous amounts of money. Perplexity burned 164% of its revenue on Amazon Web Services, OpenAI and Anthropic last year , and while Replit hasn’t leaked its costs, The Information reports its gross margins in July were 23% , which doesn’t include the costs of its free users, which you simply have to do with LLMs as free users are capable of costing you a hell of a lot of money. Problematically, your paid users can also cost you more than they bring in as well. In fact, every user loses you money in generative AI, because it’s impossible to do cost control in a consistent manner. A few months ago, I did a piece about Anthropic losing money on every single Claude Code subscriber , and I’ll walk you through it in a very simplified fashion: Anthropic is, to be clear, the second-largest model developer, and has some of the best AI talent in the industry. It has a better handle on its infrastructure than anyone outside of big tech and OpenAI. It still cannot seem to fix this problem, even with weekly rate limits . While one could assume that Anthropic is simply letting people run wild, my theory is far simpler: even the model developers have no real way of limiting user activity, likely due to the architecture of generative AI. I know it sounds insane, but at the most advanced level, model providers are still prompting their models, and whatever rate limits may be in place appear to, at times, get completely ignored, and there doesn’t seem to be anything they can do to stop it. No, really. Anthropic counts amongst its capitalist apex predators one lone Chinese man who spent $50,000 of their compute in the space of a month fucking around with Claude Code. Even if Anthropic was profitable — it isn’t, and will burn billions this year — a customer paying $200-a-month running up $50,000 in costs immediately devours the margin of any user running the service that day , if not that week or month. Even if Anthropic’s costs are half the published rates, one guy amounted to 125 users’ monthly revenue.  That’s not a real business! That’s a bad business with out-of-control costs, and it doesn’t appear anybody has these costs under control. A few weeks ago, Replit — an unprofitable AI coding company — released a product called “ Agent 3 .” which promised to be “10x more autonomous” and offer “infinitely more possibilities,” “[testing] and [fixing] its code, constantly improving your application behind the scenes in a reflection loop.” In reality, this means you’d go and tell the model to build something and it would “go do it,” and you’ll be shocked to hear that these models can’t be relied upon to “go and do” anything. Please note that this was launched a few months after Replit raised its prices, shifting to obfuscated “ effort-based ” pricing that would charge “the full scope of the agent’s work.” Agent 3 has been a disaster. Users found tasks that previously cost a few dollars were spiralling into the hundreds of dollars, with The Register reporting one customer found themselves with a $1000 bill after a week: Another user complained that “costs skyrocketed, without any concrete results”: As I previously reported, in late May/early June, both OpenAI and Anthropic cranked up the pricing on their enterprise customers , leading to Replit and Cursor both shifting their prices. This abuse has now trickled down to their customers. Replit has now released an update that lets you choose how autonomous you want Agent 3 to be , which is a tacit admission that you can’t trust coding LLMs to build software. Replit’s users are still pissed off, complaining that Replit is charging them for activity when the agent doesn’t do anything , a consistent problem across its Reddit. While Reddit is not the full summation of all users across every company, it’s a fairly good barometer of user sentiment, and man, are users pissy. Traditionally, Silicon Valley startups have relied upon the same model of “grow really fast and burn a bunch of money, then “turn the profit lever.” AI does not have a “profit lever,” because the raw costs of providing access to AI models are so high ( and they’re only increasing ) that the basic economics of how the tech industry sells software don’t make sense. I’ll reiterate something I wrote a few weeks ago : In simpler terms, it is very, very difficult to imagine what one user — free or otherwise — might cost, and thus it’s hard to charge them on a monthly basis, or tell them what a service might cost them on average. This is a huge problem with AI coding environments.  According to The Information , Claude Code was driving “nearly $400 million in annualized revenue, roughly doubling from a few weeks ago” on July 31 2025.  That annualized revenue works out to about $33 million a month in revenue for a company that predicts it will make at least $416 million a month by the end of the year, and for a product that has become the most-popular coding environment in the world, from the second-largest and best-funded AI company in the world. …is that it? Is that all that’s happening here?  $33 million dollars, all of it unprofitable, after it felt, at least based on social media chatter and discussing with multiple different software engineers, that Claude Code had become ubiquitous with anything to do with LLMs. To be clear, Anthropic’s Sonnet and Opus models are consistently some of the most popular for programming on Openrouter , an aggregator of LLM usage, and Anthropic has been consistently-named as “ the best at coding .”  Some bright spark out there is going to say that Microsoft’s Github Copilot has 1.8 million paying subscribers , and guess what, that’s true, and in fact, I reported it! Here’s another fun fact: the Wall Street Journal reported that Microsoft loses “on average more than $20-a-month-per-user,” with “...some users [costing] the company as much as $80.” And that’s for the most-popular product! If you believe the New York Times or other outlets that simply copy and paste whatever Dario Amodei says , you’d think that the reason that software engineers are having trouble finding work is because their jobs are being replaced by AI. This grotesque , abusive , manipulative and offensive lie has been propagated throughout the entire business and tech media without anybody sitting down and asking whether it’s true, or even getting a good understanding of what it is that LLMs can actually do with code. Members of the media, I am begging you, stop doing this. I get it, every asshole is willing to give a quote saying that “ coding is dead ,” and that every executive is willing to burp out some nonsense about replacing all of their engineers , but I am fucking begging you to either use these things yourself, or speak to people that do. I am not a coder. I cannot write or read code. Nevertheless, I am capable of learning , and have spoken to numerous software engineers in the last few months, and basically reached a consensus of “this is kind of useful, sometimes.” However, one very silly man once said that I don’t speak to people who use these tools , so I went and spoke to three notable, experienced software engineers, and asked them to give me the straight truth about what coding LLMs can do.  In simple terms, LLMs are capable of writing code , but can’t do software engineering, because software engineering is the process of understanding, maintaining and executing code to produce functional software, and LLMs do not “learn,” cannot “adapt,” and (to paraphrase Brown), break down the more of your code and variables you ask them to look at at once. It’s very easy to believe that software engineering is just writing code, but the reality is that software engineers maintain software , which includes writing and analyzing code among a vast array of different personalities and programs and problems . Good software engineering harkens back to Brian Merchant’s interviews with translators — while some may believe that translators simply tell you what words mean, true translation is communicating the meaning of a sentence , which is cultural, contextual, regional, and personal, and often requires the exercise of creativity and novel thinking.  My editor, Matthew Hughes, gave an example of this in his newsletter :  Similarly, coding is not just “a series of text that programs a computer,” but a series of interconnected characters that refers to other software in other places that must also function now and explain, on some level, to someone who has never, ever seen the code before, why it was done this way.  This is, by the way, why we are still yet to get any tangible proof that AI is replacing software engineers…because it can’t.  Of all the fields supposedly at risk from “AI disruption,” coding feels (or felt) the most tangible, if only because the answer to “can you write code with LLMs” wasn’t an immediate, unilateral no.  The media has also been quick to say that AI “writes software,” which is true in the same way that ChatGPT “writes novels”. In reality, LLMs can generate code, and do some software engineering-adjacent tasks, but, like all Large Language Models, break down and go totally insane, hallucinating more as the tasks get more complex . And, as I pointed out earlier, software engineering is not just coding. It involves thinking about problems, finding solutions to novel challenges, designing stuff in a way that can be read and maintained by others, and that’s (ideally) scalable and secure. The whole fucking point of an “AI” is that you hand shit off to it! That’s what they’ve been selling it as! That’s why Jensen Huang told kids to stop learning to code, as with AI, there’s no point .  And it was all a lie. Generative AI can’t do the job of a software engineer, and it fails while  also costing abominable amounts of money.  Coding LLMs seem like magic at first, because they (to quote a conversation with Carl Brown) make the easy things easier, but they also make the harder things harder. They don’t even speed up engineers — they actually make them slower ! Yet coding is basically the only obvious use case for LLMs.  I’m sure you’re gonna say “but I bet the enterprise is doing well!” and you are so very, very wrong. Before I go any further, let’s establish some facts: All of this is to say that Microsoft has one of the largest commercial software empires in history, thousands (if not tens of thousands) of salespeople, and thousands of companies that literally sell Microsoft services for a living . And it can’t sell AI. A source that has seen materials related to sales has confirmed that, as of August 2025, Microsoft has around eight million active licensed users of Microsoft 365 Copilot, amounting to a 1.81% conversion rate across the 440 million Microsoft 365 subscribers. This would amount to, if each of these users paid annually at the full rate of $30-a-month, to about $2.88 billion in annual revenue for a product category that makes $33 billion a fucking quarter. And I must be clear, I am 100% sure these users aren’t all paying $30 a month. The Information reported a few weeks ago that Microsoft has been “reducing the software’s price with more generous discounts on the AI features, according to customers and salespeople,” heavily suggesting discounts had already been happening. Enterprise software is traditionally sold at a discount anyway — or, put a different way, with bulk pricing for those who sign up a bunch of users at once.  In fact, I’ve found evidence that it’s been doing this a while, with a 15% discount on annual Microsoft 365 Copilot subscriptions for orders of 10-to-300 seats mentioned by an IT consultant back in late 2024 , and another that’s currently running through September 30, 2025 through Microsoft’s Cloud Solution Provider program , with up to 2400 licenses discounted if you pay upfront for the year. Microsoft seems to do this a lot, as I found another example of an offer that ran from January 1 2025 through March 31 2025 . An “active” user is someone who has taken one action on Copilot in any Microsoft 365 app in the space of 28 days. Now, I know. That word, active. Maybe you’re thinking “Ed, this is like the gym model! There are unpaid licenses that Microsoft is getting paid for!”  Fine! Let’s assume that Microsoft also has, based on research that suggests this is the case for all software companies, another 50% — four million — of paid Copilot licenses that aren’t being used. That still makes this 12 million users, which is still a putrid 2.72% conversion rate. So, why aren’t people paying for Copilot? Let’s hear from someone who talked to The Information : Microsoft 365 Copilot has been such a disaster that Microsoft will now integrate Anthropic’s models in an attempt and make them better.  Oh, one other thing: sources also confirm GPU utilization for Microsoft 365’s enterprise Copilot is barely scratching 60%.  I’m also hearing that less than SharePoint — another popular enterprise app from Microsoft with 250 million users — had less than 300,000 weekly active users of its AI copilot features in August. So, The Information reported a few months ago that Microsoft’s projected AI revenues would be $13 billion, with $10 billion of that from OpenAI, leaving about $3 billion of total revenue across Microsoft 365 Copilot and any other foreseeable feature that Microsoft sells with “AI” on it. This heavily suggests that Microsoft is making somewhere between $1.5 billion and $2 billion on Azure or Microsoft 365 Copilot, though I suppose there are other places it could be making AI revenue too. Right? I guess. In any case, Microsoft’s net income (read: profit) in its last quarterly earnings was $27.2 billion. One of the comfortable lies that people tell themselves is that the AI bubble is similar to the fiber boom, or the dot com bubble, or Uber, or that we’re in the “growth stage,” or that “this is what software companies do, they spend a bunch of money then “ pull the profit lever .”  This is nothing like anything you’ve seen before, because this is the dumbest shit that the tech industry has ever done.  AI data centers are nothing like fiber, because there are very few actual use cases for these GPUs outside of AI, and none of them are remotely hyperscale revenue drivers. As I discussed a month or so ago , data center development accounted for more of America’s GDP growth than all consumer spending combined, and there really isn’t any demand for AI in general, let alone at the scale that these hundreds of billions of dollars are being sunk into.  The conservative estimate of capital expenditures related to data centers is around $400 billion, but given the $50 billion a quarter in private credit, I’m going to guess it breaks $500 billion, all to build capacity for an industry yet to prove itself. And this NVIDIA-OpenAI “$100 billion funding” news should only fill you full of dread, but also it isn’t fucking finalized, stop reporting it as if it’s done, I swear to god- Anyway, according to CNBC , “the initial $10 billion tranche is locked in at a $500 billion valuation and expected to close within a month or so once the transaction has been finalized,” with “successive $10 billion rounds are planned, each to be priced at the company’s then-current valuation as new capacity comes online.”  At no point is anyone asking how, exactly, OpenAI builds data centers to fill full of these GPUs. In fact, I am genuinely shocked (and a little disgusted!) by how poorly this story has been told. Let’s go point by point: To be clear, when I say OpenAI needs at least $300 billion over the next four years, that’s if you believe its projections, which you shouldn’t .  Let’s walk through its (alleged) numbers, while plagiarizing myself :  According to The Information , here's the breakdown (these are projections): OpenAI's current reported burn is $116 billion through 2030, which means there is no way that these projections include $300 billion in compute costs, even when you factor in revenue. There is simply no space in these projections to absorb that $300 billion, and from what I can tell, by 2029, OpenAI will have actually burned more than $290 billion, assuming that it survives that long, which I do not believe it will. Don’t worry, though. OpenAI is about to make some crazy money . Here are the projections that CFO Sarah Friar signed off on : Just so we are clear, OpenAI intends to 10x its revenue in the space of four years, selling software and access to models in an industry with about $60 billion of revenue in 2025. How will it do this? It doesn’t say. I don’t know OpenAI CFO Sarah Friar, but I do know that signing off on these numbers is, at the very least, ethically questionable.  Putting aside the ridiculousness of OpenAI’s deals, or its funding requirements, Friar has willfully allowed Sam Altman and OpenAI to state goals that defy reality or good sense, all to take advantage of investors and public markets that have completely lost the plot.  I need to be blunter: OpenAI has signed multiple different deals and contracts for amounts it cannot afford to pay, that it cannot hope to raise the money to pay for, that defy the amounts of venture capital and private credit available, all to sustain a company that will burn $300 billion and has no path to profitability of any kind. So, as I said above, CNBC reported on September 23, 2025 that the NVIDIA deal will be delivered in $10 billion tranches, the first of which is “expected to close within a month,” and the rest delivered “as new capacity comes online.” This is, apparently, all part of a plan to build 10GW of data center capacity with NVIDIA. A few key points: So, let’s start simple: data centers take forever to build. As I said previously, based on current reports, it’s taking Oracle and Crusoe around 2.5 years per gigawatt of data center capacity, and nowhere in these reports does one reporter take a second to say “hey, what data centers are you talking about?” or “hey, didn’t Sam Altman say back in July that he was building 10GW of data center capacity with Oracle? ” But wait, now Oracle and OpenAI have done another announcement that says they’re only doing 7GW, but they’re “ahead of schedule” on 10GW?  Wait, is NVIDIA’s 10GW the same 10GW as Oracle and OpenAI are working on? Is it different? Nobody seems to know or care! Anyway, I cannot be clear enough how unlikely it is that (as NVIDIA has said) “ the first gigawatt of NVIDIA systems will be deployed in the second half of 2026 ,” and that’s if it has bought the land and got the permits and ordered the construction, none of which has happened yet. But let’s get really specific on costs!  Crusoe’s 1.2GW of compute for OpenAI is a $15 billion joint venture , which means a gigawatt of compute runs about $12.5 billion. Abilene’s eight buildings are meant to hold 50,000 NVIDIA GB200 GPUs and their associated networking infrastructure, so let’s say a gigawatt is around 333,333 Blackwell GPUs. Though this math is a little funky due to NVIDIA promising to install its new Rubin GPUs in these theoretical data centers, that means these data centers will require a little under $200 billion worth of GPUs. By my maths that’s $325 billion.  I’m so tired of this. A number of you have sent me the following image with some sort of comment about how “this is how it’ll work,” and you are wrong, because this is neither how it works nor how it will work nor accurate on any level. In the current relationship, NVIDIA Is Not Sending OpenAI $100 Billion, nor will it send it that much money, because 90% of OpenAI’s funding is gated behind building 9 or 10 gigawatts of data center capacity. In the current relationship, OpenAI does not have the money to pay Oracle. Also, can Oracle even afford to give that much money to NVIDIA? It had negative free cash flow last quarter , already has $104 billion in debt , and its biggest new customer cannot afford a single fucking thing it’s promised. The only company in this diagram that actually can afford to do any of this shit is NVIDIA, and even then it only has $56 billion cash on hand . In any case, as I went over on Friday, OpenAI has promised about a trillion dollars between compute contracts across Oracle, Microsoft, Google and CoreWeave, 17 Gigawatts of promised data centers in America between NVIDIA and “Stargate,” several more gigawatts of international data centers, custom chips from Broadcom, and their own company operations. How exactly does this get paid for?  Nobody seems to ask these questions! Why am I the asshole doing this? Don’t we have tech analysts that are meant to analyse shit? AHhhhh- Every time I sit down to write about this subject the newsletters seem to get longer, because people are so painfully attached to the norms and tropes of the past. This post is, already, 17,500 words — a record for this newsletter — and I’ve still not finished editing and expanding it.  What we’re witnessing is one of the most egregious wastes of capital in history, sold by career charlatans with their reputations laundered by a tech and business media afraid to criticize the powerful and analysts that don’t seem to want to tell their investors the truth. There are no historic comparisons here — even Britain’s abominable 1800s railway bubble, which absorbed half of the country’s national income , created valuable infrastructure for trains, a vehicle that can move people to and from places. GPUs are not trains, nor are they cars, or even CPUs. They are not adaptable to many other kinds of work, nor are they “the infrastructure of the future of tech,” because they’re already quite old and with everybody focused on buying them, you’d absolutely see one other use case by now that actually mattered. GPUs are expensive, power-hungry, environmentally destructive and require their own kinds of cooling and server infrastructure, making every GPU data center and environmental and fiscal bubble unto themselves. And, whereas the Victorian train infrastructure still exists in the UK — though it has been upgraded over the years — a GPU has a limited useful lifespan. These are cards that can — and will — break after a period of extended usage, whether that period is five years or later, and they’ll inevitably be superseded by something better and more powerful, meaning that the resale value of that GPU will only go down, with a price depreciation that’s akin to a new car.  I am telling you, as I have been telling you for years, again and again and again , that the demand is not there for generative AI, and the demand is never, ever arriving. The only reason anyone humours any of this crap is the endless hoarding of GPUs to build capacity for a revolution that will never arrive. Well, that and OpenAI, a company built and sold on lies about ChatGPT’s capabilities . ChatGPT’s popularity — and OpenAI’s hunger for endless amounts of compute — have created the illusion of demand due to the sheer amount of capacity required to keep their services operational, all so they can burn $8 billion or more in 2025 and, if my estimates are right, nearly a trillion dollars by 2030 . This NVIDIA deal is a farce — an obvious attempt by the largest company on the American stock market to prop up the one significant revenue-generator in the entire industry, knowing that time is running out for it to create new avenues for eternal growth. I’d argue that NVIDIA’s deal also shows the complete contempt that these companies have for the media. There are no details about how this deal works beyond the initial $10 billion, there’s no land purchased, no data center construction started, and yet the media slurps it down without a second thought. I am but one man, and I am fucking peculiar. I did not learn financial analysis in school, but I appear to be one of the few people doing even the most basic analysis of these deals, and while I’m having a great time doing so, I am also exceedingly frustrated at how little effort is being put into prying apart these deals. I realize how ridiculous all of this sounds. I get it. There’s so much money being promised to so many people, market rallies built off the back of massive deals , and I get that the assumption is that this much money can’t be wrong, that this many people wouldn’t just say stuff without intending to follow through, or without considering whether their company could afford it.  I know it’s hard to conceive that hundreds of billions of dollars could be invested in something for no apparent reason, but it’s happening, right god damn now, in front of your eyes, and I am going to be merciless on anyone who attempts to write a “how could we see this coming?”  Generative AI has never been reliable, has always been unprofitable, and has always been unsustainable, and I’ve been saying so since February 2024 . The economics have never made sense, something I’ve said repeatedly since April 2024 , and when I wrote “How Does OpenAI Survive?” in July 2024 , I had multiple people suggest I was being alarmist. Here’s some alarmism for you: the longer it takes for OpenAI to die, the more damage it will cause to the tech industry.  On Friday, when I put out my piece on OpenAI needing a trillion dollars , I asked analyst Gil Luria if the capital was there to build the 17 Gigawatts that OpenAI had allegedly planned to build. He said the following: That doesn’t sound good! Anyway, as I discussed earlier, venture capital could run out in six quarters, with investor and researcher Jon Sakoda estimating that there will only be around $164 billion of dry powder (available capital) in US VC firms by the end of 2025. In July, The French Tech Journal reported (using Pitchbook data) that global venture capital deal activity reached its lowest first-half total since 2018, with $139.4 billion in deal value in the first half of 2025, down from $183.4 billion in the first half of 2024, meaning that any further expansion or demands for venture capital from OpenAI will likely sap the dwindling funds available from other startups. Things get worse when you narrow things to US venture capital. In a piece from April , EY reported that VC-backed investment in US companies hit $80 billion in Q1 2025, but “one $40 billion deal” accounted for half of the investment — OpenAI’s $40 billion deal of which only $10 billion has actually closed, and that didn’t happen until fucking June . Without the imaginary money from OpenAI, US venture would have declined by 36%. The longer that OpenAI survives, the longer it will sap the remaining billions from the tech ecosystem, and I expect it to extend its tendrils to private credit too. The $325 billion it needs just to fulfil its NVIDIA contract, albeit over 4 years, is an egregious sum that I believe exceeds the available private capital in the world. Let’s get specific, and check out the top 10 private equity firms’ available capital!  Assuming that all of this capital is currently available, the top 10 private equity firms in the world have around $477 billion of available capital. We can, of course, include investment banks — Goldman Sachs had around $520 billion cash in hand available at the end of its last quarter , and JPMorgan over $1.7 trillion , but JP Morgan has only dedicated $50 billion in direct lending commitments as of February 2025 , and while Goldman Sachs expanded its direct private credit lending by $15 billion back in June , that appears to be an extension of its “more than $20 billion” direct lending close from mid-2024 . Include both of those, and that brings us up to — if we assume that all of these funds are available — $562 billion in capital and about $164 billion in US venture available to spend, and that’s meant to go to more places than just OpenAI. Sure, sure, there’s more than just the top 10 private equity firms and there’s venture money outside of the US, but what could it be? Like, another $150 billion? You see, OpenAI needs to buy those GPUs, and it needs to build those data centers, and it needs to pay its thousands of staff and marketing and sales costs too. While OpenAI likely wouldn’t be the ones raising the money for the data centers — and honestly, I’m not sure who would do it at this point? — somebody is going to need to build TWENTY GIGAWATTS OF DATA CENTERS if we’re to believe both Oracle and NVIDIA You may argue that venture funds and private credit can raise more, and you’re right! But at this point, there have been few meaningful acquisitions of AI companies, and zero exits from the billions of dollars put into data centers.  Even OpenAI admits in its own announcement about new Stargate sites that this will be a “$400 billion investment over 3 years.” Where the fuck is that money coming from? Is OpenAI really going to absorb massive chunks of all available private credit and venture capital for the next few years?  And no, god, stop saying the US government will bail this out. It will have to bail out hundreds of billions of dollars, there is no scenario where it’s anything less than that, and I’ve already been over this. While the US government has spent equivalent sums in the past to support private business (the total $440 billion dispersed during the Great Recession’s TARP program, where the Treasury bought toxic assets from investment banks to stop them from imploding a la Lehman, springs to mind), it’s hard to imagine any case where OpenAI is seen as vital to the global financial system — and the economic health of the US — as the banking system.  Sure, we spent around $1tn — if we’re being specific, $953bn — on the Paycheck Protection Program during the Covid era, but that was to keep people employed at a time when the economy outside of Zoom and Walmart had, for all intents and purposes, ceased to exist. There was an urgency that doesn’t apply here. If OpenAI goes tits up, Softbank loses some money — nothing new there — and Satya Nadella has to explain why he spent tens of billions of dollars on a bunch of data centers filled with $50,000 GPUs that are, at this point, ornamental.  And while there will be — and have been — disastrous economic consequences, they won’t be as systemically catastrophic as that of the pandemic, or the global financial crisis. To be clear, it’ll be bad, but not as bad .   And there’s also the problem of moral hazard — if the government steps in, what’s to stop big tech chasing its next fruitless rainbow? — and optics. If people resented bailing out the banks after they acted like profligate gamblers and lost, how will they feel bailing out f ucking Sam Altman and Jensen Huang ?  I do apologize for the length of this piece, but the significance of this bubble requires depth. There is little demand, little real money, and little reason to continue, and the sheer lack of responsibility and willingness to kneel before the powerful fills me full of angry bile. I understand many journalists are not in a position where they can just write “this shit sounds stupid,” but we have entered a deeply stupid era, and by continuing to perpetuate the myth of AI, the media guarantees that retail investors and regular people’s 401Ks will suffer. It is now inevitable that this bubble bursts. Deutsche Bank has said the AI boom is unsustainable outside of tech spending “remaining parabolic,” which it says “is highly unlikely,” and Bain Capital has said that $2 trillion in new revenue is needed to fund AI’s scaling , and even that math is completely fucked as it talks about “AI-related savings”: Even when stared in the face by a ridiculous idea — $2 trillion of new revenue in a global software market that’s expected to be around $817 billion in 2025 — Bain still oinks out some nonsense about the “savings from applying AI in sales, marketing, customer support and R&D,” yet another myth perpetuated I assume to placate the fucking morons sinking billions into this. Every single “vibe coding is the future,” “the power of AI,” and “AI job loss” story written perpetuates a myth that will only lead to more regular people getting hurt when the bubble bursts. Every article written about OpenAI or NVIDIA or Oracle that doesn’t explicitly state that the money doesn’t exist, that the revenues are impossible, that one of the companies involved burns billions of dollars and has no path to profitability, is an act of irresponsible make believe and mythos. I am nobody. I am not a financier. I am not anybody special. I just write a lot, and read a lot, and can do the most basic maths in the world. I am not trying to be anything other than myself, nor do I have an agenda, other than the fact that I like doing this and I hate how this story is being told. I never planned for this newsletter to get this big, and now that it has, I’m going to keep doing the same thing every week. I also believe that the way to stop this happening again is to have a thorough and well-sourced explanation of everything as it happens, ripping down the narratives as they’re spun and making it clear who benefits from them and how and why they’re choosing to do so. When things collapse, we need to be clear about how many times people chose to look the other way, or to find good faith ways to interpret bad faith announcements and leak. So, how could we have seen this coming? I don’t know. Did anybody try to fucking look?

0 views
Armin Ronacher 1 months ago

90%

“I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code” — Dario Amodei Three months ago I said that AI changes everything. I came to that after plenty of skepticism. There are still good reasons to doubt that AI will write all code, but my current reality is close. For the infrastructure component I started at my new company, I’m probably north of 90% AI-written code. I don’t want to convince you — just share what I learned. In parts, because I approached this project differently from my first experiments with AI-assisted coding. The service is written in Go with few dependencies and an OpenAPI-compatible REST API. At its core, it sends and receives emails. I also generated SDKs for Python and TypeScript with a custom SDK generator. In total: about 40,000 lines, including Go, YAML, Pulumi, and some custom SDK glue. I set a high bar, especially that I can operate it reliably. I’ve run similar systems before and knew what I wanted. Some startups are already near 100% AI-generated. I know, because many build in the open and you can see their code. Whether that works long-term remains to be seen. I still treat every line as my responsibility, judged as if I wrote it myself. AI doesn’t change that. There are no weird files that shouldn’t belong there, no duplicate implementations, and no emojis all over the place. The comments still follow the style I want and, crucially, often aren’t there. I pay close attention to the fundamentals of system architecture, code layout, and database interaction. I’m incredibly opinionated. As a result, there are certain things I don’t let the AI do. I know it won’t reach the point where I could sign off on a commit. That’s why it’s not 100%. As contrast: another quick prototype we built is a mess of unclear database tgables, markdown file clutter in the repo, and boatloads of unwanted emojis. It served its purpose — validate an idea — but wasn’t built to last, and we had no expectation to that end. I began in the traditional way: system design, schema, architecture. At this state I don’t let the AI write, but I loop it in AI as a kind of rubber duck. The back-and-forth helps me see mistakes, even if I don’t need or trust the answers. I did get the foundation wrong once. I initially argued myself into a more complex setup than I wanted. That’s a part where I later used the LLM to redo a larger part early and clean it up. For AI-generated or AI-supported code, I now end up with a stack that looks something like something I often wanted, but was too hard to do by hand: Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM. The fact that I no longer have to write SQL because the AI does it for me is a game changer. I also use raw SQL for migrations now. OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on. Today I use Claude Code and Codex. Each has strengths, but the constant is Codex for code review after PRs. It’s very good at that. Claude is indispensable still when debugging and needing a lot of tool access (eg: why do I have a deadlock, why is there corrupted data in the database etc.). The working together of the two is where it’s most magical. Claude might find the data, Codex might understand it better. I cannot stress enough how bad the code from these agents can be if you’re not careful. While they understand system architecture and how to build something, they can’t keep the whole picture in scope. They will recreate things that already exist. They create abstractions that are completely inappropriate for the scale of the problem. You constantly need to learn how to bring the right information to the context. For me, this means pointing the AI to existing implementations and giving it very specific instructions on how to follow along. I generally create PR-sized chunks that I can review. There are two paths to this: Agent loop with finishing touches: Prompt until the result is close, then clean up. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge. It requires intuition to know when each approach is more likely to lead to the right results. Familiarity with the agent also helps understanding when a task will not go anywhere, avoiding wasted cycles. The most important piece of working with an agent is the same as regular software engineering. You need to understand your state machines, how the system behaves at any point in time, your database. It is easy to create systems that appear to behave correctly but have unclear runtime behavior when relying on agents. For instance, the AI doesn’t fully comprehend threading or goroutines. If you don’t keep the bad decisions at bay early it, you won’t be able to operate it in a stable manner later. Here’s an example: I asked it to build a rate limiter. It “worked” but lacked jitter and used poor storage decisions. Easy to fix if you know rate limiters, dangerous if you don’t. Agents also operate on conventional wisdom from the internet and in tern do things I would never do myself. It loves to use dependencies (particularly outdated ones). It loves to swallow errors and take away all tracebacks. I’d rather uphold strong invariants and let code crash loudly when they fail, than hide problems. If you don’t fight this, you end up with opaque, unobservable systems. For me, this has reached the point where I can’t imagine working any other way. Yes, I could probably have done it without AI. But I would have built a different system in parts because I would have made different trade-offs. This way of working unlocks paths I’d normally skip or defer. Here are some of the things I enjoyed a lot on this project: Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes. It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation. Trying out things: I tried three different OpenAPI implementations and approaches in a day. Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy. Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs. Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests. SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering and when writing it. Is 90% of code going to be written by AI? I don’t know. What I do know is, that for me, on this project, the answer is already yes. I’m part of that growing subset of developers who are building real systems this way. At the same time, for me, AI doesn’t own the code. I still review every line, shape the architecture, and carry the responsibility for how it runs in production. But the sheer volume of what I now let an agent generate would have been unthinkable even six months ago. That’s why I’m convinced this isn’t some far-off prediction. It’s already here — just unevenly distributed — and the number of developers working like this is only going to grow. That said, none of this removes the need to actually be a good engineer. If you let the AI take over without judgment, you’ll end up with brittle systems and painful surprises (data loss, security holes, unscalable software). The tools are powerful, but they don’t absolve you of responsibility. Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM. The fact that I no longer have to write SQL because the AI does it for me is a game changer. I also use raw SQL for migrations now. OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on. Agent loop with finishing touches: Prompt until the result is close, then clean up. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge. Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes. It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation. Trying out things: I tried three different OpenAPI implementations and approaches in a day. Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy. Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs. Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests. SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering and when writing it.

0 views