Latest Posts (20 found)
Jim Nielsen -23 days ago

You Might Debate It — If You Could See It

Imagine I’m the design leader at your org and I present the following guidelines I want us to adopt as a team for doing design work: How do you think that conversation would go? I can easily imagine a spirited debate where some folks disagree with any or all of my points, arguing that they should be struck as guidelines from our collective ethos of craft. Perhaps some are boring, or too opinionated, or too reliant on trends. There are lots of valid, defensible reasons. I can easily see this discussion being an exercise in frustration, where we debate for hours and get nowhere — “I suppose we can all agree to disagree”. And yet — thanks to a link to Codex’s front-end tool guidelines in Simon Willison’s article about how coding agents work — I see that these are exactly the kind of guidelines that are tucked away inside an LLM that’s generating output for many teams. It’s like a Trojan Horse of craft: guidelines you might never agree to explicitly are guiding LLM outputs, which means you are agreeing to them implicitly. It’s a good reminder about the opacity of the instructions baked in to generative tools. We would debate an open set of guidelines for hours, but if there’re opaquely baked in to a tool without our knowledge does anybody even care? When you offload your thinking, you might be on-loading someone else’s you’d never agree to — personally or collectively. Reply via: Email · Mastodon · Bluesky Typography: Use expressive, purposeful fonts and avoid default stacks (Inter, Roboto, Arial, system). Motion: Use a few meaningful animations (page-load, staggered reveals) instead of generic micro-motions. Background: Don't rely on flat, single-color backgrounds; use gradients, shapes, or subtle patterns to build atmosphere. Overall: Avoid boilerplate layouts and interchangeable UI patterns. Vary themes, type families, and visual languages.

1 views

ADK Climb Club is now web-friendly!

Just finished up a project that I’ve been meaning to get to for a year: bringing ADK Climb Club to the open web. We’ve had a landing page for a while, but all the info about our meetups was going out via Instagram and WhatsApp . But not everyone wants to use those apps, and I heard from them! So, I buckled down and imported all the old posts, and hooked up my auto-crossposter . Now, everything that we post to Instagram shows up on our website as a native, web-friendly blog post. And I enabled (free) email subscriptions (thanks Micro.blog!), so folks can get an email each time that we share information about a meetup. Although Instagram is still our “ primary” platform — that’s where our biggest audience is and where we pick up new members - I feel much better about the club being more accessible on the open web, and that people can stay in the loop with posts pushed out to them without having to sign up for a Meta app. If you’re a climber (or are climbing curious) and near Lake Placid, NY on a Wednesday night, you should come check us out! HeyDingus is a blog by Jarrod Blundy about technology, the great outdoors, and other musings. If you like what you see — the blog posts , shortcuts , wallpapers , scripts , or anything — please consider leaving a tip , checking out my store , or just sharing my work. Your support is much appreciated! I’m always happy to hear from you on social , or by good ol' email .

0 views

Kaktovik Numerals

Read on the website: Kaktovik numerals are a surprisingly good counting system. It allows many arithmetic operations to be done visually and effortlessly. Though it takes some getting used to. Thus this page!

0 views

Writing an LLM from scratch, part 32g -- Interventions: weight tying

In Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", he writes that weight tying, while it reduces the parameter count of a model, in his experience makes it worse. As such, apparently people don't use it in modern LLMs. Intuitively, that makes sense -- I'll explain why in this post. But as I'm trying various interventions to see if I can get my model -- based on Raschka's code, but trained for a fraction of the time that the original GPT-2 model was -- to perform as well as the original in terms of the loss it gets on a test set, I thought it would be worth seeing if it really is a negative for this particular tiny model of 163M parameters. After all, the original weights use weight tying, and I did find that QKV bias appeared to help -- and that's another old-school technique that they used, which has since dropped out of fashion. Might this one help too? Worth a try! Let's give it a go. I'll start with a quick refresher on what weight tying is, and how it works. This is really targeted at people who've been reading along with this series -- if it's all new to you, you might find my post on Maths for LLMs a useful catch-up guide first. In our LLM code, right at the start, we use an embedding layer to take our input token IDs, and turn them into embeddings -- each token becomes a vector in a high-dimensional space (768 in our case), which we see as representing in some manner the "meaning" of the token. A useful way to think about that is that we could start with a one-hot vector for the token -- that is, with our 50,257-token vocabulary, it would be 50,257 items long, and have zeros in every position apart from the position corresponding to the token's ID. We'll treat that as being a vector in a "vocab space". The process of converting the token into an embedding turns out to be equivalent to multiplying that vocab space representation by an embedding matrix -- one with one row per possible token, the values in that row being the values for the appropriate embedding. 1 Because matrix multiplications can be seen as projections between different spaces, we can see that as a projection from our vocab space to the embedding space. Once we've projected our sequence of tokens into a sequence of embeddings, we do all of the steps required for the LLM -- we add in positional information, run it through the Transformers layers, normalise it, and then we have a new sequence of embeddings. The embedding at position n in that output sequence, if our model is working well, should be something that represents an appropriate next-token prediction for the portion of the input sequence from zero to position n . What we want as our final output is to map that back to the vocab space. We want logits: a list of numbers that (after being run through softmax) will represent the probability that our next token is a particular one. Just as we mapped from vocab space to embedding space with (conceptually) a matrix multiplication at the start of the process, we can map back with another one. More specifically, if we treat the embedding matrix as having the same number of rows as there are input tokens (which we'll call d vocab ) and columns as there are embedding dimensions ( d emb ), then the original vocab-space-to-embedding-space matrix will have this shape: So it's projecting from a d vocab -dimensional space to a d emb -dimensional one. Similarly, our matrix to do the projection at the end is just a matrix with the numbers of rows and columns swapped around: ...to do a projection in the other direction. The trick with weight tying is to see that these two projections can potentially be just the opposite of each other. If we assume that the embedding space on the way in to the LLM is essentially the same as the embedding space on the way out, then we can use one projection to go into it from vocab space, and the opposite to go back. The "opposite" in this case is the transpose -- that is, if we use W emb for our embedding matrix and W out for the output one, we have: That means we can re-use all of the embedding parameters for the output projection matrix, and fewer parameters means not only a smaller model, but hopefully faster training. Sounds like a win! But of course, there's no such thing as a free lunch. By constraining the output head to be the transpose of the input one, we're essentially enforcing that assumption above: we're saying that the embedding space on the way out must be the same as the embedding space on the way in. That limits what the LLM can do -- if it were able to use different embedding spaces at each end, it would have more flexibility, which might help it learn to model things better. That's the theory: what does it mean in practice? Let's take a quick look at the GPT-2 code -- just the for the top level class: For our embedding layer, we use PyTorch's class, and for the output head we use . Now, provides us with access to the underlying matrix with a field: (Tensor) -- the learnable weights of the module of shape ( , ) initialized from 𝒩 ( 0 , 1 ) . So, that's exactly the d vocab × d emb matrix that we'd expect -- it's the input dimension as the rows, and the output dimension as the columns. If we look at , we see something very similar: weight (torch.Tensor) – the learnable weights of the module of shape ( , ) The values are initialized from 𝒰 ( − k , k ) where k = 1 in_features That's actually the other way around, output dimension as the rows and input as the columns. If you're wondering why, remember that we transpose the weights matrix for a neural network before using it . But that's actually really convenient in our situation, because if we want to use the same weights for both, they're already "compatible"! And that means that adding weight tying to our code above is as simple as adding two lines at the end: For the model code, it literally is just that! There is a tiny inefficiency in that PyTorch is going to spend a bit of time initialising the weights in to appropriately-sized random values, only to have them all replaced -- but that actually works in our favour, because it means that we'll use up the same amount of the random number stream when creating the LLM in both the weight-tying and non-weight-tying cases, which is a bit better for reproducibility. There is one other change needed, though. I ran a test train with that code, and checkpointing failed like this: Safetensors doesn't like it when you reuse weights like we're doing here. The good news is that the help page the error links to is exactly about this problem with weight tying, and the suggested fix -- to replace ...and similarly for loading -- appears to work fine. Saving and loading checkpoints works, and it's compatible with the old checkpoint files too. So that's good news :-) So, that's how we code it. How much actual saving do we get in terms of the parameter count by doing this? A quick-and-easy way to count the parameters is just to create an instance of the model and see: So, we've gone from a 163M-parameter model to a 124M-parameter one. That's certainly quite some saving -- 38,597,376 fewer parameters, which is a reduction of almost a quarter. We can also sanity check the size of that saving -- our output head was, as we know, a d emb × d vocab matrix, so it should have 50257 × 768 parameters -- which is, indeed, 38,597,376. Excellent. Now, there's one thing we should consider here. We're training on a Chinchilla-optimal number of tokens, 20x our parameter count. Is that what we want to keep stable? Or is the total number of training tokens the important bit, so we wind up technically overtraining? My instinct is that the total training tokens is the important thing. Chinchilla optimality is a training heuristic rather than a true aspect of the model, so sticking with it would mean that we're training a model with fewer parameters on less data. It seems very unlikely that would do anything other than produce a worse model! So: we'll keep the same number of training tokens, and just introduce weight tying. How does it train? I kicked it off on the usual 8x A100 40 GiB machine, and after a little while I checked the loss chart. It looked like this: Yikes! It started off with a loss of about 460. Normally, we start with a loss of about 11. The normal loss makes a lot of sense. If you consider it in terms of perplexity, that value of 11 comes out at e 11 ≈ 59 , 874 -- that is, the model is giving pretty much equal probabilities to every one of the 50,257 possible tokens. A loss of 460 means that the model is making incorrect predictions and is very certain about them. How could that be? Well, let's look at the documentation again. (Tensor) -- the learnable weights of the module of shape ( , ) initialized from 𝒩 ( 0 , 1 ) . weight (torch.Tensor) – the learnable weights of the module of shape ( , ) The values are initialized from 𝒰 ( − k , k ) where k = 1 in_features They're initialised completely differently. Embeddings are set to values in a normal distribution (that is, a Gaussian bell curve) with a mean of 0 and a standard deviation of 1. But linear layers are set to random values in a uniform distribution (that is, a completely flat one) within a range based on the number of input features. In particular, those numbers for the linear layer are really small! Our output head has set to 768, so that means that the k would be: So instead of getting that kind of "ideal" linear layer initialisation within the range ( − 0.0360 , 0.0360 ) , we're getting numbers which roughly 2/3 of the time will be in the range ( − 1 , 1 ) , and the rest of the time will be even further from zero -- we could be getting -3 or +4, or potentially even crazier numbers! That means that the output logits (coming from a linear layer with higher weights) will be larger, which in turn will push softmax to come up with higher probabilities: I considered changing things to initialise the weights differently, but given that the loss had fallen to 8 or so by the second checkpoint, I decided to just let the run complete. Here's the final loss chart, with the Y axis fixed to run from 0 to 12: That's a nice smooth curve, at least! The output is: Timing-wise, that's about 180 seconds faster than our baseline model training run, only a 1.5% speedup -- clearly the lower number of parameters doesn't actually save us much time. Loss-wise, the final train loss on the baseline model was 3.743, so that's not particularly promising. Still, the proof is, as ever, in the evals. Smoke test first: Borderline coherent, but maybe worse than normal? Let's see what our test set loss looks like. That's bad -- let's see it in our comparison table: Our worst model so far :-( Weight tying certainly didn't help our train. It is worth noting that the GPT-2 small weights -- which do use it -- got 3.500 on the same test set as we're using for that table, so it is possible to get a better model with weight tying. But there was clearly something different about their train, and my suspicion, as I've said before, is that it was trained for many more epochs ( I estimated 40 ), slowly grinding that loss down. But what I'm trying to do in this mini-series of interventions is find tricks that will allow us to approach the original weights' loss without a very long training run. And for the purposes of that, I think we can safely say that weight-tying is not one of those. Next time around, our last intervention test! What happens if we switch off the use of automated mixed precision (AMP)? That is something I added right back at the start as a performance enhancement; it means that PyTorch can do certain calculations in 16-bit rather than 32-bit if it thinks there's no harm in doing so. Might we get better loss by training without it? In reality we don't multiply a one-hot vector by a matrix, as that would be extremely inefficient -- PyTorch just does a lookup into the embedding matrix. If we get token ID 1234, then it just reads out the contents of row 1234, and that's our embedding. But for the purposes of this post, it's best to see that as more of a (extremely effective) performance tweak rather than what's happening conceptually.  ↩ In reality we don't multiply a one-hot vector by a matrix, as that would be extremely inefficient -- PyTorch just does a lookup into the embedding matrix. If we get token ID 1234, then it just reads out the contents of row 1234, and that's our embedding. But for the purposes of this post, it's best to see that as more of a (extremely effective) performance tweak rather than what's happening conceptually.  ↩

0 views

Code as a Tool of Process

Steve Krouse wrote a piece that has me nodding along: Programming, like writing, is an activity, where one iteratively sharpens what they're doing as they do it. (You wouldn't believe how many drafts I've written of this essay.) There’s an incredible amount of learning and improvement, i.e. sharpening , to be had through the process of iteratively building something. As you bring each aspect of a feature into reality, it consistently confronts you with questions like, “But how will this here work?” And “Did you think of that there ?” If you jump over the process of iteratively building each part and just ask AI to generate a solution, you miss the opportunity of understanding the intricacies of each part which amounts to the summation of the whole. I think there are a lot of details that never bubble to the surface when you generate code from English as it’s simply not precise enough for computers . Writing code is a process that confronts you with questions about the details. If you gloss over the details, things are going to work unexpectedly and users will discover the ambiguity in your thinking rather than you (see also: “bugs”). Writing code is a tool of process. As you go, it sharpens your thinking and helps you discover and then formulate the correctness of your program. If you stop writing code and start generating it, you lose a process which helped sharpen and refine your thinking. That’s why code generation can seem so fast: it allows you to skip over the slow, painful process of sharpening without making it obvious what you’re losing along the way. You can’t understand the trade-offs you’re making, if you’re not explicitly confronted with making them. To help me try to explain my thinking (and understand it myself), allow me a metaphor. Imagine mining for gold. There are gold nuggets in the hills. And we used to discover them by using pick axes and shovels. Then dynamite came along. Now we just blow the hillside away. Nuggets are fragmented into smaller pieces. Quite frankly, we didn’t even know if there were big nuggets or small flecks in the hillside because we just blasted everything before we found anything. After blasting, we take the dirt and process it until all we have left is a bunch of gold — most likely in the form of dust. So we turn to people, our users, and say “Here’s your gold dust!” But what if they don’t want dust? What if they want nuggets? Our tools and their processes don’t allow us to find and discover that anymore. Dynamite is the wrong tool for that kind of work. It’s great in other contexts. If you just want a bunch of dust and you’re gonna melt it all down, maybe that works fine. But for finding intact, golden nuggets? Probably not. It’s not just the tool that helps you, it’s the process the tool requires. Picks and shovels facilitate a certain kind of process. Dynamite another. Code generation is an incredible tool, but it comes with a process too. Does that process help or hurt you achieve your goals? It’s important to be cognizant of the trade-offs we make as we choose tools and their corresponding processes for working because it’s trade-offs all the way down. Reply via: Email · Mastodon · Bluesky

0 views
Martin Fowler Yesterday

Bliki: Architecture Decision Record

An Architecture Decision Record (ADR) is a short document that captures and explains a single decision relevant to a product or ecosystem. Documents should be short, just a couple of pages, and contain the decision, the context for making it, and significant ramifications. They should not be modified if the decision is changed, but linked to a superseding decision. As with most written documents, writing ADRs serves two purposes. Firstly they act as a record of decisions, allowing people months or years later to understand why the system is constructed in the way that it is. But perhaps even more valuable, the act of writing them helps to clarify thinking, particularly with groups of people. Writing a document of consequence often surfaces different points of view - forcing those differences to be discussed, and hopefully resolved. A general rule is to follow an “inverted pyramid” style of writing, commonly associated with news stories. The key is to put the most important material at the start, and push details to later in the record. The common advice is to keep decision records in the source repository of the code base to which they apply. A common choice for their location is . This way they are easily available to those working on the code base. For similar reasons they should be written in a lightweight markup language, such as markdown, so they can be easily read and diffed just like any code. We can use a build task to publish them to a product team's website. Storing them in a product repository won't work for ADRs that cover a broader ecosystem than a single code base. Some folks also feel that keeping ADRs in git makes it too hard for non-developers to work with them. Each record should be its own file, and should be numbered in a monotonic sequence as part of their file name, with a name that captures the decision, so that they are easy to read in a directory listing. (for example: “ “). Each ADR has a status. “proposed” while it is under discussion, “accepted” once the team accepts it and it is active, “superseded” once it is significantly modified or replaced - with a link to the superseding ADR. Once an ADR is accepted, it should never be reopened or changed - instead it should be superseded. That way we have a clear log of decisions and how long they governed the work. ADRs contain not just the decision, but also a brief rationale for the decision. This should summarize the problem that led to this decision being needed and the trade-offs that were taken into account. A good way to think of them follows the notion of “forces” when writing a pattern. As part of this it's valuable to explicitly list all the serious alternatives that were considered, together with their pros and cons. Any decision has consequences. Sometimes these are clearly implied from the rationale, but sometimes it's worth clearly stating them in a explicit section. Decisions are usually made under some degree of uncertainty, so it's handy to record the confidence level of the decision. This is a good place to mention any changes in the product context that should trigger the team to reevaluate the decision. ADRs play a central role in the Advice Process , where they are not only used to document decisions, but the act of writing them is used to elicit expertise and alignment. In this case they should also include advice gathered in forming the ADR, although in order to keep things brief, it may be better to summarize the advice in the ADR and keep a full record of advice separately. The most important thing to bear in mind here is brevity. Keep the ADR short and to the point - typically a single page. If there's supporting material, link to it. While ADRs are a form for recording decisions in software architecture, the broader concept of writing short decision records is worth considering in other contexts. This kind of decision log creates a valuable historic record that can do much to explain why things are the way they turned out. Michael Nygard coined the term “Architecture Decision Record” with an ADR-formatted article in 2011. While he did not originate the idea of a decision log he did make case for a lightweight document, with a focus on the decision itself. In this he was particularly inspired by Phillipe Kruchten talking about decision registers / decision logs, and by the writing style of software patterns . His article is better than pretty much everything else written on the topic, my only desire to write this one was to point to some developments since. On this site, there are brief examples of ADR formats in articles by Harmel-Law and Rowse and Shepherd . adr-tools is a simple command line tool to manage ADRs. It includes a set of ADRs for itself that are a good example of the form. Andrew Harmel-Law, Brandon Cook, David Lucas, Francisco Dias, Giuseppe Matheus Pereira, John King, Kief Morris, Michael Joyce, Neil Price, Shane Gibson, Steven Peh, and Vijay Raghavan Aravamudhan discussed drafts of this post on our internal chat. Michael Nygard gave some background on the origins of his writing.

0 views

Dissecting and Modeling the Architecture of Modern GPU Cores

Dissecting and Modeling the Architecture of Modern GPU Cores Rodrigo Huerta, Mojtaba Abaie Shoushtary, José-Lorenzo Cruz, and Antonio Gonzalez MICRO'25 The purpose of this paper is to understand the microarchitecture of recent NVIDIA GPUs, to be able to update architectural simulators that are used for research purposes. The authors uncovered lots of interesting tidbits. Take this information with a grain of salt; it is derived from careful experimentation rather than NVIDIA documentation. The paper uses the term sub-core to represent the hardware module which can execute warp-wide instructions. Each SM comprises four sub-cores. Fig. 3 illustrates the components within a sub-core and shows how 4 sub-cores share instruction and data caches: Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Instruction Issue The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware. There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards: The instruction encoding allows any instruction to set the value of a per-warp stall counter. When the hardware issues such an instruction, it sets the stall counter to the specified value. On each clock cycle thereafter, the counter is decremented by one. The hardware will not issue more instructions for the warp until the counter reaches zero. This is useful for handling hazards with a fixed latency. Variable-latency hazards are resolved with dependence counters . The hardware tracks the value of six dependence counters per warp. The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued. One of these counters is decremented when the instruction writes to the register file, and the other is decremented when the instruction reads from the register file (to resolve WAR hazards). Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero. In fig. 2 above, the values of these counters are checked in the block, and the counters are incremented in the block. The warp scheduler prefers to pick a warp and stick with it (e.g., it is not a round-robin scheduler). If the current warp cannot be scheduled (e.g., the stall counter is greater than zero, or there was a cache miss), then the scheduler switches to another warp. The warp scheduler issues instructions in program order (within a warp). There is no out-of-order execution support. The register file has a limited number of ports, and instructions must be controlled to avoid attempting too many reads or writes in parallel. Register file port contention is not handled by the warp scheduler, instead it is handled further down the pipe. For example, the stage in fig. 2 will stall fixed-latency instructions until register file read ports are available. The register file cache (RFC) is a hardware component that reduces contention on the register file read ports. The RFC has storage for 6 vectors (and tags). The compiler can mark a source operand of an instruction such that the hardware will store the source operand in the cache for a subsequent operation to use. Note that the RFC does not store per-warp values and is only useful for caching data within one warp. This plays nicely with the “pick a warp and stick to it” scheduling policy. Listing 4 has some example code sequences demonstrating how the compiler can direct the operation of the RFC (e.g., ): Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Memory Access Most of the resources that are shared between sub-cores are shared for efficiency reasons. A single sub-core will not generate memory requests at a high throughput, and there is locality of reference between the memory accesses in multiple sub-cores. The block in fig. 3 is shared in order to properly support thread group shared memory (as a thread group is spread across all sub-cores in a SM). The shared memory access modules can handle one request every two cycles. That means if all 4 sub-cores are contending on memory, each one can make a request every 8 cycles. There is a FIFO of depth ~4 between each sub-core and the shared memory structures. Typical read-after-write latency in shared memory is between 20-40 cycles. The authors built a simulation model based on their experiments. Mean percentage absolute error (MAPE) is one metric for measuring how accurate a simulation model is compared to real hardware. Table 4 shows that the model derived from the findings in this paper are a better performance model for recent NVIDIA GPUs than the baseline ( ): Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Subscribe now Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Instruction Issue The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware. There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards: The instruction encoding allows any instruction to set the value of a per-warp stall counter. When the hardware issues such an instruction, it sets the stall counter to the specified value. On each clock cycle thereafter, the counter is decremented by one. The hardware will not issue more instructions for the warp until the counter reaches zero. This is useful for handling hazards with a fixed latency. Variable-latency hazards are resolved with dependence counters . The hardware tracks the value of six dependence counters per warp. The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued. One of these counters is decremented when the instruction writes to the register file, and the other is decremented when the instruction reads from the register file (to resolve WAR hazards). Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero.

0 views

Working on Complex Systems

☕ Welcome to The Coder Cafe! Today, I’m sharing the talk I gave at the Monster SCALE Summit 2026 on working on complex systems. Get cozy, grab a coffee, and let’s begin! Introduction If you’ve been a subscriber since mid-2025, you were already here when I published the post that performed best on my newsletter: Working on Complex Systems . I really loved writing it, and I always had in mind to revisit it at some point. So, when someone at ScyllaDB reached out to invite me to speak at Monster SCALE Summit, I saw the perfect opportunity to turn it into a talk. The video isn’t a 1:1 mapping of the original content. I expanded it with more examples and new ideas. In it, I define what complex systems are, then discuss their common characteristics, and finally explore patterns for navigating them. Hope you will enjoy it! Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Latency and User Experience Probabilistic Increment Bloom Filters Monster SCALE Summit - 2026 Tech Talks ❤️ If you enjoyed this post, please hit the like button. Leave a comment Introduction If you’ve been a subscriber since mid-2025, you were already here when I published the post that performed best on my newsletter: Working on Complex Systems . I really loved writing it, and I always had in mind to revisit it at some point. So, when someone at ScyllaDB reached out to invite me to speak at Monster SCALE Summit, I saw the perfect opportunity to turn it into a talk. The video isn’t a 1:1 mapping of the original content. I expanded it with more examples and new ideas. In it, I define what complex systems are, then discuss their common characteristics, and finally explore patterns for navigating them. Hope you will enjoy it! Video Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Resources More From the Distributed Systems Category Latency and User Experience Probabilistic Increment Bloom Filters Monster SCALE Summit - 2026 Tech Talks

0 views
matduggan.com Yesterday

Hosting a Snowflake Proxy

In the nightmarish world of 2026 it can be difficult to know how to help at all. There are too many horrors happening to quickly to know where one can inject even a small amount of assistance. However I wanted to quickly post about something I did that was easy, low impact and hopefully helps a tiny fraction of a fraction of a percent of people. So I was browsing Mastodon when someone posted a link asking for people to host Snowflake proxies. Snowflake is a lightweight proxy best explained by David Fifield below. Effectively it is a lightweight and easy to run way to bypass censorship that doesn't require running a VPN and involves almost zero technical knowledge. It's quite the design and one that I kept shaking my head thinking "man I never would have thought of this in a mission years" as I read more about how it works. So I have a box sitting on an internet connection where I'm lucky enough to have plenty of excess capacity. I figured "why not share it". I thought I'd post the process here in case people were curious but were worried about how much bandwidth it might use or how many resources. Setting it up on a Debian box took like 5 minutes. That's it. So this has been running for two weeks and in that two weeks I've served up the following amount of traffic: CPU usage is quite low, Memory is slightly higher than I would have thought but that's likely a function of running for so long. Remember you can modify the systemd service file to limit memory if you are interested in running this yourself but are concerned about crossing a gig of memory. All in all I haven't noticed I'm running this at all. Obviously its great to run the browser extension to increase the pool of IP addresses and keep them from becoming static and blockable, but if you have a dedicated box with a large amount of bandwidth and are looking for a quick 20 minute project to help out people trying to deal with internet censorship, this seems like a good one to me. Get the package from here: https://packages.debian.org/sid/snowflake-proxy with: Make sure it is enabled and running

0 views
Brain Baking Yesterday

Please Compensate The Work You Appreciate

The other day, I had a casual conversation with colleagues about buying music. Nobody gave a rat’s ass; they all just either downloaded the files or used Spotify. Most conversations on this topic end like this so I expected the response from more than a few individuals, but not from everyone. I was deemed the silly fool who buys stuff and supports artists. Yet at the same time, we all bemoan the fact that creative individuals are losing their job due to the rise of generative AI. To that I say: maybe that’s our own fault for not properly compensating these people in the first place? Please compensate the work you appreciate. Showing appreciation is not enough to bring food to the table. If you get paid for the work you are doing each month, don’t you think it’s only logical that these people also get paid? Where do you think that money should be coming from? It’s weird to still encounter that much reluctance to support makers in 2026. Most Brain Baking readers will (hopefully) find this obvious, and this article won’t have a big impact on the reasoning of my colleagues, but it doesn’t hurt to re-iterate this, so I’ll mention it again: please compensate the work you appreciate. Below are a few remarks I heard every time I bring up this topic (related to music & software in general). I’m not buying music albums, I’m not as rich as you are. Even though the person meant this as a joke, the underlying message was: “I don’t want to spend that much money on music/a creative product”. I buy one to two albums each month and on Bandcamp artists can decide for themselves how much to charge. You don’t have to spare for this, but you are paying for three streaming services? Right. See also: You Shouldn’t Use Spotify . I can share the Spotify subscription with my sister to make it even cheaper. Did you know a thing called libraries exist where you can, you know, lend stuff, including CDs? Did you know that once you buy a digital album, you can do whatever the hell you want with it, including, you know, lending it to your sister? But those artists already have millions, no way I’m giving them more. This is a tougher nut to crack indeed. Michael Jackson is dead (remember, 2Pac isn’t), so where does that money end up? Even when he was still alive, supporting an artist who already has eight figure numbers on their account might be harder to justify. I’d say you should prioritise buying and supporting smaller (indie/local) groups. Maybe in this case you can also turn to the second hand market and at least support your local music shop that way. I only like popular pop music and they already earn more than enough. Consider the previous example; for instance Jackson’s album Bad . It wasn’t only Michael who was involved in the creation of that particular album you like. So all these people don’t deserve to have a meal. Consider this: if everybody thought like you, would that artist still be rich—or Dy Tryin (got it? 50 Cent? No?)? Micro$oft is bad. You’re right. Today, you should boycott Microsoft —but there used to be a time where they weren’t evil and helped propel software (and its development) into the modern age. If nobody bought MS-DOS, Windows 3.1, if no OEM deal was ever made to package Win95/98 with your new beige Compaq tower, maybe the contemporary software landscape looked a lot bleaker. What does this teach us? Compensate the work you appreciate only if it’s ethically sound 1 . You can’t find all these things on Bandcamp. Right again, but the remaining can be found on plenty of other platforms such as Apple Music. This is not an excuse to neglect compensating the artist. But thirty percent is pinched off by Apple! Yes. That means seventy percent remains for the artist. And if you don’t buy anything but stream or download music, a hundred percent of zero remains for the artist. I’ll leave that calculation up to you as an exercise in critical thinking. I used to buy CDs in stores but don’t anymore these stores are gone. Unfortunately, most brick and mortar stores are struggling, indeed. Perhaps also because most people sopped buying music and just download and/or stream stuff instead? The last thing I bought wasn’t good. I’m sorry to hear. Did you also consider that buying the bad thing might put the creator in a financial situation where they can produce something else that potentially might be better—with your help, that is? Bigger creative projects that take months or years require funding beforehand. I presume you are aware of the disadvantages of being funded by venture capital. I’m not paying anything for free software. Open source does not equal free in the sense that the people that created these packages don’t deserve to eat. Supporting a project sends an important signal to its maintainers: the thing you are doing is relevant, please continue doing so. Sending an appreciative letter also helps but doesn’t pay bills, and since we’re living in an increasingly bill-paying society, many expert developers simply quit working on free software. What do you think all those “donate” buttons are for? I only buy hardware, not software. I’ll be sure to tell my software engineering friends and colleagues to retrain into hardware engineers as soon as possible. I’m not using paid service x because free Google service y exists. You’re still paying, buddy. Just not with money, but perhaps with something that is worth even more than the green currently in your wallet. It’s called your personal data. Going to a music gig already costs an arm and a leg, no way I’m also buying the album. What kind of an argument is this? So you like the band enough to drop for a concert but you’re against paying for music just to make a statement? Next time simply stay home and instead buy the album, that’s 80% cheaper and you can listen to it again and again. I don’t have room to collect CDs. Who said anything about collecting? Then buy them digitally. At this point, we’re just arguing for the sake of arguing… I think it’s strange that many people still completely ignore all these arguments for compensating artists. These arguments alone are pretty useless: it’s not the awareness that’s the problem. Most illegal downloaders or lazy Spotify users are well-aware of the ethical concerns and financial consequences. Knowing is not enough to get people to act. Most people have heard of global warning and know we’re slowly but surely destroying the earth, yet we happily keep on driving cars, eating meat, flying planes. If you know what does move people, please let me know. Can you appreciate work that is not ethical? Sure you can; there are plenty of cool looking video games made by extreme right-thinking dickheads. Whether or not to support those dickheads is up to you.  ↩︎ By Wouter Groeneveld on 24 March 2026.  Reply via email . Can you appreciate work that is not ethical? Sure you can; there are plenty of cool looking video games made by extreme right-thinking dickheads. Whether or not to support those dickheads is up to you.  ↩︎

0 views
Susam Pal Yesterday

Wander 0.2.0

Wander 0.2.0 is the second release of Wander, a small, decentralised, self-hosted web console that lets visitors to your website explore interesting websites and pages recommended by a community of independent personal website owners. To try it, go to susam.net/wander . This release brings a number of improvements. When I released version 0.1.0, it was the initial version of the software I was using for my own website. Naturally, I was the only user initially and I only added trusted web pages to the recommendation list of my console. But ever since I announced this project on Hacker News , it has received a good amount of attention. It has been less than a week since I announced it there but over 30 people have set up a Wander console on their personal websites. There are now over a hundred web pages being recommended by this network of consoles. With the growth in the number of people who have set up Wander console, came several feature requests, most of which have been implemented already. This release makes these new features available. Since Wander 0.2.0, the file of remote consoles is executed in a sandbox to ensure that it has no side effects on the parent Wander console page. Similarly, the pages recommended by the network are also loaded into a sandbox . This release also brings several customisation features. Console owners can customise their Wander console by adding custom CSS or JavaScript. Console owners can also block certain URLs from ever being recommended on their console. This is especially important in providing a good wandering experience to visitors. Since this network is completely decentralised, console owners can add any web page they like to their console. Sometimes they inadvertently add pages that do not load successfully in the console due to frame embedding restrictions. This leads to an uneven wandering experience because these page recommendations occasionally make it to other consoles where they fail to load. Console owners can now block such URLs in their console to decrease the likelihood of these failed page loads. This helps make the wandering experience smoother. Another significant feature in this release is the expanded Console dialog box. This dialog box now shows various details about the console and the current wandering session. For example, it shows the console's configuration: recommended pages, ignored URLs and linked consoles. It also shows a wandering history screen where you can see each link that was recommended to you along with the console that recommendation came from. There is another screen that shows all the consoles discovered during the discovery process. Those who care about how Wander works would find this dialog box quite useful. To check it out, go to my Wander console and explore. To learn more about Wander, how it works and how to set it up, please read the project README at codeberg.org/susam/wander . Read on website | #web | #technology

0 views

Our Big Dumb Government

Read this article from Heise this morning which basically says that all networking routers are now illegal to purchase in the United States. Well, actually it says all non-US made ones, but that's pretty much all of them. Now obviously this is some form of corruption, some government official is getting a big old paycheck from a US based company (Comcast?) that will benefit from this. Maybe the goal is to force all consumers to rent their equipment rather than buy. Maybe it's to shove government spyware onto routers. Probably it's both. Whatever the reason, all I can say is fuck this government. And yeah, in the grand scheme of horrible things they've done (started a war, run a secret police that kidnaps people on the streets, etc), this is small. But seriously, fuck everyone in power in the United States. People love to respond with "if you don't like it, get out". Man I would FUCKING LOVE TO. But guess what, it's not that easy. When you have a family, property, belongings, pets, careers, you can't just pack up and move. Also, most countries don't want US citizens, big surprise.

0 views
Giles's blog Yesterday

Writing an LLM from scratch, part 32f -- Interventions: weight decay

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In my training code, I have this code to create the optimiser: In my last post I looked into the learning rate, the parameter in that code, and found a value for that, plus some extra code to schedule it -- that is, to vary it over time -- which gave better training results. This time I want to go into the weight decay. What is it, what is it for, and is 0.1 really the best value? I was a little concerned going into this that in order to understand this hyperparameter, I'd need to have a good understanding of how the optimiser works; I've been building what I think is a solid mental model of optimisers, but I don't think I understand them well enough to explain them yet, and I've been hoping to delay posting about them to a separate blog post series after this one. The good news is that while weight decay is an important aspect of how optimisers work -- the "W" in AdamW, the thing that makes it different to the older Adam optimiser, is a nod to its different treatment of weight decay -- you don't need to know how the optimiser itself works to understand what weight decay is. Instead, you just need to consider an older and more fundamental aspect of building ML systems -- regularisation. In order to dig into that, let's start with overfitting. Let's imagine a simple classification task: we want to build a model that can -- for any point on this chart -- predict whether a cross or a circle should go there, training it using the sample data points that we already have: Let's say that we train a powerful model on this dataset, and it comes up with this: Now, ab initio we don't know whether that's a good result or not; we need to use our validation set to evaluate it. Let's say that the validation points are these blue ones: We can see that it looks like our powerful model has overfit. The training set is all nicely split by the boundary, but the validation points are not. A common solution to how to handle that kind of issue that you might see in introductory ML courses is to try using a less powerful model. A less powerful model in this case might come up with a less "wiggly" line to separate the two categories, perhaps because it didn't have enough parameters to make it wiggle so much, so you might find that it came up with a classifier that looked more like this: So: we use our validation set to detect overfitting, and we can adjust the complexity of our model to try to avoid it. Now, this is all very well, but it does require manual intervention. We had to do a training run, identify that we were overfitting, and then decide on parameters for the new simpler model (how many parameters should it have?). We could, perhaps have gone too far and wound up with something like this: ...and underfit. There's no way when we start out knowing what the right number of parameters is, so we need to try various values and then try to work out the optimum balance. Regularisation techniques are designed to try to automate this -- to prevent overfitting without all that tedious mucking about with the model. We've already looked at Dropout , which is one of the standard ways to do that. Although my own mental model of what it does goes some way beyond just helping to prevent overfitting, I may well be wrong -- and given that our LLM train is never seeing the same training data twice, being a single-epoch run, removing it turned out to improve our model . Another technique is just stopping the training run when you start seeing the validation loss rise, also known as "early stopping". That's such an obvious thing to do that I came up with it independently back when I was doing my early experiments with fine-tuning . Now, we don't have a separate validation set for these training runs, but because we're doing a single epoch, the training data it sees is just as "new to it" as a held-back validation set would be, so we could use a similar trick and treat "train loss starts rising" instead of validation loss rising as a reason to stop the train early. It's not exactly the same thing, but perhaps it would be close enough. But in all of the trains in this series, that's never happened -- while sometimes the train loss blips up for a bit, in the longer term it keeps going down. But there are other techniques that rely on a neat trick. Let's think back to the manual, boring way of trying to find how many parameters are appropriate for a modelling task. We tried one number, found that it overfit, then we might try a lower one, find that it underfit, then try something in the middle and find that it's better but still not perfect one way or the other, and rinse and repeat until we find something we're happy with. This kind of searching through a solution space to find an optimum is exactly what we're doing when training a model. It would be really nice to automate it in the same way. One trick is: if we want to minimise the complexity of our model so that it doesn't overfit, we can try adding a measure of the model's complexity to the loss function -- and then our normal process of gradient descent will try to minimise that, just like it will try to minimise the loss from the training results themselves. And that brings us on to weight decay. Regularisation by weight decay starts off with the hypothesis that the "size" of all of the model's weights, taken together, is a measure of the model's complexity. If the model's weights are small, then it's a simpler model than if they're large. 1 The "size" in this sense is the square of the L2 norm -- that's something we came across in gradient clipping . The L2 norm is basically all of the weights squared, added together and then the resulting sum square-rooted. You can think of it as the length of the vector that the weights represent -- that is, for our 163M-parameter model, it would be the length of the model's weights considered as a vector in 163-million dimensional space. 2 And by using its square, we get something that penalises larger values more (and we also save the time in calculating a square root). To me, it's not intuitively obvious that that measure really does express the complexity of the model in any clear sense. After all, you'd think that doubling all parameters would leave it no more complex than it was before, but it would double the L2 norm. 3 But I imagine there is solid maths behind it to say that it does work in a more general way, so in the interests of not disappearing down a mathematical rabbit hole at this stage, I'll take it as given. So: we're using the squared L2 norm as a measure of model complexity, and we're going to add that on to the training loss as a way to try to minimise both. The next question is, how do we balance between the two -- the training loss and the model complexity penalty? This is, in a somewhat hand-wavy way, similar to the decision of how much of the current loss function's gradient to use when adjusting the weights. For that, we use η , the learning rate to scale the gradients before applying them: And the balance between the "real" loss and the model complexity penalty is done in a similar way -- we have a number, the weight decay, normally represented by a lower-case lambda, λ , and we multiply the squared L2 norm by that, something like this: ...where I'm using ℒ for the normal loss on the training inputs vs the targets, N 2 for the squared L2 norm of the weights, and ℒ ′ for the combined loss. And ℒ ′ is what we -- in theory -- actually try to minimise using our optimiser. But there's actually a neat simplification that we can apply to make this even easier. Firstly, let's make one small change to the equation above: we'll halve the squared L2 norm before multiplying it by λ . That obviously doesn't change the underlying maths, it just means that we'd need to use larger values for λ to get the same effect. You'll see why that's useful in a bit. Now let's think about normal gradient descent. Again, we work out the gradient of the loss function for each weight, and subtract that times the learning rate η from the weight's value to update it: Let's reformulate that a bit. The gradient of the loss function for the weight is its partial derivative against that weight, so we can write the above like this for the version of the loss function including weight decay, ℒ ′ : Now, we defined ℒ ′ above as ℒ + λ · N 2 2 , so we can substitute that in there: Now, let's think about that L2 norm, N . It's the square root of the sum of all of the weights squared, or equivalently we can square it (like we do in the formula above) and say: Let's drop that in: Now, the derivative of a bunch of things added together is just each of them differentiated separately and then added together. Let's apply that to the two terms in the brackets: ...and now pull the constant λ and the 2 out of the second partial derivative: Then we apply the rule for the derivative of a bunch of things added together again: Now, we're doing a partial derivative versus one specific weight, w , which is one of the w 0 , w 1 , and so on in there. From that perspective, all of the other weights are constant -- which means that their derivative with respect to w is zero. So we can just get rid of all of them apart from the one that actually is w , and we wind up with this: The derivative of w 2 with respect to w is just 2 w . Thanks to that crafty halving of the N 2 earlier, that means that we can go to this: Multiplying that − η across the bracketed terms, we get: That's exactly the same as the normal gradient descent update, using the unmodified loss function without weight decay -- except that we're additionally subtracting the weight's original value scaled down by both the learning rate η and the weight decay value λ . Much simpler :-) (As an aside: the description above is correct for "traditional" simple gradient descent and -- loosely -- for Adam, but AdamW's trick is to do things somewhat differently. That's something I'll go into in more detail when I get round to writing my post on optimisers.) So: weight decay is a regularisation technique that tries to prevent our model from getting any more complex than it needs to be. We have one number, λ , which determines how much to weight complexity against the normal training loss. And, as we can see from the code: ...right now we're setting λ to 0.1. Is that the right value? As usual, the GPT-2 paper is light on the details of the hyperparameters they used, but nostalgebraist wrote a really nice post on Tumblr where they dug into what the number might have been. As they say: It does say it follows the first GPT paper in most respects, and that paper used weight decay of 0.01. Their link for the paper appears to be mistaken, as it's a different (albeit very interesting) paper from 2020, a year after the GPT-2 one, but I believe this is the paper normally called the GPT-1 one . They do indeed use 0.01 there: We also employed a modified version of L2 regularization proposed in [37], with w = 0.01 on all non bias or gain weights. The link to the GPT-3 paper looks right, though, and as they say, it uses a weight decay of 0.1: All models use weight decay of 0.1 to provide a small amount of regularization They then do a bit of maths to work out whether the GPT-2 weights are likely to have been regularised by something like weight decay, and come to the conclusion that they probably used 0.01, just like the GPT-1 paper. It seems plausible, but of course not certain. But: tentatively, GPT-2 used 0.01, while we're using 0.1, perhaps because the GPT-3 paper does. What other data points do we have? The Hugging Face "Smol training playbook" has some interesting stuff (including not using weight decay on embeddings, which they say they found helped), but the value that they use is 0.1, which they call "a very vanilla setting". And: Interestingly, over the last few years the AdamW hyperparameters have barely moved: The same triplet is reused in Llama 1, 2, and 3 and DeepSeek-V1, V2, and V3-671B, with no changes. Anyway, assuming they're right about weight decay value for the models they mention (and I assume they've done the research -- I had the link to the DeepSeek paper to hand, and that one certainly says 0.1), it looks like 0.1 is pretty much standard these days. And a quick double-check of what a typical value would be -- asking ChatGPT, Claude, Gemini and Grok -- they all recommend 0.1 as a solid sensible default with AdamW (though they all also say that values between 0.01 and 0.1 are reasonable). So on that basis, I think we can say that 0.1 is a reasonable default, and has pretty much become the standard, but it might be worth trying 0.01 just to see if it does help with tiny models like ours. Are there any dissenting voices to the 0.1 orthodoxy? I came across a paper from a team at Cerebras Systems , " Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training ". It's essentially a Chinchilla-like attempt to get scaling laws, but rather than looking just at optimal tokens per parameter in order to work out what you should scale up when adding on more compute, they're trying to find optimal batch sizes and values for weight decay. That's certainly relevant to our interests :-) However, it is very dense and in-depth, and fully understanding it at this stage would need quite a lot of work -- very much a side quest. Definitely something to come back to later, but for now, I'll just try to extract the stuff we need. Let's start off with the optimal batch size, as they have it right there on the first page. We're not going to use it, but it will be interesting to compare with what we're using, and what the DeepSeek paper that I looked at in the last post suggested. They fit this formula: ...where D is the total number of tokens that you're training on. That's quite different to the formula in the DeepSeek paper, which was: ...where C is the number of FLOPs 4 . C scales up linearly with the number of tokens D , but also with the number of parameters in the model N , so you can see the DeepSeek formula as a function of N and D -- as your model gets bigger, so does B opt -- whereas this Cerebras paper is saying that it's just a function of D , unaffected by model size. They did train over a number of different sizes (from 111M parameters up to 1.7B) and their formula seems to hold, so it's not just that they didn't treat model size as relevant. Well, let's see what their formula comes up with. We have 3,260,252,160 tokens in our train, so their formula for B opt comes out as: That's much closer to the 97-or-so sequences that appeared to be optimal when I did some rough-and-ready curve-fitting than the 373 that the DeepSeek formula gave for our setup :-) OK, so what about the weight decay? They don't give a direct formula for that, but they do give a formula for the optimal τ , the AdamW timescale. Without going into exactly what that means right now (that's one for my optimisers post later), they relate it to other numbers that we do know with this formula: ...where B is the batch size, D is the amount of data, and of course λ and η are weight decay and learning rate respectively. So if we know the optimal τ we can work out the optimal λ for our training run; solving for λ , we get: So let's work out the τ opt . Their fitted formula is this: ...where TPP is tokens-per-parameter. For us, with our Chinchilla-optimal TPP of 20, we get: Now, we're using a batch size B of 96, and (as before) D is 3,260,252,160. Our learning rate η is 0.0004 for this train -- remember, although in the last post we found that a scheduled learning rate with a peak at 0.0014 was better, in this post we're testing changing weight decay in isolation. 5 So, we just need to plug our τ opt into this: Before we do: having a batch size and a number of tokens in the same formula feels like a unit mismatch. In particular, as part of the explanation of that formula, they tie it back to a value S , the total number of optimisation steps, which they define as D / B . For that to work, either both need to be in terms of tokens, or both need to be in terms of sequences They clearly say that "B is reported in units of sequences". I'm not sure how to explain this, except by saying that perhaps the D is also meant to be in terms of sequences too, even though I'm pretty sure that it's meant to be in terms of tokens in the equation for the batch size. 6 Well, let's assume that is the case, and plug in numbers for sequences. We have 3,260,252,160 training tokens split into 1,024-token sequences, which is 3,183,840 sequences, so that comes out as: (Note that we'd get the same numbers if we plugged in numbers for tokens in both cases, as it would just multiply the top and the bottom by 1,024.) That comes out as 0.33724. Wow! That's even higher than the "traditional" 0.1, never mind the 0.01 that is the best guess we have for GPT-2. Even if I'm missing something here (I certainly can't say I've read the paper in as much detail as it deserves), that actually gives us a nice number to try out as an experiment. We already have a loss on our test set for a model trained with a weight decay of 0.1, as that was what we used in our baseline train. It looks like it might be worth doing two more, one with the GPT-2 estimate of 0.01, and one with this Cerebras-inspired 0.33724, neatly bracketing it. Let's give them a go! Firstly, the training run with λ = 0.01 : Looks like a nice smooth train -- one small loss spike near the start but it quickly recovered. The output was: That's not a bad final train loss (which does tend to indicate a good model). Let's look at the evals; firstly, the smoke test -- how would it complete "Every effort moves you"? Passably coherent. Let's take a look at the loss it gets on our test set: Not bad at all! Time to upload it to Hugging Face and to add it to the table so that we can compare it to the other interventions we've tried so far. So, it's better than gradient clipping and the QKV bias, but slightly worse than removing dropout and much worse than scheduling (and increasing) the learning rate. Now, that suggests to me that the much-higher Cerebras-inspired weight decay will be worse. My logic is this: if both decreasing it and increasing it improved loss, that would suggest that we have an inverted-U loss curve for weight decay like this: Now, it seems vanishingly unlikely that those downward trends on either side would continue so that you could get arbitrarily low loss by increasing or decreasing weight decay even more. So the curve would perhaps look a bit more like this W-shaped one: My intuition is that having multiple minima -- especially ones that just happen to be on either side of the "standard" value for weight decay -- seems less likely than the alternative -- that the higher number will be worse because we're actually on a U-shaped curve more like this: Of course, my intuition could be completely off on this, and it's definitely still worth doing the test! Here's the loss chart with that: You can see right away that it was a much choppier train, with quite a few loss spikes, some quite late on. The output at the end reflected this: ...a significantly worse loss at the end. Still, we should do the evals. Firstly the smoke test: Not too bad, but the loss test is the important one: That's terrible! Our first result for loss on the test set for an intervention that is actually worse than the baseline. Much worse: However, at this point I started wondering. When I was looking at the learning rate, the number I selected based on the DeepSeek paper worked well with learning rate scheduling, but failed to converge without. The weight decay number is multiplied by the current learning rate before it's used to reduce weights' values, so will be affected by both scheduling and η . It seemed likely that Cerebras used a learning rate schedule, and double-checking the paper: We present results with a single (standard) learning rate schedule ... For a given TPP, all models have the exact same warmup phase: a linear warmup of the learning rate from 0 to the maximum value. ... We use the µP-tuned and adjusted peak η , for 111M models. The learning rate increases linearly to the peak for the first 10% of steps, then decreases from the peak to 0 for the remainder of steps. Seems pretty certain. Now, I've been following a fairly strict rule of testing interventions in isolation; however, the learning rate and the weight decay parameters are so intertwined that perhaps that's just not reasonable here. I decided to do two more trains, both with learning rate scheduling. I'd use the same schedule as in the last blog post -- a warmup from pretty-much zero to the peak over 10% of the run, followed by a cosine decay to 10% of the peak. In the first, I'd use the same learning rate as our baseline model, 0.0004. In the second, I'd use the one we got from the DeepSeek paper, which did really well when scheduled: 0.0014. Well, that's less choppy, at least -- the scheduling calmed down the later parts of the run, as you'd expect given that the learning rate was dropping. The output: Still a kind of high training loss at the end, though. The smoke test: Not too bad, and the test set loss: Unfortunately still worse than the baseline of 3.692, albeit better than the one without learning rate scheduling. I'm not going to add it to the table, as this was more in the way of an exploratory training run. Let's see how we do with the larger DeepSeek-suggested learning rate. For this one, I kept the weight decay at 0.33724. (This was an error, as I realised later -- more on that shortly) Ouch, super-choppy loss -- and the loss at the end of the train isn't promising either Terrible loss at the end. The smoke test gives this: ...which is not too bad, but the test set loss: ...is still pretty terrible (though still a tad better than the one without the learning rate scheduling). Another one to throw away, I think. But then something occurred to me: the formula to go from the optimal AdamW time horizon τ opt to the optimal weight decay λ opt is this: It has the learning rate η in it -- I even made a footnote saying that I was going to have to remember to recalculate the weight decay value when that changed :-S Luckily, though, running the real numbers through that: ...which is almost exactly the same as the 0.1 that we've been using for all of our other experiments. So that actually suggests that the Cerebras equations come up with a reasonably usable number for weight decay if you use the DeepSeek-optimal level for the learning rate, and schedule it in a normal warmup-cosine decay manner. But it's still not as good -- for this model -- as using the GPT-2 number. 7 With that, I think it's time to wrap this intervention up! Let's look at our results table again: We've found that reducing the weight decay from the now-standard 0.1 to a GPT-2-inspired 0.01 improves the loss our model gets on the test set; it's the third-best intervention so far, after getting rid of dropout and updating our learning rate -- and the difference between it and the dropout intervention is pretty small. It did surprise me that the Cerebras-inspired number did so badly, though. To recap: I think that for now, I should not head any further down this rabbit hole and just take the win -- we have a weight decay parameter that works better than the one we had, and so that's something that can go into our set of working interventions. I can revisit the Cerebras paper later when I've spent more time studying optimisers. As to why this old-fashioned GPT-2 value might work better than the current default of 0.1: I think that could plausibly be due to scale. The 0.1 value appears to come from the GPT-3 paper, which essentially was an experiment in scaling up GPT-2. Perhaps larger models need larger weight decays? And the model we're working with here is really small, at 163M parameters. So, that's weight decay done! Of the list of planned interventions I wanted to try , only training in full-fat 32 bits (rather than AMP), and weight-tying remain. I think I'll look into the second of those next. Stay tuned! Here's a link to the next post in this series . More precisely, from Deep Learning : Minimizing J ( w ) results in a choice of weights that make a tradeoff between fitting the training data and being small. This gives us solutions that have a smaller slope, or that put weight on fewer of the features. ...where J ( w ) is the loss function we're trying to minimise in our training run, combining the "real" loss and a measure of the model's size.  ↩ I can't decide whether that makes it easier or harder to understand ;-)  ↩ Wild speculation: how about something using the Shannon entropy of the weights...?  ↩ Specifically the non-embedding training FLOPs.  ↩ Note to self: don't forget to adjust it if we do decide to combine this with the learning rate update. Also: I'm pretty sure from reading the paper that the η that they're using in these formulae is the peak -- they certainly are using learning rate scheduling, albeit with a decay-to-zero rather than the decay-to-10% we used.  ↩ Plugging in the number of sequences into the batch size formula gives us an optimal value of 9.47, which definitely doesn't look right based on the trains I've done.  ↩ Assuming that the GPT-2 value for weight decay "stacks up" well with the learning rate update and the scheduling from the last post. There may be some useful tests to do when we try to put this all together.  ↩ β 1 = 0.9, β 2 = 0.95 Grad norm clipping = 1.0 Weight decay = 0.1 (Llama 3 405B drops this to 0.01) With our too-low learning rate of 0.0004, it performed terribly When we added scheduling, it was a bit better but still not great. When we used a DeepSeek-optimal learning rate (and actually did the right calculations to get the real value for weight decay based on that), we got a number which was very close to our baseline train, and seems very unlikely on the face of it to have a significantly different resulting test set loss. More precisely, from Deep Learning : Minimizing J ( w ) results in a choice of weights that make a tradeoff between fitting the training data and being small. This gives us solutions that have a smaller slope, or that put weight on fewer of the features. ...where J ( w ) is the loss function we're trying to minimise in our training run, combining the "real" loss and a measure of the model's size.  ↩ I can't decide whether that makes it easier or harder to understand ;-)  ↩ Wild speculation: how about something using the Shannon entropy of the weights...?  ↩ Specifically the non-embedding training FLOPs.  ↩ Note to self: don't forget to adjust it if we do decide to combine this with the learning rate update. Also: I'm pretty sure from reading the paper that the η that they're using in these formulae is the peak -- they certainly are using learning rate scheduling, albeit with a decay-to-zero rather than the decay-to-10% we used.  ↩ Plugging in the number of sequences into the batch size formula gives us an optimal value of 9.47, which definitely doesn't look right based on the trains I've done.  ↩ Assuming that the GPT-2 value for weight decay "stacks up" well with the learning rate update and the scheduling from the last post. There may be some useful tests to do when we try to put this all together.  ↩

0 views

March 2026 blend of links

I promise you I try to avoid linking to more than two articles on the same topic in each edition — and I really want to avoid my readers to feel too depressed reading this blog — but everything seems to be about A.I. or some sort of automation these days, either directly or indirectly. I also notice that most of the topics revolve around the how and rarely on the why , as if accelerating tasks to the max, regardless of their purpose, is unquestionably a good thing. Emily Tucker’s Open Letter to Georgetown Students, In Response to Recent Announcements by the University about “Generative A.I.” – “ It’s a big win for them, in their quest to persuade you of your powerlessness, that they have gotten your university to [adopt] their marketing language for its official statements, to shape its academic programming around the presumption of their indefinite economic primacy, and to pay for you to have free access to technologies that will make it harder — the more you use them — to know yourself to be a free intellectual, creative and moral agent. ” (via Dan Gillmor ) Overthinking: A.I. wasn't the first to break my heart – This article from Ana Rodrigues read a little too close to home for my own comfort; the feelings described and words chosen are very accurate and indeed increasingly familiar to a growing number of people. We’re Training Students To Write Worse To Prove They’re Not Robots, And It’s Pushing Them To Use More A.I. – “ […] the AI detection tool flagged the essay as “18% A.I. written.” The culprit? Using the word “devoid.” When the word was swapped out for “without,” the score magically dropped to 0%. ” The Future Smells Like Paper – “ The technology should remove bureaucratic friction while preserving ceremonial weight. Make the process transparent without making it trivial. You can't automate meaning. You can only create conditions where it might emerge. ” (via iA Writer ) What I mean when I say that I hate Gen A.I. – “ I hate that I do it, and I am angry that I am forced - but I am an adult and I do what I must. I couldn't care less if I write the code I "make", but I am disenchanted with humanity. As a young boy I was full of optimism, I thought we can strive to be better. I was wrong. Money is all that matters. ” (via Brain Baking ) Backseat Software – So many quotable parts in this beauty of an article by Mike Swanson. Before writing this very sentence, I successively pasted 3 to 4 quotes, each better than the previous one. What a great read; actually very hard to get through, as you'll want to stop every other paragraph to take notes. (via The Talk Show ) TextEdit and the Relief of Simple Software – An interesting perspective from someone deeply involved in the activity of writing on a computer, but seemingly not as passionate about software as one would assume. I’ll keep an eye on Kyle Chayka’s future columns, as I wouldn’t be surprised if this one is just a first step into the inevitable quest of finding a better writing app on the Mac. I’ve been there, both as a TextEdit-only user and as a text-editing software snob. I even play with Vim in the Terminal from time to time, just so I can feel like Dana Scully typing a report . (via Michael Tsai ) SubEthaEdit – Perfect transition to a really excellent text editor, for people who love “real” Mac apps, with a neat collaboration feature. The Shape of Paris – At first, I just wanted to watch the first couple of seconds of this to see if it was worth saving for later or not, and I ended up watching it in full. Beautiful scenery that somehow made me nostalgic for the eight years of my life I lived in Paris. Also, has any other sport or hobby ever beaten skateboard in terms of style and looks? I don’t think so, it’s the epitome of cool . (via Kottke ) Shady Characters – Not as cool as a skateboard video in Paris, but this whole website looks incredible thanks to an exquisite typography. Subscribed to the RSS feed, and there is also a book, that I’ve just ordered. Previous blend of links editions

1 views
DHH 2 days ago

Denmark desperately needs more inequality

The Danish election is tomorrow. One of the central themes in the incumbent campaign has been a proposed wealth tax. The fig leaf for this proposal was "smaller classrooms in the early grades", but that quickly fell off, and the debate centered on "inequality". And it's true that inequality is a problem in Denmark: There's not nearly enough! I know that sounds sacrilegious. Even most of the business-friendly press and parties in Denmark dance around this topic. Which makes political sense because the word "inequality" leads most people to think of poverty and destitution. But that's not the reality in the little kingdom that could. Denmark has an enormous state apparatus (half of GDP and a third of all workers!) that offers equal access to everything from health care to education and a million programs in between. It could surely be slimmed and trimmed, but on the whole, it works remarkably well. The average Dane is incredibly well cared for by any international standard (high-trust society, hurray!). By those same standards, it's the 8th most equal country in the world on income, as measured by the Gini coefficient (0.28). But this is where the numbers start spellbinding the debate. Because the Danish Gini coefficient perversely "degrades" if new businesses succeed, as any time successful founders and high-paid employees earning incomes above the median "worsen" inequality.  This is obviously nonsense. When the pie gets bigger, it gets better for all, as long as nobody is robbed of their existing slice.  Denmark should clearly want new successful businesses! It should love to see founders reap big rewards when the risks pay off. It should celebrate early employees making fortunes on stock grants. But all too often, it just doesn't. Just to put it on a pin: Danes hate flashy cars with a passion that stretches back much further than the current green excuses. But buying a $300,000 Ferrari in Denmark is one of the most patriotic things you can possibly do! You'll end up paying almost three times the price for the privilege, and sending 2/3s of that to the treasury in taxes. Truly a contribution to the common cause worthy of admiration, not scorn!  But because the debate around inequality is anchored in a fixed-pie paradigm, scorn is all you're likely to get. Anyone who does well in Denmark is immediately suspected of having succeeded at the expense of others. Probably through some form of nefarious exploitation, even if we can't prove what?! There is a core national politics of grievance and envy. But, however human that may be, the future progress and prosperity of the country depends on rejecting this zero-sum delusional dogma. The Danish economy is currently doing well compared to the rest of the EU, but it's dangerously dependent on a handful of vintage corporations pulling the bulk of the load. This simply has to change if the Danes wish to retain their high standards of living going forward. No corporation lasts forever. Novo Nordisk was Europe's most valuable company at the start of last year, now it's worth half that, and is out of the top ten. And who knows what the closing of the Hormuz Strait will do to Maersk. These two companies alone represent roughly a quarter of all Denmark's exports! Meanwhile, new business formation just hit an all-time low. And only a tiny portion of the big employers in Denmark were created in the last thirty years. And thus, almost all the wealth that funds the highly-prized welfare state is coming from really old companies. Many of them over a hundred years old. This is wonderful in many ways. The Danes should be rightfully proud to host Maersk (1904), Novo (1923), Vestas (1945), Lego (1932), and other international heavy-weights. But it can't rely on this aging corporate vintage to forever bear fruit for tomorrow. Tomorrow needs to be tended to by planting new seeds. New companies. New growth. New capital. And that's just not going to happen if the Danish state declares itself at war with capital formation or accumulation. It should be so lucky to have more rich people, with more capital, and the talent to deploy it toward a better, shared future (or spend it on heavily-taxed Ferraris!). The ballot boxes open tomorrow morning. It's predicted to be a close one. Fingers crossed for a prosperous choice.

0 views

‘CanisterWorm’ Springs Wiper Attack Targeting Iran

A financially motivated data theft and extortion group is attempting to inject itself into the Iran war, unleashing a worm that spreads through poorly secured cloud services and wipes data on infected systems that use Iran’s time zone or have Farsi set as the default language. Experts say the wiper campaign against Iran materialized this past weekend and came from a relatively new cybercrime group known as TeamPCP . In December 2025, the group began compromising corporate cloud environments using a self-propagating worm that went after exposed Docker APIs, Kubernetes clusters, Redis servers, and the React2Shell vulnerability. TeamPCP then attempted to move laterally through victim networks, siphoning authentication credentials and extorting victims over Telegram. A snippet of the malicious CanisterWorm that seeks out and destroys data on systems that match Iran’s timezone or have Farsi as the default language. Image: Aikido.dev. In a profile of TeamPCP published in January, the security firm Flare  said the group weaponizes exposed control planes rather than exploiting endpoints, predominantly targeting cloud infrastructure over end-user devices, with Azure (61%) and AWS (36%) accounting for 97% of compromised servers. “TeamPCP’s strength does not come from novel exploits or original malware, but from the large-scale automation and integration of well-known attack techniques,” Flare’s Assaf Morag wrote . “The group industrializes existing vulnerabilities, misconfigurations, and recycled tooling into a cloud-native exploitation platform that turns exposed infrastructure into a self-propagating criminal ecosystem.” On March 19, TeamPCP executed a supply chain attack against the vulnerability scanner Trivy from Aqua Security , injecting credential-stealing malware into official releases on GitHub actions. Aqua Security said it has since removed the harmful files, but the security firm Wiz notes the attackers were able to publish malicious versions that snarfed SSH keys, cloud credentials, Kubernetes tokens and cryptocurrency wallets from users. Over the weekend, the same technical infrastructure TeamPCP used in the Trivy attack was leveraged to deploy a new malicious payload which executes a wiper attack if the user’s timezone and locale are determined to correspond to Iran, said Charlie Eriksen , a security researcher at Aikido . In a blog post published on Sunday, Eriksen said if the wiper component detects that the victim is in Iran and has access to a Kubernetes cluster, it will destroy data on every node in that cluster. “If it doesn’t it will just wipe the local machine,” Eriksen told KrebsOnSecurity. Image: Aikido.dev. Aikido refers to TeamPCP’s infrastructure as “ CanisterWorm ” because the group orchestrates their campaigns using an Internet Computer Protocol (ICP) canister — a system of tamperproof, blockchain-based “smart contracts” that combine both code and data. ICP canisters can serve Web content directly to visitors, and their distributed architecture makes them resistant to takedown attempts. These canisters will remain reachable so long as their operators continue to pay virtual currency fees to keep them online. Eriksen said the people behind TeamPCP are bragging about their exploits in a group on Telegram and claim to have used the worm to steal vast amounts of sensitive data from major companies, including a large multinational pharmaceutical firm. “When they compromised Aqua a second time, they took a lot of GitHub accounts and started spamming these with junk messages,” Eriksen said. “It was almost like they were just showing off how much access they had. Clearly, they have an entire stash of these credentials, and what we’ve seen so far is probably a small sample of what they have.” Security experts say the spammed GitHub messages could be a way for TeamPCP to ensure that any code packages tainted with their malware will remain prominent in GitHub searches. In a newsletter published today titled GitHub is Starting to Have a Real Malware Problem , Risky Business reporter Catalin Cimpanu writes that attackers often are seen pushing meaningless commits to their repos or using online services that sell GitHub stars and “likes” to keep malicious packages at the top of the GitHub search page. This weekend’s outbreak is the second major supply chain attack involving Trivy in as many months. At the end of February, Trivy was hit as part of an automated threat called HackerBot-Claw , which mass exploited misconfigured workflows in GitHub Actions to steal authentication tokens. Eriksen said it appears TeamPCP used access gained in the first attack on Aqua Security to perpetrate this weekend’s mischief. But he said there is no reliable way to tell whether TeamPCP’s wiper actually succeeded in trashing any data from victim systems, and that the malicious payload was only active for a short time over the weekend. “They’ve been taking [the malicious code] up and down, rapidly changing it adding new features,” Eriksen said, noting that when the malicious canister wasn’t serving up malware downloads it was pointing visitors to a Rick Roll video on YouTube. “It’s a little all over the place, and there’s a chance this whole Iran thing is just their way of getting attention,” Eriksen said. “I feel like these people are really playing this Chaotic Evil role here.” Cimpanu observed that supply chain attacks have increased in frequency of late as threat actors begin to grasp just how efficient they can be, and his post documents an alarming number of these incidents since 2024. “While security firms appear to be doing a good job spotting this, we’re also gonna need GitHub’s security team to step up,” Cimpanu wrote. “Unfortunately, on a platform designed to copy (fork) a project and create new versions of it (clones), spotting malicious additions to clones of legitimate repos might be quite the engineering problem to fix.” Update, 2:40 p.m. ET: Wiz is reporting that TeamPCP also pushed credential stealing malware to the KICS vulnerability scanner from Checkmarx , and that the scanner’s GitHub Action was compromised between 12:58 and 16:50 UTC today (March 23rd).

0 views
Made of Bugs 2 days ago

From error-handling to structured concurrency

How should we think about error-handling in concurrent programs? In single-threaded programs, we’ve mostly converged on a standard pattern, with a diverse zoo of implementations and concrete patterns. When an error occurs, it is propagated up the stack until we find a stack frame which is prepared to handle it. As we do so, we unwind the stack frames in-order, giving each frame the opportunity to clean up or destroy resources as appropriate.

0 views
David Bushell 2 days ago

Top ten Figma betrayals

Figma is the industry standard for painting pretty pictures of websites. It’s where designers spend my designated dev time pushing pixels around one too many artboards. Figma promises to remove the proverbial fence between design and development. In reality it provides the comfort of an ideal viewport that doesn’t exist. I don’t mind Figma (the software), although I prefer Penpot myself. I still dabble in the deceptive arts of web design. Don’t be thinking I’m out here hating on designers. I like to stick my nose inside a Figma file and point out issues before they escalate. Below I cover classic Figma betrayals that I bet you’ve experienced. Betrayals happen when software promises more than it can deliver. Take a gander at this amazing website design I whipped up in Figma to illustrate the most common betrayals. I told you I was a designer! I’ll evolve this design throughout the post. Figma has deemed 1440×1024 to be “Desktop” resolution so I’ve started there. In this mockup I’ve added a full-width banner of our hero Johnny Business . I’ve built this website far too many times than I care to remember. I’ll repeat the same question here I ask every time I build it: what happens at other viewport sizes? Do I scale the banner proportionally? On wider viewports this is likely to push content out of sight. It might even require scrolling to see the entire image on Johnny’s ultra-wide 8K. The phrase “above the fold” will be spoken in a Teams call, can we avoid that? Do I also set a maximum height on the banner? This is going to decapitate poor Johnny! He paid a lot for that haircut. What are we doing below the “Desktop” viewport, by the way? Let’s design for the 402×874 resolution Figma calls “iPhone 17” because it was first on the list. Note the absolute perfect crop of Johnny’s sockless businessing. Okay, next question: how do we move between “mobile” and “desktop”? That’s a very specific focal point. We can’t just change it willy-nilly! Code has rules; logic. A website must be responsive between all breakpoints. Are we going to use multiple images? At what breakpoint do they swap? Because that perfectly cropped mobile image doesn’t scale up very far. Hold the phone! A shadow stakeholder has asked for a redesign to “make it pop!” The ultra-wide problem has been solved with a centred fixed-width style. If that is the intention? Does either the banner or header stretch to the edge of the viewport? More importantly, that image and text has no room to move. I’ve only reduced the viewport by 200 pixels and it’s already crashing into Johnny’s face. Are we expecting breakpoints every 100 pixels? — No, wait! Please don’t spend more time designing more breakpoints! Okay, I’ll hold until more breakpoints are designed. Are we extending my development deadline? No. Okay. As development continues I’ve got more bad news to share. Figma is very happy allowing us to enter arbitrary line breaks for the perfect text fit. That’s not how the web works. One of these options is probably what we’ll see if text is left to naturally break. Yes, we can technically allow for a manual line break. That’s a pain in the content management system, but sure. Text is still forced to wrap on a smaller viewport, then what? Oh that? Now you want the manual line break to magically disappear? (╯°□°)╯︵ ┻━┻ I lied when I said “top ten” Figma betrayals. The issues above can appear in hundreds of guises across any component. If you’re betrayed once you’ll be hit again and again. Figma is not exactly conducive to responsive web design. Designing more breakpoints often leads to more questions, not less. Another betrayal I pull my hair out over is the three card pattern packed with content. This leads to an immediate breakpoint where one card drops awkwardly below. I dread this because the word “carousel” will be uttered and my sobbing is heard far and wide. Carousels are not a content strategy. I was once inspecting a Figma file only to witness the enemy cursor drive by and drop several dots underneath an image. The audacity! Figma betrayals are classic waterfall mistakes that are solved by human conversation. Developers need to be part of the design process to ask these questions. Content authors should be involved before and not after a design is complete. You’ll note I never answered the questions above because what might work for my fictional design isn’t universal. On a tangential topic Matthias Ott notes: Think about what actually happens when a designer and an engineer disagree about an interaction pattern. There’s a moment of tension – maybe even frustration. The engineer says it’ll be fragile. The designer says it’s essential for the experience. Neither is wrong, necessarily. But the conversation – if your process allows for it to happen – that back-and-forth where both sides have to articulate why they believe what they believe, is where the design becomes robust and both people gain experience. Not in the Figma file. Not in the pull request. In the friction between two people who care about different things and are forced to find a shared answer. The Shape of Friction - Matthias Ott Figma is not friction-free and that’s fine. We can’t expect any software in the hands of a single person to solve problems alone. Software doesn’t know what questions to ask. Not then with Clippy, not now with Copilot. Humans should talk to one another, not the software. Together we can solve things early the easy way, or later the hard way. One thing that has kept me employed is the ability to identify questions early and not allow Fireworks, Photoshop, Sketch, XD, and now Figma to lead a project astray. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views
matduggan.com 2 days ago

Markdown Ate The World

I have always enjoyed the act of typing words and seeing them come up on screen. While my favorite word processor of all time might be WordPerfect ( here ), I've used almost all of them. These programs were what sold me on the entire value proposition of computers. They were like typewriters, which I had used in school, except easier in every single way. You could delete things. You could move paragraphs around. It felt like cheating, and I loved it. As time has gone up what makes up a "document" in word processing has increased in complexity. This grew as word processors moved on from being proxies for typewriters and into something closer to a publishing suite. In the beginning programs like WordPerfect, WordStar, MultiMate, etc had flat binary files with proprietary formatting codes embedded in there. When word processors were just proxies for typewriters, this made a lot of sense. But as Microsoft Word took off in popularity and quickly established itself as the dominant word processor, we saw the rise of the .doc file format. This was an exponential increase in complexity from what came before, which made sense because suddenly word processors were becoming "everything tools" — not just typing, but layout, images, revision tracking, embedded objects, and whatever else Microsoft could cram in there. At its base the is a Compound File Binary Format, which is effectively just a FAT file system with the file broken into sectors that are chained together with a File Allocation Table. It's an interesting design. A normal file system would end up with sort of a mess of files to try and contain everything that the has, but if you store all of that inside of a simplified file system contained within one file then you could optimize for performance and reduced the overhead that comes with storing separate objects in a flat file. It also optimizes writes, because you don't need to rewrite the entire file when you add an object and it keeps it simple to keep revision history. But from a user perspective, they're "just" dealing with a single file. ( Reference ) The .doc exploded and quickly became the default file format for humanity's written output. School papers, office memos, résumés, the Great American Novel your uncle was definitely going to finish — all .doc files. But there was a problem with these files. They would become corrupted all of the goddamn time. Remember, these were critical documents traveling from spinning rust drives on machines that crashed constantly compared to modern computers, often copied to floppy disks or later to cheap thumb drives you got from random vendor giveaways at conferences, and then carried to other computers in backpacks and coat pockets. The entire workflow had the structural integrity of a sandwich bag full of soup. So when Word was saving your critical file, it was actually doing a bunch of different operations. It was: These weren't atomic operations so it was super easy in an era when computers constantly crashed or had problems to end up in a situation where some structures were updated and others weren't. Compared to like a file where you would either get the old version or a truncated new version. You might lose content, but you almost never ended up with an unreadable file. With as someone doing like helpdesk IT, you constantly ended up with people that had just corrupted unreadable files. And here's the part that really twisted the knife: the longer you worked on the same file, the more important that file likely was. But Word didn't clean up after itself. As a .doc accumulated images, tracked changes, and revision history, the internal structure grew more complex and the file got larger. But even when you deleted content from the document, the data wasn't actually removed from the file. It was marked as free space internally but left sitting there, like furniture you moved to the curb that nobody ever picked up. The file bloated. The internal fragmentation worsened. And the probability of corruption increased in direct proportion to how much you cared about the contents. Users had to be trained both to save the file often (as AutoRecover wasn't reliable enough) and to periodically "Save As" a new file to force Word to write a clean version from scratch. This was the digital equivalent of being told that your car works fine, you just need to rebuild the engine every 500 miles as routine maintenance. The end result was that Microsoft Word quickly developed a reputation among technical people as horrible to work with. Not because it was a bad word processor — it was actually quite good at the word processing part — but because when a user showed up at the Help Desk with tears in their eyes, the tools I had to help them were mostly useless. I could scan the raw file for text patterns, which often pulled out the content, but without formatting it wasn't really a recovered file — it was more like finding your belongings scattered across a field after a tornado. Technically your stuff, but not in any useful arrangement. Sometimes you could rebuild the FAT or try alternative directory entries to recover slightly older versions. But in general, if the .doc encountered a structural error, the thing was toast and your work was gone forever. This led to a never-ending series of helpdesk sessions where I had to explain to people that yes, I understood they had worked on this file for months, but it was gone and nobody could help them. I became a grief counselor who happened to know about filesystems. Thankfully, people quickly learned to obsessively copy their files to multiple locations with different names — thesis_final.doc, thesis_final_v2.doc, thesis_FINAL_FINAL_REAL.doc — but this required getting burned at least once, which is sort of like saying you learned your car's brakes didn't work by driving into a bus. So around 2007 we see the shift from to , which introduces a lot of hard lessons from the problems of . First, it's just a bundle, specifically a ZIP archive. Now in theory, this is great. Your content is human-readable XML. Your images are just image files. If something goes wrong, you can rename the file to .zip, extract it, and at least recover your text by opening document.xml in Notepad. The days of staring at an opaque binary blob and praying were supposed to be over. However, in practice, something terrible happened. Microsoft somehow managed to produce the worst XML to ever exist in human history. Let me lay down the scope of this complexity, because I have never seen anything like it in my life. Here is the standards website for ECMA-376. Now you know you are in trouble when you see a 4 part download that looks like the following: If you download Part 1, you are given the following: Now if you open that PDF, get ready for it. It's a 5039 page PDF. I have never conceived of something this complicated. It's also functionally unreadable, and I say this as someone who has, on multiple occasions in his life, read a car repair manual cover to cover because I didn't have anything else to do. I once read the Haynes manual for a 1994 Honda Civic like it was a beach novel. This is not that. This is what happens when a standards committee gets a catering budget and no deadline. There was an accusation at the time that Microsoft was making OOXML deliberately more complicated than it needed to be — that the goal was to claim it was an "open standard" while making the standard so incomprehensibly vast that it would take a heroic effort for anyone else to implement it. I think this is unquestionably true. LibreOffice has a great blog post on it that includes this striking comparison: So the difference between ODF format and the OOXML format results in a exponentially less complicated XML file. Either you could do the incredible amount of work to become compatible with this nightmarish specification or you could effectively find yourself cut out of the entire word processing ecosystem. Now without question this was done by Microsoft in order to have their cake and eat it too. They would be able to tell regulators and customers that this wasn't a proprietary format and that nobody was locked into the Microsoft Office ecosystem for the production of documents, which had started to become a concern among non-US countries that now all of their government documents and records were effectively locked into using Microsoft. However the somewhat ironic thing is it ended up not mattering that much because soon the only desktop application that would matter is the browser. The file formats of word processors were their own problems, but more fundamentally the nature of how people consumed content was changing. Desktop based applications became less and less important post 2010 and users got increasingly more frustrated with the incredibly clunky way of working with Microsoft Word and all traditional files with emailing them back and forth endlessly or working with file shares. So while was a superior format from the perspective of "opening the file and it becoming corrupted", it also was fundamentally incompatible with the smartphone era. Even though you could open these files, soon the expectation was that whatever content you wanted people to consume should be viewable through a browser. As "working for a software company" went from being a niche profession to being something that seemingly everyone you met did, the defacto platform for issues, tracking progress, discussions, etc moved to GitHub. This was where I (and many others) first encountered Markdown and started using it on a regular basis. John Gruber, co-creator of Markdown, has a great breakdown of "standard" Markdown and then there are specific flavors that have branched off over time. You can see that here . The important part though is: it lets you very quickly generate webpages that work on every browser on the planet with almost no memorization and (for the most part) the same thing works in GitHub, on Slack, in Confluence, etc. You no longer had to ponder whether the person you were sending to had the right license to see the thing you were writing in the correct format. This combined with the rise of Google Workspace with Google Docs, Slides, etc meant your technical staff were having conversations through Markdown pages and your less technical staff were operating entirely in the cloud. Google was better than Microsoft at the sort of stuff Word had always been used for, which is tracking revisions, handling feedback, sharing securely, etc. It had a small subset of the total features but as we all learned, nobody knew about the more advanced features of Word anyway. By 2015 the writing was on the wall. Companies stopped giving me an Office license by default, switching them to "you can request a license". This, to anyone who has ever worked for a large company, is the kiss of death. If I cannot be certain that you can successfully open the file I'm working on, there is absolutely no point in writing it inside of that platform. Combine that with the corporate death of email and replacing it with Slack/Teams, the entire workflow died without a lot of fanfare. Then with the rise of LLMs and their use (perhaps overuse) of Markdown, we've reached peak . Markdown is the format of our help docs, many of our websites are generated exclusively from Markdown. It's now the most common format that I write anything in. This was originally written in Markdown inside of Vim. There's a lot of reasons why I think Markdown ended up winning, in no small part because it solved a real problem in an easy to understand way. Writing HTML is miserable and overkill for most tasks, this removed the need to do that and your output was consumable in a universal and highly performant way that required nothing of your users except access to a web browser. But I also think it demonstrates an interesting lesson about formats. and . along with ODF are pretty highly specialized things designed to handle the complexity of what modern word processing can do. LibreOffice lets you do some pretty incredible things that cover a huge range of possible needs. Markdown doesn't do most of what those formats do. You can't set margins. You can't do columns. You can't embed a pivot table or track changes or add a watermark that says DRAFT across every page in 45-degree gray Calibri. Markdown doesn't even have a native way to change the font color. And none of that mattered, because it turns out most writing isn't about any of those things. Most writing is about getting words down in a structure that makes sense, and then getting those words in front of other people. Markdown does that with less friction than anything else ever created. You can learn it in ten minutes, write it in any text editor on any device, read the source file without rendering it, diff it in version control, and convert it to virtually any output format. The files are plain text. They will outlive every application that currently renders them. They don't belong to any company. They can't become corrupted in any meaningful way — the worst thing that can happen to a Markdown file is you lose some characters, and even then the rest of the file is fine. After decades of nursing .doc files like they were delicate flowers that you had to transport home strapped to your car roof, the idea of a format that simply cannot structurally fail is not just convenient. It's a kind of liberation. I think about this sometimes when I'm writing in Vim at midnight, just me and a blinking cursor and a plain text file that will still be readable when I'm dead. No filesystem-within-a-filesystem. No sector allocation tables. No 5,039-page specification. Just words, a few hash marks, and never having to think about it again. Updating the document stream (your text) Updating the formatting tables Update the sector allocation tables Update the directory entries Update summary information Flush everything to disk Part 1 “Fundamentals And Markup Language Reference”, 5th edition, December 2016 Part 2 “Open Packaging Conventions”, 5th edition, December 2021 Part 3 “Markup Compatibility and Extensibility”, 5th edition, December 2015 Part 4 “Transitional Migration Features”, 5th edition, December 2016

0 views
fLaMEd fury 2 days ago

MF DOOM: Long Island to Leeds

What’s going on, Internet? A quick post to share some thoughts on a great little podcast I just listened to, MF DOOM: Long Island to Leeds . Even if you’re not a fan of underground hip-hop, or even hip-hop in general you may have heard of MF Doom. MF Doom is your favourite rapper’s favourite rapper. The podcast, hosted by AFRODEUTSCHE and Adam Batty takes us through the story of how the reclusive underground MC from Long Island, New York came up in the underground Hip Hop scene and wound up in Leeds, England where he passed away in 2020. It’s crazy to me how many American based Hip Hop artists were born outside of the USA but are still able to make it big in American hip-hop. His particular circumstances when it came to leaving and coming back to the States after touring were quite unfortunate but highlights the stance the United States takes on immigration. I’d hate to think how this would have come down if he was going through this today with the current immigration climate over there. I’ve always been aware of MF Doom and listen to his music, but not a mega fan. What this podcast has done for me is bump up some of his albums to the top of my vinyl wish list. The podcast is made of five 30 minute episodes. Even if you’re not a big Hip Hop head, give it a listen 🤙 Hey, thanks for reading this post in your feed reader! Want to chat? Reply by email or add me on XMPP , or send a webmention . Check out the posts archive on the website.

0 views