Squash and Stretch
Have you ever heard of Disney’s 12 Basic Principles of Animation? In this tutorial, we’ll explore how we can use the very first principle to create SVG micro-interactions that feel way more natural and believable. It’s one of those small things that has a big impact.
From Painfully Explicit to Implicit in Lean
Note: AI was used to edit this post. As a proof-of-human thought and input, I am also publishing the original draft which was written fully before asking AI to edit the post with me. This post is aimed at Lean language beginners that are interested in writing proofs in Lean, but still feel lost when reading Lean code. A very simplified mental model of Lean is that at the core there are two systems:
Analogue Prototyping
There is a lot to say about prototyping . Chris Hecker talked about advanced prototyping at GDC 2006, and provided a hierarchy of priorities that goes like this: Analogue prototyping comes in right away at Step 1: Don’t . By not launching straight into your game engine, you can save giant heaps of time between hypothesis and implementation. You can also figure out what kinds of references will be relevant before you reach Step 4: Gather References . There’s another side to analogue prototyping as well. In the book Challenges for Game Designers , Brenda Romero says: “A painter gets better by making lots of paintings; sculptors hone their craft by making sculptures; and game designers improve their skills by designing lots of games. […] Unfortunately, designing a complete video game (and implementing it, and then seeing all the things you did right and wrong) can take years, and we’d all like to improve at a faster rate than that.” Brenda Romero Using cards, dice, and paper leads to some of the fastest prototyping possible. It can be just ten minutes between idea and test, fitting really well into those two days of Step 2: Just Do It . Of course, it can also take weeks and require countless iterations, but that’s part of the game designer’s job after all. This post focuses on what to gain from analogue prototypes of digital games, and the practical process involved. It’s also unusually full of real work, since this is something I’ve done quite a bit for my personal projects and is therefore not under NDA. If you’re curious about something or need to tell me I’m wrong, don’t hesitate to comment or e-mail me at [email protected] . Why you should care about analogue prototyping when all you want to do is the next amazing digital game may seem like a mystery. A detour that leads to having your fingers glued together and a bunch of leftover paper clippings you can’t use for anything. In Chris Hecker’s talk, the first suggestion is that you should cheat before you put too much time into anything else. Since you will be cutting and gluing and sleeving, and some of that work takes time, this counts double with analogue prototypes. The easiest way to cheat is to use proxies. If you have a collection of boardgames, this is easy. You can also go out and buy some used games cheap or ask friends if they have some lying around that they don’t use. Perhaps that worn copy of Monopoly that almost caused a family breakup can finally get some table time again, in a different form. Aesthetics matter. If you want to take shortcuts with how a game feels to play, getting something that looks the part can be a shortcut. Go to your local Dollar Store or second hand shop and pick up some plastic toys or a game with miniatures that are similar to what you are after. They can merely be there to act as center pieces for your prototype. The easiest and most efficient reference board that exists is a standard chessboard. Square grid with a manageable size. You can also use a Go board, with the extra benefit that the Go beads also make for excellent proxy components. Beyond those two, you can really use any other board game board too. Just make sure to remember where you got it from if you want to play those games in the future. Or you can even pick up games with missing parts at yard sales, usually super cheap, and scavenge proxy parts from those. For some types of games, finding a good real-world map, perhaps even a tourist map or subway map, can be an excellent shortcut. Not just for wargames, but for anything with a spatial component. The guide map from a theme park or museum works, too. Packs of 52 standard playing cards are fantastic proxies. You can use suits, ladders, make face cards have a different meaning, and much more. Countless prototypes have used these excellent decks to handle anything from combat resolution to hidden information. It’s also possible to go even further, and make your own game use regular playing gards and the known poker combos as a feature. Balatro comes to mind. Many families have a Yatzy set lying around, providing you with a small handful of six-sided standard dice. You can do a lot with just this simple straightforward randomisation element. But don’t limit yourself to just six-sided dice, if you don’t have to. Get yourself a set of Dungeons & Dragons polyhedrals and you’ll have four-, eight-, ten-, twelve- and twenty-sided dice rounding out your randomisation armory. Just want to make an honorable mention of this fantasy wargame, because of its diversity. You can build all manner of strange scenery from just a core HeroScape set and use it effectively to represent almost anything. The same goes for Lego. The main issue with these kinds of proxies is that they can take a lot of space. Particularly HeroScape , since it has a predefined scale. With Lego, you just need to figure out a scale and stick to it. If there’s a game the people you will play with are especially familiar with, you can skip over having to design one of your systems by substituting a mechanic from a game you already know. Say, if you know that you will want to have statistics in your game, you can copy the traditional lineup of six abilities from Dungeons & Dragons , as well as their scale, to get started. Even if you know that you will want a different lineup later, this means you can test elements that are more unique to your game faster. An effective way to minimise cut-and-paste time is to print your cards very small. Preferably so all of them fit on a single piece of paper. They will be a bit trickier to shuffle this way, but that’s rarely an issue in testing. This way, you need less paper and you can cut everything faster. Going from eight cards to a sheet to 32 is a pretty big difference. Just avoid miniaturizing to the point that you need a magnifying glass. There’s no need to get fancy with real cardstock. Here are some things you can use. I usually just keep any interesting sheets from deliveries I receive. Say, the sturdy sheet of paper used in a plastic sleeve to make sure a comic book doesn’t bend in the mail. Perfect for gluing counters. There are three things you need to consider for paper: size, weight, and texture. For size, since I’m in Europe, I use the standardized A-sizes. A0 is a giant piece of paper, A1 is half as big, A2 half as big again, and so on. The standard office paper format is A4, roughly equivalent to U.S. Letter. This can easily be folded into A5 pamphlets. I also keep A3 papers around (twice the size of A4), but those I use to draw on. Not for printing. I don’t have a big enough home to fit a floor printer. The next thing is paper weight, measured in grams per square meter (GSM). Most home printers can’t handle heavier paper than 120-200 GSM. I always keep standard paper (80 GSM) around, and some heavier papers too. If I print counters or cards I sometimes use the sturdier stock. For reference, Magic cards are printed on 300 GSM black core paper stock. The black core is so you can’t see through the card and is taken directly from the gambling circuit. Lastly, the paper’s texture. If you want to work a little on the presentation, it can be nice to find paper canvas, or other sturdier variants. I’ve found that glossy photo paper is almost entirely useless in my own printer, however, always smearing or distorting the print. So when I buy any higher-GSM paper I try to find paper with coarser texture. There are many different kinds of cardboard, and you should try to keep as many around as possible. Some can be good for gluing boards or counters onto, while others can help make your prototype sturdier. This isn’t as important as paper, but gets used frequently enough that it felt worth mentioning. There will be a lot of rambling about cards later, and how to use them. For now, I only refer to loose cards you can use to prop up your thin paper printouts. These are not strictly necessary, but make shuffling easier. I don’t play much Magic: The Gathering anymore, but I still have lots and lots of leftover Magic cards, so those are the ones that get used as backing in most of my prototypes. You can cheaply buy colored wooden cubes as well as glass and plastic beads in bulk. It’s not always obvious what you may need, so keeping some different types around can be helpful. More specific pieces, like coins or pawns, can also be useful but unless these components provide unique affordances the kinds of components you have access to is rarely important. It’s usually enough to be able to move them around and separate them into groups. Storage is another thing that needs solving. If you mostly print paper and iterate on rules, a binder can be quite helpful. Especially paired with plastic sleeves so you can group iterations of your rules together and store them easily. If you also need to transport your prototypes, the kinds of storage boxes you find in office supply stores will have you sorted. You can push your analogue prototyping really far and build a whole workshop. A 3D printer for making scenery and miniatures, a laser cutter for custom MDF components, and a big floor-sized professional printer that takes over a whole room. If you have the space and the resources for that, you do you, but let’s focus on the smallest possible toolbox for making analogue prototypes. If you want to buy a printer, you just need to be aware that all of them have the same problems of losing connections and failing to print still to this day. Those same problems that have plagued printers since forever. I use a laser color printer with duplex (double-sided) printing support and the ability to print slightly heavier paper, up to 220 GSM. This has been more than enough for my needs. Specifically the duplex feature helps a lot if you want to print rulebooks. Having a good store of pencils and pens, including alcohol- and water-based markers, is more than enough. You can go deeper into the pen rabbit hole by looking at Niklas Wistedt’s spectacular tutorial on how to draw dungeon maps : it’ll have you covered in the pen and pencil department. Some tools you keep around to hold piles of paper or cards together. Paper clips are extra handy, because they can also be used as improvised sliders pointing at health numbers or other variables. Rubber bands are handy for keeping decks of cards together inside a box and for transportation. Almost every paper-based activity without decent scissors on hand will be a futile effort. Just beware that cutting things out by hand takes more time than you think. If you have a game with many cards, you may have to put on a couple of episodes of your favorite show as you cut them out. If you need more precision than scissors can provide, the next rung on the cutting lader is to get a proper cutting mat, a steel ruler, and a set of good sharp knives. These can be craft scalpels, metal handles with interchangeable blades (Americans insist on calling these “x-acto knives”), or carpet knives. Once you have rules and test documents printed, you’ll quickly disappear under a veritable ocean of paper. Though smaller sheafs can be pinned together with a paper clip, staplers are even better. A standard small office stapler is enough. But if you want to staple booklets and not just sheafs, it can be worth it to get a long-reach stapler capable of punching 20 sheets or more. Attaching paper to other paper can be done in more ways than with clips or staples. Sometimes you want to use glue or adhesive tape. Keeping a standard gluestick and a can of spray glue around is perfect. Regular tape and double-sided tape is also great for many things, even if the main use for tape can just be to make larger scale maps out of individual pieces of paper. As mentioned previously, it can take some time to cut out all the cards you want to print. You can cut this time down to a fraction, metaphorically and physically, by getting a paper guilloutine. These can usually take a few sheets at a time and will give you clean cuts along identified lines. Yelling “vive la France” when you drop the blade is optional. Lastly, a more decadent piece of machinery that isn’t strictly needed is a paper laminator. These will heat up a plastic pocket and melt the edges together to provide the paper with a plastic surface. It makes the paper much sturdier and has the added benefit of allowing you to use dry erase markers to make notes and adjustments right on the sheet itself. There is a lot of software out there that can be used to make cards, boards, illustrations, and whatever else you may need. The following is merely a list of what I personally use. Since you will often want to test things at different sizes, vector graphics are generally more useful for board game prototyping than pixel graphics. This is by no means a hard rule, but resolution of pixel images tends to limit how large you can scale them, while vector graphics have no such limitations. My go-to for vector graphics is Illustrator, but there are free alternatives like Affinity available as well. My other go-to piece of software for analogue shenanigans is InDesign, another Adobe program that can also be replaced by Affinity . I’m just personally too stuck in the Adobe ecosystem, after decades of regular use, that it’s too late for me to switch. You can’t teach an old dogs new tricks, as the saying goes. Indesign is great for multiple reasons. Not least of all its ability to use comma-separated value (CSV) files to populate unique pages or cards with data. A feature called DataMerge. Speaking of spreadsheets, all system designers have a lovely relationship to their tool of choice. This can be Microsoft Excel , OpenOffice Calc , or Google Spreadsheets, but the many convenient features of spreadsheets are a huge part of our bread and butter. I don’t even want to know how many sheets I create in an average year. Very broadly speaking, when making an analogue prototype, I will make use of spreadsheets for these reasons: The fantastic Tabletop Simulator is not just a great place to play tabletop games, it’s also a great place to test your own games. Renown board game designer Cole Wehrle has recorded some workshops for people interested in this specific adventure, and let’s just say that once you have this up and running it will make it a lot easier to test your game. Especially if the members of your team doesn’t all live in the same city. Its biggest strength is how quickly you can update new versions for anyone with a module already installed. If you share your module through Steam Workshop, it’s even easier. For most analogue prototypes, this isn’t doable, simply because of NDAs and rights issues. So much stuff ! Let’s put it all together. The way I’ve talked about this, there are really six steps to the process of making an analogue prototype: This is more important than you may think. An analogue prototype can easily become a design detour. Because of this, your goal needs to formulate why you are making this analogue prototype. “Test if it’s fun with infinitely respawning enemies” could be a goal. “See what works best: party or individual character” could be another one. But it can also be a lot narrower, for example designed to test the gold economy in your game. Perhaps even to balance it. The point is that you need a goal, and you need to stick to it and cut everything out that doesn’t serve that goal. If you need to test how travelling works on the map, you probably don’t need a full-fledged combat system, for example. Facts are the smallest units of decision in your game’s design . Stuff that every decision maker on your team has agreed on and that can therefore safely inform your analogue prototype. This can be super broad, like “the player plays a hamster,” or it can be more specific, like “the player character always has exactly one weapon.” You need these facts to keep your prototype grounded, but you don’t necessarily need to refer to them all at once. Pick the ones that are most important to your goal. With a goal and some facts, you need to figure out what systems you will use. Try to narrow it down more than you may think. Don’t make a “combat system,” but rather one “attack system” and another “defense system.” The reason for this is that what you are after is the resource exchanges that come from this, and the dynamics of the interactions. The attack system may take player choices as input and dish out damage as output, while the defense system may accept armor and damage input and send health loss as output. You can refer to the examples of building blocks in this post for inspiration. This is where we come to the biggest strength of analogue prototyping: real humans provide a lot more nuance and depth than any prototype can do on its own. Analogue or digital. One player can take on the role of referee or game master, similar to how it would work in a tabletop role-playing game . In many wargames of the past, this was called an umpire. Someone who would know all the rules and act as a channel between the players and the systems. If you have built a particularly complicated analogue prototype, a good way to test it can be to act as a referee and then simply ask players what they want to do instead of teaching them the details of the rules. Players can play each other’s opponents, representing different factions, interest groups, or feature sets via their analogue mechanics. If you built an analogue prototype of StarCraft , you’d probably do it this way, with three players taking on one faction each. One player can play the enemies, while another plays the economy system, or the spawning system. The goal here is to put one player in charge of the decisions made within the related system. If someone wants to trade their stock for a new space ship, and this isn’t covered by the rules, the economy system player can decide on the exchange rate and the spawning system player can say that this spawns a patrol of rival ships. Just take ample notes, so you don’t forget the nuances that come out of this process. There are many different ways to use the components you collected previously. Some of them may not be intuitive at all. The humble die: perhaps the most useful component in your toolbox. Just look at the following list and be amazed: People have been using playing cards for leisure activities since at least medieval times. Just as for dice, you’ll see why right here, and perhaps these things will fit your needs better than dice: Humans are spatial beeings that think in three dimensions. Even such a simple thing as a square grid where you put miniatures will create relationships of behind, in front of, far away from, close to, etc. All analogue prototypes don’t need this, but if you do need it, here are some alternatives to explore: With the fast iterations of analogue prototypes, you can usually just change a word or an image somewhere and print a new page. This means you may have many copies of the same page after a while. To prepare for this situation, make sure to have a system for versioning. It doesn’t have to be too involved, especially if you’re the only designer working on this prototype, but you need to do something. I usually just iterate a number in the corner of each page. The 3 becomes a 4. I may also write the date, if that seems necessary. I may also add a colored dot (usually red) to pages that have been deprecated, since just the number itself won’t say much and you may end up referring to the wrong pages if you don’t have an indicator like this. Step 1: Don’t : Steal it, fake it, or rehash stuff you have already made before you start a new prototype. Step 2: Just Do It : If it takes less than two days, just do it. As the saying goes, it’s easier to ask for forgiveness than for permission. Step 3: Fail Early : When something feels like a dud even at an early stage, you can assume that it is in fact a dud. There’s nothing wrong about abandoning a prototype. In fact, learning to kill things early is a skill. Step 4: Gather References : Prototypes can only really help with small problems. Big problems, you must break apart and figure out. Collect references. White papers, mockup screenshots, music, asset store packs, and so on. Anything that helps you understand the problem space. The same psychology applies . Rewards, risk-taking, information overload. Many of our intrinsic and extrinsic motivators are triggered the same by boardgames as by digital games. The distance is not nearly as far as we may tell ourselves. Players can represent complex systems . A player has all the complexity of a living breathing human, making odd decisions and concocting strange plans. This lets you use players as representations of systems, from enemy behaviors to narrative. Analogue games are “pure” systems . If you can’t make sense of your mechanic in its naked form, you can probably not expect your players to make sense of it either. Similar affordances . Generating random numbers with dice, shuffling cards, moving things around a limited space; analogue gaming is always extremely close to digital gaming, even to the point that we use similar verbs and parlance. Holism . Probably the best part of the analogue format is that you can actually represent everything in your game in one way or another. It doesn’t have to be a big complex system, as long as you provide something to act as that system’s output. Listing all the actions, components, elements, etc., that are relevant. Just getting things into a list can show you if something is realistic or not. Cross-matrices for fleshing out a game’s state-space. If I know the features I want, and the terrains that exist, a cross-matrix can explore what those mean: a feature-terrain matrix. Notes on playtests. How many players played, what happened, who won and why, etc. Calculators of various kinds, incorporating more spreadsheet scripting. Can be used to check probabilities, damage variation, feature dominance, etc. Session logging. If I want to be more detailed, I can log each action from a whole session and see if there are things that can be added or removed. Set a Goal Identify Facts Systemify the Facts Consider the roles of Players Tie it together with Components Types of dice : you can use any number of sides, and make use of the corresponding probabilities. Dividing a result by the number of sides gives you the probability of that result. So, 1/6 = 0.1666 means there’s a ~17% chance to roll any single side on a six-sided die. Use the dice that best represents the percentage chances you have in mind. Singles : rolling a single die and reading the result. Pretty straightforward. Sums : rolling two or more dice and adding the result together. Pools : rolling a handful of dice and checking for specific individual results or adding them together. Buckets : rolling a lot of dice and checking for specific results. The only reason buckets of dice are separated from dice pools here is because they have a different “feel” to them; they are functionally identical. Add/Subtract : add or subtract one die from the result of another, or use mathematical modifiers to add or subtract from another result. X- or X+ : require specific results per die. In these cases X- would mean “X or lower,” and X+ would mean “X or higher.” Patterns : like Yatzy, or what the first The Witcher called “Dice Poker:” you want doubles, triples, full houses, etc. Reroll : allowing rerolls of some or all of the dice you just rolled. Makes the rolling take longer but also provides increased chances of reaching the right result. Some games allow rerolling in realtime and then use other time elements to restrict play. So you can frantically keep trying to get that 6, but if an hourglass runs out first you lose. Spin : spinning the die to the specific side you want. Trigger : if you roll a specific result, something special happens. It could be the natural 20 that causes a critical hit in Dungeons & Dragons , or it can be that a roll of 10 means you roll another ten-sided die and add it to your result. Hide : you roll or you set your result under a cupped hand or physical cup, hiding the result until everyone reveals at the same time or the game rules require it. Statistics : common sense may say that you can’t possibly roll a fifth one after the first four, but in reality you can. Dice are truly random. Shuffle : shuffling cards is a great way to randomise outcomes. This can be done in many different ways, as well, where you shuffle a “bomb” into half of the pile and then shuffle the other half to place on top, for example. There are many ways to mix up how to shuffle a deck of cards. Uniqueness : each card can only be drawn once, which means that you can make each card in a deck unique and you can affect the mathematics of probability by adding multiple copies of the same card. Just like the board game Maria uses standard playing cards but in different numbers. Front and back : the face and back of the cards can have different print on them, or the back can just inform you what kind of card it is so you can shuffle them together in setup. Of course, the fact that you can hide the faces for other players is also what makes bluffing in poker interesting. Turn, sideways : what Magic calls “tapping” and other games may call exhausting or something else. Some cards can be turned sideways (in landscape mode instead of portrait mode) by default. Turn, over : flipping a card to its other side can serve to show you new information or to hide its face from everyone around the table. It can represent a card being exhausted, or injured, or other state changes like a person transforming into a werewolf. Over/under : cards can be placed physically over or under other cards, to show various kinds of relationships. An item equipped by a character, or a condition suffered by an army, for example. Card grids : cards can be placed in a grid to generate a board, or to act as a sheet selection for a character. One card could be your character class, another could be a choice of quest, etc. It’s a neat way to test combinations. Hide cards : if you want to get really physical, you can hide cards on your person, under boards, and so on. This was one way you could play Killer , by hiding notes your opponents would find. Card text : if you print your own cards, you can have any text you want on them. Reminders, rules exceptions, etc. Deck composition : how you put decks together will affect how the game plays, and predesigning decks for different tests can be very effective. Perhaps you remove all the goblins in one playtest and have only goblins in another. Deck building : decks can also be constructed through play, similarly to how Slay the Spire works. A style of mechanic where you can start small and then grow in complexity throughout a session. Stats : cards can be in different states. On the table, in your hand, available from an open tableau, shuffled into a deck, discarded to a discard pile, and even removed from the game due to in-game effects. Semantics : something that Magic: The Gathering ‘s designer, Richard Garfield, was particularly good at was to figure out interesting names for the things you were doing. You don’t just play a card, you’re casting a spell. It’s not a discard pile, it’s your graveyard. These kinds of semantics can be strong nods back to the digital game you are making, or they can serve a more thematic purpose. Statistics : with every card you draw, the deck shrinks, increasing the chances of drawing the specific card you may want. You are guaranteed to draw every card if you go through a whole deck, which is one of the biggest strengths of decks of cards. Node or point maps : picture a corkboard with pins and red thread, or just simple circular nodes with lines between them. You can draw this easily on a large sheet of paper and just write simple names next to each circle to provide context. Sector maps : one step above the node or point map is the sector map, where regions share proximity. Grand strategy games have maps like this, where provinces share borders. Another example are more abstract role-playing games, where a house’s interior is maybe divided into two sectors and the whole exterior area around it is another sector. It’s excellent for broad-stroke maps. Square grids : if you want a grid, the square grid is probably the most intuitive. But it also has some mathematical problems: diagonals reach twice as far as cardinals. This means you need to either not allow diagonals or allow them and account for the problems that will emerge. Hexagon grids : these are more accurate and classic wargame fare, but they will also often force you to adapt your art to the grid in ways that are not as intuitive as with a square grid. Freeform : finally, you can just take any satellite image or nice drawn map, perhaps an overhead screenshot from a level you’ve made, and use it as a map in a freeform capacity. This may force you to use a tape measure or other way to measure distances, but if the distances are not important that matters a lot less. For example if your game shares sensibilities with Marvel’s Midnight Suns .
How I use org-roam
While Org-mode is fantastic in its core functionality, there is a lovely little extension that creates a way to build a wiki for all personal knowledge, ideas, writing, work, and so much more: org-roam . A “clone” of ROAM research , if you are familiar with logseq or obsidian , this will have you feeling right at home (albeit, actually at home inside emacs). It has taken some time to figure out how I wanted to use org-roam, but I think I have cracked the code. I will discuss how I’ve been capturing, filing away, and taking action on everything that pops into my head. As a small overview, Org-Roam gives you the ability to create notes (big whoop). The power comes in the backlink to any previous note that may be in your system, similar to how Wikipedia links between articles. As I write in any org-roam document (node), I see suggestions of past notes I have taken, giving the option to immediately create a link back to them. This is fine on it’s own, but you start to see inter-linking between ideas: which becomes massively helpful for research and creating new connections of information. Generally, one would be blind to in other methods of note taking. Org-roam uses an sqlite database (which some critique), as well as an ID system in which everything (files, org headers) have a unique ID. This ID is what forms the link between our notes. Let’s discuss how I’m using this. As with my org-mode flow, the goal is to not only capture, but to reduce friction of the capture to almost nothing. I have capture templates for the following files in my general org-mode file: What I was lacking was a way to integrate with org-roam and create backlinks across the notes I was taking on everything. Enter the new capture system. I use (mapped to ) to hit a daily org-roam file (~/org/roam/daily/2026-04-10.org for example) which is my capture file for everything for the day. I write everything in this file. I mean everything : I then take 5 minutes at the end of every day and file away these items into org-roam nodes if they are “seeds” (in the digital garden sense), actionable items, things I want to look into at some point, or just leave them in the daily file to be archived for posterity. Whenever I want to write something on the computer, emacs is the place I do so, in which I have autocomplete, spelling check, and macros right at my finger tips. I hit a keybind that universally reaches out to emacs and opens the org-roam-dailies-capture-today buffer if I am not on workspace 1 (emacs) and capture the thought/writing/email/text/content, and move on with my day. What this also allows it the use of my capture system via termux on my phone. I simply leave my ~/org/roam/daily/date.org file open every morning in termux running in emacsclient on my workstation, and go about my day. This means all notes live in one place, I don’t generally have to go into “note to self” in signal or xmpp and move things around, and org-roam works out of the box for backlinking and clean up. Is it ideal? No, but it is still better than the various mobile orgmode apps I have tried. I treat the phone just as a capture node, all organizing and refiling happens on my bigger screen at end of day. The major benefit of this methodology is that we have content which is greppable forevermore. If I write, it is written in emacs. Anything more than a sentence or two is in my daily file. I don’t care what it is, I can grep it for all time, version control it, and it is ready to expand upon in the future. By the end of the day, I may have dozens of captures in my daily file. I sit down, open the file up, and review. If the item is actionable or has a date/deadline associated with it, then it is filed to inbox.org/calendar.org. If it is an idea that is a seed of something larger, it is filed into its own org-roam node that can then grow on its own. If something needs to be filed under an existing roam-node, that occurs here as well, and backlinks organically take shape as I write. Finally, if the item is none of these things, it just lives in the daily file as an archive that can be revisited later with ripgrep as stated above. I have bound to project-wide for this, which I use frequently for finding anything. Refiling is simply accomplished by: Which will give you files and org headings under which to refile everything. As we grow our notes database, we will start to see that we have autosuggestions offered via cape and corfu. They look like so: allowing a direct link to previous notes’ IDs, which are portable across the filesystem, so you can move files around to logically work in a heirarchy if you so choose. The standard advice is to keep a flat file system in which all notes are in one directory, but I like organization too much and have created nested directories for this. These links and IDs are handled via the function that can be set to fire automatically on file changes. Oh the fabled “neuronal link graph” that was popularised by Obidian - how could we forget about that? opens a D3 rendered graph that looks nice, but I have not really found use for it other than pretty screenshots to show how “deep(ly autistic)” I am. I find this to be the easiest way to maintain a note taking system that actually grows with the author, while staying sane and keeping everything organized. The notes that we create allow us to understand deeply, and to make connections that are otherwise missed. As in my discussion with Prot , writing everything down has greatly impacted my thinking and allowed growth in areas that are deeply meaningful. Org-roam (and holistically org itself) is once again, just text files. So, you can very easily take any .org file and back it up and hold onto it for all time, as you will never have any proprietary lock in. The database is just an sqlite database, which is the most portable and easily malleable database in existence. The two interlink to give you peace of mind were you ever to leave emacs (haha, you won’t). If you don’t want the “heaviness” of org-roam’s database structure, you could use Prot’s denote package that is a more simplified (yet still highly powerful) method. I just like the autosuggestions and speed of roam, but your mileage may vary. So there you have it, the way that I am using org-roam to create a mind map/second brain and keep notes on everything I come across on a daily basis. How are you using org-roam, or do you have a note taking system you swear by? Post below or send me an email! As always, God bless, and until next time. If you enjoyed this post, consider Supporting my work , Checking out my book , Working with me , or sending me an Email to tell me what you think. inbox.org: Actionable items with a TODO - these are then filed away to projects or kept in this file until acted upon. calendar.org: Scheduled or deadlined items bookmarks.org: web bookmarks contacts.org: every contact I have and reach out to system. notes.org: but this is being replaced as we will see text messages emails (if not already sent via mu4e) notes to self LLM prompts websites I visit journal entries this very post, that will then become a blog post in my writing project code snippets things I want to remember
Running NixOS Micro VMs on MacOS
microvm.nix is a framework to run NixOS based micro VMs on various platforms. In particular, it can use vfkit to run micro VMs on macOS that use the macOS virtualization framework to provide a more performant VM than QEMU . microvm.nix works well but the documentation is a bit lacking. I had to figure out some gotchas while setting this up on my MacBook Pro M4, so I decided to write this note. This tutorial requires Nix and Nix Darwin to be installed on the macOS machine. To build a micro VM, we need a NixOS builder machine running AArch64 Linux. Thankfully, it is really easy to set up one with Nix Darwin. Assuming we have Nix Darwin set up with a Nix flake like: First, we add Nix Linux builder config: Now, we switch the system config to build and start the Linux builder: We should verify that the builder is working: It may take up to a minute for the builder to start. Once SSH works, we can proceed. We create a file with the micro VM configuration: This configures a micro VM with 4 VCPU, 8GB RAM and 40 GB disk. The disk image is used to store the Nix packages downloaded within the VM. It is mounted at . The host’s Nix store is mounted read-only at . The option combines these two with overlays to create the VM’s Nix store at . We can share additional directories from the host and mount them in the VM, as we do here for the directory from the macOS host. Next couple of lines set up networking in the VM. The vfkit hypervisor supports only NAT networking. This means: There are ways to work around this using gvisor-tap-vsock and vmnet-helper , but we are not going into it here. We can uncomment the line if we want a graphical NixOS VM. Finally, the workaround for the big gotcha! By default Nix does builds in a sandbox and the sandbox is created (and deleted) on the root filesystem. However, microvm.nix uses a temporary filesystem residing in RAM for the root filesystem. This means that the Nix builds may cause the root FS and RAM to fill up, causing out-of-memory or out-of-disk-space errors. To prevent that, we disable the sandbox and set the build directory to be at on the disk image we mounted. Next, we integrate the VM config with the Nix Darwin flake: Let’s go over the tricky bits. The wrapper script rebinds Ctrl + ] to send the interrupt, suspend and quit signals instead of the usual Ctrl + C so that we can use Ctrl + C inside the VM without it causing the VM to shut down. We add the script to our system packages. Lastly, the defines the actual micro VM using the file. Finally, we build and install the micro VM: And, now we can run it from any directory: Note that the disk image file will be created in the directory in which we run the above command. After this, we can remove the Linux builder config and switch again to stop and delete it. Now we have a performant micro VM running NixOS to play around with in our macOS machine. That’s all I had for this note. I hope this helps. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! Thanks for reading this post via feed. Feeds are great, and you're great for using them. ♥ This post was originally published on abhinavsarkar.net . Read more of my posts and notes . The VM can make outgoing connections to the host/internet. The host cannot initiate connections to the VM.
Proxying GoatCounter Requests for a Hugo Blog on CloudFront to bypass Ad Blockers
I’ve been running GoatCounter on my site using the script . The problem is that adblockers like uBlock Origin block it (understandably). To get around this, I set up proxying so that the GoatCounter requests go to an endpoint under my domain , and then from there CloudFront handles it and sends it to GoatCounter. Most ad blockers work based on domain and GoatCounter is on the blocklists. Since the browser is now sending requests to the same domain as my site, it shouldn’t trigger any ad blockers. This post explains how I did it in case it’s useful for anyone else. It’s possible to self-host GoatCounter, but my approach was easier to do and less infrastructure to maintain. Perhaps in the future. I know there are concerns around analytics being privacy-invasive. GoatCounter is privacy-respecting. I care about privacy. I am of the belief that GoatCounter is harmless. I just like to keep track of the visitors on my site. Read the GoatCounter developer’s take if you want another opinion: Analytics on personal websites . Clicking through the AWS console to configure CloudFront distributions is a pain in the ass. I took the time to finally get the infrastructure for my blog managed as infrastructure-as-code with Pulumi and Python . So while you can click around the console and do all of this, I will be showing how to configure everything with Pulumi. If you don’t want to use IaC, you can still find all of these options/settings in AWS itself. To set up GoatCounter proxying via CloudFront, we’ll need to CloudFront functions are JavaScript scripts that run before a request reaches a CloudFront distribution’s origin. In this case, the function strips the from . We need to strip for two reasons: Here is the code for the function: And here is the CloudFront function resource defined in Pulumi (using Python) that includes the JavaScript from above. This is a new resource defined in the same Python file where my existing distribution already exists: Here is my existing CloudFront distribution being updated with a new origin and cache behavior in Pulumi code. At the time of writing CloudFront only allows to be a list of HTTP methods in specific combinations. The value must be one of these: Since the GoatCounter JavaScript sends a request, and the third option is the only one that includes , we’re forced to use all HTTP verbs. It should be harmless though. Now that my Pulumi code has both the CloudFront function defined and the CloudFront distribution has been updated, I ran to apply changes. Finally, I updated goatcounter.js to use the new endpoint. So instead of I changed it to my own domain at the very top of the snippet: After this, I built my site with Hugo and deployed it on S3/CloudFront by updating the freshly built HTML/CSS/JS in my S3 Bucket and then invalidating the existing CloudFront cache . Now, GoatCounter should no longer be blocked by uBlock Origin. I tested by loading my site on an incognito browser window and checked that uBlock Origin was no longer blocking anything on my domain. Everything looks good! If you’re using GoatCounter you should consider sponsoring the developer . It’s a great project. Create a new CloudFront function resource Add a second origin to the distribution Add an ordered cache behavior to the distribution (which references the CloudFront function using its ARN) Update the GoatCounter script to point to this new endpoint I chose to proxy requests that hit the endpoint on my site to make sure there’s no collision with post titles/slugs. I’ll never use the path for posts. GoatCounter accepts requests under , not https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/cloudfront-functions.html https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/DownloadDistS3AndCustomOrigins.html https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/DownloadDistValuesCacheBehavior.html https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html https://www.goatcounter.com/help/js https://www.goatcounter.com/help/backend https://www.goatcounter.com/help/countjs-host
Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?
Towards the end of last year, I trained a 163M-parameter GPT-2-style model from scratch on my local RTX 3090 , using code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". The result was a pretty decent little model, but it wasn't as good as the original GPT-2-small, despite having more parameters (because it wasn't using weight-tying). Specifically: on a particular test set, my model gave a loss of 3.944 -- quite a lot more than the original GPT-2's 3.500 on the same dataset. I wanted to see whether I could train a model on my own hardware (or on something that didn't cost too much to rent in the cloud) that got closer to the original model's performance. So over the last few months, I've done a bunch of further training runs, each one testing a specific intervention -- a stand-alone change that I expected to change the loss, either for better or for worse. Specifically: At the end of all of that, I had this table showing the effect of each intervention in terms of loss on the test set. They're sorted from least-effective to most-effective, and you can see the baseline in there too: Winners and losers are reasonably clear: So, for an optimal train, we'd just use the effective interventions, right? Well, not quite. Full-fat float32 I decided wasn't worth the effort, as it meant that the train took more than twice as long, and (because it required a larger machine), cost more than three times as much. The others did look like solid changes, but there was one concern. The effect of each intervention is actually pretty small. For example, gradient clipping reduced the loss by 0.014, from 3.692 to 3.678. That's a 0.3% improvement. Even the best intervention, scheduling the learning rate, only improved things by 2%. Could it be that some or all of these improvements were not real, but just a result of the random nature of training deep neural networks? Could the differences just be in the noise? They seemed small enough for that to be possible. I've trained seven more models over the last few days to try to get a feel as to how big an effect noise has for this kind of training run. The results appear to show that variations in the initial weights matter quite a lot, but randomness in the training loop (given the same initial weights) actually has a fairly minimal impact. That surprised me a bit! Let's go through the details. When I did the original baseline training run -- creating the model that was the comparison point for all of the interventions -- I wanted to minimise the amount of random number-induced differences between the training runs in this interventions series. I did this by setting the random seed at the start -- specifically, I had this code: At the time I wrote it, this seemed pretty complete -- the seed is set on Python's own random number generator, on PyTorch's, and on the separate ones it uses for CUDA. However, in a separate project, where I was fine-tuning a Qwen model as a classifier, I'd found that this wasn't enough. In order to get full reproducibility, I'd had to lock things down a bit more, with this additional code: So: was my random number seed code enough for this case? Or would I get a different model if I ran the same code a second time? That was easy enough to do; I spun up a machine, and just ran the "baseline" train again. 3 hours 24 minutes later: Interestingly, that was exactly the same final train loss as the original baseline train. Here's the model . I ran my normal smoke test, asking it to complete "Every effort moves you" ...so that was OK -- the model was generating reasonably coherent text. Then I ran the eval to find its loss on the test set: Exactly the same as the original baseline! That was certainly promising. Now, the use of three decimal places for the output from the loss eval is just a formatting thing, so I bumped it up to 6 dps, and the new model got this: Running that against the original baseline model: Again, exactly the same. Finally, more out of idle interest than anything else, I decided to see if the models were at least different: That is, quite frankly, amazing to me. I was expecting pretty close results, but what we're seeing here is that two separate models, trained on the same data, but on different machines more than a month apart, have weights that are bit-wise identical. No random noise at all. That's actually really reassuring! It makes me much more comfortable that we're standing on a stable foundation here. Now it was time to see what effect changing that random seed would have. Let's think about what the random seed does. When we call , we're initialising Python's pseudo-random number generator so that it will start at a particular point -- after we've called it, it will generate the same sequence of "random" numbers each time it's asked for a new one. So the effect of this code: ...is to initialise three separate pseudo-random number generators to be in a known deterministic state, so they'll all generate the same sequence in every run. So, the first thing to do was to see what happened if we changed that number. I decided to do two training runs, each with exactly the same code as the baseline, but with different random seeds. Firstly, I changed it from 42 to 22 1 : That training run completed: Here's the model . Time for the evals; the smoke test: ...and the loss test: So, that's 3.673453 compared to 3.691526, an improvement of 0.018 over the run with a seed of 42. That's more than the 0.014 improvement we got from gradient clipping (and indeed, the 0.013 from full-fat float32 training), and quite close to the 0.023 improvement from adding attention weight bias. Time for another training run: Another 3h24m later: Here's the model . The smoke test: ...and the test set loss: A further improvement! That's 0.038 better than our original baseline, which beats adding on attention weight bias (though it's worse than the weight decay update). Now, three data points is rather a small number for any kind of statistical analysis, but just out of interest, let's do the basics. GeeksForGeeks has a good refresher here if you're a bit rusty. Firstly, our mean is ...and our variance 2 is: If we take the square root of that, we get the standard deviation (SD): So, if we assume a normal distribution, what would that say about our results? Here's the results table again. If we assume that the results are on a normal distribution: That seemed a bit saddening -- were all of the results apart from scheduling the learning rate within the noise? Well, so as I said, three data points is too small a number to take those results without a fistful of salt. I was thinking of perhaps trying another few random seeds to see what would happen, and perhaps to tighten those numbers up a bit, but then something occurred to me -- randomness was being used in two different ways in the training run, and perhaps we could separate them? Where do we use the random numbers? Well, immediately after we set the seeds, we create our uninitialised model for training: One of the random number generators -- Python's, PyTorch's, or one of the CUDA ones -- will be used to generate the initial weights that we're going to start training. That means that for the same model setup , we'll always start with exactly the same weights. But if the model settings change such that we initialise different things in a different order, then we'll have different weights. After we've done that, we go into the training loop. That can have randomness in it; although the AdamW optimiser itself is deterministic, we are (in all but one of these training runs) using dropout, which drops a random bunch of activations at various points -- 10% of them with our config. And it seems entirely possible that each of the interventions could change the order of execution of different steps in non-obvious ways, which would lead to dropout being applied in different ways in different runs. So, the question was: what kinds of randomness -- in terms of the initial weights, or in terms of the training run -- did each intervention potentially change vs the baseline? Disregarding the full-fat float32 run: Given that, I wanted to get two measures of how sensitive to noise each phase of the training run was: the initialisation of weights at the start, and the training run itself. I decided to start by nailing down exactly what the training run started with. We already had a baseline training run with a specific state of the random number generator at the start; in our "real" baseline, we seeded with 42 at the start, and then initialised our weights. After that, the random number generator would have reached some specific state based on its initial seed and how many numbers had been generated so far. Now, in theory, we could get the RNG into that specific state by seeding it with some number A at that point. We don't know what A is, of course. But it seems vanishingly unlikely that it would be something we'd come up with -- specifically, we can be pretty sure that A ≠ 23 and A ≠ 67 . So, I put the old initial seed of 42 back in, but re-seeded after the model had been initialised: Firstly, with a re-seed value of 23: I let that run.... ...and got this model . Time for the normal evals: Next, I did another training run, the same as the previous one, but with 67 instead of 23 for the re-seed: That one ran: ...producing this model , which eval'ed like this 3 : Let's bring those together: That's a mean of ~3.684462, with a variance of ~0.0000752 and a standard deviation of ~0.008672. Those are tiny compared to the numbers from the two trains we did with the change of the seed prior to the model initialisation. That actually surprised me a bit; we're using dropout in all of these training runs, and it's dropping a random 10% of activations in every forward training pass. With our different training run starting seeds, they should be getting very different dropout patterns. Hand-wavingly, perhaps over the three million or so sequences we're training on, it averages out? Still a little counterintuitive, though. Anyway, let's take a look at the intervention results again, this time highlighting the ones that we believe will be starting with the same weights: Using the "99.7% should be within three SDs" heuristic, we get a range of 3.658446 - 3.710478. Of the intervention runs with (I believe) stable weights, only the no-AMP and the gradient clipping ones are within that range. That made me feel quite positive. If my beliefs are correct about which runs have the same weights, then noise in the training runs seems unlikely to be causing the differences -- that is, perhaps the results from the interventions for those same-weight training runs are real signal and not just noise. What would happen if instead of pinning the seed for generating the weights and varying the starting seed for the training run, we varied the weight seed and pinned the training one? We'd already done a training run with a seed of 42 before generating the weights and a re-seed to 23 after that: So I decided to see what would happen if I varied the pre-weights initialisation seed. Let that train: ...getting this model . Evals: Next, one with 67 as the weights initialisation seed: That trained: ...getting this model , and 4 : OK, so here we have: Compared to the SD we got when we varied just the initial seed, 0.0154919, it's not too far off. Using the 3-SD rule, we get a range of 3.637030 - 3.709400, and looking at the table again, this time with the ones that we don't expect to have the same weights highlighted: ...we can see that the QKV bias is well within that range (as are all of the interventions apart from the two negative-effect ones and scheduling the learning rate). Right, what does all of that tell us? This post obviously isn't even trying to be statistically rigorous. The number of training runs I've done and the amount of data is way too small for that. However, training runs are expensive (Lambda have raised their prices again, so these cost more than US$50 each!), so there's a limit to how much I can do. But even with the limited amount of data, something seems pretty clear: "One of these things is not like the others". Keeping the model weights stable and only allowing variation in randomness across the training run itself meant that almost all of the differences between training runs disappeared. Could this be a result of the small number of samples? I guess conceivably it might, but it seems vanishingly unlikely. So I feel reasonably confident in saying that the bulk of the variation in results that we can chalk up to random noise in these training runs comes from variations in the model weights' initialisation. Additionally, the first training run in this post -- the re-run of the baseline model with no changes -- gave exactly the same numbers as the original baseline run. So we can be confident that all of the models with no changes to the weight initialisation started with the same weights. Of course, I could be wrong about which models really did have the same weights, but given that they were running the same code with the same seed, I'm pretty much sure. That makes me fairly confident that the intervention runs that had the same initial weights gave a real signal about whether or not the intervention in question actually helped. The only exception is gradient clipping, which fell within the three-SD range for the same-weights tests -- and it's essentially free, adding just 100 seconds to a three hour training run. That's a really interesting result! As I said earlier, given that dropout is making us ignore a random 10% of activations during the training run, I would have thought that changing which random 10% were being ignored would have a much larger effect. And that's not even considering other sources of random noise in the training run. I was less surprised that model weight initialisation was important, though. It's pretty obvious that your starting position in the loss landscape is going to affect where you end up at the end of the training run. Still, we now have a reasonable level of trust that our interventions gave a real signal, so I think we have everything in place to see how they stack together, and do a best-effort training run. Can we approach the original GPT-2 small weights' performance on our test set loss? It should be fun to find out :-) Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo". ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is. ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching. ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's". ↩ I trained a baseline model on an 8x A100 40 GiB per GPU machine on Lambda (which was better than my original locally-trained model, I believe due to the larger batch size that the larger machine made possible). I tried adding gradient clipping to see if that would help by limiting the effects of loss spikes. I tried removing dropout , given that these days people tend not to use it (because we're doing single-epoch training runs). I tried adding bias to the attention weight matrices -- something that was popular back in the GPT-2 era, and was used by the original weights, but which my code did not use. Instead of just using the learning rate of 0.0004 that was used in the code from the book, I looked into what values people use these days, and learned how to schedule it over the course of the training run . Similarly, I learned more about weight decay and tried some alternative values. Then I tried making my model more like the original GPT-2 one by introducing weight tying to see if that would help. Finally, I decided to try training in "full-fat" float32 instead of using PyTorch's AMP and TF32 matrix multiplication performance enhancements. Weight tying and the number for weight decay I derived from a paper by Cerebras Research (probably without understanding it properly) were negatives. Full-fat float32, gradient clipping, attention biases, the GPT-2 weight decay parameter, removing dropout, and scheduling (and updating) the learning rate were positives. We would expect ~68.2% of results to be within one SD of the mean -- that is, between 3.6573651 and 3.6883489. Interestingly, our actual baseline result is outside that range! But it does include both the gradient clipping and the QKV bias results. We would additionally expect ~95.4% of the results to be within two SDs, which is 3.6418732 to 3.7038408. That includes our baseline and our weight decay result (though not our experiment removing dropout -- the six-DP loss number for that is 3.641282). Finally, we'd expect ~99.7% of results to be within three SDs, which is a range from 3.6263813 to 3.7193327. That covers all of our positive results apart from scheduling learning rate! Gradient clipping: randomness only affected the training run -- the weights it started with would have been exactly the same as the baseline model's. Removing dropout: although this is a parameter on the model, I don't think it changes the initial weights. But in the training run, it certainly does affect randomness by removing its use of the random number generator. Adding bias to the attention weights. This will change both the initial weights -- because we have those bias weights, things will be initialised differently -- and as a result, the training run, as the random number generator will have been sampled a different number of times prior to the run. Changing and scheduling the learning rate certainly should not change the initial weights, but it might conceivably have a non-obvious effect on training. Likewise weight decay; no effect I can see on the initial weights, but it could well change training dynamics. Weight-tying. When I added it to the code , I tried to do so in such a way that the other weights would be unaffected -- I created exactly the same weights as I would without weight tying, then threw away the output head and replaced it with a reference to the input embedding weights. So I think that in theory, this one won't have changed the other model weights (apart from ignoring the initialised-but-thrown-away output head), but it could well have changed the training run. Our normal baseline: weights initialised with seed 42, and training run starts with a "seed" of our imaginary A value from above: 3.691526 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 The second run above: weights initialised with seed 42, and training run starts with a seed of 67: 3.680505 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 Mean: ~3.673215 Variance: ~0.000145 SD: ~0.012062 Varying the random seed at the start, prior to initialising weights, and not constraining the starting point for the training runs, gave a mean of 3.672857, with an SD of 0.0154919. Keeping the same seed for model weights (so that they all started with the same weights), and varying the seed for the training run, gave a mean of 3.684462, with an SD of 0.008672. Varying the seed for the model weights (so that they all started with different weights), and keeping the training run seed pinned, gave a mean of 3.673215 and an SD of 0.012062. Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo". ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is. ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching. ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's". ↩
How I use VeraCrypt to keep my data secure
I’ve been using VeraCrypt for encrypted vaults for a while now. I mount and dismount vaults multiple times a day, and typing out the full command each time gets old fast: , , , , . There’s nothing wrong with the CLI, it’s just repetitive, and repetitive is what aliases are for. The GUI exists, but I spend most of my time in a terminal and launching a GUI app to mount a file feels like leaving the house to check if the back door is locked. So I wrote some aliases and functions. They’ve replaced the GUI for me entirely. Before getting into the aliases: VeraCrypt is the right tool for this specific job, but it’s worth being clear about what that job is. I’m encrypting discrete chunks of data stored as container files, not entire drives. If I wanted to encrypt a USB pen drive or an external hard disk, I’d use LUKS instead, which is better suited to full-device encryption on Linux. VeraCrypt’s strength is the container format: a single encrypted file that you can copy anywhere, sync to cloud storage, and open on almost any platform. I format my vaults as exFAT specifically for this: it works on Windows, macOS, Linux, and iOS via Disk Decipher . That cross-platform use case is what makes it worth the extra ceremony. This post covers what I ended up with and why. It’s worth saying upfront: this works for me, for my use case, right now. It doesn’t follow that it’s the right fit for anyone else. LUKS, Cryptomator , and plenty of other tools solve similar problems in different ways, and any of them might be a better fit depending on what you’re trying to do. I’m not attached to this setup permanently either. If something better comes along, or my requirements change, I’ll adapt. The two simplest aliases are to list what’s currently mounted, and to create new vaults: is a full function because it needs to handle a few things: creating the mount directory, defaulting to the current directory if no path is specified, and (when only one vault is mounted in total) automatically -ing into it so I can get straight to work: The auto-cd only triggers when it’s the sole mounted vault. If I’ve already got other vaults open, it stays out of the way. Both sync clients are paused before mounting to prevent them trying to upload a vault that’s actively being written to — a reliable way to end up with a corrupted or conflicted file. I keep several vault files in the same directory, so was a natural next step: mount all and files in a given directory with a single shared password: The glob qualifier in zsh means the glob returns nothing (rather than erroring) if no files match. Worth knowing if you’re adapting this for bash, where you’d handle the empty case differently. Dismounting is where I hit the most friction. The function handles both single-volume and all-at-once dismounting, and cleans up the mount directories afterwards: The alias just calls with no arguments: dismount everything, clean up the directories. The bit I added most recently is the before dismounting. If I’m working inside a vault and run , the dismount would fail silently because the directory was in use. The fix checks whether is under any of the mounted paths and steps out first. The trailing slash on both sides ( ) avoids the edge case where one vault path is a prefix of another. One more thing that makes this feel native rather than bolted on: tab completion for mounted volumes when running , and completion for / files when using or : One feature worth mentioning, even if I don’t use it daily: VeraCrypt supports hidden volumes . The idea is that you create a second encrypted volume inside the free space of an existing one. The outer volume gets a decoy password and some plausible-looking files. The hidden volume gets a separate password and your actual sensitive data. When VeraCrypt mounts, it tries the password you entered against the standard volume header first, then checks whether it matches the hidden volume header. Because VeraCrypt fills all free space with random data during creation, an observer cannot tell whether a hidden volume exists at all. It’s indistinguishable from random noise. In practice: if you’re ever compelled to hand over your password, you hand over the outer volume’s password. Nothing in the file itself proves there’s anything else there. This is what “plausible deniability” means in this context. It’s not a feature most people will ever need, but it exists and it’s well-implemented. My vault files are stored in Dropbox rather than Proton Drive, which I realise sounds odd given that Proton Drive is the more privacy-focused option. The reason is practical: the Proton Drive iOS app fails to sync VeraCrypt vaults reliably. The developer of Disk Decipher (an iOS VeraCrypt client) recently dug into this and was incredibly helpful in tracking down the cause. Looking at the Proton Drive app logs, he found: . The hypothesis is that VeraCrypt creates revisions faster than Proton Drive’s file provider can handle. What makes it worse is that the problem surfaces immediately: just mounting a vault and dismounting it again is enough to trigger the error. That’s a single write operation. There’s no practical workaround on the iOS side. It’s an annoying trade-off. Dropbox has significantly more access to my files at the infrastructure level, but the vault files themselves are encrypted before they ever leave the machine, so what Dropbox sees is opaque either way. For now, it works. I’m keeping an eye on Proton Drive’s iOS progress. Google Drive is an obvious option I haven’t mentioned: that’s intentional. I’m actively working on reducing my Google dependency, so it’s not something I’m considering here. Technically, on Linux, you could use rsync to swap Dropbox out for almost any provider. What keeps me on Dropbox for this specific use case is how it handles large files: it chunks them and syncs only the changed parts rather than re-uploading the whole thing. For vault files that can be several gigabytes, that matters. As you’ll have noticed in the code above, and both pause Dropbox and Proton Drive before mounting, and restarts them once the last vault is closed. The sync clients fail silently if they’re not running, so the same code works on machines where neither is installed. Since writing this, the picture has got worse. Mounir Idrassi, VeraCrypt’s developer, posted on Sourceforge confirming what’s actually happening: Microsoft terminated the account used to sign VeraCrypt’s Windows drivers and bootloader. No warning, no explanation, and their message explicitly states no appeal is possible. He tried every contact route and reached only chatbots. The signing certificate on existing VeraCrypt builds is from a 2011 CA that expires in June 2026. Once that expires, Windows will refuse to load the driver, and the driver is required for everything: container mounting, portable mode, full disk encryption. The bootloader situation is worse still, sitting outside the OS and requiring firmware trust. The post landed on Hacker News , where Jason Donenfeld, who maintains WireGuard, posted that the same thing has happened to him: account suspended without warning, currently in a 60-day appeals process. His point was direct: if a critical RCE in WireGuard were being actively exploited right now, he’d have no way to push an update. Microsoft would have his hands entirely tied. This isn’t a one-off. A LibreOffice developer was banned under similar circumstances last year. The pattern is open source security tool developers losing distribution rights, without warning, with an appeals process that appears largely decorative. Larger projects may eventually get restored through media pressure. Most won’t have that option. I’m on Linux, so none of this touches me directly. If you’re on Windows and relying on VeraCrypt, “watch it closely” has become genuinely urgent. All of these live in my dotfiles .
Value numbering
Welcome back to compiler land. Today we’re going to talk about value numbering , which is like SSA, but more. Static single assignment (SSA) gives names to values: every expression has a name, and each name corresponds to exactly one expression. It transforms programs like this: where the variable is assigned more than once in the program text, into programs like this: where each assignment to has been replaced with an assignment to a new fresh name. It’s great because it makes clear the differences between the two expressions. Though they textually look similar, they compute different values. The first computes 1 and the second computes 2. In this example, it is not possible to substitute in a variable and re-use the value of , because the s are different. But what if we see two “textually” identical instructions in SSA? That sounds much more promising than non-SSA because the transformation into SSA form has removed (much of) the statefulness of it all. When can we re-use the result? Identifying instructions that are known at compile-time to always produce the same value at run-time is called value numbering . To understand value numbering, let’s extend the above IR snippet with two more instructions, v3 and v4. In this new snippet, v3 looks the same as v1: adding v0 and 1. Assuming our addition operation is some ideal mathematical addition, we can absolutely re-use v1; no need to compute the addition again. We can rewrite the IR to something like: This is kind of similar to the destructive union-find representation that JavaScriptCore and a couple other compilers use, where the optimizer doesn’t eagerly re-write all uses but instead leaves a little breadcrumb / instruction 1 . We could then run our copy propagation pass (“union-find cleanup”?) and get: Great. But how does this happen? How does an optimizer identify reusable instruction candidates that are “textually identical”? Generally, there is no actual text in the IR . One popular solution is to compute a hash of each instruction. Then any instructions with the same hash (that also compare equal, in case of collisions) are considered equivalent. This is called hash-consing . When trying to figure all this out, I read through a couple of different implementations. I particularly like the Maxine VM implementation. For example, here is the (hashing) and functions for most binary operations, slightly modified for clarity: The rest of the value numbering implementation assumes that if a function returns 0, it does not wish to be considered for value numbering. Why might an instruction opt-out of value numbering? An instruction might opt out of value numbering if it is not “pure”. Some instructions are not pure. Purity is in the eye of the beholder, but in general it means that an instruction does not interact with the state of the outside world, except for trivial computation on its operands. (What does it mean to de-duplicate/cache/reuse ?) A load from an array object is also not a pure operation 2 . The load operation implicitly relies on the state of the memory. Also, even if the array was known-constant, in some runtime systems, the load might raise an exception. Changing the source location where an exception is raised is generally frowned upon. Languages such as Java often have requirements about where exceptions are raised codified in their specifications. We’ll work only on pure operations for now, but we’ll come back to this later. We do often want to optimize impure operations as well! We’ll start off with the simplest form of value numbering, which operates only on linear sequences of instructions, like basic blocks or traces. Let’s build a small implementation of local value numbering (LVN). We’ll start with straight-line code—no branches or anything tricky. Most compiler optimizations on control-flow graphs (CFGs) iterate over the instructions “top to bottom” 3 and it seems like we can do the same thing here too. From what we’ve seen so far optimizing our made-up IR snippet, we can do something like this: The find-and-replace, remember, is not a literal find-and-replace, but instead something like: (if you have been following along with the toy optimizer series) This several-line function (as long as you already have a hash map and a union-find available to you) is enough to build local value numbering! And real compilers are built this way, too. If you don’t believe me, take a look at this slightly edited snippet from Maxine’s value numbering implementation. It has all of the components we just talked about: iterating over instructions, map lookup, and some substitution. This alone will get you pretty far. Code generators of all shapes tend to leave messy repeated computations all over their generated code and this will make short work of them. Sometimes, though, your computations are spread across control flow—over multiple basic blocks. What do you do then? Computing value numbers for an entire function is called global value numbering (GVN) and it requires dealing with control flow (if, loops, etc). I don’t just mean that for an entire function, we run local value numbering block-by-block. Global value numbering implies that expressions can be de-duplicated and shared across blocks. Let’s tackle control flow case by case. First is the simple case from above: one block. In this case, we can go top to bottom with our value numbering and do alright. The second case is also reasonable to handle: one block flowing into another. In this case, we can still go top to bottom. We just have to find a way to iterate over the blocks. If we’re not going to share value maps between blocks, the order doesn’t matter. But since the point of global value numbering is to share values, we have to iterate them in topological order (reverse post order (RPO)). This ensures that predecessors get visited before successors. If you have , we have to visit first and then . Because of how SSA works and how CFGs work, the second block can “look up” into the first block and use the values from it. To get global value numbering working, we have to copy ’s value map before we start processing so we can re-use the instructions. Maybe something like: Then the expressions can accrue across blocks. can re-use the already-computed from because it is still in the map. …but this breaks as soon as you have control-flow splits. Consider the following shape graph: We’re going to iterate over that graph in one of two orders: A B C or A C B. In either case, we’re going to be adding all this stuff into the value map from one block (say, B) that is not actually available to its sibling block (say, C). When I say “not available”, I mean “would not have been computed before”. This is because we execute either A then B or A then C. There’s no world in which we execute B then C. But alright, look at a third case where there is such a world: a control-flow join. In this diagram, we have two predecessor blocks B and C each flowing into D. In this diagram, B always flows into D and also C always flows into D. So the iterator order is fine, right? Well, still no. We have the same sibling problem as before. B and C still can’t share value maps. We also have a weird question when we enter D: where did we come from? If we came from B, we can re-use expressions from B. If we came from C, we can re-use expressions from C. But we cannot in general know which predecessor block we came from. The only block we know for sure that we executed before D is A. This means we can re-use A’s value map in D because we can guarantee that all execution paths that enter D have previously gone through A. This relationship is called a dominator relationship and this is the key to one style of global value numbering that we’re going to talk about in this post. A block can always use the value map from any other block that dominates it. For completeness’ sake, in the diamond diagram, A dominates each of B and C, too. We can compute dominators a couple of ways 4 , but that’s a little bit out of scope for this blog post. If we assume that we have dominator information available in our CFG, we can use that for global value numbering. And that’s just what—you guessed it—Maxine VM does. It iterates over all blocks in reverse post-order, doing local value numbering, threading through value maps from dominator blocks. In this case, their method gets the immediate dominator : the “closest” dominator block of all the blocks that dominate the current one. And that’s it! That’s the core of Maxine’s GVN implementation . I love how short it is. For not very much code, you can remove a lot of duplicate pure SSA instructions. This does still work with loops, but with some caveats. From p7 of Briggs GVN : The φ-functions require special treatment. Before the compiler can analyze the φ-functions in a block, it must previously have assigned value numbers to all of the inputs. This is not possible in all cases; specifically, any φ-function input whose value flows along a back edge (with respect to the dominator tree) cannot have a value number. If any of the parameters of a φ-function have not been assigned a value number, then the compiler cannot analyze the φ-function, and it must assign a unique, new value number to the result. It also talks about eliminating useless phis, which is optional, but would the strengthen global value numbering pass: it makes more information transparent. But what if we want to handle impure instructions? Languages such as Java allow for reading fields from the / object within methods as if the field were a variable name. This makes code like the following common: Each of these reference to and is an implicit reference to or , which is semantically a field load off an object. You can see it in the bytecode (thanks, Matt Godbolt): When straightforwardly building an SSA IR from the JVM bytecode for this method, you will end up with a bunch of IR that looks like this: Pretty much the same as the bytecode. Even though no code in the middle could modify the field (which would require a re-load), we still have a duplicate load. Bummer. I don’t want to re-hash this too much but it’s possible to fold Load and store forwarding into your GVN implementation by either: See, there’s nothing fundamentally stopping you from tracking the state of your heap at compile-time across blocks. You just have to do a little more bookkeeping. In our dominator-based GVN implementation, for example, you can: Not so bad. Maxine doesn’t do global memory tracking, but they do a limited form of load-store forwarding while building their HIR from bytecode: see GraphBuilder which uses the MemoryMap to help track this stuff. At least they would not have the same duplicate instructions in the example above! We’ve now looked at one kind of value numbering and one implementation of it. What else is out there? Apparently, you can get better results by having a unified hash table (p9 of Briggs GVN ) of expressions, not limiting the value map to dominator-available expressions. Not 100% on how this works yet. They note: Using a unified hash-table has one important algorithmic consequence. Replacements cannot be performed on-line because the table no longer reflects availability. Which is the first time that it occurred to me that hash-based value numbering with dominators was an approximation of available expression analysis. There’s also a totally different kind of value numbering called value partitioning (p12 of Briggs GVN ). See also a nice blog post about this by Allen Wang from the Cornell compiler course . I think this mostly replaces the hashing bit, and you still need some other thing for the available expressions bit. Ben Titzer and Seth Goldstein have some good slides from CMU . Where they talk about the worklist dataflow approach. Apparently this is slower but gets you more available expressions than just looking to dominator blocks. I wonder how much it differs from dominator+unified hash table. While Maxine uses hash table cloning to copy value maps from dominator blocks, there are also compilers such as Cranelift that use scoped hash maps to track this information more efficiently. (Though Amanieu notes that you may not need a scoped hash map and instead can tag values in your value map with the block they came from, ignoring non-dominating values with a quick check. The dominance check makes sense but I haven’t internalized how this affects the set of available expressions yet.) You may be wondering if this kind of algorithm even helps at all in a dynamic language JIT context. Surely everything is too dynamic, right? Actually, no! The JIT hopes to eliminate a lot of method calls and dynamic behaviors, replacing them with guards, assumptions, and simpler operations. These strength reductions often leave behind a lot of repeated instructions. Just the other day, Kokubun filed a value-numbering-like PR to clean up some of the waste. ART has a recent blog post about speeding up GVN. Go forth and give your values more numbers. There’s been an ongoing discussion with Phil Zucker on SSI, GVN, acyclic egraphs, and scoped union-find. TODO summarize Commutativity; canonicalization Seeding alternative representations into the GVN Aegraphs and union-find during GVN https://github.com/bytecodealliance/rfcs/blob/main/accepted/cranelift-egraph.md https://github.com/bytecodealliance/wasmtime/issues/9049 https://github.com/bytecodealliance/wasmtime/issues/4371 Writing this post is roughly the time when I realized that the whole time I was wondering why Cinder did not use union-find for rewriting, it actually did! Optimizing instruction by replacing with followed by copy propagation is equivalent to union-find. ↩ In some forms of SSA, like heap-array SSA or sea of nodes, it’s possible to more easily de-duplicate loads because the memory representation has been folded into (modeled in) the IR. ↩ The order is a little more complicated than that: reverse post-order (RPO). And there’s a paper called “A Simple Algorithm for Global Data Flow Analysis Problems” that I don’t yet have a PDF for that claims that RPO is optimal for solving dataflow problems. ↩ There’s the iterative dataflow way (described in the Cooper paper (PDF)), Lengauer-Tarjan (PDF), the Engineered Algorithm (PDF), hybrid/Semi-NCA approach (PDF), … ↩ initialize a map from instruction numbers to instruction pointers for each instruction if wants to participate in value numbering if ’s value number is already in the map, replace all pointers to in the rest of the program with the corresponding value from the map otherwise, add to the map doing load-store forwarding as part of local value numbering and clearing memory information from the value map at the end of each block, or keeping track of effects across blocks track heap write effects for each block at the start of each block B, union all of the “kill” sets for every block back to its immediate dominator finally, remove the stuff that got killed from the dominator’s value map V8 Hydrogen Writing this post is roughly the time when I realized that the whole time I was wondering why Cinder did not use union-find for rewriting, it actually did! Optimizing instruction by replacing with followed by copy propagation is equivalent to union-find. ↩ In some forms of SSA, like heap-array SSA or sea of nodes, it’s possible to more easily de-duplicate loads because the memory representation has been folded into (modeled in) the IR. ↩ The order is a little more complicated than that: reverse post-order (RPO). And there’s a paper called “A Simple Algorithm for Global Data Flow Analysis Problems” that I don’t yet have a PDF for that claims that RPO is optimal for solving dataflow problems. ↩ There’s the iterative dataflow way (described in the Cooper paper (PDF)), Lengauer-Tarjan (PDF), the Engineered Algorithm (PDF), hybrid/Semi-NCA approach (PDF), … ↩
Build your own Dial-up ISP with a Raspberry Pi
Last year my aunt let me add her original Tangerine iBook G3 clamshell to my collection of old Macs 1 . It came with an AirPort card—a $99 add-on Apple made that ushered in the Wi-Fi era. The iBook G3 was the first consumer laptop with built-in Wi-Fi antennas, and by far the cheapest way to get a computer onto an 802.11 wireless network.
SQLAlchemy 2 In Practice - Chapter 3 - One-To-Many Relationships
This is the third chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! In the previous chapter you learned how to execute a variety of queries on the table. Interestingly, some of those queries were designed to obtain product manufacturers and not products, and this required duplicates to be removed by grouping the results.
My self-sovereign / local / private / secure LLM setup, April 2026
Bring back MiniDV with this Raspberry Pi FireWire HAT
In my last post, I showed you to use FireWire on a Raspberry Pi with a PCI Express IEEE 1394 adapter. Now I'll show you how I'm using a new FireWire HAT and a PiSugar3 Plus battery to make a portable MRU, or 'Memory Recording Unit', to replace tape in older FireWire/i.Link/DV cameras. The alternative is an old used MRU like Sony's HVR-MRC1 , which runs around $300 on eBay 1 .
Look Ma, I made a JAR! (Building a connector for Kafka Connect without knowing Java)
As a non-Java coder, for the last ten years I’ve stumbled my way through the JVM-centric world of "big data" (as it was called then), relying on my wits with SQL and config files to just about muddle through. One of the things that drew me to Kafka Connect was that I could build integrations between Kafka and other systems without needing to write Java, and the same again for ksqlDB and Flink SQL—now stream processing was available to mere RDBMS mortals and not just the Java adonises. One thing defeated me though; if a connector didn’t exist for Kafka Connect, then I was stuck. I’d resort to cobbled-together pipelines leaning heavily on kafkacat kcat, such as I did in this blog post . I built some cool analytics on top of maritime AIS data about ships' locations, but the foundations were shaky at best: No failure logic, no schema handling, no bueno. What I really needed was a connector for Kafka Connect. However for that, you need Java. I don’t write Java. But Claude can write Java.
SQLAlchemy 2 In Practice - Chapter 2 - Database Tables
This is the second chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! This chapter provides an overview of the most basic usage of the SQLAlchemy library to create, update and query database tables.
Got a thing
~with distinction~
How to Install a Gem
This post was originally given as a talk at SF Ruby Meetup . The slides are also available. Hello, and welcome to How To Install A Gem . My name is André Arko, and I go by @indirect on all the internet services. You might know me from being 1/3 of the team that shipped Bundler 1.0, or perhaps the 10+ years I spent trying to keep RubyGems.org up and running for everyone to use. More recently, I’ve been working on new projects: , a CLI to install Ruby versions and gems at unprecedented speeds, and gem.coop , a community gem server designed from the ground up so Bundler an can install gems faster and more securely than ever before. So, with that introduction out of the way, let’s get started: do you know how to install a gem? Okay, that’s great! You can come up and give this talk instead of me. I’ll just sit over here while you write the rest of this post. Slightly more seriously, do you know how converts the name that you give it into a URL to download a .gem file? It’s called the “compact index”, and we’ll see how it works very soon. Next, who in the audience knows how to unpack a .gem file? Do you know what format .gem files use, and what’s inside them? We’ll look at gem structure and gemspec files as well. Then, do you know where to put the files from inside the gem? Where do all of these files and directories get put on disk so we can use them later? Does anyone know off the top of their head? Once those files have been unpacked into the correct places, the last thing we need to know is how to require them. How do these unpacked files on disk get found by Ruby, so you can and have that actually work? This exercise was mostly to show that using gems every day actually skips over most of the way they work underneath. So let’s look at what a gem is, and examine how they work. By the end of this talk, you’ll know what’s inside a gem, where how RubyGems figures out what to download, and where and how that download gets installed so you can use it. And if you already everything we just talked about, please feel free to go straight to rv.dev and start sending us pull requests! First, we’re going to look at how the name of a gem becomes a URL for a .gem file. Let’s use as our example. Historically, there have been at least five or six different ways to look up information about a gem based on its name, but today there is one canonical way: the compact index. It’s so simple that you can do it yourself using curl. Just run , and you’ll be able to read the exact output that every tool uses to look up the versions of a gem that exist. Each line in the file describes one version of the gem, so let’s look at one line. We can break down that line with , and tackle each part one at a time. First, . That’s the version of that this line is about. So we now know for sure that exists. Next, a list of dependencies. The gem (version ) declares dependencies on a bunch of other gems: , , , , , , , , , , , , and . Each dependency has a version requirement attached, and for almost every gem it is exactly version , and only version . For , Rails is a little bit more flexible, and allows any version and up. The final section contains a checksum, a ruby requirement, and a rubygems requirement. The checksum is a sha256 hash of the .gem file that contains the gem, so after we download the gem we can check to make sure we have the right file by comparing that checksum. For this version of Rails, the required Ruby version is or greater, and the required RubyGems version is or greater. It’s up to the client to do something with that information, but hopefully you’ll see an error if you are using Ruby or RubyGems that’s too old. Great! So now we know the important information: Rails version is real, and strong, and is our friend. We can download it, and check the checksum against the checksum we were given in the info file line. Let’s do that now: Notice that the checksum produced by exactly matches the checksum we previously saw in our line from the info file: . That lets us know that we got the right file, and there were no network or disk errors. Now that we have the gem, we can investigate: what exactly is inside a gem? At this point, we’re going to pivot from the gem to the gem. There’s a good reason for that, and the reason is… the gem doesn’t actually have any files in it. So it’s a bad example. In order to show off what a gem looks like when it has files in it, we’ll use instead. So, we have our .gem file downloaded with curl. What do we do now? The first piece of secret knowledge that we need: gems are tarballs. That means we can open them up with regular old . Let’s try it. So what’s inside the .gem tarball is… another tarball. And also two gzipped files. Let’s look at the files first. As you might expect from its name, the file is a gzipped YAML file, containing checksums for the other two files. It’s maybe a bit silly to have multiple layers of checksumming here, but it does confirm that the outer layer of tarball and zip was removed without any errors. Okay, so what’s inside ? The answer is… Ruby, sort of. It’s a YAML-serialized instance of the class. We can see exactly what was put into this object at the time the gem was built. After snipping out the YAML that lists the dependencies (which we already looked at, because they are included in the info file), what’s left is some relatively simple information about the gem. Author, author’s email, description, homepage, license, various URLs. For the purposes of installing and using the gem, we care about exactly six pieces of information: , , , , , and . We’re going to combine those items with the files in the remaining file to get our unpacked and installed gem. Now that we know what’s in the gem specification, let’s look at what’s inside the data tarball. It matches up very closely with the long list of entries in the array in the gemspec. So now we have a bunch of files. Where are we going to put these files? Enter: the magic of RubyGems. The scheme that RubyGems has come up with is largely shaped by the constraints of how Ruby finds files to require, which we’re going to look at soon. For now, it is enough for us to know that RubyGems keeps track of a list of directories, a lot like the way works for your shell to find commands to run. To find the current directory, you can run . Here’s what that looks like: From this list, we can see that RubyGems organizes its own files into a few directories. To install a gem, we’re going to need to put the files we have into each of those directories, with specific paths and filenames. Just to recap, the files we need to place somewhere are: So let’s move the files into the directories we see RubyGems offers. First, cache the .gem file so RubyGems doesn’t need to download it again later: Then, add the gem specification so that RubyGems will be able to find it. There’s a small twist here, which is that the directory doesn’t contain YAML files, it contains Ruby files. So we also need to convert the YAML file back into a Ruby object, and then write out the Ruby code to create that object into a file that RubyGems can load later. Next, we need to put the files that make up the contents of the gem into the directory. One more thing we need to do: set up the executables provided by the gem. You can check out the files that RubyGems generates by looking in , but for our purposes we just need to tell RubyGems what gem and executable it needs to run, so we can do that: And with that, we’ve installed the gem! You can run the file that we just created to prove it: As we wrap up here, there are three aspects of gems that we haven’t touched on at all: docs, extensions, and plugins. We don’t have time to talk about them today in this meetup talk slot. Hopefully a future (longer) version of this talk will have space to include all of those things, because they are all super interesting, I promise. In the meantime, I will have to direct you to the docs for RDoc to learn more about docs, to the source code of or RubyGems itself if you want to learn more about gem extensions and plugins. There’s one last thing to figure out before we wrap up: how does find a gem for us to be able to use it? To explain that, we’ll have to drop down to some basic Ruby, and then look at the ways that RubyGems monkeypatches Ruby’s basic to make it possible to have gems with versions. The first thing to know about is that it works exactly like does in your shell. There’s a global Ruby variable named , and it’s an array of paths on disk. When you try to require something, Ruby goes and looks inside each of those paths to see if the thing you asked for is there. You can test this out for yourself in just a few seconds! Let’s try it. The Ruby CLI flag lets you add directories to the variable, and then the function looks inside that directory to find a file with the name that you gave to require. No magic, just a list to check against for files on disk. Now that you understand how the variable makes work, how does RubyGems work? You can’t just put ten different versions of into the and expect to still work. RubyGems handles multiple versions of the same file by monkeypatching . Let’s look at what happens when we , which is a file located inside the gem that we just installed. RubyGems starts by looking at all of the gem specifications, including the one we saved earlier. In each specification, it combines the name and version with the values in to come up with a path on disk. So for our just-installed gem, that would mean a path of: . RubyGems knows that directory contains a file named , so it is a candidate to be “activated”, which is what RubyGems calls it when a gem is added to your . As long as internal bookkeeping shows that no other versions of have already been added to the , we’re good! RubyGems adds this specific directory to the , and delegates to the original implementation of . Require finds the file at , reads it, and evaluates it. With that, we’ve done it! We have found, downloaded, unpacked, and installed a gem so that Ruby is able to run a command and load ruby files, without ever touching the command. If you’re interested in contributing to an open source project that works a lot with gems, we would love to work with you on , where we are working to create the fastest Ruby and gem manager in the world. And of course, if your company could use faster, easier, or more secure gems for developers, for CI, and for production deployments, we can help. We’d love to talk to you and you can find our contact information at spinel.coop . railties-8.1.3.gem (the .gem file itself) metadata.gz (the YAML Gem::Specification object from inside the gem) the unpacked data.tar.gz files (the contents of the gem)
Writing an LLM from scratch, part 32g -- Interventions: weight tying
In Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", he writes that weight tying, while it reduces the parameter count of a model, in his experience makes it worse. As such, apparently people don't use it in modern LLMs. Intuitively, that makes sense -- I'll explain why in this post. But as I'm trying various interventions to see if I can get my model -- based on Raschka's code, but trained for a fraction of the time that the original GPT-2 model was -- to perform as well as the original in terms of the loss it gets on a test set, I thought it would be worth seeing if it really is a negative for this particular tiny model of 163M parameters. After all, the original weights use weight tying, and I did find that QKV bias appeared to help -- and that's another old-school technique that they used, which has since dropped out of fashion. Might this one help too? Worth a try! Let's give it a go. I'll start with a quick refresher on what weight tying is, and how it works. This is really targeted at people who've been reading along with this series -- if it's all new to you, you might find my post on Maths for LLMs a useful catch-up guide first. In our LLM code, right at the start, we use an embedding layer to take our input token IDs, and turn them into embeddings -- each token becomes a vector in a high-dimensional space (768 in our case), which we see as representing in some manner the "meaning" of the token. A useful way to think about that is that we could start with a one-hot vector for the token -- that is, with our 50,257-token vocabulary, it would be 50,257 items long, and have zeros in every position apart from the position corresponding to the token's ID. We'll treat that as being a vector in a "vocab space". The process of converting the token into an embedding turns out to be equivalent to multiplying that vocab space representation by an embedding matrix -- one with one row per possible token, the values in that row being the values for the appropriate embedding. 1 Because matrix multiplications can be seen as projections between different spaces, we can see that as a projection from our vocab space to the embedding space. Once we've projected our sequence of tokens into a sequence of embeddings, we do all of the steps required for the LLM -- we add in positional information, run it through the Transformers layers, normalise it, and then we have a new sequence of embeddings. The embedding at position n in that output sequence, if our model is working well, should be something that represents an appropriate next-token prediction for the portion of the input sequence from zero to position n . What we want as our final output is to map that back to the vocab space. We want logits: a list of numbers that (after being run through softmax) will represent the probability that our next token is a particular one. Just as we mapped from vocab space to embedding space with (conceptually) a matrix multiplication at the start of the process, we can map back with another one. More specifically, if we treat the embedding matrix as having the same number of rows as there are input tokens (which we'll call d vocab ) and columns as there are embedding dimensions ( d emb ), then the original vocab-space-to-embedding-space matrix will have this shape: So it's projecting from a d vocab -dimensional space to a d emb -dimensional one. Similarly, our matrix to do the projection at the end is just a matrix with the numbers of rows and columns swapped around: ...to do a projection in the other direction. The trick with weight tying is to see that these two projections can potentially be just the opposite of each other. If we assume that the embedding space on the way in to the LLM is essentially the same as the embedding space on the way out, then we can use one projection to go into it from vocab space, and the opposite to go back. The "opposite" in this case is the transpose -- that is, if we use W emb for our embedding matrix and W out for the output one, we have: That means we can re-use all of the embedding parameters for the output projection matrix, and fewer parameters means not only a smaller model, but hopefully faster training. Sounds like a win! But of course, there's no such thing as a free lunch. By constraining the output head to be the transpose of the input one, we're essentially enforcing that assumption above: we're saying that the embedding space on the way out must be the same as the embedding space on the way in. That limits what the LLM can do -- if it were able to use different embedding spaces at each end, it would have more flexibility, which might help it learn to model things better. That's the theory: what does it mean in practice? Let's take a quick look at the GPT-2 code -- just the for the top level class: For our embedding layer, we use PyTorch's class, and for the output head we use . Now, provides us with access to the underlying matrix with a field: (Tensor) -- the learnable weights of the module of shape ( , ) initialized from 𝒩 ( 0 , 1 ) . So, that's exactly the d vocab × d emb matrix that we'd expect -- it's the input dimension as the rows, and the output dimension as the columns. If we look at , we see something very similar: weight (torch.Tensor) – the learnable weights of the module of shape ( , ) The values are initialized from 𝒰 ( − k , k ) where k = 1 in_features That's actually the other way around, output dimension as the rows and input as the columns. If you're wondering why, remember that we transpose the weights matrix for a neural network before using it . But that's actually really convenient in our situation, because if we want to use the same weights for both, they're already "compatible"! And that means that adding weight tying to our code above is as simple as adding two lines at the end: For the model code, it literally is just that! There is a tiny inefficiency in that PyTorch is going to spend a bit of time initialising the weights in to appropriately-sized random values, only to have them all replaced -- but that actually works in our favour, because it means that we'll use up the same amount of the random number stream when creating the LLM in both the weight-tying and non-weight-tying cases, which is a bit better for reproducibility. There is one other change needed, though. I ran a test train with that code, and checkpointing failed like this: Safetensors doesn't like it when you reuse weights like we're doing here. The good news is that the help page the error links to is exactly about this problem with weight tying, and the suggested fix -- to replace ...and similarly for loading -- appears to work fine. Saving and loading checkpoints works, and it's compatible with the old checkpoint files too. So that's good news :-) So, that's how we code it. How much actual saving do we get in terms of the parameter count by doing this? A quick-and-easy way to count the parameters is just to create an instance of the model and see: So, we've gone from a 163M-parameter model to a 124M-parameter one. That's certainly quite some saving -- 38,597,376 fewer parameters, which is a reduction of almost a quarter. We can also sanity check the size of that saving -- our output head was, as we know, a d emb × d vocab matrix, so it should have 50257 × 768 parameters -- which is, indeed, 38,597,376. Excellent. Now, there's one thing we should consider here. We're training on a Chinchilla-optimal number of tokens, 20x our parameter count. Is that what we want to keep stable? Or is the total number of training tokens the important bit, so we wind up technically overtraining? My instinct is that the total training tokens is the important thing. Chinchilla optimality is a training heuristic rather than a true aspect of the model, so sticking with it would mean that we're training a model with fewer parameters on less data. It seems very unlikely that would do anything other than produce a worse model! So: we'll keep the same number of training tokens, and just introduce weight tying. How does it train? I kicked it off on the usual 8x A100 40 GiB machine, and after a little while I checked the loss chart. It looked like this: Yikes! It started off with a loss of about 460. Normally, we start with a loss of about 11. The normal loss makes a lot of sense. If you consider it in terms of perplexity, that value of 11 comes out at e 11 ≈ 59 , 874 -- that is, the model is giving pretty much equal probabilities to every one of the 50,257 possible tokens. A loss of 460 means that the model is making incorrect predictions and is very certain about them. How could that be? Well, let's look at the documentation again. (Tensor) -- the learnable weights of the module of shape ( , ) initialized from 𝒩 ( 0 , 1 ) . weight (torch.Tensor) – the learnable weights of the module of shape ( , ) The values are initialized from 𝒰 ( − k , k ) where k = 1 in_features They're initialised completely differently. Embeddings are set to values in a normal distribution (that is, a Gaussian bell curve) with a mean of 0 and a standard deviation of 1. But linear layers are set to random values in a uniform distribution (that is, a completely flat one) within a range based on the number of input features. In particular, those numbers for the linear layer are really small! Our output head has set to 768, so that means that the k would be: So instead of getting that kind of "ideal" linear layer initialisation within the range ( − 0.0360 , 0.0360 ) , we're getting numbers which roughly 2/3 of the time will be in the range ( − 1 , 1 ) , and the rest of the time will be even further from zero -- we could be getting -3 or +4, or potentially even crazier numbers! That means that the output logits (coming from a linear layer with higher weights) will be larger, which in turn will push softmax to come up with higher probabilities: I considered changing things to initialise the weights differently, but given that the loss had fallen to 8 or so by the second checkpoint, I decided to just let the run complete. Here's the final loss chart, with the Y axis fixed to run from 0 to 12: That's a nice smooth curve, at least! The output is: Timing-wise, that's about 180 seconds faster than our baseline model training run, only a 1.5% speedup -- clearly the lower number of parameters doesn't actually save us much time. Loss-wise, the final train loss on the baseline model was 3.743, so that's not particularly promising. Still, the proof is, as ever, in the evals. Smoke test first: Borderline coherent, but maybe worse than normal? Let's see what our test set loss looks like. That's bad -- let's see it in our comparison table: Our worst model so far :-( Weight tying certainly didn't help our train. It is worth noting that the GPT-2 small weights -- which do use it -- got 3.500 on the same test set as we're using for that table, so it is possible to get a better model with weight tying. But there was clearly something different about their train, and my suspicion, as I've said before, is that it was trained for many more epochs ( I estimated 40 ), slowly grinding that loss down. But what I'm trying to do in this mini-series of interventions is find tricks that will allow us to approach the original weights' loss without a very long training run. And for the purposes of that, I think we can safely say that weight-tying is not one of those. Next time around, our last intervention test! What happens if we switch off the use of automated mixed precision (AMP)? That is something I added right back at the start as a performance enhancement; it means that PyTorch can do certain calculations in 16-bit rather than 32-bit if it thinks there's no harm in doing so. Might we get better loss by training without it? In reality we don't multiply a one-hot vector by a matrix, as that would be extremely inefficient -- PyTorch just does a lookup into the embedding matrix. If we get token ID 1234, then it just reads out the contents of row 1234, and that's our embedding. But for the purposes of this post, it's best to see that as more of a (extremely effective) performance tweak rather than what's happening conceptually. ↩ In reality we don't multiply a one-hot vector by a matrix, as that would be extremely inefficient -- PyTorch just does a lookup into the embedding matrix. If we get token ID 1234, then it just reads out the contents of row 1234, and that's our embedding. But for the purposes of this post, it's best to see that as more of a (extremely effective) performance tweak rather than what's happening conceptually. ↩
Using FireWire on a Raspberry Pi
After learning Apple killed off FireWire (IEEE 1394) support in macOS 26 Tahoe , I started looking at alternatives for old FireWire equipment like hard drives, DV cameras, and A/V gear. I own an old Canon GL1 camera, with a 'DV' port. I could plug that into an old Mac (like the dual G4 MDD above) with FireWire—or even a modern Mac running macOS < 26, with some dongles —and transfer digital video footage between the camera and an application like Final Cut Pro.