Line-based Lisp Editing
Read on the website: Not all environments have Lisp-aware structural editing. Some are only line-oriented. How does one go about editing Lisp line-by-line?
Read on the website: Not all environments have Lisp-aware structural editing. Some are only line-oriented. How does one go about editing Lisp line-by-line?
This post collects some notes on using LaTeX to render mathematical documents and formulae, mostly focused on a Linux machine. For background, I typically use LaTeX for one of two (related) purposes: I don't currently use LaTeX for either precise typesetting or for authoring very large, book-sized documents. For day-to-day authoring, I find TeXstudio to be excellent. It has everything I need for local editing with a convenient preview window. I really like that TeXstudio doesn't hide the fact that it's just a graphical veneer on top of command-line LaTeX tooling, and lets you examine what it's doing through logs. Note that web-based solutions like Overleaf exist; I can see myself using that, especially if collaborating with others or having to author LaTeX from a diverse set of computers and OSes, but for local editing of git-backed text files, TeXstudio is great. pandoc is very capable for converting documents from LaTeX to other formats. Recently I find that it's easier to write math-heavy blog posts in LaTeX, and then convert them to reStructuredText with pandoc . For example, the recent post on Hilbert spaces was written like this and then converted using this command: The resulting reStructuredText is very readable and requires very little tweaking before final publishing. pandoc supports many formats, so if you use Markdown or something else, it should work similarly well. A useful feature of LaTeX tooling is the ability to render a specific formula in standalone mode to an image. We can write the formula into its own file (call it standaloneformula.tex ): In case you were wondering, this is the Gaussian integral: Once we have that standalone .tex file, there's a number of things we can do. First, the texlive package should be installed [1] . Using apt : Now, we can run the tools from texlive , for example pdflatex : This creates a PDF file that's useful for previews. To convert the .tex file to an image in SVG format, we'll use a two-step process: If we want a PNG file instead of SVG: The latexmk tool can build a .tex file into a PDF whenever the input file changes, so running: And opening the PDF in a separate window, we can observe live refreshes of edits without having to recompile explicitly. While useful in some scenarios, I find that TeXstudio already does this well. The same tooling flow works for TikZ diagrams! A standalone LaTeX document containing a single tikzpicture element can also be rendered to a SVG or PNG using the same exact commands. If you'd rather not install all these tools directly but use Docker instead, the texlive image can be used to do the same things: And now we can use the same invocations, just through docker: When a formula like \frac{n+1}{n^2-1} is embedded in text, it should be aligned properly to look good with the surrounding text. The information required to do this is emitted by tools like dvisvgm and dvipng ; for example: Note the height=..., depth=... line in the output. The height is the total height of the formula, and depth is its height below the "baseline" (how much down it should stick out from the line). In my blog, these two are translated to attributes on the image element embedding the SVG. Height is translated to style="height: ... and depth to vertical-align: ... . Render math for my blog posts, which are usually written using reStructuredText . This sometimes includes diagrams generated using TikZ. Write personal (unpublished) notes on math-y subjects entirely in LaTeX. These are typically short (up to 10-20 pages), single-subject booklets.
Setting a wallpaper in Sway, with swaybg, is easy. Unfortunately there is no way of setting a random wallpaper automatically out of the box. Here is a little helper script to do that. The script is based on a post from Silvain Durand 1 with some slight modifications. I just linked the script my sway config instead of setting a background there. Sway config : The script spawns a new instance, changes the wallpaper, and kills the old instance. With this approach there is no flickering of the background when changing. An always up-to-date version can be found in my dotfiles . Original script from Silvain Durand: https://sylvaindurand.org/dynamic-wallpapers-with-sway/ ↩︎ Original script from Silvain Durand: https://sylvaindurand.org/dynamic-wallpapers-with-sway/ ↩︎
As all developers, I’ve been using git since the dawn of time, since its commands were an inscrutable jumble of ill-fitting incantations, and it has remained this way until today. Needless to say, I just don’t get git. I never got it, even though I’ve read a bunch of stuff on how it represents things internally. I’ve been using it for years knowing what a few commands do, and whenever it gets into a weird state because I fat-fingered something, I have my trusty alias, , that deletes the directory, clones the repo again into a temp folder, and moves the directory from that into my directory, and I’ve managed to eke out a living for my family this way. Over the past few years, I’ve been seeing people rave about Jujutsu , and I always wanted to try it, but it never seemed worth the trouble, even though I hate how hard git makes some things. I idly read a few tutorials, trying to understand how it works, but in the end I decided it wasn’t for me. One day I randomly decided to try again, but this time I asked Claude how to do with Jujutsu whatever operation I wanted to do with git. That’s when the mental model of jj clicked for me, and I finally understood everything, including how git works . I never thought a VCS would spark joy in me, but here we are, and I figured maybe I can write something that will make jj click for you as well. It also doesn’t hurt that Jujutsu is completely interoperable with git (and thus with providers like GitHub), and I can have all the power of Jujutsu locally on my git repos, without anyone knowing I’m not actually using git. The problem I had with the other tutorials, without realizing it, was that there was a fundamental tension between two basic things: The best way to explain jj to someone who knows git is to use all the git terms they already know (because that makes it easy for them), but also to tell them to think about the git terms they know differently (because otherwise they’ll form the wrong mental model). You can’t really explain something by saying “a jj commit is like a git commit, except where it’s not”, so I’ll try to do things a bit differently. This will be a short post (or, at least, not as long as other jj tutorials), I’ll explain the high-level mental model you should have, and then give a FAQ for how to do various git things with jj. Just a disclaimer before we start, this is going to be far from an exhaustive reference. I’m not an expert in either git or Jujutsu , but I know enough to hopefully make jj click for you enough to learn the rest on your own, so don’t be too annoyed if I omit something. Also, you’re going to read here some things about the way Jujutsu likes doing things that will offend you to your very core, and your first reaction will be “madness, this cannot possibly work”. When you think this, I want you to relax, it’s fine, it does work, it just means I haven’t managed to make the whole thing click together for you yet. Just read on. I’m not going to show you any Jujutsu commands here. I might refer to them by name, but I want you to understand the mental model enough to go look stuff up on your own, Jujutsu only has, like, three commands you’re going to use for everything anyway (yes, you can do everything you do with git with them). (By the way, if you’re going to be trying things out while reading this post, definitely get jjui , it lets you visually work with the repository in a way that makes everything much easier to understand.) First of all, all the basic git things you’re already familiar with are there in jj: Commits, branches, operations on those, all those things carry over, with some small differences. The main difference is in the general way the two work, jj simplifies git’s model a lot by getting rid of some inconsistencies, and makes it much easier to understand what’s going on “under the hood”, because the “under the hood” is now so much smaller and simpler, that it can just be over the hood. The mental model that you probably have with git is something like an assembly line. You take a bunch of components, you form them into a widget, you put the widget into a box, you write “General bug fixes” onto the box, seal it, and send it off, never to be seen again by anyone. That’s what git thinks of as a commit. You have some work that is The Thing You’re Working On Now, and then at some point that’s kind of done, you select which pieces of that work you want to immortalize, and you commit them, freezing them in time forever from then on. (I know you can edit commits, but this is largely git’s mental model, commits are immutable). Jujutsu, in contrast, is more like playing with Play-Doh. You take a lump, cut it into two, shape one piece into something, give it a name, change your mind, give it another name, take a bit of the second piece and stick it on the first piece, and generally go back and forth all around your play area, making changes. Jujutsu wants you to be able to go back to an old commit, change it (gasp!), go to another branch (three commits back from that HEAD), change that commit too, move whole branches of your tree to other parts of it, whatever you want. Your worktree in Jujutsu is a free-for-all where you can rearrange things as you like. Basically, in git, you manipulate the code, put it in a commit, and you’re largely done. In Jujutsu, the commits themselves are also the object of manipulation. This isn’t the most natural workflow in git, as git makes it much harder than jj does, but maybe this is the workflow you already have in git (with extensive squashing/rebasing/amending). In that case, grasping the Jujutsu workflow will probably be easier, and will make things easier for you. Yes yes, nobody wants their commits changing from under them, that’s why Jujutsu doesn’t let you easily change commits that have been pushed to a remote, you can relax now. However, if you spend a moment thinking about what I said above, you’ll probably realize that a few things need to be different from git for this to work (and they are): Indeed, Jujutsu commits are mutable (until you push them). Right now you’re thinking of commits as something that can’t change, but this is one of the things you need to accept. You can (and will) go back to a previous commit (that you haven’t yet pushed) to fix a bug in it that you just hit, and it’s as simple as checking out (jj calls it ing) that commit and making the change. You don’t have to commit again! Jujutsu does whatever it needs to do under the hood when you run the command, to you it just looks like your edits are automatically persisted in the commit, in real time. To clarify, Jujutsu doesn’t create new commits while this goes on, you just see one “open” commit that you keep making changes to your code in. Indeed, there is no staging area like git has. git splits code to either be in the repo (in a commit), or outside it (staged/unstaged). Jujutsu doesn’t have that, you are always in a commit . This is important: In git, you’re outside a commit until you create one. In Jujutsu, you are always inside a commit . Nothing is ever outside a commit, “outside a commit” isn’t a thing in Jujutsu. Even the very command in Jujutsu is an alias that adds a message to the commit you’re on, and then creates a new (empty) one that you’ll now be working on. Even when you create a new repo, you start in a commit. This is the most important difference between jj and git , and the one thing you should think a bit about, as it enables many really interesting workflows. Always being in a commit means that yes, you will have commits that are half-finished work. Maybe lots of them! I usually indicate this in the commit message, to remind myself. You are impressively perceptive for a hypothetical straw man in whose mouth I’m putting words. Exactly, commits might not have a commit message. They start out blank, and you can add a commit message at any point, whenever you have an idea of what that commit will do. It might be when you start working on it, it might be half-way through, or it might be at the end. Personally, I usually add the message at the end, but that’s just preference. Yes, since everything is always in a commit, there’s nothing to stash. In git, if you have some uncommitted changes and want to check out an old commit, you need to stash them first. In Jujutsu, since all your changes are automatically persisted in a commit at all times, you can have some new changes (which, if this were git, would be uncommitted), you can check out (or ) an older commit, then come back to your new changes in the latest commit, and they’ll all be there. If you’re going to be jumping around the tree all the time, making commits and branches, they can’t require names. Jujutsu lets you create branches by just creating a commit, you don’t need to name the branch. In Jujutsu (and in git!), branches are simply two or more commits with the same parent, it’s just that git artificially makes you think of branches as special, because it makes you name them. In Jujutsu, creating a branch is as simple as checking out the commit you want to branch from, and creating a new commit on top of it. This is one thing Jujutsu simplifies over git. In git, branches are a fairly heavy thing, you have to name them, you have the mental model of “being” on the branch, and your workflow is centered around them. In Jujutsu, you just… add a new commit, and if that commit has siblings, well, that’s now a branch. I haven’t talked about conflicts much, because, unlike git, in practice they haven’t really been anything special. Jujutsu doesn’t stop the world at all, it doesn’t even particularly complain, it just marks a commit as conflicted, but you can continue working on other places in the worktree and then later come back at your leisure and fix that commit’s conflicts! Whereas in git you have to quit what you’re doing and fix the conflicts right now , jj is more “by the way, when you have some time, let me know what this commit should look like*. The changes also cascade to all subsequent commits, which is fantastic. You only fix conflicts once, and jj takes care of the rest. Under the hood, jj automatically and transparently commits whatever you’re working on when you invoke the jj command (it can also be configured to do it on its own whenever a file in the repo changes). This is safe, as these intermediate changes won’t be pushed anywhere, but this means that you get snapshots for free! . If you’ve ever had Claude get to a working solution, but then trip over itself and mess it up, jj can help, you can use the oplog to go back to the way your repo looked a few minutes ago, even if you didn’t explicitly commit anything! Even using the or command to look at stuff will take a snapshot of your repo, allowing you to return to it if something goes wrong. No more losing unstaged changes, ever! This has saved my ass a few times already. By now you probably have lots of questions, I’ll try to answer some of them here. If you have more questions, just send them to me and I’ll add them here, along with the answer. You don’t really branch off main, in that you usually won’t need to create two commits off main, you’ll only create one. In git, we branch off of main, and now our mental model is that “we’re in that branch”. In reality, if you look at the graph on the right, it’s all still just a line, we’ve just made a mental “bend” in the graph to tell ourselves that we’re on a branch. As far as the graph is concerned, though, nothing special really actually happened, we just added more commits. The only real difference is that “main” stops at the third commit, whereas “my branch” stops at the sixth commit. Other than that, the entire history is just one line. Jujutsu, on the other hand, doesn’t care what you think. It only cares what parents, children, and siblings commits have. There are two reasons you might want to branch: To Jujutsu, this repo’s history is a straight line, so there is no actual “branching”. The only reason to have branches here is communication, so Jujutsu asks you to label the commits that you want on the branches yourself. You can see these tags on the example on the right, and it’s the same as the git example above. There are still three commits in , and three more in . Jujutsu calls these labels “bookmarks”, and they correspond to whatever git uses to tag branches. Bookmarks are what you’ll tag your commits with to tell git what your branches are. Continuing the earlier example, if we create a second commit off main, even if that’s a merge commit (a commit with two parents) that’s when the tree actually diverges. In the graph on the right, the commit where we branched off now is a parent to two commits, and history is no longer linear. This isn’t special, it’s just how things are, but this is what’s actually a real “branch” to Jujutsu. The way that git does things, ie creating a branch without history actually diverging, is just for us humans and our communication needs. Jujutsu doesn’t require you to name its branches. You can happily work without any branch names at all, and you can easily see what branch is for what from the commit descriptions. You can name them, if you prefer, but you don’t have to . This sounds a bit alien right now, but it’s actually a really nice way to work. I’m worried I’ve lost you here, but it doesn’t matter. You’ll understand all of this easily when you play around with the tree a bit in jjui. You can add a commit message at any time to the current, using the command. You can do this at any time, you can even go back to other commits and amend their messages (again with the command). You don’t! Everything is already in a commit! What you do is you interactively select some of the changes in the current commit (whether this commit is blank/new or an old commit, it doesn’t matter), and you that commit into two. Jujutsu can also do this automatically! If you have a commit with a bunch of small changes to various files, jj can these changes into the closest ancestor commit where each thing changed. This is pretty magical, as you can add a few one-liner bugfixes here and there, and jj will just automatically include them in the commits where those lines were touched. Without getting too much into specifics, you just the commit you want. This checks it out and you can make changes to it, however keep in mind that, if the commit was previously pushed to a remote, jj will give you a warning that you shouldn’t change commits you’ve pushed. jjui will make navigation around the repo really easy, so use it for checking out commits as well. You just… move it. In jjui, go to the commit you want to move, press r (for ), go to the commit you want to move it after, press enter, and that’s it. There isn’t really a soft reset, as there isn’t a staging area for your changes to be reset in. Simply check out ( ) the commit you want to edit, that’s a soft reset in Jujutsu. For a hard reset (ie to throw away a commit), you that commit. jjui will, again, make it much easier to do this. No matter what you do, you can it. Not just changes, but any jj operation, you can undo rebases, pulls, anything. You can also use the oplog (again, jjui makes this really easy) to go back to how the whole repo looked at any point in time. Don’t be afraid to try things, with jj it’s really easy to undo any mistake. Simply it and make the changes you want. There are no unstaged changes in jj. All changes are in a commit, if you want to move the changes in your current commit to another branch, simply move your current commit to the target branch by rebasing. I can never remember what “rebase X onto Y” does, so just move the commit with your changes to be a child of your branch’s tip (again, use jjui for this). To do that, you need to push a new branch. Go to the commit you want to push, then probably create a new one on top of that (I tend to create a new commit when I’m done with an old one, just so I’m remember I’m done, but this is personal preference). Then, bookmark that commit with the branch name you want to give your PR, and push the commit along with the bookmark. That’s all, now you can open the PR. Here, jj exposes the low-level operations much more than git: You need to move the bookmark on your own to the commit you want to push (git does that automatically for you), and you need to push the bookmark manually as well. This is very helpful for understanding how things work under the hood, but usually you’ll set a jj alias to do this in one step. Personally, I have an alias (which I’ll include below) to find the bookmark name, move it to the latest commit, and push. Here’s my alias config: This means I can to add jj to a git repo, and to describe the current commit and create a new one on top of it (that’s what does under the hood). is a convenience alias that: I use this a lot! Jujutsu doesn’t do anything that git can’t do, but it removes so much friction that you’ll actually end up doing things all the time that git could do, but that were so fiddly with git that you never actually did them. Creating a branch for a minute just to try an idea out even though you’re in the middle of some changes, going back to a previous commit to add a line you forgot, moving commits around the tree, all of these things are so easy that they’re now actually your everyday workflow . With git, I never used to switch branches in the middle of work, because I was too worried that stashing multiple things onto the stack would eat my work. I’d never go back to a previous commit and amend it, because here be dragons. I was extremely afraid of rebasing because I always got one conflict per commit and had to unconflict the same thing fifty times. Jujutsu gives you the confidence and understanding to do all of these things, and if you fuck something up (which I haven’t yet, miraculously!) the oplog is right there to fix everything to how it was 30 seconds ago. I hope this tutorial made sense, but I’m worried it didn’t. Please contact me on Twitter or Bluesky , or email me directly, if you have feedback or corrections. History legitimately diverges into multiple directions, or You want to communicate to other people (or to yourself) that this part of the history is different (e.g. it contains some feature). This is also the case when you want to create a new branch so you can open a PR for it. Looks backward in history Finds the last bookmark there (if this were git, this would be my branch name) Checks if the current commit has changes in it If it does, it creates a new commit Moves the bookmark to the parent commit (the one I was on before I ran the command) Fetches changes from upstream (to update my tree) Pushes the changes to the remote
Though I spend the majority of my time working with microcontroller class devices, I also have an embarassingly robust collection of single board computers (SBC), including a few different Raspberry Pi models, the BeagleV Starlight Beta (RIP), and more. Typically when setting up these devices for whatever automation task I have planned for them, I’ll use “headless mode” and configure initial user and network credentials when writing the operating system to the storage device using a tool like Raspberry Pi’s Imager.
I bought a really small 8x8 LED panel a while ago because I have a problem. I just can’t resist a nice WS2812 LED panel, much like I can’t resist an e-ink display. These days I manage to stay sober, but once in a while I’ll see a nice cheap LED panel and fall off the wagon. It has now been thirteen minutes that I have gone without buying LED panels, and this is my story. This isn’t really going to be super interesting, but there are some good lessons, so I thought I’d write it up anyway. On the right you can see the LED panel I used, it’s a bare PCB with a bunch of WS2812 (Neopixel) addressable LEDs soldered onto it. It was the perfect excuse for trying out WLED , which I’ve wanted to take a look at for ages, and which turned out to be absolutely fantastic. As with every light-based project, one of the big issues is proper diffusion. You don’t want your LEDs to show up as the points of light they are, we really like nice, big, diffuse lights, so you need a way to do that. My idea was to print a two-layer white square out of PLA (which would be translucent enough to show the light, but not so translucent that you could see the LEDs behind it. I also printed a box for the square to go in front of: I printed the diffuser (the white square) first, held it over the LED panel and increased or decreased the distance of the square from the LEDs until the LEDs didn’t look like points, but the colors also didn’t blend into the neighboring squares’ colors. This turned out to be around 10mm, so that’s how thick I made the box. The eagle-eyed among you may want to seek medical assistance, but if you have normal human eyes, you may have noticed that there’s nowhere in the box for the microcontroller to go, and you would be correct. For this build, I decided to use an ESP8266 (specifically, a WeMos dev board), but I didn’t want to make the whole box chunky just to fit a small microcontroller in there, so I did the next best thing: I designed a hole in the back of the box for the cables that connect to the LED panel, and I glued the ESP8266 to the back of the box. YOLO. Look, it works great, ok? The cables are nice and shortish, even though they go to the entirely wrong side of the thing, the USB connector is at a very weird place, and the ESP8266 is exposed to the elements and the evil eye. It’s perfect. Here’s the top side, with the diffuser: And here’s the whole mini tiny cute little panel showing some patterns from WLED (did I mention it’s excellent? It is). That’s it! I learned a few things and made a cute box of lights. I encourage you to make your own, it’s extremely fun and mesmerizing and I love it and gave it to a friend because I never used it and it just took up space and then made a massive 32x32 version that I also never use and hung it on my wall. Please feel free to Tweet or toot at me, or email me directly.
Step by step instructions for monitoring your yellingist feathered neighbors.
Stick microphone out window, catch chorps, feel joy.
This post assumes you know algebra, but no linear algebra. Lets dive in. There are two big ideas I want to introduce in the first chapter: Gaussian elimination, (which is not strictly a linear algebra thing, and has been around for years before linear algebra came along), and row picture versus column picture, which is a linear algebra thing. Let’s say you have a bunch of nickels and pennies, and you want to know how many of each do you need to have 23 cents . You could write that as an equation that looks like this: is the number of nickels you need, is the number of pennies you need. And you need to figure out the and values that would make the left-hand side work out to 23. And this one is pretty easy, you can just work it out yourself. You’d need four nickels and three pennies. So is four, is three. This kind of equation is called a linear equation . And that’s because when you plot this equation, everything is flat and smooth. There are no curves or holes. There isn’t a in the equation for example to make it curved. Linear equations are great because they’re much easier to work with than curved equations. Aside: Another solution for the above is 23 pennies. Or -4 nickels + 43 pennies. The point is you have two variables (x and y for nickels and pennies), and you are trying to combine them in different ways to hit one number . The trouble starts when you have two variables, and you need to combine them in different ways to hit two different numbers . That’s when Gaussian elimination comes in. In what world would you have to hit two different numbers? Does that seem outlandish? It’s actually very common! Read on for an example. Now let’s look at a different example. In the last one we were trying to make 23 cents with nickels and pennies. Here we have two foods. One is milk, the other is bread. They both have some macros in terms of carbs and protein: and now we want to figure out how many of each we need to eat to hit this target of 5 carbs and 7 protein. This is a very similar question to the one we just asked with nickels and pennies, except instead of one equation, we have two equations: Again we have an and a . Lets find their values. To solve these kinds of questions, we usually use Gaussian elimination . If you’ve never used Gaussian elimination, strap in. Step one is to rewrite this as a set of two equations: Now you subtract multiples of one equation from another to try to narrow down the value of one variable. Lets double that second equation: See how we have a and a now? Now we can add the two equations together to eliminate : We’re left with one equation and one variable. We can solve for : Aha, we know . Now we can plug that into one of the equations to find . We plug that in to one of the equations and find out that equals 1, and there we have answer: three milks, one bread, is what we need. This method is called Gaussian elimination, even though it was not discovered by Gauss. If you haven’t seen Gaussian elimination, congratulations, you learned a big idea! Gaussian elimination is something we will talk about more. It’s part of what makes linear algebra useful. We can also find the solution by drawing pictures. Let’s see how that works. Let’s plot one of these lines. First, we need to rewrite the equations in terms of . Reminder: first equation is for carbs, second for protein. x is number of milks, y is number of breads. Now let’s plot the graph for the first equation. Now, what does this line represent? It’s all the combinations of bread and milk that you can have to get exactly five carbs: So you can eat no milk and two-and-a-half breads, or two milks and one-and-a-half breads, or five milks and no bread, to get to exactly five carbs. All of those combinations would mean you have eaten exactly five carbs. You can pick any point that sits on this line to get to your goal of eating five carbs. Note: You can see the line goes into the negative as well. Technically, 5 breads and -5 milks will give you 5 carbs as well, but you can’t drink negative milks. For these examples, let’s assume only positive numbers for the variables. Now, let’s plot the other one. This is the same thing, but for protein. If you eat any of these combinations, you’ll have met the protein goal: You can pick a point that sits on the first line to meet the carb goal. You can pick a point that sits on the second line to meet the protein goal. But you need a point that sits on both lines to hit both goals. How would a point sit on both lines? Well, it would be where the lines cross . Since these are straight lines, the lines cross only once, which makes sense because there’s only a single milk and bread combo that would get you to exactly five grams of carbs and seven grams of protein. Now we plot the lines together, see where they intersect, and that’s our answer: Bam! We just found the solution using pictures. So that’s a quick intro to Gaussian elimination. But you don’t need linear algebra to do Gaussian elimination. This is a technique that has been around for 2,000 years. It was discovered in Asia, it was rediscovered in Europe, I think in the 1600s or something, and no one was really talking about “linear algebra”. This trick is just very useful. That’s the first big idea you learned. You can stop there if you want. You can practice doing this sort of elimination. It’s a very common and useful thing. What we just saw is called the “row picture”. Now I want to show you the column picture. I’m going to introduce a new idea, which is: instead of writing this series of equations, what if we write just one equation? Remember how we had one equation for the nickels and pennies question? What if we write one like that for food? Not a system of equations, just a single equation? What do you think that would look like? Something like this: It’s an equation where the coefficients aren’t numbers, they’re an “array” of numbers. The big idea here is: what if we have a linear equation, but instead of numbers, we have arrays of numbers? What if we treat , the way we treat a number? Can that actually work? If so, it is pretty revolutionary. Our whole lives we have been looking at just numbers, and now we’re saying, what if we look at arrays of numbers instead? Let’s see how it could work in our food example. What if the coefficients are an array of numbers? Well, this way of thinking is actually kind of intuitive. You might find it even more intuitive than the system of equations version. Each of these coefficients are called vectors . If you’re coming from computer science, you can kind of think of a vector as an array of numbers (i.e. the order matters). Lets see how we can use vectors to find a solution to the bread and milk question. Yeah, we can graph vectors . We can graph them either as a point, like I’ve done for the target vector here, or as an arrow, which is what I’ve done with the vector for bread and the vector for milk: Use the two numbers in the vector as the x and y coordinates. That is another big idea here: We always think of a set of coordinates giving a point, but you can think of vectors as an arrow instead of just a point . Now what we’re asking is how much milk and how much bread do we need, to get to that point? This is a pretty simple question. It’s simple enough that we can actually see it. Let me add some milks: And let me add a bread. Bingo bango, we’re at the point: Yeah, we literally add them on, visually. I personally find this more intuitive. I think the system of equations picture can confuse me sometimes, because the initial question was, “how much bread and how much milk should I eat?” The vector way, you see it in terms of breads and milks. The row way, you see it as one of the lines is the carbs, the other line is the protein, and the x and y axes are the amount of bread, which results in the same thing, but it’s a little more roundabout, a little more abstract. This one is very direct. We just saw that we can graph vectors too. Graphing it works differently from graphing the rows, but there is a graph we can make, and it works, which is pretty cool. What about the algebra way? Here is the equation again: Since we already know the answer, I’ll just plug that in: Now, the question is how does the left side equal the right side? The first question is how do you define this multiplication? Well, in linear algebra, it’s defined as, if you multiply a scalar by a vector, you just multiply it by each number in that vector: Now you are left with two vectors. How do you add two vectors? Well, in the linear algebra you just add the individual elements of each vector: And you end up with the answer. Congratulations, you’ve just had your first taste of linear algebra. It’s a pretty big step, right? Instead of numbers, we’re working with arrays of numbers. In future chapters, we will see why this is so powerful. That’s the first big concept of linear algebra: row picture vs column picture. Finally, I’ll just leave you with this last teaser, which is: how would you write these two equations in matrix notation? Like this: This is the exact same thing as before. You can write it as scalars times columns, as we had done before: or you can write it as a matrix times a vector, as above. Either one works. Matrices are a big part of linear algebra. But before we talk about matrices, we will talk about the dot product, which is coming up next. Check out Gilbert Strang’s lectures on linear algebra on YouTube . Thanks for reading DuckTyped! Subscribe for free to receive new posts and support my work. P.S. Want more art? Check out my Instagram . Let’s say you have a bunch of nickels and pennies, and you want to know how many of each do you need to have 23 cents . You could write that as an equation that looks like this: is the number of nickels you need, is the number of pennies you need. And you need to figure out the and values that would make the left-hand side work out to 23. And this one is pretty easy, you can just work it out yourself. You’d need four nickels and three pennies. So is four, is three. This kind of equation is called a linear equation . And that’s because when you plot this equation, everything is flat and smooth. There are no curves or holes. There isn’t a in the equation for example to make it curved. Linear equations are great because they’re much easier to work with than curved equations. Aside: Another solution for the above is 23 pennies. Or -4 nickels + 43 pennies. The point is you have two variables (x and y for nickels and pennies), and you are trying to combine them in different ways to hit one number . The trouble starts when you have two variables, and you need to combine them in different ways to hit two different numbers . That’s when Gaussian elimination comes in. In what world would you have to hit two different numbers? Does that seem outlandish? It’s actually very common! Read on for an example. Food example Now let’s look at a different example. In the last one we were trying to make 23 cents with nickels and pennies. Here we have two foods. One is milk, the other is bread. They both have some macros in terms of carbs and protein: and now we want to figure out how many of each we need to eat to hit this target of 5 carbs and 7 protein. This is a very similar question to the one we just asked with nickels and pennies, except instead of one equation, we have two equations: Again we have an and a . Lets find their values. To solve these kinds of questions, we usually use Gaussian elimination . If you’ve never used Gaussian elimination, strap in. Gaussian elimination Step one is to rewrite this as a set of two equations: Now you subtract multiples of one equation from another to try to narrow down the value of one variable. Lets double that second equation: See how we have a and a now? Now we can add the two equations together to eliminate : We’re left with one equation and one variable. We can solve for : Aha, we know . Now we can plug that into one of the equations to find . We plug that in to one of the equations and find out that equals 1, and there we have answer: three milks, one bread, is what we need. This method is called Gaussian elimination, even though it was not discovered by Gauss. If you haven’t seen Gaussian elimination, congratulations, you learned a big idea! Gaussian elimination is something we will talk about more. It’s part of what makes linear algebra useful. We can also find the solution by drawing pictures. Let’s see how that works. Picture version Let’s plot one of these lines. First, we need to rewrite the equations in terms of . Reminder: first equation is for carbs, second for protein. x is number of milks, y is number of breads. Now let’s plot the graph for the first equation. Now, what does this line represent? It’s all the combinations of bread and milk that you can have to get exactly five carbs: So you can eat no milk and two-and-a-half breads, or two milks and one-and-a-half breads, or five milks and no bread, to get to exactly five carbs. All of those combinations would mean you have eaten exactly five carbs. You can pick any point that sits on this line to get to your goal of eating five carbs. Note: You can see the line goes into the negative as well. Technically, 5 breads and -5 milks will give you 5 carbs as well, but you can’t drink negative milks. For these examples, let’s assume only positive numbers for the variables. Now, let’s plot the other one. This is the same thing, but for protein. If you eat any of these combinations, you’ll have met the protein goal: You can pick a point that sits on the first line to meet the carb goal. You can pick a point that sits on the second line to meet the protein goal. But you need a point that sits on both lines to hit both goals. How would a point sit on both lines? Well, it would be where the lines cross . Since these are straight lines, the lines cross only once, which makes sense because there’s only a single milk and bread combo that would get you to exactly five grams of carbs and seven grams of protein. Now we plot the lines together, see where they intersect, and that’s our answer: Bam! We just found the solution using pictures. So that’s a quick intro to Gaussian elimination. But you don’t need linear algebra to do Gaussian elimination. This is a technique that has been around for 2,000 years. It was discovered in Asia, it was rediscovered in Europe, I think in the 1600s or something, and no one was really talking about “linear algebra”. This trick is just very useful. That’s the first big idea you learned. You can stop there if you want. You can practice doing this sort of elimination. It’s a very common and useful thing. The column picture What we just saw is called the “row picture”. Now I want to show you the column picture. I’m going to introduce a new idea, which is: instead of writing this series of equations, what if we write just one equation? Remember how we had one equation for the nickels and pennies question? What if we write one like that for food? Not a system of equations, just a single equation? What do you think that would look like? Something like this: It’s an equation where the coefficients aren’t numbers, they’re an “array” of numbers. The big idea here is: what if we have a linear equation, but instead of numbers, we have arrays of numbers? What if we treat , the way we treat a number? Can that actually work? If so, it is pretty revolutionary. Our whole lives we have been looking at just numbers, and now we’re saying, what if we look at arrays of numbers instead? Let’s see how it could work in our food example. What if the coefficients are an array of numbers? Well, this way of thinking is actually kind of intuitive. You might find it even more intuitive than the system of equations version. Each of these coefficients are called vectors . If you’re coming from computer science, you can kind of think of a vector as an array of numbers (i.e. the order matters). Lets see how we can use vectors to find a solution to the bread and milk question. Step one: graph the vectors. Yeah, we can graph vectors . We can graph them either as a point, like I’ve done for the target vector here, or as an arrow, which is what I’ve done with the vector for bread and the vector for milk: Use the two numbers in the vector as the x and y coordinates. That is another big idea here: We always think of a set of coordinates giving a point, but you can think of vectors as an arrow instead of just a point . Now what we’re asking is how much milk and how much bread do we need, to get to that point? This is a pretty simple question. It’s simple enough that we can actually see it. Let me add some milks: And let me add a bread. Bingo bango, we’re at the point: Yeah, we literally add them on, visually. I personally find this more intuitive. I think the system of equations picture can confuse me sometimes, because the initial question was, “how much bread and how much milk should I eat?” The vector way, you see it in terms of breads and milks. The row way, you see it as one of the lines is the carbs, the other line is the protein, and the x and y axes are the amount of bread, which results in the same thing, but it’s a little more roundabout, a little more abstract. This one is very direct. The algebra way We just saw that we can graph vectors too. Graphing it works differently from graphing the rows, but there is a graph we can make, and it works, which is pretty cool. What about the algebra way? Here is the equation again: Since we already know the answer, I’ll just plug that in: Now, the question is how does the left side equal the right side? The first question is how do you define this multiplication? Well, in linear algebra, it’s defined as, if you multiply a scalar by a vector, you just multiply it by each number in that vector: Now you are left with two vectors. How do you add two vectors? Well, in the linear algebra you just add the individual elements of each vector: And you end up with the answer. Congratulations, you’ve just had your first taste of linear algebra. It’s a pretty big step, right? Instead of numbers, we’re working with arrays of numbers. In future chapters, we will see why this is so powerful. That’s the first big concept of linear algebra: row picture vs column picture. Finally, I’ll just leave you with this last teaser, which is: how would you write these two equations in matrix notation? Like this: This is the exact same thing as before. You can write it as scalars times columns, as we had done before: or you can write it as a matrix times a vector, as above. Either one works. Matrices are a big part of linear algebra. But before we talk about matrices, we will talk about the dot product, which is coming up next. Additional reading Check out Gilbert Strang’s lectures on linear algebra on YouTube .
The moment I learned how to program, I wanted to experiment with my new super powers. Building a BMI calculator in the command line wouldn't cut it. I didn't want to read another book, or follow any other tutorial. What I wanted was to experience chaos. Controlled, beautiful, instructive chaos that comes from building something real and watching it spectacularly fail. That's why whenever someone asks me how they can practice their new found skill, I suggest something that might sound old-fashioned in our framework-obsessed world. Build your own blog from scratch. Not with WordPress. Not with Next.js or Gatsby or whatever the cool kids are using this week. I mean actually build it. Write every messy, imperfect line of code. A blog is deceptively simple. On the surface, it's just text on a page. But underneath? It's a complete web application in miniature. It accepts input (your writing). It stores data (your posts). It processes logic (routing, formatting, displaying). It generates output (the pages people read). When I was in college, I found myself increasingly frustrated with the abstract nature of what we were learning. We'd implement different sorting algorithms, and I'd think: "Okay, but when does this actually matter ?" We'd study data structures in isolation, divorced from any practical purpose. It all felt theoretical, like memorizing chess moves without ever playing a game. Building a blog changed that completely. Suddenly, a data structure wasn't just an abstract concept floating in a textbook. It was the actual list of blog posts I needed to sort by date. A database wasn't a theoretical collection of tables; it was the real place where my article drafts lived, where I could accidentally delete something important at 2 AM and learn about backups the hard way. This is what makes a blog such a powerful learning tool. You can deploy it. Share it. Watch people actually read the words your code is serving up. It's real. That feedback loop, the connection between your code and something tangible in the world, is irreplaceable. So how do you start? I'm not going to give you a step-by-step tutorial. You've probably already done a dozen of those. You follow along, copy the code, everything works perfectly, and then... you close the browser tab and realize you've learned almost nothing. The code evaporates from your memory because you never truly owned it. Instead, I'm giving you permission to experiment. To fumble. To build something weird and uniquely yours. You can start with a single file. Maybe it's an that clumsily echoes "Hello World" onto a blank page. Or perhaps you're feeling adventurous and fire up a Node.js server with an that doesn't use Express to handle a simple GET request. Pick any language you are familiar with and make it respond to a web request. That's your seed. Everything else grows from there. Once you have that first file responding, the questions start arriving. Not abstract homework questions, but real problems that need solving. Where do your blog posts live? Will you store them as simple Markdown or JSON files in a folder? Or will you take the plunge into databases, setting up MySQL or PostgreSQL and learning SQL to and your articles? I started my first blog with flat files. There's something beautiful about the simplicity. Each post is just a text file you can open in any editor. But then I wanted tags, and search, and suddenly I was reinventing databases poorly. That's when I learned why databases exist. Not from a lecture, but from feeling the pain of their absence. You write your first post. Great! You write your second post. Cool! On the third post, you realize you're copying and pasting the same HTML header and footer, and you remember learning something about DRY (don't repeat yourself) in class. This is where you'll inevitably invent your own primitive templating system. Maybe you start with simple includes: at the top of each page in PHP. Maybe you write a JavaScript function that stitches together HTML strings. Maybe you create your own bizarre templating syntax. It will feel like magic when it works. It will feel like a nightmare when you need to change something and it breaks everywhere. And that's the moment you'll understand why templating engines exist. I had a few blog posts written down on my computer when I started thinking about this next problem: How do you write a new post? Do you SSH into your server and directly edit a file with vim? Do you build a crude, password-protected page with a textarea that writes to your flat files? Do you create a whole separate submission form? This is where you'll grapple with forms, authentication (or a hilariously insecure makeshift version of it), file permissions, and the difference between GET and POST requests. You'll probably build something that would make a security professional weep, and that's okay. You'll learn by making it better. It's one thing to write code in a sandbox, but a blog needs to be accessible on the Internet. That means getting a domain name (ten bucks a year). Finding a cheap VPS (five bucks a month). Learning to into that server. Wrestling with Nginx or Apache to actually serve your files. Discovering what "port 80" means, why your site isn't loading, why DNS takes forever to propagate, and why everything works on your laptop but breaks in production. These aren't inconveniences, they're the entire point. This is the knowledge that separates someone who can write code from someone who can ship code. Your blog won't use battle-tested frameworks or well-documented libraries. It will use your solutions. Your weird routing system. Your questionable caching mechanism. Your creative interpretation of MVC architecture. Your homemade caching will fail spectacularly under traffic ( what traffic?! ). Your clever URL routing will throw mysterious 404 errors. You'll accidentally delete a post and discover your backup system doesn't work. You'll misspell a variable name and spend three hours debugging before you spot it. You'll introduce a security vulnerability so obvious that even you'll laugh when you finally notice it. None of this is failure. This is the entire point. When your blog breaks, you'll be forced to understand the why behind everything. Why do frameworks exist? Because you just spent six hours solving a problem that Express handles in three lines. Why do ORMs exist? Because you just wrote 200 lines of SQL validation logic that Sequelize does automatically. Why do people use TypeScript? Because you just had a bug caused by accidentally treating a string like a number. You'll emerge from this experience not just as someone who can use tools, but as someone who understands what problems those tools were built to solve. That understanding is what transforms a code-copier into a developer. Building your own blogging engine used to be a rite of passage. Before Medium and WordPress and Ghost, before React and Vue and Svelte, developers learned by building exactly this. A simple CMS. A place to write. Something that was theirs. We've lost a bit of that spirit. Now everyone's already decided they'll use React on the frontend and Node on the backend before they even know why. The tools have become the default, not the solution. Your blog is your chance to recover that exploratory mindset. It's your sandbox. Nobody's judging. Nobody's watching. You're not optimizing for scale or maintainability or impressing your coworkers. You're learning, deeply and permanently, by building something that matters to you. So here's my challenge: Stop reading. Stop planning. Stop researching the "best" way to do this. Create a folder. Create a file. Pick a language and make it print "Hello World" in a browser. Then ask yourself: "How do I make this show a blog post?" And then: "How do I make it show two blog posts?" And then: "How do I make it show the most recent one first?" Build something uniquely, personally, wonderfully yours. Make it ugly. Make it weird. Make it work, then break it, then fix it again. Embrace the technical chaos. This is how you learn. Not by following instructions, but by discovering problems, attempting solutions, failing, iterating, and eventually (accidentally) building something real. Your blog won't be perfect. It will probably be kind of a mess. But it will be yours, and you will understand every line of code in it, and that understanding is worth more than any tutorial completion certificate. If you don't know what that first blog post will be, I have an idea. Document your process of building your very own blog from scratch. The blog you build to learn programming becomes the perfect place to share what programming taught you. Welcome to development. The real kind, where things break and you figure out why. You're going to love it.
How do we actually evaluate LLMs? It’s a simple question, but one that tends to open up a much bigger discussion. When advising or collaborating on projects, one of the things I get asked most often is how to choose between different models and how to make sense of the evaluation results out there. (And, of course, how to measure progress when fine-tuning or developing our own.) Since this comes up so often, I thought it might be helpful to share a short overview of the main evaluation methods people use to compare LLMs. Of course, LLM evaluation is a very big topic that can’t be exhaustively covered in a single resource, but I think that having a clear mental map of these main approaches makes it much easier to interpret benchmarks, leaderboards, and papers. I originally planned to include these evaluation techniques in my upcoming book, Build a Reasoning Model (From Scratch) , but they ended up being a bit outside the main scope. (The book itself focuses more on verifier-based evaluation.) So I figured that sharing this as a longer article with from-scratch code examples would be nice. In Build A Reasoning Model (From Scratch) , I am taking a hands-on approach to building a reasoning LLM from scratch. If you liked “Build A Large Language Model (From Scratch)”, this book is written in a similar style in terms of building everything from scratch in pure PyTorch. Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website. Before implementing the model evaluation code, let’s first download the gpt-oss model and verify that Ollama is functioning correctly by using it from the command line terminal. Execute the following command on the command line (not in a Python session) to try out the 20 billion parameter gpt-oss model: The first time you execute this command, the 20 billion parameter gpt-oss model, which takes up 14 GB of storage space, will be automatically downloaded. The output looks as follows: Note that the gpt-oss:20b in the ollama run gpt-oss:20b command refers to the 20 billion parameter gpt-oss model. Using Ollama with the gpt-oss:20b model requires approximately 13 GB of RAM. If your machine does not have sufficient RAM, you can try using a smaller model, such as the 4 billion parameter qwen3:4b model via ollama run qwen3:4b, which only requires around 4 GB of RAM. For more powerful computers, you can also use the larger 120-billion parameter gpt-oss model by replacing gpt-oss:20b with gpt-oss:120b. However, keep in mind that this model requires significantly more computational resources. Once the model download is complete, we are presented with a command-line interface that allows us to interact with the model. For example, try asking the model, “What is 1+2?”: You can end this ollama run gpt-oss:20b session using the input . You can end this ollama run gpt-oss:20b session using the input /bye. In the remainder of this section, we will use the ollama API. This approach requires that Ollama is running in the background. There are three different options to achieve this: 1. Run the command in the terminal (recommended). This runs the Ollama backend as a server, usually on . Note that it doesn’t load a model until it’s called through the API (later in this section). 2. Run the command similar to earlier, but keep it open and don’t exit the session via . As discussed earlier, this opens a minimal convenience wrapper around a local Ollama server. Behind the scenes, it uses the same server API as ollama serve. 3. Ollama desktop app. Opening the desktop app runs the same backend automatically and provides a graphical interface on top of it as shown in Figure 12 earlier. Figure 13: Two different options to keep the Ollama server (/application) running so we can use it via the Ollama API in Python. Ollama runs locally on our machine by starting a local server-like process. When running ollama serve in the terminal, as described above, you may encounter an error message saying . If that’s the case, try use the command (and if this address is also in use, try to increment the numbers by one until you find an address not in use.) The following code verifies that the Ollama session is running properly before we use Ollama to evaluate the test set responses generated in the previous section: Ensure that the output from executing the previous code displays Ollama running: . If it shows , please verify that the command or the Ollama application is actively running (see Figure 13). In the remainder of this article, we will interact with the local gpt-oss model, running on our machine, through the Ollama REST API using Python. The following function demonstrates how to use the API: Here’s an example of how to use the function that we just implemented: The resulting response is “3”. (It differs from what we’d get if we ran Ollama run or the Ollama application due to different default settings.) Using the function, we can evaluate the responses generated by our model with a prompt that includes a grading rubric asking the gpt-oss model to rate our target model’s responses on a scale from 1 to 5 based on a correct answer as a reference. The prompt we use for this is shown below: The in the is intended to represent the response produced by our own model in practice. For illustration purposes, we hardcode a plausible model answer here rather than generating it dynamically. (However, feel free to use the Qwen3 model we loaded at the beginning of this article to generate a real ). Next, let’s generate the rendered prompt for the Ollama model: The output is as follows: Ending the prompt in incentivizes the model to generate the answer. Let’s see how the gpt-oss:20b model judges the response: The response is as follows: As we can see, the answer receives the highest score, which is reasonable, as it is indeed correct. While this was a simple example stepping through the process manually, we could take this idea further and implement a for-loop that iteratively queries the model (for example, the Qwen3 model we loaded earlier) with questions from an evaluation dataset and evaluate it via gpt-oss and calculate the average score. You can find an implementation of such a script where we evaluate the Qwen3 model on the MATH-500 dataset here on GitHub . Figure 14: A comparison of the Qwen3 0.6 base and reasoning variants on the first 10 examples in MATH-500 evaluated by gpt-oss:20b as a judge. You can find the code here on GitHub . Related to symbolic verifiers and LLM judges, there is a class of learned models called process reward models (PRMs). Like judges, PRMs can evaluate reasoning traces beyond just the final answer, but unlike general judges, they focus specifically on the intermediate steps of reasoning. And unlike verifiers, which check correctness symbolically and usually only at the outcome level, PRMs provide step-by-step reward signals during training in reinforcement learning. We can categorize PRMs as “step-level judges,” which are predominantly developed for training, not pure evaluation. (In practice, PRMs are difficult to train reliably at scale. For example, DeepSeek R1 did not adopt PRMs and instead combined verifiers for the reasoning training.) Judge-based evaluations offer advantages over preference-based leaderboards, including scalability and consistency, as they do not rely on large pools of human voters. (Technically, it is possible to outsource the preference-based rating behind leaderboards to LLM judges as well). However, LLM judges also share similar weaknesses with human voters: results can be biased by model preferences, prompt design, and answer style. Also, there is a strong dependency on the choice of judge model and rubric, and they lack the reproducibility of fixed benchmarks. In this article, we covered four different evaluation approaches: multiple choice, verifiers, leaderboards, and LLM judges. I know this was a long article, but I hope you found it useful for getting an overview of how LLMs are evaluated. A from-scratch approach like this can be verbose, but it is a great way to understand how these methods work under the hood, which in turn helps us identify weaknesses and areas for improvement. That being said, you are probably wondering, “What is the best way to evaluate an LLM?” Unfortunately, there is no single best method since, as we have seen, each comes with different trade-offs. In short: Multiple-choice (+) Relatively quick and cheap to run at scale (+) Standardized and reproducible across papers (or model cards) (-) Measures basic knowledge recall (-) Does not reflect how LLMs are used in the real world Verifiers (+) Standardized, objective grading for domains with ground truth (+) Allows free-form answers (with some constraints on final answer formatting) (+) Can also score intermediate steps if using process verifiers or process reward models (-) Requires verifiable domains (for example, math or code), and building good verifiers can be tricky (-) Outcome-only verifiers evaluate only the final answer, not reasoning quality Arena-style leaderboards (human pairwise preference) (+) Directly answers “Which model do people prefer?” on real prompts (+) Allows free-form answers and implicitly accounts for style, helpfulness, and safety (-) Expensive and time-intensive for humans (-) Does not measure correctness, only preference (-) Nonstationary populations can affect stability LLM-as-a-judge (+) Scalable across many tasks (+) Allows free-form answers (-) Dependent on the judge’s capability (ensembles can make this more robust) (-) Depends on rubric choice While I am usually not a big fan of radar plots, one can be helpful here to visualize these different evaluation areas, as shown below. Figure 15: A radar chart showing conceptually that we ideally want to pay attention to different areas when evaluating an LLM to identify its strengths and weaknesses. For instance, a strong multiple-choice rating suggests that the model has solid general knowledge. Combine that with a strong verifier score, and the model is likely also answering technical questions correctly. However, if the model performs poorly on LLM-as-a-judge and leaderboard evaluations, it may struggle to write or articulate responses effectively and could benefit from some RLHF. So, the best evaluation combines multiple areas. But ideally it also uses data that directly aligns with your goals or business problems. For example, suppose you are implementing an LLM to assist with legal or law-related tasks. It makes sense to run the model on standard benchmarks like MMLU as a quick sanity check, but ultimately you will want to tailor the evaluations to your target domain, such as law. You can find public benchmarks online that serve as good starting points, but in the end, you will want to test with your own proprietary data. Only then can you be reasonably confident that the model has not already seen the test data during training. In any case, model evaluation is a very big and important topic. I hope this article was useful in explaining how the main approaches work, and that you took away a few useful insights for the next time you look at model evaluations or run them yourself. As always, Happy tinkering! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. Understanding the main evaluation methods for LLMs There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. Method 1: Evaluating answer-choice accuracy We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. 1.2 Loading the model First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via or Code block 1: Loading a pre-trained model 1.3 Checking the generated answer letter In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Code block 2: Loading a pre-trained model Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. Loading different MMLU samples You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. Code block 3: Extracting the generated letter We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Multiple-choice answer formats Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Method 2: Using verifiers to check answers Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub Method 3: Comparing models using preferences and leaderboards So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. Code block 4: Constructing a leaderboard The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: Order matters The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. Other ranking methods The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: 4.1 Implementing a LLM-as-a-judge approach in Ollama Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website.
Walkthrough for a small embedded system based on ATmega328p which controls a PC cooler fan. The focus of this experiment is on using only open-source CLI software tooling for the solution. Additionally, no development boards were used, just the bare microcontroller, which should be helpful in transitioning to building your own boards.
I got asked how I manage papers, notes, and citations for doing research. I started writing out a very long Slack message, but it quickly passed the threshold where I ought to just turn it into a blog post. The short of it: I’m an incorrigible Emacs user, so I do a lot through my editor of choice on my laptop. That said, Zotero is a fabulous piece of technology, and I rely on it heavily to get my work done. Use Zotero in some capacity. Zotero is great. You should use it at a minimum for collecting papers and keeping paper metadata. It’s completely free and open source. It has excellent apps for iOS and Android so you can read and markup papers on a tablet and access everything on your desktop, but that’s optional. It’s so smart about finding citation information: drag a PDF into it and it will look for the DOI or something and auto-populate the relevant bibliographic information. It’s not perfect, but it’s still pretty darn helpful. When you’re starting out, I recommend using Zotero’s hosted syncing purely because it’s so easy to use. If you start being a paper packrat and need more than the 300 MB limit, you can self-host or pay a little for more storage. (I’m using 797 MB after several years of heavy Zotero use—I even have a few books in my library!) The lovely thing is you don’t have to commit to syncing up-front . You can start with purely local storage too if you want. If you’re a LaTeX user like me, you should use the Better Bibtex package. You can configure it to make a file for your entire library or just certain collections. I keep a big file for my entire library and then separate files for each paper I write. As long as I am the sole author, that is. My advisor prefers manually managing bibliographies, so what I tend to do is manually copy the reference information from my main file into the file for our shared paper. I’m as close to an Emacs maximalist as you will find. Nevertheless, I prefer reading and most note-taking outside of Emacs. I read and annotate papers on my iPad, and Zotero syncs the annotations to my desktop. When I’m writing papers, I use the Citar package in Emacs. This makes it easy to find references and insert citations. Works for Markdown, Org-mode, and LaTeX files. If you’re wondering whether or not it can do a particular thing, the answer is going to be “yes” or “there’s a package to do that” or “it’s easy to add that functionality” or “I don’t know but Claude could probably get you pretty close in modifying it to do that.” I’ll still take some notes on a paper inside of Emacs, but Zotero is how I primarily manage annotations. When I do a literature review I’ll make a big note in Emacs and just link to the papers that I’m referencing. If you are a plain-text maximalist and like to sync everything via Git or something, then you should be using Emacs. If you are strong enough to resit the pull of closed-format tools for this long, Emacs is for you. It is not a text editor; it is a toolkit to build your ideal text editor . If getting started is intimidating, try out my starter kit , which is basically a set of sane defaults with helpful commentary on how to customize it further. Using Emacs will enable you to build a workflow that is exactly tailored to your idiosyncrasies. It’s an investment, but a worthy one. So, if you are committed to the Emacs + plain text way, here is what I would recommend: Still use Zotero to store papers & associated metadata. Don’t use it for annotations though. Use Emacs and install the Citar package. It ships with a function called which can help you jump from Emacs → Zotero entry. I use this a lot. Use the Denote Zettelkasten-style personal knowledge management (PKM) system. This provides utilities to create notes with tags, links (and automatic backlinks!), etc. all in plain-text. Sync this with Git or whatever. Tie Denote and Citar together with the denote-citar package. Now, when you search for a paper with Citar, you can open a notes file for that paper. When you do, you’ll get a split screen: paper on the right, notes file on the left. If you use the pdf-tools package (and you should) then you can even add annotations to the PDF inside of Emacs! The most important thing is that you build your own system. You have to own it. You might find it easier to adopt someone else’s system, but you should be intentional about every habit you acquire. Be prepared to iterate. I used to be rather rigid with how I organized papers. I found that extreme structure was more constricting than helpful, so there’s a little messiness with how I’m organized, and I’m OK with that. If you want to know exactly how I configure any of the above packages in Emacs, feel free to contact me . Still use Zotero to store papers & associated metadata. Don’t use it for annotations though. Use Emacs and install the Citar package. It ships with a function called which can help you jump from Emacs → Zotero entry. I use this a lot. Use the Denote Zettelkasten-style personal knowledge management (PKM) system. This provides utilities to create notes with tags, links (and automatic backlinks!), etc. all in plain-text. Sync this with Git or whatever. Tie Denote and Citar together with the denote-citar package. Now, when you search for a paper with Citar, you can open a notes file for that paper. When you do, you’ll get a split screen: paper on the right, notes file on the left. If you use the pdf-tools package (and you should) then you can even add annotations to the PDF inside of Emacs!
This is a chapter from my book on Go concurrency , which teaches the topic from the ground up through interactive examples. Some concurrent operations don't require explicit synchronization. We can use these to create lock-free types and functions that are safe to use from multiple goroutines. Let's dive into the topic! Non-atomic increment • Atomic operations • Composition • Atomic vs. mutex • Keep it up Suppose multiple goroutines increment a shared counter: There are 5 goroutines, and each one increments 10,000 times, so the final result should be 50,000. But it's usually less. Let's run the code a few more times: The race detector is reporting a problem: This might seem strange — shouldn't the operation be atomic? Actually, it's not. It involves three steps (read-modify-write): If two goroutines both read the value , then each increments it and writes it back, the new will be instead of like it should be. As a result, some increments to the counter will be lost, and the final value will be less than 50,000. As we talked about in the Race conditions chapter, you can make an operation atomic by using mutexes or other synchronization tools. But for this chapter, let's agree not to use them. Here, when I say "atomic operation", I mean an operation that doesn't require the caller to use explicit locks, but is still safe to use in a concurrent environment. An operation without synchronization can only be truly atomic if it translates to a single processor instruction. Such operations don't need locks and won't cause issues when called concurrently (even the write operations). In a perfect world, every operation would be atomic, and we wouldn't have to deal with mutexes. But in reality, there are only a few atomics, and they're all found in the package. This package provides a set of atomic types: Each atomic type provides the following methods: reads the value of a variable, sets a new value: sets a new value (like ) and returns the old one: sets a new value only if the current value is still what you expect it to be: Numeric types also provide an method that increments the value by the specified amount: And the / methods for bitwise operations (Go 1.23+): All methods are translated to a single CPU instruction, so they are safe for concurrent calls. Strictly speaking, this isn't always true. Not all processors support the full set of concurrent operations, so sometimes more than one instruction is needed. But we don't have to worry about that — Go guarantees the atomicity of operations for the caller. It uses low-level mechanisms specific to each processor architecture to do this. Like other synchronization primitives, each atomic variable has its own internal state. So, you should only pass it as a pointer, not by value, to avoid accidentally copying the state. When using , all loads and stores should use the same concrete type. The following code will cause a panic: Now, let's go back to the counter program: And rewrite it to use an atomic counter: Much better! ✎ Exercise: Atomic counter +1 more Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. An atomic operation in a concurrent program is a great thing. Such operation usually transforms into a single processor instruction, and it does not require locks. You can safely call it from different goroutines and receive a predictable result. But what happens if you combine atomic operations? Let's find out. Let's look at a function that increments a counter: As you already know, isn't safe to call from multiple goroutines because causes a data race. Now I will try to fix the problem and propose several options. In each case, answer the question: if you call from 100 goroutines, is the final value of the guaranteed? Is the value guaranteed? It is guaranteed. Is the value guaranteed? It's not guaranteed. Is the value guaranteed? It's not guaranteed. People sometimes think that the composition of atomic operations also magically becomes an atomic operation. But it doesn't. For example, the second of the above examples: Call 100 times from different goroutines: Run the program with the flag — there are no races: But can we be sure what the final value of will be? Nope. and calls are interleaved from different goroutines. This causes a race condition (not to be confused with a data race) and leads to an unpredictable value. Check yourself by answering the question: in which example is an atomic operation? In none of them. In all examples, is not an atomic operation. The composition of atomics is always non-atomic. The first example, however, guarantees the final value of the in a concurrent environment: If we run 100 goroutines, the will ultimately equal 200. The reason is that is a sequence-independent operation. The runtime can perform such operations in any order, and the result will not change. The second and third examples use sequence-dependent operations. When we run 100 goroutines, the order of operations is different each time. Therefore, the result is also different. A bulletproof way to make a composite operation atomic and prevent race conditions is to use a mutex: But sometimes an atomic variable with is all you need. Let's look at an example. ✎ Exercise: Concurrent-safe stack Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Let's say we have a gate that needs to be closed: In a concurrent environment, there are data races on the field. We can fix this with a mutex: Alternatively, we can use on an atomic instead of a mutex: The type is now more compact and simple. This isn't a very common use case — we usually want a goroutine to wait on a locked mutex and continue once it's unlocked. But for "early exit" situations, it's perfect. Atomics are a specialized but useful tool. You can use them for simple counters and flags, but be very careful when using them for more complex operations. You can also use them instead of mutexes to exit early. In the next chapter, we'll talk about testing concurrent code (coming soon). Pre-order for $10 or read online Read the current value of . Add one to it. Write the new value back to . — a boolean value; / — a 4- or 8-byte integer; / — a 4- or 8-byte unsigned integer; — a value of type; — a pointer to a value of type (generic).
A matrix can be seen as a function, so mathematically writing p'=M \cdot p would be equivalent to the code with: Doing p'=M \cdot p is rotating the space p lies into, which means it gives the illusion the object is rotating clockwise . Though, in the expression , I can't help but be bothered by the redundancy of , so I would prefer to write instead. Since matrices are not commutative, this will instead do a counter-clockwise rotation of the object. The inlined rotation ends up being: To make the rotation clockwise, we can of course use , or we can transpose the matrix: . This is problematic though: we need to repeat the angle 4 times, which can be particularly troublesome if we want to create a macro and/or don't want an intermediate variable for the angle. But I got you covered: trigonometry has a shitton of identities, and we can express every according to a (and the other way around). For example, here is another formulation of the same expression: Now the angle appears only once, in a vectorized cosine call. GLSL has and functions, but it doesn't expose anything for \pi nor \tau constants. And of course, it doesn't have and implementation either. So it's obvious they want us to use \arccos(-1) for \pi and \arccos(0) for \pi/2 : To specify as a normalized value, we can use . On his Unofficial Shadertoy blog, Fabrice Neyret goes further and provide us with a very cute approximation , which is the one we will use: I checked for the best numbers in 2 digits , and I can confirm they are indeed the ones providing the best accuracy. On this last figure, the slight red/green on the outline of the circle represents the loss of precision. With 3 digits, and can respectively be used instead of and . This is good when we want a dynamic rotation angle (we will need that for the camera panning typically), but sometimes we just need a hardcoded value: for example in the of our combined noise function. is fine but we can do better. Through Inigo's demos I found the following: . It makes a rotation angle of about 37° (around 0.64 radians) in a very tiny form. Since 0.5 was pretty much arbitrary, we can just use this matrix as well. And we can make it even smaller (thank you jolle ): One last rotation tip from Fabrice's bag of tricks: rotating in 3D around an axis can be done with the help of GLSL swizzling: We will use this too. is the same , if we need to save one character and can't transpose the matrix. One last essential before going creative is the camera setup. We start with the 2D pixel coordinates which we are going to make resolution independent by transforming them into a traditional mathematical coordinates system: Since we know our demo will be rendered in landscape mode, dividing by is enough. We can also save one character using : To enter 3D space, we append a third component, giving us either a right or a left-handed Y-up coordinates system. This choice is not completely random. Indeed, it's easier/shorter to add a 3rd dimension at the end compared to interleaving a middle component. Compare the length of to (Z-up convention). In the former case, picking just a plane remains short and easy thanks to swizzling: instead of . To work in 3D, we need an origin point ( for ray origin) and a looking direction ( for ray direction). is picked arbitrarily for the eye position, while is usually calculated thanks to a helper: Which is then used like that, for example: I made a Shadertoy demo to experiment with different 3D coordinate spaces if you are interested in digging this further. All of this is perfectly fine because it is flexible, but it's also way too much unnecessary code for our needs, so we need to shrink it. One approach is to pick a simple origin and straight target point so that the matrix is as simple as possible. And then later on apply some transformations on the point. If we give and , we end up with an identity matrix, so we can ditch everything and just write: This can be shorten further: since the vector is normalized anyway, we can scale it at will, for example by a factor , saving us precious characters: And just like that, we are located at the origin , looking toward Z+, ready to render our scene. It's finally time to build our scene. We're going to start with our function previously defined, but we're going tweak it in various ways to craft a mountain height map function. Here is our first draft: We're exploiting one important correlation of the noise function: at every octave, the amplitude is halving while the frequency is doubling. So instead of having 2 running variables, we just have an amplitude getting halved every octave, and we divide our position by (which is the same as multiplying by a frequency that doubles itself). I actually like this way of writing the loop because we can stop the loop when the amplitude is meaningless ( acts as a precision stopper). Unfortunately, we'll have to change it to save one character: is too long for the iteration, we're going to double instead by using which saves one character. So instead the loop will be written the other way around: . It's not exactly equivalent, but it's good enough (and we can still tweak the values if necessary). We're going to inline the constants and rotate, and use one more cool trick: can be shortened: we just need another . Luckily we have , so we can simply write . Similarly, if we needed we could have written (it works also like that: to shorten ). We can also get rid of the braces of the loop by using the in its local scope. In the end, this is our function: To render this in 3D, we are going to do some ray-marching. The main technique used in most Shadertoy demos is ray-marching. I will assume familiarity with the technique, but if that's not the case, An introduction to Raymarching (YouTube) by kishimisu and Painting with Math: A Gentle Study of Raymarching by Maxime Heckel were good resources for me. In short: we start from a position in space called the ray origin and we project it toward a ray direction . At every iteration we check the distance to the closest solid in our scene, and step toward that distance, hoping to converge closer and closer to the object boundary. We end up with this main loop template: This works fine for solids expressed with 3D distance fields , that is functions that for a given point give the distance to the object. We will use it for our mountain, with one subtlety: the noise height map of the mountain is not exactly a distance (it is only the distance to what's below our current point ): Because of this, we can't step by the distance directly, or we're likely to go through mountains during the stepping ( ). A common workaround here is to step a certain percentage of that distance to play it safe. Technically we should figure out the theorical proper shrink factor , but we're going to take a shortcut today and just arbitrarily cut. Using trial and error I ended up with 20% of the distance. After a few simplifications, we end up with the following (complete) code: We start at so I dropped the variable entirely. Also, to avoid the division by 0 in in , is moved right at the beginning (we could also initialize to a value slightly different than 0). You may be curious about the power at the end; this is just a combination of luminance perception with gamma 2.2 (sRGB) transfer function. It only works well for grayscale; for more information, see my previous article on blending . Compared to the mountain, the clouds and fog will need a 3 dimensional noise. Well, we don't need to be very original here; we simply extend the 2D noise to 3D: The base frequency is lowered to to make it smoother, and the goes from 2 to 3 dimensions. Notice how the rotation is only done on the y-axis, the one pointing up): don't worry, it's good enough for our purpose. We also add a phase (meaning we are offsetting the sinusoid) of ( is the time in seconds, slowed down by the multiply) to slowly morph it over time. The base frequency and time scale being identical is a happy "coincidence" to be factored out later (I actually forgot about it until jolle reminded me of it). You also most definitely noticed isn't explicitly initialized: while only true WebGL, it guarantees zero initialization so we're saving a few characters here. For volumetric material (clouds and fog), the loop is a bit different: instead of calculating the distance to the solid for our current point , we do compute the density of our target "object". Funny enough, it can be thought as a 3D SDF but with the sign flipped: positive inside (because the density increases as we go deeper) and negative outside (there is no density, we're not in it). For simplicity, we're going to rewrite the function like this: Compared to the solid ray-marching loop, the volumetric one doesn't bail out when it reaches the target. Instead, it slowly steps into it, damping the light as the density increases: The core idea is that the volumetric material emit some radiance but also absorbs the atmospheric light. The deeper we get, the smaller the transmittance gets, til it converges to 0 and stops all light. All the threshold you see are chosen by tweaking them through trial and error, not any particular logic. It is also highly dependent on the total number of iterations. Steps get larger and larger as the distance increases; this is because we don't need as much precision per "slice", but we still want to reach a long distance. We want to be positioned below the clouds, so we're going to need a simple sign flip in the function. The fog will take the place at the bottom, except upside down (the sharpness will give a mountain-hug feeling) and at a different position. becomes: Having a single ray-marching loop combining the two methods (solid and volumetric) can be challenging. In theory, we should stop the marching when we hit a solid, bail out of the loop, do some fancy normal calculations along with light position. We can't afford any of that, so we're going to start doing art from now on. We start from the volumetric ray-marching loop, and add the distance to the mountain: If gets small enough, we can assume we hit a solid: In volumetric, the attenuation is calculated with the Beer-Lambert law. For solid, we're simply going to make it fairly high: This has the effect of making the mountain like a very dense gas. We're also going to disable the light emission from the solid (it will be handled differently down the line): The transmittance is not going to be changed when we hit a solid as we just want to accumulate light onto it: Finally, we have to combine the volumetric stepping ( ) with the solid stepping ( ) by choosing the safest step length, that is the minimum: We end up with the following: We can notice the mountain from negative space and the discrete presence of the fog, but it's definitely way too dark. So the first thing we're going to do is boost the radiance, as well as the absorption for the contrast: This will make the light actually overshoot, so we also have to replace the current gamma 2.2 correction with a cheap and simple tone mapping hack : . Halving the color is yet another tweak that is not obtained on anything but trial and error. There might be clever ways to reach the same result, but I leave that up to the reader: The clouds and fog are much better but the mountain is still trying to act cool. So we're going to tweak it in the loop: This boosts the overall emission. While at it, since the horizon is also sadly dark, we want to blast some light into it: When the density is null (meaning we're outside clouds and fog), an additional light is added, proportional to how far we are from any solid (the sky gets the most boost basically). The mountain looks fine but I wanted a more eerie atmosphere, so I changed the attenuation: Now instead of being a hard value, the attenuation is correlated with the proximity to the solid (when getting close to it). This has nothing to with any physics formula or anything, it's more of an implementation trick which relies on the ray-marching algorithm. The effect it creates is those crack-like polygon edges on the mountain. To add more to the effect, the emission boost is tweaked into: This makes the bottom of the mountain darker quadratically: only the tip of the mountain would have the glowing cracks. We've been working in grayscale so far, which is a usually a sound approach to visual art in general. But we can afford a few more characters to move the scene to a decent piece of art from the 21st century. Adding the color just requires very tiny changes. First, the emission boost is going to target only the red component of the color: And similarly, the overall addition of light into the horizon/atmosphere is going to get a redish/orange tint: We're almost done. For the last tweak, we're going to add a cyclic panning rotation of the camera, and adjust the moving speed: I'm currently satisfied with the "seed" of the scene, but otherwise it would have been possible to nudge the noise in different ways. For example, remember the can be replaced with in either or both volumetric and mountain related noises. Similarly, the offsetting could be changed into for a different morphing effect. And of course the rotations can be swapped (either by changing into or transposing the values). At this point, our code went through early stages of code golfing, but it still needs some work to reach perfection. Stripped out of its comments, it looks like this: The first thing we're going to do is notice that both the mountain, clouds, and fog use the exact same loop. Factoring them out and inlining the whole thing in the main function is the obvious move: Next, we are going to do the following changes: Onto the next pass of tricks: I'm also reordering a bit some instructions for clarity 🙃 The last touch is going to be nasty: we're going to reorder the instructions such that the 2nd loop is located at the very beginning of the 1st one: "Why?!" you may ask. Before answering this question, let's see why it still works: the first iteration ends up being executed with , where most calculations just cancel themselves out, leading to one wasted iteration (out of 100). Visually, it makes zero difference. But thanks to this weird change, we end up with a bunch of instructions that we can pack into the last placeholder of the main loop, comma separated. This notably allows us to drop the of the main loop: And here we are. All we have to do now is remove all unnecessary spaces and line breaks to obtain the final version. I'll leave you here with this readable version. I'm definitely breaking the magic of that artwork by explaining everything in detail here. But it should be replaced with an appreciation for how much concepts, math, and art can be packed in so little space. Maybe this is possible because they fundamentally overlap? Nevertheless, writing such a piece was extremely refreshing and liberating. As a developer, we're so used to navigate through mountains of abstractions, dealing with interoperability issues, and pissing glue code like robots. Here, even though GLSL is a very crude language, I can't stop but being in awe by how much beauty we can produce with a standalone shader. It's just... Pure code and math, and I just love it.
This post was originally given as a talk for JJ Con . The slides are also available. Welcome to “stupid jj tricks”. Today, I’ll be taking you on a tour through many different jj configurations that I have collected while scouring the internet. Some of what I’ll show is original research or construction created by me personally, but a lot of these things are sourced from blog post, gists, GitHub issues, Reddit posts, Discord messages, and more. To kick things off, let me introduce myself. My name is André Arko, and I’m probably best known for spending the last 15 years maintaining the Ruby language dependency manager, Bundler. In the world, though, my claim to fame is completely different: Steve Klabnik once lived in my apartment for about a year, so I’m definitely an authority on everything about . Thanks in advance for putting into the official tutorial that whatever I say here is now authoritative and how things should be done by everyone using , Steve. The first jj tricks that I’d like to quickly cover are some of the most basic, just to make sure that we’re all on the same page before we move on to more complicated stuff. To start with, did you know that you can globally configure jj to change your name and email based on a path prefix? You don’t have to remember to set your work email separately in each work repo anymore. I also highly recommend trying out multiple options for formatting your diffs, so you can find the one that is most helpful to you. A very popular diff formatter is , which provides syntax aware diffs for many languages. I personally use , and the configuration to format diffs with delta looks like this: Another very impactful configuration is which tool jj uses to handle interactive diff editing, such as in the or commands. While the default terminal UI is pretty good, make sure to also try out Meld, an open source GUI. In addition to changing the diff editor, you can also change the merge editor, which is the program that is used to resolve conflicts. Meld can again be a good option, as well as any of several other merging tools. Tools like mergiraf provide a way to attempt syntax-aware automated conflict resolution before handing off any remaining conflicts to a human to resolve. That approach can dramatically reduce the amount of time you spend manually handling conflicts. You might even want to try FileMerge, the macOS developer tools built-in merge tool. It supports both interactive diff editing and conflict resolution. Just two more configurations before we move on to templates. First, the default subcommand, which controls what gets run if you just type and hit return. The default is to run , but my own personal obsessive twitch is to run constantly, and so I have changed my default subcommand to , like so: The last significant configuration is the default revset used by . Depending on your work patterns, the multi-page history of commits in your current repo might not be helpful to you. In that case, you can change the default revset shown by the log command to one that’s more helpful. My own default revset shows only one change from my origin. If I want to see more than the newest change from my origin I use to get the longer log, using the original default revset. I’ll show that off later. Okay, enough of plain configuration. Now let’s talk about templates! Templates make it possible to do many, many things with jj that were not originally planned or built in, and I think that’s beautiful. First, if you haven’t tried this yet, please do yourself a favor and go try every builtin jj template style for the command. You can list them all with , and you can try them each out with . If you find a builtin log style that you especially like, maybe you should set it as your default template style and skip the rest of this section. For the rest of you sickos, let’s see some more options. The first thing that I want to show you all is the draft commit description. When you run , this is the template that gets generated and sent to your editor for you to complete. Since I am the kind of person who always sets git commit to verbose mode, I wanted to keep being able to see the diff of what I was committing in my editor when using jj. Here’s what that looks like: If you’re not already familiar with the jj template functions, this uses to combine strings, to choose the first value that isn’t empty, to add before+after if the middle isn’t empty, and to make sure the diff status is fully aligned. With this template, you get a preview of the diff you are committing directly inside your editor, underneath the commit message you are writing. Now let’s look at the overridable subtemplates. The default templates are made of many repeated pieces, including IDs, timestamps, ascii art symbols to show the commit graph visually, and more. Each of those pieces can be overrides, giving you custom formats without having to change the default template that you use. For example, if you are a UTC sicko, you can change all timestamps to render in UTC like , with this configuration: Or alternatively, you can force all timestamps to print out in full, like (which is similar to the default, but includes the time zone) by returning just the timestamp itself: And finally you can set all timestamps to show a “relative” distance, like , rather than a direct timestamp: Another interesting example of a template fragment is supplied by on GitHub, who changes the node icon specifically to show which commits might be pushed on the next command. This override of the template returns a hollow diamond if the change meets some pushable criteria, and otherwise returns the , which is the regular icon. It’s not a fragment, but I once spent a good two hours trying to figure out how to get a template to render just a commit message body, without the “title” line at the top. Searching through all of the built-in jj templates finally revealed the secret to me, which is a template function named . With that knowledge, it becomes possible to write a template that returns only the body of a commit message: We first extract the title line, remove that from the front, and then trim any whitespace from the start of the string, leaving just the description body. Finally, I’d like to briefly look at the possibility of machine-readable templates. Attempting to produce JSON from a jj template string can be somewhat fraught, since it’s hard to tell if there are quotes or newlines inside any particular value that would need to be escaped for a JSON object to be valid when it is printed. Fortunately, about 6 months ago, jj merged an function, which makes it possible to generate valid JSON with a little bit of template trickery. For example, we could create a output of a JSON stream document including one JSON object per commit, with a template like this one: This template produces valid JSON that can then be read and processed by other tools, looks like this. Templates have vast possibilities that have not yet been touched on, and I encourage you to investigate and experiment yourself. Now let’s look at some revsets. The biggest source of revset aliases that I have seen online is from @thoughtpolice’s jjconfig gist, but I will consolidate across several different config files here to demonstrate some options. The first group of revsets roughly corresponds to “who made it”, and composes well with other revsets in the future. For example, it’s common to see a type alias, and a type alias to let the current user easily identify any commits that they were either author or committer on, even if they used multiple different email addresses. Another group uses description prefixes to identify commits that have some property, like WIP or “private”. It’s then possible to use these in other revsets to exclude these commits, or even to configure jj to refuse to push them. Thoughtpolice seems to have invented the idea of a , which is a group of commits on top of some parent: Building on top of the stack, it’s possible to construct a set of commits that are “open”, meaning any stack reachable from the current commit or other commits authored by the user. By setting the stack value to 1, nothing from trunk or other remote commits is included, so every open commit is mutable, and could be changed or pushed. Finally, building on top of the open revset, it’s possible to define a “ready” revset that is every open change that isn’t a child of wip or private change: It’s also possible to create a revset of “interesting” commits by using the opposite kind of logic, as in this chain of revsets composed by . You take remote commits and tags, then subtract those from our own commits, and then show anything that is either local-only, tracking the remote, or close to the current commit. Now let’s talk about jj commands. You probably think I mean creating jj commands by writing our own aliases, but I don’t! That’s the next section. This section is about the jj commands that it took me weeks or months to realize existed, and understand how powerful they are. First up: . When I first read about absorb, I thought it was the exact inverse of squash, allowing you to choose a diff that you would bring into the current commit rather than eject out of the current commit. That is wildly wrong, and so I want to make sure that no one else falls victim to this misconception. The absorb command iterates over every diff in the current commit, finds the previous commit that changed those lines, and squashes just that section of the diff back to that commit. So if you make changes in four places, impacting four previous commits, you can to squash all four sections back into all four commits with no further input whatsoever. Then, . If you’re taking advantage of jj’s amazing ability to not need branches, and just making commits and squashing bits around as needed until you have each diff combined into one change per thing you need to submit… you can break out the entire chain of separate changes into one commit on top of trunk for each one by just running and letting jj do all the work for you. Last command, and most recent one: . You can use fix to run a linter or formatter on every commit in your history before you push, making sure both that you won’t have any failures and that you won’t have any conflicts if you try to reorder any of the commits later. To configure the fix command, add a tool and a glob in your config file, like this: Now you can just and know that all of your commits are possible to reorder without causing linter fix conflicts. It’s great. Okay. Now we can talk about command aliases. First up, the venerable . In the simplest possible form, it takes the closest bookmark, and moves that bookmark to , the parent of the current commit. What if you want it to be smarter, though? It could find the closest bookmark, and then move it to the closest pushable commit, whether that commit was , or , or . For that, you can create a revset for , and then tug from the closest bookmark to the closest pushable, like this: Now your bookmark jumps up to the change that you can actually push, by excluding immutable, empty, or descriptionless commits. What if you wanted to allow tug to take arguments, for those times when two bookmarks are on the same change, or when you actually want to tug a different bookmark than the closest one? That’s also pretty easy, by adding a second variant of the tug command that takes an argument: This version of tug works just like the previous one if no argument is given. But if you do pass an argument, it will move the bookmark with the name that you passed instead of the closest one. How about if you’ve just pushed to GitHub, and you want to create a pull request from that pushed bookmark? The command isn’t smart enough to figure that out automatically, but you can tell it which bookmark to use: Just grab the list of bookmarks attached to the closest bookmark, take the first one, pass it to , and you’re all set. What if you just want single commands that let you work against a git remote, with defaults tuned for automatic tugging, pushing, and tracking? I’ve also got you covered. Use to colocate jj into this git repo, and then track any branches from upstream, like you would get from a git clone. Then, you can to find the closest bookmark to , do a git fetch, rebase your current local commits on top of whatever just got pulled, and then show your new stack. When you’re done, just . This push handles looking for a huggable bookmark, tugging it, doing a git push, and making sure that you’re tracking the origin copy of whatever you just pushed, in case you created a new branch. Last, but definitely most stupid, I want to show off a few combo tricks that manage to deliver some things I think are genuinely useful, but in a sort of cursed way. First, we have counting commits. In git, you can pass an option to log that simply returns a number rather than a log output. Since jj doesn’t have anything like that, I was forced to build my own when I wanted my shell prompt to show how many commits beyond trunk I had committed locally. In the end, I landed on a template consisting of a single character per commit, which I then counted with . That’s the best anyone on GitHub could come up with, too . See? I warned you it was stupid. Next, via on Discord, I present: except for the closest three commits it also shows at the same time. Simply create a new template that copies the regular log template, while inserting a single conditional line that adds if the current commit is inside your new revset that covers the newest 3 commits. Easy. And now you know how to create the alias I promised to explain earlier. Last, but definitely most stupid, I have ported my previous melding of and over to , as the subcommand , which I alias to because it’s inspired by , the shell cd fuzzy matcher with the command . This means you can to see a list of local bookmarks, or to see a list of all bookmarks including remote branches. Then, you can to do a fuzzy match on , and execute . Jump to work on top of any named commit trivially by typing a few characters from its name. I would love to also talk about all the stupid shell prompt tricks that I was forced to develop while setting up a zsh prompt that includes lots of useful jj information without slowing down prompt rendering, but I’m already out of time. Instead, I will refer you to my blog post about a jj prompt for powerlevel10k , and you can spend another 30 minutes going down that rabbit hole whenever you want. Finally, I want to thank some people. Most of all, I want to thank everyone who has worked on creating jj, because it is so good. I also want to thank everyone who has posted their configurations online, inspiring this talk. All the people whose names I was able to find in my notes include @martinvonz, @thoughtpolice, @pksunkara, @scott2000, @avamsi, @simonmichael, and @sunshowers. If I missed you, I am very sorry, and I am still very grateful that you posted your configuration. Last, I need to thank @steveklabnik and @endsofthreads for being jj-pilled enough that I finally tried it out and ended up here as a result. Thank you so much, to all of you.
Agents don't need to see websites with markup and styling; anything other than plain Markdown is just wasted money spent on context tokens. I decided to make my Astro sites more accessible to LLMs by having them return Markdown versions of pages when the header has or preceding . This was very heavily inspired by this post on X from bunjavascript . Hopefully this helps SEO too, since agents are a big chunk of my traffic. The Bun team reported a 10x token drop for Markdown and frontier labs pay per token, so cheaper pages should get scraped more, be more likely to end up in training data, and give me a little extra lift from assistants and search. Note: You can check out the feature live by running or in your terminal. Static site generators like Astro and Gatsby already generate a big folder of HTML files, typically in a or folder through an command. The only thing missing is a way to convert those HTML files to markdown. It turns out there's a great CLI tool for this called html-to-markdown that can be installed with and run during a build step using . Here's a quick Bash script an LLM wrote to convert all HTML files in to Markdown files in , preserving the directory structure: Once you have the conversion script in place, the next step is to make it run as a post-build action. Here's an example of how to modify your scripts section: Moving all HTML files to first is only necessary if you're using Cloudflare Workers, which will serve existing static assets before falling back to your Worker. If you're using a traditional reverse proxy, you can skip that step and just convert directly from to . Note: I learned after I finished the project that I could have added to my so I didn't have to move any files around. That field forces the worker to always run frst. Shoutout to the kind folks on reddit for telling me. I pushed myself to go out of my comfort zone and learn Cloudflare Workers for this project since my company uses them extensively. If you're using a traditional reverse proxy like Nginx or Caddy, you can skip this section (and honestly, you'll have a much easier time). If you're coming from traditional reverse proxy servers, Cloudflare Workers force you into a different paradigm. What would normally be a simple Nginx or Caddy rule becomes custom configuration, moving your entire site to a shadow directory so Cloudflare doesn't serve static assets by default, writing JavaScript to manually check headers and using to serve files. SO MANY STEPS TO MAKE A SIMPLE FILE SERVER! This experience finally made Next.js 'middleware' click for me. It's not actually middleware in the traditional sense of a REST API; it's more like 'use this where you would normally have a real reverse proxy.' Both Cloudflare Workers and Next.js Middleware are essentially JavaScript-based reverse proxies that intercept requests before they hit your application. While I'd personally prefer Terraform with a hyperscaler or a VPS for a more traditional setup, new startups love this pattern, so it's worth understanding. Here's an example of a working file to refer to a new worker script and also bind your build output directory as a static asset namespace: Below is a minimal worker script that inspects the header and serves markdown when requested, otherwise falls back to HTML: Pro tip: make the root path serve your sitemap.xml instead of markdown content for your homepage such that an agent visiting your root URL can see all the links on your site. It's likely much easier to set this system up with a traditional reverse proxy file server like Caddy or Nginx. Here's a simple Caddyfile configuration that does the same thing: I will leave Nginx configuration as an exercise for the reader or perhaps the reader's LLM of choice. By serving lean, semantic Markdown to LLM agents, you can achieve a 10x reduction in token usage while making your content more accessible and efficient for the AI systems that increasingly browse the web. This optimization isn't just about saving money; it's about GEO (Generative Engine Optimization) for a changed world where millions of users discover content through AI assistants. Astro's flexibility made this implementation surprisingly straightforward. It only took me a couple of hours to get both the personal blog you're reading now and patron.com to support this feature. If you're ready to make your site agent-friendly, I encourage you to try this out. For a fun exercise, copy this article's URL and ask your favorite LLM to "Use the blog post to write a Cloudflare Worker for my own site." See how it does! You can also check out the source code for this feature at github.com/skeptrunedev/personal-site to get started. I'm excited to see the impact of this change on my site's analytics and hope it inspires others. If you implement this on your own site, I'd love to hear about your experience! Connect with me on X or LinkedIn .
In 2025, there's no longer a single subscription that you can pay for to watch any new movie or TV show that comes out. Netflix, Disney, HBO, and even Apple now push you to pay a separate subscription just to watch that one new show that everyone's talking about -- and I'm sick of it. Thanks to a friend of mine, I recently got intrigued by the idea of seedboxing again. In a nutshell, instead of spending $ to pay for 5 different streaming services, you pay a single fee to have someone in an area with lax torrenting laws host a VPS for you -- where you can run a torrent client and a Plex server, download content, and stream it to your devices. I tried a few seedbox services, but the pricing didn't really work for me. And since I'm in the Philippines, many of them suffer from high latency, and even raw download speeds can be spotty. So I put my work hat on and decided to try spinning up my own media server, and I chose this stack: https://github.com/Rick45/quick-arr-Stack For people just getting into home media servers like myself, this stack can essentially be run with just , with a few modifications to the env values as necessary. (For Windows users running this on WSL like me, you'll need to change all containers using networking to instead, and expose all ports one by one. Most of them only need one port, except for the Plex container, which lists them here .) Once it's up, you get: The quick arr Stack repo has a much longer and thorough explanation of each component, as well as how to configure them. Once it's all up and running -- you now have access to any TV show or movie that you want, without paying ridiculous subscription fees to all those streaming apps! Deluge -- the torrent client Plex Media Server -- this should be obvious unless you don't know what Plex is; it hosts all your downloaded content and allows you to access it via the Plex apps or through a web browser Radarr -- a tool for searching for movies, presented in a much nicer interface than manually searching for individual torrents Sonarr -- a tool for searching for TV shows, and as I understand it, a fork of Radarr Prowlarr -- converts search requests from Radarr and Sonarr into torrent downloads in Deluge Bazarr -- automatically downloads subtitles for your downloaded media Torrenting is illegal. That should be obvious. Check your local laws to make sure you're not breaking any. The stack includes an optional VPN client, which you could use if you want to be less detectable. You'll need to configure the right torrent trackers in Prowlarr. Some are great for movies, some for TV shows, and there are different ones for anime. There doesn't seem to be a single tracker that does it all. Even then, some trackers might not work. For example, l337's Cloudflare firewall is blocking Prowlarr. Not all movies and TV shows will be easy to find, so if you're looking for some obscure media, you might need to go with a Usenet tracker. This setup requires a pretty stable internet connection (with headroom for both your torrenting and your regular use), and tons of storage. Depending on how much media you're downloading, you'll probably need to delete watched series consistently or use extremely large drives. Diagnosing issues (Prowlarr can't see Sonarr! Plex isn't updating! Downloads aren't appearing in Deluge!) requires some understanding of Docker containers, Linux, and a bit of command-line work. It's certainly not impossible, but might be off-putting for beginners.
Musings and ramblings on programming and things
I’ve been learning Persian (Farsi) for a while now, and I’m using a bunch of tools for it. The central one is certainly Anki , a spaced repetition app to train memory. I’m creating my own never-ending deck of cards, with different types of content, for different purposes. The most frequent type of cards is grammar focused phrases (very rarely single words) coming sometimes from my own daily life, but also very often directly from videos of the Persian Learning YouTube channel, created by Majid, a very talented and nice Persian teacher, in my opinion.