Posts in Tutorial (20 found)
Maurycy Yesterday

How to write your own website:

I recently wrote an essay on why you should set up a personal website rather then using social media. Doing so lets you own your space on the internet, customize it and free your readers from constant advertising and algorithmic feeds designed to keep you stuck doomscrolling all day. However, despite how much time we spend using it, creating something for the intenet is seen as arcane wizardy by most people. This is a fairly accessable guide to getting started. You’ll need a text editor (any will do) and a browser (you already have one). All pages are written in HTML, which is a simple text-based format. To start with, this is a perfectly valid HTML document: To try this, just create a text file with a ".html" extension, and open it in your favorite browser. Do this now : experimenting is the best way to learn how everything works. This is what it should look like: Plain text is boring, so let’s add some formatting: The angle bracket things are tags: "<b>" is an opening tag, and "</b>" is the matching closing tag. The word surrounded by brackets ("b") is the tag name, which tells the browser what to do: In this case, b olding the enclosed text. The other formatting tags are <em> em phasis , <u> u nderline , <sub> sub scipt , <sup> sup erscript , <small> small text , <mark> highlight and <del> del eted . You don’t have to memorize this list, but go and try a few out. There’s also <br/> ( br eak), which adds a line break. It’s special because there’s no closing tag: It always immediately closed and can’t contain any text. I like to add a slash after the tag name to indicate this A big wall of text can get quite ugly, so it’s good to break it up with <p> ( p aragraph) tags. Each paragraph will be visually separated from other content on the page: Check out my new site: I have many epic things here. Together, the maching tags and their contents form an an element . Elements can contain other elements, but it’s important that they are closed in the correct order: This is wrong: … but this is fine: Browsers will attempt to render invalid HTML, but the results may not be what you intended: It’s best to make it easy for them. On that topic, it’s good practice to put all your content inside a <body> element which is itself inside a <html> element: Check out my new site: I have many epic things here. This isn’t mandatory, but helps browsers render your page correctly: In the case of an old browser, you don’t want metadata (we’ll add some later) getting confused for page content. Ok, back to text-wall-avoidance: the <ul> and <ol> ( u nordered/ o rdered l ist) tags create, well, lists. Each item should be wraped in <li> tags ( l ist i tem) About this site (unordered): It has epic things ... and is handwritten HTML It uses these tags: (ordered) <html> <body> <p> <ul> and <ol> <li> You can add angle brackets to a page with &gt; (>), &lt; (<) and &amp; (&). These entities will render as the corresponding charater, but won’t form tags. Headings use <h1> ( h eading 1 ) through <h5> ( h eading 5 ), with larger numbers using smaller font sizes: This site has epic things and I wrote it myself. To do: Figure out how to add links. About that. Links are just <a> ( a nchor) tags, but they have something new: an attribute after the tag name but before the bracket. The "href= " attribute sets where the link points to. A lot of other tags can also have attributes: For example, ordered lists with "reverse=true" count backwards. The URL in "href=" can be relative: If linking up multiple pages on the same site, instead of this: … just write this: Images work similarly to links, except that they are self-closing elements like <br/>: Check out this picture of a nebula I took! (If you don’t have a URL for your image, skip to the hosting section to set one up) That’s all the essentials, but there’s a lot of other useful tags. For example <details> creates a dropdown that works with ctrl-f: This is a dropdown with just HTML. It works well with browser features (ctrl-f, fragment identifiers, screen readers, etc) by default. (better usability than 99% of commercial sites!) …but I can’t cover everything without writing a whole book. (The Mozzila docs are a fantastic reference) At this point, you should have something like this: I made this site to write about things I do. More updates soon™ . Here's my picture of the Dumbbell Nebula: Let’s start by giving the page a machine-readable title: Like with <body>, the <head> tag isn’t required, but it is good to include it: Otherwise, any metadata that the browser doesn’t understand might be mistaken for content. The page still looks kinda bad: Text extending the edges of the page isn’t exactly easy to read. It’s not too bad when crammed into my blog, but longer paragraphs will look terrible on large monitors. To fix this, we need to add some style and layout information using the <style> tag: Unlike other tags, the contents of <style> isn’t HTML, but CSS: a whole other langauge embedded within the file. CSS is compoosed of blocks, each begining with a selector to control what gets effected. Here, this is just the name of a tag: "head" The selector is followed by a series of declarations wraped in curly braces. My example only has one: "max-width: 30em;" This caps the width of the element at 30 times the font size: I made this site to write about things I do. More updates soon™ . Here's my picture of the Dumbbell Nebula: The page is looking rather asymetrical, so let’s center the column. For fixed-width elements, this can be done using the "margin" property: I made this site to write about things I do. More updates soon™ . Here's my picture of the Dumbbell Nebula: (For varable width elements, use flexbox for centering and other fancy layouts. A single line of text can be centered with "text-align=center") Personally, I like dark themed sites, so lets change some of the colors: I made this site to write about things I do. More updates soon™ . Here's my picture of the Dumbbell Nebula: The "color" style will carry over to every element inside of the styled tag, so there’s no need to individually change the text-color of every element. However, the links do need to be changed because they override the color by default. That’s it. Everything you need to replicate my blog, minus a few small bits like the sans-serif font, nagivation box, etc. Of course, your website can and should be different: It’s yours . I highly recomend you read some documenation and play around with CSS. There’s also way more to it then I can possbly cover here. Every website you see was created with it, and it even supports animations and basic interactivity . … also, check out your browser’s devtools (ctrl-shift-i): It will have a nice GUI for editing which shows you the result in real time and shows you what’s going on under the hood. If you ever run out of tags, you can just make up your own and style them as needed. As long as the name includes a hypen, it’s guaranteed not to be included in any future version of HTML. The specification even lists <math-α> and <emotion-😍> as allowed custom elements names. I’ve used this heavily on this page: All the example websites aren’t screenshots, they are <fake-frame> elements styled up to look like a browser window. Custom tags are also very handy for styling text: At this point you should have a reasonably nice page ready to put up on the internet. The easiest way to do this is to use a static file hosting service like Github Pages or Cloudflare Pages . Both of these have generous free tiers that should last a very long time. If you don’t like big companies, there are plenty of similar, smaller services. These can be more limited: The popular Neocities charges $5/mo to use a custom domain. Another option is to rent a server ($3-$5/mo) or, if you have good internet, run one yourself. This is by far the most fiddly option: I would not recommend it unless you like playing with computers. All off these (except a server) will give you a subdomain by default. For example, Github Pages will give you your-username .github.io However, I do recommend setting up a custom domain: This will let you switch providers seamlessly should anything happen. All of these will work in a similar way: Upload a file with some name, and it will given a URL with that same name. The one exception is that files called "index.html" will be viewable at the root of the folder they are in. You should put an index.html in the root of your site to serve as the homepage, but apart from that, the organization is up to you. It has epic things ... and is handwritten HTML <html> <body> <ul> and <ol> Ken Shirriff's blog Ken Shirriff's blog Ken Shirriff's blog Ken Shirriff's blog

0 views
Karboosx 3 days ago

Google Apps Script is amazing for automation

Ever heard of Google Apps Script? You're missing out! This free Google tool lets you automate tasks across all Google apps with custom JS scripts, no servers needed. I'll show you how I use it to magically update my nerdiflix site every time I add a video to a YouTube playlist. It's like having your own digital assistant for all the boring stuff!

0 views
Ginger Bill 4 days ago

Mitigating the Billion Dollar Mistake

This article is continuation to: Was it really a Billion Dollar Mistake? . After reading a lot of the comments on numerous social media sites on the original article , I think I need to clarify a lot more. The main points I wanted to clarify: A lot of commentors based their complaints in their experience with languages like Java/C#/Python/etc, and the issues with null-pointer-exceptions (NPEs) in them. What I think a lot of people seemed to forget is that in those languages, virtually everything is a pointer, unlike in a language like C/Go/Odin which has explicit pointers. When everything is a pointer, it is exponentially more likely that you will hit a pointer that is invalid. And in the case of a managed (garbage collected) language, that invalid pointer will most definitely be a null pointer. This is why I can understand the problem of having pointers in such languages. But I think this still missed the point of what I trying to state, that the reason even exists in those languages is because you can declare a variable without an explicit initialization value: Because you can declare such a thing in a language like Java, then there are three options to try and mitigate this design flaw: Unfortunately existing languages like Java cannot have these problems solved, but newer languages that want to stylize themselves similar to that could solve them. One of the issues is that languages like Java added maybe/option/optional types too late AND it is not the default behaviour. The first approach is the current status quo, the second approach keeps the implicit value declarations but adds more checks, whilst the third approach requires doing explicit value declarations. The enforcement of maybe types as the default pointer/reference type leads to two possibilities: Version 1 would be something like this: but because of the ergonomic pains, can also lead to unwrapping cases, which are practically equivalent to NPEs: At least with an , it is a little clearer that a panic could happen. But it can also just be an early-out too like with Odin’s : Version 2 is a bit weirder, since it doesn’t remove the concept of but propagates further up the expression tree. The first approach is unergonomic to use, especially in a language where virtually everything is a pointer/reference, and with the addition of unwrapping which just panics on , it’s practically reinvented NPEs with more steps. As for the second approach, I’d argue is very bug prone if it was the default, since you cannot trivially know where the arose from since it was just passed up the stack 2 . Therefore most people think the third approach to mitigating pointers is the “obvious” and “trivial” approach: explicit individual initialization of every value/element everywhere . One thing which I commonly saw was people saying was that I “missed the point” that null safety is not about protecting from common invalid memory access but rather it’s about clarifying the states that a pointer can be in the type system itself, whether it cannot be null or maybe it could be null. I already knew this, and I find it bizarre 3 that people did not understand that from the article. The point I was trying to get across which most people seemed to either ignore or not understand was that the approach of requiring explicit initialization of every element everywhere comes with a cost and trade-offs. Most people who bring this up as “the solution” think there was either no cost or they think the cost is worth it. The former group are just wrong, and the latter group are the point I was focusing the article at in the first place: you don’t actually understand the costs fully if you are answering the way that you do. I understand this sounds “condescending” to some people, but I am not trying to be. The point I am arguing is far from the common view/wisdom, and thus I tried to explain my position. Why would a person listen to someone with a “fringe” view? “Fringe” views are typically wrong in other areas of life, so it makes sense to apply that heuristic to the domain of programming too. I don’t care if people agree with me or not, rather I wish people actually understand it and then comment. But as a systems programmer, I deal with memory all the time, and null pointers are the least common kind of invalid memory that I have to deal with, and the other kinds were not handled by the type system, nor would be handled with solving the problems of null. No, this is not saying “well just because you cannot solve problem X with Y, therefore don’t solve either”, it’s saying that they are different problems, and empirically they are just different with different kinds of severity and ways to mitigate them. I am not saying you shouldn’t try to solve either problem if you are designing your own language, but rather they are both kinds of invalid memory, but solutions to mitigate the problems are completely different in kind 4 . For a managed language like Java, the cost of explicit initialization of every element everywhere is so little in comparison to the rest of the language, that approach is honestly fine. But for a language like the one I have designed and created—Odin—the cost of non-zero initialization is extremely costly as things scale. This simple/naïve approach looks like this in a pseudo-C: But if you use a lot of pointers everywhere, the initialization becomes a lot more complex, and non-linear too. People argue the need to express non-nullable pointers, and either version 1 of the previous approach or this explicit approach are effectively the only ways of doing this. You could tell the compiler to assume the pointer is never null (e.g. or ), but those are not guarantees in the type system, just you telling the compiler to assume it is never . The non-nullability is not possible outside of those two approaches. This was the entire point I was making between the Individual-Element Mindset and the Group-Element Mindset is that the individual-element mindset lends itself well to thinking about individual elements like this. And as such, it doesn’t really think about the cost of thinking in individual elements as compounding to something expensive. I’ve been in projects where a lot of the time in a program in spent in the destructors/ traits of individual elements, when all they are doing is trivial things which could have been trivially done in bulk. Most people don’t consider these as “costs” nor that there are trade-offs to this approach to programming, rather it’s “just the way it is”. There is the other aspect where if the explicit initialization is applied to every type, not just ones which contains pointers/references, then it can be less ergonomic to type and have visual noise: 5 This constant syntactic noise can be tiring and detracts from what is actually going on. With the implicit zero initialization that I had in Odin, it has worked out really well. Many might expect it to be confusing, but it isn’t and you can rely on it and becomes very natural to use. As the creator and main architect of Odin, a lot of Odin’s design has been to fix a lot of the problems I and many others faced with C, whilst still not veering too far from the general feel of C. Odin does have pointers by default, but in practice they are a very rare problem due numerous features and constructs of the language. One of the reasons for pointers in C is caused to due the lack of a proper array type. Odin has proper array types and does not implicitly demote arrays to pointers. Odin has slices which replace a lot of the needs for pointers and pointer arithmetic, and because array types (including slices) are bounds checked, that already solves many of the problems that would have occurred in C with treating pointers as arrays, which may or may not have an associated length to check against. Odin also has tagged unions and multiple return values. Tagged unions should be “obvious” to the people who had be complaining about the initial article, but the use of tagged unions isn’t necessarily there to solve the pointer problem. Odin’s is an example of a maybe/option type, which is just a built-in discriminated union, with the following definition: And due to the design of Odin’s , if a union only has one variant and that variant is any pointer-like type, no explicit tag is stored. The state of the pointer-like value also represents the state of the . This means that . Another reason why C has problems with pointers is the lack of way to state a parameter to a procedure as being optional. C doesn’t have default values for parameters, nor any way in its type system to express this. C’s type system is just too poor and too weak. This is why people unfortunately use pointers as a way to do thus, since they can write . However, it is rare to see in Odin code be used to indicate pointers except when interfacing with foreign code, or optional parameters to a procedure. This is because the need for a pointer itself is quite rare. There are multiple reasons why: However one of the main reasons why pointers are rarely a problem in Odin is because of multiple return values. Multiple return values when used for this manner, are akin (but not semantically equivalent) to something like a type in other languages 6 . When a procedure returns a pointer, it is either assumed to be never OR accompanied with another value to indicate its validity, commonly in the form of a boolean or : And coupled with the constructs ( , , , ), , and named return values, a lot of those issues never arise: Odin is designed around multiple return values rather than / constructs, but this approach does in practice does solve the same kinds of problems. Before people go “well the assumption is not enforced in the type system”, remember where all of this derives from: Odin allows for implicit declarations of variables without an explicit initialization value. And as the designer of Odin, I think enforcing that is both quite a high cost (see the individual-element vs grouped-elements mindsets) and far from the original approach to programming C. I know this is not going to convince people, but it’s effectively trying to make someone think like another person, which is never easy, let alone always possible to do in the first place. And it’s not a mere “aesthetic preference” either. This very little design decision has MASSIVE architectural consequences which lead to numerous performance problems and maintenance costs as a project grows. Null pointer exceptions (NPEs) are in a category of constructs in a language which I class as “panic/trap on failure”. What I find interesting is that there are numerous other things in this category, but many people will normally take a different approach to those constructs compared to NPEs, due to whatever reason or bias that they have. The canonical example is integer division by zero. Instinctually, what do you think division by zero of an integer should result it? I’d argue most people will say “trap”, even if a lot of modern hardware (e.g. ARM64 and RISC-V) does not trap, and only the more common x86-related architectures do trap. Odin does currently 7 define the behaviour of division by zero to “trap” only because of this assumption, but we have considered changing this default behaviour. Odin does allow the programmer to control this behaviour at a global level or on a per-file level basis if they want a different behaviour for division by zero (and consequentially by zero). But some languages such as Pony , Coq, Isabelle, etc actually define division by zero of an integer to be . This is because it can help a lot of theorem provers . But there is the other question of production code. One of the main arguments against NPEs (especially in languages like Java) is that it causes a crash. So in the case of division by zero, do you want this to happen? Or would you prefer all integer division to be explicitly handled, or default to a predictable/useful value, like ?—which prevents crashing in the first place. Another common example of “panic on failure” is languages with runtime bounds checking. If is out of bounds, most languages just panic. It’s rare to find a language that returns a on every array access to prevent an out of bounds. Not even languages like OCaml do this. NPEs, division by zero (if traps), and runtime bounds checking are all examples of this kind of “panic on failure”, but people rarely treat them as being the same, even if they are of the same kind of problem. Honestly, no. I understand it might be common for beginners to a language like C to have many pointer related problems, but they will also have loads of other problems too. However as you get more competent at programming, that kind of problem is honestly the least of your problems. I honestly think a lot of this discussion is fundamentally a misunderstanding of different perspectives rather than anything technical. A lot of what some people think are their “technical opinions” are merely just aesthetic judgements. And to be clear, aesthetic judgements are not bad, but they are not necessarily technical. But I’d argue most people are not applying their opinions consistently when it comes to the category of “panic on failure”, and NPEs are no different; they only seem more of a problem to them either because of the existence of the name of the “Billion Dollar Mistake” or because they encounter it more. I know a lot of people view the explicit individual initialization of every element everywhere approach as the “obvious solution”, as it seems like low-hanging fruit. As a kid, I was told to not pick low-hanging fruit, especially anything below my waist. Just because it looks easy to pick, a lot of it might not be unpicked for a reason. It does not mean that you should or should not pick that fruit, but rather you need to consider the trade-offs. If you honestly think the costs of explicit individual initialization of every element everywhere are worth it for the language you are working in or developing, then great! But at least know the trade-offs of that approach. For Odin, I thought it was not worth the cost—compared to the alternative ways of mitigating the problem empirically. Most of the bad criticisms just came from people who didn’t read the article or didn’t read past a couple paragraphs. That’s why I wanted to state this comment very clearly.  ↩︎ This is partially why I do not like exceptions as error handling in many languages. It is not obvious where things are thrown/raised from and they encourage the practice of ignoring them until the latest possible space. I discuss that problem in The Value Propagation Experiment Part 2 .  ↩︎ I understand what type systems do and their benefits, and it is a little insulting when people assume my knowledge (or lack of) without doing a modicum of review.  ↩︎ In the case of the other invalid memory addresses, linear/affine substructural type systems with lifetime semantics can help with this (e.g. Rust) but they come at another cost in terms of language ergonomics and restrictions. Language design is hard.  ↩︎ I know typing is never the bottleneck in programming, but the visual noise aspect is a big one when you are trying to scan (not necessarily read ) code. I want to see the pattern and not be swamped with syntactic noise.  ↩︎ I know a result type is a kind of sum type and multiple return values are more akin to a product type, but how different languages want to be used and expressed, this works out fine in practice for the same kinds of problems. Please don’t give me a FP rant.  ↩︎ At the time of writing, I am not sure which approach is the better one: trap or zero by default, but we allow for all four options in the Odin compiler. Division by zero for floats results in “Inf” and that’s not necessarily as much of a problem in practice, so why would division by zero be as bad?  ↩︎ Null pointer dereferences are empirically the easiest class of invalid memory addresses to catch at runtime, and are the least common kind of invalid memory addresses that happen in memory unsafe languages. I do think it was a costly mistake but the “obvious solutions” to the problem are probably just as costly , if not more so, but in very subtle ways which most people neglected to understand in the article 1 . I think that even if Tony Hoare didn’t “invent” pointers, within a couple years someone else would have. I don’t think it’s a “mistake” the programming world was ever going to avoid. I am talking about languages that run on modern systems with virtual memory, not embedded systems where you interact with physical memory directly. Those platforms in my opinion need much different kinds of languages which unfortunately do not exist yet. I was also talking about languages akin to C and Odin, not languages that run on a VM or have “everything be a reference”. Allow for pointers (and just deal with it) All pointers are implicitly maybe types (e.g. in Java) Require all explicit initialization of every element everywhere to assume cannot happen, along with things like maybe types. Requiring each reference to be checked if it is . Check if a value is and propagate that up the expression tree. Odin has slice types Odin has multiple return values to allow for out-only parameters, which could be ignored with Odin isn’t a “everything is a pointer” kind of language: pointers have to be explicit typed to exist. Writing pointer types as value declarations is less common due to type inference e.g. is more much common than: . All bits set ( ) The same value ( ) Most of the bad criticisms just came from people who didn’t read the article or didn’t read past a couple paragraphs. That’s why I wanted to state this comment very clearly.  ↩︎ This is partially why I do not like exceptions as error handling in many languages. It is not obvious where things are thrown/raised from and they encourage the practice of ignoring them until the latest possible space. I discuss that problem in The Value Propagation Experiment Part 2 .  ↩︎ I understand what type systems do and their benefits, and it is a little insulting when people assume my knowledge (or lack of) without doing a modicum of review.  ↩︎ In the case of the other invalid memory addresses, linear/affine substructural type systems with lifetime semantics can help with this (e.g. Rust) but they come at another cost in terms of language ergonomics and restrictions. Language design is hard.  ↩︎ I know typing is never the bottleneck in programming, but the visual noise aspect is a big one when you are trying to scan (not necessarily read ) code. I want to see the pattern and not be swamped with syntactic noise.  ↩︎ I know a result type is a kind of sum type and multiple return values are more akin to a product type, but how different languages want to be used and expressed, this works out fine in practice for the same kinds of problems. Please don’t give me a FP rant.  ↩︎ At the time of writing, I am not sure which approach is the better one: trap or zero by default, but we allow for all four options in the Odin compiler. Division by zero for floats results in “Inf” and that’s not necessarily as much of a problem in practice, so why would division by zero be as bad?  ↩︎

0 views

Strategy Guide For Kingdom: Two Crowns

Kingdom: Two Crowns is the third, and definitive, installment in a game series, released in 2018 by Raw Fury and now available on Android and iOS, among other platforms. The gist is that you play a monarch in medieval Europe, with the option to co-op with a friend, and build your kingdom from the ground up. Hire villagers and employ them as archers, builders, farmers, knights, and others, then expand continuously to upgrade your kingdom and move to other islands. The catch is that there is a monster called the Greed, which manifests at night as these little purple guys who want to break down your defenses and steal your gold and crown. If you lose your crown, you “die.” Death in the game is not too bad; sent back to Island One with no money, you can simply work your way back to your island you were on before and keep most progress. Additionally, it can be beneficial to die, as that also resets the difficulty counter. As each day passes, the Greed gets, well, greedier. More of them spawn, they’re harder to defeat, and in late stages more powerful versions appear, such as the Breeder, which spawns more little Greeds and takes a long while to kill. When you die, the global day count resets by 100, meaning you get to keep a lot of progress and things are easier for a while. There are multiple skins of the game available to choose from. This guide focuses on the base game, or Europe. Different skins have different variations in where things are located and how they work. Since the difficulty has not ramped up yet in the start, it’s best to focus on unlocking technologies and expanding to other islands rather than maxing out on upgrades for defenses. Which ages you’ve unlocked determines what you’re able to build and accomplish: In  Wooden Age , all things buildable are wood, and only wood. This age is particularly limiting in buildings and upgrades, and while passable in defense this age notable lacks any way to retaliate against the  Greed . While the Kingdom may be capable of holding against the Greed, these will adapt, urging the Kingdom to advance into the Stone Age. Notable wooden purchases available in  New Lands  and  Two Crowns : — Kingdom Wiki - Technology You will start out here, and the best way to progress is to focus on 1. Basic walls and archer tower defenses, and 2. income, so you can head to Island Two as soon as possible. Why go to Island Two? On the first island, there are great things to unlock, but you will not have progressed enough yet to unlock them. For example, the griffin needs gems to unlock (only on Island Two and onwards), as well as the Ballista Hermit and the Archery statue. The boat is much easier to get up and running on Island Two, so it makes sense to leave as soon as you are able. This also gives you an easier start on the second island because your Global Day Count, which affects difficulty, will be lower when you arrive because less days have passed. Income can be achieved in a few ways. The first main one is archers, who hunt animals and defend the kingdom from the Greed. Next, you can hire farmers, the vendor of which requires a sector with a wooden back wall (expanding to have more upgraded walls further out makes more sectors of the kingdom). Farmers remain one of the most lucrative sources of income in the game, being able to harvest crops every day and foraging plants in the winter. It is best to ensure there is a wall in front of a farm before building a farm so that the Greed don't simply walk up and attack your farmers, though. Something I've noticed is that building archer towers out into the fields where wildlife are can be an efficient way to make archers hunt more. Instead of wandering around as much, an archer in a tower will shoot any animals that come near. Staggering them out into hunt-able areas leads to more animals shot. Clearing the trees to let grass grow, leading to more rabbit burrows, is also advisable. Additionally, obtaining the Stag Mount on Island Two allows you to attract deer to your hunters to shoot. For workers, It may be good to avoid expanding into recruitment camps until you need to, as at all stages in the game it's good to maximize workers. Destroying all trees between your base and the campsite will get rid of it. If you find an open area past the camps that allows wall construction, you can extend the kingdom past the camps and chop down all trees except the two immediately flanking the camp. — Kingdom Wiki - Starting, surviving, & winning And another tip from the same guide concerning efficiently using your mount: Remember, if you completely run your horse dry of stamina, it'll take longer before you can run again. It's a good idea to stop running & walk as soon as the horse starts puffing, to maximize speed. If you stop the horse in a grassy area, the horse will eat some grass and fully recover within a second or two. Another notable source of income only available on the first two islands is the Merchant. He walks to the middle of the Kingdom, near the Town Center, and pays you eight coins. In return, you pay him back one coin, and he will return the next day with a new shipment. With a net gain of seven coins, this can be a reliable source of coins in the starting islands. Concerning the Town Center: when you upgrade to the highest wood tier, you unlock the banker, which can remain incredibly useful, as he can deposit coins for you and accrue interest on them. Interest earnings are daily and depend on the number of coins stored. If a total of only one or two coins have been deposited, there will be no interest, and that amount won't change with time. If at least three, up to one hundred coins are stored, the Banker increases the funds by seven percent (rounded up) per day. When more than one hundred coins are stored, the interest rate becomes a solid eight coins per day. Technically speaking, once this condition is met, every five days the Monarch can refill their coin purse completely. This will work every five days indefinitely by utilizing the earnings from interest alone. — Kingdom Wiki - Banker Once you leave to Island Two, you can unlock Stone Age. Stone Age  is the first obtainable technology. It's a defining moment in the fight against the Greed, as with squires, the Kingdom may now assault portals, and upon their ruins build powerful teleporters. Notable stone purchases: — Kingdom Wiki - Technology As mentioned, Stone Age is required for what is in my opinion the meat of the game. Once you start destroying portals, you can deal with less Greed and expand more, eventually being able to eliminate the Greed from a given island once you reach Iron Age. You can also hire Pikemen once you have a stone walled sector, which are incredible subjects that can fish to produce income (including during the winter) and effectively defend the kingdom from the Greed at the walls. Once the Town Center has been upgraded to Stone, you can pay for four shields on it, which act as sort of "job vendors," where unemployed subjects can be hired as Squad Leaders at the Town Center. Once someone has picked up a shield, it is replaced with a banner of the same colors, and you will know they've been employed. When a Squad Leader/Knight dies, you will see a ripped banner, which you can pay to replace with a shield in order to make a new Squad Leader. This leader brings a squad of archers to the end of the kingdom walls, ready to be ordered to attack and destroy a Greed portal. While it is common for the squad leaders to be defeated and have to be re-employed while attacking a portal, damage to the portal is permanent, so you can keep trying until it is fully demolished. To get stronger leaders, you can upgrade them to a Knight, which requires a forge. To get a forge, you need the town center last tier, the iron keep; additionally, it requires a large enough empty space protected by an iron wall, that is, with an iron back wall. For this you need Iron Technology, found on the Fourth Island. Also on the Second Island is three gem chests, meaning you can start collecting gems to use on new mounts, statues, and Hermits. 1 The Stag mount, Scythe Statue, Dog, and Stable Hermit are also on the island. I recommend getting any statue when you can, as it gives a blessing that applies across all islands, until a monarch loses their crown, after which you can pay a coin fee to reactivate the statue. For example, the blessing the Scythe Statue gives is increasing the number of supported farm plots. Once you have worked on the Second Island enough to want to unlock Iron Age, head to the Third Island. The boat will take a bit more time to build now, and something to consider is expanding the Kingdom walls past the boat remnants while your builders work on it. If you purchase new parts but they have not been built yet, Greed can steal your parts. Also, if you're wondering how to not crash your boat every time you go to an island, you do need to destroy the dock portal of a given island in order to build a Lighthouse. This structure will ensure the boat will not be destroyed when you land on an island containing a Lighthouse. You can then upgrade the Lighthouses to prevent it from decaying like the rest of the island when you are gone from it for too long. The Third Island is not too exciting, although there are some important considerations on it. There's another mount, more gems, the Builder Statue (increases maximum wall HP), and the Bakery Hermit. The Bakery Hermit allows high tier archer towers to be upgraded into bakeries for six coins; the bakery is an unmanned structure that produces treats designed to lure vagrants out of their camps. This makes recruiting them easier, especially on larger islands. You don't want to put a bakery out in the wilderness though, since Greed can steal the treats. Hermits can also be brought with you to new islands, so the Bakery Hermit is probably most useful on Islands Four and Five. Island Four has the Iron Mine, to bring you into the Iron Age. It also has offensive mounts, such as the Bear and Lizard, and the Warrior hermit, which can turn high tier archer towers into a Warrior Tower. This tower allows you to recruit additional squad leaders by paying for more shields. The  Iron Age  brings the best weapons and defences the Kingdom has ever seen. To gain access to iron Monarchs must locate and construct the  iron mine . Reaching iron will gradually shift the Kingdom's strategy from defense to offense. Notable things made possible through iron: — Kingdom Wiki - Technology The benefits of Iron Age (the final technology) are much stronger fortifications, forges to turn squad leaders into knights, and the Bomb, which is the final step to eliminate the Greed from a given island. Once enough portals are destroyed to reach the cave, you can launch the ultimate attack against the Greed hive. The steps you need to ideally take are: In late game, focus on upgrading everything to iron and destroying portals, as well as making catapults and fire barrels to defend yourself. Rinse and repeat on all your islands, and you're golden. Subscribe via email or RSS Hermits are potential subjects who know how to build useful, specialized structures in the Kingdom. ↩ I would ensure you have a mount with considerable speed/stamina for this. Even the default horse will do. ↩ boat  – the way off the island bank  – a place for storing spare coins. Shield  – the equipment for squires Catapult  – an area-effect weapon Teleporter  – to travel long distances or spy from afar Pikes  Europe Iron wall  – the strongest type of wall Forge  – where swords turn squires into knights Bomb  – the ultimate anti-greed weapon. Hire as many squad leaders and supporting archers as possible, upgrading to knights where possible. Purchase a bomb March to the portal at the crack of dawn with the squads and builders pushing the bomb Pay coins to the builders and bomb to initiate entering the portal (remember that both members in a co-op game need to enter the portal) Fight your way through the Greed hordes inside until you reach the center of the hive Pay coins to ignite the bomb Run as fast as you can to the exit, as you have 15-30 seconds to escape with your crown intact. 2 Hermits are potential subjects who know how to build useful, specialized structures in the Kingdom. ↩ I would ensure you have a mount with considerable speed/stamina for this. Even the default horse will do. ↩

0 views
Giles's blog 1 weeks ago

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud

I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that: In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way? Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing. So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?" Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation. If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results . Back when I was messing around with fine-tuning LLMs using the Hugging Face ecosystem -- their "Transformers" library and so on -- one of the experiments I did was to fine-tune a 0.5B Qwen model on an 8x GPU machine . As part of that, I came across this excellent HF page summarising different kinds of multi-GPU training techniques . The three that are relevant are: Now, from what I understand, due to all of the copying around of models, plus the issues inherent with the GIL in Python, DDP is actually better than DP despite being more complicated -- and more flexible! Per Hugging Face: DDP is recommended because it reduces communication overhead between GPUs, efficiently utilizes each GPU, and scales to more than one machine. It might be a while before I want to try multi-machine training, but it would be awesome to have code that's ready to do that without needing any extra work. Now, how to implement it? Hugging Face have a library called Accelerate , which does everything for you: Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! That does sound very useful, but I worry that by using it I won't learn as much. It also rather ties you in to the HF ecosystem. That's not necessarily a bad thing -- I enjoyed using their stuff in my fine-tuning project -- but I'm trying for a somewhat lower-level view in this series. So, let's use the PyTorch-native stuff. There's a "getting started" tutorial , so we can follow that. It has two options for running using DDP, one with a bit of extra setup code -- the first example, under "Basic Use Case" -- and one that uses to make things easier. The second sounds best. The code changes actually look really simple; given a normal single-GPU training script, you need to do some setup at the start: ...then wrap the model itself in a object, which is what you actually do the train on: ...and a bit of teardown at the end: The way to look at this is that will spin off one process per GPU, each running exactly the same code. They have a "rank", which is an integer saying which of the per-GPU processes they are -- 0 for GPU 0, 1 for GPU 1, and so on. There's a bit of a gotcha here, though -- you can see that we're looking at an environment variable called at the start, but we then get a (non-"local") variable from a bit later on. This is due to the multi-machine possibilities with DDP -- if you have multiple machines, then the local rank will be "which GPU on the machine does this process relate to", but there will also be a "global" rank, which is unique across all machines. This distinction won't matter that much during this one-machine test, but it's worth keeping in mind if we want to keep the code in a shape where it could potentially scale to multiple machines. Anyway, after the processes are spun up, they will do their training, and the synchronisation and passing around of gradients during the backward pass will all happen invisibly in the background, so when we do our , it will have the full set of gradients. Now that means that we'll presumably also need to use the rank -- that is, which of the n per-GPU processes the current code is running in -- when selecting which dataset items to train on. More about that later. Let's start writing some code! I'll use a new repo , into which I can put just the code needed for this train. I'll also structure it a little better than last time, with separate "runs", each of which has a model config and training parameters, and will later on have its own checkpoints. You can think of these as being one per machine size that I'm trying out -- I'll create a run directory for each one. Here's a first cut , simply loading up a model config from a run's directory, using it to create the model, and then doing the wrapping above -- no training at all. Running it with (and , as I'm using that for all new projects): Promising. Now, unfortunately we only have one GPU locally, and the code assumes that it's one process per GPU (I believe that's a hard limitation for PyTorch's DDP), so running with blows up. So we can't do an in-depth test locally. But at least we know that the basic infra is there and working. Now let's move the other training code from the single-GPU script into that file, pretty much blindly. This is the result -- it's doing almost nothing beyond what the last train did, apart from wrapping the model in a object -- the only other changes are to use this "runs" directory that we've introduced. As a quick hack, we should try running it. It does a validation and checkpoint before it starts, and we can make that happen quickly by hacking the validation loop to only do a couple of iterations: (Foreshadowing: that hack will come back to haunt us later!) Running that, then hitting control-C after the validation completes, and it looks OK: ...and we have what look like solid checkpoints: However, loading one of those checkpoints fails: It turns out that the problem is this code when we save it: The that we're saving is the wrapper around our model; my guess is that it does actually include all of the weights for the model, hence the correct-looking size for the checkpoint file, but they're renamed -- the wrapper sees the underlying model as something called , so (for example) would be called . Fixing that, with this diff: ...sorts it out -- we can load our checkpoints again. Here's the updated file . I think we're going to have to revisit checkpointing and validation again; we don't want to do it in all of our processes, probably only on global rank 0, and we'll need to somehow synchronise everything so that the other processes don't carry on training while we're doing it. But before we get on to that, there are a couple of other things to change. At the top of the file we're defining some constants that look wrong: We'll handle the dumbest of these first; it was actually silly that in the old code we had a constant for sequence length. We're using the context length of the model for that, so it's duplicated information. Let's get it from the : ...and here's the updated file . That was nice and simple. The code that we have specifies the batch size for each GPU -- that is, with , we'll have six sequences in each batch on each one. Like I mentioned earlier, that's called a "micro-batch" in distributed training like this 1 -- a per-GPU batch, as opposed to the overall global size across all GPUs -- so we could just rename it, and then we'd have 6 × n gpus as a global batch size. However, it feels to me like this is a useful metaparameter to be able to tweak from outside the code. I can see machines with per-GPU VRAM varying from 40 GiB to 160 GiB on Lambda Labs, and pretty clearly that will mean there will be a varying largest micro-batch size on each type. So this is something we'll want to configure on a per-run basis, so let's add a new file to our run config, load that up, and pass it through. That's a simple enough fix; no need to note the diff, but here's the code . This one we'll need to think about. The size of our validation set is based on what one process running on my local RTX 3090 can validate in five minutes, and the interval (for which I fairly arbitrarily put 2000 in the code when copying it across) was calibrated for roughly every half-hour. Those numbers in turn were aimed at the 44 hours of training time I expected locally. For this train, we'll (hopefully!) be taking significantly less time. We'll have eight GPUs, so naively that's 5.5 hours of train time, and each will have more VRAM, so we should be able to bump up the batch size and potentially get even faster than that. Depending on which kind of cards we're using, they may be faster, too -- I found that an A100 is slower (with the same batch size) than the RTX 3090 in my fine-tuning experiments, but the H100 and B200 are likely faster. I think this is another thing for the train config; we should have the validation interval (in terms of iterations) and the number of batches to do for validation. Here's the updated code . Now, let's move on to the dataset. With the code as it is right now, all of our per-GPU processes are using this code to iterate over the same dataset: That means that they'll all be training on the same data; the synchronisation that is happening "magically" in the background means that they'll all train on the first item, work out gradients, and step their optimiser -- so they'll essentially (modulo randomness) have the same updates. Pretty pointless! What we want is for each of the n per-GPU processes to train on 1 / n of the data. We have two useful helpers in : , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) So, the simplest thing to do is to use the world size as a step, and the rank as an offset: Here's the code with that . Now, remember that the same code is running for every one of our per-GPU processes. That means that all of them will do the training with forward and backward passes, and their own optimiser steps, all synchronised by PyTorch DDP magic. But they will also do their own validations -- which is kind of pointless -- and they'll also try to save their own checkpoints, which would be messy because they could quite easily interfere with each other; after all, all of the processes are running on the same machine and would be writing to the same filesystem. So, as a first cut, let's just wrap an around the eval and checkpointing stuff -- we change this: ...to this: That line is getting bit long, so let's break it apart a bit: That looks OK, but there's an extra wrinkle: all of the processes are running the same code, so while the rank zero one will do the eval, the others will continue through the script, so they will go right back around our loop and start training on the next batches -- which is bad. We want our processes to be proceeding in lockstep, iteration-by-iteration. Luckily, the solution is simple: the function in basically says "stop here until all of our processes have reached this point". So we can use two of those -- one before the eval loop, to make sure that all of the processes have finished their training part of the iteration before we do the eval on rank zero, and one after the eval, so that the non-rank-zero processes will wait. One bit of complexity -- we want to do those barriers only if it's a eval iteration, but we want to do them for all processes. So we have to break up the statement, and we wind up with this: That seems to work OK ( code here ), but it does give a warning: So, we want to pass the device ID in when we call . Let's dig into that a bit. Here's the copypasta that I took from the PyTorch tutorial earlier in this post: Let's dig into what that is doing. The environment variable is being set by to 0, 1, 2, etc as appropriate to tell us which process we are on this machine. So the first line is telling PyTorch to use the device with that index for this process . The next line is getting the current accelerator -- that is, an object that represents which acceleration hardware we're using in this process. I think that the best way to see the combination of these two lines is that the first says "use " (or 1, or 2, or...), and then the second says "get the object describing the GPU you're using right now". So it's a slightly indirect way of getting the object containing the details of the GPU in question. Next, we call . A backend in this context is an abstraction of whatever system the device in question is programmed using -- in the case of an Nvidia GPU, it would be some kind of thing that encapsulates CUDA. Once that's done, we call , passing in the backend that we're using. We're saying "initialise the internal data structures for so that they're all set up properly to work with the backend we specified". After that, we can do stuff like getting the global rank with and so on, because has been properly initialized. Presumably at this point we're talking to any other machines in a multi-machine cluster, so we can find out what our world size is and that kind of thing. That extra line at the end, to get the : ...actually looks erroneous to me. All of our code is assuming one process per GPU. So I think we can just use the there as well. Let's rewrite it like this (with some useful comments): That seems to work well! Here's the code . However, I ran it past ChatGPT (largely to validate my understanding of what was going on), and it highlighted something slightly misleading about it. Right now, we're training on a single node, with one process per GPU. But again, one of the neat-o things about this DDP stuff is that it should be able to scale to multiple nodes. Now, remember that is just the rank of the current process on the specific node that it's running on -- hence the name. If we had two machines, each with 8 GPUs, then there would be a process with rank zero on each of them. The "real" rank -- that is, across all machines -- is the one that you can get from once it has been initialised. One of the things it does during that initialisation is to talk to all of the other nodes and work that kind of thing out -- which of the local rank zero processes across all of the machines is the global rank zero process. So we need to use the local rank when working out which GPU we should be running on and so on, but we should not treat it as a global rank. That's actually quite fine in this case, as we're calling inside the training loop when we actually need to use the global one (when indexing into the dataset, or when deciding if we're the process that should be doing evals and checkpoints). The only place where we might be confusing matters is in that print, which is not important anyway, as the training loop also prints out its rank. So, let's tweak it a little more for clarity: That seems to work well! Here's the code . Time to run it past ChatGPT to see if I've made any dumb errors. Turns out that (unsurprisingly) I have... Let's go back to our code that decides whether or not it's an iteration where we need to do a validation run and a checkpoint: The problem is that our index is different in the different processes! Remember, we have this in order to pick out the correct training items: So let's think about it; in the first run through the loop, with 8 GPUs, we would have In the next run through the loop, we'd have: So will give different results for each process. That might not sound like the end of the world -- will only be zero for one of them, so long as is larger than the number of GPUs -- but remember that our validation code looks like this: Now, if different processes have different values for , then will only be called in the one(s) for which it is . But means "wait until all processes have reached this barrier". So the ones that call it will lock up completely until other processes get there, and everything will at best get out-of-sync, and at worst will lock up completely. I think that the problem here is that I'm conflating two things: the index of the global step -- that is, one iteration across all GPUs -- and the dataset element that we want to use. In the original one-GPU case that made, sense; iteration 0 was on dataset element 0, iteration 1 was on element 1, and so on. But now the offset into the dataset, and the global step, are quite different things. This is quite deeply embedded in the code, but we can fix it! Let's start off by changing our checkpoint code, just to rename things. It keeps track of a variable called , our offset into the training dataset, and uses that both to index into the dataset, and to work out how far through the train we are. The latter is a much better thing to store in a checkpoint, so instead of saving , we'll store (and restore) . Basically, just a rename so that the variables and stored JSON match the new reality. Here's the updated code . Now we need to make a number of minor changes to the training loop just to match that rename of the value that we're checkpointing (eg. for the code to generate the training chart) but the most important change is to our loop. Instead of iterating over our dataset with a step and and offset so that we can index into it, we firstly work out how many global steps there will be: ...then we iterate from our initial global step -- zero if we're starting a fresh train, or whatever global step we were on in a loaded checkpoint plus one if we're doing a continued train from a checkpoint -- up to the : That means that we need to use the global step, the world size, and our current rank to work out which dataset item we should be training on for this process at this global step. Let's say that we have eight processes; on the 0th global step, we should have rank 0 training on dataset item 0, rank 1 on item 1, and so on. On the next global step, rank 0 should train on item 8, rank 1 on 9, and so on. So: That's actually much more elegant than the earlier code, and seems to work fine. Here it is . Phew, glad to have caught that before I started spending money on machines -- it would have been confusing if everything locked up. Thanks, ChatGPT! Another thing that raised by ChatGPT is about the validation. We don't want to validate across all of the validation dataset -- we're using a number from the . I have this code: This looked like a nice, quick way to get the first elements of the validation dataset. But ChatGPT told me it would raise. It didn't, though -- why? The problem is that I had set to in my training config for testing. Stepping through what that slice does, when we run : Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" Nasty! AI code review certainly helped me dodge a bullet on that one. Let's fix it, it's not a big change: we can just do this: ...and that works! So here's the code now . So, I think we have one final issue, which is the training and validation datasets. In our single-GPU train, we worked out ahead of time how much of FineWeb (or FineWeb-Edu) to train on -- the Chinchilla-optimal number -- and generated a dataset that contained a round number of 6-sequence, 1024-token batches that was the smallest such round number that was larger than our target. We also worked out exactly how large (in terms of batches) our validation dataset needed to be so that each validation run would take five minutes. There was one big issue with that system; when I decided to do an "extended" train on more of the FineWeb-Edu dataset, in order to see whether I could get the loss down further, I had to do some nasty hackery in order to generate a new one. So it would be nice to not have that problem this time around. Additionally, we're likely to be tweaking the batch size quite a lot in this experiment while we find what the appropriate level is to fit onto the cloud GPUs, and also varying how much validation we do -- and additionally, we have the world size to worry about. I think that the best way to give us the flexibility we need will be to pre-convert the complete FineWeb and FineWeb-Edu datasets into the format we need -- each sequence in the dataset converted to GPT-2 tokens, and then those sequences concatenated together, with the token 50257 separating them. It would be good to properly nail down the validation dataset at the same time. So we can have a script that loads up the original dataset as downloaded from Hugging Face, splits it into 99% train, 1% validation, does the conversion, and then saves them as safetensors files. If we use for those (which is just large enough for our 50,257-token vocab), we can fit the ~10B tokens in each dataset's train split into 20 GiB of disk. Not too bad. But there will still be the issue of getting them onto our cloud machines. Let's generate the data, and then work out how to handle that. I tried initially with the code I used last time, adapted to run through the entire dataset . It does the 99%/1% train/validation split, and then for each of those generates a single massive tensor of tokens like this: It almost worked! To my surprise, it got all the way to the end, and only blew up with an out-of-memory error when it was trying to save the result -- and it did that completely silently, so I thought it had worked right up until I tried to check the file on disk to see how large it was, and it wasn't there. The obvious tweak: set the list to just after the , to free up the memory it's using. Given that it was the save that triggered the OOM, you'd think that that would be enough -- but it turned out not to be so. Rather than mess around with this for much longer, I just decided to add on 128 GiB of swap to my machine temporarily: ...and that was enough to make it run. So I've now generated pre-tokenised, pre-concatenated train and validation sets for both FineWeb and FineWeb-Edu: Now, thinking about how to get it up to the Lambda Labs machines. I have normal 1 Gb residential broadband, so conceivably I could upload 20 GiB in about 200 seconds. But that's assuming that there's no network congestion, so I would expect it to take longer. The LL machines are quite expensive, and I don't want to waste money keeping them up while I'm just uploading data. There are possibilities here: I think the best option is to use option (1), but with the option of also doing (2). The HF dataset will still take time to download to LL, even over the faster network connection. That might not be a problem -- but if it is, I download it once on a cheap instance and use a persistent disk too. Essentially I'd be using the persistent disk as a "cache", and still get the benefits of the easily-shareable datasets on Hugging Face. So, that decided, let's find out how we can upload a whacking great 20 GiB safetensors file as a dataset on Hugging Face. It turns out that resources like datasets on HF are just Git repositories using the LFS (Large File System) plugin to be able to handle, well, large files. Conveniently, given that I'm using to manage my project, there's a plugin that allows me to use their CLI tools with minimal effort, so: Both datasets show up on my profile page on Hugging Face, so that's looking good. Now it's time to try to upload the data. We'll need to install Git's LFS support first: Now let's try the FineWeb one first: OK, so we need some kind of extra thing to tell it we can use large files on top of the LFS stuff: Right, now let's try again: Weird that it prompted for the credentials twice, but it did appear to try to do something there -- but obviously it didn't work. Let's see if Git over SSH is any better. ...then the same stuff to copy in the files and create the metadata file, then: Looks like the same error. Odd. Let's try using HF's upload tools rather than Git -- feels like a bit of a cop-out, but maybe it'll work better. That did indeed take about 200 seconds to run, but the upload speed was only about 10 MiB/s -- from the output, I think it must have been compressing it. Anyway, it looks like it succeeded, so let's upload the others! ...and that's done :-) Next, a bit of manual editing of the dataset cards on the Hugging Face website, and we have our two new public datasets: That looks solid. So, the next thing: change our codebase so that we have some quick and easy way to download them (I'm feeling a little wary of using Git for that after the upload issue), and then to use the downloaded files in our training code. We already have the code to download a dataset; the stuff that I wrote to download FineWeb and FineWeb-Edu originally. Here's the important bit: ...so we can adapt that to download all files in an arbitrary dataset: ...and call that from our , using a new command-line argument , and a new element in our train config JSON file: I was thinking that we'd need extra guard code to not download the dataset again if it's already there, but it looks like handles that all nicely for us. So we have a way to specify which dataset we should use for a training run, and code to download it. Now we just need to adjust the code that loads our datasets so that instead of looking in the , it looks in the directory returned by : ...and update the directory so that if just blindly uses the directory provided rather than trying to look in a subdirectory: That all works! We successfully download the datasets and try to use them. Here's the code . But now we have a problem; when the tries to reshape the huge tensor that we have as our inputs: ...it craps out: That makes perfect sense. Our original files were carefully sized for a batch size of six, and 1024-token sequences. We need some way to work out an appropriate slice of both the training and the validation data. Most of the trains are likely to be Chinchilla-optimal, or at least use a Chinchilla-optimal number of tokens -- rounded up appropriately to match our micro-batch size, sequence length, and world size. But I'd like it to be more configurable. What I'll do is add a key to the training config dictionary, along with a so that we can (for example) train on the first Chinchilla-optimal tokens, then do an extended train continuing on from there. The idea is that we can use as a base, and train on the smallest number of full batches that contains at least that many tokens. For validation, I think that the key that we already have is actually quite nice. Validation is time-bound, and the number of batches is the easiest lever to pull to handle that. However, a would be nice for symmetry. So, here are some numbers for debugging: Now let's use them. Initially, we have this to load the train dataset: Let's work through that one first then make appropriate changes to the validation one. The pieces of information we need to work out which tokens to use are: Let's update our function so that it takes those parameters in that order: ...and now we can write an updated that uses those numbers to get the right number of tokens: Validation is less obvious; I think that the best way to do this (given that the validation dataset is small) is just to have a "magic" value for , which means "just get a round number of full batches starting at . It's also worth remembering that we only do evals on the rank 0 process, so we could in theory pass in a world size of 1 -- but I think that passing in the real world size might be a good idea, because it gives us one fewer thing to change if, in the future, we move towards distributed evals. ...and we change to be able to handle the magic : I also added in a quick sanity check to make sure that we don't get weird behaviour if the is past the end of the original dataset. That all looks good! Running it kicks off training, and validation is running happily every ten global steps, but just with three samples, as configured in the JSON file. Here's the code . One thing that hasn't shown up while running this code locally is that our training loop has this: With one GPU, that's fine, but on a multi-GPU machine, that is going to happen in all of our per-GPU processes -- so they'll all be spamming out progress bars, which will be ugly. So, as a first cut: Now, in order to compare different machines (say, an 8x H100 vs an 8x A100) it would be nice to get tokens-per-second numbers while training. We can do that in the progress bar too! It has a method that adds stuff to the end of the bar, just after the elapsed time and iterations/second numbers. For that, we'll need to have the object available in a variable: ...and now we can count the total tokens seen in the training run, plus keep track of the start time -- just before the start of the training loop: ...then inside, after the training step: That will give us a running average of tokens per second over the train as a whole since the start. Running that, we get a nice progress bar like this (you'll need to scroll to the right): Note that the tokens per second is worse than the just less than 20k that we got when running the single-GPU test previously, but that's due to the testing setup I have -- I'm doing an eval every 10 global steps. Changing that to 1,000,000 so that we just get a single eval when we start, then letting it run for a while to settle down from the initial eval, we get this: ...which is close enough to what we had before. Finally, let's print out some summary information at the end: Ran that on a super-short train with about 50 iterations-worth of tokens, and: Looking good. Here's the code . I think we now have something where it's worth spinning up a Lambda Labs machine to run it. Let's kick off a training run on the cheapest two-GPU machine that they have available right now. That's actually not all that cheap, it's a $6.38/hour 2x H100 80 GiB SXM5. But I'm not planning to do a full train on it yet, this is just a sanity test. I won't attach a filesystem this time, either -- let's see how things go without the caching of the datasets that I was considering. First thing: do we have ? Nope. OK, let's install it: Right, now let's clone our repo and set up our environment: And now I think we can just try running it! It took 18 seconds to download the dataset! I don't think we need to worry about the caching thing with persistent disks, at least at this point. But there are a couple of issues here. I didn't put the number of processes in the command line -- I should be using Also, we don't have the XKCD font family. I'll ignore that for now. OK, that's looking good! Let's make our validations happen less often, and see how high we can get the micro-batches with the 80 GiB VRAM we have on each of our two GPUs. Doing a binary chop, I set the micro-batch size to 100 (OOM), then to 50 (OOM), then to 25 (worked), then to 37 (OOM), then 31 (OOM), then 28 (worked), and finally 29 (OOM). So we have a batch size of 28 for our 80 GiB machines. Leaving it for a little while to settle down, and we get to about 142,000 tokens/second. Now, on the 3090, we were training at 20,000 tokens/second. That means that this machine is running at about 7 times the speed. Given that our original train finished in 48 hours, we'd expect the train to finish in about 6, which indeed is the estimated time on the tqdm progress bar. At $6.38 per hour, that comes to $38.28. Not bad! And this instance is actually quite pricey on a per-GPU basis -- it's $3.19 per GPU/hour, whereas there is an 8x H100 that costs $2.99 per GPU/hour. I'm almost tempted to let it run. But the purpose of this run was to work out the bugs. We're going to want to track the training chart -- remember that after every validation run, our training code generates a chart showing the training and validation loss so far, like this one . I ran the normal quick-and-dirty Python webserver command on the instance, inside the directory containing the training chart: My browser didn't connect to it, but looking at the Lambda Labs interface, there's a new "Firewall" section, where you configure rules for allowing incoming connections to your instances. That's sensible, and the default rules are just "allow SSH from any IP" and "allow ping from any IP". Adding one letting anyone access port 8000 fixed the problem, and I saw a directory listing; clicking on the chart showed exactly what I'd expect, but without the XKCD fonts. Nice. Let's work out how to fix that XKCD font thing. Looking around, it seems like there are approximately twenty thousand ways to do it. Here's one that seems to work; firstly, install the font on the system: Now, that installs a font that has the family name 'xkcd Script` (with that erratic capitalisation). So we need to change the code to pick up pretty much anything that looks like it's XKCD, so instead of this: ...we can do this: That seems to work OK. So, now, I think we have the beginnings of a script to set up a Lambda Labs machine so that we can use it. Let's write a with this: ...and give it another go on a fresh machine. Shut this one down -- total cost so far $7.28. Now there are no 2-GPU instances available. There is a super-cheap 1x A10 (basically the datacenter version of a 3090), though, so let's use that -- we're as certain as we can be that the multi-GPU stuff works, and the proof of the pudding will be whether we can train a model that works. After spinning up our 1x A10 machine: Looking good! I think we have something that (in theory) should work. That cost $0.05. I think it's time to do our first train on a big instance. There are four 8x instances available on Lambda Labs for me right now: I think I'm going to want to train on all of those, to try to work out some kind of metric (dollars per megatoken?) to compare them. But let's start with something reasonably low-end -- in fact, let's try the cheapest, and see what happens. Spin one up, and first thing; after the setup, we need to work out the micro-batch size. Last time we used 28, but this machine has GPUs with half as much VRAM. I did a binary chop again... it turns out to be 13. Now let's think about validation frequency. Let's try to get a feel for how long it will take. We can set the eval batches to (say) 100, so that we can see how fast evals are, but also set the interval to 10,000,000 so that it never does one after the first. It took 11 seconds to run 100 validation batches, and after a few minutes, it settles down at 254,000 tokens/second or so, and is estimating 3h15m to completion. Nice! The cards are an earlier generation to the H100s we used in the two-GPU test, so they're slower, and they have half the VRAM. So eight of them are, working together, about twice as fast as two H100s. Doesn't sound completely crazy. So, in our local train, we spent 5 minutes evaluating every 30 minutes. So our eval time was 16% of our train time. Probably a bit high, but let's run with it. If we're going to take 3 hours training time, then 16% of that is about 28 minutes. Previously we did about 88 evals (44 hours train time, with an eval after each half hour). That seems a bit too high. So let's say that we want to do 50 evals. 28 minutes eval time in total, with 50 of them, means about 30 seconds per eval. If 100 eval batches take 11 seconds, let's approximate it to 300 eval batches. As to the interval between them -- if we want to do 50 over 3h15m, or 195 minutes, then that's one every (let's approximate) 4 minutes. We seem to have settled down to 2.57 iterations per second, so that's about every 617 iterations. Let's bake those in and let it rip. After the run: OK, let's download everything. Looking at the checkpoints, the latest (that is, the last one at the end of the training) and best (the checkpoint that had the lowest validation loss) are the same one, meaning that validation loss kept falling consistently: So let's just download using the "best" symlink to get the weights for that checkpoint: And now we can shut the cloud machine down. Now that the clock is no longer ticking and we aren't spending money on an unused machine, here's the training chart: It looks like we had a couple of gradient spikes there. I'm going to add some gradient clipping code at some point, but I think I'll hold off for a little bit -- I want to do a few cloud trains first to work out the best instance sizes to use, and only then start exploring the possibilities for making the models better. Apart from that, it looks pretty normal. Looking at the billing page on Lambda Labs, that machine was up for about 4 hours and 35 minutes, costing US$10.32 per hour, for a total cost of US$47.35. Of that 4h35m, 13,904 seconds, or 3h52 was the actual training run -- somewhat more than the 3h15m that was predicted at the start of the run. The validation will have accounted for most of that -- we did 50 evals, at 30 seconds each, so that's 25 minutes. That means that 3h40m is accounted for, and the remainder can just be chalked up to noise, I guess. That leads to one question: do we actually need to be doing validation for these trains? I've been doing validation loops in these trains largely out of habit -- when you're training an ML model, it's just "what you do". The reason you'd normally hold out a validation set is simple: if you're training over multiple epochs, then eventually your model is going to start overfitting to the training data 2 . You validate as you go along so that you can spot any points where, while the training loss continues to drop, the validation loss -- which is loss on data that the model hasn't been trained on -- starts rising. That's the classic indicator of overfitting. But for these models we're not doing multiple epochs -- we're just training through a stream of constantly new tokens. So, in fact, there's no real difference between the training data and the validation data, apart from the fact that the validation data is constant. From the model's perspective, it's all new stuff (modulo any repetitions in the dataset, which is possible but I think not likely to be super-common in something as curated as FineWeb). Now, in this post I'm aiming to identify the best options for training in the cloud -- cost in terms of dollars and time. I don't want to change the model itself or the training strategy because I want whatever I come up with to be roughly equivalent to the models I trained on my own machine. Exploring enhancements is for the next post. (Of course, given that the batch size is one of the levers I want to experiment with, and training on larger machines is already meaning that I'm doing micro-batches larger than the batch size of 6 that I used locally, and then the overall batches are 8 times larger, that's not quite true.) Validation, however, doesn't actually affect the training runs in any direct way. I could in theory remove it. However, that is a relatively large change to the code, as I've kind of linked it in with my checkpointing code. I think that what I'll do for now is leave it in. Validation will scale at the same rate as training (so long as I leave the eval batches constant) so it leaving it there will give me a clean comparison between machine types. And I can keep notes on how much time was spent on validation for each train so that I can subtract it from the total time if that proves useful. However, when I start tweaking the training code with changes beyond the batch size, I should probably try removing validation first. Anyway, while validation during the training run might not be important, evaluating the model at the end and seeing how it compares to others is! Let's do that next. There were two important post-train evals that I did on the models that I trained locally: There was also a simple smoke test -- how does the model predict that the phrase ...should continue? I should do the same three tests here. A simple autoregressive generation script is easy enough to knock together, and: All we're looking for here is basic coherency, and I think this is good enough to pass that filter. Next, the loss-style testing. What I think I want to be able to do here is just take a file and run an eval against a standard dataset. I did not generate my own test set, but I did generate a much-larger-than-necessary eval set, 1% of both FineWeb and FineWeb-Edu -- that's 100 million tokens or so in both cases. In the validation that I was doing during the train just now, I did 300 batches of 1,024 tokens with a micro-batch size of 13. That only ran on the rank 0 process, so that's Not even 4% of the validation data. Now, for the local eval, I think it makes sense to make it run for about five minutes -- that's just for my own convenience, I don't want to spend very long -- and I know from the previous local train that I can do 3,200 batches of six 1,024-token sequences in that time: So, somewhat arbitrarily, let's use the 19,660,800 tokens starting at position 50,000,000 in the FineWeb validation dataset for our tests -- they'll never be used for training or validation during the training loop. It's kind of a hack, but it'll do for now. Here's the code . It should be easy enough to understand; it did require one tweak to our existing function, though: Originally, that function worked out out the actual number of tokens to use by working out the size of each global batch, dividing our requested minimum number of tokens by that size and taking the floor, adding on one, then multiplying that by the global batch size. That works fine in cases where the is not a multiple of the global batch size -- it gives us a round number of batches that contains at least . But if is already a multiple of the global batch size, it gives us an extra batch at the end. So I added that as a special case in to avoid that. Anyway, running that gives us a loss: That's actually quite a lot lower than we were seeing with the locally-trained models on the test dataset I was using then -- but, of course, it's a different dataset so it's not strictly comparable. Let's run the same test against them: That's really interesting! Those numbers are really close to the numbers I got in the last post. That does make some kind of sense, though -- while the numbers aren't strictly comparable, as I said, both the dataset that I was using then and the one I'm using now are essentially random stuff from FineWeb, so I guess they must be more similar than I thought. But, importantly, the loss on the newly-trained model is much lower -- 3.674 rather than > 3.9 for all three of the older locally-trained models. Now, the only big difference between this training run and the ones that I did locally is the batch size. As I said in the last post, while I felt that the difference between my batch size of six and the (reported) batch size of 512 for the original GPT-2 was the least-likely cause of the differences in the results, Gemini told me that it thought it was the most likely cause. It looks like Gemini (and, I should note, on Hacker News ) might have been right! Batch size is super-important. Let's do the same eval with the OpenAI weights. I wrote a quick script (in my old 'LLM from scratch' repo, which has the code used in the book) to load up the GPT-2 weights and save them as a safetensors file . When I ran that, I got an interesting error: That was easy enough to fix; in the book's code we assign the weights that have been loaded from the OpenAI TensorFlow checkpoint files with a function called that looks like this: Just adding a call to to the last line fixed the error: ...and as a result, I had safetensors files for the original OpenAI models: So now we can run our test against them: Excellent. Let's start putting together a table of these results: That's pretty amazing. Having a batch size of 13 micro-batches over eight GPUs, or 104 in total, seems to have massively improved the model -- it's much closer to the original weights. It will be interesting to see whether I get further improvements when I move to the larger machines, which (due to having more VRAM) will have larger possible micro-batches, so we'll get larger global batch sizes. It certainly makes me think that I could have got much better results locally by using gradient accumulation, which would mimic the effects of a larger batch size by running multiple smaller batches through, without doing an optimiser step each time, then doing one big update once enough has gone through. But all of that is for another day. Let's try the instruction fine-tuning test now. I decided to pretty much re-use my adapted version of the code from the book; that meant that I was borrowing quite a lot of Raschka's code, which he has released under the Apache 2 license . I normally use the MIT license for my code, but I'm not married to it, so I relicensed the whole repo as Apache 2 with some specific headers to say which parts came from "Build a Large Language Model (from Scratch)", and added this code . It downloads the Alpaca dataset from the site for the book, splits it into train/validation/test splits, trains on the training set, evaluating each epoch and bailing out (and restoring the previous epoch's weights) when validation loss starts rising, and then runs through the test set generating responses, and then sends them all off to the OpenAI API for GPT-5.1 to judge them. Running it against our new model gets a score of 17.09. Let's try the various other models and build out our table: Interesting! In the last run, I found the instruction fine-tune numbers came out as FineWeb-Edu extended > FineWeb > FineWeb-Edu, but here we have FineWeb-Edu > FineWeb > FineWeb-Edu extended -- exactly the opposite! I do have to wonder, though, how precise a measure this is. While the training should be fairly consistent (though I don't have a random seed in there to enforce it), the fact that we're using an LLM as a judge means that there is an element of randomness coming in here. Indeed, I re-ran the FineWeb-Edu extended train test again, just to see what I got, and it came up with an even-worse 12.12. So I don't think we can read a huge amount into these numbers -- well, unless we can get the numbers significantly up. While it looks like a 2.5-point difference might just be randomness, I doubt that a 10-point difference could be. I think we've done the tests that we need for this model now, and we have a testing procedure in place. So let's train some further models on different instance sizes, and gather numbers. This is the biggest machine available on Lambda Labs right now, and is only sporadically available; one happens to be there now, so let's to give it a go. First, we need to create the runs/8xb200m160 directory, initially with a that is a clone of the one I did for the last train, , then spin up the machine. As before, we need to log in, clone the repo, then in it run the script, run , and try to run the script: It crapped out because there was no datasets directory, which is an annoyance. We should create it if it doesn't exist. Create the directory, and run it again. It took a while to download the dataset, because every per-GPU process downloads it separately. That only took a minute or two, but it was a waste of time; I think we should only download it from the rank 0 process with some barriers to make the other processes pause. Next, we need to do a binary chop on the micro-batch size, starting with a low of 13 (which I know will be fine because it worked on the 40 GiB GPUs that we used last time), and a high of 100 (fairly random, just something I'm pretty sure will fail). While doing that, a few things are standing out, both to do with validation. When the script starts, it does one training iteration, then goes straight into validation. Then it starts the training run proper. However: We're going to need to work out some kind of fix for that, because it's taken me 17 minutes from spinning up the machine to getting a size for our micro-batches -- which happens to be 64. On a machine that costs US$39.92/hour, that's an expensive test! We'll look into that later. Anyway, a batch size of 64 is pretty neat, as with 8 GPUs, that means we have a global batch size of 512 -- exactly the same as in the original GPT-2 paper! So, let's kick off the train. It takes about 7 minutes to get to the first checkpoint, at which point it's averaging 801,221 tokens/second. That pattern repeats, and with about one minute to do the validation, we're spending about 12.5% of the time on this machine validating. Hmm. A further indication that we might want to remove the validation stuff if it's not adding on any value. Eventually, it finishes: So, that's 1h9m50s. The final validation loss is not as good as the previous run on the 8x A100 40 GiB machine, where we got down to 3.675. Given that we're using the same validation dataset as the previous, that's meaningful: this is not as good a model, it seems. Again, latest and best checkpoints are the same one: So we can download everything: ...and here's the training chart: OK, so that's smoother than the last one -- no loss spikes. Maybe the larger batch size smoothed them? Let's think a bit about the cost of this train. From Lambda Labs, we had that machine running for a little over 1h30m. At US$39.92/hour, the total cost was US$60.25. Yikes. So, knocking off the 1h10 or so for the train, we have 20m to allow for -- which matches up quite well to the 17 minutes of fiddling with batch sizes, and then 3 minutes to download all of the files. If this blog post isn't going to cost significantly more than it needs to, we need to get that down. Of the US$60.25, just over US$13 was spent on identifying the batch size. Only US$46.57 was spent on the train itself. We also did 11 validation runs as part of that; at a minute each, those cost US$7.32. So, excluding validation, we're below US$40 for the train. Now, let's run our tests. First, the smoke test: we get this: "...on all other website for..." is a bit rubbish. Still, on to the loss: That's in line with the training loss -- worse than the loss I got with the one trained on the smaller machine, with its corresponding smaller batch size, but still better than any of our local trains. Still interesting, though -- larger batches are not guaranteed to get bigger results. More investigation needed there! On to the instruction fine-tuning test. That gives us a score of 13.89 -- the worst that we've seen yet! I think I'll put together a full table including these results later; I want to try training on some other, differently sized machines first, and we can aggregate the results at the end. But before we do that, let's make some changes to the scripts to fix some of those QoL issues we encountered in that last train. The first irritation was that it errored out saying that was not a directory when it didn't exist. The script takes a datasets directory as one of its command-line options, and it's reasonable that it checks that it really is a directory (rather than, say, a file or a symlink): ...but if it doesn't exist, it might as well create it first. Now, I could just put this before the check: ...but remember, this code is run by multiple processes -- so they could easily trip over a race condition here. What I want is to have just one of them do this; I've deemed the rank 0 process the "special" one for validation, printing the progress bar, and so on, so we may as well treat it that way here. But -- there's a difference! Rank zero is the one that should be printing stuff out, it's true. And right now, we only have one node participating in this train. But I do want to avoid simple errors that would make it hard to run multi-node in the future. Now, if we have multiple nodes, then each one will have its own filesytem (unless we're using NFS or something like that), so we'll need a separate "datasets" directory for all of them. What we want is to do these checks on one process on each node. Usefully, we have the variable that is defined earlier in , which is per-node. Again, let's imagine we have two nodes with two GPUs each. Node 0 might be runnning the processes with global rank 0 and 1, and node 1 might have global ranks 2 and 3. On node 0, the processes would have local ranks 0 and 1 respectively, but on node 1, they'd also be local ranks 0 and 1. So, the full code becomes this: Note the barrier; we don't want the other processes to check whether is a directory until the local rank 0 process has had a chance to create it. (Of course, if we were running this on a setup where all of the nodes shared a filesystem, it wouldn't work -- in that case we'd want to use the global rank that we can get from instead. But we can burn that bridge if we ever come to it ;-) Phew, that was a bit more work than I expected! But it sets us up nicely for the next QoL fix on my to-do list. I don't like the fact that every process downloaded the whole dataset. The actually handled it pretty gracefully -- none of the processes tripped over any of the others. Indeed, it looks like there was some kind of global queueing going on, so they downloaded it one after the other. But it did take time -- maybe a minute or two in total, and with the clock ticking on that ~US$40/hour machine, that felt a bit stress-inducing. So: I think it would be best to only do that from the rank 0 process as well. The code that downloads the dataset is just after the bit we've been looking at: ...and looks like this: Now, the docs for say that the parameter is: If provided, the downloaded files will be placed under this directory. ...and the return value is this: We happen to be passing in a object for , and we're not in mode -- it defaults to . So all we're doing by returning that wrapped in a object is a slightly indirect way of returning the path that we're passing in as . For tidiness, I really want to gate the call to in with the same rank stuff as we did for the directory creation. So, let's change the setup so that takes the path to the directory where we want this specific dataset to be, not the generic "all datasets" directory. And given that we're now passing this specific path into the function, we don't need to return it: Now it's just a wrapper around a single call to , which I'm not entirely sure about (it's a code smell that I'm probably creating an unnecessary level of abstraction) but I think I'm happiest leaving it that way for now, as it does hide away a bit of messiness in the HF hub API. 3 That means that we can now combine the directory-checking logic that we fixed above with download-on-local-rank-zero-only code like this: Here's the updated code with those fixes. Now, let's move on to validation. I'm increasingly of the opinion that the validation steps are just adding on to the cost without much in the way of benefit. Additionally, the validation is taking a different amount of time for each batch size, and happen a different number of times in each train -- remember, it's batches every global steps, and the batch size varies based on the micro-batch size, which is different for different amounts of GPU VRAM, and the total number of global steps in a train also varies based on the size of each batch. So that means that if we want to compare apples to apples in any final comparison of the time and money cost of training models on different kinds of Lambda Labs machines, we'll want to exclude the validation cost -- once we've settled on a machine type, we're going to want to fine-tune the validation size for that in much more detail than I have to date, assuming we don't drop it entirely. However: I'm loath to make such a fundamental change halfway through this comparison. It's tightly coupled to the checkpointing code, and the charting code, and so on. So I think that for this post, I'm just going to keep it there, and keep track of how much time (roughly) we're spending on each validation step for each train, so that we can remove it and get a "pure" train-time only comparison between the different kinds of machines. It's not pretty, but I think it's better than changing horses mid-stream. On the other hand, the validation is a real pain when doing the binary chop to find out the maximum micro-batch size for our VRAM before we start the training run. That's because we have to wait for one validation to run before we get into the full training loop, which makes it slower. On top of that, having to do a manual binary chop is a PITA. What I think would be a true QoL improvement for the future trains is something that does the binary chop for us, using a dummy training loop. We run it once on each new machine type, get a micro-batch size to plug into our training parameters, and then let it rip, This will re-use so much of the code from the training script that I think it actually is just an alternative way of running it. After a bit of hacking, I came up with this updated code -- the diff is a bit hairy, but essentially: That takes just over six seconds to find the correct batch size on my local machine; with multiple GPUs, I expect it will be slower (there's a spinup overhead to start all of the per-GPU processes), but I'm sure it won't be as bad as the manual binary chops with validation that I was doing, and will be less error-prone. Right! We've done some QoL stuff, let's try another machine size on Lambda Labs :-) These are the machines that Andrej Karpathy is recommending for training nanochat, so let's see how we do with them. They cost US$23.92/hour; let's see how it works out. Here are the steps: Now let's download our dataset and find our micro-batch size: That took less than a minute to run -- nice! Now we can put that micro-batch size in . It does seem a little small -- after all, we could fit a batch of 64 into 160 GiB -- but I'll do some analysis later. Actually, before we kick off the train, let's see how long all of the preparatory steps took to run before we can do that -- not just the micro-batch-size script, but also the installation of the dependencies, the clone, and any overhead from boot time etc: Five minutes total. Not bad. Let's start the train: The initial validation run took 38 seconds, and then we started off. At 4m37s in, we get the first real validation run; at that point, it's running at 493k tokens/second. Eventually, it finishes, having taken about 1h50 including all of the validations. Here's the training chart: Two things stand out here: Further evidence that gradient clipping is likely to be an excellent addition to our training loop! It's also worth noting that the train loss spikes at the same time as the validation loss, so getting rid of the latter would still allow us to get a "best" checkpoint to compare with the latest at the end of the train. The machine was up and running for 2h9m, costing US$23.92/hour, for a total cost of US$51.47. The train took 6,650.197 seconds, so about 1h50m. Allowing for five minutes setup time, that's 1h55m accounted for. There's an extra 14m there -- that was because downloading those two checkpoints to my machine took quite a long time due to local network issues. Might want to look into ways to avoid that later. And for later cost-accounting purposes, we should note that it took 38 seconds or so for each validation run, and we can see on the chart that there were 24 of them. So, firstly, let's give our two models -- the best one and the latest one -- a smoke test: Both of those look OK! Now let's try the loss test. I started running it, but when it started downloading the dataset, I realised that it needed updating to allow for the changes I made to -- ooops! That done, let's give it a run for both of our models: As you'd expect, the best checkpoint has somewhat better loss, at 3.725, than the last one, with 3.734. Once again, better than our local trains, but not quite as good as the result with the first cloud train on that 8x A100 40 GiB machine, which was 3.674. Again, I'll put together a table comparing all of these results at the end. Does that make any real difference with the instruction fine-tune test? The test prints a lot out, but the headline numbers: So that was interesting! However, I am getting ever less convinced that the IFT test is a useful one; the randomness of the LLM-as-a-judge responses means that I don't think it can be consistent. Perhaps a better way to do this would be to batch up all of the models, and then give GPT5.1 answers from "model A", "model B", and so on all in one query, and then to ask it to give them scores all at the same time. That would hopefully make things at least a bit more consistent. Something to ponder later, I think. In the meantime, one extra thing I wanted to dig into before going on to the last train for this post: I mentioned that I thought that the batch size for that last run, 27, was a bit small considering that we'd managed to fit a size of 64 into the 160 GiB/GPU machine. But after thinking about it for a bit, it occurs to me that during my experiments doing fine-tuning, I came to the conclusion that memory use scaled linearly with batch size , with a fixed amount per element in the batch (the activations for the model for that batch element), plus an overhead (the model itself, the optimiser, and perhaps other stuff). We have batch sizes for: Now, that is slightly messy data because each memory "measurement" is the size of the card's VRAM, not the amount of VRAM we actually used -- there might have been anything from zero to just less than one extra batch element's worth of "spare" space -- but we can see what we get with a simple linear regression: And if we plot that, we get this: Nice! That fits really well. So we have an overhead of about 11.5 GiB, then about 2.35 GiB per batch element on top of that. That is, of course, somewhat sad news for anyone trying to repro this on a GPU with 12 GiB -- looks like it would be just too small to even fit in a single-element batch after the overhead :-( Anyway, that's been a bit of a side quest. Let's try our last machine size for what has (once again) turned into a bit of a monster of a blog post... This is the same kind of instance as the first train in this post, except that it has double the VRAM per GPU. Let's see what we can do with it. Once again, we create the run file, commit and push, then spin up the machine. On it, we clone the repo, run then . Next, we can find our micro-batch size: Interesting, we managed to squeeze an extra one in compared to the H100's batch size of 27, despite having exactly the same amount of VRAM! Not sure what might have caused that. It took 4 minutes to get to this point, so let's get that batch size into the config and kick off the run. The initial validation takes 1m06s, which is consistent throughout the train. The first real val run at 8m15s in, and the estimated train time is 2h35m, with a tokens-per-second of 286,188. At the end: Again, the latest and the best global steps are the same (despite some loss spikes): ...so we just need to download that and shut down the machine. How much did that cost us? The machine was running for 3h25m, costing US$14.32 / hour, for a total of US$48.76. Our train took 11,532 seconds, which is 3h12m, and our setup took about 4 minutes -- maybe five including the time required to update the train config with the micro-batch size, so we have 7 minutes on top of that, which is about the amount of time it took to download the model. Let's run some evals! Our smoke test gives us this: Coherent enough, I think! Now the loss on our test dataset; it comes out as 3.730, so pretty similar to our other cloud trains, apart from the oddly-low one on the 40 GiB GPUs. Now let's see what GPT-5.1 thinks of the instruction fine-tuned version. It only needs two epochs of fine-tuning, and believes that "The author of 'Pride and Prejudice' is 'Pride and Prejudice'", which is not promising, and gets a score in the same kind of range as the other models, 11.71. So: we've trained four models on four different machine sizes. Let's see how they stack up against each other, against our locally-trained models, and the original OpenAI GPT-2 weights. So, I've trained four of my 163M-parameter GPT-2 models, using almost exactly the same dataset -- the Chinchilla-optimal number of tokens, rounded up to make an even number of batches. I did this on four different multi-GPU machines on Lambda Labs: I've done some evals on each of the models, so let's put those results together in one table -- results for the trains in this blog post, alongside those for the original OpenAI GPT-2 weights, both small and medium, and for the models I got when training locally. For all models, I've provided: I've sorted the models in order of increasing loss on the test set -- so, the best model by that measure is first. The instruction fine-tune results are kind of all over the place, and I'll look into that later 5 . For now, let's focus on the test loss. We have a pretty clear pattern, where the local trains are grouped together at around 4.0, and the cloud trains at around 3.7. For the local trains, as I noticed last time around, FineWeb is counter-intuitively better than FineWeb-Edu. There are two interesting things about the cloud trains: I think that what we're seeing here is that larger batches are better, but only up to a point. It's as if there's some kind of curve like this: I got that by taking the log of the batch size, then asking NumPy to do a polynomial regression -- that is, work out a , b and c so that the formula ...fits it as well as possible: It's kind of interesting that it's such a good fit with such an ad-hoc formula! We have a nice smooth curve hitting almost all of the points, and our optimal batch size looks like it's just a little below that 104 we managed with the smaller cloud machine, at about 97. But it's certainly not something that I'd like to read too much into. Best to treat it as purely illustrative: "it might be something like this". I think digging into that might be an interesting experiment at some later point. A bit of checking around the Internet (and a chat with ChatGPT) suggests that it's something people have looked into in some detail, unsurprisingly. An interesting point ChatGPT raised is that with our pretty much fixed "budget" of tokens -- we're always training on something close to the Chinchilla-optimal number -- then a larger batch size means that we're doing fewer optimiser steps. Intuitively, that sounds like a problem. The larger batches mean that each move across the loss landscape is "better", or at least more stable. But we're doing fewer of those moves over the course of the train. There's obviously a tension between those two. You can imagine a degenerate case where the batch is so large you can fit the entire run into one iteration, so you do just one update of the parameters; that obviously wouldn’t work very well. Anyway, for the purposes of this post, let's flag it as interesting and move on. Let's take a look at costs. Here's another table for those -- for each cloud model, I've listed: What do these numbers tell us, given what we were trying to do here? Like I said at the start, this was a pretty expensive learning experience: I wound up spending US$215.16 on Lambda Labs instances over the course of putting this all together. But it was worth it! At the start of this post (if you can remember so far back), I said I wanted to achieve two things: Yes, absolutely. The trains I did, if we exclude the validation time, each cost between US$35.56 and US$39.14. In time, also excluding validation, the slowest ran for about 3h25m, and the fastest just less than an hour. Now, in a future post I want to try making the changes that I listed at the end of my last post to see if I can get the loss lower: If I'm to do those, what I'll need to do is start with a baseline train on one particular size of machine, and then try introducing each change separately to see what happens to loss. I'll want to use a fixed seed for random number generation, so that I start with the same initial weights each time. Given what these experiments have already shown about loss -- that the smallest, cheapest machine has better loss than the other more expensive ones due to what I assume is the batch size -- then that actually feels like exactly the right machine to choose for this. It does take a while to train anything, but three and a half hours is pretty acceptable, I think -- I can do a train or two per day. An 8x A100 with 40 GiB VRAM per GPU is the way forward. So: next steps. I want to: This is going to be fun. Stay tuned! I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩ I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU. If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right. DataParallel (DP). With this: The default GPU (normally ) is in charge of the process. It gets a batch of data, divides it up into per-GPU "micro-batches", and sends each of those to a thread for each of the other GPUs. It then sends an up-to-date version of the model to each GPU. Next, all of the per-GPU threads do a forward pass on their replica using their specific micro-batch, and send their outputs to the thread for the default GPU. The default GPU thread aggregates all of those outputs (similarly to how the losses across all of our batches and the prefix sequences are aggregated in the normal single-GPU case ) to work out an overall loss. It then does a backward pass. This will start on the default GPU, as the aggregation step is the first thing that it will come to when going backwards through the steps that came up with that overall loss. However, it will then come to operations that happened on the other GPUs and those are (somehow) parallelised. Once that is done, each GPU has gradients that represent how their copies of the model contributed to the overall loss. Finally, they send those gradients back to the default GPU, which combines them (I think of this as just being an average, though I gather it's more complex) and applies them, producing an updated model. Then the process repeats; the updated model on the default GPU will be sent to the other GPUs in the second step of the next iteration. DistributedDataParallel (DDP). This does less work on the default GPU and does less copying around. Each GPU has its own process (rather than thread), and is essentially responsible for its own training loop. Right at the very start, the default GPU's process sends the model to all of the others. Then all processes go into their training loop: Firstly, each one works out its own micro-batch (which means you need to have code to make sure that the datasets are properly split across the GPUs) Each model does its own forward pass, then its own backward pass, working out its own independent gradients. As it comes up with those gradients, it broadcasts them to a "reducer", which handles the aggregation. This is done in a distributed way -- there's not just one reducer handling everything. When all models have completed the backward pass, the reducer has a set of combined gradients, which is visible from the per-GPU processes. Each GPU process does its own optimizer step using those combined gradients. That means that there's no model copy required -- each GPU has applied the same gradient update, so they already have in-sync models, assuming everything went well. ZeRO. This is a much more complex system, and I went into how it works in this blog post . , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) = 0 for the process with rank 0 = 1 for the process with rank 1 = 7 for the process with rank 7 = 8 for the process with rank 0 = 9 for the process with rank 1 = 15 for the process with rank 7 Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" But if is set to 2, which it happened to be in my case, then it will silently fail -- our first eval loop will get the first X from the validation set as , and the second X as . Zoom through the records in the dataset in batches of 1,000. For each batch: Tokenising each batch, so we get a list of lists of tokens. Convert that list of lists into a single list tokens separating each item. Convert that list into a PyTorch tensor. Add the tensor to a list. After that's all done, use to convert the list into a single tensor, and then save that with . I can upload the datasets to Hugging Face; their network connection will be better than mine, so I can just pay the price in time of uploading everything from home once, and then I can download them faster from HF to LL. That also has the benefit of meaning that after this experiment I can safely delete the local files, but then download them again if I need them. And if anyone else wants to repro this experiment, the data will be easily available to them. Lambda Labs have persistent filesystems that you can use. They cost $0.20/GB/month, so that would be about $5/month for all of my datasets. So I could upload the data to a cheap instance with a persistent filesystem mounted, shut down that instance but keep the filesystem, and then mount it on each machine I use to run tests. . The world size -- that is, how many per-GPU processes are we running? The micro-batch size The sequence length An 8x B200, with 160 GiB per GPU, at $39.92/hour An 8x H100, with 80 GiB per GPU, at $23.92/hour An 8x A100, with 80 GiB per GPU, at $14.32/hour An 8x A100, with 40 GiB per GPU, at $10.32/hour The loss they got on the validation set from the first train. Strictly speaking, I was kind of cheating and using that as a test set. The score given by the OpenAI GPT 5.1 model for an instruction-following dataset. This was the one provided in the book -- an Alpaca-style Q&A dataset, with a well-defined train and test set. Each model was fine-tuned on a training set of 85% of the data until loss on a validation set of 5% of the data started rising, and then tested on the remaining 10%. Sebastian Raschka, being a pro, was splitting up the data properly :-) If we're going to do validation then it does make some sense to do one at the start -- but doing one training iteration first seems kind of arbitrary (though it's clear how that drops out of the existing code). The validation runs on this machine are taking longer than they were on the less-powerful A100 GPUs! That confused me for a bit, until I realised that I didn't notice that it was slower with the batch-size 13 test, only with the larger ones later in in the binary chop. If we're using larger batches, then there's more work to do for the validation. Doing this binary chop by hand is annoying and error-prone, and worse, we have to wait for one of those (long) validation runs before we get into proper training. The initial training iteration can succeed, while later ones hit memory limits -- it seems like we need to wait for three or four training iterations before we can be sure that we have a workable batch size. Not quite sure why that is, perhaps it's something in the optimiser or the scaler? If : Local snapshot path. If : A list of DryRunFileInfo objects containing download information. I updated the function so that it takes flags to tell it whether or not to do validation (default true) and an optional maximum number of steps, which is by default. With those default values, it does exactly the same as before, of course. I created a function, which does all of the dataset-loading stuff that the original function did, and then calls with a -wrapped model. So that maintains the current flow. Next, I added a flag to the script; if that's not set, it just calls . However, if it is set, it instead calls a new function, which determines the largest batch size we can fit onto the current hardware for the current run, and (on the rank 0 process only, to avoid log spam), prints it out. does what it says on the tin; it confirms that we can train with batch size of 1, and that we can't with batch size 70 (chosen because the limit was 64 on that massive B200 machine), then chops between them to find the largest batch size that doesn't OOM. It uses for that -- that just constructs a dataset with the appropriate batch size, then runs a three-step train with no validation to see if it raises an OOM. PyTorch rather messily just raises a generic for those, but we can look inside the exception's message to see if it is an OOM. Create the run file, commit and push. Spin up the machine. On it: Clone the repo We had two nasty loss spikes. As a result of the second of those, the best iteration as per validation loss is not the last one. Best checkpoint: 4 epochs of fine-tuning, and a score of 11.98 -- another record low! Amusingly, it confidently said "The author of 'Pride and Prejudice' is Sarah Palin". Latest checkpoint: 5 epochs of fine-tuning, and a rather good score of 17.91. 24 GiB locally, which was 6 40 GiB in the first train in this series, which was 13 80 GiB in the last one, giving us 27 160 GiB in the one on the huge machine, giving us 64 An 8x A100 40 GiB An 8x A100 80 GiB An 8x H100 80 GiB An 8x B200 160 GiB The loss on my test set. The results it got on an instruction fine-tune test based on Sebastian Raschka's. The global batch size (that is, for single GPU runs, just the batch size, but for the multi-GPU ones, where each batch is made up of per-GPU micro-batches, the per-GPU batch size times the number of GPUs). 4 They're all consistently better than the local ones. The one on the smaller machine is better than the ones on the larger ones; indeed, it looks like the larger the machine, the worse. How long the training run took. How much the machine cost per hour. How much the training run cost. How much of that was doing validation (which I'm now thinking is pointless on single-epoch trains like this). How much it would have cost, and how long it would have taken if it had been run without validation. I wanted to learn how to change a simple single-GPU training loop to make it multi-GPU. Could I get the training time for a full base model down from 48 hours to something more manageable -- and, hopefully, not too expensive? Removing dropout Tweaking the learning rate (and maybe adding the warmup and cosine learning-rate decay stuff I've read about). Reverting the architectural differences between our model and the original GPT-2: reintroducing weight tying between the token embeddings and the final linear layer, and also bias in the attention weights. Trying full-fat 32-bit precision. Fixing the exploding gradients issue with gradient clipping. Dig in to the instruction fine-tuning tests a little more -- as I've said above, I'm not 100% happy with how comparable it really is between models, at least given how I've been running it so far. Upload the models we have to Hugging Face. I have a new motherboard ready for my PC, and replacing the old one has a risk that I might mess up and break the NVMe drive I have them stored on. I was holding off on this because it would mean sharing Raschka's GPT code, but having noticed that he's already licensed it all under the Apache license, I can release them under the same one. Strip out the validation stuff. We can use training loss to track our progress, and losing evals during the train will help keep the cost down. Finally, do the trains to see how each of the levers above affects loss. I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩

0 views
JSLegendDev 1 weeks ago

The Phaser Game Framework in 5 Minutes

Phaser is the most popular JavaScript/TypeScript framework for making 2D games. It’s performant, and popular games like Vampire Survivors and PokéRogue were made with it. Because it’s a web-native framework, games made with Phaser are lightweight and generally load and run better on the web than the web exports produced by major game engines. For that reason, if you’re looking to make 2D web games, Phaser is a great addition to your toolbelt. In this post, I’ll explain the framework’s core concepts in around 5 minutes. — SPONSORED SEGMENT — In case you want to bring your web game to desktop platforms, today’s sponsor GemShell , allows you to build executables for Windows/Mac/Linux in what amounts to a click. It also makes Steam integration easy. For more info, visit 👉 https://l0om.itch.io/gemshell You have a tool/product you want featured in a sponsored segment? Contact me at [email protected] A Phaser project starts with defining a config to describe how the game’s canvas should be initialized. To make the game scale according to the window’s size, we can set the scale property in our config. The mode property is set to FIT, so the canvas scales while preserving its own aspect ratio. As for keeping the canvas centered on the page, the autoCenter property is used with the CENTER_BOTH value. Most games are composed of multiple scenes and switching between them is expected during the course of gameplay. Since Phaser uses the object oriented paradigm, a scene is created by defining a class that inherits from the Phaser.Scene class. To be able to reference the scene elsewhere in our code, it’s important to give it a key. For this purpose and for being able to use the methods and properties of the parent class we need to call the super constructor and pass to it the key we want to use. The two most important methods of a Phaser scene are the and methods. The first is used for, among other things, creating game objects like text and sprites and setting things like scores. It runs once, every time the scene becomes active. The latter which runs once per frame is, for example, used to handle movement logic. Once a scene is created, we still need to add it to our game. This is done in the Phaser config under a property called scenes, which expects an array. The order of scenes in this array is important. The first element will be used as the default scene of the game. To switch scenes, we can call the method of the scene manager. Before we can render sprites, we need to load them. For this purpose, a Phaser scene has access to the method where asset loading logic should be placed. To load an image, we can use the image method of the loader plugin. Then, in the method, we can render a sprite by calling the method of the Game Object Factory plugin. The first two params are for specifying the X and Y coordinates while the third param is for providing the key of the sprite to render. Because we created our sprite game object in the method, we don’t have access to it in our method, that’s why you’ll often see the pattern of assigning a game object to an instance field so it becomes accessible to other methods of the scene. Finally, movement logic code is placed in the method which runs every frame. Rendering text is similar to sprites. Rather than the using the method, we use the method. If you want to hold data or define custom methods for a sprite game object, a better approach is to define a class that inherits from the Phaser.GameObject.Sprite class. Once the class is defined, we can use it in our scene’s code. While asset loading can be done in any Phaser scene, a better approach is to create a scene dedicated to loading assets, which then switches to the main game scene once loading is complete. This can be achieved like shown below : Another important aspect of any game is the ability to play animations. Usually for 2D games, we have spritesheets containing all the needed frames to animate a character in a single image. An example of a spritesheet In Phaser, we first specify the dimensions of a frame in the loading logic of the spritesheet so that the framework knows how to slice the image into individual frames. Then, we can create an animation by defining its starting and ending frames. To provide the needed frames we call the method of Phaser’s animation manager. Finally once the animation is created, it can be played by using the method of the sprite game object. If you want the animation to loop back indefinitely, add the repeat property and set it to -1. A game needs to be interactive to be called a “game”. One way to handle input is by using event listeners provided by Phaser. For keyboard input, we can use : And for handling mouse and touch input we can use . At one point, you might need to share data between scenes. For this purpose, you can use Phaser’s registry. Here is an example of its usage. To play sounds (assuming you have already loaded the sound first) you can use the method of the sound manager. You can specify the sound’s volume in the second param of that method. If you need to be able to stop, pause or play the same sound at a later time, you can add it to the sound manager rather than playing it immediately. This comes in handy when you transition from one scene to another and you have a sound that loops indefinitely. In that case, you need to stop the sound before switching over otherwise the sound will keep playing in the next scene. By default, Phaser offers an Arcade physics system which is not meant for complex physics simulations. However, it’s well suited for most types of games. To enable it, you can add the following to your Phaser config. You can add an existing game object to the physics system the same way you add one to a scene. This will create a physics body for that game object which is accessible with the body instance field. You can view this body as a hitbox around your sprite if you turn on the debug mode in your project’s config. Example of a Phaser game with debug set to true To create bodies that aren’t affected by gravity, like platforms, you can create a static group and then create and add static bodies to that group. Here’s an example : You can also add already existing physics bodies to a group. Now, you might be wondering what groups are useful for? They shine in collision handling logic. Let’s assume you have multiple enemies attacking the player. To determine when a collision occurs between any enemy and the player, you can set up the following collision handler : There are many concepts I did not have time to cover. If you want to delve further into Phaser, I have a project based course you can purchase where I guide you through the process of building a Sonic themed infinite runner game. This is a great opportunity to put in practice what you’ve learned here. If you’re interested, here’s the link to the course : https://www.patreon.com/posts/learn-phaser-4-147473030 . That said, you can freely play the game being built in the course as well as have access to the final source code. Original Phaser game live demo : https://jslegend.itch.io/sonic-ring-run-phaser-4 Demo of the version built in the course : https://jslegend.itch.io/sonic-runner-tutorial-build Final source code : https://github.com/JSLegendDev/sonic-runner-phaser-tutorial If you enjoy technical posts like this one, I recommend subscribing to not miss out on future releases. Subscribe now In the meantime, you can read the following : Phaser is the most popular JavaScript/TypeScript framework for making 2D games. It’s performant, and popular games like Vampire Survivors and PokéRogue were made with it. Because it’s a web-native framework, games made with Phaser are lightweight and generally load and run better on the web than the web exports produced by major game engines. For that reason, if you’re looking to make 2D web games, Phaser is a great addition to your toolbelt. In this post, I’ll explain the framework’s core concepts in around 5 minutes. — SPONSORED SEGMENT — In case you want to bring your web game to desktop platforms, today’s sponsor GemShell , allows you to build executables for Windows/Mac/Linux in what amounts to a click. It also makes Steam integration easy. For more info, visit 👉 https://l0om.itch.io/gemshell You have a tool/product you want featured in a sponsored segment? Contact me at [email protected] The Phaser Config A Phaser project starts with defining a config to describe how the game’s canvas should be initialized. To make the game scale according to the window’s size, we can set the scale property in our config. The mode property is set to FIT, so the canvas scales while preserving its own aspect ratio. As for keeping the canvas centered on the page, the autoCenter property is used with the CENTER_BOTH value. Scene Creation Most games are composed of multiple scenes and switching between them is expected during the course of gameplay. Since Phaser uses the object oriented paradigm, a scene is created by defining a class that inherits from the Phaser.Scene class. To be able to reference the scene elsewhere in our code, it’s important to give it a key. For this purpose and for being able to use the methods and properties of the parent class we need to call the super constructor and pass to it the key we want to use. The two most important methods of a Phaser scene are the and methods. The first is used for, among other things, creating game objects like text and sprites and setting things like scores. It runs once, every time the scene becomes active. The latter which runs once per frame is, for example, used to handle movement logic. Hooking Up a Scene to Our Game Once a scene is created, we still need to add it to our game. This is done in the Phaser config under a property called scenes, which expects an array. The order of scenes in this array is important. The first element will be used as the default scene of the game. Switching Scenes To switch scenes, we can call the method of the scene manager. Rendering Sprites Before we can render sprites, we need to load them. For this purpose, a Phaser scene has access to the method where asset loading logic should be placed. To load an image, we can use the image method of the loader plugin. Then, in the method, we can render a sprite by calling the method of the Game Object Factory plugin. The first two params are for specifying the X and Y coordinates while the third param is for providing the key of the sprite to render. Because we created our sprite game object in the method, we don’t have access to it in our method, that’s why you’ll often see the pattern of assigning a game object to an instance field so it becomes accessible to other methods of the scene. Finally, movement logic code is placed in the method which runs every frame. Rendering Text Rendering text is similar to sprites. Rather than the using the method, we use the method. Entity Creation If you want to hold data or define custom methods for a sprite game object, a better approach is to define a class that inherits from the Phaser.GameObject.Sprite class. Once the class is defined, we can use it in our scene’s code. Asset Loading While asset loading can be done in any Phaser scene, a better approach is to create a scene dedicated to loading assets, which then switches to the main game scene once loading is complete. This can be achieved like shown below : Animation API Another important aspect of any game is the ability to play animations. Usually for 2D games, we have spritesheets containing all the needed frames to animate a character in a single image. An example of a spritesheet In Phaser, we first specify the dimensions of a frame in the loading logic of the spritesheet so that the framework knows how to slice the image into individual frames. Then, we can create an animation by defining its starting and ending frames. To provide the needed frames we call the method of Phaser’s animation manager. Finally once the animation is created, it can be played by using the method of the sprite game object. If you want the animation to loop back indefinitely, add the repeat property and set it to -1. Input Handling A game needs to be interactive to be called a “game”. One way to handle input is by using event listeners provided by Phaser. For keyboard input, we can use : And for handling mouse and touch input we can use . Sharing Data Between Scenes At one point, you might need to share data between scenes. For this purpose, you can use Phaser’s registry. Here is an example of its usage. Playing Sound To play sounds (assuming you have already loaded the sound first) you can use the method of the sound manager. You can specify the sound’s volume in the second param of that method. If you need to be able to stop, pause or play the same sound at a later time, you can add it to the sound manager rather than playing it immediately. This comes in handy when you transition from one scene to another and you have a sound that loops indefinitely. In that case, you need to stop the sound before switching over otherwise the sound will keep playing in the next scene. Physics, Debug Mode, Physics Bodies and Collision Logic By default, Phaser offers an Arcade physics system which is not meant for complex physics simulations. However, it’s well suited for most types of games. To enable it, you can add the following to your Phaser config. You can add an existing game object to the physics system the same way you add one to a scene. This will create a physics body for that game object which is accessible with the body instance field. You can view this body as a hitbox around your sprite if you turn on the debug mode in your project’s config. Example of a Phaser game with debug set to true To create bodies that aren’t affected by gravity, like platforms, you can create a static group and then create and add static bodies to that group. Here’s an example : You can also add already existing physics bodies to a group. Now, you might be wondering what groups are useful for? They shine in collision handling logic. Let’s assume you have multiple enemies attacking the player. To determine when a collision occurs between any enemy and the player, you can set up the following collision handler : Project Based Tutorial There are many concepts I did not have time to cover. If you want to delve further into Phaser, I have a project based course you can purchase where I guide you through the process of building a Sonic themed infinite runner game. This is a great opportunity to put in practice what you’ve learned here. If you’re interested, here’s the link to the course : https://www.patreon.com/posts/learn-phaser-4-147473030 . That said, you can freely play the game being built in the course as well as have access to the final source code. Original Phaser game live demo : https://jslegend.itch.io/sonic-ring-run-phaser-4 Demo of the version built in the course : https://jslegend.itch.io/sonic-runner-tutorial-build Final source code : https://github.com/JSLegendDev/sonic-runner-phaser-tutorial

0 views
Steve Klabnik 1 weeks ago

Getting started with Claude for software development

2025 was an interesting year in many ways. One way in which it was interesting for me is that I went from an AI hater to a pretty big user. And so I’ve had a few requests for a “using Claude” guide, so I figure new year, why not give it a shot? The lack of this kind of content was something that really frustrated me starting out, so feels like a good thing to contribute to the world. This post is going to be for software developers that are interested in learning about using these tools as of early 2026. I’m going to spend this post talking about some background, and then the first steps towards getting your feet wet. If folks like it, I’ll follow up with more. There’s a lot here. I’m going to be speaking about Claude directly, because it’s the tool I use the most, but a lot of this should apply to other platforms as well. The first thing I want to say on this topic is that there’s a reason that this is the first post of possibly many: there’s a lot here, as I just said above. This matters more than you might think at first. Becoming productive with LLMs is not actually easy, no matter what other people tell you. A lot of advice in this space is given by people who don’t have a teaching background, and have forgotten how much work they’ve put in to get to where they are. I liken it to vim: everyone acknowledges that modal editing is a different way of working. We joke about how hard it is to learn how to quit vim if you’ve accidentally started it up. But many people also acknowledge that the learning curve is worth it, due to the power you get. I think of LLMs like vim: they’re not super easy to get real results from, but the time invested can be worth it. It’s also worth saying up front: maybe it’s not worth it, for you. I don’t fault anyone for not wanting to spend time learning a new tool, especially in a space that’s moving as fast as this one. Effectively everything I’m going to talk about in this post has really only come into its own in the last 12-14 months. Maybe in another 12 months this post will be useless. I don’t know. But just like we might not find the time to learn vim to be worth it over just using a more normal editor, that doesn’t mean that deciding that all of this isn’t worth your time isn’t a rational, reasonable decision to make. You’re not going to be “left behind” or whatever some of the boosters say, in the same way that you don’t need to learn vim to do software dev. We aren’t doing vim vs emacs wars here. We’re saying “hey if you want to learn vim this is how I think you can, and if not, that’s fine.” Furthermore, because there’s so much to cover, this post is going to be background and step 1. Because otherwise it would be too dang long. You can’t just read ten thousand words on this stuff and be an expert, you have to go actually use the things. So you should be taking time between each post to go and do that, and so not putting it all in one place should give you natural breakpoints to go and actually try this stuff out. The very first thing I want to say on this entire topic is something that I think about a lot. I have generally had better outcomes with LLMs than a lot of people I know. And that’s bothered me. And I’m not sure exactly why that is. But I do have one idea. I like to approach this space in a … maybe “scientific” way is too strong, but at least a rational one. I try things out, discard what doesn’t seem to work, and keep what seems to work. I try and think critically about this space. I do think that the whole “vibe” term, while complicated in this space, is also important. Vibes do matter, actually. I have some more science-y and some more folks-y reasons that I believe this. But I do think that the attitude you bring towards this process partially dictates your success, and I think you should be conscious of that while you go on this journey. Is that too woo-y for you? Okay, let me make it concrete: I un-ironically believe that swearing at Claude makes it perform worse. I think you will get better results working with an LLM if you treat them like you’d treat a respected co-worker, and you will get worse results if you berate, insult, or otherwise mistreat them. This matters because I think that for a lot of LLM-skeptical people who give this a shot, they may not actually go “Hey claude what’s your fucking problem” (though I have literally seen this happen) they will tend to let their frustrations show a bit more when things don’t work out. Use your emotional regulation skills. It’s very okay to be critical in response to whatever Claude does, but do it in a way that wouldn’t get you reported to HR in a healthy company. Do this: Why did you do it that way? I would have preferred if we did <this> instead. Stop making such basic mistakes. You know that we do <this> and not <that>, idiot. I think that being kind to people is good for you, but I also believe that even if you’re a misanthrope, consider this a skill to get increased output from the tool. I think a bit of anthropomorphization is actually a good thing here. We’ll come back to that later during the more practical steps, but basically, that’s the higher level principle at work: an LLM is not a person. But it is working based off of language that people use. That’s its API. And so interacting with it in the way you’d interact with a co-worker is, in my mind, the right way to do it. Maybe I’ll elaborate on this belief someday. Or maybe not. I do this for personal belief reasons more than anything else. But it is something I want to share. Okay! Now that we’ve got that out of the way, let’s talk about the various ways you can use Claude! There’s a number of them, actually, but I want to focus on two: on the web at https://claude.ai , and with Claude Code . Using Claude in these ways is fundamentally different. Both have pros and cons. For real actual software development, you want to use Claude Code. This is due to the “agentic loop”, which you’ll learn more about in a bit. But for your first steps, using it via the web is okay. It’s mostly just important to know that your experience using the web interface is not going to be the same as using Claude Code. If I only had access to the web interface, I wouldn’t be so bullish on this stuff. But it is and can be useful. Especially when getting your feet wet, as long as you can understand that they’re meaningfully different. This gets into another topic that matters: money. Another reason I do not fault anyone for not spending time with these tools is that vim is free, whereas Claude is very much not. However. There are three major factors in the money equation: Claude Web vs Claude Code, which models you have access to, and the actual cost. Let’s talk about them. You can load up https://claude.ai and talk to Claude right now for free. But you cannot use Claude Code without paying. So if you want to start incredibly small, using the website at first before you fork over many can make sense. Again, that’s fine, just know that the experience is different. But it may be a good way to start. In 2024 and 2025, there was a good argument that you needed to be on a paid plan because that’s how you got access to the latest models. While this is still true to some degree, models have advanced far enough that the changes are less important over time. I do think that in the first half of 2026, it still does matter to a degree. Basically, the difference between Claude 3, 4, and 4.5 is significant, but for me, Claude 4 was good enough a year ago to get real work done. I’m not 100% sure which one you get for free today, but it’s at least fine. And I think that by the time the next round of models come out, the ones you’ll have access to for free will be basically good enough to make this question moot. But do know that you get what you pay for, and paying for things does get you better performance. (Speaking of models, you’ll hear Claude referred to in three ways: Haiku, Sonnet, and Opus. As the names imply, worst to best there, though also, fastest to slowest. Sonnet, especially the 4.5 version, is pretty good for everything. Opus 4.5 is wonderful. Haiku is great for certain things.) As for actual cost: there’s $20/month, $100/month, and $200/month plans, as well as “you pay per API call.” You might be tempted to think “I’ll just pay per API call and keep my usage down.” This is a reasonable thing to think, and also a terrible mistake to make. You get a lot of bang for your buck with the plans. To give you an idea, I recently hit my weekly limit last night on the $200/month plan, and my estimated usage for that week (which again, I’m paying $50 for) would have been $1440.73 if I were paying by the API call. Now, I am a very heavy user, but the point stands: as someone trying out these tools, it is way easy to spend more than $20 of API tokens. If you want to give these tools a real shot, come up with a spare $20, sign up for the cheap plan, and then cancel after your experiment is over. You get access to Claude Code and you’ve capped your spend. It’s a win/win. There’s some good secondary effects of trying to be frugal here but I think that’s more of an intermediate than an advanced topic, to be honest. I think worrying about the money while you build these skills is a distraction. Cap your spend via a plan so that way you can not stress out about breaking the bank. Okay, with all of that background out of the way: let’s talk about your first steps here. Everyone is interested in the ability of LLMs to generate code. But I think that’s actually step 2, not step 1. The way I want you to start using these tools is purely read-only at first. This is also why the website is okay to get started with too; Claude Code is far better at generating code than the site is, but we’re not going to start by writing code. Find some code you’ve written recently. It can be literally anything. Load up https://claude.ai , and type: Hi Claude! Can we talk about this code? And then paste your code in. You don’t need any fancy prompting techniques. You don’t even need to say what language it is. Just give it some code. It could be ten lines, it could be a hundred. I wouldn’t recommend a thousand to start. Claude will probably respond with some sort of basic analysis of what you’ve done, and then a question. I gave it ~50 lines of code a friend and I were discussing recently, and it gave me this back: Sure! This looks like <description of what it does>. You’ve got <three things that the code does>. What’s on your mind about it? Are you thinking through the design, running into a specific issue, or wanting feedback on a particular aspect? From here, you have a ton of options of which way to go, but they really depend on what you’ve pasted in. Here’s some fun prompt ideas: Do you think this code is idiomatic? If you could improve one thing about this code, what might it be? If I wanted to modify this code to do <something>, how would you go about doing that? Are there any bugs in this code? Are there any security implications of this code I may not have thought about? And so on. Anyway, the goal here is to just get used to this whole thing. It’s a bit weird! It’s very different than talking to a compiler. If Claude says something you disagree with, push back a little, just like you would a co-worker: I’m not sure I agree with that. The reason why is that in some other part of the system, there’s <behavior> and so that would impact this sort of decision. Why did you suggest that? I’d like to understand more. Claude will absolutely not be right all of the time. And that’s okay! The goal is to work together, not that this is a magic tool that suddenly solves all of your problems. Once you’ve done this a few times, you might want to graduate to Claude Code. The reason for this is that you can start to scale up your questions. Once you’ve installed it and logged in, you’ll be at a terminal prompt. It might bug you about creating a CLAUDE.md, don’t worry about that for now. Continue having conversations with Claude about your codebase. The reason that this is a a big step up is that before, you had to paste all of the code in. Now, Claude can go find your code itself. Some prompts for you to try: Please give me a code review of my codebase and suggest five things I could do to improve it. Can you find any bugs in <component>? I’m curious about the performance of <component>, can we talk about it? One thing I like to do here is have Claude double check my intuitions. A few months ago, working on an application in Rust, I was considering a refactoring. I hadn’t done it because I was worried that it would be tedious, take a while, and maybe not improve the codebase. It might! But it might not. But putting in the day or two to do the refactor wasn’t really worth finding out if maybe that would be wasted. So, I asked Claude. This is an example of a bit longer of a prompt: Hi Claude! I am considering refactoring my code. In a function like this: <paste code>, I don’t like how I did things, and I’m considering doing it like this instead: <paste code>. However, I know that changes the signature, which impacts other code in the codebase. A few questions for you: 1. how many function signatures would need to be updated if I made this change? 2. can you show me what the code would look like if I did this refactoring on one of my simpler endpoints? 3. can you show me what the code would look like if I did this refactoring on one of my most complex endpoints? Claude came back and said something like “250 signatures would need to change, here’s the before and after using these two examples from your codebase.” Now, Claude isn’t perfect: maybe it was actually 260 signatures. But the point is, this helped me characterize my intuition here: it would be a non-trivial amount of work. But I also got to see its impact on real code I had written, which helped me decide if this refactoring would actually help me in some of the more hairy parts of the codebase. Note that there’s not really any special “prompt engineering” going on here. You don’t need to do “as a senior software engineer” or stuff like that. Just talk to it like you’d talk to a person. It’s fine. That doesn’t mean that prompts are useless, but this sort of optimization is an intermediate to advanced topic, and frankly, I’m skeptical that at this point the “as an x” technique even helps. More on that someday. The point is, you can start asking more complex questions as you get more comfortable with the tool. Because Claude works asynchronously, you can just fire off questions like these in the background, and come back to them when it’s done. Well, sorta. Let’s talk about permissions before we wrap this up. By default, Claude will put you in an “ask before edits” mode. This is a good way to start. It’ll check in with you before doing certain things, and you can say yes or no. Please consider what it’s about to do, and give the answer you’re comfortable with. Advanced users basically let Claude do whatever it wants, but you’re not there yet, and there’s risks involved that aren’t obvious to you just yet as a new user, so even though it can be a bit annoying to say yes every time it asks, I’d encourage you to start off with minimal permissions. It gives you the option to say “commands like this one are okay for the rest of my session” and so when it wants to or something, that can be nice to agree to, but I’d encourage you to not use it for writing code just yet, and tell it no if it asks. We’ll do that in a follow-up post. So that’s my intro to getting started with Claude. Spend $20, talk to it like you’d talk to a person, and use it as a means of getting feedback on your code, don’t have it write anything just yet. Graduate to larger and larger questions as you get comfortable with what it can do. Gently push back when you think it gets out of line. But your goal here is a baseline understanding of what the tool is capable of, not to vibe code out an entire app in an afternoon. These skills may seem too basic, but I promise you, it gets harder from here, and so you’ll want a solid foundation in read-only questions before we graduate to having Claude write some code. I hope this was helpful to you. Here’s my post about this post on BlueSky: Getting started with Claude for software development: steveklabnik.com/writing/gett... Getting started with Claude for software development Blog post: Getting started with Claude for software development by Steve Klabnik

0 views
xenodium 1 weeks ago

Bending Emacs - Episode 9: World times

A new year, a new Bending Emacs episode, so here it goes: Bending Emacs Episode 9: Time around the world Emacs comes with a built-in world clock: To customize displayed timezones, use: Each entry requires a valid timezone string (as per entries in your system's ) and a display label. I wanted a slightly different experience than the built-in command ( more details here ), so I built the time-zones package. is available on MELPA , so you can install with: Toggle help with the key add cities with the key. Shifting time is possible via the / keys, in addition to a other features available via the help menu. Hope you enjoyed the video! Liked the video? Please let me know. Got feedback? Leave me some comments . Please go like my video , share with others, and subscribe to my channel . If there's enough interest, I'll continue making more videos! Enjoying this content or my projects ? I am an indie dev. Help make it sustainable by ✨ sponsoring ✨ Need a blog? I can help with that . Maybe buy my iOS apps too ;)

0 views

Easy (Horizontal Scrollbar) Fixes for Your Blog CSS

Read on the website: There are narrow screen CSS problems I often email people because of. These three fixes should be enough for most.

0 views
JSLegendDev 1 weeks ago

Learn Phaser 4 by Building a Sonic Themed Infinite Runner Game in JavaScript

Phaser is the most popular JavaScript/TypeScript framework for making 2D games. It is performant and popular games like Vampire Survivors and Pokérogue were made with it. Because it’s a web-native framework, games built with it are lightweight and generally load and run better on the web than web exports produced by major game engines. For this reason, if you’re a web developer looking to make 2D web games, Phaser is a great addition to your toolbelt. To make the process of learning Phaser easier, I have released a course that takes you through the process of building a Sonic themed infinite runner game with Phaser 4 and JavaScript. You can purchase the course here : h ttps://www.patreon.com/posts/learn-phaser-4-147473030 . Total length of the course is 1h 43min. More details regarding content and prerequisites are included in the link. That said, you can freely play the game being built in the course as well as have access to the final source code. Original Phaser game live demo : https://jslegend.itch.io/sonic-ring-run-phaser-4 Demo of the version built in the course : https://jslegend.itch.io/sonic-runner-tutorial-build Final source code : https://github.com/JSLegendDev/sonic-runner-phaser-tutorial Phaser is the most popular JavaScript/TypeScript framework for making 2D games. It is performant and popular games like Vampire Survivors and Pokérogue were made with it. Because it’s a web-native framework, games built with it are lightweight and generally load and run better on the web than web exports produced by major game engines. For this reason, if you’re a web developer looking to make 2D web games, Phaser is a great addition to your toolbelt. To make the process of learning Phaser easier, I have released a course that takes you through the process of building a Sonic themed infinite runner game with Phaser 4 and JavaScript. You can purchase the course here : h ttps://www.patreon.com/posts/learn-phaser-4-147473030 . Total length of the course is 1h 43min. More details regarding content and prerequisites are included in the link. That said, you can freely play the game being built in the course as well as have access to the final source code. Original Phaser game live demo : https://jslegend.itch.io/sonic-ring-run-phaser-4 Demo of the version built in the course : https://jslegend.itch.io/sonic-runner-tutorial-build Final source code : https://github.com/JSLegendDev/sonic-runner-phaser-tutorial

0 views
Anton Zhiyanov 1 weeks ago

Go 1.26 interactive tour

Go 1.26 is coming out in February, so it's a good time to explore what's new. The official release notes are pretty dry, so I prepared an interactive version with lots of examples showing what has changed and what the new behavior is. Read on and see! new(expr)  • Type-safe error checking  • Green Tea GC  • Faster cgo and syscalls  • Faster memory allocation  • Vectorized operations  • Secret mode  • Reader-less cryptography  • Goroutine leak profile  • Goroutine metrics  • Reflective iterators  • Peek into a buffer  • Process handle  • Signal as cause  • Compare IP subnets  • Context-aware dialing  • Fake example.com  • Optimized fmt.Errorf  • Optimized io.ReadAll  • Multiple log handlers  • Test artifacts  • Modernized go fix  • Final thoughts This article is based on the official release notes from The Go Authors and the Go source code, licensed under the BSD-3-Clause license. This is not an exhaustive list; see the official release notes for that. I provide links to the documentation (𝗗), proposals (𝗣), commits (𝗖𝗟), and authors (𝗔) for the features described. Check them out for motivation, usage, and implementation details. I also have dedicated guides (𝗚) for some of the features. Error handling is often skipped to keep things simple. Don't do this in production ツ Previously, you could only use the built-in with types: Now you can also use it with expressions: If the argument is an expression of type T, then allocates a variable of type T, initializes it to the value of , and returns its address, a value of type . This feature is especially helpful if you use pointer fields in a struct to represent optional values that you marshal to JSON or Protobuf: You can use with composite values: And function calls: Passing is still not allowed: 𝗗 spec • 𝗣 45624 • 𝗖𝗟 704935 , 704737 , 704955 , 705157 • 𝗔 Alan Donovan The new function is a generic version of : It's type-safe and easier to use: is especially handy when checking for multiple types of errors. It makes the code shorter and keeps error variables scoped to their blocks: Another issue with is that it uses reflection and can cause runtime panics if used incorrectly (like if you pass a non-pointer or a type that doesn't implement ): doesn't cause a runtime panic; it gives a clear compile-time error instead: doesn't use , executes faster, and allocates less than : Since can handle everything that does, it's a recommended drop-in replacement for new code. 𝗗 errors.AsType • 𝗣 51945 • 𝗖𝗟 707235 • 𝗔 Julien Cretel The new garbage collector (first introduced as experimental in 1.25) is designed to make memory management more efficient on modern computers with many CPU cores. Go's traditional garbage collector algorithm operates on graph, treating objects as nodes and pointers as edges, without considering their physical location in memory. The scanner jumps between distant memory locations, causing frequent cache misses. As a result, the CPU spends too much time waiting for data to arrive from memory. More than 35% of the time spent scanning memory is wasted just stalling while waiting for memory accesses. As computers get more CPU cores, this problem gets even worse. Green Tea shifts the focus from being processor-centered to being memory-aware. Instead of scanning individual objects, it scans memory in contiguous 8 KiB blocks called spans . The algorithm focuses on small objects (up to 512 bytes) because they are the most common and hardest to scan efficiently. Each span is divided into equal slots based on its assigned size class , and it only contains objects of that size class. For example, if a span is assigned to the 32-byte size class, the whole block is split into 32-byte slots, and objects are placed directly into these slots, each starting at the beginning of its slot. Because of this fixed layout, the garbage collector can easily find an object's metadata using simple address arithmetic, without checking the size of each object it finds. When the algorithm finds an object that needs to be scanned, it marks the object's location in its span but doesn't scan it immediately. Instead, it waits until there are several objects in the same span that need scanning. Then, when the garbage collector processes that span, it scans multiple objects at once. This is much faster than going over the same area of memory multiple times. To make better use of CPU cores, GC workers share the workload by stealing tasks from each other. Each worker has its own local queue of spans to scan, and if a worker is idle, it can grab tasks from the queues of other busy workers. This decentralized approach removes the need for a central global list, prevents delays, and reduces contention between CPU cores. Green Tea uses vectorized CPU instructions (only on amd64 architectures) to process memory spans in bulk when there are enough objects. Benchmark results vary, but the Go team expects a 10–40% reduction in garbage collection overhead in real-world programs that rely heavily on the garbage collector. Plus, with vectorized implementation, an extra 10% reduction in GC overhead when running on CPUs like Intel Ice Lake or AMD Zen 4 and newer. Unfortunately, I couldn't find any public benchmark results from the Go team for the latest version of Green Tea, and I wasn't able to create a good synthetic benchmark myself. So, no details this time :( The new garbage collector is enabled by default. To use the old garbage collector, set at build time (this option is expected to be removed in Go 1.27). 𝗣 73581 • 𝗔 Michael Knyszek In the Go runtime, a processor (often referred to as a P) is a resource required to run the code. For a thread (a machine or M) to execute a goroutine (G), it must first acquire a processor. Processors move through different states. They can be (executing code), (waiting for work), or (paused because of the garbage collection). Previously, processors had a state called used when a goroutine is making a system or cgo call. Now, this state has been removed. Instead of using a separate processor state, the system now checks the status of the goroutine assigned to the processor to see if it's involved in a system call. This reduces internal runtime overhead and simplifies code paths for cgo and syscalls. The Go release notes say -30% in cgo runtime overhead, and the commit mentions an 18% sec/op improvement: I decided to run the CgoCall benchmarks locally as well: Either way, both a 20% and a 30% improvement are pretty impressive. And here are the results from a local syscall benchmark: That's pretty good too. 𝗖𝗟 646198 • 𝗔 Michael Knyszek The Go runtime now has specialized versions of its memory allocation function for small objects (from 1 to 512 bytes). It uses jump tables to quickly choose the right function for each size, instead of relying on a single general-purpose implementation. The Go release notes say "the compiler will now generate calls to size-specialized memory allocation routines". But based on the code, that's not completely accurate: the compiler still emits calls to the general-purpose function. Then, at runtime, dispatches those calls to the new specialized allocation functions. This change reduces the cost of small object memory allocations by up to 30%. The Go team expects the overall improvement to be ~1% in real allocation-heavy programs. I couldn't find any existing benchmarks, so I came up with my own. And indeed, running it on Go 1.25 compared to 1.26 shows a significant improvement: The new implementation is enabled by default. You can disable it by setting at build time (this option is expected to be removed in Go 1.27). 𝗖𝗟 665835 • 𝗔 Michael Matloob The new package provides access to architecture-specific vectorized operations (SIMD — single instruction, multiple data). This is a low-level package that exposes hardware-specific functionality. It currently only supports amd64 platforms. Because different CPU architectures have very different SIMD operations, it's hard to create a single portable API that works for all of them. So the Go team decided to start with a low-level, architecture-specific API first, giving "power users" immediate access to SIMD features on the most common server platform — amd64. The package defines vector types as structs, like (a 128-bit SIMD vector with sixteen 8-bit integers) and (a 512-bit SIMD vector with eight 64-bit floats). These match the hardware's vector registers. The package supports vectors that are 128, 256, or 512 bits wide. Most operations are defined as methods on vector types. They usually map directly to hardware instructions with zero overhead. To give you a taste, here's a custom function that uses SIMD instructions to add 32-bit float vectors: Let's try it on two vectors: Common operations in the package include: The package uses only AVX instructions, not SSE. Here's a simple benchmark for adding two vectors (both the "plain" and SIMD versions use pre-allocated slices): The package is experimental and can be enabled by setting at build time. 𝗗 simd/archsimd • 𝗣 73787 • 𝗖𝗟 701915 , 712880 , 729900 , 732020 • 𝗔 Junyang Shao , Sean Liao , Tom Thorogood Cryptographic protocols like WireGuard or TLS have a property called "forward secrecy". This means that even if an attacker gains access to long-term secrets (like a private key in TLS), they shouldn't be able to decrypt past communication sessions. To make this work, ephemeral keys (temporary keys used to negotiate the session) need to be erased from memory immediately after the handshake. If there's no reliable way to clear this memory, these keys could stay there indefinitely. An attacker who finds them later could re-derive the session key and decrypt past traffic, breaking forward secrecy. In Go, the runtime manages memory, and it doesn't guarantee when or how memory is cleared. Sensitive data might remain in heap allocations or stack frames, potentially exposed in core dumps or through memory attacks. Developers often have to use unreliable "hacks" with reflection to try to zero out internal buffers in cryptographic libraries. Even so, some data might still stay in memory where the developer can't reach or control it. The Go team's solution to this problem is the new package. It lets you run a function in secret mode . After the function finishes, it immediately erases (zeroes out) the registers and stack it used. Heap allocations made by the function are erased as soon as the garbage collector decides they are no longer reachable. This helps make sure sensitive information doesn't stay in memory longer than needed, lowering the risk of attackers getting to it. Here's an example that shows how might be used in a more or less realistic setting. Let's say you want to generate a session key while keeping the ephemeral private key and shared secret safe: Here, the ephemeral private key and the raw shared secret are effectively "toxic waste" — they are necessary to create the final session key, but dangerous to keep around. If these values stay in the heap and an attacker later gets access to the application's memory (for example, via a core dump or a vulnerability like Heartbleed), they could use these intermediates to re-derive the session key and decrypt past conversations. By wrapping the calculation in , we make sure that as soon as the session key is created, the "ingredients" used to make it are permanently destroyed. This means that even if the server is compromised in the future, this specific past session can't be exposed, which ensures forward secrecy. The current implementation only supports Linux (amd64 and arm64). On unsupported platforms, invokes the function directly. Also, trying to start a goroutine within the function causes a panic (this will be fixed in Go 1.27). The package is mainly for developers who work on cryptographic libraries. Most apps should use higher-level libraries that use behind the scenes. The package is experimental and can be enabled by setting at build time. 𝗗 runtime/secret • 𝗣 21865 • 𝗖𝗟 704615 • 𝗔 Daniel Morsing Current cryptographic APIs, like or , often accept an as the source of random data: These APIs don't commit to a specific way of using random bytes from the reader. Any change to underlying cryptographic algorithms can change the sequence or amount of bytes read. Because of this, if the application code (mistakenly) relies on a specific implementation in Go version X, it might fail or behave differently in version X+1. The Go team chose a pretty bold solution to this problem. Now, most crypto APIs will just ignore the random parameter and always use the system random source ( ). The change applies to the following subpackages: still uses the random reader if provided. But if is nil, it uses an internal secure source of random bytes instead of (which could be overridden). To support deterministic testing, there's a new package with a single function. It sets a global, deterministic cryptographic randomness source for the duration of the given test: affects and all implicit sources of cryptographic randomness in the packages: To temporarily restore the old reader-respecting behavior, set (this option will be removed in a future release). 𝗗 testing/cryptotest • 𝗣 70942 • 𝗖𝗟 724480 • 𝗔 Filippo Valsorda , qiulaidongfeng A leak occurs when one or more goroutines are indefinitely blocked on synchronization primitives like channels, while other goroutines continue running and the program as a whole keeps functioning. Here's a simple example: If we call and don't read from the output channel, the inner goroutine will stay blocked trying to send to the channel for the rest of the program: Unlike deadlocks, leaks do not cause panics, so they are much harder to spot. Also, unlike data races, Go's tooling did not address them for a long time. Things started to change in Go 1.24 with the introduction of the package. Not many people talk about it, but is a great tool for catching leaks during testing. Go 1.26 adds a new experimental profile designed to report leaked goroutines in production. Here's how we can use it in the example above: As you can see, we have a nice goroutine stack trace that shows exactly where the leak happens. The profile finds leaks by using the garbage collector's marking phase to check which blocked goroutines are still connected to active code. It starts with runnable goroutines, marks all sync objects they can reach, and keeps adding any blocked goroutines waiting on those objects. When it can't add any more, any blocked goroutines left are waiting on resources that can't be reached — so they're considered leaked. Here's the gist of it: For even more details, see the paper by Saioc et al. If you want to see how (and ) can catch typical leaks that often happen in production — check out my article on goroutine leaks . The profile is experimental and can be enabled by setting at build time. Enabling the experiment also makes the profile available as a net/http/pprof endpoint, . According to the authors, the implementation is already production-ready. It's only marked as experimental so they can get feedback on the API, especially about making it a new profile. 𝗗 runtime/pprof • 𝗚 Detecting leaks • 𝗣 74609 , 75280 • 𝗖𝗟 688335 • 𝗔 Vlad Saioc New metrics in the package give better insight into goroutine scheduling: Here's the full list: Per-state goroutine metrics can be linked to common production issues. For example, an increasing waiting count can show a lock contention problem. A high not-in-go count means goroutines are stuck in syscalls or cgo. A growing runnable backlog suggests the CPUs can't keep up with demand. You can read the new metric values using the regular function: The per-state numbers (not-in-go + runnable + running + waiting) are not guaranteed to add up to the live goroutine count ( , available since Go 1.16). All new metrics use counters. 𝗗 runtime/metrics • 𝗣 15490 • 𝗖𝗟 690397 , 690398 , 690399 • 𝗔 Michael Knyszek The new and methods in the package return iterators for a type's fields and methods: The new methods and return iterators for the input and output parameters of a function type: The new methods and return iterators for a value's fields and methods. Each iteration yields both the type information ( or ) and the value: Previously, you could get all this information by using a for-range loop with methods (which is what iterators do internally): Using an iterator is more concise. I hope it justifies the increased API surface. 𝗗 reflect • 𝗣 66631 • 𝗖𝗟 707356 • 𝗔 Quentin Quaadgras The new method in the package returns the next N bytes from the buffer without advancing it: If returns fewer than N bytes, it also returns : The slice returned by points to the buffer's content and stays valid until the buffer is changed. So, if you change the slice right away, it will affect future reads: The slice returned by is only valid until the next call to a read or write method. 𝗗 Buffer.Peek • 𝗣 73794 • 𝗖𝗟 674415 • 𝗔 Ilia Choly After you start a process in Go, you can access its ID: Internally, the type uses a process handle instead of the PID (which is just an integer), if the operating system supports it. Specifically, in Linux it uses pidfd , which is a file descriptor that refers to a process. Using the handle instead of the PID makes sure that methods always work with the same OS process, and not a different process that just happens to have the same ID. Previously, you couldn't access the process handle. Now you can, thanks to the new method: calls a specified function and passes a process handle as an argument: The handle is guaranteed to refer to the process until the callback function returns, even if the process has already terminated. That's why it's implemented as a callback instead of a field or method. is only supported on Linux 5.4+ and Windows. On other operating systems, it doesn't execute the callback and returns an error. 𝗗 Process.WithHandle • 𝗣 70352 • 𝗖𝗟 699615 • 𝗔 Kir Kolyshkin returns a context that gets canceled when any of the specified signals is received. Previously, the canceled context only showed the standard "context canceled" cause: Now the context's cause shows exactly which signal was received: The returned type, , is based on , so it doesn't provide the actual value — just its string representation. 𝗗 signal.NotifyContext • 𝗖𝗟 721700 • 𝗔 Filippo Valsorda An IP address prefix represents an IP subnet. These prefixes are usually written in CIDR notation: In Go, an IP prefix is represented by the type. The new method lets you compare two IP prefixes, making it easy to sort them without having to write your own comparison code: orders two prefixes as follows: This follows the same order as Python's and the standard IANA (Internet Assigned Numbers Authority) convention. 𝗗 Prefix.Compare • 𝗣 61642 • 𝗖𝗟 700355 • 𝗔 database64128 The package has top-level functions for connecting to an address using different networks (protocols) — , , , and . They were made before was introduced, so they don't support cancellation: There's also a type with a general-purpose method. It supports cancellation and can be used to connect to any of the known networks: However, a bit less efficient than network-specific functions like — because of the extra overhead from address resolution and network type dispatching. So, network-specific functions in the package are more efficient, but they don't support cancellation. The type supports cancellation, but it's less efficient. The Go team decided to resolve this contradiction. The new context-aware methods ( , , , and ) combine the efficiency of the existing network-specific functions with the cancellation capabilities of : I wouldn't say that having three different ways to dial is very convenient, but that's the price of backward compatibility. 𝗗 net.Dialer • 𝗣 49097 • 𝗖𝗟 490975 • 𝗔 Michael Fraenkel The default certificate already lists in its DNSNames (a list of hostnames or domain names that the certificate is authorized to secure). Because of this, doesn't trust responses from the real : To fix this issue, the HTTP client returned by now redirects requests for and its subdomains to the test server: 𝗗 Server.Client • 𝗖𝗟 666855 • 𝗔 Sean Liao People often point out that using for plain strings causes more memory allocations than . Because of this, some suggest switching code from to when formatting isn't needed. The Go team disagrees. Here's a quote from Russ Cox: Using is completely fine, especially in a program where all the errors are constructed with . Having to mentally switch between two functions based on the argument is unnecessary noise. With the new Go release, this debate should finally be settled. For unformatted strings, now allocates less and generally matches the allocations for . Specifically, goes from 2 allocations to 0 allocations for a non-escaping error, and from 2 allocations to 1 allocation for an escaping error: This matches the allocations for in both cases. The difference in CPU cost is also much smaller now. Previously, it was ~64ns vs. ~21ns for vs. for escaping errors, now it's ~25ns vs. ~21ns. Here are the "before and after" benchmarks for the change. The non-escaping case is called , and the escaping case is called . If there's just a plain error string, it's . If the error includes formatting, it's . Seconds per operation: Bytes per operation: Allocations per operation: If you're interested in the details, I highly recommend reading the CL — it's perfectly written. 𝗗 fmt.Errorf • 𝗖𝗟 708836 • 𝗔 thepudds Previously, allocated a lot of intermediate memory as it grew its result slice to the size of the input data. Now, it uses intermediate slices of exponentially growing size, and then copies them into a final perfectly-sized slice at the end. The new implementation is about twice as fast and uses roughly half the memory for a 65KiB input; it's even more efficient with larger inputs. Here are the geomean results comparing the old and new versions for different input sizes: See the full benchmark results in the commit. Unfortunately, the author didn't provide the benchmark source code. Ensuring the final slice is minimally sized is also quite helpful. The slice might persist for a long time, and the unused capacity in a backing array (as in the old version) would just waste memory. As with the optimization, I recommend reading the CL — it's very good. Both changes come from thepudds , whose change descriptions are every reviewer's dream come true. 𝗗 io.ReadAll • 𝗖𝗟 722500 • 𝗔 thepudds The package, introduced in version 1.21, offers a reliable, production-ready logging solution. Since its release, many projects have switched from third-party logging packages to use it. However, it was missing one key feature: the ability to send log records to multiple handlers, such as stdout or a log file. The new type solves this problem. It implements the standard interface and calls all the handlers you set up. For example, we can create a log handler that writes to stdout: And another handler that writes to a file: Finally, combine them using a : I'm also printing the file contents here to show the results. When the receives a log record, it sends it to each enabled handler one by one. If any handler returns an error, doesn't stop; instead, it combines all the errors using : The method reports whether any of the configured handlers is enabled: Other methods — and — call the corresponding methods on each of the enabled handlers. 𝗗 slog.MultiHandler • 𝗣 65954 • 𝗖𝗟 692237 • 𝗔 Jes Cok Test artifacts are files created by tests or benchmarks, such as execution logs, memory dumps, or analysis reports. They are important for debugging failures in remote environments (like CI), where developers can't step through the code manually. Previously, the Go test framework and tools didn't support test artifacts. Now they do. The new methods , , and return a directory where you can write test output files: If you use with , this directory will be inside the output directory (specified by , or the current directory by default): As you can see, the first time is called, it writes the directory location to the test log, which is quite handy. If you don't use , artifacts are stored in a temporary directory which is deleted after the test completes. Each test or subtest within each package has its own unique artifact directory. Subtest outputs are not stored inside the parent test's output directory — all artifact directories for a given package are created at the same level: The artifact directory path normally looks like this: But if this path can't be safely converted into a local file path (which, for some reason, always happens on my machine), the path will simply be: (which is what happens in the examples above) Repeated calls to in the same test or subtest return the same directory. 𝗗 T.ArtifactDir • 𝗣 71287 • 𝗖𝗟 696399 • 𝗔 Damien Neil Over the years, the command became a sad, neglected bag of rewrites for very ancient Go features. But now, it's making a comeback. The new is re-implemented using the Go analysis framework — the same one uses. While and now use the same infrastructure, they have different purposes and use different sets of analyzers: By default, runs a full set of analyzers (currently, there are more than 20). To choose specific analyzers, use the flag for each one, or use to run all analyzers except the ones you turned off. For example, here we only enable the analyzer: And here, we enable all analyzers except : Currently, there's no way to suppress specific analyzers for certain files or sections of code. To give you a taste of analyzers, here's one of them in action. It replaces loops with or : If you're interested, check out the dedicated blog post for the full list of analyzers with examples. 𝗗 cmd/fix • 𝗚 go fix • 𝗣 71859 • 𝗔 Alan Donovan Go 1.26 is incredibly big — it's the largest release I've ever seen, and for good reason: All in all, a great release! You might be wondering about the package that was introduced as experimental in 1.25. It's still experimental and available with the flag. P.S. To catch up on other Go releases, check out the Go features by version list or explore the interactive tours for Go 1.25 and 1.24 . P.P.S. Want to learn more about Go? Check out my interactive book on concurrency a vector from array/slice, or a vector to array/slice. Arithmetic: , , , , . Bitwise: , , , , . Comparison: , , , , . Conversion: , , . Masking: , , . Rearrangement: . Collect live goroutines . Start with currently active (runnable or running) goroutines as roots. Ignore blocked goroutines for now. Mark reachable memory . Trace pointers from roots to find which synchronization objects (like channels or wait groups) are currently reachable by these roots. Resurrect blocked goroutines . Check all currently blocked goroutines. If a blocked goroutine is waiting for a synchronization resource that was just marked as reachable — add that goroutine to the roots. Iterate . Repeat steps 2 and 3 until there are no more new goroutines blocked on reachable objects. Report the leaks . Any goroutines left in the blocked state are waiting for resources that no active part of the program can access. They're considered leaked. Total number of goroutines since the program started. Number of goroutines in each state. Number of active threads. First by validity (invalid before valid). Then by address family (IPv4 before IPv6). Then by masked IP address (network IP). Then by prefix length. Then by unmasked address (original IP). Vet is for reporting problems. Its analyzers describe actual issues, but they don't always suggest fixes, and the fixes aren't always safe to apply. Fix is (mostly) for modernizing the code to use newer language and library features. Its analyzers produce fixes are always safe to apply, but don't necessarily indicate problems with the code. It brings a lot of useful updates, like the improved builtin, type-safe error checking, and goroutine leak detector. There are also many performance upgrades, including the new garbage collector, faster cgo and memory allocation, and optimized and . On top of that, it adds quality-of-life features like multiple log handlers, test artifacts, and the updated tool. Finally, there are two specialized experimental packages: one with SIMD support and another with protected mode for forward secrecy.

0 views

Prolog Basics Explained with Pokémon

The project that inspired this post is a little silly—I am about to describe the mechanics of a children’s video game in great detail—but this particular problem is what finally made Prolog click for me, an epiphany I’ve been hunting for ever since reading Bruce Tate’s “Seven Languages in Seven Weeks.” This exercise has taught me a lot about the kinds of interfaces I’m trying to build in somewhat more practical domains . For certain kinds of relationships, logic programming is by far the most concise and expressive programming system I’ve ever used. To understand why, let’s talk about Pokémon. Pokémon is a video game series/multimedia franchise/lifestyle brand set in a world where humans live alongside a menagerie of colorful animal characters. “Pokémon” is both the name of the franchise and the generic term for the animal characters themselves, which all have their own individual species names. There are over a thousand distinct species of Pokémon, from Bulbasaur ( #1 ) to Pecharunt ( #1025 ). There are all sorts of Pokémon games now, but the main series has always been about catching and battling them. During a battle, your team of six Pokémon faces off against another team. Each Pokémon is equipped with four moves that it can choose to (usually) do damage to their opponent. You need to reduce the HP (Hit Points) of all your opponent’s Pokémon to zero before they are able to do so to you. Each Pokémon has unique traits that affects how it battles. They have a set of base stats, a large pool of possible moves, a handful of abilities, and a typing. As you will see in a moment, the immense number of combinations here is the motivation for trying to track this with software. Typing is especially important. Moves have a type, like Fire or Rock, and Pokémon can have up to two types. A move with a type that is Super Effective against the opposing Pokémon will do double damage; a move that is Not Very Effective will do half damage. It’s a little more intuitive with examples. The Fire-type move Flamethrower will do 2x to Grass-type Pokémon, because Grass is weak to Fire, but the Water-type move Surf will only do ½ damage to them, because Grass resists Water. Type modifiers can stack. Scizor is a Bug/Steel type, and both Bug and Steel are weak to Fire, so Fire moves will do 4x damage to Scizor. Electric is weak to Water, but Ground is immune, so if you use an Electric type move against Water/Ground Swampert , you’ll do zero damage, since 0×2 is still 0. Naturally, there is a chart to help you keep track. Those are effectively the mechanics of the Pokémon video games as I understood them when I was 8. Click moves to do damage, try to click moves with good type matchups. These games are for children and, at the surface level, they’re not very hard. Before I explain how wonky the Pokémon mechanics can get under the hood, I first need to explain how logic programming works. Pokémon is a great fit for logic programming because Pokémon battles are essentially an extremely intricate rules engine. Let’s start by creating a file with a bunch of facts. In Prolog, we declare “predicates.” Predicates define relationships: is a , is a , and so on. We refer to this predicate as , because the name of the predicate is and it has one argument. These facts are loaded into an interactive prompt called the “top-level.” You query the top-level by typing a statement into the prompt; Prolog tries to find all the ways to make that statement true. When there’s more than one possible solution, the top-level displays the first solution and then awaits user input. You can then have it display one more solution, all the solutions, or stop entirely. In this first example, we type and hit Enter. The top-level replies Squirtle is, in fact, a Pokémon. Not all things are Pokémon. Let’s add Pokémon types in there, as the predicate . Recall that some Pokémon have just one type while others have two. In the latter case, that’s modeled with two facts. Bulbasaur is a Grass type, and Bulbasaur is a Poison type; both are true. The paradigm is similar to a One-To-Many relation in a SQL database. Interactively, we can confirm whether Squirtle is a water type. Can we state that Squirtle is a Grass type? No, because Squirtle is a Water type. Suppose we didn’t know what type Squirtle was. We can ask! In Prolog, names that start with an upper-case letter are variables. Prolog tries to “unify” the predicate with all possible matches for the variable. There’s only one way to make this particular predicate true though: has to be , because Squirtle’s only type is Water. For Pokémon with two types, the predicate unifies twice. Semantically, that leading semicolon on the third line means “or.” is true when or when . Any of the terms can be be a variable, which means we can ask questions in any direction. What are all the Grass types? Just make the first argument the variable, and set the second argument to . I cut it off, but the prompt would happily would list all 164 of them. Commas can be used to list multiple predicates—Prolog will unify the variables such that all of them are true. Listing all the Water/Ice types is just a matter of asking what Pokémon exist that unify with both the Water and Ice types. Even though is a variable, in the context of the query, both instances of it have to be the same (just like in algebra). The query only unifies for values of where both those predicates hold. For instance, the Water/Ice type Dewgong is a solution because our program contains the following two facts: Therefore, subbing in for the variable satisfies the query. Squirtle, by contrast, is just a Water type: exists, but not . The query requires both to unify, so is not a possible value for . Pokémon have lots of data that you can play around with. Iron Bundle is a strong Water/Ice-type Pokémon with high Special Attack. How high exactly? With Special Attack that high, we want to make use of strong Special moves. What Special moves does Iron Bundle know? Freeze-Dry is a particularly good Special move. Here’s a query for all Ice-type Pokémon with Special Attack greater than 120 that learn Freeze-Dry . One last concept before we move on: Rules. Rules have a head and a body, and they unify if the body is true. A move is considered a damaging move if it’s either a Physical Move or a Special Move. The predicate defines all the moves that do direct damage. This will unify with any moves that do direct damage. Nothing I’ve shown so far is, logically speaking, very ambitious—just “and” and “or” statements about various facts. It’s essentially a glorified lookup table. Still, take a moment to appreciate how much nicer it is to query this database than a plausible alternative, like SQL. For the facts we’ve seen so far, I would probably set up SQL tables like this: Then query it like so: For comparison, here’s the equivalent Prolog query again: I’m not ripping on SQL—I love SQL—but that’s the best declarative query language most people interact with. It’s amazing to me how much simpler and more flexible the Prolog version is. The SQL query would become unmanageably complex if we continued to add clauses, while the Prolog query remains easy to read and edit (once you get the hang of how variables work). With the basics established, here’s some context on the project I’m working on. Pokémon battles have an outrageous number of number of mechanics that all interact in complex and probabilistic ways. Part of the appeal of these games is the futile attempt to keep them all in your head better than your opponent, using that information to out-predict and out-maneuver their plans. It’s a sort of like very silly Poker. The challenge, if you want to build software for this game, is to model all that complexity without losing your mind. Prolog is stunningly good at this, for two main reasons: To illustrate that, here’s how I implemented priority moves for my Pokémon draft league. Pokémon draft is pretty much what it sounds like. Pokémon are given a point value based on how good they are, each player is given a certain amount of points to spend, and you draft until every player has spent their points. Your team ends up with about 8-11 Pokémon and each week you go head to head against another person in the league. My friend and WMI collaborator Morry invited me to his a couple years ago and I’ve been hooked on the format ever since. The games are 6v6, so a big part of the battle is preparing for all the possible combinations of six your opponent could bring, and putting together six of your own that can handle all of them. Naturally, you can only build teams with the Pokémon you drafted. I just made that predicate my name: . What Pokémon do I have that learn Freeze-Dry ? None. Rats. One very important type of move is priority moves. Earlier I mentioned that the Speed stat controls which Pokémon moves first. Some nuance: the Pokémon that used the move with the highest priority goes first, and if they both selected a move of the same priority, then the one with the higher Speed goes first. Most moves have a priority of zero. Ah, but not all! Accelerock has a priority of 1. A Pokémon that uses Accelerock will move before any Pokémon that uses a move with priority 0 (or less), even if the latter Pokémon has a higher Speed stat. I define a predicate that unifies with a Pokémon, the priority move it learns, and what priority that move is. A simple query that asks “what priority moves does my team learn” returns a lot of answers. Although this is technically correct (the best kind), most of these answers are not actually useful. Helping Hand and Ally Switch have very high priority, but they only have a purpose in Double Battles, which isn’t the format I’m playing. To fix this, I define all the Double Battle moves and exclude them. I’m going to exclude the move Bide too, which is functionally useless. The predicate means “true if this goal fails”, and means “these two terms are different.” We get the following results: Much better, but there’s a handful of moves in there that go first because they protect the user from damage or status, like Detect . That’s not really what I mean by priority move—I’m interested in moves that will surprise my opponent with damage or an adverse side effect, like Quick Attack and Sucker Punch . With those rules in place, we arrive at a very useful answer! It’s even more useful to look up what priority moves my opponent for the week has. At this point, I showed the program to Morry and he hit me with a challenge. Pokémon with the Prankster ability get an additional +1 priority on their status moves. Could the rule be extended to note that? I happen to have one such Pokémon on my team. This took me 3 minutes, using Prolog’s if/then construct, . Now the same query includes all of Tornadus’ status moves, with their increased priority. At the top, I said that this experience had taught me about the kinds of interfaces I want to build. One of those lessons is fairly obvious: Prolog can be a little clunky, but it’s an elegant language for expressing and querying relations like the ones described here. That has implications if you, like me, are interested in the judicious use of declarative DSLs for programming. The other lesson is what kinds of tools work for non -programmers. I’m not the first person to think “it would be nice to know what priority moves my opponent’s team has.” The Pokémon community has resources like this, built in the best programming interface of all time: the humble spreadsheet. I use a copy of “Techno’s Prep Doc” , which is one of those spectacularly-advanced Google Sheets you come across in the wild sometimes. You put in the teams and it generates tons of useful information about the matchup. It has a great interface, support for a variety of formats, scannable visuals, and even auto-complete. I was curious about the formula for finding priority moves. It’s gnarly. With a little bit of clicking around, I was basically able to figure out what this does. There’s a “Backend” sheet that lists all the moves. It’s effectively a hard-coded version of my Prolog query. The lookup formula does some filtering, VLOOKUP-ing, and kinda-metaprogramming (INDIRECT returns a cell reference ) to find all the Pokémon on your team that are in that Backend list, and display them. There are a number of reasons that I, personally, would prefer to work on a version of this database implemented in Prolog instead of one implemented with spreadsheet VLOOKUPs. I plan to build webapps with this that do things the existing suite of Pokémon tooling can’t. (If I can ever get scryer-prolog to compile to WASM , that is.) Furthermore, the Prolog paradigm is clearly more extensible. The spreadsheet backend is a hard-coded list of notable moves; my database can look up any move. I still can’t really believe this query, which finds all the Special moves that Tornadus learns which are super-effective against any member of Justin’s team. Nothing like that exists in any tool that I know of—it’s the kind of thing I normally try to figure out by endlessly switching tabs. With the grammar established by my program, I put this together in like 30 seconds. I’m not interested in how structured programming is more extensible than spreadsheets, though. I already know why I don’t do all my programming in spreadsheets. A question I find very important is: What is it about this particular problem, and the kinds of people who were motivated to solve it, where the most well-maintained solution available is a spreadsheet? I believe there are a great many problems like that in the world, and a lot of improvements on that programming paradigm yet to be properly realized. Thanks to Morry Kolman for reading a draft of this blog . Some moves miss a certain percentage of the time, doing no damage. Some moves raise or lower a Pokémon's stats. Pokémon can hold items that have various effects. Damage calculations aren't constant; moves do normally-distributed damage within the calculated range. Pokémon can get frozen, burned, paralyzed, poisoned, or fall asleep; these all have various adverse effects. There are a variety of field effects (like weather, terrain, Trick Room) which alter move damage, turn order, and other things. Pokémon each have an ability that has various effects i.e Levitate makes you immune to ground moves, Drizzle turns the weather to Rain when the Pokemon switches in, Sheer Force disables a move's side effects but multiplies its damage by 1.3x. Players have points they (invisibly) allocate to each Pokémon before the game, to boost chosen stats. Depending on they built the team, each Pokemon might do more damage or take hits better than you were expecting. The challenge, if you want to build software for this game, is to model all that complexity without losing your mind. Prolog is stunningly good at this, for two main reasons: Take a look at the damage calculator to get an idea of what I mean. The query model excels at describing ad-hoc combinations. The data model is perfectly suited to layering rules in a consistent way. I joined the draft league in Season 3, lost in finals, then won Seasons 4 and 5. We just started Season 6. If you want it, you can have the crown . There are a number of coders in this draft league and I have gotten precisely zero of them to try out my Prolog program. That’s kind of the point though! It needs to be a website… The Prolog implementation I’m using is Scryer Prolog , a modern Prolog implementation that emphasizes standards and formal correctness. The creator, Markus Triska, has a terrific online book, “The Power of Prolog,” and accompanying YouTube channel that has soundtracked my breakfast for weeks. Scryer Prolog is also designed to encourage more constructs that preserve logical completeness and monotonicity , which means I’m not really supposed to use the or predicates. I couldn’t really figure out how to express what I wanted with the replacements offered, though. Happy to edit if anyone wants to help. Also, on Markus’ website : “My goal is to provide programs that work as intended, reliably and conveniently, with zero surprises. Programs that you can run for multiple decades without any issues such as crashes, resource leaks or other unexpected behaviour.” This guy and I have some similar interests! I did some fun metaprogrogramming to get all the data into Prolog predicates using the Pokémon Showdown NodeJS API. Yes, putting the accent on the “e” everywhere but the code blocks was very annoying.

0 views
neilzone 1 weeks ago

yt-dlp's --download-archive flag

Today, I learned about the flag for . From the readme : For instance: This means that, if the download stops working for whatever reason, you have a list of the files from the playlist which have been downloaded already. When you re-run the command, yt-dlp will not attempt to download the files, for which the IDs are already listed in archive.txt. Very handy! But what if you have already started downloading a playlist, and did not use the flag? You can create a suitable file, from the list of the directory of your downloads, although exactly how you do this will depend on your preferences for interacting with a computer. In the directory of the downloaded files, I used to get a file with the list of downloaded files. I then used vim’s integrated search-and-replace function to get the format right. This involved: (Yes, I could have down it with sed or awk, without vim. I did not.) downloads the files as usual adds the ID of each downloaded file to archive.txt (which is probably specific to archiving from YouTube)

0 views
(think) 1 weeks ago

How to Vim: Navigating Prose in Style

I don’t know about you, but I’m not using Vim solely for programming. I also write documentation in it, plus most of my blog posts (like this one). When dealing with prose (regular text), it’s good to know a couple of essential Vim motions: Vim’s check for beginning/end of sentence is not very precise, but it mostly gets the job done. And because paragraphs are just blocks of text surrounded by blanks lines that’s handy in programming contexts as well. The forward sentence motion positions the cursor on the first character in the next sentence, or on the line after the paragraph (if the sentence is the last in a paragraph). The backward sentence operates similarly - it goes to the first character in the previous (or current) sentence. The paragraph motions will take you to the empty lines before or after a paragraph. Due to the simple definition of a paragraph in Vim, those are quite reliable. I guess in the world of motions like the ones provided by and you might be wondering if learning the rudimentary motions is worth it all. In my experience it’s never a bad idea to be able to use someone else’s setup, and the built-in functionality is naturally the smallest common denominator. That’s all I have for you today. Keep hacking! and allow you to move backward/forward in sentences and allow you to move backward/forward in paragraphs

1 views
(think) 1 weeks ago

How to Vim: Alternative Approach to Find and Replace

The classic way to do “find and replace” in Vim is pretty well known: This will replace all instances of in the current buffer (that’s what the is about) with . The flag means you’ll get prompted for confirmation for every replacement. Not bad, right? Still, often you need to replace just a few instances of something, so the above might be a bit too much typing. Imagine you’re dealing with the following text: If you want to replace the instances with the fastest way to do this would be something like: Pretty sweet and quite interactive in my opinion. It also allows you easily skip matches you don’t want to replace. And there are a few other tricks you can keep in mind: So, there you have it - another way to do “find and replace” in Vim! Keep hacking! - this will take you to the beginning of - this will replace with - this will take you to the next match and repeat the last edit you did You can use to select the word under the cursor and start a search with it If you’re searching for something more complex (e.g. it has multiple words) you can use instead of . means the next search match.

0 views
(think) 1 weeks ago

How to Vim: Insert Thing at Point in Command Mode

Most Vim users probably know that they can use to insert the contents of registers, while typing some command. For instance - you can insert the clipboard contents with . Fewer people probably know that while typing a command you can use to insert the object under the cursor. There are several objects supported by this: When is set the cursor position at the end of the currently displayed match is used. With the part of the word that was already typed is not inserted again. I find those keybindings handy when doing commands like: For me is the most commonly used keybinding, but all of them have their uses from time to time. As usual you can find much more on the topic in Vim’s user manual. Try and see where this takes you. That’s all I have for you today. I hope some of you will find some of those commands useful. - the Filename under the cursor - the Filename under the cursor, expanded with as in - the Word under the cursor - the WORD under the cursor - the line under the cursor

0 views
Danny McClelland 1 weeks ago

Using Proton Pass CLI to Keep Linux Scripts Secure

If you manage dotfiles in a public Git repository, you’ve probably faced the dilemma of how to handle secrets. API keys, passwords, and tokens need to live somewhere, but committing them to version control is a security risk. Proton has recently released a CLI tool for Proton Pass that solves this elegantly. Instead of storing secrets in files, you fetch them at runtime from your encrypted Proton Pass vault. The CLI is currently in beta. Install it with: This installs to . Then authenticate: This opens a browser for Proton authentication. Once complete, you’re ready to use the CLI. List your vaults: View an item: Fetch a specific field: Get JSON output (useful for parsing multiple fields): I have several tools that need API credentials. Rather than storing these in config files, I created wrapper scripts that fetch credentials from Proton Pass at runtime. Here’s a wrapper for a TUI application that needs API credentials: The key insight: fetching JSON once and parsing with is faster than making separate API calls for each field. The Proton Pass API call takes a few seconds. For frequently-used tools, this adds noticeable latency. The solution is to cache credentials in the Linux kernel keyring: With caching: The cache expires after one hour, or when you log out. Clear it manually with: The CLI also has built-in commands for secret injection. The command passes secrets as environment variables: The command processes template files: These use a URI syntax: to reference secrets. For applications that read credentials from config files (like WeeChat’s ), the wrapper can update the file before launching: The CLI can also act as an SSH agent, loading keys stored in Proton Pass: This is useful if you store SSH private keys in your vault. This approach keeps secrets out of your dotfiles repository entirely. The wrapper scripts reference Proton Pass item names, not actual credentials. Your secrets remain encrypted in Proton’s infrastructure and are only decrypted locally when needed. The kernel keyring cache is per-user and lives only in memory. It’s cleared on logout or reboot, and the TTL ensures credentials don’t persist indefinitely. For public dotfiles repositories, this is a clean solution: commit your wrapper scripts freely, keep your secrets in Proton Pass. First run: ~5-6 seconds (fetches from Proton Pass) Subsequent runs: ~0.01 seconds (from kernel keyring)

0 views
(think) 1 weeks ago

How to Vim: Using Effectively the Command History

One of the frustrating aspects of Vim for me is that in insert mode you’re quite limited in what you can do. That’s fine most of the time, except when you’re in command-line mode (meaning you’re typing something like ). In command-line mode there’s no way to switch to normal mode, so if you make a typo or want to edit a command you’ve previous invoked that’s a bit painful. Recently I’ve discovered a way to mitigate the problem - namely the command history window. (see for details) You can trigger in a couple of ways: When the command history window opens you can edit its contents like any other buffer and normally you’d find some command that’s of interest to you, perhaps edit it, and afterwards press RET to run it. You can also close the window with either or . Note, that the command history window is special and while you’re in it you can’t really move around (e.g. switch to another window with ) For me the main use of the command history window is to reuse and tweak longer and commands, but I can imagine it having other uses as well. It’s certainly a good addition to any Vimmer’s tool belt. Going to back to the original problem I posted - how do you fix a typo while entering some command? Imagine you wrote the following: Just press and you can instantly fix the command in the command window. Now you can just do a (remember you’re in normal mode) and press RET to fire the fixed command. That’s all I have for you today. Keep hacking! P.S. If only Vim’s insert mode supported ’s keybindings or something similar… You can, however, get something similar using plugins like rsi.vim and readline.vim . Press and afterwards do a

0 views