Posts in Css (20 found)
Maurycy Yesterday

How to write your own website:

I recently wrote an essay on why you should set up a personal website rather then using social media. Doing so lets you own your space on the internet, customize it and free your readers from constant advertising and algorithmic feeds designed to keep you stuck doomscrolling all day. However, despite how much time we spend using it, creating something for the intenet is seen as arcane wizardy by most people. This is a fairly accessable guide to getting started. You’ll need a text editor (any will do) and a browser (you already have one). All pages are written in HTML, which is a simple text-based format. To start with, this is a perfectly valid HTML document: To try this, just create a text file with a ".html" extension, and open it in your favorite browser. Do this now : experimenting is the best way to learn how everything works. This is what it should look like: Plain text is boring, so let’s add some formatting: The angle bracket things are tags: "<b>" is an opening tag, and "</b>" is the matching closing tag. The word surrounded by brackets ("b") is the tag name, which tells the browser what to do: In this case, b olding the enclosed text. The other formatting tags are <em> em phasis , <u> u nderline , <sub> sub scipt , <sup> sup erscript , <small> small text , <mark> highlight and <del> del eted . You don’t have to memorize this list, but go and try a few out. There’s also <br/> ( br eak), which adds a line break. It’s special because there’s no closing tag: It always immediately closed and can’t contain any text. I like to add a slash after the tag name to indicate this A big wall of text can get quite ugly, so it’s good to break it up with <p> ( p aragraph) tags. Each paragraph will be visually separated from other content on the page: Check out my new site: I have many epic things here. Together, the maching tags and their contents form an an element . Elements can contain other elements, but it’s important that they are closed in the correct order: This is wrong: … but this is fine: Browsers will attempt to render invalid HTML, but the results may not be what you intended: It’s best to make it easy for them. On that topic, it’s good practice to put all your content inside a <body> element which is itself inside a <html> element: Check out my new site: I have many epic things here. This isn’t mandatory, but helps browsers render your page correctly: In the case of an old browser, you don’t want metadata (we’ll add some later) getting confused for page content. Ok, back to text-wall-avoidance: the <ul> and <ol> ( u nordered/ o rdered l ist) tags create, well, lists. Each item should be wraped in <li> tags ( l ist i tem) About this site (unordered): It has epic things ... and is handwritten HTML It uses these tags: (ordered) <html> <body> <p> <ul> and <ol> <li> You can add angle brackets to a page with &gt; (>), &lt; (<) and &amp; (&). These entities will render as the corresponding charater, but won’t form tags. Headings use <h1> ( h eading 1 ) through <h5> ( h eading 5 ), with larger numbers using smaller font sizes: This site has epic things and I wrote it myself. To do: Figure out how to add links. About that. Links are just <a> ( a nchor) tags, but they have something new: an attribute after the tag name but before the bracket. The "href= " attribute sets where the link points to. A lot of other tags can also have attributes: For example, ordered lists with "reverse=true" count backwards. The URL in "href=" can be relative: If linking up multiple pages on the same site, instead of this: … just write this: Images work similarly to links, except that they are self-closing elements like <br/>: Check out this picture of a nebula I took! (If you don’t have a URL for your image, skip to the hosting section to set one up) That’s all the essentials, but there’s a lot of other useful tags. For example <details> creates a dropdown that works with ctrl-f: This is a dropdown with just HTML. It works well with browser features (ctrl-f, fragment identifiers, screen readers, etc) by default. (better usability than 99% of commercial sites!) …but I can’t cover everything without writing a whole book. (The Mozzila docs are a fantastic reference) At this point, you should have something like this: I made this site to write about things I do. More updates soon™ . Here's my picture of the Dumbbell Nebula: Let’s start by giving the page a machine-readable title: Like with <body>, the <head> tag isn’t required, but it is good to include it: Otherwise, any metadata that the browser doesn’t understand might be mistaken for content. The page still looks kinda bad: Text extending the edges of the page isn’t exactly easy to read. It’s not too bad when crammed into my blog, but longer paragraphs will look terrible on large monitors. To fix this, we need to add some style and layout information using the <style> tag: Unlike other tags, the contents of <style> isn’t HTML, but CSS: a whole other langauge embedded within the file. CSS is compoosed of blocks, each begining with a selector to control what gets effected. Here, this is just the name of a tag: "head" The selector is followed by a series of declarations wraped in curly braces. My example only has one: "max-width: 30em;" This caps the width of the element at 30 times the font size: I made this site to write about things I do. More updates soon™ . Here's my picture of the Dumbbell Nebula: The page is looking rather asymetrical, so let’s center the column. For fixed-width elements, this can be done using the "margin" property: I made this site to write about things I do. More updates soon™ . Here's my picture of the Dumbbell Nebula: (For varable width elements, use flexbox for centering and other fancy layouts. A single line of text can be centered with "text-align=center") Personally, I like dark themed sites, so lets change some of the colors: I made this site to write about things I do. More updates soon™ . Here's my picture of the Dumbbell Nebula: The "color" style will carry over to every element inside of the styled tag, so there’s no need to individually change the text-color of every element. However, the links do need to be changed because they override the color by default. That’s it. Everything you need to replicate my blog, minus a few small bits like the sans-serif font, nagivation box, etc. Of course, your website can and should be different: It’s yours . I highly recomend you read some documenation and play around with CSS. There’s also way more to it then I can possbly cover here. Every website you see was created with it, and it even supports animations and basic interactivity . … also, check out your browser’s devtools (ctrl-shift-i): It will have a nice GUI for editing which shows you the result in real time and shows you what’s going on under the hood. If you ever run out of tags, you can just make up your own and style them as needed. As long as the name includes a hypen, it’s guaranteed not to be included in any future version of HTML. The specification even lists <math-α> and <emotion-😍> as allowed custom elements names. I’ve used this heavily on this page: All the example websites aren’t screenshots, they are <fake-frame> elements styled up to look like a browser window. Custom tags are also very handy for styling text: At this point you should have a reasonably nice page ready to put up on the internet. The easiest way to do this is to use a static file hosting service like Github Pages or Cloudflare Pages . Both of these have generous free tiers that should last a very long time. If you don’t like big companies, there are plenty of similar, smaller services. These can be more limited: The popular Neocities charges $5/mo to use a custom domain. Another option is to rent a server ($3-$5/mo) or, if you have good internet, run one yourself. This is by far the most fiddly option: I would not recommend it unless you like playing with computers. All off these (except a server) will give you a subdomain by default. For example, Github Pages will give you your-username .github.io However, I do recommend setting up a custom domain: This will let you switch providers seamlessly should anything happen. All of these will work in a similar way: Upload a file with some name, and it will given a URL with that same name. The one exception is that files called "index.html" will be viewable at the root of the folder they are in. You should put an index.html in the root of your site to serve as the homepage, but apart from that, the organization is up to you. It has epic things ... and is handwritten HTML <html> <body> <ul> and <ol> Ken Shirriff's blog Ken Shirriff's blog Ken Shirriff's blog Ken Shirriff's blog

0 views

The frustration of a perfect setup

No matter how I look at the list of apps I currently use , whether first-party or third-party, I can’t find anything to change, not a program to replace, not a service to swap for another. I think I am happy with my setup. It feels strange to admit, but somehow, I can’t quite believe it; I must be missing something, something surely can be tweaked. What happens after peak setup? This frustration comes from the fact that looking at new apps, digging into settings, trying new online services, working on how each of these things operates with the others, is one of my favourite hobbies. I mean, a quick glance at the archive of this site will tell you that, not only do I love writing about apps and digital tools, but I love playing with their configurations; I’m like a kid with Lego bricks, building things, taking them apart, and building them again, with a huge smile, in a slightly different and improved way. Now that my application setup appears to be “final”, it feels as though all my toys and Lego bricks are neatly stored away in their respective drawers, sorted by colour, by type, and by size. It’s perfect, and seeing my beautiful collection all nice and tidy like that is a very satisfying sensation, except I’m looking at it seated on the empty floor of my childhood bedroom, alone and bored. What is there to do when nothing needs to be improved? I recently wrote about my HTML and CSS “explorations” with this blog. Satisfied with the results, I think this job is done. The same goes for how Eleventy works on my machine: everything has been optimised , refined, future-proofed (especially Node.js ): nothing to see here! Even the hosting is something I’m very happy with. My only gripe with xmit is that there is no possibility for me to pay for it. The other apps on my Mac — the ones that don’t live in the Terminal like Eleventy, Node.js & npm, and xmit — are also perfect at what they do, and I can’t think of anything better to explore, let alone to use. If this is not your first visit, you already know how I feel about BBEdit . Well, I feel just about the same about NetNewsWire , which is as close to perfection an app can get as far as I’m concerned. It feels part of the OS (even more so than current system apps if I’m being honest), it is stable, it is simple to use, and it runs smoothly on my soon-to-be six-year-old MacBook Air. Being happy with Safari is by far the strongest proof that my setup is final. Using StopTheScript to block JavaScript on most media sites, along with the performance and privacy benefits of using a DNS resolver like Quad9 , has proven to be an efficient way to keep Safari light and responsive, even if my web experience is getting a little more interrupted than I would like, due to all the crap websites throw at first-time visitors these days. Yesterday, I had a look at apps like Yoink , Karabiner Elements , Hazel , and also got a taste of Mullvad Browser , and News Explorer . Some of these apps were tried purely out of curiosity, to see if they would fit right in my “workflow”, others were basically reassurance that my current system and choices were the best I could have made. * 1 Among all the parties involved in this setup, the obvious candidate for a replacement is my Intel-powered MacBook Air. Yet, this old computer is currently in great shape: the recent factory-settings reset I had to do surely helped. But its best feature is not being able to run MacOS Tahoe: stuck to MacOS Sequoia, it’s protecting me from Liquid Glass on the Mac and the “icons in menus everywhere” experience. My personal laptop is a breath of fresh air after spending hours on my work computer running Tahoe. * 2 So, what will be able to make that itch go away? When nothing is broken, don’t fix it, as they say. But surely, there must be something that I’m missing, surely there is a program, somewhere, that would delight me, that would put a smile on my face. I want a new box of Lego bricks, I want to empty my drawers on the floor and see if I can do better. In case you’re wondering, all of these apps are excellent, but not enough to replace what I already use, or to justify adding a new item to my list. For example, Mullvad Browser, like Firefox, isn’t scriptable; News Explorer has more features than NetNewsWire, but is not as polished; Yoink looks incredibly useful, but I prefer my own ways for now, &c. ^ Its replacement will have to wait until the new generation comes out, probably in March; then I can decide on whether I want to stick to the Air family, keep mine a bit longer, or upgrade for a far nicer screen and go with the Pro. ^ In case you’re wondering, all of these apps are excellent, but not enough to replace what I already use, or to justify adding a new item to my list. For example, Mullvad Browser, like Firefox, isn’t scriptable; News Explorer has more features than NetNewsWire, but is not as polished; Yoink looks incredibly useful, but I prefer my own ways for now, &c. ^ Its replacement will have to wait until the new generation comes out, probably in March; then I can decide on whether I want to stick to the Air family, keep mine a bit longer, or upgrade for a far nicer screen and go with the Pro. ^

0 views
Lea Verou 1 weeks ago

Web dependencies are broken. Can we fix them?

Abstraction is the cornerstone of modern software engineering. Reusing logic and building higher-level solutions from lower-level building blocks is what makes all the technological wonders around us possible. Imagine if every time anyone wrote a calculator they also had to reinvent floating-point arithmetic and string encoding! In healthy ecosystems dependencies are normal, cheap, and first-class. “Dependency-free” is not a badge of honor. And yet, the web platform has outsourced this fundamental functionality to third-party tooling . As a result, code reuse has become a balancing of tradeoffs that should not have existed in the first place. In NodeJS, you just and reference specifiers straight away in your code. Same in Python, with . Same in Rust with . In healthy ecosystems you don’t ponder how or whether to use dependencies. The ecosystem assumes dependencies are normal, cheap, and first-class . You just install them, use them, and move on. “Dependency-free” is not a badge of honor. Instead, dependency management in the web platform consists of bits and bobs of scattered primitives, with no coherent end-to-end solution . Naturally, bundlers such as Webpack , rollup , and esbuild have picked up the slack, with browserify being the one that started it all, in 2012. There is nothing wrong with bundlers when used as a performance optimization to minimize waterfall effects and overhead from too many HTTP requests. You know, what a bundler is supposed to do. It is okay to require advanced tools for advanced needs , and performance optimization is generally an advanced use case. Same for most other things bundlers and build tools are used for, such as strong typing, linting, or transpiling. All of these are needs that come much later than dependency management, both in a programmer’s learning journey, as well as in a project’s development lifecycle. Dependency management is such a basic and ubiquitous need, it should be a part of the platform, decoupled from bundling. Requiring advanced tools for basic needs is a textbook usability cliff . In other ecosystems, optimizations happen (and are learned) after dependency resolution. On the web, optimization is the price of admission! This is not normal. Bundlers have become so ubiquitous that most JS developers cannot even imagine deploying code without them. READMEs are written assuming a bundler, without even mentioning the assumption. It’s just how JS is consumed. My heart breaks for the newbie trying to use a drag and drop library, only to get mysterious errors about specifiers that failed to resolve. However, bundling is not technically a necessary step of dependency management. Importing files through URLs is natively supported in every browser, via ESM imports. HTTP/2 makes importing multiple small files far more reasonable than it used to be — at least from a connection overhead perspective. You can totally get by without bundlers in a project that doesn’t use any libraries. But the moment you add that first dependency, everything changes. You are suddenly faced with a huge usability cliff : which bundler to use, how to configure it, how to deploy with it, a mountain of decisions standing between you and your goal of using that one dependency. That one drag and drop library. For newcomers, this often comes very early in their introduction to the web platform, and it can be downright overwhelming. It is technically possible to use dependencies without bundlers, today. There are a few different approaches, and — I will not sugarcoat it — they all suck . There are three questions here: There is currently no good answer to any of them, only fragile workarounds held together by duct tape. Using a dependency should not need any additional song and dance besides “install this package” + “now import it here”. That’s it. That’s the minimum necessary to declare intent . And that’s precisely how it works in NodeJS and other JS runtimes. Anything beyond that is reducing signal-to-noise ratio , especially if it needs to be done separately for every project or worse, for every dependency. You may need to have something to bite hard on while reading the next few sections. It’s going to be bad . Typically, package managers like take care of deduplicating compatible package versions and may use a directory like to install packages. In theory, one could deploy as part of their website and directly reference files in client-side JS. For example, to use Vue : It works out of the box, and is a very natural thing to try the first time you install a package and you notice . Great, right? No. Not great. First, deploying your entire directory is both wasteful , and a security risk . In fact, most serverless hosts (e.g. Netlify or Vercel ) automatically remove it from the publicly deployed files after the build is finished. Additionally, it violates encapsulation : paths within a package are generally seen as an implementation detail of the package itself, and packages expose specifier exports like or that they map to internal paths. If you decide to circumvent this and link to files directly, you now need to update your import paths whenever you update the package. It is also fragile, as not every module is installed directly in — though those explicitly marked as app dependencies are. Another common path is importing from CDNs like Unpkg and JSDelivr . For Vue, it would look like this: It’s quick and easy. Nothing to install or configure! Great, right? No. Not great. It is always a bad idea to introduce a dependency on a whole other domain you do not control , and an even worse one when linking to executable code. First, there is the obvious security risk. Unless you link to a specific version, down to the patch number and/or use SRI , the resource could turn malicious overnight under your nose if the package is compromised. And even if you link to a specific version, there is always the risk that the CDN itself could get compromised. Who remembers polyfill.io ? But even supply-chain attacks aside, any third-party domain is an unnecessary additional point of failure . I still remember scrambling to change JSDelivr URLs to Unpkg during an outage right before one of my talks, or having to hunt down all my repos that used RawGit URLs when it sunset, including many libraries. The DX is also suboptimal. You lose the immediacy and resilience of local, relative paths. Without additional tooling ( Requestly , file edits, etc.), you now need to wait for CDN roundtrips even during local development. Wanted to code on a flight? Good luck. Needed to show a live demo during a talk, over clogged conference wifi? Maybe sacrifice a goat to the gods first. And while they maintain encapsulation slightly better than raw file imports, as they let you reference a package by its name for its default export, additional specifiers (e.g. ) typically still require importing by file path. “But with public CDNs, I benefit from the resource having already been cached by another website the user visited!” Oh my sweet summer child. I hate to be the one to break it to you, but no, you don’t, and that has been the case since about 2020 . Double keyed caching obliterated this advantage . In case you were not aware, yes, your browser will redownload every single resource anew for every single website (origin) that requests it. Yes, even if it’s exactly the same. This changed to prevent cross-site leaks : malicious websites could exfiltrate information about your past network activity by measuring how long a resource took to download, and thus infer whether it was cached. Those who have looked into this problem claim that there is no other way to prevent these timing attacks other than to actually redownload the resource. No way for the browser to even fake a download by simply delaying the response. Even requiring resources to opt-in (e.g. via CORS) was ruled out, the concern being that websites could then use it as a third-party tracking mechanism. I personally have trouble accepting that such wasteful bandwidth usage was the best balance of tradeoffs for all Web users , including those in emerging economies and different locales [1] . It’s not that I don’t see the risks — it’s that I am acutely aware of the cost, a cost that is disproportionately borne by those not in the Wealthy Western Web . How likely is it that a Web user in Zimbabwe, where 1 GB of bandwidth costs 17% of the median monthly income , would choose to download React or nine weights of Roboto thousands of times to avoid seeing personalized ads? And how patronizing is it for people in California to be making this decision for them? A quick and dirty way to get local URLs for local development and CDN URLs for the remote site is to link to relative URLs, and add a URL rewrite to a CDN if that is not found. E.g. with Netlify rewrites this looks like this: Since is not deployed, this will always redirect on the remote URL, while still allowing for local URLs during development. Great, right? No. Not great. Like the mythical hydra, it solves one problem and creates two new ones. First, it still carries many of the same issues of the approaches it combines: Additionally, it introduces a new problem: the two files need to match, but the naïve approach above would always just link to the latest version. Sure, one could alleviate this by building the file with tooling, to link to specific versions, read from . But the point is not that it’s insurmountable, but that it should not be this hard . Another solution is a lightweight build script that copies either entire packages or specific exports into a directory that will actually get deployed. When dependencies are few, this can be as simple as an npm script: So now we have our own nice subset of and we don’t depend on any third-party domains. Great, right? No. Not great. Just like most other solution, this still breaks encapsulation, forcing us to maintain a separate, ad-hoc index of specifiers to file paths. Additionally, it has no awareness of the dependency graph. Dependencies of dependencies need to be copied separately. But wait a second. Did I say dependencies of dependencies? How would that even work? In addition to their individual flaws, all of the solutions above share a major flaw: they can only handle importing dependency-free packages . But what happens if the package you’re importing also uses dependencies? It gets unimaginably worse my friend, that’s what happens. There is no reasonable way for a library author to link to dependencies without excluding certain consumer workflows. There is no local URL a library author can use to reliably link to dependencies, and CDN URLs are highly problematic. Specifiers are the only way here. So the moment you include a dependency that uses dependencies, you’re forced into specifier-based dependency management workflows , whether these are bundlers, or import map flavored JSON vomit in every single HTML page (discussed later). As a fig leaf, libraries will often provide a “browser” bundle that consumers can import instead of their normal , which does not use specifiers. This combines all their dependencies into a single dependency-free file that you can import from a browser. This means they can use whatever dependencies they want, and you can still import that bundle using regular ESM imports in a browser, sans bundler. Great, right? No. Not great. It’s called a bundle for a reason. It bundles all their dependencies too, and now they cannot be shared with any other dependency in your tree, even if it’s exactly the same version of exactly the same package. You’re not avoiding bundling, you’re outsourcing it , and multiplying the size of your JS code in the process. And if the library author has not done that, you’re stuck with little to do, besides a CDN that rewrites specifiers on the fly like esm.sh , with all CDN downsides described above. As someone who regularly releases open source packages ( some with billions of npm installs ), I find this incredibly frustrating. I want to write packages that can be consumed by people using or not using bundlers, without penalizing either group , but the only way to do that today is to basically not use any dependencies. I cannot even modularize my own packages without running into this! This doesn’t scale. Browsers can import specifiers, as long as the mapping to a URL is explicitly provided through an import map . Import maps look like this: Did you notice something? Yes, this is an HTML block. No, I cannot link to an import map that lives in a separate file. Instead, I have to include the darn thing in. Every. Single. Page. The moment you decide to use JS dependencies, you now need an HTML templating tool as well. 🙃 “💡 Oh I know, I’ll generate this from my library via DOM methods! ” I hear you say. No, my sweet summer child. It needs to be present at parse time. So unless you’re willing to it (please don’t), the answer is a big flat NOPE. “💡 Ok, at least I’ll keep it short by routing everything through a CDN or the same local folder ” No, my sweet summer child. Go to sleep and dream of globs and URLPatterns . Then wake up and get to work, because you actually need to specify. Every. Single. Mapping. Yes, transitive dependencies too. You wanted to use dependencies? You will pay with your blood, sweat, and tears. Or, well, another build tool. So now I need a build tool to manage the import map , like JSPM . It also needs to talk to my HTML templating tool, which I now had to add so it can spit out these import maps on. Every. Single. HTML. Page. There are three invariants that import maps violate: Plus, you still have all of the issues discussed above, because you still need URLs to link to. By trying to solve your problem with import maps, you now got multiple problems. To sum up, in their current form, import maps don’t eliminate bundlers — they recreate them in JSON form, while adding an HTML dependency and worse latency. Given the current state of the ecosystem, not using bundlers in any nontrivial application does seem like an exercise in masochism. Indeed, per State of JS 2024 , bundlers were extremely popular, with Webpack having been used by 9 in 10 developers and having close to 100% awareness! But sorting by sentiment paints a different picture, with satisfaction, interest, and positivity dropping year after year. Even those who never question the status quo can feel it in their gut that this is not okay. This is not a reasonable way to manage dependencies. This is not a healthy ecosystem. Out of curiosity, I also ran two polls on my own social media. Obviously, this suffers from selection bias , due to the snowball sampling nature of social media, but I was still surprised to see such a high percentage of bundle-less JS workflows: I’m very curious how these folks manage the problems discussed here. Oftentimes when discussing these issues, I get the question “but other languages are completely compiled, why is it a problem here?”. Yes, but their compiler is official and always there. You literally can’t use the language without it. The problem is not compilation, it’s fragmentation. It’s the experience of linking to a package via a browser import only to see errors about specifiers. It’s adding mountains of config and complexity to use a utility function. It’s having no clear path to write a package that uses another package, even if both are yours. Abstraction itself is not something to outsource to third-party tools. This is the programming equivalent of privatizing fundamental infrastructure — roads, law enforcement, healthcare — systems that work precisely because everyone can rely on them being there. Like boiling frogs , JS developers have resigned themselves to immense levels of complexity and gruntwork as simply how things are . The rise of AI introduced swaths of less technical folks to web development and their overwhelm and confusion is forcing us to take a long hard look at the current shape of the ecosystem — and it’s not pretty. Few things must always be part of a language’s standard library, but dependency management is absolutely one of them. Any cognitive overhead should be going into deciding which library to use, not whether to include it and how . This is also actively harming web platform architecture . Because bundlers are so ubiquitous, we have ended up designing the platform around them, when it should be the opposite. For example, because is unreliable when bundlers are used, components have no robust way to link to other resources (styles, images, icons, etc.) relative to themselves, unless these resources can be part of the module tree. So now we are adding features to the web platform that break any reasonable assumption about what HTML, CSS, and JS are, like JS imports for CSS and HTML, which could have been a simple if web platform features could be relied on. And because using dependencies is nontrivial, we are adding features to the standard library that could have been userland or even browser-provided dependencies. To reiterate, the problem isn’t that bundlers exist — it’s that they are the only viable way to get first-class dependency management on the web. JS developers deserve better. The web platform deserves better. As a web standards person, my first thought when spotting such a lacking is “how can the web platform improve?”. And after four years in the TAG , I cannot shake the holistic architectural perspective of “which part of the Web stack is best suited for this?” Before we can fix this, we need to understand why it is the way it is. What is the fundamental reason the JS ecosystem overwhelmingly prefers specifiers over URLs? On the surface, people often quote syntax, but that seems to be a red herring. There is little DX advantage of (a specifier) over (a URL), or even (which can be configured to have a JS MIME type). Another oft-cited reason is immutability: Remote URLs can change, whereas specifiers cannot. This also appears to be a red herring: local URLs can be just as immutable as specifiers. Digging deeper, it seems that the more fundamental reason has to do with purview . A URL is largely the same everywhere, whereas can resolve to different things depending on context. A specifier is app-controlled whereas a URL is not. There needs to be a standard location for a dependency to be located and referenced from, and that needs to be app-controlled. Additionally, specifiers are universal . Once a package is installed, it can be imported from anywhere, without having to work out paths. The closest HTTP URLs can get to this is root-relative URLs, and that’s still not quite the same. Specifiers are clearly the path of least resistance here, so the low hanging fruit would be to make it easier to map specifiers to URLs, starting by improving import maps. An area with huge room for improvement here is import maps . Both making it easier to generate and include import maps, and making the import maps themselves smaller, leaner, and easier to maintain. The biggest need here is external import maps , even if it’s only via . This would eliminate the dependency on HTML templating and opens the way for generating them with a simple build tool. This was actually part of the original import map work , and was removed from the spec due to lack of implementer interest, despite overwhelming demand. In 2022, external import maps were prototyped in WebKit (Safari), which prompted a new WHATWG issue . Unfortunately, it appears that progress has since stalled once more. External import maps do alleviate some of the core pain points, but are still globally managed in HTML, which hinders composability and requires heavier tooling. What if import maps could be imported into JS code? If JS could import import maps, (e.g. via ), this would eliminate the dependency on HTML altogether, allowing for scripts to localize their own import info, and for the graph to be progressively composed instead of globally managed. Going further, import maps via an HTTP header (e.g. ) would even allow webhosts to generate them for you and send them down the wire completely transparently. This could be the final missing piece for making dependencies truly first-class. Imagine a future where you just install packages and use specifiers without setting anything up, without compiling any files into other files, with the server transparently handling the mapping ! However, import maps need URLs to map specifiers to, so we also need some way to deploy the relevant subset of to public-facing URLs, as deploying the entire directory is not a viable option. One solution might be a way to explicitly mark dependencies as client side , possibly even specific exports. This would decouple detection from processing app files: in complex apps it can be managed via tooling, and in simple apps it could even be authored manually, since it would only include top-level dependencies. Even if we had better ways to mark which dependencies are client-side and map specifiers to URLs, these are still pieces of the puzzle, not the entire puzzle. Without a way to figure out what depends on what, transitive dependencies will still need to be managed globally at the top level, defeating any hope of a tooling-light workflow. The current system relies on reading and parsing thousands of files to build the dependency graph. This is reasonable for a JS runtime where the cost of file reads is negligible, but not for a browser where HTTP roundtrips are costly. And even if it were, this does not account for any tree-shaking. Think of how this works when using URLs: modules simply link to other URLs and the graph is progressively composed through these requests. What if specifiers could work the same way? What if we could look up and route specifiers when they are actually imported? Here’s a radical idea: What if specifiers were just another type of URL , and specifier resolution could be handled by the server in the same way a URL is resolved when it is requested? They could use a protocol, that can be omitted in certain contexts, such as ESM imports. How would these URLs be different than regular local URLs? Architecturally, this has several advantages: Obviously, this is just a loose strawman at this point, and would need a lot of work to turn into an actual proposal (which I’d be happy to help out with, with funding ), but I suspect we need some way to bridge the gap between these two fundamentally different ways to import modules. Too radical? Quite likely. But abstraction is foundational, and you often need radical solutions to fix foundational problems. Even if this is not the right path, I doubt incremental improvements can get us out of this mess for good. But in the end, this is about the problem . I’m much more confident that the problem needs solving, than I am of any particular solution. Hopefully, after reading this, so are you. So this is a call to action for the community. To browser vendors, to standards groups, to individual developers. Let’s fix this! 💪🏼 Thanks to Jordan Harband , Wes Todd , and Anne van Kesteren for reviewing earlier versions of this draft. In fact, when I was in the TAG, Sangwhan Moon and I drafted a Finding on the topic, but the TAG never reached consensus on it. ↩︎ Use specifiers or URLs? How to resolve specifiers to URLs? Which URL do my dependencies live at? Linking to CDNs is inherently insecure It breaks encapsulation of the dependencies Locality : Dependency declarations live in HTML, not JS. Libraries cannot declare their own dependencies. Composability : Import maps do not compose across dependencies and require global coordination Scalability : Mapping every transitive dependency is not viable without tooling Twitter/X poll : 17.6% of respondents Mastodon poll : 40% (!) of respondents Their protocol would be implied in certain contexts — that would be how we can import bare specifiers in ESM Their resolution would be customizable (e.g. through import maps, or even regular URL rewrites) Despite looking like absolute URLs, their resolution would depend on the request’s header (thus allowing different modules to use different versions of the same dependency). A request to a URL without an header would fail. HTTP caching would work differently; basically in a way that emulates the current behavior of the JS module cache. It bridges the gap between specifiers and URLs . Rather than having two entirely separate primitives for linking to a resource, it makes specifiers a high-level primitive and URLs the low-level primitive that explains it. It allows retrofitting specifiers into parts of the platform that were not designed for them, such as CSS . This is not theoretical: I was at a session at TPAC where bringing specifiers to CSS was discussed. With this, every part of the platform that takes URLs can now utilize specifiers, it would just need to specify the protocol explicitly. In fact, when I was in the TAG, Sangwhan Moon and I drafted a Finding on the topic, but the TAG never reached consensus on it. ↩︎

0 views

Easy (Horizontal Scrollbar) Fixes for Your Blog CSS

Read on the website: There are narrow screen CSS problems I often email people because of. These three fixes should be enough for most.

0 views
The Jolly Teapot 1 weeks ago

New year, new me, new web browsing setup?

Since we’re at the start of a new year, I will stop fine-tuning everything on this blog and let it live as the receptacle it’s supposed to be. With my mind cleared of HTML and CSS concerns, I now have energy to waste on new optimisations of my digital environment, and this time with an old favourite of mine: content blockers. * 1 In 2022, I experimented with blocking JavaScript on a per-site basis , which, at the time, allowed me to feel better about my behaviour on the web. You see, I thought that I was not actively refusing adverts. I was just disabling a specific technology on my web browser; not my fault if most ads are enabled via JS after all. True, ads couldn’t reach my house, but not because I actively refused their delivery; simply because the trucks used for their delivery weren’t allowed to drive on my pedestrian-only street. Ethically, I preferred this approach to the one blocking all ads blindly on every site, even if the consequences, from the publishers’ perspective, were the same. I know it was very hypocritical of me, and I know I was still technically blocking the ads. Nevertheless, I felt less guilty blocking the technology used for ads, and not the ads directly. This setup was fine, until it wasn’t. My web experience was not great. Blocking JavaScript by default breaks too many non-media sites, and letting it on made me realise how awful browsing the web without a content blocker can be. The only way for this system to work was to have patience and discipline on the per-site settings. Eventually, I gave up and reinstalled the excellent Wipr Safari extension on all my devices a few weeks later. Last year, on top of Wipr , I also tried services like NextDNS and Mullvad DNS . With these, the browser ad blocker becomes almost superfluous, as all it has to do is remove empty boxes that were supposed to be ads before being blocked by the DNS. It was an efficient setup, but I was still blocking ads, which kept on bothering me. While I happily support financially a few publications, I can’t do the same for all the sites I visit. For the ones I am not paying, seeing ads seems like a fair deal; blocking ads was making me feel increasingly guilty. * 2 Like I wrote in the other post on the topic : Somehow, I always feel a little bit of shame and guilt when talking about content blockers, especially ad blockers. Obviously ads are too often the only way many publishers manage to make decent money on the internet: every newspaper can’t be financially successful with subscriptions, and every media company can’t survive only on contributions and grants. That’s why recently, I stopped using Mullvad as my DNS resolver, and switched to Quad9 , which focuses on privacy-protection and not ad-blocking. I also uninstalled Wipr. Today, I rely solely on StopTheScript . What’s new this time around is that I will try to be more disciplined than I was three years ago, and do the work to make this system last. What I do is set the default StopTheScript setting on “Ask”. When a site aggressively welcomes me with three or four banners masking the article I came to read, I click on the StopTheScript icon and allow it to block JavaScript on the website, and refresh the page. Two clicks, one keyboard shortcut. In most cases, these steps are easier and faster than what is the usual series of events. You know, the one where you need to reload the page with ad blockers disabled, just so you can close the modal window that was blocking scrolling on the page, and then reload the page once again, this time with ad blockers enabled. With JavaScript turned off, visiting most websites is a breeze: my computer feels like it uses an M4 chip and not an Intel Core i5, the page is clean, the article is there, it works. There are a few media sites that refuse to display anything with JS turned off, but I’d say that 95% of the time it’s fine, and I can live my life without a proper ad blocker. * 3 For websites where ads are tolerable, I don’t bother blocking JavaScript, I let it pass. In my mind, this is how my first interaction with a website goes if it were a department store: [opens page at URL] Website: “ Hi dear visitor, I see you’re looking at this product, but may I interest you in a free newsletter? Or would you like to share your Google account with us so next time you come back we’ll know? Also, could you sign this agreement real quick? Oh, and by the way, have you seen that we have a special offer currently? Would you like a cookie? ” Me: “ Hello, yes, oh wow, hum… wait a second… ” [blocks JavaScript] Me: “ Sorry, I don’t speak your language and don’t understand anything you say .” [Salesperson goes away instantly] Me: “ Ah, this is nice and quiet. ” Maybe I’m wrong, but to me, this is a more “polite” default behaviour than using an ad blocker from the get-go, which, in this analogy, would be something like this: [opens page at URL] Ad blocker: “ Alright, well done team, great job. We arrested all sales people, handcuffed them, and brought them all to in the basement. All clear. The boss can come in. ” Me: “ Ah, this is nice and quiet. ” If you have a better analogy, I’m all ears: I really struggled with this one. I’m not sure how long this JS blocking setup will last this time. I’m not sure if it feels that much better to block JS permanently on some websites rather than blocking ads. All I know is that most websites are much quicker to load without JavaScript, much easier to handle by my machine, and just for those reasons, StopTheScript may be the best content blocker for Safari. I guess this is not surprising that all the cool new web browsers include a JavaScript toggle natively. Why are they called content blockers and not ad blockers? Pretty sure it’s some sort of diplomatic lingo used to avoid hurting the feelings of ad companies. I don’t like the word content , but calling ads and trackers content is just weird. ^ I know I could use an ad blocker and disable it on some websites, or only activate it on the most annoying sites, but ad blockers tend to disappear in the background, don’t they? ^ I mention media sites because obviously ecommerce sites, video sites, and interactive sites require JavaScript. Interestingly, Mastodon doesn’t need it to display posts, whereas Bluesky does. ^ Why are they called content blockers and not ad blockers? Pretty sure it’s some sort of diplomatic lingo used to avoid hurting the feelings of ad companies. I don’t like the word content , but calling ads and trackers content is just weird. ^ I know I could use an ad blocker and disable it on some websites, or only activate it on the most annoying sites, but ad blockers tend to disappear in the background, don’t they? ^ I mention media sites because obviously ecommerce sites, video sites, and interactive sites require JavaScript. Interestingly, Mastodon doesn’t need it to display posts, whereas Bluesky does. ^

0 views
Rob Zolkos 1 weeks ago

A Month Exploring Fizzy

In their book Getting Real , 37signals talk about Open Doors — the idea that you should give customers access to their data through RSS feeds and APIs. Let them get their information when they want it, how they want it. Open up and good things happen. Fizzy takes that seriously. When 37signals released Fizzy with its full git history available , they didn’t just open-source the code — they shipped a complete API and webhook system too. The doors were wide open baby! So I dove in — reading the source, building tools, and sharing what I found. Every time curiosity kicked in, there was a direct path from “I wonder if…” to something I could actually try and execute. This post is a catch-all for my very bubbly month of December. Fizzy Webhooks: What You Need to Know — I set up a local webhook receiver to capture and document every event type Fizzy sends. The post covers the payload structures, signature verification, and ideas for what you could build on top of the webhook system. The Making of Fizzy, Told by Git — I prompted Claude Code to analyze the entire git history and write a documentary about the development. Vanilla CSS is all you need — Diving into the no-build CSS architecture across Campfire, Writebook, and Fizzy. Fizzy Design Evolution: A Flipbook from Git — I went through each day of commits, got the application to a bootable state, seeded the database, and took a screenshot. Then I stitched those screenshots into a flipbook video with a soundtrack made from Fizzy’s own audio files. Fizzy’s Pull Requests: Who Built What and How — An analysis of who owned which domains in the Fizzy codebase. The post maps contributors to their expertise areas and curates learning paths through the PRs for topics like Turbo/Hotwire, caching, AI integration, multi-tenancy, and webhooks. The open API invited experimentation. I spotted gaps that would make integration easier for other developers, so I filled them: fizzy-api-client — Ruby client for the Fizzy API. fizzy-client-python — Python client for the Fizzy API. fizzy-cli — Command-line interface for the Fizzy API, built first in Ruby and then migrated to Go for portability. fizzy-skill — An AI agent skill for interacting with Fizzy. n8n-nodes-fizzy — An n8n community node that brings Fizzy into your automation workflows. Create cards, manage assignments, and react to real-time events through webhook triggers. Migration tools — I built these to make it easier to try Fizzy without starting from scratch. Migrating your existing issues and boards gives you an immediate sense of how it could work for you, without having to manually create test cards. You can see your real data running in Fizzy from day one, which I think makes it easier to evaluate and decide if its useful for you. I also contributed a few small fixes back to the main repository: Fizzy is released under the O’Saasy License , which is similar in spirit to MIT but includes a restriction on offering the software as a competing hosted or SaaS product. You can modify and self-host it, but you can’t repackage it and sell it as your own hosted service. I built O’Saasy Directory to make it easy to find applications released under this license. Beyond Fizzy, the directory includes other submitted projects where the source is available to read and modify. If you have built something under the O’Saasy License, visit the submission page to add yours. Having built the Fizzy CLI and fizzy-api-client Rubygem, I saw some fun opportunities to build little lab experiments to show how Fizzy could be integrated with - both to power up some functionality that isn’t there yet, but also creating boards in some interesting ways (eg Movie Quiz). I got the idea for this on a flight to Australia with no internet. Just a pad of paper and a pen. I should probably do that more often as a bunch of ideas for all sorts of products came out. CarbonationLabs is not a product per se. It’s an open source Rails application designed to be run locally where you can interact with the hosted or self-hosted versions of Fizzy. If anything I hope it inspires creation of little problem solving workflows for Fizzy that wouldn’t be built into the main product (the problem is too niche). The API and webhook system is really flexible and most of your bespoke problems could be solved with some creative thinking. Introducing Carbonation Labs - fun ways to add experiments to and extend Fizzy (repo link and demo videos below)🧵 I built carbonation.dev to bring together all the tools, libraries, and integrations that I and others in the community have created for Fizzy. It’s a directory covering API clients (Ruby, Python, JavaScript), CLI tools with packages for macOS, Arch Linux, Debian, Fedora, and Windows, integrations for Claude Code and other AI agents, n8n, Raycast, Telegram, and MCP servers, plus migration tools for GitHub, Linear, Asana, and Jira. If you’ve built something for Fizzy, I’d love to feature it. You can submit a pull request to add your tool to the directory. Building the Fizzy CLI pushed me into some new territory. I created an AUR package for Arch Linux users, set up a Homebrew tap for macOS, published my first Python package to PyPI, and made an n8n plugin — all firsts for me. While I already knew Go, rewriting the CLI in it was a fun exercise, and building TUIs for the setup and skill commands introduced me to terminal UI libraries I hadn’t used before. Gosh it was fun! If you want to get better at Rails, Fizzy is a great place to study real-world code. And in my view if you want to work at 37signals as a Rails programmer, digging into Fizzy — along with Campfire and Writebook — is a solid way to learn how they approach Rails architecture and design decisions. Submitting PRs is also a good way to contribute back while learning — just be respectful of the contribution policy . The review discussions give you a window into how to reason about problems, spot opportunities, and make trade-offs. This month pushed parts of my creative thinking that weren’t gone, but definitely weren’t being stressed. Like any muscle, use it or lose it. The direction of what to explore came from my own curiosity and a habit of poking around under the hood, and AI helped me move a lot faster once I knew where I wanted to go. Most of this information already exists somewhere — Google, Stack Overflow, documentation — but having AI right there alongside me as a partner was thrilling. All of this was made possible because a team left the doors open. No one asked me to step inside; I decided to invest the time and do the work to see what I could build, learn and share. I do this at work too—when I can—looking for opportunities I can shape, experiment with, and get genuinely excited about. Most importantly I had fun and I hope you enjoyed following along. linear2fizzy — Migrate Linear issues jira2fizzy — Migrate JIRA issues asana2fizzy — Migrate Asana tasks gh2fizzy — Migrate GitHub Issues prd2fizzy — Convert PRDs to Fizzy cards #2114 — Remove unused install.svg and its CSS class #2111 — Remove unpaired view-transition-name #2095 — Fix typo: minues → minutes #2094 — Fix duplicate word: use use → use #2093 — Add QrCodesController test #2088 — Fix view-transition-name typo in public card show

0 views
マリウス 2 weeks ago

Updates 2025/Q4

This post includes personal updates and some open source project updates. As the year comes to a close, I’d like to begin this update by sharing a famous (and sadly now gone ) tweet . My goal is not only to remind those who have seen it before, but also to introduce it to those who haven’t, along with the thoughts it inevitably sparks. It’s a way to preserve this rare gem of social media for posterity. Below is the original post, with added speaker information for easier reading. Warning: This text is a bit long. If you’d rather skip ahead to the next part of the update, click/tap here . Someday aliens are going to land their saucers in a field somewhere in New Jersey and everything is going to go just fine right up until we try to explain our calendar to them. Humans: “yeah we divide our year into a number of sub units called ‘months’ made up a number of days, and they’re not all the same length” Aliens: “I guess that’s unavoidable, if your rotations-count per orbit is a prime number” Humans: “yeah, our’s isn’t prime” Aliens: “but surely you have most of these ‘months’ the same length and just make the last one shorter or longer?” Humans: “No… They’re different lengths following no logical pattern” Aliens: “what” Humans: “and we further subdivide the months into ‘weeks’, which is 7 days.” Aliens: “ahh, so each month is an integer multiple of weeks?” Humans: “that would make sense, but no. Only one is, sometimes” Aliens: “SOMETIMES?!” Humans: “yeah our orbit around the sun isn’t an integer number of days, so we have to change the number of days to in a year from time to time” Aliens: “oh yes, a similar thing happens on Epsilon Indi 7, where they have to add an extra day every 39 years to keep holidays on track” Humans: “yeah that’s how ours work! Although the ratio doesn’t work out cleanly, so we just do every 4 years, except every 100 years, except except every 400 years” Aliens: “oh, you number your years? What’s the epoch?” Humans: “uh, it’s supposed to be the birth of a religious leader, but they got the math wrong so it’s off by 4 years, if he existed at all.” Aliens: “if? You based your calendar off the birth date of someone you’re not sure exists?” Humans: “yeah. He’s written about in a famous book but historical records are spotty.” Aliens: “interesting. I didn’t realize your planet was one of the ones with a single universal religion, that usually only happens in partial or complete hive minds.” Humans: “uhh, we’re not.” Aliens: “You’re not?!” Humans: “yeah we have multiple religions.” Aliens: “oh but they all have a common ancestor, which agrees on the existence of that leader, right?” Humans: “uh, no. Two of the big ones do, but most of the others don’t believe in him” Aliens: “YOUR CALENDAR IS BASED ON A RELIGIOUS LEADER THAT NOT EVERYONE BELIEVES IN?” Humans: “well, on his birth. And yeah, we got it wrong by a couple years.” Aliens: “OK, fine. So, you have somewhat complicated rules about when you change the length of your years, and I’m scared to ask this, but… You definitely just add or subtract that extra day at the end, right?” Humans: “…. Nope.” Aliens: “At the start of the year? " Humans: “nah. The end of the second month” Aliens: “WHY WOULD IT BE THE SECOND MONTH?” Humans: “I’m not sure, really.” Aliens: “huh. So at this point I’m dreading asking this, but how do you measure time within each day?” Humans: “oh that’s much simpler. Each day is divided into hours, each hour has minutes, and each minute has seconds.” Aliens: “ok. And 10 of each?” Humans: “10 hours? No. There’s 24 hours, 60 minutes, 60 seconds” Aliens: “…. I thought you said you used a base-10 counting system” Humans: “we do! Mostly. But our time system came from some long gone civilization that liked base-60 like 5000 years ago” Aliens: “and you haven’t changed it since?” Humans: “No.” Aliens: “huh. Okay, so why 24? That’s not a divisor of 60” Humans: “oh because it’s actually 12!” Aliens: “what” Humans: “yeah each day is 24 hours but they are divided into two sets of 12.” Aliens: “and that’s 5 12s, right, I see the logic here, almost. So like, after hour 12, it becomes the second half, which is 1?” Humans: “No, after 11.” Aliens: “oh, you zero-index them! So it’s hours 0-11 in the first half, then 12-23 in the second half?” Humans: “No. 12 to 11 in the first half, and again in the second half” Aliens: “please explain that before my brain melts out my mouth” Humans: “the first hour is 12. Then the next one is 1, then it goes back up to 11, then 12 again” Aliens: “that is not how numbers work. And how do you tell first 12 apart from second 12?” Humans: “oh we don’t use numbers for that!” Aliens: “you don’t number the two halves of your day?” Humans: “nah, we call them AM and PM” Aliens: “WHAT DOES THAT MEAN” Humans: “I think it’s ante-meridian and post-meridian? But I’m not sure, I dont know much Latin” Aliens: “Latin?” Humans: “yeah it’s an ancient language from an old empire which controlled a lot of the world and we still use some of their terms” Aliens: “oh, and that was the civilization that liked base-60 and set up your time system?” Humans: “that would make sense, but… No, completely different one.” Aliens: “okay, and what do you do to if you want to measure very short times, shorter than a second?” Humans: “oh we use milliseconds and microseconds” Aliens: “ahh, those are a 60th of a second and then 60th of the other?” Humans: “No. Thousandths.” Aliens: “so you switch to base-10 at last, but only for subdivisions of the second?” Humans: “yeah.” Aliens: “but at thousands, ie, ten tens tens” Humans: “yeah. Technically we have deciseconds and centiseconds, which are 1/10 of a second, and 1/100 of a second, but no one really uses them. We just use milli.” Aliens: “that seems more like a base-1000 system than a base-10 system.” Humans: “it kinda is? We do a similar thing with measures of volume and distance and mass.” Aliens: “but you still call it base-10?” Humans: “yeah” Aliens: “so let me see if I get this right: Your years are divided in 10 months, each of which is some variable number of days, the SECOND of which varies based on a complex formula… and each day is divided into two halves of 12 hours, of 60 minutes, 60 seconds, 1000 milliseconds?” Humans: “12 months, actually.” Aliens: “right, because of the ancient civilization that liked base-60, and 12 is a divisor of 60.” Humans: “No, actually, that came from the civilization that used latin. Previously there were 10.” Aliens: “what” Humans: “yeah the Latin guys added two months part of the way through their rule, adding two more months. That’s why some are named after the wrong numbers” Aliens: “you just said two things I am having trouble understanding. 1. Your months are named, not numbered? 2. THE NAMES ARE WRONG?” Humans: “yep! Our 9th month is named after the number 7, and so on for 10, 11, and 12.” Aliens: “your 12th month is named… 10?” Humans: “yeah.” Aliens: “what are the other ones named after?!” Humans: “various things. Mainly Gods or rulers” Aliens: “oh, from that same religion that your epoch is from?” Humans: “uh… No. Different one.” Aliens: “so you have an epoch based on one religion, but name your months based on a different one?” Humans: “yeah! Just wait until you hear about days of the week.” Aliens: “WHAT” Humans: “so yeah we group days into 7-day periods-” Aliens: “which aren’t an even divisor of your months lengths or year lengths?” Humans: “right. Don’t interrupt” Aliens: “sorry” Humans: “but we name the days of the week, rather than numbering them. Funny story with that, actually: there’s disagreement about which day starts the week.” Aliens: “you have a period that repeats every 7 days and you don’t agree when it starts?” Humans: “yeah, it’s Monday or Sunday.” Aliens: “and those names come from…” Humans: “celestial bodies and gods! The sun and moon are Sunday and Monday, for example” Aliens: “but… I looked at your planet’s orbit parameters. Doesn’t the sun come up every day?” Humans: “yeah.” Aliens: “oh, do you have one of those odd orbits where your natural satellite is closer or eclipsed every 7 days, like Quagnar 4?” Humans: “no, the sun and moon are the same then as every other day, we just had to name them something.” Aliens: “and the other days, those are named after gods?” Humans: “yep!” Aliens: “from your largest religion, I imagine?” Humans: “nah. That one (and the second largest, actually) only has one god, and he doesn’t really have a name.” Aliens: “huh. So what religion are they from? The Latin one again?” Humans: “nah, they only named one of the God-days” Aliens: “only on… SO THE OTHER DAYS ARE FROM A DIFFERENT RELIGON ENTIRELY?” Humans: “Yep!” Aliens: “the third or forth biggest, I assume?” Humans: “nah, it’s one that… Kinda doesn’t exist anymore? It mostly died out like 800 years ago, though there are some modern small revivals, of course” Aliens: “so, let me get confirm I am understanding this correctly. Your days and hours and seconds and smaller are numbered, in a repeating pattern. But your years are numbered based on a religious epoch, despite it being only one religion amongst several.” Humans: “correct so far” Aliens: “and your months and days of the week are instead named, although some are named after numbers, and it’s the wrong numbers” Humans: “exactly” Aliens: “and the ones that aren’t numbers or rulers or celestial objects are named after gods, right?” Humans: “yup!” Aliens: “but the months and the days of the week are named after gods from different religons from the epoch religion, and indeed, each other?” Humans: “yeah! Except Saturday. That’s the same religion as the month religion” Aliens: “and the month/Saturday religion is also from the same culture who gave you the 12 months system, and the names for the two halves of the day, which are also named?” Humans: “right! Well, kinda.” Aliens: “please explain, slowly and carefully” Humans: “yeah so cultures before then had a 12 month system, because of the moon. But they had been using a 10 month system, before switching to 12 and giving them the modern names” Aliens: “the… Moon? Your celestial body?” Humans: “yeah, it completes an orbit about every 27 days, so which is about 12 times a year, so it is only natural to divide the year into 12 periods, which eventually got called months” Aliens: “ok, that makes sense. Wait, no. Your orbital period is approximately 365.25 days, right?” Humans: “yeah. That’s why we do 365 or 366 based on the formula” Aliens: “but that doesn’t work. 365 divided by 27 is ~13.5, not 12” Humans: “yeah I’m not sure why 12 was so common then. Maybe it goes back to the base 60 people?” Aliens: “okay so one final check before I file this report: Years are numbered based on a religious leader. Years always have 12 months, but the lengths of those months is not consistent between each other or between years.” Humans: “don’t forget the epoch we number our years from is wrong!” Aliens: “right, yes. And your months are named, some after a different religion, and some after numbers, but not the number the month is in the year.” Humans: “right. And when we change the month lengths, it’s the second one we change” Aliens: “how could I forget? After months you have a repeating ‘week’ of 7 days, which is named after gods from two religons, one of which is the month-naming one, and a nearly extinct one. And you don’t agree when the week starts.” Humans: “nope! My money is on Monday.” Aliens: “that’s the Monday that’s named after your moon, which supposedly influenced the commonality of the 12 months in a year cycle, despite it orbiting 13 times in a year?” Humans: “correct!” Aliens: “and as for your days, they split into two halves, named after a phrase you don’t really understand in the long dead language of the same culture that named the months and Saturday.” Humans: “Yep. I took some in college but all I remember is like, ‘boy’, ‘girl’, ‘stinky’, ‘cocksucker’” Aliens: “charming. And then each half is divided into 12 hours, but you start at 12, then go to 1, and up to 11” Humans: “all I can say is that it makes more sense on analog clocks.” Aliens: “i don’t know what that is and at this point I would prefer you not elaborate. So each of those hours is divided into 60 minutes and then 60 seconds, and this comes from an ancient civilization, but not the one that gave you the month names” Humans: “yep. Different guys. Different part of the world.” Aliens: “ok. And then after seconds, you switch to a ‘base-10’ system, but you only really use multiples of a thousand? Milliseconds and microseconds?” Humans: “right. And there’s smaller ones beyond that, but they all use thousands” Aliens: “right. Got it. All written down here. Now if you’ll excuse me, I just gotta go make sure I didn’t leave my interociter on, I’ll be right back.” The tall alien walks back into their saucer without a wave. The landing ramp closes. The ship gently lifts off as gangly landing legs retract. There’s a beat, then a sudden whooshing sound as air rushes back into the space that previously held the craft, now suddenly vacuum. NORAD alarms go off briefly as an object is detected leaving the earth’s atmosphere at a significant fraction of the speed of light. In the years to come, many technological advances are made from what was left behind, a small tablet shaped object made of some kind of artifical stone/neutrino composite material. The alien message left on screen is eventually translated to read “Untitled Document 1 has not been saved, are you sure you wish to quit? (yes) (no) (cancel)” Many years have passed, and we await the day the aliens return. They have not. As I mentioned in the previous update ( here ), my beloved 9barista coffee brewer started malfunctioning at the end of Q3, likely due to the age of the O-ring sealing the water chamber and the descaling process I performed. However, I was able to fix the machine using the official 9barista repair kit and have been using it daily ever since. In recent months, though, I’ve almost entirely switched to decaf coffee in an effort to reduce some recurring headaches I’ve been dealing with for a while. It doesn’t seem to be the constant consumption of caffeine causing the issue; rather, the headaches mostly appeared whenever I skipped a cup, making it seem more like a caffeine withdrawal effect. Although I continued to experience headaches in Q4, those were likely linked to being sick rather than coffee, see below . That said, both the frequency and intensity of the headaches have noticeably decreased. Toward the end of Q4, I also began experimenting with additions to my coffee, specifically Lion’s Mane , a well-known component of traditional Chinese medicine that’s often advertised as an alternative to caffeine. It’s believed to enhance focus without the jitters or cold sweats that usually come with high caffeine consumption. In mid-October, I unfortunately got hit with a heavy dose of COVID-19 , which knocked me out for three weeks and has had (once again) a lasting impact on my overall health. Since I was mostly bedbound during that time, I spent some of it exchanging COVID anecdotes with the friendly folks in the community channel . I was surprised to find that many people there had similar negative experiences, particularly in relation to post-vaccine infections. My first encounter with COVID was back in 2020, and for me, it turned out to be little more than a bad flu, with two days of fever and some headaches. I didn’t lose my sense of smell or taste, nor did I experience any long-term effects. In fact, the most troubling part of the whole COVID experience for me back then wasn’t the sickness itself, but the fear of being picked up by local authorities for having an elevated body temperature. This was especially concerning because I was still traveling the world at the time, enjoying the eerie quiet of empty airports and cities. Due to increasing social pressure, especially from governments imposing heavy travel restrictions, I was eventually pushed into getting vaccinated shortly after that. Unfortunately, my body didn’t handle the two doses very well. I experienced extreme muscle pain and a general sense of being under the weather . While those side effects faded after a few days, in the months that followed, I felt more tired and inflamed than usual, with recurring flu-like symptoms and headaches. At some point, COVID hit me again, but this time it was really bad. I ended up battling a fever around 40°C/104°F for over a week, and I was completely knocked out for almost two months. On top of that, I began experiencing cardiovascular symptoms, which persisted for months and even years afterward. The adverse effects I’d never experienced before didn’t just show up with subsequent COVID infections, but also with regular flu. There was one point when a strain of Influenza B hit me so hard that I had to visit the emergency room, which is something I’d never done before, even though I’d never received the annual flu vaccine. To this day, it feels like ever since I got the Pfizer shots (for which I had to sign a liability waiver), my health has been in a constant decline, especially whenever influenza or COVID strikes. No matter how healthy my diet or activity level, it doesn’t seem to make much of a difference. In fact, the ongoing inflammation and regular flu-like symptoms have made it especially hard to push myself during a workout or a run. At some point, I started digging deeper into the issue, with regular bloodwork and visits to specialists, particularly cardiologists. Unfortunately, as is often the case, no medical expert has been able to diagnose the underlying issue(s) or propose meaningful solutions. Society seems quick to ridicule those who seek to improve their health through unconventional methods, yet most people fail to recognize the globally poor state of healthcare, which leaves people stranded, regardless of how much private money they’re willing to spend to solve their problems. Long story short, will I continue to get the battletested shots for Hepatitis , Tetanus , and other dangers humanity faces? Definitely. But will I be significantly more skeptical of vaccines that didn’t undergo year-long trials and were fast-tracked by every government on Earth to curb an allegedly man-made virus that escaped a biological research facility, all while creating shareholder value ? You bet! Note: This is a complex topic, and everyone has their own personal experience. For many, the COVID shots seem to have had no negative side effects. For some, however, they did. This doesn’t mean that COVID doesn’t exist, nor that lizard overlords used it as an excuse to inject us with nanobots . Medicine certainly has its flaws, and financial interests were prioritized over absolute safety, something that’s happened in other areas as well over the past few years (e.g., Boeing ). If, however, you think there’s a pLaNdEmIc or some intentional, eViL gEnEtIc ExPeRiMeNt at play, there’s no need at all to launch your XLibre Xserver to reach out to me with fUrThEr iNfO oN tHiS tOpIc . Thank you. You might have noticed that the main menu at the top of this website has grown, now including a now page , as well as a link to Codeberg, but more on that in a second . The now page is exactly what the name suggests: a now page . Given the failure of social media, I’ve pretty much given up on maintaining a public profile for posting status updates. Up until the end of 2021, I was still actively maintaining a Mastodon account alongside a TUI client , but that eventually fell apart for multiple reasons. After that, I used Nostr for a while, but eventually gave it up too. These days, I’m somewhat active on Bluesky , though my account isn’t publicly available. I don’t have high hopes for Bluesky either, and I’ll probably delete my account there one day, at the latest when Bluesky inevitably becomes enshittified . The now page , however, is here to stay. It will continue to feature short, tweet -like updates about all sorts of things. If you’re interested, feel free to check it every once in a while. I might even activate a dedicated RSS feed for it at some point. For the past few months I’ve been silently moving most private project repositories away from GitHub towards privately hosted instances of Forgejo – a terrible name, btw – as well as many of my public GitHub projects to Codeberg . One reason to do so is… well, let me just quote Andrew Kelley here, who probably put it best: […] the engineering excellence that created GitHub’s success is no longer driving it. Priorities and the engineering culture have rotted, leaving users inflicted with some kind of bloated, buggy JavaScript framework in the name of progress. Stuff that used to be snappy is now sluggish and often entirely broken. Most importantly, Actions has inexcusable bugs while being completely neglected . After the CEO of GitHub said to “embrace AI or get out” , it seems the lackeys at Microsoft took the hint, because GitHub Actions started “vibe-scheduling”; choosing jobs to run seemingly at random. Combined with other bugs and inability to manually intervene, this causes our CI system to get so backed up that not even master branch commits get checked. However, unlike most people who decided to migrate from GitHub to Codeberg, I won’t be deleting my repositories on GitHub just yet. Instead, I’ve updated all my local clones to point toward Codeberg, and I’ve enabled synchronized pushes from Codeberg to GitHub, as I plan to continue using GitHub’s workflows. “But why?!” you might ask. The reason is simple: Because I’m happy to waste Microsoft’s resources on automated tests and build actions. While I could use Codeberg’s Woodpecker CI or even set up my own, I’m more than content to keep using GitHub’s CPU cycles for free to build my silly little projects , while hosting the primary source code repositories on Codeberg. Since there doesn’t seem to be a way to disable Pull Requests on GitHub for my respective projects, I’ve added pull request templates that warn against opening PRs there. I’ve also disabled the Issues tab and updated the short descriptions to link to Codeberg. Additionally, my overview page on GitHub now links to Codeberg, with the GitHub repositories listed explicitly as GitHub mirrors . At the end of October I encountered an issue with ungoogled-chromium on my Gentoo laptop that prevented it from compiling successfully. Upon further investigation I learned that, quote: Using the system libc++ is no longer supported This change was driven by the Chromium project and affected my, along with many others’, Gentoo installation, due to the use of system libraries instead of the in-tree ones provided by Chromium. As mentioned here , this is a security concern, as users will need to trust the Chromium-provided libraries over those from their distribution. In case you’ve ever wondered why anyone in 2025 would still compile from source when tHe PeRfOrMaNcE bEnEfItS aRe NeGlIgIbLe , this is one of the key reasons why compiling from source still makes sense and, in fact, is more important than ever. The same projects that have historically taken a controversial stance on sensible default settings are now the ones seemingly rejecting security-critical system components in favor of their own. Tl;dr: If you’re using Chromium or a Chromium-based browser (other than ungoogled-chromium on Gentoo through PF4Public ’s repository), it’s highly likely that your browser is not using your system maintainer’s libraries, but rather Chromium’s in-tree ones with whatever versions and features the Chromium developers deem necessary and sensible. In what to this day remains a mystery the keyboard switch of my key has decided that it rejects its existence and seemingly removed one of its legs, presumably in an effort to escape and start a new live. I had documented the whole incident on Keebtalk for anyone who’s equally as puzzled by this as I am. I invested quite some time in pursuing my open source projects in the past quarter, hence there are a few updates to share. At the beginning of November I released Zeit v1.0.0 , a full rewrite of my command line time tracking tool. In case you missed it, I summed up everything in a dedicated post and have also published a dedicated project website that will soon act as more than just a landingpage. With 📨🚕 (MSG.TAXI) continuing to grow and evolve, Overpush has received a few important updates improving its stability with long-running XMPP connections. One thing that made me very happy throughout the debugging phase was the fact that despite stability of Overpush not being perfect , no messages ever got lost whatsoever and were always successfully delivered the moment the service would be able to reach the target platforms (specifically XMPP in this case). :-) If you haven’t yet tried Overpush yourself, I encourage you to sign up on 📨🚕 and give it a go. If you find the service useful you’ll be able to easily spin up your own Overpush instance further down the line and won’t have to depend on any closed-source proprietary platfrom. As those of you idling in the community channel might know, I’ve been actively working on an internet forum software for some time now . What kick-started my efforts was the desire to set up a support and discussion forum for 📨🚕 , among other things, but I was dissatisfied with the existing options. I was looking for an internet forum that… The first thing that came to mind was phpBB , which has been around for decades and appears to be one of the few options that (unlike Discourse and Lemmy ) doesn’t require users to have JavaScript enabled. Sadly, phpBB is a monster . It has too many features, takes a lot of time to properly install and configure, and, more importantly, when looking at its runtime dependencies and extensions, it requires some recurring effort to keep it safe and sound. Don’t get me wrong, unlike Discourse , which is frankly terrible, phpBB is a solid piece of software. However, for my use cases, I wanted something more lightweight that is easy to set up and run. None of the existing solutions, with maybe one or two exceptions like DFeed , came close to what I was looking for. And those that seemed like a good fit sadly lacked some functionalities, which would have required me to extend them in ways that would significantly alter core functionality. These changes would have likely not been merged upstream, meaning I’d probably end up maintaining my own fork anyway. The bulletin board I’m working on is built in Go, as a single executable binary (without CGO ) for all major platforms ( Linux , * BSD , (maybe) Plan 9 , macOS , and (maybe) Windows ) that doesn’t require a runtime (like Erlang / Elixir , PHP , Ruby , Python , or worse, Node.js ) or even assets (e.g., HTML/CSS files) anywhere in . It renders modern HTML on the server-side and doesn’t require any user-side JavaScript to be enabled. The forum will support only PostgreSQL (single- and multi-node setups), require a Redis/Valkey instance or cluster, and use S3-compatible storage for user content (e.g., profile pictures, file uploads, etc.). The platform will allow sign-ups via email and XMPP addresses, supporting notifications and replies through both services. But don’t worry: OAuth authentication via popular providers will also be available. Additionally, the forum will feature a dedicated REST API that, unlike Lemmy ’s or Discourse ’s APIs, will be much easier to work with. One mid-term goal is to integrate this API into Neon Modem Overdrive , which will become its official TUI client. Short story long: I’ve been working on this project for a little while now and expect to release a first live demo around February ‘26. While many basic features are already implemented, there are still details I’d like to perfect before publishing the first version. I’ll set up a live online demo for people to try out first, and only after fine-tuning the code based on feedback will I wrap up the actual source release. The forum will be open-source and available under the SEGV license. If this sounds interesting to you and you’d like to participate in development or testing, reach out to me ! With that said, I sincerely hope you’re enjoying a wonderful holiday season and gearing up for a great new year! As we wrap up 2025, I’ll be taking a well-deserved break from posting here on the site. The start of 2026 is shaping up to be quite hectic, and I’m looking forward to diving into some exciting projects, especially focusing on the ▓▓▓▓▓▓▓▓▓▓▓ bulletin board system I’m building. I hope this season brings you moments of joy, relaxation, and time well spent with those who matter most. May the new year be filled with new opportunities, exciting adventures, and personal growth. I look forward to reconnecting with all of you next year ! Stay safe, take care of yourselves, and I’ll see you in 2026! Can use an existing database to authenticate users and/or… Supports simple email/username signups. Ideally supports notifications and replies via email. Is lightweight and doesn’t require a ton of runtime dependencies. Does not require users to have JavaScript enabled . Does not overwhelm me with administrative features. Is somewhat easily themeable.

0 views
The Jolly Teapot 2 weeks ago

December 2025 blend of links

I almost missed the deadline with this one, didn’t I? At least it gives me a chance to wish every one of you a happy New Year’s Eve, and new year. In 2026, I’ll write less about CSS, fonts, HTML, and text editors, and more about… well, at least I’ll try. Thank you for reading. The Future of Veritasium ▪︎ Precious testimonial on what it really means to depend on the algorithm for revenue, and on how many people actually work in the background of a successful and quality YouTube channel like Veritasium. Mouseless ▪︎ If this app is definitely not for me — I tried — it may be appealing to some of you; I found the concept very intriguing; I can see how effective it could be in some apps that require a lot of hovering and clicking. (via Pierre Carrier ) Everpen ▪︎ I’ve been intrigued by this for a while now, and 2026 may be the year when I try this. I currently love using my fountain pen at my desk, but I prefer to travel with a pencil in my bag, and this may be the perfect companion for me. Predictions for Journalism 2026, Nieman Journalism Lab ▪︎ Every year, I look forward to reading these predictions; I just wish scrolling the page didn’t make my laptop activate its “vacuum cleaner noise” mode (I had to browse the “cards” via my RSS reader: I know, it’s time for me to upgrade ). Nick Heer, People and Blogs ▪︎ “ there is no better spellchecker than the ‘publish’ button. ” If you don’t follow the People and Blogs interview series , you are missing out. Grid Paper ▪︎ An excellent bookmark to add to your collection of utilities, especially interesting if, like me, you waste many high-quality notebook pages trying to do isometric drawings, and failing miserably. The Land of Giants Transmission Towers ▪︎ I love this and I keep thinking about it since I learned about it: Why isn’t it already a thing? Truly mesmerising, and I found that the illustrations used on their website are very tasteful too. (via Kottke ) Norm Architects ▪︎ As a fanboy of Norm Architects, I don’t know whether I like more their work or the photographs of their work. For years now, I’ve had one of an older batch of press pictures as a desktop wallpaper (you’ll know it when you see it) and another as my phone wallpaper. The colours, the lights, the shades, the textures: superb. How To Spot Arial ▪︎ Sorry, I’m writing about typefaces once again , but I think this is an important skill to have. (via Gruber ) Rubio Orders State Department Braille Signage Switch To ‘Times New Roman’ ▪︎ I promise, this is the last time I’ll be sharing something about typography and fonts until the end of the year. More “Blend of links” posts

0 views
Ahead of AI 2 weeks ago

The State Of LLMs 2025: Progress, Progress, and Predictions

As 2025 comes to a close, I want to look back at some of the year’s most important developments in large language models, reflect on the limitations and open problems that remain, and share a few thoughts on what might come next. As I tend to say every year, 2025 was a very eventful year for LLMs and AI, and this year, there was no sign of progress saturating or slowing down. There are many interesting topics I want to cover, but let’s start chronologically in January 2025. ​Scaling still worked, but it didn’t really change how LLMs behaved or felt in practice (the only exception to that was OpenAI’s freshly released o1, which added reasoning traces). So, when DeepSeek released their R1 paper in January 2025, which showed that reasoning-like behavior can be developed with reinforcement learning, it was a really big deal. (Reasoning, in the context of LLMs, means that the model explains its answer, and this explanation itself often leads to improved answer accuracy.) Figure 1: A short response and a longer response including intermediate steps that is typically generated by reasoning models. DeepSeek R1 got a lot of attention for various reasons: First, DeepSeek R1 was released as an open-weight model that performed really well and was comparable to the best proprietary models (ChatGPT, Gemini, etc.) at the time. Second, the DeepSeek R1 paper prompted many people, especially investors and journalists, to revisit the earlier DeepSeek V3 paper from December 2024. This then led to a revised conclusion that while training state-of-the-art models is still expensive, it may be an order of magnitude cheaper than previously assumed, with estimates closer to 5 million dollars rather than 50 or 500 million. Figure 2: Table from the DeepSeek V3 paper estimating the cost of training the 671B parameter DeepSeek V3 model. ​The DeepSeek R1 supplementary materials estimate that training the DeepSeek R1 model on top of DeepSeek V3 costs another $294,000, which is again much lower than everyone believed. Figure 3: Table from the DeepSeek R1 paper’s supplementary materials estimating the cost of training the R1 model on top of DeepSeek V3. Of course, there are many caveats to the 5-million-dollar estimate. For instance, it captures only the compute credit cost for the final model run, but it doesn’t factor in the researchers’ salaries and other development costs associated with hyperparameter tuning and experimentation. Third, and most interestingly, the paper presented Reinforcement Learning with Verifiable Rewards (RLVR) with the GRPO algorithm as a new (or at least modified) algorithmic approach for developing so-called reasoning models and improving LLMs during post-training. Figure 4: Broad overview of how / when reinforcement learning is applied. There are many details that I am skipping in this overview, but interested readers can read more in my The State of Reinforcement Learning for LLM Reasoning article. Up to this point, post-training methods like supervised instruction fine-tuning (SFT) and reinforcement learning with human feedback (RLHF), which still remain an important part of the training pipeline, are bottlenecked by requiring expensive written responses or preference labels. (Sure, one can also generate them synthetically with other LLMs, but that’s a bit of a chicken-egg problem.) What’s so important about DeepSeek R1 and RLVR is that they allow us to post-train LLMs on large amounts of data, which makes them a great candidate for improving and unlocking capabilities through scaling compute during post-training (given an available compute budget). The V in RLVR stands for “verifiable,” which means we can use deterministic approaches to assign correctness labels, and these labels are sufficient for the LLM to learn complex problem-solving. (The typical categories are math and code, but it is also possible to expand this idea to other domains.) Figure 5: A simple example of a verifiable reward. I don’t want to get too lost in technical details here, as I want to cover other aspects in this yearly review article. And whole articles or books can be written about reasoning LLMs and RLVR. For instance, if you are interested to learn more, check out my previous articles: ​All that being said, the takeaway is that LLM development this year was essentially dominated by reasoning models using RLVR and GRPO. ​Essentially, every major open-weight or proprietary LLM developer has released a reasoning (often called “thinking”) variant of their model following DeepSeek R1. If I were to summarize the LLM development focus points succinctly for each year, beyond just scaling the architecture and pre-training compute, my list would look like this: 2022 RLHF + PPO 2023 LoRA SFT 2024 Mid-Training 2025 RLVR + GRPO Pre-training is still the required foundation for everything. Besides that, RLHF (via the PPO algorithm) was, of course, what brought us the original ChatGPT model in the first place back in 2022. In 2023, there was a lot of focus on LoRA and LoRA-like parameter-efficient fine-tuning techniques to train small custom LLMs. Figure 6: Some of the focus areas of proprietary and open-weight LLM development over the years. Note that this is cumulative, meaning that RLHF + PPO, for example, is still relevant and being used. However, it’s no longer the most hotly discussed topic. Then, in 2024, all major labs began making their (pre-)training pipelines more sophisticated by focusing on synthetic data, optimizing data mixes, using domain-specific data, and adding dedicated long-context training stages. I summarized these different approaches in my 2024 article back then (I grouped the techniques under pre-training, because the term “mid-training” hadn’t been coined yet back then): ​Back then, I considered these as pre-training techniques, since they use the same pre-training algorithm and objective. Today, these slightly more specialized pre-training stages, which follow the regular pre-training on general data, are often called “mid-training” (as a bridge between regular pre-training and post-training, which includes SFT, RLHF, and now RLVR).​ So, you may wonder what’s next? I think we will see (even) more focus on RLVR next year. Right now, RLVR is primarily applied to math and code domains. The next logical step is to not only use the final answer’s correctness as a reward signal but also judge the LLM’s explanations during RLVR training. This has been done before, for many years in the past, under the research label “process reward models” (PRMs). However, it hasn’t been super successful yet. E.g., to quote from the DeepSeek R1 paper : 4.2. Unsuccessful Attempts [...] In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments. However, looking at the recent DeepSeekMath-V2 paper, which came out last month and I discussed in my previous article From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates , I think we will see more of “explanation-scoring” as a training signal in the future.​ The way the explanations are currently being scored involves a second LLM. This leads to the other direction I am seeing for RLVR: an extension into other domains beyond math and code.​ So, if you asked me today what I see on the horizon for 2026 and 2027, I’d say the following: 2026 RLVR extensions and more inference-time scaling 2027 Continual learning Besides the aforementioned RLVR extensions, I think there will be more focus on inference-time scaling in 2026. Inference-time scaling means we spend more time and money after training when we let the LLM generate the answer, but it goes a long way. Inference scaling is not a new paradigm, and LLM platforms already use certain techniques under the hood. It’s a trade-off between latency, cost, and response accuracy. However, in certain applications, where accuracy matters more than latency and cost, extreme inference-scaling can totally be worth it. For instance, as the recent DeepSeekV2-Math paper showed, it pushed the model to gold-level performance on a challenge math competition benchmark. Figure 7: Combination of two inference-time scaling methods: self-consistency and self-refinement. Additional self-refinement iterations improve accuracy. Annotated figure from the DeepSeekMath-V2 paper . Self-consistency and self-refinement are covered in chapters 4 and 5 of my Build A Reasoning Model (From Scratch) book. There’s also been a lot of talk among colleagues about continuous learning this year. In short, continual learning is about training a model on new data or knowledge without retraining it from scratch. It’s not a new idea, and I wonder why it came up so much this year, since there hasn’t been any new or substantial breakthrough in continual learning at this point. The challenge to continual learning is catastrophic forgetting (as experiments with continued pre-training show, learning new knowledge means that the LLM is forgetting old knowledge to some extent). Still, since this seems like such a hot topic, I do expect more progress towards minimizing catastrophic forgetting and making continual learning method development an important development in the upcoming years. Academic research in the era of expensive LLMs has been a bit challenging in recent years. Of course, important discoveries that became mainstream and key pillars of LLM progress and breakthroughs can be made in academia despite (or because of) smaller budgets. In recent years, popular examples include LoRA ( LoRA: Low-Rank Adaptation of Large Language Models 2021) and related methods for parameter-efficient fine-tuning. Figure 8: A code-based introduction to LoRA tutorial Another one is DPO ( Direct Preference Optimization: Your Language Model is Secretly a Reward Model ) and related methods for reward-model-free alignment as an alternative reinforcement learning with human feedback. Figure 9: A code-based introduction to DPO tutorial In my bubble, this year’s research highlight has been GRPO. Although it was introduced in the DeepSeek R1 paper rather than originating from academia, it has still made for an exciting year for researchers: both RLVR and GRPO are conceptually interesting and, depending on scale, not prohibitively expensive to experiment with. So, there have been many mathematical improvements to GRPO that I saw in the LLM research literature this year (from both companies and academic researchers), which were later adopted in the training pipelines of state-of-the-art LLMs. For instance, some of the improvements include the following: Zero gradient signal filtering (DAPO by Yu et al., 2025 ) Active sampling (DAPO by Yu et al., 2025 ) Token-level loss (DAPO by Yu et al., 2025 ) No KL loss (DAPO by Yu et al., 2025 and Dr. GRPO by Liu et al., 2025 ) Clip higher (DAPO by Yu et al., 2025 ) Truncated importance sampling ( Yao et al., 2025 ) No standard deviation normalization (Dr. GRPO by Liu et al., 2025 ) DeepSeek V3.2 : KL tuning with domain‑specific KL strengths (zero for math) Reweighted KL Off‑policy sequence masking Keep sampling mask for top‑p / top‑k Keep original GRPO advantage normalization I can confirm that these GRPO tricks or modifications have a huge impact in practice. For instance, with some or multiple of these modifications in place, bad updates no longer corrupt my training runs, and I no longer need to reload checkpoints periodically. And even for very short runs, I observed a big gain when adopting these tricks: Figure 10: Small excerpt of the results from my from-scratch GRPO training code, which is available on GitHub Anyways, I have a vanilla GRPO script in my “Build A Reasoning Model (From Scratch) repository if you want to toy around with it. (I will add more ablation studies with the respective modifications soon.) When it comes to LLM architectures, state-of-the-art models still use the good old decoder-style transformer. However, this year, open-weight LLMs more or less converged on using mixture-of-experts (MoE) layers, as well as at least one “efficiency-tweaked” attention mechanism: Grouped-query attention, sliding-window attention, or multi-head latent attention. Beyond those fairly standard LLM architectures, we have also seen more drastic efficiency tweaks targeting the attention mechanism to scale linearly with sequence length. Examples of this include the Gated DeltaNets in Qwen3-Next and Kimi Linear, as well as the Mamba-2 layers in NVIDIA’s Nemotron 3. Anyways, I don’t want to go into too much detail here because I have a whole 13k-word and recently-updated article dedicated to these architectures here if you want to learn more: The Big LLM Architecture Comparison, https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison Figure 11: The Big LLM Architecture Comparison My prediction is that we will keep building, and with the transformer architecture for at least a couple more years, at least when it comes to state-of-the-art modeling performance. At the same time, I do think that we will see more and more of these efficiency and engineering tweaks like Gated DeltaNet and Mamba layers because at the scale at which LLMs are trained, deployed, and used, it just makes sense from a financial perspective for these companies, which are still burning a lot of money on serving LLMs. This doesn’t mean that there are no other alternatives out there. As I’ve written about in Beyond Standard LLMs , for instance, text diffusion models are an interesting approach. Right now, they fall into the category of experimental research models, but Google shared that they will release a Gemini Diffusion model. It won’t rival their state-of-the-art offerings in modeling quality, but it will be really fast and attractive for tasks with low-latency requirements (e.g., code completion). Also, two weeks ago, the open-weight LLaDA 2.0 models dropped . The largest one, at 100B parameters, is the largest text diffusion model to date and is on par with Qwen3 30B. (Yes, it doesn’t push the state-of-the-art overall, but it’s still a notable release in the diffusion model space.) Improving LLMs by scaling training data and architectures is an established formula that (still) keeps on giving. However, especially this year, it’s no longer the “only” sufficient recipe. ​We saw this with GPT 4.5 (Feb 2025), which was rumored to be much larger than GPT 4 (and the later-released GPT 5), and pure scaling alone is not generally the most sensible way forward. The capabilities of GPT 4.5 may have been better than those of GPT 4, but the increased training budget was considered a “bad bang for the buck.” Instead, better training pipelines (with greater focus on mid- and post-training) and inference scaling have driven much of the progress this year. For example, as discussed earlier, when talking about DeepSeekMath-V2, which achieved gold-level math performance, inference scaling is one of the levers we can pull to get LLMs to solve extremely complex tasks on demand (GPT Heavy Thinking or Pro are other examples; it doesn’t make sense to use these for everything due to the high latency and cost, but there are certain examples, like challenging math or coding problems, where the intense inference-scaling makes sense.) Another major improvement came from training LLMs with tool use in mind. As you may know, hallucinations are one of the biggest problems of LLMs. Arguably, hallucination rates keep improving, and I think this is largely due to said tool use. For instance, when asked who won the FIFA soccer World Cup in 1998, instead of trying to memorize, an LLM can use a traditional search engine via tool use and select and scrape this information from a credible website on this topic (for example, in this case, the official FIFA website itself). The same goes for math problems, using calculator APIs, and so forth. ​For instance, OpenAI’s gpt-oss models were among the earlier open-weight models released this year that were specifically developed with tool use in mind. Figure 12: Annotated table from the gpt-oss model card paper . Unfortunately, the open-source ecosystem hasn’t fully caught up with that yet, and many, if not most, tools still default to running these LLMs in non-tool-use mode. One reason is that this is a newer, evolving paradigm, for which the tooling needs to be adapted. The other reason is also that this is a harder problem, to solve due to security (giving an LLM unrestricted tool use access could potentially be a security risk or wreak other kinds of havoc on your system. I think the sensible question to always ask is: would you trust a new intern to do this with this amount of access to your system?) ​I do think that, in the coming years, enabling and allowing tool use will become increasingly common when using LLMs locally. If I had to pick a word or trend that describes LLM development this year, it would be “benchmaxxing”. ​Here, benchmaxxing means there’s a strong focus on pushing leaderboard numbers, sometimes to the point where benchmark performance becomes a goal in itself rather than a proxy for general capability. A prominent example was Llama 4, which scored extremely well across many established benchmarks. However, once users and developers got their hands on it, they realized that these scores didn’t reflect the real-world capabilities and usefulness. As the popular saying goes, if the test set is public, it isn’t a real test set. And the problem these days is that test set data is not only part of the training corpus (intentionally or unintentionally), but is also often directly optimized for during LLM development. ​Back in the day, even if benchmark scores on public test sets were inflated, at least the model ranking was still preserved. E.g., see the annotated figure from the 2019 Do ImageNet Classifiers Generalize to ImageNet ? paper below. Figure 13: Annotated figure from the 2019 Do ImageNet Classifiers Generalize to ImageNet? paper. In LLM development, this has reached a point where benchmark numbers are no longer trustworthy indicators of LLM performance. However, I do think benchmarks remain necessary thresholds that LLMs must cross. I.e., if I see that an LLM scores below X on benchmark Y, I already know it’s not a good LLM. However, if it scores above X on benchmark Y, that doesn’t imply it’s much better than another LLM that scores above X on the same benchmark. Another aspect to consider is that image classifiers have only one job, namely, classifying images. However, LLMs are used for many different tasks: translating text, summarizing text, writing code, brainstorming, solving math problems, and many more. Evaluating image classifiers, where a clear metric such as classification accuracy is available, is much simpler than evaluating LLMs on both deterministic and free-form tasks. Besides trying out LLMs in practice and constantly generating new benchmarks, there’s unfortunately no solution to this problem. By the way, if you are curious to learn more about the main categories of LLM evaluation, you might like my article Understanding the 4 Main Approaches to LLM Evaluation (From Scratch): Since it comes up so often, I wanted to share my two cents about LLM replacing humans for certain types of tasks (or even jobs). At a high level, I see LLMs as tools that give people in certain professions “superpowers”. What I mean is that when LLMs are used well, they can make individuals substantially more productive and remove a lot of friction from day-to-day work. This ranges from relatively mundane tasks, such as making sure you title-cased section headers consistently, to finding complex bugs in larger code bases. Today, I still write most of the code I care about myself. With “care about,” I mean in contexts where it matters that I understand the code and that the code is correct. For example, if I set up an LLM training script, I would implement and carefully go over the training logic. This is a) to make sure it’s doing what I think it should be doing and b) to preserve my knowledge and expertise in this task. However, I now use LLMs to add the more mundane code around it, such as adding a command-line argparse boilerplate so I can use my own code more conveniently from the command line. Figure 14: Example adding command line arguments to a training script using the prompt “Add argparse for all hyperparameter options to training-script.py”. But I also more and more rely on LLMs to spot issues, suggest improvements, or sanity-check ideas. At the same time, I want to understand what I am building, and as a personal goal, I aim to deepen my knowledge and skills and continue growing my expertise. At the same time, LLMs have been extremely valuable for tasks outside my core expertise. They let me automate things I would otherwise not have had the time or energy to tackle. One example is a recent tool I wrote to extract and back up my Substack articles as Markdown. (I draft everything in Markdown, but I often edit and extend articles directly in the Substack editor, so my local drafts are not always up to date). LLMs also helped me clean up the CSS on my website, which had accumulated years of duplication and inconsistencies. And there are many similar cases where I used LLMs this year. Or, in short, I think the trick here is to recognize when and when not to use LLMs. And how to use LLMs in a way that helps you grow your expertise in a way that also feels satisfying. LLMs got better at writing code, but despite what I hear some other people say, I don’t think that code is or will become ephemeral or obsolete. LLMs give people superpowers to generate certain coding projects that would have taken them lots of effort to create themselves. ​However, pure LLM-generated code bases don’t replace expert-crafted code bases. These expert code bases may have even been created by human coders using LLMs themselves. But the key point is that someone with expertise in this area has invested a lot of time and effort in creating, testing, and refining it. It would take someone else a lot of work to replicate it, so why not adopt it if it exists? ​In short, I think that an expert full-stack web developer who has learned about good design patterns and trade-offs and has studied, seen, and built many platforms in their career will be able to build a better platform than a random person prompting an LLM to build one. ​The awesome thing is that a random person can now build a platform, even if it’s not the best one. However, using and prompting LLMs will only get that person so far, and the platform’s quality may plateau. So, if the person really cares about improving the platform, it would be a good idea to go deeper here, learn how others build platforms, and come back with more knowledge to use LLMs more effectively to guide and improve the platform design. Similar to coding, I do not see LLMs making technical writing obsolete. Writing a good technical book takes thousands of hours and deep familiarity with the subject. That process may involve LLMs to improve clarity, check technical correctness, explore alternatives, or run small experiments, but the core work still depends on human judgment and expertise. Figure 15: A non-staged example where an LLM just helped me to find and fix an error in a previous article. Yes, LLMs can make technical books better. They can help authors find errors, expand references, and generally reduce time spent on mundane tasks. This frees up more time for the deep work that actually requires creativity and experience. From the reader’s perspective, I also do not think LLMs replace technical writing. Using an LLM to learn about a topic works well for quick questions and beginner-level explanations. However, this approach quickly becomes messy when you want to build a deeper understanding. At that point, instead of potentially wasting hours yourself to try to filter through LLM responses about a topic you are trying to learn about but are not an expert in (yet), it often makes sense to follow a structured learning path designed by an expert. (The expert may or may not have used LLMs.) Of course, it still makes perfect sense to use LLMs for clarifying questions or exploring side paths while taking a course or learning from a book. It’s also great to have it design quizzes or exercise to practice the knowledge. Overall, I see LLMs as a net win for both writers and readers. But I also think the trick here is to learn to recognize when and when not to use LLMs. For instance, the main downside is that it can be tempting to immediately use an LLM when a topic gets hard, because struggling through a problem yourself first often leads to much stronger learning. I see research in much the same way. LLMs are very useful for finding related literature, spotting issues in mathematical notation, and suggesting follow-up experiments. But it still makes sense to keep a human researcher in the driver’s seat. Maybe the rules of thumb here are something like this: If this (research) article or book was entirely generated by a human, it could have potentially been further improved And if this (research) article or book could have been generated by just prompting an LLM, then it’s probably not novel and/or deep enough. LLMs are still fairly new and evolving, and I think there is also a less discussed downside to overusing LLMs. For instance, I think that if the model does all the doing and the human mainly supervises, work can start to feel hollow. Sure, some people genuinely enjoy focusing on managing systems and orchestrating workflows, and that is a perfectly valid preference. But for people who enjoy doing the thing itself, I think this mode of work can accelerate burnout. (This is likely especially true for companies that expect more results faster since we now have LLMs.) There is a special satisfaction in struggling with a hard problem and finally seeing it work. I do not get the same feeling when an LLM one-shots the solution. I guess it’s similar to cooking (this is just something that came to mind, and I’m not a great cook). If you enjoy making pizza, using pre-made dough and only adding toppings likely removes much of the joy, and cooking becomes a means to an end. That’s not necessarily bad, but I think if you are doing this work for many hours every day over a longer stretch (months or years), I can see how it will feel empty and eventually lead to burnout. So, a selfish perspective is that writing code is also more enjoyable than reading code. And you may agree that creating pull requests is usually more fun than reviewing them (but of course, this is not true for everyone). Maybe a good, idealized (but not perfect) analogy for how we should use AI in a sustainable way is chess. Chess engines surpassed human players decades ago, yet professional chess played by humans is still active and thriving. I am not a chess expert, but I’d say the game has probably even become richer and more interesting. Based on what I heard (e.g., based on Kasparov’s Deep Thinking book and podcasts featuring Magnus Carlsen), modern players have been using AI to explore different ideas, challenge their intuitions, and analyze mistakes with a level of depth that simply was not possible before. I think this is a useful model for how to think about AI in other forms of intellectual work. Used well, AI can accelerate learning and expand what a single person can reasonably take on. I think we should treat it more as a partner rather than a replacement. But I also think if AI is used to outsource thinking and coding entirely, it risks undermining motivation and long-term skill development. Figure 16: LLMs lower the barrier of entry, and they make coders (beginners and experts) more productive. However, as we are wrapping up the year 2025, I think it's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results. The general coding, knowledge-answering, and writing capabilities of LLMs keep improving. This is largely true because scaling still delivers a positive return on investment thanks to improvements in training pipelines and paradigms (e.g., RLVR), as well as in inference scaling and tool use. However, this will begin to plateau at some point (similar to what we have seen for the GPT 4 to GPT 4.5 development), unless we keep on inventing new training methods and/or architectures (at this point, no one knows what these might look like, yet). LLMs are currently able to solve a lot of general tasks and low(er) hanging fruit. But to entrench them in certain industries, it would require more domain specialization. I think LLM providers would love to get their hands on high-quality, domain-specific data. For now, it looks like this will be a challenge. For instance, it appears that most of the companies approached have declined such deals precisely because the data is proprietary and core to their business differentiation. (I’ve heard this from multiple sources, and there was also a The Information article on this topic.) ​In my opinion, it makes total sense. I think that selling valuable and proprietary data, which can give a company an edge one day, to OpenAI or Anthropic could be a bit short-sighted. Figure 17: Example of sectors and types of data that could be useful for training domain-specific LLMs, but where selling the data externally would be concerning. (I am not a legal expert, and this is not legal advice, but I can imagine that if it’s a pure local LLM that doesn’t leave the companies’ secure servers, training the model on patient health data is no different than developing other types of internal software that works with that patient health data.) Right now, LLM development is prohibitively expensive and challenging at scale, which is why only a few major companies develop state-of-the-art LLMs. However, I think LLM development is becoming increasingly commoditized, as LLM developers frequently rotate between employers and will eventually be hired by bigger financial institutions, biotech companies, and others with budgets to develop competitive in-house LLMs that benefit from their private data. ​ These LLMs don’t even have to be entirely trained from scratch; many state-of-the-art LLMs like DeepSeek V3.2, Kimi K2, and GLM 4.7 are being released and could be adapted and further post-trained. You may be wondering what I have been up to this year. My focus has been almost entirely on LLM-related work. Last year, I decided to become independent and start my own company, mainly to have more time to work on my own research, books, Substack writing, and industry collaborations. As an independent researcher, consulting projects are part of what makes this setup sustainable. This includes the usual everyday expenses (from groceries to health insurance), but also less visible costs such as cloud compute for said experiments. Over time, my goal is to further reduce consulting work and spend more time on long-form research and writing, especially the technical deep dives I share here. I am in the fortunate position that many companies have reached out about full-time roles, which would be a viable option if independence does not work out, but for now, I plan to remain independent. If you find my work useful, and if you can, subscribing to the Substack or picking up one of my books genuinely helps make this kind of work sustainable, and I really appreciate the support. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. One of my personal highlights this year has been the positive feedback on my book Build A Large Language Model (From Scratch) . I received many thoughtful messages from readers at companies and universities all around the world. The feedback spans a wide range of use cases, from college professors adopting the book as a primary textbook to teach how LLMs work, to former students who used it to prepare for job interviews and land new roles, to engineers who relied on it as a stepping stone for implementing custom LLMs in production. I was also excited to learn that the book has now been translated into at least nine languages. Figure 18: Build A Large Language Model (From Scratch) translated into different languages. Many readers also asked whether there would be a second edition covering newer and more advanced topics. While that is something I have thought about, I am cautious about making the book less accessible. For example, replacing standard multi-head attention with more complex variants such as multi-head latent attention, as used in some newer DeepSeek models, would raise the barrier to entry quite a bit. Instead, for now, I prefer to keep the book as is, since it works really well for people who want to get into LLMs. And for readers interested in more advanced material, as a follow-up, I added substantial bonus material to the book’s GitHub repository over the course of the year. I plan to continue expanding these materials over time. Figure 19: Excerpt of some of the bonus material I added to the Build A Large Language Model (From Scratch) repository this year. In addition, as you may know, I am currently working on a sequel, Build A Reasoning Model (From Scratch). The first book, Build A Large Language Model (From Scratch) , focuses on the core large language model architecture and the fundamentals of pre-training. Figure 20: Illustration of how the two from-scratch books relate to each other. The reasoning model book then picks up where the first book leaves off. Starting from a pre-trained base model, it explores inference-time scaling methods and reinforcement learning techniques aimed specifically at improving reasoning capabilities. Figure 21: Excerpt of Build A Reasoning Model (From Scratch), which is available in early access . Next to this Substack, I am working hard on writing the reasoning book, and in many ways, I think this is my most well thought-out and most polished book so far. At this point, my estimate is that I spend approximately 75-120 hours on each chapter. In case you are curious, I estimate that this typically breaks down as follows: 3-5 hours: brainstorming and revising the topic selection 5-10 hours: structuring the content 20 hours: writing the initial code 10-20 hours: running additional experiments and reading the latest literature for more insights 10-20 hours: making figures 10 hours: writing the initial draft text 10-20 hours: rewriting and refining the chapter 5-10 hours: making the exercises plus running the experiments 2-5 hours: incorporating editor and reader suggestions Currently, I am halfway through with chapter 6, which implements the reinforcement learning with verifiable rewards (GRPO) code for training reasoning models. Figure 22: Early results from experiments for chapter 6 and 7 on reinforcement learning with verifiable rewards. Build A Reasoning Model (From Scratch) is very hard work but I am thoroughly enjoying working on it! I hope you and other readers will find it useful similar to Build A Large Language Model (From Scratch) I wanted to close this article with some of the main takeaways, focusing on things that I think were a bit surprising to me, and things I predict for 2026. Let’s start with the surprises of 2025. These are developments I likely would not have expected if you had asked me a year earlier in 2024: Several reasoning models are already achieving gold-level performance in major math competitions (OpenAI with an unnamed model, Gemini Deep Think , and open-weight DeepSeekMath-V2 ). I am not surprised that this happened in general, but I am surprised that this already happened in 2025, not 2026. Llama 4 (or Llama in general) fell almost completely out of favor in the open-weight community, and Qwen has overtaken Llama in popularity (as measured by the number of downloads and derivatives as reported via ’s ATOM project ). Mistral AI uses the DeepSeek V3 architecture for its latest flagship Mistral 3 model, announced in December 2025. Besides Qwen3 and DeepSeek R1/V3.2, many additional contenders have emerged in the race for open-weight state-of-the-art models, including Kimi, GLM, MiniMax, and Yi. Cheaper, efficient hybrid architectures are already becoming a bigger priority in leading labs ( Qwen3-Next , Kimi Linear , Nemotron 3 ) as opposed to being developed by separate labs OpenAI released an open-weight model (gpt-oss, and I wrote a standalone article about it earlier this year). MCP ( joining the Linux Foundation ) has already become the standard for tool and data access in agent-style LLM systems (for now); I expected the ecosystem to remain more fragmented in 2025, until at least 2026. We will likely see an industry-scale, consumer-facing diffusion model for cheap, reliable, low-latency inference, with Gemini Diffusion probably going first. The open-weight community will slowly but steadily adopt LLMs with local tool use and increasingly agentic capabilities. RLVR will more widely expand into other domains beyond math and coding (for example, chemistry, biology, and others). Classical RAG will slowly fade as a default solution for document queries. Instead of using retrieval on every document-related query, developers will rely more on better long-context handling, especially as there are going to be better “small” open-weight models. A lot of LLM benchmark and performance progress will come from improved tooling and inference-time scaling rather than from training or the core model itself. It will look like LLMs are getting much better, but this will mainly be because the surrounding applications are improving. At the same time, developers will focus more on lowering latency and making reasoning models expand fewer reasoning tokens where it is unnecessary. Don’t get me wrong, 2026 will push the state-of-the-art further, but the proportion of progress will come more from the inference than purely the training side this year. To wrap things up, I think if there is one meta-lesson from 2025, it is that progress in LLMs is less about a single breakthrough, and improvements are being made on multiple fronts via multiple independent levers. This includes architecture tweaks, data quality improvements, reasoning training, inference scaling, tool calling, and more. At the same time, evaluation remains hard, benchmarks are imperfect, and good judgment about when and how to use these systems is still essential. My hope for 2026 is that we continue to see interesting improvements, but also that we understand where the improvements are coming from. This requires both better and more consistent benchmarking, and of course transparency. Thank you for reading, and for all the thoughtful feedback and discussions throughout the year, in the comments and across all the different platforms, from Substack Notes to GitHub. The positive feedback and detailed conversations genuinely keep me motivated to invest the time and energy required for long-form articles and to keep digging deeply into LLM research and implementation details. I learned a lot from these exchanges, and I hope you did too. I am very much looking forward to continuing these conversations as the field keeps evolving in 2026! Cheers, Sebastian 10. Bonus: A Curated LLM Research Papers List (July to December 2025) In June, I shared a bonus article with my curated and bookmarked research paper lists to the paid subscribers who make this Substack possible. In a similar fashion, as a thank you to all the kind supporters, below, I prepared a list of all the interesting research articles I bookmarked and categorized from July to December 2025. I skimmed over the abstracts of these papers but only read a very small fraction. However, I still like to keep collecting these organized lists as I often go back to sets of them when working on a given project. However, given the already enormous length of this current article, I am sharing this list in a separate article, which is linked below: Thanks so much for subscribing to my Ahead of AI blog and for supporting my work this year. I really appreciate it. Your support makes this work feasible in a very real sense and allows me to keep spending the time needed to write, experiment, and think deeply about these topics! Figure 1: A short response and a longer response including intermediate steps that is typically generated by reasoning models. 1.1 The DeepSeek Moment DeepSeek R1 got a lot of attention for various reasons: First, DeepSeek R1 was released as an open-weight model that performed really well and was comparable to the best proprietary models (ChatGPT, Gemini, etc.) at the time. Second, the DeepSeek R1 paper prompted many people, especially investors and journalists, to revisit the earlier DeepSeek V3 paper from December 2024. This then led to a revised conclusion that while training state-of-the-art models is still expensive, it may be an order of magnitude cheaper than previously assumed, with estimates closer to 5 million dollars rather than 50 or 500 million. Figure 2: Table from the DeepSeek V3 paper estimating the cost of training the 671B parameter DeepSeek V3 model. ​The DeepSeek R1 supplementary materials estimate that training the DeepSeek R1 model on top of DeepSeek V3 costs another $294,000, which is again much lower than everyone believed. Figure 3: Table from the DeepSeek R1 paper’s supplementary materials estimating the cost of training the R1 model on top of DeepSeek V3. Of course, there are many caveats to the 5-million-dollar estimate. For instance, it captures only the compute credit cost for the final model run, but it doesn’t factor in the researchers’ salaries and other development costs associated with hyperparameter tuning and experimentation. Third, and most interestingly, the paper presented Reinforcement Learning with Verifiable Rewards (RLVR) with the GRPO algorithm as a new (or at least modified) algorithmic approach for developing so-called reasoning models and improving LLMs during post-training. Figure 4: Broad overview of how / when reinforcement learning is applied. There are many details that I am skipping in this overview, but interested readers can read more in my The State of Reinforcement Learning for LLM Reasoning article. Up to this point, post-training methods like supervised instruction fine-tuning (SFT) and reinforcement learning with human feedback (RLHF), which still remain an important part of the training pipeline, are bottlenecked by requiring expensive written responses or preference labels. (Sure, one can also generate them synthetically with other LLMs, but that’s a bit of a chicken-egg problem.) ​ What’s so important about DeepSeek R1 and RLVR is that they allow us to post-train LLMs on large amounts of data, which makes them a great candidate for improving and unlocking capabilities through scaling compute during post-training (given an available compute budget). The V in RLVR stands for “verifiable,” which means we can use deterministic approaches to assign correctness labels, and these labels are sufficient for the LLM to learn complex problem-solving. (The typical categories are math and code, but it is also possible to expand this idea to other domains.) Figure 5: A simple example of a verifiable reward. I don’t want to get too lost in technical details here, as I want to cover other aspects in this yearly review article. And whole articles or books can be written about reasoning LLMs and RLVR. For instance, if you are interested to learn more, check out my previous articles: ​All that being said, the takeaway is that LLM development this year was essentially dominated by reasoning models using RLVR and GRPO. ​Essentially, every major open-weight or proprietary LLM developer has released a reasoning (often called “thinking”) variant of their model following DeepSeek R1. 1.2 LLM Focus Points If I were to summarize the LLM development focus points succinctly for each year, beyond just scaling the architecture and pre-training compute, my list would look like this: 2022 RLHF + PPO 2023 LoRA SFT 2024 Mid-Training 2025 RLVR + GRPO Figure 6: Some of the focus areas of proprietary and open-weight LLM development over the years. Note that this is cumulative, meaning that RLHF + PPO, for example, is still relevant and being used. However, it’s no longer the most hotly discussed topic. Then, in 2024, all major labs began making their (pre-)training pipelines more sophisticated by focusing on synthetic data, optimizing data mixes, using domain-specific data, and adding dedicated long-context training stages. I summarized these different approaches in my 2024 article back then (I grouped the techniques under pre-training, because the term “mid-training” hadn’t been coined yet back then): ​Back then, I considered these as pre-training techniques, since they use the same pre-training algorithm and objective. Today, these slightly more specialized pre-training stages, which follow the regular pre-training on general data, are often called “mid-training” (as a bridge between regular pre-training and post-training, which includes SFT, RLHF, and now RLVR).​ So, you may wonder what’s next? I think we will see (even) more focus on RLVR next year. Right now, RLVR is primarily applied to math and code domains. The next logical step is to not only use the final answer’s correctness as a reward signal but also judge the LLM’s explanations during RLVR training. This has been done before, for many years in the past, under the research label “process reward models” (PRMs). However, it hasn’t been super successful yet. E.g., to quote from the DeepSeek R1 paper : 4.2. Unsuccessful Attempts [...] In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments. However, looking at the recent DeepSeekMath-V2 paper, which came out last month and I discussed in my previous article From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates , I think we will see more of “explanation-scoring” as a training signal in the future.​ The way the explanations are currently being scored involves a second LLM. This leads to the other direction I am seeing for RLVR: an extension into other domains beyond math and code.​ So, if you asked me today what I see on the horizon for 2026 and 2027, I’d say the following: 2026 RLVR extensions and more inference-time scaling 2027 Continual learning Figure 7: Combination of two inference-time scaling methods: self-consistency and self-refinement. Additional self-refinement iterations improve accuracy. Annotated figure from the DeepSeekMath-V2 paper . Self-consistency and self-refinement are covered in chapters 4 and 5 of my Build A Reasoning Model (From Scratch) book. There’s also been a lot of talk among colleagues about continuous learning this year. In short, continual learning is about training a model on new data or knowledge without retraining it from scratch. It’s not a new idea, and I wonder why it came up so much this year, since there hasn’t been any new or substantial breakthrough in continual learning at this point. The challenge to continual learning is catastrophic forgetting (as experiments with continued pre-training show, learning new knowledge means that the LLM is forgetting old knowledge to some extent). Still, since this seems like such a hot topic, I do expect more progress towards minimizing catastrophic forgetting and making continual learning method development an important development in the upcoming years. 2. GRPO, the Research Darling of the Year Academic research in the era of expensive LLMs has been a bit challenging in recent years. Of course, important discoveries that became mainstream and key pillars of LLM progress and breakthroughs can be made in academia despite (or because of) smaller budgets. In recent years, popular examples include LoRA ( LoRA: Low-Rank Adaptation of Large Language Models 2021) and related methods for parameter-efficient fine-tuning. Figure 8: A code-based introduction to LoRA tutorial Another one is DPO ( Direct Preference Optimization: Your Language Model is Secretly a Reward Model ) and related methods for reward-model-free alignment as an alternative reinforcement learning with human feedback. Figure 9: A code-based introduction to DPO tutorial In my bubble, this year’s research highlight has been GRPO. Although it was introduced in the DeepSeek R1 paper rather than originating from academia, it has still made for an exciting year for researchers: both RLVR and GRPO are conceptually interesting and, depending on scale, not prohibitively expensive to experiment with. So, there have been many mathematical improvements to GRPO that I saw in the LLM research literature this year (from both companies and academic researchers), which were later adopted in the training pipelines of state-of-the-art LLMs. For instance, some of the improvements include the following: Olmo 3 : Zero gradient signal filtering (DAPO by Yu et al., 2025 ) Active sampling (DAPO by Yu et al., 2025 ) Token-level loss (DAPO by Yu et al., 2025 ) No KL loss (DAPO by Yu et al., 2025 and Dr. GRPO by Liu et al., 2025 ) Clip higher (DAPO by Yu et al., 2025 ) Truncated importance sampling ( Yao et al., 2025 ) No standard deviation normalization (Dr. GRPO by Liu et al., 2025 ) KL tuning with domain‑specific KL strengths (zero for math) Reweighted KL Off‑policy sequence masking Keep sampling mask for top‑p / top‑k Keep original GRPO advantage normalization Figure 10: Small excerpt of the results from my from-scratch GRPO training code, which is available on GitHub Anyways, I have a vanilla GRPO script in my “Build A Reasoning Model (From Scratch) repository if you want to toy around with it. (I will add more ablation studies with the respective modifications soon.) 3. LLM Architectures: A Fork in the Road? When it comes to LLM architectures, state-of-the-art models still use the good old decoder-style transformer. However, this year, open-weight LLMs more or less converged on using mixture-of-experts (MoE) layers, as well as at least one “efficiency-tweaked” attention mechanism: Grouped-query attention, sliding-window attention, or multi-head latent attention. Beyond those fairly standard LLM architectures, we have also seen more drastic efficiency tweaks targeting the attention mechanism to scale linearly with sequence length. Examples of this include the Gated DeltaNets in Qwen3-Next and Kimi Linear, as well as the Mamba-2 layers in NVIDIA’s Nemotron 3. Anyways, I don’t want to go into too much detail here because I have a whole 13k-word and recently-updated article dedicated to these architectures here if you want to learn more: The Big LLM Architecture Comparison, https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison Figure 11: The Big LLM Architecture Comparison My prediction is that we will keep building, and with the transformer architecture for at least a couple more years, at least when it comes to state-of-the-art modeling performance. At the same time, I do think that we will see more and more of these efficiency and engineering tweaks like Gated DeltaNet and Mamba layers because at the scale at which LLMs are trained, deployed, and used, it just makes sense from a financial perspective for these companies, which are still burning a lot of money on serving LLMs. This doesn’t mean that there are no other alternatives out there. As I’ve written about in Beyond Standard LLMs , for instance, text diffusion models are an interesting approach. Right now, they fall into the category of experimental research models, but Google shared that they will release a Gemini Diffusion model. It won’t rival their state-of-the-art offerings in modeling quality, but it will be really fast and attractive for tasks with low-latency requirements (e.g., code completion). Also, two weeks ago, the open-weight LLaDA 2.0 models dropped . The largest one, at 100B parameters, is the largest text diffusion model to date and is on par with Qwen3 30B. (Yes, it doesn’t push the state-of-the-art overall, but it’s still a notable release in the diffusion model space.) 4. It’s Also The Year of Inference-Scaling and Tool Use Improving LLMs by scaling training data and architectures is an established formula that (still) keeps on giving. However, especially this year, it’s no longer the “only” sufficient recipe. ​We saw this with GPT 4.5 (Feb 2025), which was rumored to be much larger than GPT 4 (and the later-released GPT 5), and pure scaling alone is not generally the most sensible way forward. The capabilities of GPT 4.5 may have been better than those of GPT 4, but the increased training budget was considered a “bad bang for the buck.” Instead, better training pipelines (with greater focus on mid- and post-training) and inference scaling have driven much of the progress this year. For example, as discussed earlier, when talking about DeepSeekMath-V2, which achieved gold-level math performance, inference scaling is one of the levers we can pull to get LLMs to solve extremely complex tasks on demand (GPT Heavy Thinking or Pro are other examples; it doesn’t make sense to use these for everything due to the high latency and cost, but there are certain examples, like challenging math or coding problems, where the intense inference-scaling makes sense.) Another major improvement came from training LLMs with tool use in mind. As you may know, hallucinations are one of the biggest problems of LLMs. Arguably, hallucination rates keep improving, and I think this is largely due to said tool use. For instance, when asked who won the FIFA soccer World Cup in 1998, instead of trying to memorize, an LLM can use a traditional search engine via tool use and select and scrape this information from a credible website on this topic (for example, in this case, the official FIFA website itself). The same goes for math problems, using calculator APIs, and so forth. ​For instance, OpenAI’s gpt-oss models were among the earlier open-weight models released this year that were specifically developed with tool use in mind. Figure 12: Annotated table from the gpt-oss model card paper . Unfortunately, the open-source ecosystem hasn’t fully caught up with that yet, and many, if not most, tools still default to running these LLMs in non-tool-use mode. One reason is that this is a newer, evolving paradigm, for which the tooling needs to be adapted. The other reason is also that this is a harder problem, to solve due to security (giving an LLM unrestricted tool use access could potentially be a security risk or wreak other kinds of havoc on your system. I think the sensible question to always ask is: would you trust a new intern to do this with this amount of access to your system?) ​I do think that, in the coming years, enabling and allowing tool use will become increasingly common when using LLMs locally. 5. Word of the Year: Benchmaxxing If I had to pick a word or trend that describes LLM development this year, it would be “benchmaxxing”. ​Here, benchmaxxing means there’s a strong focus on pushing leaderboard numbers, sometimes to the point where benchmark performance becomes a goal in itself rather than a proxy for general capability. A prominent example was Llama 4, which scored extremely well across many established benchmarks. However, once users and developers got their hands on it, they realized that these scores didn’t reflect the real-world capabilities and usefulness. As the popular saying goes, if the test set is public, it isn’t a real test set. And the problem these days is that test set data is not only part of the training corpus (intentionally or unintentionally), but is also often directly optimized for during LLM development. ​Back in the day, even if benchmark scores on public test sets were inflated, at least the model ranking was still preserved. E.g., see the annotated figure from the 2019 Do ImageNet Classifiers Generalize to ImageNet ? paper below. Figure 13: Annotated figure from the 2019 Do ImageNet Classifiers Generalize to ImageNet? paper. In LLM development, this has reached a point where benchmark numbers are no longer trustworthy indicators of LLM performance. However, I do think benchmarks remain necessary thresholds that LLMs must cross. I.e., if I see that an LLM scores below X on benchmark Y, I already know it’s not a good LLM. However, if it scores above X on benchmark Y, that doesn’t imply it’s much better than another LLM that scores above X on the same benchmark. Another aspect to consider is that image classifiers have only one job, namely, classifying images. However, LLMs are used for many different tasks: translating text, summarizing text, writing code, brainstorming, solving math problems, and many more. Evaluating image classifiers, where a clear metric such as classification accuracy is available, is much simpler than evaluating LLMs on both deterministic and free-form tasks. Besides trying out LLMs in practice and constantly generating new benchmarks, there’s unfortunately no solution to this problem. By the way, if you are curious to learn more about the main categories of LLM evaluation, you might like my article Understanding the 4 Main Approaches to LLM Evaluation (From Scratch): 6. AI for Coding, Writing, and Research Since it comes up so often, I wanted to share my two cents about LLM replacing humans for certain types of tasks (or even jobs). At a high level, I see LLMs as tools that give people in certain professions “superpowers”. What I mean is that when LLMs are used well, they can make individuals substantially more productive and remove a lot of friction from day-to-day work. This ranges from relatively mundane tasks, such as making sure you title-cased section headers consistently, to finding complex bugs in larger code bases. 6.1 Coding Today, I still write most of the code I care about myself. With “care about,” I mean in contexts where it matters that I understand the code and that the code is correct. For example, if I set up an LLM training script, I would implement and carefully go over the training logic. This is a) to make sure it’s doing what I think it should be doing and b) to preserve my knowledge and expertise in this task. However, I now use LLMs to add the more mundane code around it, such as adding a command-line argparse boilerplate so I can use my own code more conveniently from the command line. Figure 14: Example adding command line arguments to a training script using the prompt “Add argparse for all hyperparameter options to training-script.py”. But I also more and more rely on LLMs to spot issues, suggest improvements, or sanity-check ideas. At the same time, I want to understand what I am building, and as a personal goal, I aim to deepen my knowledge and skills and continue growing my expertise. At the same time, LLMs have been extremely valuable for tasks outside my core expertise. They let me automate things I would otherwise not have had the time or energy to tackle. One example is a recent tool I wrote to extract and back up my Substack articles as Markdown. (I draft everything in Markdown, but I often edit and extend articles directly in the Substack editor, so my local drafts are not always up to date). LLMs also helped me clean up the CSS on my website, which had accumulated years of duplication and inconsistencies. And there are many similar cases where I used LLMs this year. Or, in short, I think the trick here is to recognize when and when not to use LLMs. And how to use LLMs in a way that helps you grow your expertise in a way that also feels satisfying. 6.2 Codebases and code libraries LLMs got better at writing code, but despite what I hear some other people say, I don’t think that code is or will become ephemeral or obsolete. LLMs give people superpowers to generate certain coding projects that would have taken them lots of effort to create themselves. ​However, pure LLM-generated code bases don’t replace expert-crafted code bases. These expert code bases may have even been created by human coders using LLMs themselves. But the key point is that someone with expertise in this area has invested a lot of time and effort in creating, testing, and refining it. It would take someone else a lot of work to replicate it, so why not adopt it if it exists? ​In short, I think that an expert full-stack web developer who has learned about good design patterns and trade-offs and has studied, seen, and built many platforms in their career will be able to build a better platform than a random person prompting an LLM to build one. ​The awesome thing is that a random person can now build a platform, even if it’s not the best one. However, using and prompting LLMs will only get that person so far, and the platform’s quality may plateau. So, if the person really cares about improving the platform, it would be a good idea to go deeper here, learn how others build platforms, and come back with more knowledge to use LLMs more effectively to guide and improve the platform design. 6.3 Technical writing and research Similar to coding, I do not see LLMs making technical writing obsolete. Writing a good technical book takes thousands of hours and deep familiarity with the subject. That process may involve LLMs to improve clarity, check technical correctness, explore alternatives, or run small experiments, but the core work still depends on human judgment and expertise. Figure 15: A non-staged example where an LLM just helped me to find and fix an error in a previous article. Yes, LLMs can make technical books better. They can help authors find errors, expand references, and generally reduce time spent on mundane tasks. This frees up more time for the deep work that actually requires creativity and experience. From the reader’s perspective, I also do not think LLMs replace technical writing. Using an LLM to learn about a topic works well for quick questions and beginner-level explanations. However, this approach quickly becomes messy when you want to build a deeper understanding. At that point, instead of potentially wasting hours yourself to try to filter through LLM responses about a topic you are trying to learn about but are not an expert in (yet), it often makes sense to follow a structured learning path designed by an expert. (The expert may or may not have used LLMs.) Of course, it still makes perfect sense to use LLMs for clarifying questions or exploring side paths while taking a course or learning from a book. It’s also great to have it design quizzes or exercise to practice the knowledge. Overall, I see LLMs as a net win for both writers and readers. But I also think the trick here is to learn to recognize when and when not to use LLMs. For instance, the main downside is that it can be tempting to immediately use an LLM when a topic gets hard, because struggling through a problem yourself first often leads to much stronger learning. I see research in much the same way. LLMs are very useful for finding related literature, spotting issues in mathematical notation, and suggesting follow-up experiments. But it still makes sense to keep a human researcher in the driver’s seat. Maybe the rules of thumb here are something like this: If this (research) article or book was entirely generated by a human, it could have potentially been further improved And if this (research) article or book could have been generated by just prompting an LLM, then it’s probably not novel and/or deep enough. Figure 16: LLMs lower the barrier of entry, and they make coders (beginners and experts) more productive. However, as we are wrapping up the year 2025, I think it's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results. 7. The Edge: Private data The general coding, knowledge-answering, and writing capabilities of LLMs keep improving. This is largely true because scaling still delivers a positive return on investment thanks to improvements in training pipelines and paradigms (e.g., RLVR), as well as in inference scaling and tool use. ​ However, this will begin to plateau at some point (similar to what we have seen for the GPT 4 to GPT 4.5 development), unless we keep on inventing new training methods and/or architectures (at this point, no one knows what these might look like, yet). LLMs are currently able to solve a lot of general tasks and low(er) hanging fruit. But to entrench them in certain industries, it would require more domain specialization. I think LLM providers would love to get their hands on high-quality, domain-specific data. For now, it looks like this will be a challenge. For instance, it appears that most of the companies approached have declined such deals precisely because the data is proprietary and core to their business differentiation. (I’ve heard this from multiple sources, and there was also a The Information article on this topic.) ​In my opinion, it makes total sense. I think that selling valuable and proprietary data, which can give a company an edge one day, to OpenAI or Anthropic could be a bit short-sighted. Figure 17: Example of sectors and types of data that could be useful for training domain-specific LLMs, but where selling the data externally would be concerning. (I am not a legal expert, and this is not legal advice, but I can imagine that if it’s a pure local LLM that doesn’t leave the companies’ secure servers, training the model on patient health data is no different than developing other types of internal software that works with that patient health data.) Right now, LLM development is prohibitively expensive and challenging at scale, which is why only a few major companies develop state-of-the-art LLMs. However, I think LLM development is becoming increasingly commoditized, as LLM developers frequently rotate between employers and will eventually be hired by bigger financial institutions, biotech companies, and others with budgets to develop competitive in-house LLMs that benefit from their private data. ​ These LLMs don’t even have to be entirely trained from scratch; many state-of-the-art LLMs like DeepSeek V3.2, Kimi K2, and GLM 4.7 are being released and could be adapted and further post-trained. 8. Building LLMs and Reasoning Models From Scratch You may be wondering what I have been up to this year. My focus has been almost entirely on LLM-related work. Last year, I decided to become independent and start my own company, mainly to have more time to work on my own research, books, Substack writing, and industry collaborations. As an independent researcher, consulting projects are part of what makes this setup sustainable. This includes the usual everyday expenses (from groceries to health insurance), but also less visible costs such as cloud compute for said experiments. Over time, my goal is to further reduce consulting work and spend more time on long-form research and writing, especially the technical deep dives I share here. I am in the fortunate position that many companies have reached out about full-time roles, which would be a viable option if independence does not work out, but for now, I plan to remain independent. If you find my work useful, and if you can, subscribing to the Substack or picking up one of my books genuinely helps make this kind of work sustainable, and I really appreciate the support. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. One of my personal highlights this year has been the positive feedback on my book Build A Large Language Model (From Scratch) . I received many thoughtful messages from readers at companies and universities all around the world. The feedback spans a wide range of use cases, from college professors adopting the book as a primary textbook to teach how LLMs work, to former students who used it to prepare for job interviews and land new roles, to engineers who relied on it as a stepping stone for implementing custom LLMs in production. I was also excited to learn that the book has now been translated into at least nine languages. Figure 18: Build A Large Language Model (From Scratch) translated into different languages. Many readers also asked whether there would be a second edition covering newer and more advanced topics. While that is something I have thought about, I am cautious about making the book less accessible. For example, replacing standard multi-head attention with more complex variants such as multi-head latent attention, as used in some newer DeepSeek models, would raise the barrier to entry quite a bit. Instead, for now, I prefer to keep the book as is, since it works really well for people who want to get into LLMs. And for readers interested in more advanced material, as a follow-up, I added substantial bonus material to the book’s GitHub repository over the course of the year. I plan to continue expanding these materials over time. ​ Figure 19: Excerpt of some of the bonus material I added to the Build A Large Language Model (From Scratch) repository this year. In addition, as you may know, I am currently working on a sequel, Build A Reasoning Model (From Scratch). The first book, Build A Large Language Model (From Scratch) , focuses on the core large language model architecture and the fundamentals of pre-training. Figure 20: Illustration of how the two from-scratch books relate to each other. The reasoning model book then picks up where the first book leaves off. Starting from a pre-trained base model, it explores inference-time scaling methods and reinforcement learning techniques aimed specifically at improving reasoning capabilities. Figure 21: Excerpt of Build A Reasoning Model (From Scratch), which is available in early access . Next to this Substack, I am working hard on writing the reasoning book, and in many ways, I think this is my most well thought-out and most polished book so far. At this point, my estimate is that I spend approximately 75-120 hours on each chapter. In case you are curious, I estimate that this typically breaks down as follows: 3-5 hours: brainstorming and revising the topic selection 5-10 hours: structuring the content 20 hours: writing the initial code 10-20 hours: running additional experiments and reading the latest literature for more insights 10-20 hours: making figures 10 hours: writing the initial draft text 10-20 hours: rewriting and refining the chapter 5-10 hours: making the exercises plus running the experiments 2-5 hours: incorporating editor and reader suggestions Figure 22: Early results from experiments for chapter 6 and 7 on reinforcement learning with verifiable rewards. Build A Reasoning Model (From Scratch) is very hard work but I am thoroughly enjoying working on it! I hope you and other readers will find it useful similar to Build A Large Language Model (From Scratch) 9. Surprises in 2025 and Predictions for 2026 I wanted to close this article with some of the main takeaways, focusing on things that I think were a bit surprising to me, and things I predict for 2026. 9.1 Noteworthy and Surprising Things in 2025 Let’s start with the surprises of 2025. These are developments I likely would not have expected if you had asked me a year earlier in 2024: Several reasoning models are already achieving gold-level performance in major math competitions (OpenAI with an unnamed model, Gemini Deep Think , and open-weight DeepSeekMath-V2 ). I am not surprised that this happened in general, but I am surprised that this already happened in 2025, not 2026. Llama 4 (or Llama in general) fell almost completely out of favor in the open-weight community, and Qwen has overtaken Llama in popularity (as measured by the number of downloads and derivatives as reported via ’s ATOM project ). Mistral AI uses the DeepSeek V3 architecture for its latest flagship Mistral 3 model, announced in December 2025. Besides Qwen3 and DeepSeek R1/V3.2, many additional contenders have emerged in the race for open-weight state-of-the-art models, including Kimi, GLM, MiniMax, and Yi. Cheaper, efficient hybrid architectures are already becoming a bigger priority in leading labs ( Qwen3-Next , Kimi Linear , Nemotron 3 ) as opposed to being developed by separate labs OpenAI released an open-weight model (gpt-oss, and I wrote a standalone article about it earlier this year). MCP ( joining the Linux Foundation ) has already become the standard for tool and data access in agent-style LLM systems (for now); I expected the ecosystem to remain more fragmented in 2025, until at least 2026. We will likely see an industry-scale, consumer-facing diffusion model for cheap, reliable, low-latency inference, with Gemini Diffusion probably going first. The open-weight community will slowly but steadily adopt LLMs with local tool use and increasingly agentic capabilities. RLVR will more widely expand into other domains beyond math and coding (for example, chemistry, biology, and others). Classical RAG will slowly fade as a default solution for document queries. Instead of using retrieval on every document-related query, developers will rely more on better long-context handling, especially as there are going to be better “small” open-weight models. A lot of LLM benchmark and performance progress will come from improved tooling and inference-time scaling rather than from training or the core model itself. It will look like LLMs are getting much better, but this will mainly be because the surrounding applications are improving. At the same time, developers will focus more on lowering latency and making reasoning models expand fewer reasoning tokens where it is unnecessary. Don’t get me wrong, 2026 will push the state-of-the-art further, but the proportion of progress will come more from the inference than purely the training side this year.

0 views
Susam Pal 3 weeks ago

My Coding Adventures in 2025

In this post, I return with a retrospective on my coding adventures, where I summarise my hobby projects and recreational programming activities from the current year. I did the last such retrospective in 2023 . So I think this is a good time to do another retrospective. At the outset, I should mention that I have done less hobby computing this year than in the past few, largely because I spent a substantial portion of my leisure time studying Galois theory and algebraic graph theory. In case you are wondering where I am learning these subjects from, the books are Galois Theory , 5th ed. by Ian Stewart and Algebraic Graph Theory by Godsil and Royle. Both are absolutely fascinating subjects and the two books I mentioned are quite good as well. I highly recommend them. Now back to the coding adventures. Here they go: MathB : The year began not with the release of a new project but with the opposite: discontinuing a project I had maintained for 13 years. MathB.in, a mathematics pastebin service, was discontinued early this year. This is a project I developed in 2012 for myself and my friends. Although a rather simple project, it was close to my heart, as I have many fond memories of exchanging mathematical puzzles and solutions with my friends using this service. Over time, the project grew quite popular on IRC networks, as well as in some schools and universities, where IRC users, learners, and students used the service to share problems and solutions with one another, much as my friends and I had done in its early days. I shut it down this year because I wanted to move on from the project. Before the shutdown, a kind member of the Archive Team worked with me to archive all posts from the now-defunct website. Although shutting down this service was a bittersweet event for me, I feel relieved that I no longer have to run a live service in my spare time. While this was a good hobby ten years ago, it no longer is. See my blog post MathB.in Is Shutting Down for more details on the reasons behind this decision. The source code of this project remains open source and available at github.com/susam/mathb . QuickQWERTY : This is a touch-typing tutor that runs in a web browser. I originally developed it in 2008 for myself and my friends. While I learned touch typing on an actual typewriter as a child, those lessons did not stick with me. Much later, while I was at university, I came across a Java applet-based touch-typing tutor that finally helped me learn touch typing properly. I disliked installing Java plugins in the web browser, which is why I later developed this project in plain HTML and JavaScript. This year, I carried out a major refactoring to collapse the entire project into a single standalone HTML file with no external dependencies. The source code has been greatly simplified as well. When I was younger and more naive, inspired by the complexity and multiple layers of abstraction I saw in popular open source and professional projects, I tended to introduce similar abstractions and complexity into my personal projects. Over time, however, I began to appreciate simplicity. The new code for this project is smaller and simpler, and I am quite happy with the end result. You can take a look at the code here: quickqwerty.html . If you want to use the typing tutor, go here: QuickQWERTY . Unfortunately, it does not support keyboard layouts other than QWERTY. When I originally developed this project, my view of the computing world was rather limited. I was not even aware that other keyboard layouts existed. You are, however, very welcome to fork the project and adapt the lessons for other layouts. CFRS[] : This project was my first contribution to the quirky world of esolangs. CFRS[] is an extremely minimal drawing language consisting of only six simple commands: , , , , , and . I developed it in 2023 and have since been maintaining it with occasional bug fixes. This year, I fixed an annoying bug that caused the drawing canvas to overflow on some mobile web browsers. A new demo also arrived from the community this year and has now been added to the community demo page. See Glimmering Galaxy for the new demo. If you want to play with CFRS[] now, visit CFRS[] . FXYT : This is another esolang project of mine. This too is a minimal drawing language, though not as minimal as CFRS[]. Instead, it is a stack-based, postfix canvas colouring language with only 36 simple commands. The canvas overflow bug described in the previous entry affected this project as well. That has now been fixed. Further, by popular demand, the maximum allowed code length has been increased from 256 bytes to 1024 bytes. This means there is now more room for writing more complex FXYT programs. Additionally, the maximum code length for distributable demo links has been increased from 64 bytes to 256 bytes. This allows several more impressive demos to have their own distributable links. Visit FXYT to try it out now. See also the Community Demos to view some fascinating artwork created by the community. Nerd Quiz : This is a new project I created a couple of months ago. It is a simple HTML tool that lets you test your nerdiness through short quizzes. Each question is drawn from my everyday moments of reading, writing, thinking, learning, and exploring. The project is meant to serve as a repository of interesting facts I come across in daily life, captured in the form of quiz questions. Go here to try it out: Nerd Quiz . I hope you will enjoy these little bits of knowledge as much as I enjoyed discovering them. Mark V. Shaney Junior : Finally, I have my own Markov gibberish generator. Always wanted to have one. The project is inspired by the legendary Usenet bot named Mark V. Shaney that used to post messages to various newsgroups in the 1980s. My Markov chain program is written in about 30 lines of Python. I ran it on my 24 years of blog posts consisting of over 200 posts and about 200,000 words and it generated some pretty interesting gibberish. See my blog post Fed 24 Years of My Posts to Markov Model to see the examples. Elliptical Python Programming : If the previous item was not silly enough, this one surely is. Earlier this year, I wrote a blog post describing the fine art of Python programming using copious amounts of ellipses. I will not discuss it further here to avoid spoilers. I'll just say that any day I'm able to do something pointless, whimsical and fun with computers is a good day for me. And it was a good day when I wrote this post. Please visit the link above to read the post. I hope you find it fun. Fizz Buzz with Cosines : Another silly post in which I explain how to compute the discrete Fourier transform of the Fizz Buzz sequence and derive a closed-form expression that can be used to print the sequence. Fizz Buzz in CSS : Yet another Fizz Buzz implementation, this time using just four lines of CSS. That wraps up my coding adventures for this year. There were fewer hobby projects than usual but I enjoyed spending more time learning new things and revisiting old ones. One long-running project came to an end, another was cleaned up and a few small new ideas appeared along the way. Looking forward to what the next year brings. Read on website | #programming | #technology | #retrospective MathB : The year began not with the release of a new project but with the opposite: discontinuing a project I had maintained for 13 years. MathB.in, a mathematics pastebin service, was discontinued early this year. This is a project I developed in 2012 for myself and my friends. Although a rather simple project, it was close to my heart, as I have many fond memories of exchanging mathematical puzzles and solutions with my friends using this service. Over time, the project grew quite popular on IRC networks, as well as in some schools and universities, where IRC users, learners, and students used the service to share problems and solutions with one another, much as my friends and I had done in its early days. I shut it down this year because I wanted to move on from the project. Before the shutdown, a kind member of the Archive Team worked with me to archive all posts from the now-defunct website. Although shutting down this service was a bittersweet event for me, I feel relieved that I no longer have to run a live service in my spare time. While this was a good hobby ten years ago, it no longer is. See my blog post MathB.in Is Shutting Down for more details on the reasons behind this decision. The source code of this project remains open source and available at github.com/susam/mathb . QuickQWERTY : This is a touch-typing tutor that runs in a web browser. I originally developed it in 2008 for myself and my friends. While I learned touch typing on an actual typewriter as a child, those lessons did not stick with me. Much later, while I was at university, I came across a Java applet-based touch-typing tutor that finally helped me learn touch typing properly. I disliked installing Java plugins in the web browser, which is why I later developed this project in plain HTML and JavaScript. This year, I carried out a major refactoring to collapse the entire project into a single standalone HTML file with no external dependencies. The source code has been greatly simplified as well. When I was younger and more naive, inspired by the complexity and multiple layers of abstraction I saw in popular open source and professional projects, I tended to introduce similar abstractions and complexity into my personal projects. Over time, however, I began to appreciate simplicity. The new code for this project is smaller and simpler, and I am quite happy with the end result. You can take a look at the code here: quickqwerty.html . If you want to use the typing tutor, go here: QuickQWERTY . Unfortunately, it does not support keyboard layouts other than QWERTY. When I originally developed this project, my view of the computing world was rather limited. I was not even aware that other keyboard layouts existed. You are, however, very welcome to fork the project and adapt the lessons for other layouts. CFRS[] : This project was my first contribution to the quirky world of esolangs. CFRS[] is an extremely minimal drawing language consisting of only six simple commands: , , , , , and . I developed it in 2023 and have since been maintaining it with occasional bug fixes. This year, I fixed an annoying bug that caused the drawing canvas to overflow on some mobile web browsers. A new demo also arrived from the community this year and has now been added to the community demo page. See Glimmering Galaxy for the new demo. If you want to play with CFRS[] now, visit CFRS[] . FXYT : This is another esolang project of mine. This too is a minimal drawing language, though not as minimal as CFRS[]. Instead, it is a stack-based, postfix canvas colouring language with only 36 simple commands. The canvas overflow bug described in the previous entry affected this project as well. That has now been fixed. Further, by popular demand, the maximum allowed code length has been increased from 256 bytes to 1024 bytes. This means there is now more room for writing more complex FXYT programs. Additionally, the maximum code length for distributable demo links has been increased from 64 bytes to 256 bytes. This allows several more impressive demos to have their own distributable links. Visit FXYT to try it out now. See also the Community Demos to view some fascinating artwork created by the community. Nerd Quiz : This is a new project I created a couple of months ago. It is a simple HTML tool that lets you test your nerdiness through short quizzes. Each question is drawn from my everyday moments of reading, writing, thinking, learning, and exploring. The project is meant to serve as a repository of interesting facts I come across in daily life, captured in the form of quiz questions. Go here to try it out: Nerd Quiz . I hope you will enjoy these little bits of knowledge as much as I enjoyed discovering them. Mark V. Shaney Junior : Finally, I have my own Markov gibberish generator. Always wanted to have one. The project is inspired by the legendary Usenet bot named Mark V. Shaney that used to post messages to various newsgroups in the 1980s. My Markov chain program is written in about 30 lines of Python. I ran it on my 24 years of blog posts consisting of over 200 posts and about 200,000 words and it generated some pretty interesting gibberish. See my blog post Fed 24 Years of My Posts to Markov Model to see the examples. Elliptical Python Programming : If the previous item was not silly enough, this one surely is. Earlier this year, I wrote a blog post describing the fine art of Python programming using copious amounts of ellipses. I will not discuss it further here to avoid spoilers. I'll just say that any day I'm able to do something pointless, whimsical and fun with computers is a good day for me. And it was a good day when I wrote this post. Please visit the link above to read the post. I hope you find it fun. Fizz Buzz with Cosines : Another silly post in which I explain how to compute the discrete Fourier transform of the Fizz Buzz sequence and derive a closed-form expression that can be used to print the sequence. Fizz Buzz in CSS : Yet another Fizz Buzz implementation, this time using just four lines of CSS.

0 views
Max Woolf 3 weeks ago

Nano Banana Pro is the best AI image generator, with caveats

A month ago, I posted a very thorough analysis on Nano Banana , Google’s then-latest AI image generation model, and how it can be prompt engineered to generate high quality and extremely nuanced images that most other image generations models can’t achieve, including ChatGPT at the time. For example, you can give Nano Banana a prompt with a comical amount of constraints: Nano Banana can handle all of these constraints easily: Exactly one week later, Google announced Nano Banana Pro, another AI image model that in addition to better image quality now touts five new features: high-resolution output, better text rendering, grounding with Google Search, thinking/reasoning, and better utilization of image inputs. Nano Banana Pro can be accessed for free using the Gemini chat app with a visible watermark on each generation, but unlike the base Nano Banana, Google AI Studio requires payment for Nano Banana Pro generations. After a brief existential crisis worrying that my months of effort researching and developing that blog post were wasted, I relaxed a bit after reading the announcement and documentation more carefully. Nano Banana and Nano Banana Pro are different models (despite some using the terms interchangeably), but Nano Banana Pro is not Nano Banana 2 and does not obsolete the original Nano Banana—far from it. Not only is the cost of generating images with Nano Banana Pro far greater, but the model may not even be the best option depending on your intended style. That said, there are quite a few interesting things Nano Banana Pro can now do, many of which Google did not cover in their announcement and documentation. I’ll start off answering the immediate question: how does Nano Banana Pro compare to the base Nano Banana? Working on my previous Nano Banana blog post required me to develop many test cases that were specifically oriented to Nano Banana’s strengths and weaknesses: most passed, but some of them failed. Does Nano Banana Pro fix the issues I had encountered? Could Nano Banana Pro cause more issues in ways I don’t anticipate? Only one way to find out. We’ll start with the test case that should now work: the infamous prompt, as Google’s announcement explicitly highlights Nano Banana Pro’s ability to style transfer. In Nano Banana, style transfer objectively failed on my own mirror selfie: How does Nano Banana Pro fare? Yeah, that’s now a pass. You can nit on whether the style is truly Ghibli or just something animesque, but it’s clear Nano Banana Pro now understands the intent behind the prompt, and it does a better job of the Ghibli style than ChatGPT ever did. Next, code generation. Last time I included an example prompt instructing Nano Banana to display a minimal Python implementation of a recursive Fibonacci sequence with proper indentation and syntax highlighting, which should result in something like: Nano Banana failed to indent the code and syntax highlight it correctly: How does Nano Banana Pro fare? Much much better. In addition to better utilization of the space, the code is properly indented and tries to highlight keywords, functions, variables, and numbers differently, although not perfectly. It even added a test case! Relatedly, OpenAI’s just released ChatGPT Images based on their new image generation model. While it’s beating Nano Banana Pro in the Text-To-Image leaderboards on LMArena , it has difficulty with prompt adherence especially with complex prompts such as this one. Syntax highlighting is very bad, the is missing a parameter, and there’s a random in front of the return statements. At least it no longer has a piss-yellow hue. Speaking of code, how well can it handle rendering webpages given a single-page HTML file with about a thousand tokens worth of HTML/CSS/JS? Here’s a simple Counter app rendered in a browser. Nano Banana wasn’t able to handle the typography and layout correctly, but Nano Banana Pro is supposedly better at typography. That’s a significant improvement! At the end of the Nano Banana post, I illustrated a more comedic example where characters from popular intellectual property such as Mario, Mickey Mouse, and Pikachu are partying hard at a seedy club, primarily to test just how strict Google is with IP. Since the training data is likely similar, I suspect any issues around IP will be the same with Nano Banana Pro—as a side note, Disney has now sued Google over Google’s use of Disney’s IP in their AI generation products. However, due to post length I cut out an analysis on how it didn’t actually handle the image composition perfectly: Here’s the Nano Banana Pro image using the full original prompt: Prompt adherence to the composition is much better: the image is more “low quality”, the nightclub is darker and seedier, the stall is indeed a corner stall, the labels on the alcohol are accurate without extreme inspection. There’s even a date watermark: one curious trend I’ve found with Nano Banana Pro is that it likes to use dates within 2023. The immediate thing that caught my eye from the documentation is that Nano Banana Pro has 2K output (4 megapixels, e.g. 2048x2048) compared to Nano Banana’s 1K/1 megapixel output, which is a significant improvement and allows the model to generate images with more detail. What’s also curious is the image token count: while Nano Banana generates 1,290 tokens before generating a 1 megapixel image, Nano Banana Pro generates fewer tokens at 1,120 tokens for a 2K output, which implies that Google made advancements in Nano Banana Pro’s image token decoder as well. Curiously, Nano Banana Pro also offers 4K output (16 megapixels, e.g. 4096x4096) at 2,000 tokens: a 79% token increase for a 4x increase in resolution. The tradeoffs are the costs: A 1K/2K image from Nano Banana Pro costs $0.134 per image: about three times the cost of a base Nano Banana generation at $0.039. A 4K image costs $0.24. If you didn’t read my previous blog post, I argued that the secret to Nano Banana’s good generation is its text encoder, which not only processes the prompt but also generates the autoregressive image tokens to be fed to the image decoder. Nano Banana is based off of Gemini 2.5 Flash , one of the strongest LLMs at the tier that optimizes for speed. Nano Banana Pro’s text encoder, however, is based off Gemini 3 Pro which not only is a LLM tier that optimizes for accuracy, it’s a major version increase with a significant performance increase over the Gemini 2.5 line. 1 Therefore, the prompt understanding should be even stronger. However, there’s a very big difference: as Gemini 3 Pro is a model that forces “thinking” before returning a result and cannot be disabled, Nano Banana Pro also thinks. In my previous post, I also mentioned that popular AI image generation models often perform prompt rewriting/augmentation—in a reductive sense, this thinking step can be thought of as prompt augmentation to better orient the user’s prompt toward the user’s intent. The thinking step is a bit unusual, but the thinking trace can be fully viewed when using Google AI Studio: Nano Banana Pro often generates a sample 1K image to prototype a generation, which is new. I’m always a fan of two-pass strategies for getting better quality from LLMs so this is useful, albeit in my testing the final output 2K image isn’t significantly different aside from higher detail. One annoying aspect of the thinking step is that it makes generation time inconsistent: I’ve had 2K generations take anywhere from 20 seconds to one minute , sometimes even longer during peak hours. One of the more viral use cases of Nano Banana Pro is its ability to generate legible infographics. However, since infographics require factual information and LLM hallucination remains unsolved, Nano Banana Pro now supports Grounding with Google Search , which allows the model to search Google to find relevant data to input into its context. For example, I asked Nano Banana Pro to generate an infographic for my gemimg Python package with this prompt and Grounding explicitly enabled, with some prompt engineering to ensure it uses the Search tool and also make it fancy : That’s a correct enough summation of the repository intro and the style adheres to the specific constraints, although it’s not something that would be interesting to share. It also duplicates the word “interfaces” in the third panel. In my opinion, these infographics are a gimmick more intended to appeal to business workers and enterprise customers. It’s indeed an effective demo on how Nano Banana Pro can generate images with massive amounts of text, but it takes more effort than usual for an AI generated image to double-check everything in the image to ensure it’s factually correct. And if it isn’t correct, it can’t be trivially touched up in a photo editing app to fix those errors as it requires another complete generation to maybe correctly fix the errors—the duplicate “interfaces” in this case could be covered up in Microsoft Paint but that’s just due to luck. However, there’s a second benefit to grounding: it allows the LLM to incorporate information from beyond its knowledge cutoff date. Although Nano Banana Pro’s cutoff date is January 2025, there’s a certain breakout franchise that sprung up from complete obscurity in the summer of 2025, and one that the younger generations would be very prone to generate AI images about only to be disappointed and confused when it doesn’t work. Grounding with Google Search, in theory, should be able to surface the images of the KPop Demon Hunters that Nano Banana Pro can then leverage it to generate images featuring Rumi, Mira, and Zoey, or at the least if grounding does not support image analysis, it can surface sufficent visual descriptions of the three characters. So I tried the following prompt in Google AI Studio with Grounding with Google Search enabled, keeping it uncharacteristically simple to avoid confounding effects: “Golden” is about Golden Gate Park, right? That, uh, didn’t work, even though the reasoning trace identified what I was going for: Of course, you can always pass in reference images of the KPop Demon Hunters, but that’s boring. One “new” feature that Nano Banana Pro supports is system prompts—it is possible to provide a system prompt to the base Nano Banana but it’s silently ignored. One way to test is to provide the simple prompt of but also with the system prompt of which makes it wholly unambiguous whether the system prompt works. And it is indeed in black and white—the message is indeed silly . Normally for text LLMs, I prefer to do my prompt engineering within the system prompt as LLMs tends to adhere to system prompts better than if the same constraints are placed in the user prompt. So I ran a test of two approaches to generation with the following prompt, harkening back to my base skull pancake test prompt, although with new compositional requirements: I did two generations: one with the prompt above, and one that splits the base prompt into the user prompt and the compositional list as the system prompt. Both images are similar and both look very delicious. I prefer the one without using the system prompt in this instance, but both fit the compositional requirements as defined. That said, as with LLM chatbot apps, the system prompt is useful if you’re trying to enforce the same constraints/styles among arbitrary user inputs which may or may not be good user inputs, such as if you were running an AI generation app based off of Nano Banana Pro. Since I explicitly want to control the constraints/styles per individual image, it’s less useful for me personally. As demoed in the infographic test case, Nano Banana Pro can now render text near perfectly with few typos—substantially better than the base Nano Banana. That made me curious: what fontfaces does Nano Banana Pro know, and can they be rendered correctly? So I gave Nano Banana Pro a test to generate a sample text with different font faces and weights, mixing native system fonts and freely-accessible fonts from Google Fonts : That’s much better than expected: aside from some text clipping on the right edge, all font faces are correctly rendered, which means that specifying specific fonts is now possible in Nano Banana Pro. Let’s talk more about that 5x2 font grid generation. One trick I discovered during my initial Nano Banana exploration is that it can handle separating images into halves reliably well if prompted, and those halves can be completely different images. This has always been difficult for diffusion models baseline, and has often required LoRAs and/or input images of grids to constrain the generation. However, for a 1 megapixel image, that’s less useful since any subimages will be too small for most modern applications. Since Nano Banana Pro now offers 4 megapixel images baseline, this grid trick is now more viable as a 2x2 grid of images means that each subimage is now the same 1 megapixel as the base Nano Banana output with the very significant bonuses of a) Nano Banana Pro’s improved generation quality and b) each subimage can be distinct, particularly due to the autoregressive nature of the generation which is aware of the already-generated images. Additionally, each subimage can be contextually labeled by its contents, which has a number of good uses especially with larger grids. It’s also slightly cheaper: base Nano Banana costs $0.039/image, but splitting a $0.134/image Nano Banana Pro into 4 images results in ~$0.034/image. Let’s test this out using the mirror selfie of myself: This time, we’ll try a more common real-world use case for image generation AI that no one will ever admit to doing publicly but I will do so anyways because I have no shame: I can’t use any of these because they’re too good. One unexpected nuance in that example is that Nano Banana Pro correctly accounted for the mirror in the input image, and put the gray jacket’s Patagonia logo and zipper on my left side. A potential concern is quality degradation since there are the same number of output tokens regardless of how many subimages you create. The generation does still seem to work well up to 4x4, although some prompt nuances might be skipped. It’s still great and cost effective for exploration of generations where you’re not sure how the end result will look, which can then be further refined via normal full-resolution generations. After 4x4, things start to break in interesting ways. You might think that setting the output to 4K might help, but that’s only increases the number of output tokens by 79% while the number of output images increases far more than that. To test, I wrote a very fun prompt: This prompt effectively requires reasoning and has many possible points of failure. Generating at 4K resolution: It’s funny that both Porygon and Porygon2 are prime: Porygon-Z isn’t though. The first 64 prime numbers are correct and the Pokémon do indeed correspond to those numbers (I checked manually), but that was the easy part. However, the token scarcity may have incentivised Nano Banana Pro to cheat: the Pokémon images here are similar-if-not-identical to official Pokémon portraits throughout the years. Each style is correctly applied within the specified numeric constraints but as a half-measure in all cases: the pixel style isn’t 8-bit but more 32-bit and matching the Game Boy Advance generation—it’s not a replication of the GBA-era sprites however, the charcoal drawing style looks more like a 2000’s Photoshop filter that still retains color, and the Ukiyo-e style isn’t applied at all aside from an attempt at a background. To sanity check, I also generated normal 2K images of Pokemon in the three styles with Nano Banana Pro: The detail is obviously stronger in all cases (although the Ivysaur still isn’t 8-bit), but the Pokémon design is closer to the 8x8 grid output than expected, which implies that the Nano Banana Pro may not have fully cheated and it can adapt to having just 31.25 tokens per subimage. Perhaps the Gemini 3 Pro backbone is too strong. While I’ve spent quite a long time talking about the unique aspects of Nano Banana Pro, there are some issues with certain types of generations. The problem with Nano Banana Pro is that it’s too good and it tends to push prompts toward realism—an understandable RLHF target for the median user prompt, but it can cause issues with prompts that are inherently surreal. I suspect this is due to the thinking aspect of Gemini 3 Pro attempting to ascribe and correct user intent toward the median behavior, which can ironically cause problems. For example, with the photos of the three cats at the beginning of this post, Nano Banana Pro unsurprisingly has no issues with the prompt constraints, but the output raised an eyebrow: I hate comparing AI-generated images by vibes alone, but this output triggers my uncanny valley sensor while the original one did not. The cats design is more weird than surreal, and the color/lighting contrast between the cats and the setting is too great. Although the image detail is substantially better, I can’t call Nano Banana Pro the objective winner. Another test case I had issues with is Character JSON. In my previous post, I created an intentionally absurd giant character JSON prompt featuring a Paladin/Pirate/Starbucks Barista posing for Vanity Fair, but also comparing that generation to one from Nano Banana Pro: It’s more realistic, but that form of hyperrealism makes the outfit look more like cosplay than a practical design: your mileage may vary. Lastly, there’s one more test case that’s everyone’s favorite: Ugly Sonic! Nano Banana Pro specifically advertises that it supports better character adherence (up to six input images), so using my two input images of Ugly Sonic with a Nano Banana Pro prompt that has him shake hands with President Barack Obama: Wait, what? The photo looks nice, but that’s normal Sonic the Hedgehog, not Ugly Sonic. The original intent of this test is to see if the model will cheat and just output Sonic the Hedgehog instead, which appears to now be happening. After giving Nano Banana Pro all seventeen of my Ugly Sonic photos and my optimized prompt for improving the output quality, I hoped that Ugly Sonic will finally manifest: That is somehow even less like Ugly Sonic. Is Nano Banana Pro’s thinking process trying to correct the “incorrect” Sonic the Hedgehog? As usual, this blog post just touches the tip of the iceberg with Nano Banana Pro: I’m trying to keep it under 26 minutes this time. There are many more use cases and concerns I’m still investigating but I do not currently have conclusive results. Despite my praise for Nano Banana Pro, I’m unsure how often I’d use it in practice over the base Nano Banana outside of making blog post header images—even in that case, I’d only use it if I could think of something interesting and unique to generate. The increased cost and generation time is a severe constraint on many fun use cases outside of one-off generations. Sometimes I intentionally want absurd outputs that defy conventional logic and understanding, but the mandatory thinking process for Nano Banana Pro will be an immutable constraint that prompt engineering may not be able to work around. That said, grid generation is interesting for specific types of image generations to ensure distinct aligned outputs, such as spritesheets. Although some might criticize my research into Nano Banana Pro because it could be used for nefarious purposes, it’s become even more important to highlight just what it’s capable of as discourse about AI has only become worse in recent months and the degree in which AI image generation has progressed in mere months is counterintuitive. For example, on Reddit, one megaviral post on the /r/LinkedinLunatics subreddit mocked a LinkedIn post trying to determine whether Nano Banana Pro or ChatGPT Images could create a more realistic woman in gym attire. The top comment on that post is “linkedin shenanigans aside, the [Nano Banana Pro] picture on the left is scarily realistic”, with most of the other thousands of comments being along the same lines. If anything, Nano Banana Pro makes me more excited for the actual Nano Banana 2, which with Gemini 3 Flash’s recent release will likely arrive sooner than later. The gemimg Python package has been updated to support Nano Banana Pro image sizes, system prompt, and grid generations, with the bonus of optionally allowing automatic slicing of the subimages and saving them as their own image. Anecdotally, when I was testing the text-generation-only capabilities of Gemini 3 Pro for real-world things such as conversational responses and agentic coding, it’s not discernably better than Gemini 2.5 Pro if at all.  ↩︎ Anecdotally, when I was testing the text-generation-only capabilities of Gemini 3 Pro for real-world things such as conversational responses and agentic coding, it’s not discernably better than Gemini 2.5 Pro if at all.  ↩︎

0 views
DuckTyped 3 weeks ago

One year of keeping a tada list

A tada list, or to-done list, is where you write out what you accomplished each day. It’s supposed to make you focus on things you’ve completed instead of focusing on how much you still need to do. Here is what my tada lists look like: I have a page for every month. Every day, I write out what I did. At the end of the month, I make a drawing in the header to show what I did that month. Here are a few of the drawings: In January, I started a Substack, made paintings for friends, and wrote up two Substack posts on security. In February, I learned took a CSS course and created a component library for myself. In March, I read a few books, worked on a writing app, took a trip to New York, and drafted several posts on linear algebra for this Substack. (If you’re wondering where these posts are, there’s a lag time between draft and publish, where I send the posts out for technical review and do a couple of rounds of rewrites). I don’t really spend much time celebrating my accomplishments. Once I accomplish something, I have a small hit of, “Yay, I did it,” before moving on to, “So, what else am I going to do?” For example, when I finished my book (a three-year-long effort), I had a couple of weeks of, “Yay, I wrote a book,” before this became part of my normal life, and it turned into, “Yes, I wrote a book, but what else have I done since then?” I thought the tada list would help reinforce “I did something!” but it also turned into “I was able to do this thing, because I did this other thing earlier”. I’ll explain with an For years I have been wanting to create a set of cards with paintings of Minnesota, for family and friends. The problem: I didn’t have many paintings of Minnesota, and didn’t like the ones I had. So I spent 2024 learning a lot about watercolor pigments, and color mixing hundreds of greens, to figure out which greens I wanted to use in my landscapes: Then I spent the early part of 2025 doing a bunch of value studies, because my watercolors always looked faded: (Value studies are where you try to make your paintings look good using black and white only, so you're forced to work using value instead of color. It’s an old exercise to improve your art). Then in the summer, I did about 50 plein air paintings of Minnesota landscapes: (Plein air = painting on location. Please admire the wide variety of greens I mixed for these paintings). Look at how much better these are: Out of those 50, I picked my top four and had cards made. Thanks to the “tada” list, it wasn’t just “I made some cards”, it was Remember when I spent countless hours on color mixing And value studies And spent most of my summer painting outside? The payoff for all that work was these lovely cards. Test prints The final four For a while now, I have wanted a mustache-like templating language, but with static typing. Last year, I created a parser combinator library called `tarsec` for TypeScript, and this year, I used it to write a mustache-like template language called `typestache` for myself that had static typing. I’ve since used both `tarsec` and `typestache` in personal projects, like this one that adds file-based routing to express and autogenerates a client for the frontend. Part of the reason I like learning stuff is it lets me do things I couldn’t do before. I think acknowledging that you CAN do something new is an important part of the learning process, but I usually skip it. The tada list helps. Maybe the most obvious con: a tada list forces you to have an accomplishment each day so you can write it down, and this added stress to my day. Also, a year is a long time to keep it going, and I ran out of steam by the end. You can see that my handwriting gets worse as time goes on and for the last couple of months, I stopped doing the pictures. It’s fun to see things on the list that I had forgotten about. For example, I had started this massive watercolor painting of the Holiday Inn in Pacifica in February, and I completely forgot about it Will I do this next year? Maybe. I need to weigh the accomplishment part against the work it takes to keep it going. It’s neat to have this artifact to look back on either way. Thanks for reading DuckTyped! Subscribe for free to receive new posts and support my work. A few more of the several color studies I did: Including another grid of greens. I have a page for every month. Every day, I write out what I did. At the end of the month, I make a drawing in the header to show what I did that month. Here are a few of the drawings: In January, I started a Substack, made paintings for friends, and wrote up two Substack posts on security. In February, I learned took a CSS course and created a component library for myself. In March, I read a few books, worked on a writing app, took a trip to New York, and drafted several posts on linear algebra for this Substack. (If you’re wondering where these posts are, there’s a lag time between draft and publish, where I send the posts out for technical review and do a couple of rounds of rewrites). Pros I don’t really spend much time celebrating my accomplishments. Once I accomplish something, I have a small hit of, “Yay, I did it,” before moving on to, “So, what else am I going to do?” For example, when I finished my book (a three-year-long effort), I had a couple of weeks of, “Yay, I wrote a book,” before this became part of my normal life, and it turned into, “Yes, I wrote a book, but what else have I done since then?” I thought the tada list would help reinforce “I did something!” but it also turned into “I was able to do this thing, because I did this other thing earlier”. I’ll explain with an example For years I have been wanting to create a set of cards with paintings of Minnesota, for family and friends. The problem: I didn’t have many paintings of Minnesota, and didn’t like the ones I had. So I spent 2024 learning a lot about watercolor pigments, and color mixing hundreds of greens, to figure out which greens I wanted to use in my landscapes: Then I spent the early part of 2025 doing a bunch of value studies, because my watercolors always looked faded: (Value studies are where you try to make your paintings look good using black and white only, so you're forced to work using value instead of color. It’s an old exercise to improve your art). Then in the summer, I did about 50 plein air paintings of Minnesota landscapes: (Plein air = painting on location. Please admire the wide variety of greens I mixed for these paintings). Look at how much better these are: Out of those 50, I picked my top four and had cards made. Thanks to the “tada” list, it wasn’t just “I made some cards”, it was Remember when I spent countless hours on color mixing And value studies And spent most of my summer painting outside?

0 views
The Jolly Teapot 3 weeks ago

The Club Racer Treatment

In 2022, I wrote a post called The Lotus philosophy applied to blog design , in which I was trying to explain how the Lotus philosophy of lighter cars for improved performance could apply to web design, and to my blog in particular. I wrote: For as long as I can remember, I’ve been a fan of Lotus. From the Esprit featured in The Spy Who Loved Me (1977), the one in the Accolade’s Test Drive video game from 1987, to my fascination with the choices made by the engineers with the 900 kg Elise (and later the Elise CR): Lotus is more than a simple car brand, it is a way to think about product design […] The most acute observers probably noticed my mention of the Lotus Elise CR. This car is, to me at least, a fantastic example of what a company can do when driven by principles and a well laid-out order of priorities. The Elise CR, which stands for Club Racer, was basically a special edition of the regular Lotus Elise, with various modifications aimed for better handling on the track, that was lightened by about 25 kilograms compared to the base car. 1 One may think that a weight reduction of around 3% is nothing, that it doesn’t matter, and that it may not influence performance that much. And to be honest with you, I don’t really know. I just know that I was always fascinated by the engineering that went into saving those 25 kg out of a roughly 900 kg car. Compared to the regular Elise, the CR had its seats fitted with less padding, its floor mats were removed, it had no radio, no A/C, and even the Lotus badge on the back was a sticker instead of using the usual metal letters. The result was a car marginally faster, slightly better to drive, less comfortable, and less practical. If you planned to drive a Lotus Elise on regular roads, you’d be better off with a regular Elise. The Club Racer was a prize among purists, it was a demonstration of what could be done, and I loved that it existed. 2 In its essence, the Club Racer was not about the results on paper or the weight itself, it was about the effort, the craft, and the experience. It was about giving a damn. For a while now, I’ve been generally happy with this site’s design, which feels very much in line with this Lotus philosophy. But there was always an itch that I couldn’t ignore: a Lotus Elise was great, but what I really wanted was a Lotus Elise CR. This is why, in the past couple of… checks notes … weeks, I spent hours and hours giving the Club Racer treatment to this website, for very marginal changes. 3 Now that all of this tedious, frustrating, and abstract work is over, I don’t even know how much weight I saved. Probably the equivalent of the Elise CR’s 25 kg: meaningless to most, meaningful to a few. Like I said, it wasn’t really about the results, but about the effort; it was about getting my hands dirty. Today, I am quite happy with the choices I made and with what I learned in the process. To make sure my project had structure, I needed to identify which were my top 3 priorities, and in which order they needed to be. Obviously, weight saving was one of them, but did I really want to put it above all else? The Lotus Elise CR was about performance and driving experience, not weight saving. Weight saving was just a means to an end. For a blog like mine, the driving experience is obviously the readability, but I also wanted my site to pass the W3C validator, and keep its perfect score on PageSpeed Insights (that’s the performance bit). I ended up with priorities ordered like this: I decided to stick to a serif typeface, to make this website as comfortable as possible to read, just like a page of a paperback novel would be. I have been using STIX Two Text for a while now, and I really like it: it feels a lot like Times New Roman , but improved in every way possible. Not only I think it looks great, but it comes preinstalled on Apple devices, it is open-source, and if a visitor falls back on Times New Roman (via the browser default setting for ), the site maintains enough of the typography to make it just as nice to read: line length, line height, size rendering, etc. Also with readability in mind, I’ve decided to keep the automatic light/dark mode feature, along with the responsive feature for the font size, as it makes text always nicely proportioned compared to the screen size. I certainly could have removed even more than I did, but I wanted to keep the 100 score on PageSpeed Insights and pass the W3C validator . This is why I still have a meta description, for example, and why I use a base64 format for the inline SVG used as the favicon. I kept some of the “branding” elements for good measure, even if what I feel is the visual identity of this site mainly revolves around its lightness. Even a Lotus Elise CR has a coat of paint after all. I could shave even more bytes off this site if the default browser stylesheets weren’t being needlessly updated . But a Club Racer treatment is only fun when talking about weight saving, so let’s get to the good stuff. This is what I removed: Airbags: The HTML tags, as I learned that they are optional in HTML5, as are the tags: If you look at the Elements tab of the browser Web Inspector panel, both are automatically added by the browser, I think. Floor mats: The quotation marks in most of the elements in the but also on some the permanent links (I didn’t go as far as reworking the Markdown parser of Eleventy to get rid of them in all attributes, but on the homepage and other pages, each link is now 2 bytes lighter — at least before Brotli compression and other shenanigans). Power steering: The line height setting for headings. Foam: The padding left and right for mobile view. Sound isolation: A lot of unnecessary nodes in the homepage, now leaner and lighter, at the expense of extra CSS: very worth it. This includes the summaries for Blend of links posts that felt very repetitive. Air conditioning: The little tags around the “by” of the header to make it 16% smaller. I liked you guys, but you had to go. Radio: The highlight colour, used since 2020 on this site, mostly as the bottom border colour for links: it felt distracting and didn’t work well in dark mode. Metal logo: for headings. This CSS feature makes titles look great, but for most of them it wasn’t even needed on desktop. And a bunch of other little things that I mostly forgot (I should have kept a log). 4 To you dear readers, if you’re not reading this in an RSS reader, this site won’t feel any faster than before. It won’t even look better. If anything, it will look slightly worse and for that, I’m sorry. Well, not really: I’m actually very happy about what has changed, and I think it will make this site easier to maintain, and easier to be proud of. On top of the weight-saving, I also worked on improving my local Eleventy setup, reducing dependencies and the number of node modules. I’ve mentioned this on my Now page , but the site now compiles in 1.5 second on my Intel-Core-i5-powered MacBook Air, which is roughly 2–3 times faster than before. I guess this is when you have an underpowered engine that weight-saving and simplifications are the most noticeable. More noticeable than on the website that’s for sure. I hope that when I finally upgrade my computer, probably next March, I won’t get fooled by the hugely improved chips on the newer Macs, to the point of forgetting Colin Chapman: Adding power makes you faster on the straights; subtracting weight makes you faster everywhere. Happy holidays everyone. I found a great review here , in French. ↩︎ Lotus nowadays surely doesn’t look like a brand Colin Chapman would recognise. ↩︎ I thought it would only take a couple of days, but here I am, three weeks later; This was a rather enjoyable rabbit hole. ↩︎ To help me in some of the decisions, I asked a lot of questions to ChatGPT. It sometimes gave me very useful answers, but sometimes it felt like I could have just tossed a coin instead. Also, I was starting to get very annoyed at the recurring “ ah, your question is the classic dilemma between Y and Z ”. ↩︎ Driving experience / Readability Performance / W3C validation & PageSpeed Insights scores Weight saving Airbags: The HTML tags, as I learned that they are optional in HTML5, as are the tags: If you look at the Elements tab of the browser Web Inspector panel, both are automatically added by the browser, I think. Floor mats: The quotation marks in most of the elements in the but also on some the permanent links (I didn’t go as far as reworking the Markdown parser of Eleventy to get rid of them in all attributes, but on the homepage and other pages, each link is now 2 bytes lighter — at least before Brotli compression and other shenanigans). Power steering: The line height setting for headings. Foam: The padding left and right for mobile view. Sound isolation: A lot of unnecessary nodes in the homepage, now leaner and lighter, at the expense of extra CSS: very worth it. This includes the summaries for Blend of links posts that felt very repetitive. Air conditioning: The little tags around the “by” of the header to make it 16% smaller. I liked you guys, but you had to go. Radio: The highlight colour, used since 2020 on this site, mostly as the bottom border colour for links: it felt distracting and didn’t work well in dark mode. Metal logo: for headings. This CSS feature makes titles look great, but for most of them it wasn’t even needed on desktop. And a bunch of other little things that I mostly forgot (I should have kept a log). 4 I found a great review here , in French. ↩︎ Lotus nowadays surely doesn’t look like a brand Colin Chapman would recognise. ↩︎ I thought it would only take a couple of days, but here I am, three weeks later; This was a rather enjoyable rabbit hole. ↩︎ To help me in some of the decisions, I asked a lot of questions to ChatGPT. It sometimes gave me very useful answers, but sometimes it felt like I could have just tossed a coin instead. Also, I was starting to get very annoyed at the recurring “ ah, your question is the classic dilemma between Y and Z ”. ↩︎

0 views
Andy Bell 3 weeks ago

Wrapping up 2025 (sort of)

I’m doing my annual wrap up post early this year because I’m really tired and want to completely switch off for a couple of weeks. It’s time to spend some quality time with my lovely family. This isn’t my usual style of wrap-up. Consider this wrap up more of a call to action than a retrospective. Let’s get stuck in. We’ve watched our western legacy media and politicians alike say, “nothing to see here” while Palestinians have been massacred by Israel. In fact, our government in the UK has persecuted supporters of Palestinian rights, some of which are still on hunger strike while they are held on remand, awaiting trial. This is on top of the government’s complicity in genocide of course. We’ve also seen (only if you look) a horrifying genocide in Sudan that continues to decimate the population. We’ve watched our pathetic politicians and again, legacy media give ground to the far right and thin-skinned Billionaires . There’s been ample opportunities to change that direction and it’s not been taken. Personal, party affiliation and corporate interests are prioritised, as always. The mainstream political parties will not get us out of this. Give money and support to actual progressive parties. In the UK, the only valid choice is the Green Party as I see it. If we don’t fight back in 2026 — regardless of your political party affiliation — we’re fucked. Sports team politics helps only the rich and there are more of us than them. Never forget that and stop being comfortable with the status quo. There are more of us than them. AI — more accurately Large Language Models (LLMs) — are a disaster. Don’t come at me with your mealy-mouthed “but I really enjoy it.” Grow up and start being serious. Over a trillion USD has been pumped into this technology that works only some of the time and literally drives people to the point of suicide . Here, I collected some awful things throughout the year . Sorry in advance for making you furious. You’d think that people in the tech industry are smart and can see these problems and I wish I could agree. Instead we see sycophantic celebrations of this technology and continuing false claims that “this is the future” and “this is a game changer”. I agree in part about the future — you can’t put the LLM toothpaste back in the tube — but the bubble is not going to stay inflated. It can’t possibly do that, and you’ll see that fact if you just listen to people who know what they are talking about . We have to act against this technology to reduce the damage in the long term. It is our responsibility . It’s easy to call yourself an engineer but now, it’s time to actually be an engineer and act on your ethical responsibilities. Here’s what I’m asking people to do to take the “shine” off LLMs in the tech industry: Right now, it’s not a fair fight, especially as the vast majority of tech media appears to be “on side” with these AI companies. We have to change that as a collective unit. Support smaller, independent tech media and above all else, let’s organise . There’s been a bit of a culture of “I don’t need to bother doing that because of AI” and let me tell you — from someone who has been doing this stuff for nearly 20 years — that is a dangerous position to put yourself in. No single technology has surpassed the need for personal development and genuine human intelligence. You should always be getting incrementally better at what you do. Now, what I am not saying is that you should be doing work work out of hours. You are not paid enough and frankly, the industry does not value you enough. Value yourself by investing your time in skills that make you happy and fulfilled . Here’s some ideas: I must be clear here too. When I say improve your skills, I’m not saying you have to be designing and coding. We are humans and we have vast levels of intelligence and creativity. Our purpose is much more than coding. Embrace that in whatever form you want . Embrace art . By doing this, you’re bringing back your ability to be curious, your ability to be creative and your ability to improve. It’ll do wonders for the understandable feeling of helpfulness too. Don’t fall into the trap of chasing metrics. Write because you want to write. Paint because you want to paint. Create because you want to create. Let the art fulfil you . Don’t let likes, follows and page views ruin that for you. Fight the urge to turn personal projects into a money making and/or clout chasing venture. You should definitely do more designing, coding and learning to improve your professional skills, but it is your boss’s responsibility to give you the time and resources to improve those. If you are a boss reading this who doesn’t do that: you are wrong and your staff will leave unless you change that. “AI” won’t save you here. I hope you have a restful holiday period. I want to thank everyone who has supported Piccalilli , Set Studio and my work this year. It means everything to me and next year, expect to see a lot more. I’ve written more about that in the Piccalilli year in review . It’s rosier, I promise. Thank you to everyone that responded to my post on how hard this year has been too. I’m delighted to say that our Black Friday sales shot up and we’ve got some really good client work to get stuck into next year. To everyone I spoke to who’s also had a really hard year, I truly hope things have picked up for you too. Let’s all help and support each other in 2026, onwards. Please make sure you rest up and spend time with the people you love and the people that make you happy. I know that’s what I’m going to be doing. I’ve got a couple of Professional Obligations™ to do, then clocking off now for the year to enjoy the holidays with my family. Although I’m angry at the industry and the global situation, I feel like I’m in a much better headspace than I thought I would be in at this point. There have been lots of positives throughout the year, especially with Piccalilli . Next year will be very different for me. I want to do more making . I design so little and write so little code now and I’m starting to feel really rusty. That changes in 2026. Less spreadsheets and more CSS. I’m also going on a speaking hiatus with minimal conference attendance. The conference circuit won’t miss yet another white guy. I’ve done way too much this year, so a year off will do me good. Anyway, let’s all come back in 2026 refreshed and take these motherfuckers down . Constantly and consistently post when it goes wrong. This could be on your blog or social media. Post anyway . When people (especially people who are paid to peddle this technology) post claims: challenge them to provide evidence and prove their claims. It might sound harsh, but it’s long overdue that “thought leaders” in our industry are held to account for the effects of their influence. Create a culture of shame for AI boosting. Never forgive and especially never forget those who have boosted and vocally supported this technology . Unless there are consequences, we’ll continue to have hype cycles like crypto, NFTs and now, LLMs. We have the power to break the cycle of cycles in our vast numbers. Stop paying for AI services. Make yourself, and maintain a personal website Make random stuff that makes you happy Find a creative outlet that you really enjoy Find other people’s creative outlets then celebrate and enjoy with them Spend less time scrolling timelines and chasing metrics. Spend more time embracing the things that you love Participate in smaller communities that bring you joy and support. Delete your Twitter account while you’re at it

0 views
Simon Willison 3 weeks ago

Your job is to deliver code you have proven to work

In all of the debates about the value of AI-assistance in software development there's one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers - or open source maintainers - and expects the "code review" process to handle the rest. This is rude, a waste of other people's time, and is honestly a dereliction of duty as a software developer. Your job is to deliver code you have proven to work. As software engineers we don't just crank out code - in fact these days you could argue that's what the LLMs are for. We need to deliver code that works - and we need to include proof that it works as well. Not doing that directly shifts the burden of the actual work to whoever is expected to review our code. There are two steps to proving a piece of code works. Neither is optional. The first is manual testing . If you haven't seen the code do the right thing yourself, that code doesn't work. If it does turn out to work, that's honestly just pure chance. Manual testing skills are genuine skills that you need to develop. You need to be able to get the system into an initial state that demonstrates your change, then exercise the change, then check and demonstrate that it has the desired effect. If possible I like to reduce these steps to a sequence of terminal commands which I can paste, along with their output, into a comment in the code review. Here's a recent example . Some changes are harder to demonstrate. It's still your job to demonstrate them! Record a screen capture video and add that to the PR. Show your reviewers that the change you made actually works. Once you've tested the happy path where everything works you can start trying the edge cases. Manual testing is a skill, and finding the things that break is the next level of that skill that helps define a senior engineer. The second step in proving a change works is automated testing . This is so much easier now that we have LLM tooling, which means there's no excuse at all for skipping this step. Your contribution should bundle the change with an automated test that proves the change works. That test should fail if you revert the implementation. The process for writing a test mirrors that of manual testing: get the system into an initial known state, exercise the change, assert that it worked correctly. Integrating a test harness to productively facilitate this is another key skill worth investing in. Don't be tempted to skip the manual test because you think the automated test has you covered already! Almost every time I've done this myself I've quickly regretted it. The most important trend in LLMs in 2025 has been the explosive growth of coding agents - tools like Claude Code and Codex CLI that can actively execute the code they are working on to check that it works and further iterate on any problems. To master these tools you need to learn how to get them to prove their changes work as well. This looks exactly the same as the process I described above: they need to be able to manually test their changes as they work, and they need to be able to build automated tests that guarantee the change will continue to work in the future. Since they're robots, automated tests and manual tests are effectively the same thing. They do feel a little different though. When I'm working on CLI tools I'll usually teach Claude Code how to run them itself so it can do one-off tests, even though the eventual automated tests will use a system like Click's CLIRunner . When working on CSS changes I'll often encourage my coding agent to take screenshots when it needs to check if the change it made had the desired effect. The good news about automated tests is that coding agents need very little encouragement to write them. If your project has tests already most agents will extend that test suite without you even telling them to do so. They'll also reuse patterns from existing tests, so keeping your test code well organized and populated with patterns you like is a great way to help your agent build testing code to your taste. Developing good taste in testing code is another of those skills that differentiates a senior engineer. A computer can never be held accountable . That's your job as the human in the loop. Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review. That's no longer valuable. What's valuable is contributing code that is proven to work . Next time you submit a PR, make sure you've included your evidence that it works as it should. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

1 views
Martin Fowler 4 weeks ago

Fragments: December 16

Gitanjali Venkatraman does wonderful illustrations of complex subjects (which is why I was so happy to work with her on our Expert Generalists article). She has now published the latest in her series of illustrated guides: tackling the complex topic of Mainframe Modernization In it she illustrates the history and value of mainframes, why modernization is so tricky, and how to tackle the problem by breaking it down into tractable pieces. I love the clarity of her explanations, and smile frequently at her way of enhancing her words with her quirky pictures. ❄                ❄                ❄                ❄                ❄ Gergely Orosz on social media Unpopular opinion: Current code review tools just don’t make much sense for AI-generated code When reviewing code I really want to know: Some people pushed back saying they don’t (and shouldn’t care) whether it was written by a human, generated by an LLM, or copy-pasted from Stack Overflow. In my view it matters a lot - because of the second vital purpose of code review. When asked why do code reviews, most people will answer the first vital purpose - quality control. We want to ensure bad code gets blocked before it hits mainline . We do this to avoid bugs and to avoid other quality issues, in particular comprehensibility and ease of change. But I hear the second vital purpose less often: code review is a mechanism to communicate and educate. If I’m submitting some sub-standard code, and it gets rejected, I want to know why so that I can improve my programming. Maybe I’m unaware of some library features, or maybe there’s some project-specific standards I haven’t run into yet, or maybe my naming isn’t as clear as I thought it was. Whatever the reasons, I need to know in order to learn. And my employer needs me to learn, so I can be more effective. We need to know the writer of the code we review both so we can communicate our better practice to them, but also to know how to improve things. With a human, its a conversation, and perhaps some documentation if we realize we’ve needed to explain things repeatedly. But with an LLM it’s about how to modify its context, as well as humans learning how to better drive the LLM. ❄                ❄                ❄                ❄                ❄ Wondering why I’ve been making a lot of posts like this recently? I explain why I’ve been reviving the link blog. ❄                ❄                ❄                ❄                ❄ Simon Willison describes how he uses LLMs to build disposable but useful web apps These are the characteristics I have found to be most productive in building tools of this nature: His repository includes all these tools, together with transcripts of the chats that got the LLMs to build them. ❄                ❄                ❄                ❄                ❄ Obie Fernandez : while many engineers are underwhelmed by AI tools, some senior engineers are finding them really valuable. He feels that senior engineers have an oft-unspoken mindset, which in conjunction with an LLM, enables the LLM to be much more valuable. Levels of abstraction and generalization problems get talked about a lot because they’re easy to name. But they’re far from the whole story. Other tools show up just as often in real work: ❄                ❄                ❄                ❄                ❄ Emil Stenström built an HTML5 parser in python using coding agents, using Github Copilot in Agent mode with Claude Sonnet 3.7. He automatically approved most commands. It took him “a couple of months on off-hours”, including at least one restart from scratch. The parser now passes all the tests in html5lib test suite. After writing the parser, I still don’t know HTML5 properly. The agent wrote it for me. I guided it when it came to API design and corrected bad decisions at the high level, but it did ALL of the gruntwork and wrote all of the code. I handled all git commits myself, reviewing code as it went in. I didn’t understand all the algorithmic choices, but I understood when it didn’t do the right thing. Although he gives an overview of what happens, there’s not very much information on his workflow and how he interacted with the LLM. There’s certainly not enough detail here to try to replicate his approach. This is contrast to Simon Willison (above) who has detailed links to his chat transcripts - although they are much smaller tools and I haven’t looked at them properly to see how useful they are. One thing that is clear, however, is the vital need for a comprehensive test suite. Much of his work is driven by having that suite as a clear guide for him and the LLM agents. JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn’t have written it this quickly without the agent. But “quickly” doesn’t mean “without thinking.” I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. ❄                                  ❄ Then Simon Willison ported the library to JavaScript : Time elapsed from project idea to finished library: about 4 hours, during which I also bought and decorated a Christmas tree with family and watched the latest Knives Out movie. One of his lessons: If you can reduce a problem to a robust test suite you can set a coding agent loop loose on it with a high degree of confidence that it will eventually succeed. I called this designing the agentic loop a few months ago. I think it’s the key skill to unlocking the potential of LLMs for complex tasks. Our experience at Thoughtworks backs this up. We’ve been doing a fair bit of work recently in legacy modernization (mainframe and otherwise) using AI to migrate substantial software systems. Having a robust test suite is necessary (but not sufficient) to making this work. I hope to share my colleagues’ experiences on this in the coming months. But before I leave Willison’s post, I should highlight his final open questions on the legalities, ethics, and effectiveness of all this - they are well-worth contemplating. The prompt made by the dev What corrections the other dev made to the code Clear marking of code AI-generated not changed by a human A single file: inline JavaScript and CSS in a single HTML file means the least hassle in hosting or distributing them, and crucially means you can copy and paste them out of an LLM response. Avoid React, or anything with a build step. The problem with React is that JSX requires a build step, which makes everything massively less convenient. I prompt “no react” and skip that whole rabbit hole entirely. Load dependencies from a CDN. The fewer dependencies the better, but if there’s a well known library that helps solve a problem I’m happy to load it from CDNjs or jsdelivr or similar. Keep them small. A few hundred lines means the maintainability of the code doesn’t matter too much: any good LLM can read them and understand what they’re doing, and rewriting them from scratch with help from an LLM takes just a few minutes. A sense for blast radius. Knowing which changes are safe to make loudly and which should be quiet and contained. A feel for sequencing. Knowing when a technically correct change is still wrong because the system or the team isn’t ready for it yet. An instinct for reversibility. Preferring moves that keep options open, even if they look less elegant in the moment. An awareness of social cost. Recognizing when a clever solution will confuse more people than it helps. An allergy to false confidence. Spotting places where tests are green but the model is wrong.

0 views
Simon Willison 1 months ago

JustHTML is a fascinating example of vibe engineering in action

I recently came across JustHTML , a new Python library for parsing HTML released by Emil Stenström. It's a very interesting piece of software, both as a useful library and as a case study in sophisticated AI-assisted programming. I didn't initially know that JustHTML had been written with AI assistance at all. The README caught my eye due to some attractive characteristics: I was out and about without a laptop so I decided to put JustHTML through its paces on my phone. I prompted Claude Code for web on my phone and had it build this Pyodide-powered HTML tool for trying it out: This was enough for me to convince myself that the core functionality worked as advertised. It's a neat piece of code! At this point I went looking for some more background information on the library and found Emil's blog entry about it: How I wrote JustHTML using coding agents : Writing a full HTML5 parser is not a short one-shot problem. I have been working on this project for a couple of months on off-hours. Tooling: I used plain VS Code with Github Copilot in Agent mode. I enabled automatic approval of all commands, and then added a blacklist of commands that I always wanted to approve manually. I wrote an agent instruction that told it to keep working, and don't stop to ask questions. Worked well! Emil used several different models - an advantage of working in VS Code Agent mode rather than a provider-locked coding agent like Claude Code or Codex CLI. Claude Sonnet 3.7, Gemini 3 Pro and Claude Opus all get a mention. What's most interesting about Emil's 17 step account covering those several months of work is how much software engineering was involved, independent of typing out the actual code. I wrote about vibe engineering a while ago as an alternative to vibe coding. Vibe coding is when you have an LLM knock out code without any semblance of code review - great for prototypes and toy projects, definitely not an approach to use for serious libraries or production code. I proposed "vibe engineering" as the grown up version of vibe coding, where expert programmers use coding agents in a professional and responsible way to produce high quality, reliable results. You should absolutely read Emil's account in full. A few highlights: This represents a lot of sophisticated development practices, tapping into Emil's deep experience as a software engineer. As described, this feels to me more like a lead architect role than a hands-on coder. It perfectly fits what I was thinking about when I described vibe engineering . Setting the coding agent up with the html5lib-tests suite is also a great example of designing an agentic loop . Emil concluded his article like this: JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn't have written it this quickly without the agent. But "quickly" doesn't mean "without thinking." I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. That's probably the right division of labor. I couldn't agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what's left to be a much more valuable use of my time. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . It's pure Python. I like libraries that are pure Python (no C extensions or similar) because it makes them easy to use in less conventional Python environments, including Pyodide. "Passes all 9,200+ tests in the official html5lib-tests suite (used by browser vendors)" - this instantly caught my attention! HTML5 is a big, complicated but meticulously written specification. 100% test coverage. That's not something you see every day. CSS selector queries as a feature. I built a Python library for this many years ago and I'm always interested in seeing new implementations of that pattern. html5lib has been inconsistently maintained over the last few years, leaving me interested in potential alternatives. It's only 3,000 lines of implementation code (and another ~11,000 of tests.) He hooked in the 9,200 test html5lib-tests conformance suite almost from the start. There's no better way to construct a new HTML5 parser than using the test suite that the browsers themselves use. He picked the core API design himself - a TagHandler base class with handle_start() etc. methods - and told the model to implement that. He added a comparative benchmark to track performance compared to existing libraries like html5lib, then experimented with a Rust optimization based on those initial numbers. He threw the original code away and started from scratch as a rough port of Servo's excellent html5ever Rust library. He built a custom profiler and new benchmark and let Gemini 3 Pro loose on it, finally achieving micro-optimizations to beat the existing Pure Python libraries. He used coverage to identify and remove unnecessary code. He had his agent build a custom fuzzer to generate vast numbers of invalid HTML documents and harden the parser against them.

2 views
The Jolly Teapot 1 months ago

Typefaces as clothes

After seeing this news , I have spent an unusual amount of my week thinking and reading about Times New Roman. 1 In an ocean of opinions, mine probably won’t register, but I think the Times New Roman typeface is, by itself, fine. It looks OK. It can even look good in some contexts. The problem with it is not really how it objectively looks, but how we perceive it, due to its misplaced ubiquity. As the web browser de facto default since what feels like forever, Times New Roman has indeed been used, reused, and abused in every imaginable way, to the point where we now see these documents or pages using it and automatically think that the person behind them doesn’t care in the slightest. I thought of an analogy that works great for Times New Roman, but also for other typefaces generally in use on text-based websites like mine. Please bear with me. If typefaces were clothes, what would they be? Times New Roman could be any piece of clothes you wear, but on which you forgot to remove the price tag. People will notice it and be embarrassed for you. If it is an accident, people will tell you. But if you always show up with price tags hanging off all of your clothes, people will stop telling you, and they will not take you as seriously as you expect: It won’t matter how good or bad you think the clothes themselves look. On some minimalistic websites, Times New Roman, or rather the browser default, kind of works. You just have to own it and make it obvious it is a deliberate choice, fitting a bare-bones setup. It may work on some blogs, but used on a more complex or ambitious website, Times New Roman will look a bit odd, as if someone didn’t know there were far better options available. I love Helvetica, and I think a lot of people love it too. It is also widely used, to the point where type design nerds will try to avoid it as much as possible, even if it objectively looks great. Helvetica is a suit in the 60s. It’s iconic, seemingly everywhere, and everybody who wants to look serious and professional will wear one. Hipsters will of course refuse to wear a suit to be different and edgy, but a suit is the general standard of elegance. But there are suits, and then there are suits . If you want your suit to look great, you will need a tailor, you will need fine detailing: a quality fabric alone won’t cut it. If you wear a suit that is not adjusted to your body shape and doesn’t pair well with your shirt or your shoes, well, you might as well leave the price tag attached to it. Helvetica needs refinement to look good, it needs attention, care, and good typography; on its own it can quickly look a bit generic. When I look at typical Swiss graphic design works , what makes them look great is not Helvetica, it’s not the fonts in use, it’s how the typography is crafted, detailed, and fine-tuned, so it doesn’t just look OK, it looks fantastic. If you’re thinking that without that tailor-made design, Helvetica is just as good as Arial, you’d be right. Except that the Arial suit, unlike the Helvetica one, is made of cheap fabric, was bought at a discount on Amazon, and a good tailor would not even want to work on it in the first place. 2 Another CSS value we see a lot on blogs is . For Apple devices, it will translate to the San Francisco typeface, for Android it will be Roboto, and Segoe UI for Windows. For me, these typefaces are like clothes from the popular clothing stores. Everybody shops in them, everybody more or less follows the same fashion trends. It’s easy, affordable, comfortable, unobtrusive, inoffensive, and can even look pretty good if well-thought-out. San Francisco would be clothes from a brand like Uniqlo, and Segoe UI would be something coming from stores like Zara or H&M. Roboto would be something coming from a slightly cheaper brand like Primark, or Amazon Basics (just don’t pay attention to details). These typefaces are OK in terms of how they look, but on their own, they will look very generic, efficient, bland, and will lack on personality and identity. I’ve written about why I like fonts before . They make me think of drafts, work in progress, creativity, code, unfinished business. To me, they would be clothes like coveralls or chore jackets: functional, robust, rugged, practical, often poorly fitted. Monospaced typefaces each have different qualities, different styles, different purposes, but to the world they all look more or less the same. They will very quickly look neglected when taken out of context. It will certainly look very professional but will severely lack elegance, like that guy at the supermarket wearing a boiler suit to buy groceries. I have now realised that I’ve opened a Pandora’s box with this topic, so I may split it into two or three posts, to avoid a three-thousand-words post that I will never finish. 3 As a final entry, I wanted to list Verdana. I have always really liked Verdana, and I would use it on all my sites if it wasn’t already so popular , especially for blogs. Verdana is very easy to read, has a nice casual look; it is a practical, all-terrain typeface, that is easy to recommend since it comes installed with the most popular operating systems. So what’s not to like? Verdana is great for text, but not so great for titles. It’s good in some cases, but bad in others. That’s why I think that Verdana is like clothes made by the Levi’s brand. Obviously great for jeans and denim, but I wouldn’t wear other Levi’s clothes, and I would certainly not wear only Levi’s clothes. It may look good on you, and if you like that, go for it, but I personally won’t (and I also prefer other brands of jeans). So that was my fun and rather entertaining train of thought this week. I’ve thought about many brands that I could map to a typeface: brands like Patagonia, or COS. Please let me know if you have any similar typefaces analogies, or if you disagree with the ones I made. Full disclaimer, I’m definitely not an expert on any of this, as you can clearly see. Best take on the subject is this one , hands down. ↩︎ The CSS value is precisely hard for me to use because on Windows it defaults to Arial, while it defaults to Helvetica on Apple devices. This alone tells you a lot about what sets these two companies apart. ↩︎ There are many other typefaces I want to talk about in this manner: Georgia, Calibri, Inter, IBM Plex, Futura, Avenir, etc. Am I correct thinking about Georgia as a university professor’s brown velvet jacket? Or is it Palatino? ↩︎ Best take on the subject is this one , hands down. ↩︎ The CSS value is precisely hard for me to use because on Windows it defaults to Arial, while it defaults to Helvetica on Apple devices. This alone tells you a lot about what sets these two companies apart. ↩︎ There are many other typefaces I want to talk about in this manner: Georgia, Calibri, Inter, IBM Plex, Futura, Avenir, etc. Am I correct thinking about Georgia as a university professor’s brown velvet jacket? Or is it Palatino? ↩︎

0 views
Anton Sten 1 months ago

Vibe coding for designers: my actual process

Martijn asked a great question in my community Slack the other day: >"Have you documented your vibe coding process somewhere? I'm curious about your overall approach to creating a website like yours and what tools you use. Do you use platforms like Lovable or Cursor? How skilled are you with code? What about the backend? Have you run into issues you couldn't solve yourself?" I haven't documented it—until now. So here's the honest breakdown of how I built and maintain antonsten.com using AI, what actually works, and where I've thrown my hands up and walked away. ## Why vibe code at all? Let me start with the obvious question: why not just use Ghost or Framer or whatever and focus on the actual work? Fair point. And honestly, for a lot of people that's the right call. I originally moved to Ghost because I liked that it was indie and I wanted one place for my newsletter and blog. But it turned out that setup only makes sense if your newsletter and blog content are 1:1—which isn't the case for me. I was also paying for something that felt like an ongoing expense I didn't need. So I had a practical problem: I wanted something cheaper and more flexible. And it turned out I could now build it myself. That's the real story. I didn't set out to learn vibe coding for its own sake. I had a problem, and AI tools had gotten good enough that building my own site became a realistic option. Now I run a static Astro site deployed through Netlify. No monthly fees, full control. But here's the honest take: if you just want to write and not think about your website, use Ghost. Use whatever gets out of your way. Vibe coding makes sense if you want control and enjoy tinkering. It's a trap if it becomes procrastination dressed up as productivity. ## Start in Figma, not in a prompt This might be the most important part: I don't start by talking to AI. I start in Figma. I know Figma. I can move fast there. So I sketch out the scaffolding first—general theme, grids, typography, color. Maybe one or two pages. Nothing polished, just enough to know what I'm building. Why does this matter? Because [AI will happily design the wrong thing for you](/articles/ai-will-happily-design-the-wrong-thing-for-you/). If you open Claude Code with a vague prompt and no direction, you'll get something—but it probably won't be what you needed. AI is a builder, not an architect. You still have to be the architect. I've never been able to vibe code anything just out of the blue. I need to know the desired outcome before I start prompting. Maybe that's a personal limitation, but I suspect it's actually how this works best. ## The actual tools Once I have the design direction, I move to VS Code with Claude Code (I used Cursor before, both work well). My first step is defining what we're building. I describe the page structure—what sections it should include—and then work piece by piece on components: newsletter signup box, callout section, button styles, that kind of thing. High-level first, then smaller chunks. For hosting, I use Astro as a static site generator, connected to Github and deployed through Netlify. No database, no CMS. When I want to publish a new blog post, I literally ask Claude to do it and push to Github. Yes, that's unconventional. But it works for a site like mine where I'm the only editor. ## My coding skills (or lack thereof) I did some coding in the late 90s. That's roughly where my skills are today. I understand HTML and CSS well enough to read what's happening, but I couldn't build anything from scratch myself. What this means in practice: I can make tweaks. Adjusting margins, changing typography, small fixes—I can do that by hand. But the heavy lifting? That's Claude. Understanding code, even at a basic level, helps. You don't need to be a developer, but knowing what you're looking at makes the back-and-forth much smoother. ## Even the writing Here's something that might surprise you: I use a similar process for writing these posts. I collect my initial thoughts—rough ideas, half-formed points—and then work with Claude by having it ask me clarifying questions. This helps me think through aspects I might have missed, while making sure the final piece reflects my thinking. Because just like AI will happily design the wrong thing for you, it'll happily write the wrong thing too. Once the post is ready, I paste it into VS Code and ask Claude to publish it. Same workflow, different output. ## The iteration never stops The first version is never good enough. Never. I still make tweaks to my site on a weekly basis. Partly because things break or look off in certain contexts, but mostly because I treat the site as a playground. It's where I experiment, try new things, and occasionally break something at 11pm and have to fix it before bed. This is actually one of the joys of this approach—the site becomes a living thing you can tinker with instead of a finished artifact you're afraid to touch. ## Where I've hit walls Let's be honest: I run into problems I can't solve all the time. Sometimes I just can't get the prompt right. I'll describe what I want, Claude will build something close but not quite, I'll try to clarify, and we'll go in circles until I give up. Animations and transitions have been particularly brutal. One example: I spent hours trying to get an image to unmask on scroll. Simple enough concept, right? Instead I got the entire page unmasking. Then the image being static while everything else moved. Then something that technically worked but looked terrible. Eventually I just... didn't do it. Moved on. The site survived. When I get stuck, I'll sometimes ask Claude (or ChatGPT) to explain what's happening rather than just fix it. That helps me learn. Other times I accept that this particular feature isn't happening today. ## What this means for designers Any designer can do this. The tools are accessible, the learning curve is manageable, and you don't need to become a developer. But—and this is important—you still need design thinking and systems thinking. AI handles the syntax, but you need to know what you're building, why you're building it, and how the pieces fit together. The hard part was never the code. The hard part is the decisions. So yes, designers should embrace AI. But let's keep designing. Be explicit about what you're building before you ask AI to build it. If you're curious but haven't tried it, start small. A landing page. A personal project. Something where the stakes are low and you can experiment.

0 views
Simon Willison 1 months ago

Useful patterns for building HTML tools

I've started using the term HTML tools to refer to HTML applications that I've been building which combine HTML, JavaScript, and CSS in a single file and use them to provide useful functionality. I have built over 150 of these in the past two years, almost all of them written by LLMs. This article presents a collection of useful patterns I've discovered along the way. First, some examples to show the kind of thing I'm talking about: These are some of my recent favorites. I have dozens more like this that I use on a regular basis. You can explore my collection on tools.simonwillison.net - the by month view is useful for browsing the entire collection. If you want to see the code and prompts, almost all of the examples in this post include a link in their footer to "view source" on GitHub. The GitHub commits usually contain either the prompt itself or a link to the transcript used to create the tool. These are the characteristics I have found to be most productive in building tools of this nature: The end result is a few hundred lines of code that can be cleanly copied and pasted into a GitHub repository. The easiest way to build one of these tools is to start in ChatGPT or Claude or Gemini. All three have features where they can write a simple HTML+JavaScript application and show it to you directly. Claude calls this "Artifacts", ChatGPT and Gemini both call it "Canvas". Claude has the feature enabled by default, ChatGPT and Gemini may require you to toggle it on in their "tools" menus. Try this prompt in Gemini or ChatGPT: Or this prompt in Claude: I always add "No React" to these prompts, because otherwise they tend to build with React, resulting in a file that is harder to copy and paste out of the LLM and use elsewhere. I find that attempts which use React take longer to display (since they need to run a build step) and are more likely to contain crashing bugs for some reason, especially in ChatGPT. All three tools have "share" links that provide a URL to the finished application. Examples: Coding agents such as Claude Code and Codex CLI have the advantage that they can test the code themselves while they work on it using tools like Playwright. I often upgrade to one of those when I'm working on something more complicated, like my Bluesky thread viewer tool shown above. I also frequently use asynchronous coding agents like Claude Code for web to make changes to existing tools. I shared a video about that in Building a tool to copy-paste share terminal sessions using Claude Code for web . Claude Code for web and Codex Cloud run directly against my simonw/tools repo, which means they can publish or upgrade tools via Pull Requests (here are dozens of examples ) without me needing to copy and paste anything myself. Any time I use an additional JavaScript library as part of my tool I like to load it from a CDN. The three major LLM platforms support specific CDNs as part of their Artifacts or Canvas features, so often if you tell them "Use PDF.js" or similar they'll be able to compose a URL to a CDN that's on their allow-list. Sometimes you'll need to go and look up the URL on cdnjs or jsDelivr and paste it into the chat. CDNs like these have been around for long enough that I've grown to trust them, especially for URLs that include the package version. The alternative to CDNs is to use npm and have a build step for your projects. I find this reduces my productivity at hacking on individual tools and makes it harder to self-host them. I don't like leaving my HTML tools hosted by the LLM platforms themselves for a couple of reasons. First, LLM platforms tend to run the tools inside a tight sandbox with a lot of restrictions. They're often unable to load data or images from external URLs, and sometimes even features like linking out to other sites are disabled. The end-user experience often isn't great either. They show warning messages to new users, often take additional time to load and delight in showing promotions for the platform that was used to create the tool. They're also not as reliable as other forms of static hosting. If ChatGPT or Claude are having an outage I'd like to still be able to access the tools I've created in the past. Being able to easily self-host is the main reason I like insisting on "no React" and using CDNs for dependencies - the absence of a build step makes hosting tools elsewhere a simple case of copying and pasting them out to some other provider. My preferred provider here is GitHub Pages because I can paste a block of HTML into a file on github.com and have it hosted on a permanent URL a few seconds later. Most of my tools end up in my simonw/tools repository which is configured to serve static files at tools.simonwillison.net . One of the most useful input/output mechanisms for HTML tools comes in the form of copy and paste . I frequently build tools that accept pasted content, transform it in some way and let the user copy it back to their clipboard to paste somewhere else. Copy and paste on mobile phones is fiddly, so I frequently include "Copy to clipboard" buttons that populate the clipboard with a single touch. Most operating system clipboards can carry multiple formats of the same copied data. That's why you can paste content from a word processor in a way that preserves formatting, but if you paste the same thing into a text editor you'll get the content with formatting stripped. These rich copy operations are available in JavaScript paste events as well, which opens up all sorts of opportunities for HTML tools. The key to building interesting HTML tools is understanding what's possible. Building custom debugging tools is a great way to explore these options. clipboard-viewer is one of my most useful. You can paste anything into it (text, rich text, images, files) and it will loop through and show you every type of paste data that's available on the clipboard. This was key to building many of my other tools, because it showed me the invisible data that I could use to bootstrap other interesting pieces of functionality. More debugging examples: HTML tools may not have access to server-side databases for storage but it turns out you can store a lot of state directly in the URL. I like this for tools I may want to bookmark or share with other people. The localStorage browser API lets HTML tools store data persistently on the user's device, without exposing that data to the server. I use this for larger pieces of state that don't fit comfortably in a URL, or for secrets like API keys which I really don't want anywhere near my server - even static hosts might have server logs that are outside of my influence. CORS stands for Cross-origin resource sharing . It's a relatively low-level detail which controls if JavaScript running on one site is able to fetch data from APIs hosted on other domains. APIs that provide open CORS headers are a goldmine for HTML tools. It's worth building a collection of these over time. Here are some I like: GitHub Gists are a personal favorite here, because they let you build apps that can persist state to a permanent Gist through making a cross-origin API call. All three of OpenAI, Anthropic and Gemini offer JSON APIs that can be accessed via CORS directly from HTML tools. Unfortunately you still need an API key, and if you bake that key into your visible HTML anyone can steal it and use to rack up charges on your account. I use the secrets pattern to store API keys for these services. This sucks from a user experience perspective - telling users to go and create an API key and paste it into a tool is a lot of friction - but it does work. Some examples: You don't need to upload a file to a server in order to make use of the element. JavaScript can access the content of that file directly, which opens up a wealth of opportunities for useful functionality. Some examples: An HTML tool can generate a file for download without needing help from a server. The JavaScript library ecosystem has a huge range of packages for generating files in all kinds of useful formats. Pyodide is a distribution of Python that's compiled to WebAssembly and designed to run directly in browsers. It's an engineering marvel and one of the most underrated corners of the Python world. It also cleanly loads from a CDN, which means there's no reason not to use it in HTML tools! Even better, the Pyodide project includes micropip - a mechanism that can load extra pure-Python packages from PyPI via CORS. Pyodide is possible thanks to WebAssembly. WebAssembly means that a vast collection of software originally written in other languages can now be loaded in HTML tools as well. Squoosh.app was the first example I saw that convinced me of the power of this pattern - it makes several best-in-class image compression libraries available directly in the browser. I've used WebAssembly for a few of my own tools: The biggest advantage of having a single public collection of 100+ tools is that it's easy for my LLM assistants to recombine them in interesting ways. Sometimes I'll copy and paste a previous tool into the context, but when I'm working with a coding agent I can reference them by name - or tell the agent to search for relevant examples before it starts work. The source code of any working tool doubles as clear documentation of how something can be done, including patterns for using editing libraries. An LLM with one or two existing tools in their context is much more likely to produce working code. I built pypi-changelog by telling Claude Code: And then, after it had found and read the source code for zip-wheel-explorer : Here's the full transcript . See Running OCR against PDFs and images directly in your browser for another detailed example of remixing tools to create something new. I like keeping (and publishing) records of everything I do with LLMs, to help me grow my skills at using them over time. For HTML tools I built by chatting with an LLM platform directly I use the "share" feature for those platforms. For Claude Code or Codex CLI or other coding agents I copy and paste the full transcript from the terminal into my terminal-to-html tool and share that using a Gist. In either case I include links to those transcripts in the commit message when I save the finished tool to my repository. You can see those in my tools.simonwillison.net colophon . I've had so much fun exploring the capabilities of LLMs in this way over the past year and a half, and building tools in this way has been invaluable in helping me understand both the potential for building tools with HTML and the capabilities of the LLMs that I'm building them with. If you're interested in starting your own collection I highly recommend it! All you need to get started is a free GitHub repository with GitHub Pages enabled (Settings -> Pages -> Source -> Deploy from a branch -> main) and you can start copying in pages generated in whatever manner you like. Bonus transcript : Here's how I used Claude Code and shot-scraper to add the screenshots to this post. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . svg-render renders SVG code to downloadable JPEGs or PNGs pypi-changelog lets you generate (and copy to clipboard) diffs between different PyPI package releases. bluesky-thread provides a nested view of a discussion thread on Bluesky. The anatomy of an HTML tool Prototype with Artifacts or Canvas Switch to a coding agent for more complex projects Load dependencies from CDNs Host them somewhere else Take advantage of copy and paste Build debugging tools Persist state in the URL Use localStorage for secrets or larger state Collect CORS-enabled APIs LLMs can be called directly via CORS Don't be afraid of opening files You can offer downloadable files too Pyodide can run Python code in the browser WebAssembly opens more possibilities Remix your previous tools Record the prompt and transcript Go forth and build A single file: inline JavaScript and CSS in a single HTML file means the least hassle in hosting or distributing them, and crucially means you can copy and paste them out of an LLM response. Avoid React, or anything with a build step. The problem with React is that JSX requires a build step, which makes everything massively less convenient. I prompt "no react" and skip that whole rabbit hole entirely. Load dependencies from a CDN. The fewer dependencies the better, but if there's a well known library that helps solve a problem I'm happy to load it from CDNjs or jsdelivr or similar. Keep them small. A few hundred lines means the maintainability of the code doesn't matter too much: any good LLM can read them and understand what they're doing, and rewriting them from scratch with help from an LLM takes just a few minutes. ChatGPT JSON to YAML Canvas made with GPT-5.1 Thinking - here's the full ChatGPT transcript Claude JSON to YAML Artifact made with Claude Opus 4.5 - here's the full Claude transcript Gemini JSON to YAML Canvas made with Gemini 3 Pro - here's the full Gemini transcript hacker-news-thread-export lets you paste in a URL to a Hacker News thread and gives you a copyable condensed version of the entire thread, suitable for pasting into an LLM to get a useful summary. paste-rich-text lets you copy from a page and paste to get the HTML - particularly useful on mobile where view-source isn't available. alt-text-extractor lets you paste in images and then copy out their alt text. keyboard-debug shows the keys (and values) currently being held down. cors-fetch reveals if a URL can be accessed via CORS. exif displays EXIF data for a selected photo. icon-editor is a custom 24x24 icon editor I built to help hack on icons for the GitHub Universe badge . It persists your in-progress icon design in the URL so you can easily bookmark and share it. word-counter is a simple tool I built to help me write to specific word counts, for things like conference abstract submissions. It uses localStorage to save as you type, so your work isn't lost if you accidentally close the tab. render-markdown uses the same trick - I sometimes use this one to craft blog posts and I don't want to lose them. haiku is one of a number of LLM demos I've built that request an API key from the user (via the function) and then store that in . This one uses Claude Haiku to write haikus about what it can see through the user's webcam. iNaturalist for fetching sightings of animals, including URLs to photos PyPI for fetching details of Python packages GitHub because anything in a public repository in GitHub has a CORS-enabled anonymous API for fetching that content from the raw.githubusercontent.com domain, which is behind a caching CDN so you don't need to worry too much about rate limits or feel guilty about adding load to their infrastructure. Bluesky for all sorts of operations Mastodon has generous CORS policies too, as used by applications like phanpy.social species-observation-map uses iNaturalist to show a map of recent sightings of a particular species. zip-wheel-explorer fetches a file for a Python package from PyPI, unzips it (in browser memory) and lets you navigate the files. github-issue-to-markdown fetches issue details and comments from the GitHub API (including expanding any permanent code links) and turns them into copyable Markdown. terminal-to-html can optionally save the user's converted terminal session to a Gist. bluesky-quote-finder displays quotes of a specified Bluesky post, which can then be sorted by likes or by time. haiku uses the Claude API to write a haiku about an image from the user's webcam. openai-audio-output generates audio speech using OpenAI's GPT-4o audio API. gemini-bbox demonstrates Gemini 2.5's ability to return complex shaped image masks for objects in images, see Image segmentation using Gemini 2.5 . ocr is the first tool I built for my collection, described in Running OCR against PDFs and images directly in your browser . It uses and to allow users to open a PDF in their browser which it then converts to an image-per-page and runs through OCR. social-media-cropper lets you open (or paste in) an existing image and then crop it to common dimensions needed for different social media platforms - 2:1 for Twitter and LinkedIn, 1.4:1 for Substack etc. ffmpeg-crop lets you open and preview a video file in your browser, drag a crop box within it and then copy out the command needed to produce a cropped copy on your own machine. svg-render lets the user download the PNG or JPEG rendered from an SVG. social-media-cropper does the same for cropped images. open-sauce-2025 is my alternative schedule for a conference that includes a downloadable ICS file for adding the schedule to your calendar. See Vibe scraping and vibe coding a schedule app for Open Sauce 2025 entirely on my phone for more on that project. pyodide-bar-chart demonstrates running Pyodide, Pandas and matplotlib to render a bar chart directly in the browser. numpy-pyodide-lab is an experimental interactive tutorial for Numpy. apsw-query demonstrates the APSW SQLite library running in a browser, using it to show EXPLAIN QUERY plans for SQLite queries. ocr uses the pre-existing Tesseract.js WebAssembly port of the Tesseract OCR engine. sloccount is a port of David Wheeler's Perl and C SLOCCount utility to the browser, using a big ball of WebAssembly duct tape. More details here . micropython is my experiment using @micropython/micropython-webassembly-pyscript from NPM to run Python code with a smaller initial download than Pyodide.

1 views