Posts in Programming (20 found)

Porting MiniJinja to Go With an Agent

Turns out you can just port things now. I already attempted this experiment in the summer, but it turned out to be a bit too much for what I had time for. However, things have advanced since. Yesterday I ported MiniJinja (a Rust Jinja2 template engine) to native Go, and I used an agent to do pretty much all of the work. In fact, I barely did anything beyond giving some high-level guidance on how I thought it could be accomplished. In total I probably spent around 45 minutes actively with it. It worked for around 3 hours while I was watching, then another 7 hours alone. This post is a recollection of what happened and what I learned from it. All prompting was done by voice using pi , starting with Opus 4.5 and switching to GPT-5.2 Codex for the long tail of test fixing. MiniJinja is a re-implementation of Jinja2 for Rust. I originally wrote it because I wanted to do a infrastructure automation project in Rust and Jinja was popular for that. The original project didn’t go anywhere, but MiniJinja itself continued being useful for both me and other users. The way MiniJinja is tested is with snapshot tests: inputs and expected outputs, using insta to verify they match. These snapshot tests were what I wanted to use to validate the Go port. My initial prompt asked the agent to figure out how to validate the port. Through that conversation, the agent and I aligned on a path: reuse the existing Rust snapshot tests and port incrementally (lexer -> parser -> runtime). This meant the agent built Go-side tooling to: This resulted in a pretty good harness with a tight feedback loop. The agent had a clear goal (make everything pass) and a progression (lexer -> parser -> runtime). The tight feedback loop mattered particularly at the end where it was about getting details right. Every missing behavior had one or more failing snapshots. I used Pi’s branching feature to structure the session into phases. I rewound back to earlier parts of the session and used the branch switch feature to inform the agent automatically what it had already done. This is similar to compaction, but Pi shows me what it puts into the context. When Pi switches branches it does two things: Without switching branches, I would probably just make new sessions and have more plan files lying around or use something like Amp’s handoff feature which also allows the agent to consult earlier conversations if it needs more information. What was interesting is that the agent went from literal porting to behavioral porting quite quickly. I didn’t steer it away from this as long as the behavior aligned. I let it do this for a few reasons. First, the code base isn’t that large, so I felt I could make adjustments at the end if needed. Letting the agent continue with what was already working felt like the right strategy. Second, it was aligning to idiomatic Go much better this way. For instance, on the runtime it implemented a tree-walking interpreter (not a bytecode interpreter like Rust) and it decided to use Go’s reflection for the value type. I didn’t tell it to do either of these things, but they made more sense than replicating my Rust interpreter design, which was partly motivated by not having a garbage collector or runtime type information. On the other hand, the agent made some changes while making tests pass that I disagreed with. It completely gave up on all the “must fail” tests because the error messages were impossible to replicate perfectly given the runtime differences. So I had to steer it towards fuzzy matching instead. It also wanted to regress behavior I wanted to retain (e.g., exact HTML escaping semantics, or that must return an iterator). I think if I hadn’t steered it there, it might not have made it to completion without going down problematic paths, or I would have lost confidence in the result. Once the major semantic mismatches were fixed, the remaining work was filling in all missing pieces: missing filters and test functions, loop extras, macros, call blocks, etc. Since I wanted to go to bed, I switched to Codex 5.2 and queued up a few “continue making all tests pass if they are not passing yet” prompts, then let it work through compaction. I felt confident enough that the agent could make the rest of the tests pass without guidance once it had the basics covered. This phase ran without supervision overnight. After functional convergence, I asked the agent to document internal functions and reorganize (like moving filters to a separate file). I also asked it to document all functions and filters like in the Rust code base. This was also when I set up CI, release processes, and talked through what was created to come up with some finalizing touches before merging. There are a few things I find interesting here. First: these types of ports are possible now. I know porting was already possible for many months, but it required much more attention. This changes some dynamics. I feel less like technology choices are constrained by ecosystem lock-in. Sure, porting NumPy to Go would be a more involved undertaking, and getting it competitive even more so (years of optimizations in there). But still, it feels like many more libraries can be used now. Second: for me, the value is shifting from the code to the tests and documentation. A good test suite might actually be worth more than the code. That said, this isn’t an argument for keeping tests secret — generating tests with good coverage is also getting easier. However, for keeping code bases in different languages in sync, you need to agree on shared tests, otherwise divergence is inevitable. Lastly, there’s the social dynamic. Once, having people port your code to other languages was something to take pride in. It was a sign of accomplishment — a project was “cool enough” that someone put time into making it available elsewhere. With agents, it doesn’t invoke the same feelings. Will McGugan also called out this change . Lastly, some boring stats for the main session: This did not count the adding of doc strings and smaller fixups. Pi session transcript Narrated video of the porting session Parse Rust’s test input files (which embed settings as JSON headers). Parse the reference insta snapshots and compare output. Maintain a skip-list to temporarily opt out of failing tests. It stays in the same session so I can navigate around, but it makes a new branch off an earlier message. When switching, it adds a summary of what it did as a priming message into where it branched off. I found this quite helpful to avoid the agent doing vision quests from scratch to figure out how far it had already gotten. Agent run duration: 10 hours ( 3 hours supervised) Active human time: ~45 minutes Total messages: 2,698 My prompts: 34 Tool calls: 1,386 Raw API token cost: $60 Total tokens: 2.2 million Models: and for the unattended overnight run

0 views

LoopFrog: In-Core Hint-Based Loop Parallelization

LoopFrog: In-Core Hint-Based Loop Parallelization Marton Erdos, Utpal Bora, Akshay Bhosale, Bob Lytton, Ali M. Zaidi, Alexandra W. Chadwick, Yuxin Guo, Giacomo Gabrielli, and Timothy M. Jones MICRO'25 To my Kanagawa pals: I think hardware like this would make a great target for Kanagawa, what do you think? The message of this paper is that there is plenty of loop-level parallelism available which superscalar cores are not yet harvesting. Fig. 1 illustrates the classic motivation for multi-core processors: scaling the processor width by 4x yields a 2x IPC improvement. In general, wider cores are heavily underutilized. Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The main idea behind is to add hints to the ISA which allow a wide core to exploit more loop-level parallelism in sequential code. If you understand Fig. 2, then you understand , the rest is just details: Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The compiler emits instructions which the processor can use to understand the structure of a loop. Processors are free to ignore the hints. A loop which can be optimized by comprises three sections: A header , which launches each loop iteration A body , which accepts values from the header A continuation , which computes values needed for the next loop iteration (e.g., the value of induction variables). Each execution of the header launches two threadlets . A threadlet is like a thread but is only ever executed on the core which launched it. One threadlet launched by the header executes the body of the loop. The other threadlet launched by the header is the continuation, which computes values needed for the next loop iteration. Register loop-carried dependencies are allowed between the header and continuation, but not between body invocations. That is the key which allows multiple bodies to execute in parallel (see Fig. 2c above). At any one time, there is one architectural threadlet (the oldest one), which can update architectural state. All other threadlets are speculative . Once the architectural threadlet for loop iteration completes, it hands the baton over to the threadlet executing iteration , which becomes architectural. Dependencies through memory are handled by the speculative state buffer (SSB). When a speculative threadlet executes a memory store, data is stored in the SSB and actually written to memory later on (i.e., after that threadlet is no longer speculative). Memory loads read from both the L1 cache and the SSB, and then disambiguation hardware determines which data to use and which to ignore. The hardware implementation evaluated by the paper does not support nested parallelization, it simply ignores hints inside of nested loops. Fig. 6 shows simulated performance results for an 8-wide core. A core which supports 4 threadlets is compared against a baseline which does not implement . Source: https://dl.acm.org/doi/10.1145/3725843.3756051 can improve performance by about 10%. Fig. 1 at the top shows that an 8-wide core experiences about 25% utilization, so there may be more fruit left to pick. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The main idea behind is to add hints to the ISA which allow a wide core to exploit more loop-level parallelism in sequential code. Structured Loops If you understand Fig. 2, then you understand , the rest is just details: Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The compiler emits instructions which the processor can use to understand the structure of a loop. Processors are free to ignore the hints. A loop which can be optimized by comprises three sections: A header , which launches each loop iteration A body , which accepts values from the header A continuation , which computes values needed for the next loop iteration (e.g., the value of induction variables).

0 views
Kev Quirk Yesterday

Linux in the Air

Sal talks about how Linux is going through somewhat of a revival at the moment, as well as some of his own thoughts on the whole Mac vs Windows vs Linux debacle. Read Post → I think a lot of this Linux revival is thanks to a perfect storm going on in the OS space, namely: I’ve been back on Linux (specifically Ubuntu) since I bought my Framework 13 , and I’ve been very happy. The only issues I’ve really had are with some apps being blurry under Wayland, but I’ve been able to easily work around these issues. Sal has had some similar problems with Wayland, but has also managed to work around them. My son also runs Linux on his iMac , and I’m about to replace Windows 10 on my wife’s X1 Carbon with Ubuntu too. So we’re going to be a Linux household very soon. And you know what? It’s fine. My son doesn’t know (or care) that he’s running Linux. My wife will be in the same boat - as long as she can check her emails, browse the web, and manage our finances in a spreadsheet, she’s good. Linux based operating systems are great, and I’m thrilled they’re going through this revival. If you’re thinking about switching, I’d implore you to do so - remember you can always try before you “buy” with a live USB. So there’s no commitment required. If you do switch, please remember to donate to your distro of choice. ❤ Thanks for reading this post via RSS. RSS is great, and you're great for using it. ❤️ You can reply to this post by email , or leave a comment . Microsoft forcing many users to buy new hardware because of arbitrary hardware requirements, as well as forcing users to have an online accounts. Apple completely screwing up MacOS Tahoe with their Liquid Glass update.

0 views
Simon Willison 2 days ago

First impressions of Claude Cowork, Anthropic's general agent

New from Anthropic today is Claude Cowork , a "research preview" that they describe as "Claude Code for the rest of your work". It's currently available only to Max subscribers ($100 or $200 per month plans) as part of the updated Claude Desktop macOS application. I've been saying for a while now that Claude Code is a "general agent" disguised as a developer tool. It can help you with any computer task that can be achieved by executing code or running terminal commands... which covers almost anything, provided you know what you're doing with it! What it really needs is a UI that doesn't involve the terminal and a name that doesn't scare away non-developers. "Cowork" is a pretty solid choice on the name front! The interface for Cowork is a new tab in the Claude desktop app, called Cowork. It sits next to the existing Chat and Code tabs. It looks very similar to the desktop interface for regular Claude Code. You start with a prompt, optionally attaching a folder of files. It then starts work. I tried it out against my perpetually growing "blog-drafts" folder with the following prompt: Look at my drafts that were started within the last three months and then check that I didn't publish them on simonwillison.net using a search against content on that site and then suggest the ones that are most close to being ready It started by running this command: That path instantly caught my eye. Anthropic say that Cowork can only access files you grant it access to - it looks to me like they're mounting those files into a containerized environment, which should mean we can trust Cowork not to be able to access anything outside of that sandbox. It turns out I have 46 draft files from the past three months. Claude then went to work with its search tool, running 44 individual searches against to figure out which of my drafts had already been published. Here's the eventual reply: Based on my analysis, here are your unpublished drafts that appear closest to being ready for publication : 🔥 Most Ready to Publish (substantial content, not yet published) That's a good response! It found exactly what I needed to see, although those upgrade instructions are actually published elsewhere now ( in the Datasette docs ) and weren't actually intended for my blog. Just for fun, and because I really like artifacts , I asked for a follow-up: Make me an artifact with exciting animated encouragements to get me to do it Here's what I got: I couldn't figure out how to close the right sidebar so the artifact ended up cramped into a thin column but it did work. I expect Anthropic will fix that display bug pretty quickly. I've seen a few people ask what the difference between this and regular Claude Code is. The answer is not a lot . As far as I can tell Claude Cowork is regular Claude Code wrapped in a less intimidating default interface and with a filesystem sandbox configured for you without you needing to know what a "filesystem sandbox" is. Update : It's more than just a filesystem sandbox - I had Claude Code reverse engineer the Claude app and it found out that Claude uses VZVirtualMachine - the Apple Virtualization Framework - and downloads and boots a custom Linux root filesystem. I think that's a really smart product. Claude Code has an enormous amount of value that hasn't yet been unlocked for a general audience, and this seems like a pragmatic approach. With a feature like this, my first thought always jumps straight to security. How big is the risk that someone using this might be hit by hidden malicious instruction somewhere that break their computer or steal their data? Anthropic touch on that directly in the announcement: You should also be aware of the risk of " prompt injections ": attempts by attackers to alter Claude's plans through content it might encounter on the internet. We've built sophisticated defenses against prompt injections, but agent safety---that is, the task of securing Claude's real-world actions---is still an active area of development in the industry. These risks aren't new with Cowork, but it might be the first time you're using a more advanced tool that moves beyond a simple conversation. We recommend taking precautions, particularly while you learn how it works. We provide more detail in our Help Center . That help page includes the following tips: To minimize risks: I do not think it is fair to tell regular non-programmer users to watch out for "suspicious actions that may indicate prompt injection"! I'm sure they have some impressive mitigations going on behind the scenes. I recently learned that the summarization applied by the WebFetch function in Claude Code and now in Cowork is partly intended as a prompt injection protection layer via this tweet from Claude Code creator Boris Cherny: Summarization is one thing we do to reduce prompt injection risk. Are you running into specific issues with it? But Anthropic are being honest here with their warnings: they can attempt to filter out potential attacks all they like but the one thing they can't provide is guarantees that no future attack will be found that sneaks through their defenses and steals your data (see the lethal trifecta for more on this.) The problem with prompt injection remains that until there's a high profile incident it's really hard to get people to take it seriously. I myself have all sorts of Claude Code usage that could cause havoc if a malicious injection got in. Cowork does at least run in a filesystem sandbox by default, which is more than can be said for my habit! I wrote more about this in my 2025 round-up: The year of YOLO and the Normalization of Deviance . Security worries aside, Cowork represents something really interesting. This is a general agent that looks well positioned to bring the wildly powerful capabilities of Claude Code to a wider audience. I would be very surprised if Gemini and OpenAI don't follow suit with their own offerings in this category. I imagine OpenAI are already regretting burning the name "ChatGPT Agent" on their janky, experimental and mostly forgotten browser automation tool back in August ! bashtoni on Hacker News : Simple suggestion: logo should be a cow and and orc to match how I originally read the product name. I couldn't resist throwing that one at Nano Banana : You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . - "Frequently Argued Questions about LLMs" (22,602 bytes) This is a meaty piece documenting common arguments about LLMs with your counterpoints Well-structured with a TL;DR and multiple sections No matching published article found on your site Very close to ready - just needs a final review pass - "Claude Code Timeline and Codex Timeline" (3,075 bytes) About viewing JSONL session logs from Claude Code and Codex You published on Dec 25, but this appears to be a different/earlier piece about timeline viewing tools Shorter but seems complete - Plugin Upgrade Guide (3,147 bytes) Technical guide for plugin authors You published the main 1.0a20 announcement but this companion upgrade guide appears unpublished Would be valuable for plugin maintainers Avoid granting access to local files with sensitive information, like financial documents. When using the Claude in Chrome extension, limit access to trusted sites. If you chose to extend Claude’s default internet access settings, be careful to only extend internet access to sites you trust. Monitor Claude for suspicious actions that may indicate prompt injection.

1 views
Playtank 2 days ago

Designing Good Rules

This is the second part of two in a short series on how to design Your Next Systemic Game , and this time, we’ll dive into designing rules . A few years ago, I was demonstrating a system to a friend. Look here, you can do this, and when you activate these things, see what happens! It got some excitement, with the synergies involved, and sparked a great comment: “So it’s more like designing board game rules than a computer game?” Yes! Yes it is! Just like board game rules, a systemic design needs to be clearer and more easily communicated than even the real world is. Self-consistent, as Tom Leonard once wrote . But it’s also not. In a board game, players need to understand and internalise all of the rules before they can explore the game’s strategies, and the state-space is clearly restricted by the physical components of the game. In a digital game, much of the interaction can be left for the player to discover and can become obfuscated by complexity itself. Learning how to apply the rules is a process of discovery for the player. Before we can design rules, we need to know what they are applying to. Modern games rely heavily on content . Objects built for specific purposes. A gun. An enemy. A level. You expand your game by adding more of them, or making new types of things that you can then add more of. How a certain piece of content behaves is usually very specific, and even if new content can add new behavior, there is rarely any interconnection. Systems works differently. If you have the concept of something being flammable in your game, every piece of content in your simulation needs to be adapted to this rule in one way or another. Even objects that are not flammable will usually provide some kind of response, like having the flame fizzle out against it or begin to glow menacingly red and generate a heat haze. What this means is that you can add more systems and all of the objects you already have will then “just work” based on how you tell them to interact with these systems. You can create a lot more variety this way, by relying on robust interconnected systems, and not having to produce things because the players get bored. Look at the Building a Systemic Gun post (one of the most viewed on this blog) for a more practical example of the difference this makes. There are three key elements to designing rules for emergent effect: “Combine simple behaviors to give the impression that the monsters are working together,” writes Derek Yu about his game Spelunky . ” This not only creates challenging situations, but it also makes the world feel more like a living, breathing ecosystem. Wherever possible, I tried to add monsters that attack you from new directions, so that when they were paired with existing monsters the attacks would feel coordinated. ” (Emphasis mine.) That’s it: have new enemies attack from different directions. A rule that is never communicated to the player, but serves to inform the game’s development and make them feel coordinated. A simple rule that, when combined with more simple rules, generates an emergent experience. This is the holy trinity of rules design. They get some more attention in a previous post on designing systemic games . At a high level: Permissions are what you can do. Restrictions are exceptions to permissions. Conditions provide the framework for the other two. The following could be the rules set up for a simple fire propagation system. Each rule here is simple, but it doesn’t have the raw simplicity of “new enemies attack from new directions.” A gestalt is “an organised whole that is perceived as more than the sum of its parts.” Consider a character class in Dungeons & Dragons or the specific role you may have in a hero shooter such as Overwatch . You’re the healer, or the glass cannon, or the damage dealer. These are variations of gestalts. You can rely on gestalts used in other games, or you can come up with your own. What you want is to provide rules for each gestalt that separates it from others and encourage players to actively switch between gestalts as they play in order to keep the game fresh. When we say that something can be internalised, it means that it can be made part of someone’s immediate understanding of the world. Gravity and darkness are two examples of things we have all internalised. You know that things fall if you drop them, and you know that you can’t see anything when it’s dark. If you drop something, you’ll instinctively react to try and catch it. If it gets dark, you squint. Something intuitive can then be defined as something quickly or easily internalised . Game rules are harder to internalise, because we must first describe the game world. But there are some key terms you can consider. Borrowed from Michael Sellers’ excellent book Advanced Game Design A Systems Approach , and elsewhere. For easier adaptation, focus on comprehension, elegance, and notion. “[P]resenting the game in such a way that players can build a mental model of it.” Michael Sellers This one is easy, because you’ve already prepared your Model in the previous post (right?!). Players must understand what they are interacting with. They must be introduced to the rules and be able to decipher the rules. A game with rules that are contradictory or generate too much information in a short time can end up frustrating instead. We may insist on tutorials or intro sections. On illustrative feature videos. But when it comes down to it, the best way to make our players understand the game they’re playing is by letting them play it. Only when you’ve interacted using a rule a number of times will you understand the rule. “Creating a diverse space for players to explore based on only a few rules.” Michael Sellers One reason platformers have such wide appeal is that they are extremely simple to internalise. You need to learn how to jump, and many of the other interactions will follow from there. You usually need very few rules and most of them will make sense simply because of that internalised concept of gravity that was mentioned before. This is also why many shooters will have visible projectiles and simple rules tied to them, like touching a projectile killing you or dealing damage. Similarly, the best way you can achieve elegance is by building your entire game around a single verb. Jump. Shoot. Drive. Then you can let the other elements click in place based on what comes out of that verb. “We have these broken notions of physics and when a video game takes those broken notions of physics and gives them life in a virtual world it doesn’t bother us.” Jamie Fristrom Not only doesn’t it bother us when our childhood ideas of physics are proven right by a game, against realistic fact, it often entertains us and feels just as natural as reality. A notion is an idea or whim, something that just comes to us naturally. It’s taking the description of a phenomenon not from its published scientific lore but from a science book from the kids’ section in the library. When we intuitively understand how our momentum is retained through Portal ‘s portals, this is notion at its best. Because there are no such portals in the real world, and there’s really nothing to say that we’d retain our momentum through them if there were. Notion means that things make more sense than they do in real life. If the Spelunky snake attacks a certain way the first time you encounter it, then it has to work like that in the future as well. If not, the system becomes too unpredictable to internalise. Consistency is important, but it doesn’t mean that everything has to behave the same all the time. It’s the outcome of interactions that need to be consistent, not necessarily the full output of a scenario. “Game systems should have predictable outputs for given inputs.” Michael Sellers In Thief , if I get spotted by a guard, they will first become suspicious before they are alerted. This gives me some time to react. How much time depends on lighting and circumstances, but you can quickly learn how this system behaves and play on its predictability. Staying behind guards is better than to risk it, for example. One reason many games don’t involve physics simulations in direct interaction, unless it’s done for fun, is because of its inherent lack of predictability. You want your Rocket League ball to bounce the same, so you can improve your skill at taking shots. A stated rule should always behave the same. A system should always provide the same outputs from the same inputs. “[R]ules and content should function the same in all areas of your game.” Michael Sellers Many games start from very clear rules but make seemingly arbitrary exceptions. You can shoot and kill characters, but if you shoot and kill this particular character it’s game over or checkpoint reload. Or they are immune to the damage. This is inconsistent and will make it harder to internalise the systems involved. It’s also bad for the sense of immersion. If a player has internalised a tool they can use, it should always behave as expected. Perhaps they have a grappling hook that can let them climb to new vertical locations. But then in the new level, the hook bounces off an invisible wall as the player finds an interesting balcony to reach. Maybe the level designer felt that it would make the level too easy, or there’s a story beat that introduces this balcony. But if you want to be serious with your rules, this lack of coherence is always a bad thing. “Enabling the system to be used within multiple contexts or to have new parts added within it.” Michael Sellers Once you have your consistent rules in place, you can start experimenting. A guard in Thief that is blind but reacts faster to sound would combine well with metallic floors causing lots of noise to make an interesting scenario. The more you think of this variability early on, the better. Remember the five areas of maximising iteration : Authoring, Transitioning, Tweaking, Testing, and Updating. “[C]reate game systems such that content can be reused in new ways or created procedurally” Michael Sellers Once you have your simple, intuitive, and coherent rules in place, you can extend them. You can use this to change or even to make make rules, both through your game itself and by providing tools for players to do so on their own. With your systems and rules in place, you can easily let other systems alter their outcomes. Have your spawning system spawn more enemies of a specific type or no enemies at all of another type. Or make it turn entire systems on or off to vary the gameplay. Some of these things can be represented as systems in their own right, such as weather that decreases your sight or makes all rock surfaces slippery. Other things can be expressed through the game’s UI or story. If you push this even further, and modularise your systems more, you can let your systems make the rules. This becomes possible for any system that has sufficient variation in its inputs and outputs. In the board game The Awful Green Things From outer Space , there are various weapons that you can use and various effects that those weapons can cause. As you run around on the ship in the game, you can pick weapons up. But you don’t know what effect they’ll have. Only once you hit an awful green thing with it will you draw a chit that determines what the effect actually is. They may be very vulnerable against the weapon, or you may cause them to split into more awful green things. Mutators, modifiers, custom game modes. Players have always been fond of mixing things up. What’s important about these changes is that they will always use the same pool of common content and then change the rules in one way or another. Like setting up a custom game in Civilization . Back when Xbox Live launched with Halo 2 , the community that formed invented many new ways to play. One of the ones I remember was the “zombie” game mode, where you’d arm one side (the humans) with shotguns and then have one player play a zombie using an energy sword. If a human was killed by a zombie, they’d switch to the energy sword on respawn and now become part of the zombie team. The interesting part of this was that it was an entirely verbal agreement, but still worked. Later Halo games introduced the Forge, where these kinds of variations could be created as customisations and be shared across the community. Of course, with PC gaming, there’s always been modding, and modding can push things much farther than what something like Forge can do. But modding is beyond the scope of this post. Back in the Game Balancing Guide , you find Dax Gazaway’s classification for players’ openness to learning new rules. If your game is too different from other games, there will be a segment of gamers that won’t like it or may not even touch it. Games are interactive. Players will understand something that they play much faster and much more intuitively than they will understand something they watch . The connection between button press and feedback will make more synapses fire than just watching through a video. Because of this, it’s always better to let the player do what they are expected to do than it is to tell them how it’s done. On the very first screen of Super Mario Bros. , you’re not told that you have to jump. Rather, if you don’t jump you die and have to start over. Always aim to let the player play before you show them something. Sometimes you can’t rely on interactive gameplay for one reason or another, and you still have information you need to convey to the player. Mandatory information is generally the domain of linear games. Cutscenes and staged sequences are the tools used in those cases. Passive observation techniques . If you can, avoid using dialogue or written text. This is where film and television are good inspirations, because they try to minimise how much characters talk and how much is shown. Use the minimum amount of exposition or narration, and have the player discover things in their own time. Avoid forcing their hand. Always aim to show and not tell , and rely on words only if you really have to or don’t have time to find a better solution. As a way to parcel out information, you can refer to the inverted pyramid in journalism. Always get the most important element of a rule across first. Ideally, by letting the player experience it first-hand. A rule is made simple by clear boundaries. “Move one square” may sound simple enough, but can you move diagonally? Can you move into a square where there’s already another object occupying the same space? Can you move into the blue square, or only the white one? This is where your rules collide with reality and where you will need to really consider what you are making your rules for . Where effects originate from (their sources) and which things are affected by them. What you will quickly notice is that rules can easily overlap multiple systems. If instead of saying “wood burns,” you’d say “flammable materials burn,” we can’t know what the rule means without first internalising which things are flammable. Saying “wood burns” means that the player can look into the environment of the game, identify something as wood, and then understand that it burns. Wood being flammable in this case, becomes a perceived affordance once internalised. The following is effectively a glossary of the terms used in this post. You can combine it with the previous post and you’ll have a sort of manual on how to go about making a systemic game. Send me an e-mail at [email protected] when I can play it! Design simple systems . Provide intuitive rules . Apply these rules consistently . Permissions: Wood burns. Books burn. Fire spreads. Restrictions: Water douse flames. Storms douse flames. Magicwood won’t burn. Conditions: Burning things break over time. Design simple systems . Permissions , Restrictions , and Conditions are your three main rule frameworks. Gestalts can be used as communicable collections of rules and invitations to expand player play styles. Provide intuitive rules . Comprehension means understanding how a rule works. Elegance is about making wide use of narrow elements. Notion argues that you should lean into what players already expect, even against common sense. Apply these rules consistently . Predictability means that you get the same outputs from the same inputs. Coherence makes sure that your game works the same under all circumstances. Variability is a strength of consistency, because it lets you mix things up. Extensibility can mean systems or players changing or even making up the rules. Communicating rules includes some considerations: Play, Don’t Show : games are interactive first and foremost. Lean into it. Show, Don’t Tell : words are the least effective channel you can use; don’t use them if you don’t have to (or want to). The Inverted Pyramid is a handy journalistic tool that you can use to parcel out information. Boundaries need to be set, so that players understand where a rule begins and ends. Perceived Affordances are intuitive visual elements that aid the communication of your interactions (and therefore rules).

0 views
Simon Willison 3 days ago

My answers to the questions I posed about porting open source code with LLMs

Last month I wrote about porting JustHTML from Python to JavaScript using Codex CLI and GPT-5.2 in a few hours while also buying a Christmas tree and watching Knives Out 3. I ended that post with a series of open questions about the ethics and legality of this style of work. Alexander Petros on lobste.rs just challenged me to answer them , which is fair enough! Here's my attempt at that. You can read the original post for background, but the short version is that it's now possible to point a coding agent at some other open source project and effectively tell it "port this to language X and make sure the tests still pass" and have it do exactly that. Here are the questions I posed along with my answers based on my current thinking. Extra context is that I've since tried variations on a similar theme a few more times using Claude Code and Opus 4.5 and found it to be astonishingly effective. I decided that the right thing to do here was to keep the open source license and copyright statement from the Python library author and treat what I had built as a derivative work, which is the entire point of open source. After sitting on this for a while I've come down on yes, provided full credit is given and the license is carefully considered. Open source allows and encourages further derivative works! I never got upset at some university student forking one of my projects on GitHub and hacking in a new feature that they used. I don't think this is materially different, although a port to another language entirely does feel like a slightly different shape. Now this one is complicated! It definitely hurts some projects because there are open source maintainers out there who say things like "I'm not going to release any open source code any more because I don't want it used for training" - I expect some of those would be equally angered by LLM-driven derived works as well. I don't know how serious this problem is - I've seen angry comments from anonymous usernames, but do they represent genuine open source contributions or are they just angry anonymous usernames? If we assume this is real, does the loss of those individuals get balanced out by the increase in individuals who CAN contribute to open source because they can now get work done in a few hours that might previously have taken them a few days that they didn't have to spare? I'll be brutally honest about that question: I think that if "they might train on my code / build a derived version with an LLM" is enough to drive you away from open source, your open source values are distinct enough from mine that I'm not ready to invest significantly in keeping you. I'll put that effort into welcoming the newcomers instead. The much bigger concern for me is the impact of generative AI on demand for open source. The recent Tailwind story is a visible example of this - while Tailwind blamed LLMs for reduced traffic to their documentation resulting in fewer conversions to their paid component library, I'm suspicious that the reduced demand there is because LLMs make building good-enough versions of those components for free easy enough that people do that instead. I've found myself affected by this for open source dependencies too. The other day I wanted to parse a cron expression in some Go code. Usually I'd go looking for an existing library for cron expression parsing - but this time I hardly thought about that for a second before prompting one (complete with extensive tests) into existence instead. I expect that this is going to quite radically impact the shape of the open source library world over the next few years. Is that "harmful to open source"? It may well be. I'm hoping that whatever new shape comes out of this has its own merits, but I don't know what those would be. I'm not a lawyer so I don't feel credible to comment on this one. My loose hunch is that I'm still putting enough creative control in through the way I direct the models for that to count as enough human intervention, at least under US law, but I have no idea. I've come down on "yes" here, again because I never thought it was irresponsible for some random university student to slap an Apache license on some bad code they just coughed up on GitHub. What's important here is making it very clear to potential users what they should expect from that software. I've started publishing my AI-generated and not 100% reviewed libraries as alphas, which I'm tentatively thinking of as "alpha slop" . I'll take the alpha label off once I've used them in production to the point that I'm willing to stake my reputation on them being decent implementations, and I'll ship a 1.0 version when I'm confident that they are a solid bet for other people to depend on. I think that's the responsible way to handle this. That one was a deliberately provocative question, because for a new HTML5 parsing library that passes 9,200 tests you would need a very good reason to hire an expert team for two months (at a cost of hundreds of thousands of dollars) to write such a thing. And honestly, thanks to the existing conformance suites this kind of library is simple enough that you may find their results weren't notably better than the one written by the coding agent. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

1 views
Fredrik Meyer 3 days ago

Writing a work log

About a year ago I started writing a “work diary”. The process is simple: at the end of the day, I write a few sentences of what I did that day at work. It has a few benefits: I can ask an LLM questions about what I have spent time on. Here’s an example: Or this one: Based on the extensive work logs, collaboration with team members, involvement in complex debugging, API development, feature implementation, and participation in meetings and project management, it can be estimated that the programmer is at a senior level. Here I use the CLI tool by Simon Willison. I use Emacs for Org Mode and for Magit . To write the log I press 1 to open the “work Org Mode file”. Then I navigate to the work diary (headlines are “Work log”, month, day). In Org Mode, insert the current date. Then I write a few sentences. Here’s an example from last Friday (loosely translated): Sleepy today. Deployed § 11-4 second part out in the dev environment, fixed a small bug (it didn’t consider manual income). Otherwise spent time on unrelated small fixes. Used Copilot to get ktor-openapi-generator to support -annotations. Made a Grafana-dashboard for errors logs per app. I have this in my Emacs config: Easier to know what I should do the next day if I didn’t finish a task. If I’m stuck with something, I can write the problem down, effectively rubber-ducking with myself. I have a belief that writing a problem down will help clarify thoughts. I can ask an LLM questions about what I have spent time on. Here’s an example: Java Spring Boot API Development Test Automation Microservices CI/CD (Continuous Integration/Continuous Deployment) It can help me realize issues I should focus more or less on. Asking the LLM again, it pointed out that a lot of time is spent fixing bugs or attending meetings. It suggested to set aside dedicated time for deep work , so that complex coding tasks can be handled without interruption. It gives myself some traceability. I can verify that I did actually work a particular day, or that I worked on that particular thing at a particular day. I have this in my Emacs config: ↩

0 views
<antirez> 3 days ago

Don't fall into the anti-AI hype

I love writing software, line by line. It could be said that my career was a continuous effort to create software well written, minimal, where the human touch was the fundamental feature. I also hope for a society where the last are not forgotten. Moreover, I don't want AI to economically succeed, I don't care if the current economic system is subverted (I could be very happy, honestly, if it goes in the direction of a massive redistribution of wealth). But, I would not respect myself and my intelligence if my idea of software and society would impair my vision: facts are facts, and AI is going to change programming forever. In 2020 I left my job in order to write a novel about AI, universal basic income, a society that adapted to the automation of work facing many challenges. At the very end of 2024 I opened a YouTube channel focused on AI, its use in coding tasks, its potential social and economical effects. But while I recognized what was going to happen very early, I thought that we had more time before programming would be completely reshaped, at least a few years. I no longer believe this is the case. Recently, state of the art LLMs are able to complete large subtasks or medium size projects alone, almost unassisted, given a good set of hints about what the end result should be. The degree of success you'll get is related to the kind of programming you do (the more isolated, and the more textually representable, the better: system programming is particularly apt), and to your ability to create a mental representation of the problem to communicate to the LLM. But, in general, it is now clear that for most projects, writing the code yourself is no longer sensible, if not to have fun. In the past week, just prompting, and inspecting the code to provide guidance from time to time, in a few hours I did the following four tasks, in hours instead of weeks: 1. I modified my linenoise library to support UTF-8, and created a framework for line editing testing that uses an emulated terminal that is able to report what is getting displayed in each character cell. Something that I always wanted to do, but it was hard to justify the work needed just to test a side project of mine. But if you can just describe your idea, and it materializes in the code, things are very different. 2. I fixed transient failures in the Redis test. This is very annoying work, timing related issues, TCP deadlock conditions, and so forth. Claude Code iterated for all the time needed to reproduce it, inspected the state of the processes to understand what was happening, and fixed the bugs. 3. Yesterday I wanted a pure C library that would be able to do the inference of BERT like embedding models. Claude Code created it in 5 minutes. Same output and same speed (15% slower) than PyTorch. 700 lines of code. A Python tool to convert the GTE-small model. 4. In the past weeks I operated changes to Redis Streams internals. I had a design document for the work I did. I tried to give it to Claude Code and it reproduced my work in, like, 20 minutes or less (mostly because I'm slow at checking and authorizing to run the commands needed). It is simply impossible not to see the reality of what is happening. Writing code is no longer needed for the most part. It is now a lot more interesting to understand what to do, and how to do it (and, about this second part, LLMs are great partners, too). It does not matter if AI companies will not be able to get their money back and the stock market will crash. All that is irrelevant, in the long run. It does not matter if this or the other CEO of some unicorn is telling you something that is off putting, or absurd. Programming changed forever, anyway. How do I feel, about all the code I wrote that was ingested by LLMs? I feel great to be part of that, because I see this as a continuation of what I tried to do all my life: democratizing code, systems, knowledge. LLMs are going to help us to write better software, faster, and will allow small teams to have a chance to compete with bigger companies. The same thing open source software did in the 90s. However, this technology is far too important to be in the hands of a few companies. For now, you can do the pre-training better or not, you can do reinforcement learning in a much more effective way than others, but the open models, especially the ones produced in China, continue to compete (even if they are behind) with frontier models of closed labs. There is a sufficient democratization of AI, so far, even if imperfect. But: it is absolutely not obvious that it will be like that forever. I'm scared about the centralization. At the same time, I believe neural networks, at scale, are simply able to do incredible things, and that there is not enough "magic" inside current frontier AI for the other labs and teams not to catch up (otherwise it would be very hard to explain, for instance, why OpenAI, Anthropic and Google are so near in their results, for years now). As a programmer, I want to write more open source than ever, now. I want to improve certain repositories of mine abandoned for time concerns. I want to apply AI to my Redis workflow. Improve the Vector Sets implementation and then other data structures, like I'm doing with Streams now. But I'm worried for the folks that will get fired. It is not clear what the dynamic at play will be: will companies try to have more people, and to build more? Or will they try to cut salary costs, having fewer programmers that are better at prompting? And, there are other sectors where humans will become completely replaceable, I fear. What is the social solution, then? Innovation can't be taken back after all. I believe we should vote for governments that recognize what is happening, and are willing to support those who will remain jobless. And, the more people get fired, the more political pressure there will be to vote for those who will guarantee a certain degree of protection. But I also look forward to the good AI could bring: new progress in science, that could help lower the suffering of the human condition, which is not always happy. Anyway, back to programming. I have a single suggestion for you, my friend. Whatever you believe about what the Right Thing should be, you can't control it by refusing what is happening right now. Skipping AI is not going to help you or your career. Think about it. Test these new tools, with care, with weeks of work, not in a five minutes test where you can just reinforce your own beliefs. Find a way to multiply yourself, and if it does not work for you, try again every few months. Yes, maybe you think that you worked so hard to learn coding, and now machines are doing it for you. But what was the fire inside you, when you coded till night to see your project working? It was building. And now you can build more and better, if you find your way to use AI effectively. The fun is still there, untouched. Comments

54 views

The frustration of a perfect setup

No matter how I look at the list of apps I currently use , whether first-party or third-party, I can’t find anything to change, not a program to replace, not a service to swap for another. I think I am happy with my setup. It feels strange to admit, but somehow, I can’t quite believe it; I must be missing something, something surely can be tweaked. What happens after peak setup? This frustration comes from the fact that looking at new apps, digging into settings, trying new online services, working on how each of these things operates with the others, is one of my favourite hobbies. I mean, a quick glance at the archive of this site will tell you that, not only do I love writing about apps and digital tools, but I love playing with their configurations; I’m like a kid with Lego bricks, building things, taking them apart, and building them again, with a huge smile, in a slightly different and improved way. Now that my application setup appears to be “final”, it feels as though all my toys and Lego bricks are neatly stored away in their respective drawers, sorted by colour, by type, and by size. It’s perfect, and seeing my beautiful collection all nice and tidy like that is a very satisfying sensation, except I’m looking at it seated on the empty floor of my childhood bedroom, alone and bored. What is there to do when nothing needs to be improved? I recently wrote about my HTML and CSS “explorations” with this blog. Satisfied with the results, I think this job is done. The same goes for how Eleventy works on my machine: everything has been optimised , refined, future-proofed (especially Node.js ): nothing to see here! Even the hosting is something I’m very happy with. My only gripe with xmit is that there is no possibility for me to pay for it. The other apps on my Mac — the ones that don’t live in the Terminal like Eleventy, Node.js & npm, and xmit — are also perfect at what they do, and I can’t think of anything better to explore, let alone to use. If this is not your first visit, you already know how I feel about BBEdit . Well, I feel just about the same about NetNewsWire , which is as close to perfection an app can get as far as I’m concerned. It feels part of the OS (even more so than current system apps if I’m being honest), it is stable, it is simple to use, and it runs smoothly on my soon-to-be six-year-old MacBook Air. Being happy with Safari is by far the strongest proof that my setup is final. Using StopTheScript to block JavaScript on most media sites, along with the performance and privacy benefits of using a DNS resolver like Quad9 , has proven to be an efficient way to keep Safari light and responsive, even if my web experience is getting a little more interrupted than I would like, due to all the crap websites throw at first-time visitors these days. Yesterday, I had a look at apps like Yoink , Karabiner Elements , Hazel , and also got a taste of Mullvad Browser , and News Explorer . Some of these apps were tried purely out of curiosity, to see if they would fit right in my “workflow”, others were basically reassurance that my current system and choices were the best I could have made. * 1 Among all the parties involved in this setup, the obvious candidate for a replacement is my Intel-powered MacBook Air. Yet, this old computer is currently in great shape: the recent factory-settings reset I had to do surely helped. But its best feature is not being able to run MacOS Tahoe: stuck to MacOS Sequoia, it’s protecting me from Liquid Glass on the Mac and the “icons in menus everywhere” experience. My personal laptop is a breath of fresh air after spending hours on my work computer running Tahoe. * 2 So, what will be able to make that itch go away? When nothing is broken, don’t fix it, as they say. But surely, there must be something that I’m missing, surely there is a program, somewhere, that would delight me, that would put a smile on my face. I want a new box of Lego bricks, I want to empty my drawers on the floor and see if I can do better. In case you’re wondering, all of these apps are excellent, but not enough to replace what I already use, or to justify adding a new item to my list. For example, Mullvad Browser, like Firefox, isn’t scriptable; News Explorer has more features than NetNewsWire, but is not as polished; Yoink looks incredibly useful, but I prefer my own ways for now, &c. ^ Its replacement will have to wait until the new generation comes out, probably in March; then I can decide on whether I want to stick to the Air family, keep mine a bit longer, or upgrade for a far nicer screen and go with the Pro. ^ In case you’re wondering, all of these apps are excellent, but not enough to replace what I already use, or to justify adding a new item to my list. For example, Mullvad Browser, like Firefox, isn’t scriptable; News Explorer has more features than NetNewsWire, but is not as polished; Yoink looks incredibly useful, but I prefer my own ways for now, &c. ^ Its replacement will have to wait until the new generation comes out, probably in March; then I can decide on whether I want to stick to the Air family, keep mine a bit longer, or upgrade for a far nicer screen and go with the Pro. ^

0 views
Ginger Bill 4 days ago

Mitigating the Billion Dollar Mistake

This article is continuation to: Was it really a Billion Dollar Mistake? . After reading a lot of the comments on numerous social media sites on the original article , I think I need to clarify a lot more. The main points I wanted to clarify: A lot of commentors based their complaints in their experience with languages like Java/C#/Python/etc, and the issues with null-pointer-exceptions (NPEs) in them. What I think a lot of people seemed to forget is that in those languages, virtually everything is a pointer, unlike in a language like C/Go/Odin which has explicit pointers. When everything is a pointer, it is exponentially more likely that you will hit a pointer that is invalid. And in the case of a managed (garbage collected) language, that invalid pointer will most definitely be a null pointer. This is why I can understand the problem of having pointers in such languages. But I think this still missed the point of what I trying to state, that the reason even exists in those languages is because you can declare a variable without an explicit initialization value: Because you can declare such a thing in a language like Java, then there are three options to try and mitigate this design flaw: Unfortunately existing languages like Java cannot have these problems solved, but newer languages that want to stylize themselves similar to that could solve them. One of the issues is that languages like Java added maybe/option/optional types too late AND it is not the default behaviour. The first approach is the current status quo, the second approach keeps the implicit value declarations but adds more checks, whilst the third approach requires doing explicit value declarations. The enforcement of maybe types as the default pointer/reference type leads to two possibilities: Version 1 would be something like this: but because of the ergonomic pains, can also lead to unwrapping cases, which are practically equivalent to NPEs: At least with an , it is a little clearer that a panic could happen. But it can also just be an early-out too like with Odin’s : Version 2 is a bit weirder, since it doesn’t remove the concept of but propagates further up the expression tree. The first approach is unergonomic to use, especially in a language where virtually everything is a pointer/reference, and with the addition of unwrapping which just panics on , it’s practically reinvented NPEs with more steps. As for the second approach, I’d argue is very bug prone if it was the default, since you cannot trivially know where the arose from since it was just passed up the stack 2 . Therefore most people think the third approach to mitigating pointers is the “obvious” and “trivial” approach: explicit individual initialization of every value/element everywhere . One thing which I commonly saw was people saying was that I “missed the point” that null safety is not about protecting from common invalid memory access but rather it’s about clarifying the states that a pointer can be in the type system itself, whether it cannot be null or maybe it could be null. I already knew this, and I find it bizarre 3 that people did not understand that from the article. The point I was trying to get across which most people seemed to either ignore or not understand was that the approach of requiring explicit initialization of every element everywhere comes with a cost and trade-offs. Most people who bring this up as “the solution” think there was either no cost or they think the cost is worth it. The former group are just wrong, and the latter group are the point I was focusing the article at in the first place: you don’t actually understand the costs fully if you are answering the way that you do. I understand this sounds “condescending” to some people, but I am not trying to be. The point I am arguing is far from the common view/wisdom, and thus I tried to explain my position. Why would a person listen to someone with a “fringe” view? “Fringe” views are typically wrong in other areas of life, so it makes sense to apply that heuristic to the domain of programming too. I don’t care if people agree with me or not, rather I wish people actually understand it and then comment. But as a systems programmer, I deal with memory all the time, and null pointers are the least common kind of invalid memory that I have to deal with, and the other kinds were not handled by the type system, nor would be handled with solving the problems of null. No, this is not saying “well just because you cannot solve problem X with Y, therefore don’t solve either”, it’s saying that they are different problems, and empirically they are just different with different kinds of severity and ways to mitigate them. I am not saying you shouldn’t try to solve either problem if you are designing your own language, but rather they are both kinds of invalid memory, but solutions to mitigate the problems are completely different in kind 4 . For a managed language like Java, the cost of explicit initialization of every element everywhere is so little in comparison to the rest of the language, that approach is honestly fine. But for a language like the one I have designed and created—Odin—the cost of non-zero initialization is extremely costly as things scale. This simple/naïve approach looks like this in a pseudo-C: But if you use a lot of pointers everywhere, the initialization becomes a lot more complex, and non-linear too. People argue the need to express non-nullable pointers, and either version 1 of the previous approach or this explicit approach are effectively the only ways of doing this. You could tell the compiler to assume the pointer is never null (e.g. or ), but those are not guarantees in the type system, just you telling the compiler to assume it is never . The non-nullability is not possible outside of those two approaches. This was the entire point I was making between the Individual-Element Mindset and the Group-Element Mindset is that the individual-element mindset lends itself well to thinking about individual elements like this. And as such, it doesn’t really think about the cost of thinking in individual elements as compounding to something expensive. I’ve been in projects where a lot of the time in a program in spent in the destructors/ traits of individual elements, when all they are doing is trivial things which could have been trivially done in bulk. Most people don’t consider these as “costs” nor that there are trade-offs to this approach to programming, rather it’s “just the way it is”. There is the other aspect where if the explicit initialization is applied to every type, not just ones which contains pointers/references, then it can be less ergonomic to type and have visual noise: 5 This constant syntactic noise can be tiring and detracts from what is actually going on. With the implicit zero initialization that I had in Odin, it has worked out really well. Many might expect it to be confusing, but it isn’t and you can rely on it and becomes very natural to use. As the creator and main architect of Odin, a lot of Odin’s design has been to fix a lot of the problems I and many others faced with C, whilst still not veering too far from the general feel of C. Odin does have pointers by default, but in practice they are a very rare problem due numerous features and constructs of the language. One of the reasons for pointers in C is caused to due the lack of a proper array type. Odin has proper array types and does not implicitly demote arrays to pointers. Odin has slices which replace a lot of the needs for pointers and pointer arithmetic, and because array types (including slices) are bounds checked, that already solves many of the problems that would have occurred in C with treating pointers as arrays, which may or may not have an associated length to check against. Odin also has tagged unions and multiple return values. Tagged unions should be “obvious” to the people who had be complaining about the initial article, but the use of tagged unions isn’t necessarily there to solve the pointer problem. Odin’s is an example of a maybe/option type, which is just a built-in discriminated union, with the following definition: And due to the design of Odin’s , if a union only has one variant and that variant is any pointer-like type, no explicit tag is stored. The state of the pointer-like value also represents the state of the . This means that . Another reason why C has problems with pointers is the lack of way to state a parameter to a procedure as being optional. C doesn’t have default values for parameters, nor any way in its type system to express this. C’s type system is just too poor and too weak. This is why people unfortunately use pointers as a way to do thus, since they can write . However, it is rare to see in Odin code be used to indicate pointers except when interfacing with foreign code, or optional parameters to a procedure. This is because the need for a pointer itself is quite rare. There are multiple reasons why: However one of the main reasons why pointers are rarely a problem in Odin is because of multiple return values. Multiple return values when used for this manner, are akin (but not semantically equivalent) to something like a type in other languages 6 . When a procedure returns a pointer, it is either assumed to be never OR accompanied with another value to indicate its validity, commonly in the form of a boolean or : And coupled with the constructs ( , , , ), , and named return values, a lot of those issues never arise: Odin is designed around multiple return values rather than / constructs, but this approach does in practice does solve the same kinds of problems. Before people go “well the assumption is not enforced in the type system”, remember where all of this derives from: Odin allows for implicit declarations of variables without an explicit initialization value. And as the designer of Odin, I think enforcing that is both quite a high cost (see the individual-element vs grouped-elements mindsets) and far from the original approach to programming C. I know this is not going to convince people, but it’s effectively trying to make someone think like another person, which is never easy, let alone always possible to do in the first place. And it’s not a mere “aesthetic preference” either. This very little design decision has MASSIVE architectural consequences which lead to numerous performance problems and maintenance costs as a project grows. Null pointer exceptions (NPEs) are in a category of constructs in a language which I class as “panic/trap on failure”. What I find interesting is that there are numerous other things in this category, but many people will normally take a different approach to those constructs compared to NPEs, due to whatever reason or bias that they have. The canonical example is integer division by zero. Instinctually, what do you think division by zero of an integer should result it? I’d argue most people will say “trap”, even if a lot of modern hardware (e.g. ARM64 and RISC-V) does not trap, and only the more common x86-related architectures do trap. Odin does currently 7 define the behaviour of division by zero to “trap” only because of this assumption, but we have considered changing this default behaviour. Odin does allow the programmer to control this behaviour at a global level or on a per-file level basis if they want a different behaviour for division by zero (and consequentially by zero). But some languages such as Pony , Coq, Isabelle, etc actually define division by zero of an integer to be . This is because it can help a lot of theorem provers . But there is the other question of production code. One of the main arguments against NPEs (especially in languages like Java) is that it causes a crash. So in the case of division by zero, do you want this to happen? Or would you prefer all integer division to be explicitly handled, or default to a predictable/useful value, like ?—which prevents crashing in the first place. Another common example of “panic on failure” is languages with runtime bounds checking. If is out of bounds, most languages just panic. It’s rare to find a language that returns a on every array access to prevent an out of bounds. Not even languages like OCaml do this. NPEs, division by zero (if traps), and runtime bounds checking are all examples of this kind of “panic on failure”, but people rarely treat them as being the same, even if they are of the same kind of problem. Honestly, no. I understand it might be common for beginners to a language like C to have many pointer related problems, but they will also have loads of other problems too. However as you get more competent at programming, that kind of problem is honestly the least of your problems. I honestly think a lot of this discussion is fundamentally a misunderstanding of different perspectives rather than anything technical. A lot of what some people think are their “technical opinions” are merely just aesthetic judgements. And to be clear, aesthetic judgements are not bad, but they are not necessarily technical. But I’d argue most people are not applying their opinions consistently when it comes to the category of “panic on failure”, and NPEs are no different; they only seem more of a problem to them either because of the existence of the name of the “Billion Dollar Mistake” or because they encounter it more. I know a lot of people view the explicit individual initialization of every element everywhere approach as the “obvious solution”, as it seems like low-hanging fruit. As a kid, I was told to not pick low-hanging fruit, especially anything below my waist. Just because it looks easy to pick, a lot of it might not be unpicked for a reason. It does not mean that you should or should not pick that fruit, but rather you need to consider the trade-offs. If you honestly think the costs of explicit individual initialization of every element everywhere are worth it for the language you are working in or developing, then great! But at least know the trade-offs of that approach. For Odin, I thought it was not worth the cost—compared to the alternative ways of mitigating the problem empirically. Most of the bad criticisms just came from people who didn’t read the article or didn’t read past a couple paragraphs. That’s why I wanted to state this comment very clearly.  ↩︎ This is partially why I do not like exceptions as error handling in many languages. It is not obvious where things are thrown/raised from and they encourage the practice of ignoring them until the latest possible space. I discuss that problem in The Value Propagation Experiment Part 2 .  ↩︎ I understand what type systems do and their benefits, and it is a little insulting when people assume my knowledge (or lack of) without doing a modicum of review.  ↩︎ In the case of the other invalid memory addresses, linear/affine substructural type systems with lifetime semantics can help with this (e.g. Rust) but they come at another cost in terms of language ergonomics and restrictions. Language design is hard.  ↩︎ I know typing is never the bottleneck in programming, but the visual noise aspect is a big one when you are trying to scan (not necessarily read ) code. I want to see the pattern and not be swamped with syntactic noise.  ↩︎ I know a result type is a kind of sum type and multiple return values are more akin to a product type, but how different languages want to be used and expressed, this works out fine in practice for the same kinds of problems. Please don’t give me a FP rant.  ↩︎ At the time of writing, I am not sure which approach is the better one: trap or zero by default, but we allow for all four options in the Odin compiler. Division by zero for floats results in “Inf” and that’s not necessarily as much of a problem in practice, so why would division by zero be as bad?  ↩︎ Null pointer dereferences are empirically the easiest class of invalid memory addresses to catch at runtime, and are the least common kind of invalid memory addresses that happen in memory unsafe languages. I do think it was a costly mistake but the “obvious solutions” to the problem are probably just as costly , if not more so, but in very subtle ways which most people neglected to understand in the article 1 . I think that even if Tony Hoare didn’t “invent” pointers, within a couple years someone else would have. I don’t think it’s a “mistake” the programming world was ever going to avoid. I am talking about languages that run on modern systems with virtual memory, not embedded systems where you interact with physical memory directly. Those platforms in my opinion need much different kinds of languages which unfortunately do not exist yet. I was also talking about languages akin to C and Odin, not languages that run on a VM or have “everything be a reference”. Allow for pointers (and just deal with it) All pointers are implicitly maybe types (e.g. in Java) Require all explicit initialization of every element everywhere to assume cannot happen, along with things like maybe types. Requiring each reference to be checked if it is . Check if a value is and propagate that up the expression tree. Odin has slice types Odin has multiple return values to allow for out-only parameters, which could be ignored with Odin isn’t a “everything is a pointer” kind of language: pointers have to be explicit typed to exist. Writing pointer types as value declarations is less common due to type inference e.g. is more much common than: . All bits set ( ) The same value ( ) Most of the bad criticisms just came from people who didn’t read the article or didn’t read past a couple paragraphs. That’s why I wanted to state this comment very clearly.  ↩︎ This is partially why I do not like exceptions as error handling in many languages. It is not obvious where things are thrown/raised from and they encourage the practice of ignoring them until the latest possible space. I discuss that problem in The Value Propagation Experiment Part 2 .  ↩︎ I understand what type systems do and their benefits, and it is a little insulting when people assume my knowledge (or lack of) without doing a modicum of review.  ↩︎ In the case of the other invalid memory addresses, linear/affine substructural type systems with lifetime semantics can help with this (e.g. Rust) but they come at another cost in terms of language ergonomics and restrictions. Language design is hard.  ↩︎ I know typing is never the bottleneck in programming, but the visual noise aspect is a big one when you are trying to scan (not necessarily read ) code. I want to see the pattern and not be swamped with syntactic noise.  ↩︎ I know a result type is a kind of sum type and multiple return values are more akin to a product type, but how different languages want to be used and expressed, this works out fine in practice for the same kinds of problems. Please don’t give me a FP rant.  ↩︎ At the time of writing, I am not sure which approach is the better one: trap or zero by default, but we allow for all four options in the Odin compiler. Division by zero for floats results in “Inf” and that’s not necessarily as much of a problem in practice, so why would division by zero be as bad?  ↩︎

0 views

Most Code is Just Cache

Claude Code has systematically begun to consume many of the SaaS apps I used to (or plan to) pay for. Why pay a subscription when I can "vibe code" a personal MVP in twenty minutes? I don’t worry about maintenance or vendor lock-in because, frankly, the code is disposable. If I need a new feature tomorrow, I don’t refactor—I just rebuild it. 1 Code is becoming just an ephemeral cache of my intent. Cartoon via Nano Banana. In this model, the ‘Source Code’ is the prompt and the context; the actual Python or Javascript that executes is just the binary. We still run the code because it’s thermodynamically efficient and deterministic, but we treat it as disposable. If the behavior needs to change, we don’t refactor the binary; we re-compile the intent. This shift has made me intolerant of static interfaces. I have stopped caring about software that doesn’t let me dump massive amounts of context into Gemini or Claude to just do the thing . If a product forces me to click buttons to execute a process that an LLM could intuit from a prompt, that product is already legacy. It forces us to question the permanence of the current model. We often make the mistake of assuming software—as we know it today—is a permanent fixture of human productivity. But if you zoom out, the era of SaaS is a blink of an eye in modern history. It is easy to overestimate how core it is to the future. In this post, I want to extrapolate these thoughts a bit and write out what could be the final stages of software. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. The stages here might not necessarily be chronological or mutually exclusive. Instead, they are ordered from static to dynamic code generation — where more and more the intent of a customer is the software they use. This is the baseline where software is a static artifact sold as a service, built on the assumption that user problems are repetitive and predictable enough to be solved by rigid workflows. To the consumer, this looks like dashboards, CRUD forms, and hardcoded automations. The intelligence here is sourced mainly from the SaaS founder and hired domain experts, hard-coded into business logic years before the user ever logs in. When: We recognized that distributing software via the cloud was more efficient than on-premise installations. Value Loop: Customer Problem → Product Manager writes PRD → Engineers write Static Code → Deploy → Customer adapts their workflow to the tool. (Time: Months to Years | Fit: Generic / One-size-fits-none) We are seeing this now with companies adopting the Forward Deployed Engineering (FDE) . In this stage, the SaaS company hires humans to manually use AI to build bespoke solutions for the client. For the consumer, this feels like a concierge service; they don’t get a login to a generic tool, they get a custom-built outcome delivered by a human who used AI to write the glue code. The intelligence is hybrid: the human provides the architecture, the AI writes the implementation code in weeks to days. When: Companies realize AI allows their employees to build custom apps for clients faster than the clients can learn or adapt a generic tool. Value Loop: Customer Problem → SaaS Employee (FDE) Prompts AI → AI generates Custom Script/App → Employee Deploys for Customer. (Time: Days | Fit: High / Tailored to specific customer edge cases) This is the current “safe space” for most tech companies, where they bolt an LLM onto an existing application to handle unstructured data. Consumers experience this as a “Draft Email” button in their CRM or a “Chat” sidebar in their UI—the platform is still the main product, but AI is a feature that (hopefully) reduces friction and/or provides some extra functionality customization 2 . The intelligence comes from a constrained model of product design and LLM scaffolding, providing content within a structure still strictly dictated by the SaaS platform’s code. When: People start to see AI is good at summarizing, generating content, or taking actions within existing workflows. Value Loop: Customer Problem → Static SaaS Interface AI Feature Text Box → Stochastic Result → Human Review. (Time: Minutes | Fit: Medium / Constrained by the platform’s UI) This is the tipping point where the software interface starts to disappear because the “interface” was just a way to collect context that the model can now ingest directly. Consumers move to a “Do this for me” interface where intent maps directly to an outcome rather than a button click, often realized as an agent calling a database or MCP servers 3 . The intelligence is the model and it’s engineered input context, relegating the SaaS role to in some sense providing clean proprietary data via an agent friendly interface. Software as a Service for Agents . When: People start to see AI is good at orchestrating complex decisions and using tools—across SaaS platforms—autonomously. Value Loop: Customer Problem (Prompt as ~PRD) → Runtime Code Generation → Dynamic Outcome. (Time: Real-time | Fit: Very High / Dynamically generated for the specific context) Critically, this doesn't mean the LLM acts as the CPU for every single user interaction (which would be latency-poor and energy-inefficient). Instead, the model almost acts as a Just-In-Time compiler. It generates the necessary code to execute the user’s intent, runs that code for the session, and then potentially discards it This is the end game in some cases. If code is just a cache for intent, eventually we bypass the cache and bake the intuition directly into the model. To the consumer, the “tool” is invisible; the expert system simply exists and provides answers or actions without a login or workflow. The intelligence is in the model itself; the software platform exists solely as a distillation mechanism—a gym to train the vertical AI—and once the model learns the domain, the software is no longer needed. A company in this stage is not really even SaaS anymore, maybe more so a AI-gyms-aaS company. When: People start to see AI is good at absorbing the entire vertical’s intuition. Value Loop: Raw Domain Data → Reinforcement Learning / Fine-Tuning → Model Weights. (Time: Instant / Pre-computed | Fit: Very High / Intuitive domain mastery) This might feel unintuitive as a stage — like how could you bake some proprietary data lake into a model? How can our juicy data not be the moat? My conclusion is that most (but not all) data is a transformation of rawer upstream inputs and that these transformation (data pipelines, cross-tenant analysis, human research, etc.) are all “cache” that can be distilled into a more general model that operates on its intuition and upstream platform inputs. “But can agents run a bank?” Reliability and safety comes down to distinguishing between guardrails (deterministic interfaces and scaffolding) and runtime execution (LLM code). For now, you don’t let the LLM invent the concept of a transaction ledger or rewrite the core banking loop on the fly. In XX years, maybe we do trust AI to write core transaction logic after all fail-able humans wrote the code for most mission critical software that exists today. The line between human-defined determinism and agent symbolic interfaces will gradually move of time. “But enterprise SaaS is actually super complex.” Yes, but that complexity is mostly just unresolved ambiguity. Your “deep enterprise understanding” is often a collection of thousands of edge cases—permissions, policy exceptions, region-specific rules—that humans had to manually hard-code into IF/ELSE statements over a decade. Distilled to the core, this complexity collapses. The model doesn’t need 500 hard-coded features; it needs the raw data and the intent. An app built for one can also make a lot of simplifications compared to one that acts as a platform. “Customers don’t want to prompt features.” I agree. I don’t think the future looks like a chatbot. “Chat” is a skeuomorphic bridge we use because we haven’t figured out the consistent native interface yet. It might be a UI that pre-emptively changes based on your role, or it might feel like hiring a really competent employee who just “takes care of it” without you needing to specify the . Or, as we see in Stage 2, the user never prompts at all—an FDE does it for them, and the user just gets a bespoke app that works perfectly. Stage 1, where most companies are stuck today, definitely is. Why? Because the sheer overhead of traditional SaaS—the learning curve, the rigid workflows, the "click tax" to get work done—is becoming unacceptable in a world where intent can be executed directly. It feels increasingly archaic when flexible solutions can be generated on demand. The value is moving away from the workflow logic itself and toward two specific layers that sandwich it: The Data Layer: Proprietary data, trust, and the “agentic scaffolding” that allows models to act safely within your domain. The Presentation Layer: Brand and UI. While I suspect trying to control the presentation layer long-term is futile (as users will eventually bring their own “interface agents” to interact with your data), for now, it remains a differentiator. We are going to see companies move through these tiers. The winners IMO will be the ones who realize that the "Service" part of SaaS is being replaced by model intelligence. The SaaS that remains will be the infrastructure of truth and the engine of agency. We are transitioning from a world of static artifacts (code that persists for years) to dynamic generations (code that exists for milliseconds or for a single answer). Of course, I could be wrong. Maybe AI capability plateaus before it can fully integrate into complex verticals. Maybe traditional SaaS holds the line at Stage 2 or 3, protecting its moat through sheer inertia. Maybe the world ends up more decentralized. Some of my open questions: Which stage should you work on today? Is there alpha in skipping straight to Stage 4, or do you need to build the Stage 2 “vibe coding” service to bootstrap for now? What are the interfaces of the future? Is it MCP, curated compute sandboxes, or a yet-to-be-defined agent-to-agent-to-human protocol? What interface wins out or does each company or consumer bring their own agentic worker? How fast does this happen? Are we looking at a multi-decade-long transition, or do companies today rapidly start dropping lower stage SaaS tools? Does AI have a similar impact beyond software? does medicine move from “static protocols” to “on-demand, patient-specific treatments”? Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. Even more so than me you can see Geoffrey Huntley’s ralph-powered rampage of GitHub and many other tools . I liked this tweet by Harj Taggar, “moved away from the FDE playbook that’s become the default for fast growing AI startups. Instead they’ve built AI to covert plain English from the customer into Python code to make the product work for their use cases” . Similar to Karpathy’s “LLMs not as a chatbot, but the kernel process of a new Operating System” (2023) Cartoon via Nano Banana. In this model, the ‘Source Code’ is the prompt and the context; the actual Python or Javascript that executes is just the binary. We still run the code because it’s thermodynamically efficient and deterministic, but we treat it as disposable. If the behavior needs to change, we don’t refactor the binary; we re-compile the intent. This shift has made me intolerant of static interfaces. I have stopped caring about software that doesn’t let me dump massive amounts of context into Gemini or Claude to just do the thing . If a product forces me to click buttons to execute a process that an LLM could intuit from a prompt, that product is already legacy. It forces us to question the permanence of the current model. We often make the mistake of assuming software—as we know it today—is a permanent fixture of human productivity. But if you zoom out, the era of SaaS is a blink of an eye in modern history. It is easy to overestimate how core it is to the future. In this post, I want to extrapolate these thoughts a bit and write out what could be the final stages of software. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. Software Evolution The stages here might not necessarily be chronological or mutually exclusive. Instead, they are ordered from static to dynamic code generation — where more and more the intent of a customer is the software they use. Stage 1. Traditional SaaS This is the baseline where software is a static artifact sold as a service, built on the assumption that user problems are repetitive and predictable enough to be solved by rigid workflows. To the consumer, this looks like dashboards, CRUD forms, and hardcoded automations. The intelligence here is sourced mainly from the SaaS founder and hired domain experts, hard-coded into business logic years before the user ever logs in. When: We recognized that distributing software via the cloud was more efficient than on-premise installations. Value Loop: Customer Problem → Product Manager writes PRD → Engineers write Static Code → Deploy → Customer adapts their workflow to the tool. (Time: Months to Years | Fit: Generic / One-size-fits-none) When: Companies realize AI allows their employees to build custom apps for clients faster than the clients can learn or adapt a generic tool. Value Loop: Customer Problem → SaaS Employee (FDE) Prompts AI → AI generates Custom Script/App → Employee Deploys for Customer. (Time: Days | Fit: High / Tailored to specific customer edge cases) When: People start to see AI is good at summarizing, generating content, or taking actions within existing workflows. Value Loop: Customer Problem → Static SaaS Interface AI Feature Text Box → Stochastic Result → Human Review. (Time: Minutes | Fit: Medium / Constrained by the platform’s UI) When: People start to see AI is good at orchestrating complex decisions and using tools—across SaaS platforms—autonomously. Value Loop: Customer Problem (Prompt as ~PRD) → Runtime Code Generation → Dynamic Outcome. (Time: Real-time | Fit: Very High / Dynamically generated for the specific context) When: People start to see AI is good at absorbing the entire vertical’s intuition. Value Loop: Raw Domain Data → Reinforcement Learning / Fine-Tuning → Model Weights. (Time: Instant / Pre-computed | Fit: Very High / Intuitive domain mastery) The Data Layer: Proprietary data, trust, and the “agentic scaffolding” that allows models to act safely within your domain. The Presentation Layer: Brand and UI. While I suspect trying to control the presentation layer long-term is futile (as users will eventually bring their own “interface agents” to interact with your data), for now, it remains a differentiator. Which stage should you work on today? Is there alpha in skipping straight to Stage 4, or do you need to build the Stage 2 “vibe coding” service to bootstrap for now? What are the interfaces of the future? Is it MCP, curated compute sandboxes, or a yet-to-be-defined agent-to-agent-to-human protocol? What interface wins out or does each company or consumer bring their own agentic worker? How fast does this happen? Are we looking at a multi-decade-long transition, or do companies today rapidly start dropping lower stage SaaS tools? Does AI have a similar impact beyond software? does medicine move from “static protocols” to “on-demand, patient-specific treatments”?

12 views
devansh 4 days ago

Is Complexity just an illusion?

Most of what we call “complexity” is not a property of reality. It’s a property of our descriptions of reality. The world is what it is; what changes is the language you have available to carve it up. When someone says “that’s a golden retriever,” they’re not just using two words, they’re using a compressed concept that bundles size, coat, temperament, typical behavior, and a bunch of implied background. If you don’t share that vocabulary, you’re forced into a longer, clumsier description of the same dog. The dog didn’t get more complex. Your map did. This is why expertise feels like magic. A chess novice sees a board with dozens of pieces and a combinatorial explosion of interactions. A grandmaster sees “a fork motif,” “a weak back rank,” “a pinned knight,” and a small set of candidate lines. They’re not seeing less detail. They’re carrying a better compression scheme. They have words for patterns that occur often, and those words collapse chaos into structure. Complexity shrinks when you acquire the right abstractions. Once you internalize this, you stop worshipping “simple explanations” in the naive sense. People don’t actually want explanations that are short. They want explanations that keep working when conditions change, that don’t fall apart on new data, and that don’t assume more than the evidence forces. Word count is not the virtue. Appropriate restraint is. Compare the proverb"Red sky at night, sailor’s delight" to a messier but truer model: weather depends on pressure systems, humidity, wind, and local geography; red skies correlate sometimes, depending on context. The proverb is shorter. The second is less wrong in more places because it commits less. This is also why simplicity often correlates with truth in mature domains. Over time, languages evolve to give short handles to recurring, broadly useful structure. We coin compact terms like “germs,” “incentives,” “feedback loops,” “network effects.” They’re easy to say because the underlying patterns are valuable and frequent, so the culture compresses them into vocabulary. The causality isn’t “short explanations generalize.” It’s “general structure gets named,” and once named it looks simple. Simplicity is often a dashboard indicator, not the engine. Learning anything complex is mostly representation engineering in your own head. You are not trying to stuff facts into memory. You are trying to acquire compression, concept that turn many details into a small number of stable handles. Following is a basic mental model: 1) Steal the field’s primitives before you invent your own. Every domain has a small set of basic concepts that do a shocking amount of work. If you skip them, you’ll experience the domain as irreducible complexity. In calculus, “derivative” is not a symbol; it’s “local linear approximation.” Once that clicks, a lot of problems stop being special cases. In economics, “opportunity cost” and “incentives” are compression handles that cut through moralizing narratives. In product work, “retention,” “activation,” and “unit economics” prevent you from drowning in vibes. Early learning should look like building a precise glossary, not collecting trivia. 2) Build a pattern library by grinding examples until the patterns name themselves. Experts aren’t mainly smarter; they’ve seen enough instances to chunk reality. You get there by doing many small reps, not by reading one long explanation. Read one worked example, then do three similar ones from scratch. In chess, drill forks and pins until you stop counting pieces and start seeing motifs. In programming, you want “race condition,” “off-by-one,” “state leak,” “cache invalidation” to become immediate hypotheses, not postmortem discoveries. Practice isn’t repetition for discipline’s sake; it’s training your brain to compress recurring structure. 3) Learn with falsifiable predictions, not passive recognition. If you can only nod along, you don’t have the abstraction. Force yourself to predict outcomes before checking. If you’re learning statistics, predict how changing sample size affects variance. If you’re learning sales, predict which segment will churn and why. If you’re learning systems, predict the failure mode under load. This converts knowledge from "a story I can repeat" into "a model that constrains reality." 4) Control commitment: go from broad to narrow. When something breaks or surprises you, generate hypotheses ranked by how much they commit. Start with coarse categories (“measurement issue,” “traffic shift,” “pricing edge case,” “product regression”) before picking a single narrative. Then test to eliminate. This is how experts stay accurate, they don’t jump to the cleanest story; they keep the hypothesis space alive until evidence collapses it. The question “what does this rule out?” becomes your guardrail. 5) Upgrade your vocabulary deliberately. When you encounter a recurring cluster of details, name it. Give yourself a handle. The handle can be a formal term from the field or your own shorthand, but it must point to a repeatable pattern you can recognize and use. This is how you compound. Each new concept is a new compression tool; it makes future learning cheaper. If you do this well, "complex topics" start to feel different. Not because the world got simpler, but because you stopped paying unnecessary translation costs. The deepest form of intelligence isn’t producing the shortest answer. It’s finding the abstraction level where the real structure becomes easy to express, and then refusing to overcommit beyond the evidence. So is complexity an illusion? idk you tell me. The kind of complexities people complain about are “hard to describe, hard to predict, hard to compress”, this is often a signal that your vocabulary is misaligned with the structure of the thing. The tax is rarely levied by the territory. It’s paid at the currency exchange between reality and the symbols you’re using. And the highest-leverage move, more often than people admit, is to upgrade the map.

1 views
Brain Baking 4 days ago

Favourites of December (And a Short 2025 Recap)

A late happy new year to everyone! I almost forgot to publish last month’s favourite (blog) posts, and since last month was the last one of 2025, let’s do a short recap as well. Previous month’s recap: November 2025 . Last year was another eventful year. Browse the full 2025 Brain Baking archive for more juicy details. I selected one post per month that for me stands out: Our son also kicked me out of my cosy home office upstairs. Luckily, our renovations were finished in time, so we moved the living room and I took the old space hostage . One of the advantages of directly staring at a larger window is being able to admire the seasonal view: The window at my desk showcases snowy trees. For 2026, I only wish for one thing: stability . Let’s stop the craziness and try to get things settled down. No more kids, renovations, job changes, broken bicycles, and serious sickness please. Just, you know, breathing. Whoosah . Last month I joined the Advent of Code challenge using Clojure, a language I know absolutely nothing about. Since then I’ve been obsessed with Lisp-based dialects. Forgive me if most of the links below are programming-oriented: it’s been invigorating to learn something new and actually enjoy a programming language for a chance. It’s the reason I’m typing this in Emacs now, although I haven’t even installed CIDER yet. All in due time… Ok that was definitely too much Emacs stuff. The lack of other links shows how much I’ve been obsessed with the editor lately. No other random links for this month! Related topics: / metapost / By Wouter Groeneveld on 10 January 2026.  Reply via email . In January, I had the idea to compile your own philosophy . So far, I have collected lots of notes and summarised too many previous ones, but nothing has been published yet. In February, I shared my stationary drawers . I should really clean out all those fountain pens. In March, I dug up a photo of my first console , the SEGA Genesis/MegaDrive. In April, I learned that my sourdough starter has twins somewhere in Switzerland. In May, more thoughts about writing and publishing popped up. In June, I debunked (or confirmed?) the fact that IT freelancers earn more than their employee counterparts . In July, I got influenced by other board game enthusiasts and admitted to having too many games and too little time . In August, we welcomed our second little one and I turned forty —in that order. Yes, that is important to me. In September, I wrote too many articles about trick taking games and local traditions . In October, I fondly looked back at years of downloading warez software . In November, I recovered my late father-in-law’s 1994 IBM PC invoice . In December, I started shaving Emacs yaks . I haven’t stopped ever since. Nick George reports on building static websites with Clojure . Nathan Marz describes how he invented Specter to fill Clojure’s mutability hole. I don’t understand 90% of the technicalities there, but one day, I will. More Clojure stuff. Sorry… Mikko Koski helped me get started: 8 tips for Advent of Code 2022 in Clojure. A more official one, but just as interesting: the State of Clojure 2024 results . 76% of the people using it build web apps, 40% is on Emacs/CIDER, and Babashka is super popular! This Advent of Code GIF archive is crazy. Victor Dorneanu wrote about his Doom Emacs to Vanilla migration. I tried Doom/Spacemacs for about one whole day and then started back from scratch, but damn, it’s very challenging, even though you can “do what you want”—if you’re an Emacs/Elisp acolyte, that is. I’m planning to get babtized in the Emacs Church very soon. Alice from The Wallflower Digest shares her thoughts about personal curriculums ; a way to get started with deliberate life-long learning. (via Joel , I think?) Karthinks found fifteen ways to use Embark , a wonderful context-aware Emacs package. More “Emacs from scratch” blogs to share: this one’s from Arne and lies out the foundations in case you want to get started. Thanks, Arne. You’re in my RSS feed now. Frank Meeuwsen writes (in Dutch) about AI tooling and how they democratise digital literacy. Or rather, how they should . Gregory J. Stein wrote a guide on email in Emacs using Mu and Mu4e . I have more thoughts on that saved for a separate blog post. If you’d like to know how many Emacs packages you’re currently rocking, Manuel Uberti has an Elisp for you (via Sebastián ) Kristoffer Balintona helped me better understand the Vertico completion-at-point-function stack .

3 views
Abhinav Sarkar 5 days ago

How I use Jujutsu

About three months ago I started using Jujutsu (JJ), a new Version Control System , for my personal projects. It took me a while to get used to it after more than a decade of using Git , but now I’m quite comfortable with it. Working with Jujutsu requires a shift from the mental model of Git. However, it is not as daunting as it may seem on the first day. This post was originally published on abhinavsarkar.net . Looking back, I don’t actually use all the fancy things JJ provides, and you may not need to either. In this post I list my most used JJ commands and how I use them. It is not meant to be a tutorial, or even a comprehensive list, but it should be enough to get you started. This post assumes that the reader knows how to use Git. JJ uses Git as a backend. This means that you can still use Git commands in your repo, and push them to Git remotes. Your coworkers can keep using Git with shared repos without ever being aware that you use JJ . initializes a new Jujutsu repository. You do this once, and you’re ready to start working. I usually run it with the option, which allows me to use Git commands as well in the same repo. If you want to work in an existing Git repo, you should run it with in the repo directory, to make JJ aware of it. Afterward, you don’t need to use Git commands. clones a Git repo and initializes it as a JJ repo. You can supply the option if you want. configures user settings. You can edit the user-level JJ config file by running . You can also override settings at repo level. For example, to set a different user email for a repo, run . You can also run to list the current config in effect. This is an area where JJ differs a lot from Git. JJ has no staging area, which means that every change you make is automatically and continuously staged. This came as a big surprise to me when I was getting started. If you are planning to use JJ with an existing Git repo, get rid of the untracked files either by committing them, or deleting them, or adding them to . There is literally no concept of untracked files in JJ ; a file is either committed or tracked or ignored. JJ has the concept of commits, same as Git. However, the workflow is different. Since there is no staging area, you start with creating a commit. That’s right! The first thing you do is create a commit, and then fill it by changing your files. Once you are done, you finalize the commit, and move on to a new fresh commit. JJ prefers to call them “changes” instead of commits to distinguish them from Git commits. creates a new change. If you know what your change is about, you can start with a commit message: , but JJ does not mandate it. You can start making changes without worrying about the message. One useful variation that I use a lot is . This creates a new change after the given change but before all the change’s descendants, effectively inserting a new change in the commit tree while simultaneously rebasing all descendant change. Once you are done, you can add a commit message to the current change by running . You can also provide the message inline: . As I mentioned, you don’t need to add a message to start working on a change, but you do need it before you push the change to a Git remote. You can run it any number of times to change the current change’s message. Alternatively, you can run to describe the current commit and start a new one. It is equivalent to running followed by . I use a mix of , and , depending on the situation. Like the command, tells you the state your current change is in. It lists the changed files and their individual statuses (added, modified, etc). This is where JJ really shines compared to Git. Moving commits around or editing them is a massive pain in Git. However, JJ makes it so easy, I do it many times a day. switches you over to the given change so you can modify it further. You use this when you’ve already committed a change but you need to tweak it. By default, you can edit only the changes that haven’t been pushed to the main branch of your repo’s remote. After you edit files, all the descendant changes are automatically rebased if there are no conflicts. simply combines the current change with its parent. It is useful when you commit something, and realize that you forgot to make some small changes. Another use for it is to resolve conflicts: create a new change after the conflicted change, fix the conflict, and squash it to resolve. is the opposite of : you use it to interactively split the current change into two or more changes. Often when I’m working on a feature and I find some unrelated things to fix, such as linter warnings, I go ahead and fix them in the same change. After I’m done with all the work for the feature, I use to split the change into unrelated changes so that the project history stays clean. restores files to how they were before the change, pretty much same as . You can run it in interactive mode by adding the option. You can also restore the files to how they were in a different change by specifying a change ID with the option. moves changes from anywhere to anywhere. You can use it to move individual changes between branches, or rearrange them in the same branch like so: When you move single changes like this, their descendant changes become invalid, but you can move them also in the same way. Or you can move entire branch of changes: It mostly works without any issues, but if there are conflicts, you’ll need to resolve them. I actually use rebase all the time. When I’m working on multiple features, and I find something that is more suited to be done on a different feature branch than I’m currently on, I finish working on the change, and then just move it to the different branch. Another use is to rebase feature branches on the main branch every day, like so: Here, , , , and are shorthand change IDs of the roots of various feature branches. You can also use rebase to splice changes/branches in the middle of other branches using the (after) and (before) options, but I rarely do this. is like except the changes are not moved but copied to the destination. It’s somewhat like . discards a change and rebases all its descendants onto the discarded change’s parent. I use it to get rid of failed experiments or prototypes. is supposed to automatically break the current change and integrate parts of it into ancestor changes at the right places, but I haven’t managed to make it work yet. I need to look more deeply into this. shows the change graph. JJ has a concept of revsets (sets of changes) that has an entire language to specify change IDs. takes an argument that uses the revsets language to choose which changes to show. For example: The revset language is rich and revsets can be used with many JJ commands. You can also create your own aliases for it, as we’ll see in a later section. shows differences between two changes, or in general between two revsets: shows the details of the current change. You can also use to inspect another change without switching to it. I’ve been mentioning branches, but actually JJ does not have branches like Git does. Instead, it has bookmarks, which are named pointers to changes. You can bookmark any change by giving it a name, and you can manipulate the bookmarks. Then to have branches, all you need to do is to point a bookmark to the required tip of the change graph. creates a new bookmark pointing to the current change with the given name. You can use bookmarks to mark the root or the tip of a feature branch, or to mark a milestone you want to return to later. When you rebase a change that a bookmark points to, the bookmark moves with it automatically. To list all existing bookmarks, run . To delete a bookmark you no longer need, run . If the deleted bookmark is tracked as a remote Git branch, the deletion is propagated to the remote as well. Alternatively, you can delete a bookmark only locally by running . You can also move, rename, and set bookmarks, as well as associate/disassociate them with Git remote branches. If you push a change with a bookmark to a Git remote, JJ creates a Git branch with the same name on the remote, but locally it remains a JJ bookmark. JJ tracks each operation in the repository in an immutable log, and provides commands to work with this log. shows a history of all operations performed on the repository. Each operation is assigned a unique ID, and you can see what changed with each operation. You can use the op IDs to restore the whole repo to an earlier state by running . undoes the last operation performed on the repository. Unlike , which modifies history, works on the Jujutsu operations themselves. This means it doesn’t lose any information; it just moves you back one step in the operation history. You can run this repeatedly to move backward in the operation history one step at a time. is the opposite of , that is, it moves you forward in the operation history by one step. It can also be run repeatedly. The operation log along with the undo and redo commands provide a safety net that makes it much easier to experiment with JJ without the fear of losing work. JJ uses Git as its backend, and provides commands to interact with remote Git repos. We already learned about and . We can also push and fetch. pushes your JJ changes to a Git remote. By default, it pushes all tracked bookmarks that have new changes. If you want to push a specific bookmark, you can specify it with . You can also push to all or tracked branches with the and options respectively. When you push, JJ converts the changes into Git commits and creates or updates remote Git branches accordingly. One thing to note is that JJ refuses to push changes that have conflicts or are missing commit messages. fetches changes from a Git remote and updates your local repository. It’s equivalent to . After fetching, you can see the remote changes in your change graph, and you can rebase your local changes on top of them if needed. You can fetch from a specific remote by running , and fetch a particular branch by running . manages your Git remotes. You can add a new remote with , or list existing remotes with . This is similar to but integrated with JJ ; it does not update the remotes of your underlying Git repo. creates a new change that undoes the effects of the specified change, pretty much like . The reverted change remains in the history of the repo. marks conflicts as resolved during a merge. When JJ can’t automatically merge changes (for example, when two changes modified the same lines), it creates a conflicted state in your working directory. After you manually fix the conflicts in your files, you run to tell JJ that the conflicts are resolved and the merge can proceed. JJ then automatically rebases any descendant changes. JJ is highly customizable through its configuration files. You can define custom aliases for commonly used commands and revsets, which can significantly ease up your workflow. These are stored in your JJ config file at the user and/or repo level. Here’s my configuration: You can compose revsets to create new revsets. These are the ones I use: I use the above defined revsets to create some custom commands: I have the default command set to so running only shows me the recent log. My usual workflow is to create a new commit, work on it, describe it, split/squash/rebase as needed, then run . Three months in, JJ has become my primary version control tool. The learning curve was steep, but it was worth it. The ability to freely rearrange changes and experiment without fear has fundamentally changed how I work. I spend less time wrestling with Git and more time actually coding. JJ has plenty of other useful features such as workspaces and the ability to manipulate multiple changes at once that I haven’t explored deeply. There’s a lot more to discover as I continue using it. If you use Git for personal projects and find yourself frustrated with rebasing or commit management, JJ might be worth a try. For further learning, I recommend the Jujutsu for Everyone tutorial , Steve Klabnik’s tutorial and Justin Pombrio’s cheat sheet , and of course, the official documentation . If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! If you liked this post, please leave a comment . Starting Up Creating Changes Modifying Changes Viewing Changes Managing Branches Managing State Working with Git Other Useful Commands Custom Configuration Revset Aliases Command Aliases : finds nonempty leaf changes that are mutable, have descriptions, and can be pushed to a remote. : finds changes from the default branch or the repository root to the current change, plus ancestors of visible leaf changes from the last 5 days. This gives me a good overview of the state of my repo. : finds all changes from the last month. : shows the recent changes from the default branch to the present, combining the and revsets. : moves the bookmark in the current branch to the closest pushable commit.

0 views
Giles's blog 5 days ago

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results

I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI: Here were the results I got, sorted by the loss: Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern. I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now. In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change. Let's spec out the problem first. The instruction fine-tuning test trains our model on the Alpaca dataset in order to let it know how to follow instructions; that comprises a series of sequences like this: More details in this post . In the version I've settled on , I fine-tune on a training set of 85% of the samples, epoch by epoch, bailing out when the loss on a separate validation set of 5% of the samples starts rising. I then use the weights from the previous epoch -- that is, before validation loss started rising -- to generate responses to the remaining 10% of the samples. Once that's done, the script hits the OpenAI API, using GPT-5.1, default parameters for all of the options (eg. no explicit temperature) with queries like this: We do that for every model-generated response in the test set, then take the average of the scores and use that as our result. To see why that's problematic, imagine this simple instruction with no separate input: One response I've seen from my models was this: That's obvious garbage, and should get a zero -- and GPT-5.1 consistently does that. Another response, from OpenAI's original weights for their "medium" model (larger than the ones I've been training), is this: That's correct, so it deserves 100, or perhaps 95 due to being unnecessarily wordy (the answer "Jane Austen" is the suggested response in the dataset). But now how about this one: One of my models came up with that gem during an earlier eval. It's completely wrong, so it deserves a 0, right? And normally the GPT-5.1 model does that -- but sometimes it's a little more generous, and gives it a low, but non-zero score. When asked for its reason for that, it makes the logical point that while it's the wrong answer, at least Sarah Palin is a real person. It's better than the "the book wrote itself" complete nonsense of the first response. The problem is that the different runs against the different models are not consistent, as they're all talking to GPT-5.1 separately. One model might find it in a harsh "mood", and get a lower rating than another model that found it at a more generous moment. I came to the conclusion that the best way to fix this is to do a "batch" -- that is, fine-tune each model on the Alpaca dataset that Raschka provides, and generate responses for the test set and store them in a file. Then, once we've done that for all models, we can score them all at once, prompting GPT-5.1 with something like this: The theory is that doing it that way will mean that each individual query/response pair is graded consistently between models, even if there might still be inconsistencies between query/response pairs. That hopefully means we'll get more consistent results and can compare the models better. Here's the code: Running the first against each of our models, and then the second against all of the output files, gives us this updated table (with links to the annotated JSON files in case anyone else wants to take a look): (Still sorted by loss so that you can compare it more easily with the one above.) That's really interesting! The IFT score is still not correlated with the loss. But there does appear to be a pattern. It looks like we have three groups of models: I tried running the LLM-as-a-judge scoring script a few times, just to make sure this wasn't some kind of random weirdness, but the pattern was always the same: the OpenAI weights, the cloud FineWeb 8x A100 40 GiB, and the two local Local FineWeb-Edu models always got the best IFT scores, though sometimes they swapped positions (apart from the OpenAI medium model, which was of course always at the top). The other cloud FineWeb models and the local FineWeb one were consistently scored much lower. A hypothesis: there are two things that contribute to how good a model is at these IFT tests: Or to put it another way -- some of these models are smart but not knowledgeable, while others are knowledgeable but not smart, and some are neither. I think that could explain what we're seeing here. While OpenAI never published their "WebText" dataset for GPT-2, the paper describes it as a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. Now, the FineWeb dataset is quite similar, though I think it's a tad more curated than that. But OpenAI trained their models for quite some time and did lots of tricks to get the loss as low as possible. By contrast, the FineWeb-Edu dataset is a carefully selected subset of FineWeb, with only the most "educational" data. Models trained on it, you might think, would know more facts for a given amount of training. So we can imagine the OpenAI models are smart but not knowledgeable, as we can our cloud FineWeb 8x A100 40 GiB model, which (I believe due to an accidentally-near-optimal batch size) worked out well in terms of loss. They were trained on relatively sloppy datasets but turned out reasonably well. Their intelligence makes up for some of their lack of knowledge. Our other cloud trains and the local FineWeb one are dumb and not knowledgeable; they were trained on the low-information FineWeb dataset, but they didn't wind up with a particularly amazing loss. So they get low scores. And finally, our local FineWeb-Edu models are still dumb, but they make up for it by knowing more because their training data was better. Well, it sounds plausible ;-) And I'd like to spend some time digging in to see if there's any indication if it's actually true. But after an afternoon of poking around the results, I can't really get a handle on whether it is, or indeed how you'd test that hypothesis in any real depth. TBH, I think this has zoomed so far past my "no side quests" limit that it's not even visible in the rear view mirror, so it's probably best to shelve it as a "cool idea, bro" for now. Learning about how to run sensible evals, and how to work out what they're saying, will have to be a task for another day. I will keep on doing these IFT tests for future models, though, just out of interest. So: let's get back to our regular scheduled LLM training. Next up, how do we upload our models to Hugging Face quickly and easily so that other people can play with them. A simple cross entropy loss over a fixed test set. The results for an instruction fine-tune test that's covered in the book. A script to fine-tune a model and generate test responses and to dump them into a JSON file. The LLM-as-a-judge code to send a bunch of models' responses to GPT-5.1 . It scrambles the order of the models in each query, to try to avoid any preference the model might have for the first one vs the last one, and it stores GPT-5.1's per-response scores and comments in a new "annotated" JSON file. The OpenAI weights and the cloud train on the 8x A100 40 GiB machine using FineWeb, which have low loss and high IFT scores The other cloud models and the local train that used FineWeb, which have medium loss and low IFT scores. The FineWeb-Edu local trains, which have high loss, but IFT scores that are almost as good as the first group's. The loss. Models that are better at predicting the next token are inherently better at instruction-following after the fine-tuning. The amount of information in the dataset. It doesn't matter how clever a model is, if it never saw "Jane Austen wrote 'Pride and Prejudice'" as part of its training, it will never be able to get a good score on that question.

0 views
ptrchm 6 days ago

Switching to Linux After 19 Years on macOS

In October, I decided to try to switch my main development machine to Linux. After almost two decades in the Apple ecosystem, the change has been both refreshing and challenging. In the early 2000s, I went through a brief Linux phase. I installed Slackware and Mandrake . Ubuntu hadn’t come out yet 1 . I managed to run instead of , and played with KDE. But that was about it. Every piece of software that interested me at the time ran on Windows. Background and motivation Hardware and distro Making it easier to switch The good parts Personal OS AI is good at Linux Hyprland is awesome GNOME is more complete and stable Perfect software development environment The not-so-good parts It’s a time sink Graphic glitches Keyboard shortcuts (the copy-paste situation) Installing software Non-Apple Laptops suck I’m keeping the MacBook around So, should you try it?

3 views

You’re Always Choosing How You Live

I recently finished reading a book that took me a really long time to get through: The Courage to Be Disliked . I highlight regularly when I read, and this one still stood out for the number of passages I wanted to keep. The reason it took me so long, though, is the way it’s written, as a dialogue between a student and a philosopher. I found that format a bit hard to follow at times, and it slowed me down. It’s probably much longer than it needs to be. Still, the gems of Adlerian psychology (its individual psychology and opposition to any kind of dualistic value system that treats the mind as separate from the body) really shine through, distilled for a wider audience and translated into ideas that are easy to follow and relate to. I kept all my highlights, but I’ve also been condensing them into notes so I can easily come back to the ideas that really landed for me. I already know this is a book I’ll revisit  (I read it on Kindle, but I also have the physical copy, which I’m eager to highlight with a pen in hand) Two ideas in particular stood out for me at this stage of my life. The first is the idea of separation of tasks . The core idea is that most relationship problems come from interfering in other people’s tasks, their thoughts, choices, and responsibilities - or letting them interfere in ours. It sounds simple, but it’s not easy to live out, especially in close relationships. Right now, I’m really noticing this in my relationship with my teenager and how he chooses to spend his free time. It’s uncomfortable to sit with the boundary between what’s genuinely my responsibility and what actually belongs to him. I can feel how quickly concern turns into control, or how easily care becomes interference. This idea keeps nudging me to step back and ask: whose task is this, really? I even wrote it down for myself as a note: His task → How he chooses to spend his free time. My task → Creating healthy and clear boundaries, consistency,  and values/structure in the home. The second idea that really resonated is building horizontal relationships  - relationships where we relate to each other as equals, rather than through hierarchy, control, or superiority. It’s about moving away from power dynamics and toward mutual respect, responsibility, and trust. Not just with friends or colleagues, but in families too. In relation to my teenager, I am guiding, not controlling. Together, these two ideas feel quietly radical (even though I think we all intuitively know this is how it should be). They challenge a lot of the ways we’re taught to manage, fix, and influence the people around us. They ask for more personal responsibility, more emotional maturity, and more trust in others - and in ourselves. My mum is staying with us at the moment, and I definitely have opinions about how she spends her time (and she has opinions about mine too, although she’s become less judgmental as she’s gotten older). We’ve talked about these ideas from the book, and now when I catch myself wanting to criticise, I’ll say something like: “I want to tell you to go for a walk instead of playing so much mahjong on your phone because you need to stay fit, but that’s your task, not mine.” We usually laugh. But honestly, it’s not easy. I notice older, more mature people often seem much better at this, more accepting of other people’s tasks, and more focused on simply doing their own (and gently guiding). I wanted to capture these ideas in my notes so I can come back to them regularly. I think they’re going to shape the direction I’m taking in 2026. Or at least, that’s the hope. I also sprinkled some quotes from the book throughout. Your life is not determined by your past, your trauma, or your emotions. You are always choosing your way of living right now . Change is possible at any moment—but it requires courage. Adlerian psychology rejects the idea that the past causes who you are (etiology). Instead, it says we act toward goals (teleology). Emotions like anger, anxiety, or fear are tools we use to achieve goals (e.g., avoiding responsibility, asserting power, not changing). Past experiences don’t define you; the meaning you give them does. Personality is a chosen lifestyle, not something fixed. People say they want to change, but often choose not to because the current way of living is familiar and predictable—even if painful. Change means uncertainty, criticism, and possible failure. Unhappiness comes from a lack of courage, not a lack of ability. ⠀ “PHILOSOPHER: Don’t you see? In a word, anger is a tool that can be taken out as needed. It can be put away the moment the phone rings, and pulled out again after one hangs up. The mother isn’t yelling in anger she cannot control. She is simply using the anger to overpower her daughter with a loud voice and thereby assert her opinions.” Every problem ultimately involves relationships with others. Feelings of inferiority only exist because we compare ourselves to others. Superiority complexes (boasting, victimhood, self-pity) are just inverted inferiority. Life becomes painful when it turns into a competition. “YOUTH: So I am making up flaws in other people just so that I can avoid my life tasks, and furthermore, so I can avoid interpersonal relationships? And I am running away by thinking of other people as my enemies? PHILOSOPHER: That’s right. Adler called the state of coming up with all manner of pretexts in order to avoid the life tasks the “life-lie.” Life isn’t about winning or losing. Healthy inferiority is comparing yourself to your ideal self, not to others. True freedom comes from withdrawing from comparison altogether. “PHILOSOPHER: Look, no matter how much you want to be Y, you cannot be reborn as him. You are not Y. It’s okay for you to be you. However, I am not saying it’s fine to be “just as you are.” If you are unable to really feel happy, then it’s clear that things aren’t right just as they are. You’ve got to put one foot in front of the other, and not stop.” Most relationship problems come from interfering in other people’s tasks or letting them interfere in yours. You are responsible for your actions, not how others react. Let others judge, approve, or dislike you - that’s their task. Trying to control others (even “for their own good”) is manipulation “Separating one’s tasks is not an egocentric thing. Intervening in other people’s tasks is essentially an egocentric way of thinking, however. Parents force their children to study; they meddle in their life and marriage choices. That is nothing other than an egocentric way of thinking.” Wanting approval makes you unfree. Living to meet others’ expectations means living their life, not yours. Freedom means accepting that some people won’t like you. Being disliked is not failure—it’s proof you’re living authentically. “Many people think that the interpersonal relationship cards are held by the other person. That is why they wonder, How does that person feel about me? and end up living in such a way as to satisfy the wishes of other people. But if they can grasp the separation of tasks, they will notice that they are holding all the cards. This is a new way of thinking.” No praising, no rebuking—both are forms of control. Treat others as equals (“equal but not the same”). You’re not trying to dominate, impress, win approval, or avoid being judged. Encouragement replaces judgment. Gratitude builds connection; praise undermines confidence. Horizontal relationships support: Self-acceptance (you don’t need to rank yourself) Healthy boundaries (not over-responsible for others) Courage (you act based on values, not fear of judgment) Real connection (less performance, more authenticity) “It is fine to just let go of it. Living in fear of one’s relationships falling apart is an unfree way to live, in which one is living for other people.” You don’t need to “love yourself” or affirm yourself. Accept what you can’t change; focus on what you can. Worth comes from feeling useful to others , not from being special. Contribution, not recognition, is the source of confidence and courage. ⠀ “PHILOSOPHER: You’re wrong. You notice only your shortcomings because you’ve resolved to not start liking yourself. In order to not like yourself, you don’t see your strong points and focus only on your shortcomings. First, understand this point. YOUTH: I have resolved to not start liking myself? PHILOSOPHER: That’s right. To you, not liking yourself is a virtue.” You’re not the center of the world; you’re part of a community. A sense of belonging is earned by contributing, not demanded. Shift from self-focus (“How am I seen?”) to social interest (“How can I help?”). “Do not cling to the small community right in front of you. There will always be more ‘you and I,’ and more ‘everyone,’ and larger communities that exist.” “It is about having concern for others, building horizontal relationships, and taking the approach of encouragement. All these things connect to the deep life awareness of “I am of use to someone,” and in turn, to your courage to live.” Life is not a straight line or a story, it’s a series of moments. Past and future are excuses we use to avoid living fully now. The greatest life-lie is postponing life. A life lived earnestly in each moment is already complete. “PHILOSOPHER: The greatest life-lie of all is to not live here and now. It is to look at the past and the future, cast a dim light on one’s entire life, and believe that one has been able to see something. Until now, you have turned away from the here and now and shone a light only on invented pasts and futures. You have told a great lie to your life, to these irreplaceable moments.” “As long as we postpone life, we can never go anywhere and will pass our days only one after the next in dull monotony, because we think of here and now as just a preparatory period, as a time for patience. But a “here and now” in which one is studying for an entrance examination in the distant future, for example, is the real thing.” Life has no inherent meaning. You give it meaning through contribution to others. That contribution is the “guiding star” for a free and happy life. “life in general has no meaning whatsoever. But you can assign meaning to that life. And you are the only one who can assign meaning to your” His task → How he chooses to spend his free time. My task → Creating healthy and clear boundaries, consistency,  and values/structure in the home. Adlerian psychology rejects the idea that the past causes who you are (etiology). Instead, it says we act toward goals (teleology). Emotions like anger, anxiety, or fear are tools we use to achieve goals (e.g., avoiding responsibility, asserting power, not changing). Past experiences don’t define you; the meaning you give them does. Personality is a chosen lifestyle, not something fixed. People say they want to change, but often choose not to because the current way of living is familiar and predictable—even if painful. Change means uncertainty, criticism, and possible failure. Unhappiness comes from a lack of courage, not a lack of ability. ⠀ Every problem ultimately involves relationships with others. Feelings of inferiority only exist because we compare ourselves to others. Superiority complexes (boasting, victimhood, self-pity) are just inverted inferiority. Life becomes painful when it turns into a competition. Life isn’t about winning or losing. Healthy inferiority is comparing yourself to your ideal self, not to others. True freedom comes from withdrawing from comparison altogether. Most relationship problems come from interfering in other people’s tasks or letting them interfere in yours. You are responsible for your actions, not how others react. Let others judge, approve, or dislike you - that’s their task. Trying to control others (even “for their own good”) is manipulation Wanting approval makes you unfree. Living to meet others’ expectations means living their life, not yours. Freedom means accepting that some people won’t like you. Being disliked is not failure—it’s proof you’re living authentically. No praising, no rebuking—both are forms of control. Treat others as equals (“equal but not the same”). You’re not trying to dominate, impress, win approval, or avoid being judged. Encouragement replaces judgment. Gratitude builds connection; praise undermines confidence. Self-acceptance (you don’t need to rank yourself) Healthy boundaries (not over-responsible for others) Courage (you act based on values, not fear of judgment) Real connection (less performance, more authenticity) You don’t need to “love yourself” or affirm yourself. Accept what you can’t change; focus on what you can. Worth comes from feeling useful to others , not from being special. Contribution, not recognition, is the source of confidence and courage. ⠀ You’re not the center of the world; you’re part of a community. A sense of belonging is earned by contributing, not demanded. Shift from self-focus (“How am I seen?”) to social interest (“How can I help?”). Life is not a straight line or a story, it’s a series of moments. Past and future are excuses we use to avoid living fully now. The greatest life-lie is postponing life. A life lived earnestly in each moment is already complete. Life has no inherent meaning. You give it meaning through contribution to others. That contribution is the “guiding star” for a free and happy life.

1 views
Simon Willison 6 days ago

LLM predictions for 2026, shared with Oxide and Friends

I joined a recording of the Oxide and Friends podcast on Tuesday to talk about 1, 3 and 6 year predictions for the tech industry. This is my second appearance on their annual predictions episode, you can see my predictions from January 2025 here . Here's the page for this year's episode , with options to listen in all of your favorite podcast apps or directly on YouTube . Bryan Cantrill started the episode by declaring that he's never been so unsure about what's coming in the next year. I share that uncertainty - the significant advances in coding agents just in the last two months have left me certain that things will change significantly, but unclear as to what those changes will be. Here are the predictions I shared in the episode. I think that there are still people out there who are convinced that LLMs cannot write good code. Those people are in for a very nasty shock in 2026. I do not think it will be possible to get to the end of even the next three months while still holding on to that idea that the code they write is all junk and it's it's likely any decent human programmer will write better code than they will. In 2023, saying that LLMs write garbage code was entirely correct. For most of 2024 that stayed true. In 2025 that changed, but you could be forgiven for continuing to hold out. In 2026 the quality of LLM-generated code will become impossible to deny. I base this on my own experience - I've spent more time exploring AI-assisted programming than most. The key change in 2025 (see my overview for the year ) was the introduction of "reasoning models" trained specifically against code using Reinforcement Learning. The major labs spent a full year competing with each other on who could get the best code capabilities from their models, and that problem turns out to be perfectly attuned to RL since code challenges come with built-in verifiable success conditions. Since Claude Opus 4.5 and GPT-5.2 came out in November and December respectively the amount of code I've written by hand has dropped to a single digit percentage of my overall output. The same is true for many other expert programmers I know. At this point if you continue to argue that LLMs write useless code you're damaging your own credibility. I think this year is the year we're going to solve sandboxing. I want to run code other people have written on my computing devices without it destroying my computing devices if it's malicious or has bugs. [...] It's crazy that it's 2026 and I still random code and then execute it in a way that it can steal all of my data and delete all my files. [...] I don't want to run a piece of code on any of my devices that somebody else wrote outside of sandbox ever again. This isn't just about LLMs, but it becomes even more important now there are so many more people writing code often without knowing what they're doing. Sandboxing is also a key part of the battle against prompt injection. We have a lot of promising technologies in play already for this - containers and WebAssembly being the two I'm most optimistic about. There's real commercial value involved in solving this problem. The pieces are there, what's needed is UX work to reduce the friction in using them productively and securely. I think we're due a Challenger disaster with respect to coding agent security[...] I think so many people, myself included, are running these coding agents practically as root, right? We're letting them do all of this stuff. And every time I do it, my computer doesn't get wiped. I'm like, "oh, it's fine". I used this as an opportunity to promote my favourite recent essay about AI security, the Normalization of Deviance in AI by Johann Rehberger. The Normalization of Deviance describes the phenomenon where people and organizations get used to operating in an unsafe manner because nothing bad has happened to them yet, which can result in enormous problems (like the 1986 Challenger disaster) when their luck runs out. Every six months I predict that a headline-grabbing prompt injection attack is coming soon, and every six months it doesn't happen. This is my most recent version of that prediction! (I dropped this one to lighten the mood after a discussion of the deep sense of existential dread that many programmers are feeling right now!) I think that Kākāpō parrots in New Zealand are going to have an outstanding breeding season. The reason I think this is that the Rimu trees are in fruit right now. There's only 250 of them, and they only breed if the Rimu trees have a good fruiting. The Rimu trees have been terrible since 2019, but this year the Rimu trees were all blooming. There are researchers saying that all 87 females of breeding age might lay an egg. And for a species with only 250 remaining parrots that's great news. (I just checked Wikipedia and I was right with the parrot numbers but wrong about the last good breeding season, apparently 2022 was a good year too.) In a year with precious little in the form of good news I am utterly delighted to share this story. Here's more: I don't often use AI-generated images on this blog, but the Kākāpō image the Oxide team created for this episode is just perfect : We will find out if the Jevons paradox saves our careers or not. This is a big question that anyone who's a software engineer has right now: we are driving the cost of actually producing working code down to a fraction of what it used to cost. Does that mean that our careers are completely devalued and we all have to learn to live on a tenth of our incomes, or does it mean that the demand for software, for custom software goes up by a factor of 10 and now our skills are even more valuable because you can hire me and I can build you 10 times the software I used to be able to? I think by three years we will know for sure which way that one went. The quote says it all. There are two ways this coding agents thing could go: it could turn out software engineering skills are devalued, or it could turn out we're more valuable and effective than ever before. I'm crossing my fingers for the latter! So far it feels to me like it's working out that way. I think somebody will have built a full web browser mostly using AI assistance, and it won't even be surprising. Rolling a new web browser is one of the most complicated software projects I can imagine[...] the cheat code is the conformance suites. If there are existing tests that it'll get so much easier. A common complaint today from AI coding skeptics is that LLMs are fine for toy projects but can't be used for anything large and serious. I think within 3 years that will be comprehensively proven incorrect, to the point that it won't even be controversial anymore. I picked a web browser here because so much of the work building a browser involves writing code that has to conform to an enormous and daunting selection of both formal tests and informal websites-in-the-wild. Coding agents are really good at tasks where you can define a concrete goal and then set them to work iterating in that direction. A web browser is the most ambitious project I can think of that leans into those capabilities. I think the job of being paid money to type code into a computer will go the same way as punching punch cards [...] in six years time, I do not think anyone will be paid to just to do the thing where you type the code. I think software engineering will still be an enormous career. I just think the software engineers won't be spending multiple hours of their day in a text editor typing out syntax. The more time I spend on AI-assisted programming the less afraid I am for my job, because it turns out building software - especially at the rate it's now possible to build - still requires enormous skill, experience and depth of understanding. The skills are changing though! Being able to read a detailed specification and transform it into lines of code is the thing that's being automated away. What's left is everything else, and the more time I spend working with coding agents the larger that "everything else" becomes. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . 1 year: It will become undeniable that LLMs write good code 1 year: We're finally going to solve sandboxing 1 year: A "Challenger disaster" for coding agent security 1 year: Kākāpō parrots will have an outstanding breeding season 3 years: the coding agents Jevons paradox for software engineering will resolve, one way or the other 3 years: Someone will build a new browser using mainly AI-assisted coding and it won't even be a surprise 6 years: Typing code by hand will go the way of punch cards Kākāpō breeding season 2026 introduction from the Department of Conservation from June 2025 . Bumper breeding season for kākāpō on the cards - 3rd December 2025, University of Auckland.

0 views

Dynamic Load Balancer in Intel Xeon Scalable Processor

Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines Jiaqi Lou, Srikar Vanavasam, Yifan Yuan, Ren Wang, and Nam Sung Kim ISCA'25 This paper describes the DLB accelerator present in modern Xeon CPUs. The DLB addresses a similar problem discussed in the state-compute replication paper : how to parallelize packet processing when RSS (static NIC-based load balancing) is insufficient. Imagine a 100 Gbps NIC is receiving a steady stream of 64B packets and sending them to the host. If RSS is inappropriate for the application, then another parallelization strategy would be for a single CPU core to distribute incoming packets to all of the others. To keep up, that load-distribution core would have to be able to process 200M packets per second, but state-of-the-art results top out at 30M packets per second. The DLB is an accelerator designed to solve this problem. Fig. 2 illustrates the DLB hardware and software architecture: Source: https://dl.acm.org/doi/10.1145/3695053.3731026 A set of producer cores can write 16B queue elements (QEs) into a set of producer ports (PPs). In a networking application, one QE could map to a single packet. A set of consumer cores can read QEs out of consumer queues (CQs). QEs contain metadata which producers can set to enable ordering within a flow/connection, and to control relative priorities. The DLB balances the load at each consumer, while honoring ordering constraints and priorities. A set of cores can send QEs to the DLB in parallel without suffering too much from skew. For example, imagine a CPU with 128 cores. If DLB is not used, and instead RSS is configured to statically distribute connections among those 128 cores, then skew could be a big problem. If DLB is used, and there are 4 cores which write into the producer ports, then RSS can be configured to statically distribute connections among those 4 cores, and skew is much less likely to be a problem. Fig. 5 shows that DLB works pretty well. and are software load balancers, while uses the DLB accelerator. DLB offers similar throughput and latency to RSS, but with much more flexibility. Source: https://dl.acm.org/doi/10.1145/3695053.3731026 AccDirect One awkward point in the design above is the large number of CPU cycles consumed by the set of producer cores which write QEs into the DLB. The paper proposes AccDirect to solve this. The idea is that DLB appears as a PCIe device, and therefore a flexible NIC can use PCIe peer-to-peer writes to send packets directly to the DLB. The authors find that the NVIDIA BlueField-3 has enough programmability to support this. Fig. 9 show that this results in a significant power savings, but not too much of a latency improvement: Source: https://dl.acm.org/doi/10.1145/3695053.3731026 Dangling Pointers I feel like it is common knowledge that fine-grained parallelism doesn’t work well on multi-core CPUs. In the context of this paper, the implication is that it is infeasible to write a multi-core packet processor that primarily uses pipeline parallelism. Back-of-the-envelope: at 400Gbps, and 64B packets, there is a budget of about 40 8-wide SIMD instructions to process a batch of 8 packets. If there are 128 cores, then maybe the aggregate budget is 4K instructions per batch of 8 packets across all cores. This doesn’t seem implausible to me. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views