Posts in Programming (20 found)
Xe Iaso -2 days ago

Giving your Go apps Tigris superpowers

Tigris is S3-compatible, which means you can point the AWS SDK at it and most things just work. The catch is that the Tigris-exclusive features—bucket forking, snapshots, object renaming, and the like—need verbose workarounds because the AWS SDK doesn't know they exist. So we wrote a Go SDK that does. It comes in two flavors: the package is a drop-in replacement for the standard S3 client with first-class methods for the Tigris-specific operations, and is a higher-level client for the common single-bucket case that infers its configuration from the environment so you stop passing the same parameters over and over. You can adopt the Tigris features incrementally without refactoring your existing S3 code, and the simpler API still works against other S3-compatible providers. I wrote up how it works and why we built it over on the Tigris blog.

0 views
David Bushell Yesterday

Are you standard.site?

Standard.site provides shared AT Protocol lexicons. Atproto is just spicy JSON and asymmetric cryptography. I’ve tried to explain atproto in more detail before. Bluesky has always supported a few open graph meta tags which I use to generate images for blog posts. That’s part of the social media game; get in people’s faces as loudly as possible. Now the game has changed! I return Monday ready to work and suddenly I start seeing a fancy new “View publication” button appear in my Bluesky feed. I’ve never wanted nor needed a button before but now that people are rocking buttons, what am I supposed to be, a buttonless pleb? I got my own button it looks like this: Mat Marquis, fellow button connoisseur, was quick with a guide to “Implementing Standard.Site” which I hastily copied. Mat used an atproto explorer to edit records which is akin to rawdoggin’ SQL in production. Given the weekly GitHub and NPM malware party this is probably a safer play than running yourself. I’m never going to remember to publish manually though. I have a janky build script and some experience with the @atcute libraries . How hard can it be? My script begins by generating a manifest of pages by parsing markdown before rendering the HTML template. I added a new step that fetches all atproto records in the collection. It cross-references the paths in my manifest. Any unknown path has the record deleted. It then iterates the manifest and either updates the atproto record (if title or description has changed), or creates a new record if none existed. Finally it adds the atproto URI to the manifest for the element. Now my blog is standard.site and I have a fancy button to prove it. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views

Barry Hess

This week on the People and Blogs series we have an interview with Barry Hess, whose blog can be found at bjhess.com . Tired of RSS? Read this in your browser or sign up for the newsletter . People and Blogs is supported by the "One a Month" club members. If you enjoy P&B, consider becoming one for as little as 1 dollar a month. I’m a programmer-type from rural Minnesota. I grew up on a farm near a small town. Now I live in a bustling city of 27,000 people…surrounded by farmland. In other words, I’m still in rural Minnesota. I studied computer science at a small private college, which led to my 26-year career programming computers. First it was at an insurance company, then it was at a SaaS startup, and now it’s for myself at a little company I run with my business partner. My hobbies are mostly typical: reading, watching movies, and the occasional video game (meaning Fortnite). My favorite sport is baseball, though I’ll watch the occasional other sport. I also try to do a little woodworking, cooking, and, well, blogging. Blogging is a hobby, yes? I decided to start a blog in 2004. Personal blogs were popping up all over, and I was enjoying meeting new people through the comments section in these blogs. I also have a couple non-blogging friends that were doing their thing on Xanga . The blogs I followed were either friends, friends of friends, about the Minnesota Twins (baseball), or about U.S. news and politics. Online I generally use the handle bjhess. That was what my college gave me for my first ever user account. Toward the end of college I was looking for a domain name, and unfortunately there was already a techy person with my first and last name who grabbed that obvious option. (They still have the domain to this day!) So bjhess.com it was, and the name stuck. I blogged via b2evolution and WordPress in the early days, probably at Dreamhost. In the early 2010s I switched over to a self-hosted and customized install of Scanty , and I ran that for a long time. In 2022 I switched to an HTML-only site . That lasted about a year before I and my colleagues built Pika. I don’t have a system or process for blogging. My inspiration is generally from interesting things happening in my life. That can be a vacation, a recent discovery, an experiment that I’m trying, or a feeling that I’m feeling. Most of my posts are written in a single session, with a couple rounds of editing for grammar, tone, and flow. There’s only been one occasion where I asked others to read my writing before posting. I’ve recently tried the “weekly update” format of posts, which to this point has been me adding links and notes to a draft leading up to finalizing the post on Friday or Saturday. I’m toying with updating the draft post daily throughout the week before publishing, but then if I’m doing that I wonder if I should…just post those daily updates daily? Inspiration comes and goes, but I generally prefer to have quiet while writing, whether that’s natural or simulated via headphones. Aside from that basic need, I don’t strongly believe that physical spaces influence my creativity. However, I’ve been noticing that my office is in a state of constant clutter…and I’m starting to believe. Now the question is whether that clutter impacts the mind or whether the cluttered mind leads to a physical manifestation? A little of both, I think. Today, and for the rest of my life, my blog is hosted at Pika . I write my posts directly in the web editor. I would start my blog on Pika, naturally! I believe pretty strongly that most bloggers probably would be better off not rolling their own static site generators or CMS installations. For those that want to play in that world, though, there’s nothing like it. For the rest of us there are a number of small, independent blogging platforms that make things quite a bit easier. They all tend to play nice together, offering exporting and importing options if you ever find a different platform to be a better fit for your style. If I were paying for my Pika account, it would be $60 per year, and my domain is $13 per year. Not bad for a favorite hobby! I pay $9/month for Plausible analytics, though I’m not entirely sure why. As a programmer, I think it’s mainly that I want a place to look to see any weird happenings to make sure nothing is amiss. If traffic to my blog disappeared, I’d be curious if I did something wrong technically to cause it. All’s fair for monetizing. I don’t do it, but I know affiliate links and such make sense in some contexts. Let me dial up my feed reader here. Okay, for a selection… I’m not sure how Chris Glass keeps his daily photo journal going, but it’s great. Rafał Pastuszak does fascinating things at Untested . Adam Keys is usually thinking . Since Luke moved away from my area, I like to read what’s going through his mind on recursion . I travel vicariously through MacPsych. Maique gives me all the photo inspiration . Holy cats, Jamie Todd Rubin is an avid reader . Brendon Bigley provides cool video game news . Annie lends me insight . Davey and Jamie share lives well lived. I also like to keep up with Derek Sivers , Hugh Howey , Craig Mod , and Cabel Sasser (I still need to read the 2025 snacks rundown). Oh, and, boy howdy, Mike Monteiro . Any of the above who haven’t been interviewed would be a great option to interview next! I won’t be shy–I’m working on Pika and I would greatly appreciate it if you gave Pika a look. Our biggest project at the moment is The Pika Pulse , which will be a great help to discover Pika blogs. I think that’s a good thing for the readers of People and Blogs! Mostly, though, I’d like more people to blog. I want people of all ages and backgrounds sharing their experiences at their own domain online. Whether you do that via Pika or any other setup or service (yes, even WordPress), I’ll be excited! See you online! Now that you're done reading the interview, go check the blog and subscribe to the RSS feed . If you're looking for more content, go read one of the previous 144 interviews . People and Blogs is possible because kind people support it.

0 views
Xe Iaso Yesterday

IPv6 zones in URLs are a mistake

IPv6 is weird. One of the more strange parts of the standard is that every interface's link local addresses are in . If you have a machine with two network interfaces, both of them will be in , so if you have a packet destined to , how do you disambiguate it? The answer is you use IPv6 scopes/zones . The exact format of what goes into a zone is OS dependent, but on Linux it's the interface name and on Windows it's the interface ID. This lets the kernel's routing table know how to handle an address range conflict. On my tower, this would be represented like this: Where is the name of my tower's ethernet device. When you create a host:port bindhost, you normally separate the hostname and port with a colon. IPv6 uses colons to separate hex groups. In order to disambiguate what's the host and what's the port, you typically format the IPv6 address in square brackets, so on port 80 would look like this: And with the right scope it looks like this: Now let's get URL encoding into the mix. From high orbit, you can imagine a URL's format as being something like this: An IPv6 zone would then be part of the hostname, just like with that port 80 example from earlier. So you'd think the URL would be something like this: But if you try to parse this as a URL in Go, you get an error: This happens because URLs can't represent all Unicode values, so any values that don't fit into the grammar of a URL become percent-encoded . This is why sometimes you'll see a in URLs in the wild; that's encoding the ascii space key, which is invalid in URLs. In order to work around this, you need to percent-encode the percent sign in the IPv6 zone: In theory, there is guidance for how to properly handle IPv6 zones in user interfaces in RFC 9844 , but there's no such guidance for URLs . Go also does not seem to follow this RFC in net/url . EDIT: It seems that this behaviour is compliant with RFC 6874 and that this is in fact how it is meant to be done. Our industry confounds me. So in the meantime in order for Anubis to point to IPv6 zoned addresses, you need to encode the with percent encoding. This is horrible, but it seems that this is an edge case that applies to other frameworks, programming languages, and libraries: Maybe some day in the future there will be a better option here. In the meantime my policy of not forking the Go standard library means that this somewhat terrible UX for an edge case is acceptable. I hate it, but what can you do? TL;DR: computers were a mistake. https://trac.nginx.org/nginx/ticket/623 https://github.com/psf/requests/issues/6808 https://datatracker.ietf.org/doc/html/draft-schinazi-httpbis-link-local-uri-bcp-03 -- Browsers don't currently support IPv6 zones because it breaks the concept of an "origin" which is used for many subtle things, this RFC draft attempts to define an zone origin in IPv6 so that browsers have a leg to stand on

0 views

How Other Link Checkers Do Recursion

After I published Five Years of Trying to Add Recursion to lychee , one reply I got was a very fair question: If recursion is so hard, how do other link checkers do it? Plenty of them already crawl websites! This sent me down a rabbit hole of reading the code of other link checkers. The key takeaway is: they didn’t find a clever trick we missed. They were built as crawlers from the very first commit, and I initially built lychee as a stream. I went and read the source of the recursive checkers we list in lychee’s README : muffet (Go), LinkChecker (Python), linkinator (TypeScript), and broken-link-checker (JavaScript). This post is a teardown of how each one actually handles recursion, what it costs them, and what it means for lychee. If you haven’t read the first post , the summary is that lychee was architected as a one-shot, unidirectional pipeline ( ). Recursion needs a cycle (responses create new inputs), and cycles in an async, channel-based pipeline are where the dragons live . 🐲 Five years and four attempts later, the pieces we’ll need to do it properly only just landed. DAGs vs. cycles Every recursive checker I looked at is built from the same three parts: Diagrammatically, lychee is different from the others: Crawlers have a back-edge baked in. Our pipeline doesn’t, and every one of my failed attempts was an effort to bend that back-edge into a graph that was never designed for it. Let’s look at that graph design more closely: Note that the visited check happens in the enqueue step, atomically with the mark, before the worker ever touches the network. That ordering is the entire fix to the deduplication race that haunted lychee’s attempts 1–4, where the cache was written after checking. Each tool uses a variation on it. muffet (Go): a WaitGroup and a Set muffet is closest in spirit to lychee: a fast, single-binary, concurrent website checker. The dedup + scheduling decision lives in one method ( ): is a (a mutex-guarded ). returns whether the URL was already present, so a page is only scheduled the first time it’s seen. Dedup happens at enqueue, synchronized by the set’s mutex. This is basically a line-by-line translation of the diagram above. Checking a page fetches all of its links concurrently, and feeds qualifying ones back into , the back-edge: How muffet knows it’s done muffet’s answer to termination is a little built around a ( ): Every scheduled page increments the group; every completed page decrements it; returns when the count hits zero. The whole crawl bootstraps with a single before , so the counter is positive before anyone waits on it. This is the same counter I tried (and failed with) in Attempt 1 and Attempt 4 . The difference is the invariant: is only ever called from inside an already-running daemon that holds the count above zero (or from the bootstrap). There is no window where the counter briefly reads zero while work is still pending. Go’s enforces this invariant so naturally that it doesn’t feel like distributed termination detection at all, but that’s exactly what it is. It’s the moral equivalent of the primitive Kait contributed to lychee in 2026 . Where the tradeoffs are Concurrency isn’t bounded by the daemon manager. does for every task, spawning unbounded goroutines. The actual limiting happens downstream in a (a buffered-channel counting semaphore) and a per-host throttler pool. muffet separates “the frontier” from “the rate limiter,” which is exactly the separation lychee lacked when it tried to use one bounded channel as both in the past. Cheap goroutines do a lot of heavy lifting. Spawning a goroutine per link is “fine” in Go. The equivalent in Rust ( per link, each needing state) is what pushed me toward and the ownership pain I wrote about . On extensibility, muffet is a focused CLI, not a library. There’s no plugin surface; you get what the flags give you. lychee deliberately ships as a reusable crate, which raises the bar, since every architectural choice has to uphold the standards of a public API. On scalability, unbounded goroutines plus an in-memory visited set scale comfortably to large sites, but there’s no disk-backed frontier, so a truly enormous crawl is bounded by RAM. Same as lychee. Takeaways: muffet LinkChecker (Python): a joinable unbounded queue LinkChecker has existed since the year 2000. It’s a synchronous, thread-pool crawler. Its frontier is a hand-written ( ), a clone of Python’s with / . Look at the very first design comment: It’s explicit about the exact deadlock that bit me. That comment is our Attempt 4 backpressure deadlock , called out and designed around. lychee tried to push discovered URLs into a bounded channel; when it filled, the response handler blocked, no responses drained, no slots freed. Deadlock. 💥 LinkChecker’s answer is brutalist in nature: the frontier is unbounded . Backpressure is enforced elsewhere (a fixed thread count and per-host throttling), never by blocking a producer that is also a consumer. Termination by counter, done right blocks until hits zero ( ): Again: a counter. But the increment in and the decrement in are both inside the queue’s lock, and a worker calls only after fully processing an item including enqueuing its children . So children are counted before the parent is marked done, with no premature zero. It’s semantics implemented with a mutex and a condition variable. Deduplication, before the request LinkChecker writes the URL into its result cache at enqueue time ( ): That sentinel is a “fix” that’s missing in lychee’s attempts. By the time any worker thread checks the URL, the cache already says “mine,” so concurrent discovery from another page is a no-op. Per-host politeness and termination guards The ( ) throttles per host: and calls so a stuck crawl can’t hang forever. Where the tradeoffs are Blocking threads instead of async. Each of the (default 10–100) threads does blocking I/O via . Simple and battle-tested, but the concurrency ceiling is the thread count, and each thread carries a full stack. lychee’s Tokio model reaches thousands of concurrent in-flight requests on a handful of OS threads; LinkChecker can’t, and doesn’t try. The unbounded frontier trades a deadlock for unbounded memory. The explicit “no max size” decision means RAM growth on huge sites. There’s a cap and a periodic to mitigate it. Extensibility is excellent. LinkChecker has a real plugin system ( : anchor checks, SSL, virus scanning, and more) and many output loggers. This is the most extensible of the bunch, and it pays for that with a large, mature, somewhat old-fashioned codebase. On scalability, it’s GIL-bound and thread-limited, so raw throughput is the lowest here, but correctness and feature coverage are high. Takeaways: LinkChecker linkinator (TypeScript): Single-Threaded linkinator is a Node.js checker, and it benefits from something neither Go nor Rust provides: a single-threaded event loop . Check-and-insert into the visited set is atomic for free , because no two callbacks run simultaneously. The frontier is a concurrency-limited (a p-queue-style structure). Termination is one line in ( ): is the library’s termination detection: it resolves when the queue is empty and no task is in flight. Same idea as muffet’s and LinkChecker’s , just expressed as a promise and backed by a single-threaded runtime, so no Mutex is needed to protect the visited set. The back-edge and the race-free dedup When crawling, GETs the page, extracts links, and for each new URL re-enters the queue ( ): Because JavaScript is single-threaded, the entire thing executes without interruption. In Rust or Go, that’s a critical section you must guard with a mutex (and get the ordering right); in Node it’s just three statements. This is the single biggest reason recursion is easier in Node than in Rust. It’s just a language feature. linkinator also keeps a of keys, and a map so it can wait on an in-flight check and still report a duplicate broken link against every parent that references it. Those reuse-operations are themselves pushed onto the same queue, so correctly waits for them too. HEAD vs GET linkinator uses for leaf links but when it needs to crawl, because recursion needs the response body to find more links : This is precisely lychee’s remaining open problem : you can only recurse into pages you fetched with a body. linkinator just always GETs when crawling; lychee plans to reuse the body it already has in cache from the check it just performed. Where the tradeoffs are Single-threaded is both a blessing and a ceiling. No data races, trivially correct dedup, but HTML parsing is CPU work that blocks the one event loop. For thousands of pages, you’re bound by a single core. lychee’s multi-threaded runtime parses and checks in parallel. It suffers from in-memory result inflation. The source explicitly comments on “massive result inflation for heavily interlinked sites”: the array, , and all grow with the crawl. Fine for a docs site, heavy for a giant one. Rate limiting is reactive, not proactive. There’s a that backs off per host on a with , but no general per-host concurrency cap like lychee’s . linkinator can hammer a host until it complains; lychee now paces before the complaint. For extensibility, it’s an ( , , and so on), so it’s embeddable and scriptable, which is nice. It’s a library first, like lychee. Takeaways: linkinator broken-link-checker (JavaScript): event-driven, using two queues broken-link-checker (BLC) takes the event-driven model furthest. It’s built on , a queue with (concurrency) and , and it nests two of them: a site-level queue feeding a page-level . The frontier and dedup live in ( ). Visited pages are tracked in a , written at enqueue time: Recursion is governed by a filter that decides whether a discovered link becomes a crawled page: Termination by event cascade BLC has no counter and no . It rides the queue’s drain events. When the page-level queue empties it fires , which makes emit and call the site queue’s callback; when the site queue drains, it fires . That’s the public : That’s their termination detection, expressed as “the request queue reported empty.” And in classic Node.js fashion, the callback is what actually tells the site queue to free up a slot for another site. So the termination of one site is what allows another to start, and the termination of the whole crawl is what allows the process to exit. It’s a cascade of events that propagates from the page queue to the site queue to the process. Where the tradeoffs are It’s the best web citizen of the bunch. robots.txt is honored ( , ), is respected, and plus are first-class. This is a crawler that’s polite by default. Event cascades are powerful but fiddly. Termination is spread across half a dozen event handlers and two nested queues. It works, but the control flow is much harder to follow than . This is the JS cousin of the “leaky abstraction” problem I described, where recursion-awareness ends up sprinkled across many handlers. It’s single-threaded, the same ceiling as linkinator, plus the in-memory per site. On maturity versus momentum, it’s very widely used (it powers a lot of tooling), but development has slowed. The architecture is still sound and worth studying. Takeaways: broken-link-checker A note on markdown-link-check and the “industrial” crawlers Our README marks markdown-link-check as supporting recursion, but there’s some nuance there: it recurses over Markdown files , not by spidering a live website. There’s no HTTP frontier and no termination problem in the sense above. Worth a mention so the comparison is honest, not worth a teardown. If you want to see the pattern at full industrial scale, look at Scrapy (Python/Twisted) or Colly (Go). Both use the same approach: a scheduler (frontier) with a pluggable, optionally disk-backed queue, a dupefilter (often a Bloom filter rather than a ), a bounded downloader pool, and explicit “engine idle → close spider” termination. They solve exactly the problems lychee struggled with ( distributed termination detection , backpressure, dedup), just with years of dedicated crawler engineering behind them. The takeaway isn’t “lychee should be Scrapy”: it’s that crawling is a well-trodden architecture, and lychee is simply standing on a different one right now. Side-by-side Tool Lang / runtime Concurrency model Frontier “Done?” signal Dedup point Per-host limiting muffet Go, goroutines goroutine pool + semaphore + host throttler mutex-guarded set + daemon channel visited set at enqueue host throttler pool LinkChecker Python, threads fixed blocking thread pool unbounded joinable-queue counter ( ) result cache at (req/s) linkinator Node, event loop single-thread + p-queue ( ) p-queue at enqueue (race-free) reactive broken-link-checker Node, event loop ( ) nested request queues queue-drain events at enqueue + lychee (2026) Rust, Tokio tasks + channels + per-host pool lychee in 2026 finally has a column-for-column match. The is muffet’s and LinkChecker’s . The is BLC’s / and LinkChecker’s . The per-URI mutex is everyone’s enqueue-time dedup. So Why Couldn’t We Just Copy Them? Three reasons, in increasing order of how much they’re actually lychee’s fault. They started as crawlers; lychee started as a stream. Every tool above has a back-edge in its core data structure. lychee’s core was a DAG optimized for the 99% case (a list of files/URLs, checked once, fast). Retrofitting a cycle onto a pipeline is much harder than having one from the start. The problem is architectural in nature. The frontier and the rate-limiter must be different objects. muffet (set + semaphore), LinkChecker (unbounded queue + thread count), linkinator (p-queue + delayCache), BLC (request queue + maxSockets) all keep “what to do next” separate from “how fast to go.” lychee’s early attempts tried to make one bounded channel serve both roles, and a cycle through a bounded channel deadlocks. The fix (lychee’s plus a over an unbounded work source) is the same separation we’re aiming for now. Single-threaded runtimes get dedup for free. Both Node tools dedup with a plain and zero locking, because the event loop serializes access. Go and Python pay a mutex. Rust pays a mutex and fights the borrow checker about who owns the shared state across . That’s the ~30% “Rust tax” I estimated last time : not the algorithm, but the friction of expressing shared mutable frontier state under . None of this is a knock on lychee’s design. A unidirectional stream is the right call for the common, non-recursive case: it’s why lychee is fast and why the 30% channel regression from Attempt 2 was a dealbreaker. The other tools pay for their back-edge on every run, recursive or not. lychee refused to, and that principle is exactly why recursion took five years and why, when it lands, it won’t slow down the path everyone actually uses. I believe that we can have our cake and eat it too: a crawler architecture that supports recursion without sacrificing the speed of a one-shot pipeline. But it’s a harder problem than just “copy what they do,” because most link checkers didn’t start with uncompromising performance as their top goal. Key takeaways So when someone asks “how do other link checkers do recursion?”, the real answer is: they made it a part of the architecture from the beginning, and they leaned on a runtime (providing conveniences like a , a joinable queue, an idle promise) that solved termination without solving “distributed termination detection.” Thanks to the maintainers of muffet, LinkChecker, linkinator, and broken-link-checker: reading your source is the clearest way to learn about crawler architecture out there and we’re all in this together, just with a different set of tradeoffs. A mutable work queue (let’s call it “frontier”), not a fixed input stream. Discovered URLs go back into the same queue they came from. A visited set that’s updated at enqueue time (before the request completes), so two pages discovering the same link can’t both submit it. A primitive that answers “is everything done?”: a , a joinable-queue counter, an promise, or a queue-drain event. Concurrency isn’t bounded by the daemon manager. does for every task, spawning unbounded goroutines. The actual limiting happens downstream in a (a buffered-channel counting semaphore) and a per-host throttler pool. muffet separates “the frontier” from “the rate limiter,” which is exactly the separation lychee lacked when it tried to use one bounded channel as both in the past. Cheap goroutines do a lot of heavy lifting. Spawning a goroutine per link is “fine” in Go. The equivalent in Rust ( per link, each needing state) is what pushed me toward and the ownership pain I wrote about . On extensibility, muffet is a focused CLI, not a library. There’s no plugin surface; you get what the flags give you. lychee deliberately ships as a reusable crate, which raises the bar, since every architectural choice has to uphold the standards of a public API. On scalability, unbounded goroutines plus an in-memory visited set scale comfortably to large sites, but there’s no disk-backed frontier, so a truly enormous crawl is bounded by RAM. Same as lychee. muffet’s termination is a , full stop. It’s the design lychee converged on after five years; muffet got it for free from Go’s standard library on day one. The frontier and the concurrency limiter are separate things. A mutex-guarded set is the frontier; a semaphore plus host throttler bounds concurrency. Conflating them is what deadlocked lychee. Goroutines hide the cost that Rust makes you pay explicitly. The same per-task model that’s trivial in Go is where Rust’s /ownership friction shows up. Blocking threads instead of async. Each of the (default 10–100) threads does blocking I/O via . Simple and battle-tested, but the concurrency ceiling is the thread count, and each thread carries a full stack. lychee’s Tokio model reaches thousands of concurrent in-flight requests on a handful of OS threads; LinkChecker can’t, and doesn’t try. The unbounded frontier trades a deadlock for unbounded memory. The explicit “no max size” decision means RAM growth on huge sites. There’s a cap and a periodic to mitigate it. Extensibility is excellent. LinkChecker has a real plugin system ( : anchor checks, SSL, virus scanning, and more) and many output loggers. This is the most extensible of the bunch, and it pays for that with a large, mature, somewhat old-fashioned codebase. On scalability, it’s GIL-bound and thread-limited, so raw throughput is the lowest here, but correctness and feature coverage are high. The unbounded frontier is a deliberate anti-deadlock choice, documented in a one-line comment. It describes the exact problem we hit in lychee in attempt 4. Dedup at time (a placeholder in the cache) is their synchronization mechanism. The cache must claim the URL before the request, not after. Threads buy simplicity at the cost of throughput. A blocking thread pool is the easiest correct model… and the slowest one. Single-threaded is both a blessing and a ceiling. No data races, trivially correct dedup, but HTML parsing is CPU work that blocks the one event loop. For thousands of pages, you’re bound by a single core. lychee’s multi-threaded runtime parses and checks in parallel. It suffers from in-memory result inflation. The source explicitly comments on “massive result inflation for heavily interlinked sites”: the array, , and all grow with the crawl. Fine for a docs site, heavy for a giant one. Rate limiting is reactive, not proactive. There’s a that backs off per host on a with , but no general per-host concurrency cap like lychee’s . linkinator can hammer a host until it complains; lychee now paces before the complaint. For extensibility, it’s an ( , , and so on), so it’s embeddable and scriptable, which is nice. It’s a library first, like lychee. is the termination mechanism. Simple and provided by the JS runtime. A single-threaded event loop makes request deduplication pretty much free. This is the biggest structural reason recursion is easier in that case. Reactive 429 backoff is not the same as proactive per-host pacing. lychee’s aims higher, at the cost of more machinery. It’s the best web citizen of the bunch. robots.txt is honored ( , ), is respected, and plus are first-class. This is a crawler that’s polite by default. Event cascades are powerful but fiddly. Termination is spread across half a dozen event handlers and two nested queues. It works, but the control flow is much harder to follow than . This is the JS cousin of the “leaky abstraction” problem I described, where recursion-awareness ends up sprinkled across many handlers. It’s single-threaded, the same ceiling as linkinator, plus the in-memory per site. On maturity versus momentum, it’s very widely used (it powers a lot of tooling), but development has slowed. The architecture is still sound and worth studying. Termination is a cascade of queue-drain events, not a counter. Same idea, different syntax. Politeness is built in. robots.txt, , and make it the most server-friendly recursive checker by default. Event-driven control flow is the cost. Distributing recursion logic across many handlers is exactly the kind of spread-out complexity that makes the feature hard to reason about. There is no secret sauce. Every recursive checker is a worklist plus a visited set plus a quiescence detector. The “trick” is being shaped like a crawler from commit one. Termination is always the same idea wearing different clothes: (muffet), joinable-queue counter (LinkChecker), (linkinator), queue-drain events (BLC), (lychee 2026). All of them are distributed termination detection. Dedup belongs at enqueue, before the request. Marking a URL visited after checking it (what lychee did for four attempts) is the bug. Everyone else claims the URL the moment it enters the frontier. Separate the frontier from the rate limiter. A bounded channel that is both your queue and your backpressure will deadlock the instant you add a cycle. There is no free lunch. Node’s single thread makes dedup trivial at the cost of performance; Go’s goroutines and make termination trivial at the cost of a runtime; Rust gives you neither for free but hands you a compiler that refuses to let the races compile and you can get the network card to glow if you know exactly what you are doing.

0 views
Max Bernstein 3 days ago

A survey of inlining heuristics

Compilers, especially method just-in-time compilers, operate on one function at a time. It is a natural code unit size, especially for a dynamic language JIT: at a given point in time, what more information can you gather about other parts of a running, changing system? I don’t have any data to back this up—maybe I should go gather some—but on average, methods are small. Especially in languages such as Ruby that use method dispatch for everything, even instance variable (attribute, field, …) lookups, they are small . And everywhere. This makes the compiler sad. If we are to continue to anthropomorphize them, compilers like having more context so they can optimize better. Consider the following silly-looking example that is actually representative of a surprising amount of real-world code: Right now, in the method, I count 8 different method calls: (Technically more, but the ivar lookups (including !), addition, and subtraction are generally specialized and don’t push a frame, even in the interpreter.) Furthermore, there are at least two heap allocations: one for each instance. Last, there is a bunch of memory traffic to and from instances. This all is a huge bummer! What should be a simple math operation is now overwhelmed with a bunch of other stuff. is certainly not a zero-cost abstraction. Even if we had a bunch of other optimizations such as load-store elimination or escape analysis, they would not be able to do much: pretty much everything escapes and is effectful. That is, unless we inline . Inlining is the lever that enables a bunch of other optimization passes to kick in. I wrote about the design and implementation of Cinder’s inliner ( FB link , personal blog link ) a couple of years ago. I wrote about arguably the simplest part, which is copying the callee body into the caller. It took me at least a week to get working. Probably closer to months if you consider all the plumbing through the rest of the JIT. In February during a small hackathon, I watched my colleague k0kubun prototype that bit of the inliner inside ZJIT in about 30 minutes. There is more to do when pretty much every part of the VM is observable from the guest language: both Python and Ruby allow inspecting the state of the locals, the call stack, etc from user code. Sampling profilers also expect some amount of breadcrumbs to work with to inspect the stack. So there’s some more machinery still required to pretend like the callee function was not inlined. I talk about this a little bit in the Cinder blog post. Even so, all of that can probably be designed and wired together in a couple of months. Then you will find yourself tuning the inliner for the next 10 years. This is much harder. The thing that makes inlining difficult, especially in a method JIT, is that you are trying to make an entire (dynamic!) system faster but you are only looking through a microscope and only capable of local reasoning 1 . Whereas other optimizations such as strength reduction, inline caches, and value numbering are an un-alloyed good for the generated code, inlining can have negative effects . It is also perhaps the first optimization people add that has non-local impact. If you inline wrong, your code size might blow up. This might thrash your CPU’s caches. Bummer, but happens to the best of us. But also, if you inline wrong, you might get in the way of other helpful optimizations: if you hit some size limit after inlining method A, you might never get to inline B, which is the key to unlocking the performance of the method you are trying to optimize. Last, inlining might hurt compile time. In situations where latency is paramount (think: interactive client JavaScript), adding tons more code into the fray might add noticeable hiccups, even if the long-term throughput improves. As always, in-band compilation is a trade-off because any time you spend compiling, you are not executing code . You have to write your compiler to reason about all of this stuff. So you have heuristics. For example, here is Michael Pollan’s inliner heuristic: Inline methods. Mostly small. Not too many. I did a survey of a bunch of compilers, mostly JIT compilers, to see what their inlining heuristics look like. I also read (skimmed) some papers to see what those folks had to say. I wonder if they agree. This post was a long time coming. I started working on it about five years ago but then when I quit working at Facebook I accidentally left behind all of the inliner research I did for Cinder’s inliner. So then I kind of just thought about it aimlessly for a while before redoing it this year. Anyway, here’s wonderwall. Spoiler alert: all in all, people tend to look at: And also have different interesting ways to pipe in profile information. Last, some newer papers do some wild stuff: Another thing to consider in inlining is how you gather and interpret profiles. When you compile a function, you tend to specialize it based on the input it has historically been given. For a monomorphic input, maybe you guard that the type is still the same and otherwise jump into the interpreter. For a polymorphic input, maybe you check the top K (~4) common cases and otherwise jump into the interpreter. Fine. But sometimes you can be compiling a polymorphic method that is actually monomorphic in its caller . That is, might only ever pass one kind of input to , but other callers pass all kinds of stuff. Here is a bit of a silly example to show what I mean: Just kidding, not so silly at all. It’s a super common pattern in Rails . It makes polymorphic in even though for many of its callers, it may well be monomorphic (or even a constant). In order to plumb this information through to the compiler, you have to figure out this call context relationship. There are a couple of common ways to do it. YJIT, for example, though it does not inline, splits methods based on the types of the arguments going in. This means that it clones the compiled code, generating a new version for each context. This does not give call context (“A calls B”) but gives type context (“B is called with integers, B’ is called with strings”). A compiler could do type-based splitting in the interpreter or a baseline tier. If you don’t fancy duplicating the code, you can instead duplicate the profiles. You could either do this using type context (as above) or using call context. SpiderMonkey, for example, does “trial inlining” that allows callers to pass down a bit of memory for potential inline candidate callees to record their inline caches. Instead of each function holding its own ICScript, the caller allocates a unique ICScript for that potential-inline call-site. This gives each callee function (at least?) one level of call context. Later, when inlining the callee into the caller, we don’t have other callers’ type information polluting the IR builder (or whatever reads the profiles). JavaScriptCore handles this by inlining bytecode into other bytecode. This is a gnarly transformation but gives the interpreter, even (!) access to call context. On tier-up to the compiler, all the inlining decisions have been made already. HotSpot handles this with multiple tiers. The interpreter tiers up to the client compiler, C1. C1 profiles branch and call targets in compiled code. C1 may eventually recompile based on this new information. C1 may eventually tier up to C2, which copies C1 inlining decisions. This way, we get call context in profiles via inlining. One last thing you could do is just trust your type inference and branch folding in the optimizer. You could inline and do polymorphic specialization in the callee when building the IR, then hope that your branch pruning monomorphizes the inlined callee. It’s a little wasteful because the polymorphic code is built “for nothing”, but it might work fine? Okay, onto the collected notes and half-baked commentary. Here’s a survey of a bunch of JIT compilers and how they reason about inlining heuristics. But before we get into that, thanks to Iain Ireland, CF Bolz-Tereick, and Ian Rogers for feedback on this blog post! What follows is mostly a “bits and bobbles” section a la Phil Zucker . We’ll start with Cinder , because when I wrote Cinder’s inliner I added only the simplest heuristics, mostly “don’t inline” signals. Over time, after I left, people tuned it a bit more. The inliner starts from the caller CFG, walking it to find suitable inlining candidates. Inlining candidates are only for call targets that are known—in Cinder’s case, only for monomorphic call targets—and pass some checks. The callee is only known by it’s function object, which includes its bytecode. There is no IR available for the callee until we decide to inline. Most of the “can’t handle this” checks are related to argument handling. Python has a pretty complex calling convention, so if the caller/callee have not agreed on how the arguments should be passed through, the inliner doesn’t care to try and figure it out on its own. That is the responsibility of other parts of the compiler . Things in this function could be considered “TODO”. Failures are logged so they can be analyzed. If the Cinder team determines that there is some very frequent case they should handle, they will find out from the logs. The inliner collects all candidate call instructions in one pass over the CFG. It loads the configurable “cost limit” from the options struct. Then it does one pass over the inlining candidates vector, inlining until it (maybe) hits the cost limit. It does some graph maintenance work after inlining these calls, but that’s it. This approach gets a surprising amount of utility for being so simple: it inlines constants (quite a few methods look like ), small methods, and (at least, as far as I can remember) shrinks the compiled code size. All for very little compile time overhead. There’s one other “standalone” Python JIT out there, PyPy. So we should look at that too. There are two inliners in PyPy. One is inside the RPython to C translation pipeline, which acts more like an ahead-of-time compiler 2 . Then there is the tracing JIT bit, which has its own optimizer and heuristics. We’re going to look at the latter. I talked to CF Bolz-Tereick about the inliner and their comment was that PyPy’s inlining heuristic is “yes”. There are a couple of exceptions, such as not inlining recursive functions or functions with loops. But the basic idea of tracing includes tracing through call instructions, which naturally means that you are “inlining”. PyPy also does this neat thing where they treat frame pushes like normal allocation. Frame pushes, frame reads, and frame writes get written to the trace like normal object memory traffic and can get optimized away like other field reads and writes. This means that they can “just” use DCE to eliminate frame pushes and pops, whereas Cinder has some complicated mechanism to do it (which is my fault). TODO get more details here V8 is a JS engine and it has over the years had many execution approaches. We’ll look at three of them since they all have or had their place in the history: They also each inline at different times in the pipeline, which made for a fun time trying to understand the different codebases. Inlining happens during Hydrogen graph building Don’t store function bytecode of all functions; need to re-parse callee text source to inline Heuristics https://github.com/tekknolagi/v8/blob/a969ab67f8e1e7475d9b26468225c3a772890c64/src/crankshaft/hydrogen.cc#L7807 https://docs.google.com/document/d/1VoYBhpDhJC4VlqMXCKvae-8IGuheBGxy32EOgC2LnT8/edit https://github.com/v8/v8/blob/036842f4841326130a40adfcff38f85a9b4cd30a/src/compiler/js-inlining-heuristic.h#L14 When optimizing, add call instructions to the inline candidates list: https://github.com/v8/v8/blob/1a391f98cc7a9196369f2d6cab7df35ffbe92c08/src/maglev/maglev-graph-optimizer.cc#L1271 https://github.com/v8/v8/blob/036842f4841326130a40adfcff38f85a9b4cd30a/src/maglev/maglev-inlining.h#L36 Unlike for example Cinder, Maglev looks like it does not have a lot of restrictions about what can get inlined into what, so its “can inline” signal is about budget. Actually two budgets: small budget and normal budget. Then its inlining loop is a greedy walk of the to-inline queue checking candidate sizes. It runs this loop (which drains the queue) interleaved with the optimizer (which populates the queue). Confusingly, though, the optimizer also calls another function called which checks if it legally can inline: appears unused? / dead declaration? maybe src/maglev/maglev-graph-builder.cc is just not working on github search also unused / dead declaration same JavaScriptCore is funky! Unlike these other compilers that do inlining in their neat little SSA IRs, JSC inlines at the bytecode level 4 . This is their way of making sure that they get at least one level of call context into their interpreter inline caches, which will eventually give better information to the compiler. JSC only inlines based on bytecode profile information, and only inlines bytecode?? TODO find better sources for bytecode inlining SpiderMonkey has another way of getting that call contet without doing bytecode inlining: they add call context to their inline caches. Methods can pass down an ICScript to their callees where the callee writes its inline cache information. Then, when compiling, the callee is more likely to be monomorphized. https://github.com/mozilla-firefox/firefox/blob/438a3ce10eb77fb50d968463b7741117aec5bb4a/js/src/wasm/WasmHeuristics.h#L213 SpiderMonkey ICScript https://fitzgen.com/2025/11/19/inliner.html Plan: run in interpreter; tier up to C1; profile call targets; inline in C1; profile branch counts; tier up to C2, which copies C1 inlining decisions in bytecode parser https://github.com/openjdk/jdk/blob/a05d5d2514c835f2bfeaf7a8c7df0ac241f0177f/src/hotspot/share/opto/bytecodeInfo.cpp#L116 https://github.com/openjdk/jdk/blob/497dca2549a9829530670576115bf4b8fab386b3/src/hotspot/share/opto/bytecodeInfo.cpp#L197 https://github.com/openjdk/jdk/blob/497dca2549a9829530670576115bf4b8fab386b3/src/hotspot/share/opto/parse.hpp#L42 https://github.com/openjdk/jdk/blob/497dca2549a9829530670576115bf4b8fab386b3/src/hotspot/share/opto/doCall.cpp#L185 Not too small Walk up the call stack to figure out what to compile Handling the right thing to inline: def foo(a) = a.each {|x| x } want to compile , inline each, inline block, not compile block separately (probably) https://bernsteinbear.com/assets/img/design-hotspot-client-compiler.pdf https://github.com/openjdk/jdk/blob/d854a04231a437a6af36ae65780961f40f336343/src/hotspot/share/c1/c1_GraphBuilder.cpp#L755 https://github.com/openjdk/jdk/blob/d854a04231a437a6af36ae65780961f40f336343/src/hotspot/share/c1/c1_GraphBuilder.cpp#L3854 heuristics: TruffleRuby uses weighted compile queue Graal https://ieeexplore.ieee.org/document/8661171 https://github.com/dotnet/runtime/blob/2d638dc1179164a08d9387cbe6354fe2b7e4d823/docs/design/coreclr/jit/inlining-plans.md https://github.com/dotnet/runtime/blob/0b3f3ab1ecf4de06459e5f0e2b7cb3baf70ef981/src/coreclr/jit/inline.def#L94 https://github.com/dotnet/runtime/blob/0b3f3ab1ecf4de06459e5f0e2b7cb3baf70ef981/src/coreclr/jit/inlinepolicy.cpp https://github.com/dotnet/runtime/blob/0b3f3ab1ecf4de06459e5f0e2b7cb3baf70ef981/docs/design/coreclr/jit/inline-size-estimates.md?plain=1#L5 https://github.com/dotnet/runtime/blob/0b3f3ab1ecf4de06459e5f0e2b7cb3baf70ef981/src/coreclr/jit/fginline.cpp https://github.com/dotnet/runtime/issues/10303 https://github.com/AndyAyersMS/PerformanceExplorer/blob/master/notes/notes-aug-2016.md https://github.com/dart-lang/sdk/blob/391212f3da8cc0790fc532d367549042216bd5ca/runtime/vm/compiler/backend/inliner.cc#L49 https://github.com/dart-lang/sdk/blob/391212f3da8cc0790fc532d367549042216bd5ca/runtime/vm/compiler/backend/inliner.cc#L1023 https://web.archive.org/web/20170830093403id_/https://link.springer.com/content/pdf/10.1007/978-3-540-78791-4_5.pdf An adaptive strategy for inline substitution (PDF) tracelet based https://github.com/facebook/hhvm/blob/eeba7ad1ffa372a9b8cc9d1ec7f5295d45627009/hphp/runtime/vm/jit/inlining-decider.h#L89 https://github.com/LineageOS/android_art/blob/8ce603e0c68899bdfbc9cd4c50dcc65bbf777982/compiler/optimizing/inliner.h https://github.com/JikesRVM/JikesRVM/blob/5072f19761115d987b6ee162f49a03522d36c697/rvm/src/org/jikesrvm/compilers/opt/inlining/DefaultInlineOracle.java#L55 Partial inlining Understanding and Exploiting Optimal Function Inlining (PDF) machine learning Automatic construction of inlining heuristics using machine learning Machine-Learning-Based Optimization Heuristics in Dynamic Compilers (PDF) Guiding Inlining Decisions Using Post-Inlining Transformations (PDF) U Can’t Inline This! (PDF) Towards better inlining decisions using inlining trials RhizomeRuby inlining An Optimization-Driven Incremental Inline Substitution Algorithm for Just-in-Time Compilers (PDF) Automatic Tuning of Inlining Heuristics (PDF) Inlining-Benefit Prediction with Interprocedural Partial Escape Analysis (PDF) Inlining of Virtual Methods (PDF) A Study of Type Analysis for Speculative Method Inlining in a JIT Environment (PDF) A Comparative Study of Static and Profile-Based Heuristics for Inlining (PDF) clusters from Custom benefit-driven inliner in Falcon JIT (PDF) https://github.com/oracle/graal/blob/5dde777cba22a99ebe3f19745d03ddfbc35c563c/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/phases/common/inlining/policy/GreedyInliningPolicy.java https://github.com/oracle/graal/blob/5dde777cba22a99ebe3f19745d03ddfbc35c563c/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/phases/common/inlining/InliningPhase.java https://github.com/oracle/graal/blob/5dde777cba22a99ebe3f19745d03ddfbc35c563c/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/phases/common/inlining/info/elem/InlineableGraph.java#L148 There are some newer papers, especially in Java land, that try to do a lot of analysis ahead-of-time and bundle the resulting information in .class files. Then the JIT can read it and see more than local context. Or, if you are an AOT compiler, you can probably do a lot more whole system reasoning—both for time budget reasons and also because you can see more functions at once.  ↩ Check it out if you like. I stumbled across it by accident.  ↩ See also “Turbolev”, which seems to merge Maglev (CFG) with Turbofan (Sea of Nodes)… somehow.  ↩ Potentially a misunderstanding based on a private conversation. I’m working on tracking down the implementation…  ↩ Profiles of call target Cumulative caller size (increasing as callees get inlined) Callee size Inline depth Number of inlined calls at a certain depth If recursion is present Callee/caller call count ratio (if callee only called less than K% of calls to caller, don’t inline callee) Callee stack usage Polymorphism in callee What mode the compiler is in (baseline vs more aggressive) If the callee looks like it always raises/throws Train neural networks to make inlining decisions Let inlining drive the entire optimization pipeline, treating it as a search heuristic over a BFS walk of the call graph Use AOT-gathered information to aid in JIT heuristics Hydrogen was the first real SSA IR and it looks very familiar to me, having worked on Cinder and now ZJIT. It is now defunct. Turbofan was the replacement, going full Sea of Nodes. In the grand scheme of things it is a pretty fast compiler, but it does not hold back from doing some expensive rewrites. This was recently rewritten from Sea of Nodes to a mode traditional CFG and nicknamed Turboshaft. Maglev is meant to coexist alongside Turbofan, preferring to speculate a little more eagerly and do fewer incremental rewrites in the name of compile time. 3 https://github.com/tekknolagi/v8/blob/a969ab67f8e1e7475d9b26468225c3a772890c64/src/crankshaft/hydrogen.cc#L9236 something about native context check callee AST size against configurable limit check inlining depth against configurable limit don’t inline recursive functions check current cumulative method size (as tracked by AST node count) against configurable limit Find candidates https://github.com/v8/v8/blob/036842f4841326130a40adfcff38f85a9b4cd30a/src/compiler/js-inlining-heuristic.cc#L134 Can inline https://github.com/v8/v8/blob/036842f4841326130a40adfcff38f85a9b4cd30a/src/compiler/js-inlining-heuristic.cc#L75 Force inline small functions https://github.com/v8/v8/blob/036842f4841326130a40adfcff38f85a9b4cd30a/src/compiler/js-inlining-heuristic.cc#L309 Loop over sorted (by comparator) list https://github.com/v8/v8/blob/036842f4841326130a40adfcff38f85a9b4cd30a/src/compiler/js-inlining-heuristic.cc#L847 skip recursion https://github.com/v8/v8/blob/1a391f98cc7a9196369f2d6cab7df35ffbe92c08/src/objects/shared-function-info-inl.h#L421 not called enough (min call frequency) bytecode too big Bytecode inlining https://github.com/WebKit/WebKit/blob/709c3895afd71e0836f8c8be7393e44d41fab7e1/Source/JavaScriptCore/bytecode/CodeBlock.cpp#L2453 DFG https://github.com/WebKit/WebKit/blob/709c3895afd71e0836f8c8be7393e44d41fab7e1/Source/JavaScriptCore/dfg/DFGCapabilities.cpp#L76 https://github.com/WebKit/WebKit/blob/917854a9c245b87b333e23ed4b195505d574a333/Source/JavaScriptCore/dfg/DFGByteCodeParser.cpp#L1703 https://github.com/WebKit/WebKit/blob/917854a9c245b87b333e23ed4b195505d574a333/Source/JavaScriptCore/bytecode/CallLinkStatus.cpp#L294 https://github.com/WebKit/WebKit/blob/d919344236c47b610930636d3310f00380624d43/Source/JavaScriptCore/bytecode/InlineCallFrame.h skip callees with exception handlers (unless explicitly allowed with a CLI flag) skip synchronized callees (unless explicitly allowed with a CLI flag) skip classes with unlinked callees skip uninitialized classes max inline level (default 9) max recursive inline level (default 1) callee bytecode size (max for top level is 35 bytecodes, but falls off by 10% per inline level) callee stack usage (max of 10 slots) max total method size (default 8000 bytecodes) There are some newer papers, especially in Java land, that try to do a lot of analysis ahead-of-time and bundle the resulting information in .class files. Then the JIT can read it and see more than local context. Or, if you are an AOT compiler, you can probably do a lot more whole system reasoning—both for time budget reasons and also because you can see more functions at once.  ↩ Check it out if you like. I stumbled across it by accident.  ↩ See also “Turbolev”, which seems to merge Maglev (CFG) with Turbofan (Sea of Nodes)… somehow.  ↩ Potentially a misunderstanding based on a private conversation. I’m working on tracking down the implementation…  ↩

0 views
Jim Nielsen 4 days ago

An Ode to the Exacting Pedantry of Computers

The very first computer programming class I ever took introduced me to the idea of there being different kinds of numbers, like integers, floats, and doubles (it was a C++ course). “You mean, when I assign a variable, I have to say up front what kind of number this is?” It was such an odd concept to me. A number is a number. Why do I have to say it’s this kind of number or that kind of number? I dropped out of that class. A few years later, I decided I wanted to try programming again. So I took another intro class. This time they were teaching with Python instead of C++, so you can imagine my excitement to learn that I didn’t have to think of numbers in this way anymore! It felt like the computer was meeting me partway. Over time, I came to learn how pedantic computers are. They require a kind of exacting precision in saying what you want them to do. And they’ll only ever do exactly what you tell them to do, nothing more, nothing less. If there was a bug in your program, that wasn’t because the computer was doing something you told it not to. The computer was only ever doing exactly what you told it to do. A “bug” was very likely a flaw in your conception of how the program should execute, not the actual execution. It was a failure on your part to be more precise, to imagine a scenario where something happened that you didn’t anticipate — and therefore didn’t tell the program how to handle. “Do what I mean, not what I say!” But now, with LLMs, that kind of exacting precision in language and thought is disappearing. You can have a thought, ask the LLM to build it, and it will fill in all the details you didn’t specify or anticipate. All those pesky details which previously would’ve made you reflect, “Oh, I didn’t think of that. Maybe I should design this differently…” Or, “Oh, well now that I have to think about this some more, I can see that it might not actually be a very good idea…” The pedantic friction, which seemed like such a nuisance, was actually acting as a kind of tool for sharpening and improving your thinking and output. The exacting nature of the computer required you to think more. LLMs, however, have significantly lessened that friction. You can think less and move faster. And yet, that feels like our job as software makers: to think, to anticipate, to explicitly articulate intent. As a software user, I’d rather folks spend more time thinking so that I, in turn, have better experience. This is preferable to giving me more stuff faster that’s only partly conceived. As an industry it feels like we’re headed in a direction where we think it’s better to ship more faster and fix the effects of half-conceived intent later, than to spend more time upfront discovering, sculpting, and specifying intent. That’s one thing writing code by hand has taught me: intent — what you want to build and how you want it to work — is shaped through the act of articulating it. That hard work is not required of us anymore. The LLM will fill in the details. The exacting pedantry of the computer is going away, and in its place are assumptions about intent — many of which we don’t even know about until our users run into their effects. Reply via: Email · Mastodon · Bluesky

0 views

Yield Not Thy Core

Yield Not Thy Core Achilles Benetopoulos, Peter Alvaro, Andi Quinn, and Robert Soule EUROSYS’26 This paper describes a solution to the placement problem in distributed systems. If you model a computation as a directed graph, how do you optimally distribute the graph among a set of cooperating computers? The authors propose a dynamic placement system and implement it in Magpie . One common solution to the placement problem is to ship data over the network. For example, a set of compute nodes could access data via network requests to a separate set of nodes running Redis servers. At the opposite end of the spectrum, code can be shipped over the network. The canonical example is expressing computation as a SQL query which is sent to the node(s) that hold the relevant data. Magpie proposes a more fluid solution, where both code and data can move dynamically. In Magpie, an object represents data that is operated on. What makes Magpie objects unique is that pointers to data stored in an object are encoded as tuples. This allows Magpie to dynamically move objects around the system without invalidating pointers. The downside of this approach is that it prevents traditional libraries (that rely on raw pointers) from being used in user code. Magpie assumes a high degree of inter-object locality, so any given object is stored by exactly one node (i.e., a single object is never split between multiple nodes). User code is expressed in terms of nanotransactions and epics . A nanotransaction runs to completion on a single node and accesses a pre-specified set of objects. The Magpie runtime ensures that all objects accessed by a given nanotransaction are resident on a single node before executing the nanotransaction. The code for a nanotransaction is simple, because there is no need to query data over the network, and there is no need to deal with locking. If a hazard is present between two nanotransactions, they will execute serially. In Magpie, nanotransactions are written in Rust. An epic is a computation graph where each vertex is a nanotransaction and each edge is a data dependency. In contrast to nanotransactions, a single epic can be distributed across multiple nodes. Magpie schedules nanotransactions once all data dependencies are satisfied. Conflicts between concurrently running epics are handled via snapshot isolation . Any particular epic has a consistent view of each object and may abort in the event of a conflict. Scheduling and data movement are implemented hierarchically. A worker node can locally determine if it has ownership of all dependencies required for a nanotransaction. If this is the case, then the worker node executes the transaction immediately. Otherwise, the worker node uses a local ownership cache to try to determine if another node has all required dependencies and communicates with that node if possible. Failing that, scheduling is performed by a global orchestration node. Fig. 9 compares Magpie to memcached executing a workload that involves a user-specified read-modify-write operation: Source: https://dl.acm.org/doi/10.1145/3767295.3803616 Magpie is able to offer a lower latency because it is able to ship the entire read-modify-write operation to the server that holds the relevant data, rather than requiring multiple roundtrips. Some applications may benefit from being able to indicate that an object is rarely changed and thus can be distributed among multiple nodes at the same time. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts.

0 views
Kaushik Gopal 5 days ago

AI model choices 2026-06

My 2026 Jan AI tool stack. Six months since my last post and the whole list has turned over. I constantly try multiple harnesses but I think I’ve firmly settled on OpenCode paired with OpenChamber as my harness of choice. Stay tuned for a future post on power tips for OpenCode, but I put a lot of these harnesses through the ringer and am really happy with this combo atm. I still drop often into TUI land with OpenCode and while like others I had my dalliance with cmux, I’ve found it’s not great on performance and runs into memory issues. So I’m back to naked Ghostty. OpenCode on the other hand is really good at managin sessions, so often I don’t even find myself needing to use tmux. Also, I use Hermes but having discovered OpenChambers, I don’t find myself needing to reach as often. Again, this deserves a longer post, if you’re curious. I’ve just been blow away by Kimi 2.6. I’ve found it often keeps pace with GPT 5.5 with Medium reasoning. There have even been time it’s results matched Opus 4.8 (though Opus typically gets the results one-shot). I’m not sure if I’ve engineered my harness in some way to work better with Kimi, but dang I love the results I’m getting. If you want to give it a shot, I recommend Opencode’s $10 “Go” plan and push Kimi 2.6 hard. I don’t use any of the models as therapits/friends to have conversations and I typically have a heavy handed instruction on how it should communicate with me, so a lot of the complaints around Opus 4.8 being “rough” or GPT 5.5 being soul-less 🙄, don’t bother me as much. If I do go the exec-plans route, I start with GPT 5.5 and hand it over to Kimi 2.6 for execution.  ↩︎ If you’re curious subscribe to my developer podcast’s newsletter where I post on some of the techniques I’m using.  ↩︎ Kimi 2.6 has become my overall workhorse model. By default I start most AI sessions with Kimi 2.6 now. GPT 5.5 remains my coding model of choice. My detailed planning, creating of exec-plans 1 , code review, simplification, and one-shot feature changes, reliably happen with GPT 5.5 (high). Opus 4.8 for deep thinking, writing and overall hard tasks. It’s been four days since the release, so my opinion is still forming but I’ve been fascinated how quickly Opus 4.8 is giving me the right solutions, especially for the slightly more complex problems. - I still only reach out to, when other models are struggling, cause 💸🔥 Gemini for anything image, video, or audio. Nothing else is close. It has collapsed work that used to eat hours 2 of mine. If I do go the exec-plans route, I start with GPT 5.5 and hand it over to Kimi 2.6 for execution.  ↩︎ If you’re curious subscribe to my developer podcast’s newsletter where I post on some of the techniques I’m using.  ↩︎

0 views
Unsung 5 days ago

Writing about fonts

In last week’s post , I made an off-hand comment about Vercel’s Geist Pixel announcement , and I thought it might be interesting to turn this into more of a full-fledged critique. I don’t think it’s a good announcement, but its flaws are pretty universal, so I want to put words to these flaws. This will extend to a lot of other writing about design, not even necessary even just about typography. Here’s my advice that I believe would make announcements like this better: I know the elephant in the room here is “how big companies do things.” A lot of redesign announcements and font unveils exist chiefly to make the execs who started it happy, and perhaps as fodder for future promotion – I bet the whole “Already shaping what’s next” section isn’t really written for external audience – and they get chewed by the big PR machine that often files away whatever personality and quirkiness might have been there. Your job is to fight the machine! But I acknowledge that it might be hard. However, I’ve also seen all this seeping into personal font announcements, which is unfortunate. (I don’t want to link to specific examples, since that’d be punching down.) Also, this is not just about the joy of reading or some general notion of “craft” – although they are important, too. This is also purely informational. I feel I haven’t learned enough from the Geist Pixel announcement for the amount of time I spent with it. I don’t understand “multiple variants for different densities and use cases” or “semi-mono approach” or what stylistic sets are included. (My general goal is to write in a way that people can learn something new from any design announcement, even if they don’t have any prior context, and if they never actually use the font.) It‘s a shame, because the work itself seems thoughtful and excellent, deserves a better intro, and could help others interested in typography as a jumping off point, particularly because this feels like a typeface off the beaten path. Just to round up this post, some recent counterexamples: #craft #david jonathan ross #storytelling #typography Write like a human being would. This is famously hard, and takes practice. Here, we see stuff like “unapologetically digital,” “a functional tool within a broader typographical system,” “the result feels both nostalgic and contemporary,” and “constraints weren’t a limitation, they were the design tool.” No one talks like this. I think people believe font releases have to use these words and phrases, as a way to bring legitimacy to the project. I do not subscribe to that way of thinking. I think it leads to writing that’s optimized only for admiration, which is not as much fun for anyone. Show a specific example of a problem you solved. This page hints at some things – “They don’t scale properly across viewports, their metrics conflict with existing typography, or they’re purely decorative.” – but that feels altogether too vague to be useful or even interesting. These are actually fascinating and hard challenges, yet I know as much at the bottom of the page as I did at the top. Show details you are proud of. Zoom in literally or figuratively. “Each glyph was manually refined to avoid visual noise, uneven weight distribution, and awkward diagonals.” I would love to see a few examples. Show work in progress! Show stuff you discarded. This will be hard, but why not? It’s good practice and I believe this, more than anything else, will have people appreciate what you did. Plus, everybody loves a blooper reel. Related: talk about struggle. Don’t motion in the direction of struggle, or performatively announce that this was the hardest project of your life. Actually talk about something that was hard, and why. Be vulnerable. Be honest. People didn’t care that Rocky lost in the first movie, because people cared about Rocky. Talk about your inspiration or history. What we all do here is part of something bigger. Why a pixel font to begin with? Why is this interesting to you? Is that because Vercel is filled with nerds, or because you got bored with bold and italic, or because it just seems visually interesting in a new way? Let me type! Immediately and everywhere. I don’t think any modern font announcement/​tester can exist without this. This is the easiest way to getting to know the font and explore specific things that matter to you. (To do this here, you have to go to the font page , switch to Geist Pixel at the top, and then scroll all the way to the bottom. This feels entirely too far away.) Show, don’t tell, generally. The Geist Pixel announcement feels rife for an avalanche of this, but has so little. I mentioned above wishing to see examples of manual refinements. There is a visual for “seamless mixing,” but it’s really a marketing photo, not a real-use example – it visualizes what, but you want to visualize what and why at the same time. I would love to see the spread of variants, specific examples of how the font is not “breaking in production” or not “scaling properly across viewports.” I don’t know what is a “semi-mono approach” and I would like to learn. Motion is okay, but it has zero nutritional value. If you have limited resources, don’t spend it on motion. Anything interactive is better. (But again, the best interactive thing is letting you type.) The “Already shaping what’s next” is a narratively unsatisfying section, as it promises stuff that you cannot see yet. Either show those, or skip the tease altogether. Fran Sans announcement post by Emily Sneddon (complements the font page ) – personal, distinctive, talks about the process, shows interesting artifacts. I feel that every small essay from David Jonathan Ross’s Font Of The Month teaches me something new – pick a font you like on that page, then click Notes next to it. Departure Mono font page by Helena Zhang doesn’t use a lot of words, but still tells a lot. Shantell Sans process post by Shantell Martin (complements the font page ) – already talked about it before .

0 views
JSLegendDev 1 weeks ago

My Biggest Gripe With YouTube

3 years ago, I started a YouTube channel called JSLegendDev where I uploaded tutorials teaching the JavaScript programming language through the development of 2D games. The state of the space around the time I started was as follows : Tutorials inferior to an hour in length were not in demand. They made very little views. Tutorials divided into multiple parts where dead on arrival. You were guaranteed dwindling views on every new upload. To adapt, other content creators started uploading longer, multi-hour, often project based tutorials which translated to more views. Seeing the shift, I also decided to follow suit and uploaded tutorials reaching the 4-10 hour mark. I saw some success doing this. Therefore, I kept at it for a while. However, as time passed, I got tired of recording extremely long tutorials and they, in general, started to make less views. There are many hypotheses as to why YouTube’s algorithm started serving tutorial content less. The advent of AI could’ve been the likely cause but also a general shift in YouTube becoming more of an entertainment focused platform to the detriment of educational content. Something you now put on TV to relax. In the programming space, channel producing content that can be watched passively like tech news, tech drama, tech history, high level discussions, etc… continued to thrive. Seeing this new shift and because I was genuinely tired of making YouTube tutorials, I published my first scripted video titled “How do Devs Make Levels Without Game Engines” which was first published as an article. In that piece, I told the story of how I discovered a convenient way to design levels for my games using an external editor called Tiled in conjunction with my editor-less game framework. At the end of that video, I promoted a paid tutorial I made teaching the exact steps needed to achieve what was presented. The video ended up accumulating over 30k views, which was pretty great! It took far less effort to make compared to my multi-hour tutorials and I was able to make a few sales on my paid tutorial I mentioned within. Previously, I was very unsuccessful in selling any paid courses and I didn’t quite understand why. However, the answer now hit me like a truck. Why would anyone still have the appetite for a paid course after having invested the time following a free multi-hour course? Even if the subject of the paid offering was different, they would probably be too tired to commit to another one. Anyway, following in the footsteps of this first breakthrough, I uploaded another scripted video titled “You Can Now Make PS2 Games in JavaScript” which was again first published as an article. In that video, I told the story of how I discovered that you could make PS2 games in JavaScript and provided an overview of how the viewer could get started. Despite including very practical knowledge, the viewer was never expected to follow along and therefore could watch it passively. It was a resounding success, over 100k views! Unfortunately, I didn’t sell any courses in that video because I simply didn’t have the energy to both make the video and a course. The best business decision would have been to wait before uploading. I’ll go into more details later, but my biggest gripe with YouTube is that it’s no longer a great platform to build an audience but rather it’s only good for reach and here, I had wasted a lot of reach. After having made so many game development tutorials, I wanted to try my hand in creating an original game that I would sell on Steam. Once the project was starting to take shape, I had the idea of making a video about it to gauge interest as I wasn’t sure it would find an audience. Therefore, I had the idea of using the same format used in my two previous successful videos. However, rather than focusing on technical details, I instead would tell the story of how I came up with my game’s design covering the various iterations and challenges I faced while working on it. Therefore, I ended up uploading a video titled “Making a Small RPG” which again, was originally an article. It was also a resounding success reaching barely below 100k views! However, it came with a hidden cost. That cost was the tipping point that made me realize that YouTube is no longer a good platform to build an audience on. I naively thought that if the video performed well, this would translate to subscribers and an audience eager to hear more about the project, but this wasn’t the case. I had made a big mistake by not setting up a Steam page to direct viewers to before publishing the video. On my next upload concerning the project, the fall off in terms of views was brutal. I went from 98k views to below 10k. It became clear that YouTube was acting as a gatekeeper between me and the audience I thought I had built. After reflecting on the situation, I came to the following conclusion. The reason my 3 previous videos had performed well was due having certain characteristics that aligned with YouTube’s goal as a platform, which consists in making people watch videos for as long as possible so they can serve more ads. I listed them below : The subject of all three videos were remarkable which lead to people clicking on them. Something is remarkable when it obviously stands out as being interesting/noteworthy. For example, the subject of my video titled “You Can Now Make PS2 Games in JavaScript” is remarkable because the PS2 is a very popular, but now old console and you had to use a hard programming language called C++ to make games for it. Being able to now use JavaScript, a simpler but most importantly, a language originally designed for making websites and not games, makes the subject come across as immediately noteworthy. Therefore, remarkable. The use of storytelling made people eager to watch more of the video. This can be explained by the fact that we instinctively want to know what happens next in a compelling story. Finally, the length of the videos were all above 10 minutes and the 2 more successful ones were in the 15+ min range. This resulted in more absolute watch time compared to shorter content. For example, if 2 videos are both watched fully by the same audience. The shorter one will translate to less total time spent on the platform compared to the longer one. Therefore, YouTube will recommend the longer one instead because there’s an opportunity cost to doing otherwise. To understand the fall off, it’s important to first mention that usually, series on YouTube don’t work. The second video of a series ends up making less views than the first because it requires prior context before clicking. Thus reducing its appeal and limiting its reach. However, I knew this going in. I tried making the second video as independent as possible but in the end, a second video talking about the same subject was bound to be less remarkable. It didn’t help that because I summarized the content of the first video in the second one, a familiar viewer would have found it less engaging making the video further away from hitting criteria 2 and 3 that I outlined above. Consequently, I realized I had wasted my biggest marketing ammunition regarding my small RPG game as I had no way to contact the audience hit by the first video. Like with the one on making PS2 games in JavaScript, I had wasted tremendous reach. At this point, I realized my biggest gripe with YouTube was simply that I could not access my audience reliably. Therefore, was it really my audience? On one hand, YouTube allows someone without a following to reach millions but on the other, the link to those reached is fickle. I thought I was building an audience by gaining subscribers but instead, I was building a sand castle that could easily be carried away by the slightest algorithm waves. YouTube wasn’t always like this. People used to subscribe to channels and seek their content in their subscriptions tab. However, the platform effectively buried this model by conditioning users to seek recommended videos on the home page and deprioritizing the Subscriptions tab to the point that it barely looks like a clickable section. You have to click on the “Subscriptions” text to access your sub feed. Doesn’t look very clickable doesn’t it? I think that we’re now entering an era where YouTube is starting to treat content creators as interchangeable much like TikTok. They saw the success TikTok had, tried to replicate it with Shorts and now YouTube long form is getting affected as well. I fear that in the future, uploading to YouTube will look no different than making posts on Reddit. You might get views, you might get comments, but they’re self contained to a specific post with no following building up and no guarantee of your next posts having the same reach. The conclusion to all of this is that it’s not worth it to be a YouTuber. Relying on YouTube adsense and sponsorships (sponsors use views as a metric to determine how much to pay you) for your livelihood is simply not sustainable due to how fickle getting views on the platform is. Therefore, focusing so much on making YouTube content will most likely lead to your exploitation. That said, is quitting really the answer? Considering that YouTube can give you incredible reach even if you’re a nobody as long as you make content that is remarkable, engaging (for example, through storytelling) and long enough, it would be stupid to completely walk away, at least in my case. Therefore a new strategy appears on the horizon. It consists in building your audience outside of YouTube through a mailling list (Substack conveniently allows you to do so) and to strategically make occasional compelling YouTube content to tap into the platform’s reach potential. However, the key is to always direct viewers to the mailling list. Why is building an audience through email so important? because it allows you to have a direct and long lasting link with your audience. It also gives you independence from social media platforms. Even in the case of Substack, where this article is currently hosted, I can export my email list and move to another platform or email sending service without my subscribers even noticing. This shift implies that I no longer need to worry about pumping frequent content for YouTube because I’m not making money through them or worrying about doing so. By making YouTube content rarely, I get to keep most of my energy to build something compelling outside the platform like an actual game, writing interesting articles, making an in-depth course or other kinds of art/products. This plan seems to me as more sustainable and more healthy long term. That’s about all I’ve got to share. Hope this article was insightful. If you’re curious to see where this journey will lead, I recommend subscribing! I usually write about programming, game development and game design. Subscribe now You can check some of my previous articles below. Tutorials inferior to an hour in length were not in demand. They made very little views. Tutorials divided into multiple parts where dead on arrival. You were guaranteed dwindling views on every new upload. The video ended up accumulating over 30k views, which was pretty great! It took far less effort to make compared to my multi-hour tutorials and I was able to make a few sales on my paid tutorial I mentioned within. Previously, I was very unsuccessful in selling any paid courses and I didn’t quite understand why. However, the answer now hit me like a truck. Why would anyone still have the appetite for a paid course after having invested the time following a free multi-hour course? Even if the subject of the paid offering was different, they would probably be too tired to commit to another one. Anyway, following in the footsteps of this first breakthrough, I uploaded another scripted video titled “You Can Now Make PS2 Games in JavaScript” which was again first published as an article. In that video, I told the story of how I discovered that you could make PS2 games in JavaScript and provided an overview of how the viewer could get started. Despite including very practical knowledge, the viewer was never expected to follow along and therefore could watch it passively. It was a resounding success, over 100k views! Unfortunately, I didn’t sell any courses in that video because I simply didn’t have the energy to both make the video and a course. The best business decision would have been to wait before uploading. I’ll go into more details later, but my biggest gripe with YouTube is that it’s no longer a great platform to build an audience but rather it’s only good for reach and here, I had wasted a lot of reach. After having made so many game development tutorials, I wanted to try my hand in creating an original game that I would sell on Steam. Once the project was starting to take shape, I had the idea of making a video about it to gauge interest as I wasn’t sure it would find an audience. Therefore, I had the idea of using the same format used in my two previous successful videos. However, rather than focusing on technical details, I instead would tell the story of how I came up with my game’s design covering the various iterations and challenges I faced while working on it. Therefore, I ended up uploading a video titled “Making a Small RPG” which again, was originally an article. It was also a resounding success reaching barely below 100k views! However, it came with a hidden cost. That cost was the tipping point that made me realize that YouTube is no longer a good platform to build an audience on. I naively thought that if the video performed well, this would translate to subscribers and an audience eager to hear more about the project, but this wasn’t the case. I had made a big mistake by not setting up a Steam page to direct viewers to before publishing the video. On my next upload concerning the project, the fall off in terms of views was brutal. I went from 98k views to below 10k. It became clear that YouTube was acting as a gatekeeper between me and the audience I thought I had built. After reflecting on the situation, I came to the following conclusion. The reason my 3 previous videos had performed well was due having certain characteristics that aligned with YouTube’s goal as a platform, which consists in making people watch videos for as long as possible so they can serve more ads. I listed them below : The subject of all three videos were remarkable which lead to people clicking on them. Something is remarkable when it obviously stands out as being interesting/noteworthy. For example, the subject of my video titled “You Can Now Make PS2 Games in JavaScript” is remarkable because the PS2 is a very popular, but now old console and you had to use a hard programming language called C++ to make games for it. Being able to now use JavaScript, a simpler but most importantly, a language originally designed for making websites and not games, makes the subject come across as immediately noteworthy. Therefore, remarkable. The use of storytelling made people eager to watch more of the video. This can be explained by the fact that we instinctively want to know what happens next in a compelling story. Finally, the length of the videos were all above 10 minutes and the 2 more successful ones were in the 15+ min range. This resulted in more absolute watch time compared to shorter content. For example, if 2 videos are both watched fully by the same audience. The shorter one will translate to less total time spent on the platform compared to the longer one. Therefore, YouTube will recommend the longer one instead because there’s an opportunity cost to doing otherwise.

0 views
Langur Monkey 1 weeks ago

Langur Agent

Langur Agent is a simple, open, hackable CLI AI agent for Linux and macOS. It connects to any service providing an OpenAI-compatible endpoint. It features: The source is available in this repository . Langur Agent has been tested on Linux and macOS only. Install the agent with: Run the agent with the default session: If you need an API key to access the endpoint, put it in the file. Langur Agent looks for the file in the following locations, in order: Create the file with the API key: The agent uses to load at startup. The package reads from the environment automatically. You can also set in your shell profile. On first run, the configuration is created in . You can configure the agent interactively with the slash command. The agent works with any OpenAI-compatible endpoint, so LM Studio, Ollama, OpenWebUI, or any other service you configure. Here are the default values: Run the agent, and then you can enter your prompt. You can use the following key bindings during input: During inference, you can cancel the turn and return to the input prompt with Ctrl + c . Use to print information about the available commands, and to configure the agent interactively. Internally, Langur Agent uses sessions to separate different memory histories. Sessions are named by the user. By default, the agent uses the session. You can start in a different session (either create a new one, or restore it if it exists) with the argument: The default session’s name is , so the following two commands are equivalent: You can also list the existing sessions with : Sessions contain: For now, the configuration file is the same for all sessions. Sessions are matched by the directory name in the sessions location ( ). You can rename a session by just renaming the directory! You can enable mode for the current session with the command , or permanently in the configuration . External editor —In mode, exit INSERT mode ( Esc ), then press v to edit your prompt in an external editor (uses your or variable). There are a few commands available to use in the agent loop. You can list them with . Also, use (e.g. ) to show additional help for a command. Persistent memory follows XDG Base Directory spec in : In addition to persistent memory, the agent maintains a chat history of recent user input and assistant output pairs. This provides context that survives beyond the LLM’s context window. Here is how it works: Persistence: Configuration: Langur Agent can be easily customized and extended by adding new tools, commands, and skills. If you create a cool new tool, skill, or slash command, consider contributing it via a pull request! Create a file in or use one of the existing ones. To create a tool, create a method and decorate it with : Tools are auto-discovered on startup. The process is very similar to tools. You need to create your method, preferably in , and decorate it with . A slash command must return, in that order, , , , : Decorated commands are automatically registered, and auto-completed in the input prompt. Add a file in with YAML front matter, following the agentskills.io standard: The front matter and are parsed and shown in the skills list. The body is injected into the system prompt. session management memory management visual candy autocompletion interactive configuration Python 3.13+ for dependency management Current directory, Home directory, Alt + Enter : add a new line Enter : submit the prompt Ctrl + q : quit The input history Chat memory (see chat memory ) Notes (see session memory ) User profile (see session memory ) — user information — persistent notes (added via tool) Memory is loaded into the system prompt each turn tool adds notes during a session tool explicitly persists memory to disk Memory is auto-saved when the agent exits (interactive mode) Each user message and assistant response is stored in memory Reasoning is omitted from chat memory Automatically compacted when exceeding the configured character limit The user can trigger the compaction any time with Chat memory is attached to the system prompt on each turn The agent displays the last 10 exchanges, with long messages truncated Chat history is persisted to Automatically loaded on startup Saved after every exchange (user input or assistant response) Compacted history is also persisted to disk : a indicating if the command succeeded or failed. : an optional short status message. It is printed with or . : an optional with the Python Rich-formatted content, it is printed to the output. : an optional formatted in Markdown, it is printed to the output.

0 views
daniel.haxx.se 1 weeks ago

curl up 2026 summary

Getting curl developers and related enthusiasts into a single room to hang out in the real world for a whole weekend once a year is awesome. We find inspiration, we share experiences, we learn from each other and we dream and plan of future endeavors and things to work on. Seeing faces, hearing voices and watching body language help us communicate better virtually and on video calls during the rest of the year. We have gathered curl people like this annually since 2017, even if some years during Covid were “different”. To me, this is one of the best events of the year. I get to hang out and talk curl with good friends a whole weekend! The 2026 edition was held in Prague in late May and kept the general style of past events. About 25 people got into the room. We had five curl maintainers present and quite a lot of local curious minds. The curl up format is easy, casual and friendly. We do topical presentations, followed up with Q&A and discussions around the topics brought up – of course usually with reflections about curl’s role, both past and future. We live-stream and record the presentations to allow our friends who could not attend to keep up both in real-time but also after the fact. Unfortunately the tech is not always on our side so the quality sometimes is a little lacking. This year I brought an HDMI-splitter and an HDMI-to-USB device to allow us to get better recordings, but they were not working as smoothly as intended so we had to use inferior backup solutions for most of the meetup. This presentation above was the “keynote”, the introduction talk to the event. We then also recorded another nine session that are all available in the curl up 2026 playlist on YouTube. To give you all a little glimpse of what curl up is about, here’s a gallery showing some of the speakers and some scenery. Daniel Stenberg Alexandr Nedvedicky Daniel Stenberg Jim Fuller Jim Fuller Carlos Henrique Lima Melara Jim Klimov Moritz Buhl Stanislav Fort Daniel Stenberg Igor Chubin Igor Chubin Daniel Stenberg Daniel Stenberg and Frank Gevaerts All photos taken by and donated to us by an anonymous curl fan present in the room.

0 views
Martin Fowler 1 weeks ago

Fragments: May 27

At the GOTO Conference in Copenhagen in 2025, Kent Beck and I spent some time on stage talking and answering questions from the audience - a format I refer to as “two old geezers on a park bench”. We talk about our experiences with LLM-augmented programming (at that point - October 2025), we show our frustration that things we’ve been saying for thirty years still need to be said, we say how anything like a manifesto reunion needs to be led by a younger generation, and opine on what junior developers should be focusing on in their career. ❄                ❄                ❄                ❄                ❄ Ian Johnson has written a series of posts about restructuring a gnarly codebase The story follows a real Laravel + React codebase over ~3 months and ~258 commits from a legacy monolith with no tests to a well-structured application with automated quality gates, a React SPA migration in progress, and an AI agent that reliably ships production code with minimal supervision. The series covers the steps in decent detail, and his approach follows the kinds of steps I’d use. First get everything under the control of decent characterization tests, add static analysis, introduce the right patterns to make things flow easily. With all of this, is his use of AI, which changed during the exercise: For the first two months of this project, I used Claude Code with auto-approve turned off. Every file edit, every terminal command, every change… I reviewed it before it executed. […] The results were good. The code was clean. But I was doing most of the thinking and half the typing. The agent was a fancy autocomplete with better suggestions. I wasn’t getting the leverage I’d hoped for. I read an article about “on-the-loop” versus “in-the-loop” human-AI collaboration. The framing clicked immediately […] I was micromanaging because I didn’t trust the agent to do the right thing. And I didn’t trust the agent because there was nothing forcing it to do the right thing. His early steps put in tests, static analysis, and the right architectural patterns. With those in place, he could let the agent do more work. My role shifted from writer to curator. I don’t write most of the code anymore. I Define the patterns […] Review the test specs […] Review the output […] Update the harness […] Make strategic decisions […] He finishes the series with conclusions about how he’d generalize his experience to other circumstances. ❄                ❄                ❄                ❄                ❄ Back in the land of my birth, there was some notable groans when the National Health Service decided to close nearly all of their Open Source repositories , supposedly to the security threat of LLMs. Closing repos like this isn’t an effective counter to LLM-augmented attackers. I suspect it’s no coincidence to see GDS (Government Data Services), the highly-regarded IT enablers in the UK government publish their position Moving code from public to private as a substitute for investment in secure-by-design delivery, ownership and remediation is a warning sign because it reduces sharing and scrutiny, can slow coordinated improvement across government and suppliers, and does not remove the underlying weaknesses in a running service. Terence Eden memorably sums up his view on this: Within the UK’s Civil Service you occasionally hear the expression “being invited to a meeting without biscuits”. It implies a rather frosty discussion without any of the polite niceties of a normal meeting. ❄                ❄                ❄                ❄                ❄ I’ve seen a few cases where those developers who are most involved in working with LLMs find they are running into a problem with cognitive endurance, Adam Tornhill has joined this group : One of the big wins with agents is that they let us stay with the higher-level problem for longer. We get less sidetracked by details, dependency cleanup, and similar secondary tasks that used to break concentration. But there is a cost we are still underestimating. Agentic coding is mentally expensive. I can usually sustain the pace for a couple of hours. Then I need a break. The pace is simply too intense. And based on conversations with other engineers, I do not think I am alone in that. He explains that working with The Genie means we are making more decisions in less time, this increase in decision density is hard on the brain. He responds by keeping agent tasks small, automating everything he can, and accepting that he won’t know every line of code as long as he has good verification mechanisms in place. Notably, he has not gone in the direction of doing his work with swarms of agents that he coordinates. Instead has one long-running task that he babysits and one focus task That last point is important given the running-twenty-agents-in-parallel hype. I cannot even think about twenty meaningful things to build, and even less so about the resulting cognitive tax of the likely interruptions. It’s exactly the wrong thing to even consider. At least for humans. (And yes, I understand sub-agents and machine parallelisation. That is not what I’m objecting to. It is the parallelisation of human attention that does not scale). I liked that he included some thoughts about what folks can do in time outside this intense programming time. Not just “have a coffee” (although he includes that) but also about learning about the domain that the software supports. ❄                ❄                ❄                ❄                ❄ A couple of pithy quotes from social media Lorin Hochstein “Metaphor debt” is when all of your metaphors involve the concept of “debt” because you can’t think of any other metaphors anymore. ❄                ❄ Daniel Terhorst-North If a vegan crossfit fan is using Claude to write Rust, which thing do they tell you first? ❄                ❄                ❄                ❄                ❄ Karl Bode reacts to speakers getting booed when mentioning AI during commencement addresses. He points out that younger folks are increasingly unhappy with the tech oligarchy and their fruits . The thing is the kids aren’t stupid. They see the field clearly. They see the difference between what’s being sold to them by tech companies, the press, and commencement speakers, and what they have repeatedly seen with their own eyes. They’ve watched tech oligarchs spend the last decade mired in scandal after scandal, hype cycle after hype cycle, steadily enshittifying everything they touch along the way. The percentage of Gen Z that think AI’s benefits don’t counterbalance the risks now sits around fifty percent, up 11 percentage points in just the last year. Eight out of every ten believe that using AI makes the process of actual learning more difficult. He sees young people saddled with the perception of entering a worsening world - which leads them to rage against this latest fruit of the tech oligarchy. A rage that is easy for folks like me - with a comfortable retirement off-ramp - to properly appreciate. A rage that could have marked political and social consequences. ❄                ❄                ❄                ❄                ❄ Relevant to these concerns are a couple of items in last week’s Economist newspaper. The newspaper argues that historically major technological advances haven’t led to significant unemployment or drops in wages ( paywalled article ). The closest was the original industrial revolution in 19th Century Britain. There was a stagnation in wages during this period, but there was also a massive increase in population, from 4½ million to 12 million. It also points out that we’ll probably only understand the full consequences of all this when a recession hits, as this is when most unproductive jobs tend to be flushed out of the system. A second article ( also paywalled ) indicates that AI is having some effect on graduate hiring. They did an analysis of surveys of recent graduates, looking to see if employment varied depending on a job’s exposure to AI. The least exposed quintile of subjects saw employment rate fall by 1.5% over the last couple of years, while the most exposed quintile’s drop was 6.6%. ❄                ❄                ❄                ❄                ❄ Lawfare isn’t impressed with the latest efforts by the US Government to regulate AI. On [last] Wednesday, the White House invited leaders of OpenAI, Google, Anthropic, Meta, and Microsoft to the Oval Office for a signing ceremony the following afternoon. President Trump was to sign an executive order on AI and cybersecurity—the administration’s most formal effort yet to establish a voluntary process for reviewing frontier models before their release. But roughly three hours before the ceremony, when some company executives were already in the air to Washington, the White House called it off. They see the proposed regulations as mild, and including some valuable measures to harden defenses against cyber threats. But it’s worth underscoring the implications of postponing (if not outright canceling) this order, which, by its own terms, was about as modest a frontier-AI intervention as the federal government could put on paper: voluntary, focused on the government’s own defenses, and explicitly barred from becoming a licensing regime. The objection isn’t so much about government coercion as about the government having any settled role at all. Voluntary, in other words, isn’t the floor of frontier AI policy in this administration; it’s the ceiling. This is a questionable position given that the concerns animating this draft order will likely grow in the near future. It is also self-defeating for those who applauded the order’s delay or demise. Far from resolving the risk of government meddling in AI, killing the order just leaves in place what Ball has described as the “opaque and essentially lawless” alternative: government access happening through back channels, on terms set case by case, with no stable rules at all. One of the problems here is a distinct lack of governmental expertise, either in AI or in software in general. Too much is being decided at the whims of the tech oligarchy, there isn’t any attempt to engage in the broader issues at hand. That’s not entirely a bad thing, trying to regulate something that’s still evolving so fast is usually a fool’s errand - but the problem here is the impact of AI is so big that there’s real danger in being too far behind. ❄                ❄ Which leads me to a rare thing, an endorsement of a candidate for political office. If you are voting in congressional district MA-06 (North Shore of Massachusetts), I’d seriously look at Beth Anders-Beck , who is running for congress in that district. Beth has a long background in software development (including developing the notion of Forest and Desert ), so would introduce expertise that Congress desperately needs. I’ve known Beth for decades, and have a high opinion of their intelligence, judgment, and ability to work with others. Congress doesn’t deserve Beth, but it does need her.

0 views
Heather Burns 1 weeks ago

Born Crotchety

I spoke with The National about the proposed UK social media ban for teenagers.  That’s an archive link due to their unfortunate adwall. There’s nothing I offered in my delightfully crotchety comments that I wasn’t already saying four, five, six, and seven years ago, but if anyone had listened to me four, five, six, and […]

0 views
iDiallo 1 weeks ago

How Many Tokens Did You Burn Today

Early in my career, a manager at one of the big firms where I worked made a request so absurd it remains etched in my memory. I walked back to the team, repeated what he had asked, and couldn't finish the story without laughing. He wanted me to create a pie chart, of lines of code, per developer, per week. We all lost it. Our lead developer asked if, by any chance, the manager's eyes looked glassy. We laughed even harder. Because yes. Yes, they did. He was always high. That was twenty years ago. I've repeated that story countless times, and it always drew chuckles as we discussed the disconnect between software teams and management. Any software engineer could relate. We all knew that lines of code were a meaningless metric. A junior could write a thousand lines of spaghetti. A senior could fix the same problem with forty elegant ones. But then, last week, I found my name at the top of a leaderboard. My employer had been exploring productivity tools and trialed one they thought would be useful. After the trial, they were quoted $500k a year. The tool tracked developer productivity and integrated with Atlassian products, Microsoft, and many other services we used. The price was too steep, so it was dropped. A couple of months later, the same company came back with a discount. The exact same tool for just $50k a year. My employer jumped at the opportunity. How many bytes did you use today? I'm looking at this dashboard right now and I see my name at the top of the leaderboard. I click on the widget, and a pie chart appears. There it is: a breakdown of the total lines of code my team has produced using AI, by individual. This isn't limited to my employer. Every company is putting something together to track AI usage and justify the investment. Instead of tracking project completions, we're tracking how many lines of code each developer generated with AI. And the joke's on me, because nobody is laughing. The whole industry is applauding and encouraging employees to use more of it. I didn't become the champion because I have some neat agentic workflow. It was done by complete accident. While using an LLM, I accidentally selected "planning mode" for a request that had already been planned. The agent ran for several minutes, burning tokens to resolve a problem that didn't exist. Just like that, I made it to the top, without ever writing a single line of code. If this widget is taken at face value, it won't be long before developers start gaming it deliberately. Just let the agent run overnight, and your employer can claim a 10x improvement in productivity. We didn't use line count as a productivity metric in the past because it never made sense. Whenever we refactor code, we often end up with less than we started with. In fact, much of the time I spend modifying AI-generated code is spent deleting unnecessary things it created. Should we track negative lines of code? The better you are at programming, the worse your numbers look. We are assessing developers by the lines of code. I've watched AI evangelists ask "how many tokens did you burn today?" They were trying to convince an audience that productivity is directly proportional to token usage. It reminds me of the transition from paper to computers. A computer evangelist of that era might have asked: "how many bytes did you use today?" Token counts, lines of code, bytes, none of these have anything to do with actual productivity. Metrics are often entirely disconnected from what they're meant to measure. I've seen companies rely on story points only to watch employees point every ticket as high as possible. Choose lines of code as your metric, and lines of code will increase. Reward the highest contributor, and watch everyone double or triple their output by the next performance review. It's a silly metric but it serves a purpose, just not yours. AI companies promote token usage and associate it with productivity because they directly benefit from it. Imagine an internet service provider that charges by the byte. What would their recommendation for productivity be? "Use more bytes!" The best engineers I've ever known wrote less code, not more. They deleted things. They simplified. They understood that the goal was never the code itself. They solved problems, they made the system reliable, and they served the user. Measuring developers by output volume, whether that's lines, commits, or tokens, mistakes the exhaust for the engine. Every era of tooling brings a new class of metric that mistakes activity for value. The spreadsheet didn't make accountants more productive just because they could fill more cells. AI won't make developers more productive just because it can generate more code. We aren't even tracking if the right problems are being solved, and solved well. If the productivity dashboard can't answer that, it's not measuring productivity. It's measuring the subscription.

0 views

Pipeline Parallel Decompression

This isn’t a paper summary, but rather a description of a hobby experiment I’ve been hacking on ("research quality" code). This quote (attributed to either Anonymous or David Clark ) originally referred to networking, but applies to parallel programming as well: There is an old network saying: Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed—you can’t bribe God. Standard "cured with money" parallelization techniques (e.g., shared-nothing architectures, data parallelism) try to minimize cross-core communication. These hammers are great for hitting nails labeled: "improve throughput by throwing more cores at the problem”. Not everything is a nail. Important problems which cannot be solved with this kind of approach include: Parallel network packet processing in cases where load balancing schemes like RSS do not apply Parallel transaction processing when there is high contention between transactions Parallel encryption of a single stream of data Pipeline parallelism has the potential to provide "bribing God” solutions to some of these problems. A potential additional benefit that pipeline parallelism brings to the table is better usage of CPU caches because of a smaller working set. For example, if 8 cores cooperate to process 1 input file, the working set (input data, output data, intermediate data structures) is potentially 8 times smaller than the case where each core processes a separate input file. This caching advantage also applies to instruction caches, as pipeline parallelism distributes the computational steps of an algorithm across cores. Pipeline parallelism has some major drawbacks: Fine-grain synchronization/communication Load imbalance The purpose of this experiment is to put some numbers on the costs and benefits in a real-world application ( DEFLATE decompression). DEFLATE decompression is hard to parallelize because of two tight feedback loops: The position of encoded token in the input stream is not known until token is decoded (because input data is encoded with a variable length code). The output generated by a match (i.e., length & distance tuple) cannot be computed until some amount of previous output has been generated (because a match references previously generated output) A Negative Nancy might view these as problems, but a Positive Pipeliner views them as a guide for how to decompose the algorithm into pipeline stages. The general technique is to dedicate a pipeline stage to each of these feedback loops and whittle them down to be as tight as possible. The design I’ve landed on has three pipeline stages: , , and . The stage computes the length of each encoded token. It simply reads the next 13 bits from the input stream and uses them as an index into a lookup table. The inner loop looks like this: Note that in contrast to non-pipelined implementations, the only thing this code (and the lookup table) are concerned with is finding the length of each token, everything else is dealt with in another pipeline stage. Each iteration of this loop runs in about 8 clock cycles, and the lookup table fits in the L1 cache. The CPU cannot run multiple iterations of this loop in parallel due to the tight dependency chain. The input to the lookup stage is the encoded bits associated with each input token ( in the code above). These bits are used to perform another lookup (in a larger lookup table, stored in the L2 cache) which results in much more information about each token. Optimizing this stage is easy, because it doesn’t contain any tight feedback loops. The CPU can process multiple loop iterations in parallel, which enables it to hide the latency of accessing the L2. If necessary, it would be easy to split this pipeline stage into two. The inner loop looks like this: The structure contains metadata about the input token (literal value and/or information about a match). This data structure does not contain the exact distance associated with the match, the variables named deal with that detail from the DEFLATE spec. The stage writes literals and matches to the output buffer. This code leans on the CPU store-to-load forwarding hardware to deal with match operations which must read data that was recently produced. Each iteration of the inner loop performs a word-sized write of literal data, plus a 32B read and write to read and write match data. Actual store-to-load forwarding is rare, as most match distances are large. The Silesia Corpus contains commonly used files to benchmark compression algorithms. has English text with short matches whereas contains data dumps with longer matches. is an optimized library which can decompress roughly 2-3x faster than the standard . The following chart shows baseline performance on in a shared-nothing architecture where each CPU core decompresses a separate input file. There is one data point for each core count (1, 2, …, 8). As you would expect, throwing more cores at the problem improves throughput, at the cost of slight latency increase. If you want a more interesting tradeoff of throughput vs. latency, you have to bribe God. For example, say you are writing a decompression application. If the user requests a bulk decompression of 100 files, then the optimal choice may assign each file to a CPU core. But if the user requests to decompress a single file, then you would prefer to decompress using multiple CPU cores. And here is the same chart with the 3-stage pipeline implementation added in orange (compare it to the third blue dot from the left for a 3-core vs 3-core comparison): For a 37% cost in throughput, you get a 2x reduction in latency. Here is the chart for , which shows a similar story. Data-parallel throughput saturates at 6 cores. Pipeline parallelism allows a 2.6x latency reduction at the cost of 14% throughput. Dangling Pointers I think there is room for language/runtime support to improve performance of pipeline parallel algorithms on multicore CPUs (by reducing load imbalance). is bound by the chase stage, whereas is bound by the output stage. The programmer could supply multiple implementations of the pipeline (with some compiler help to reduce code duplication), and the runtime could dynamically switch between them depending on which stage is the bottleneck. High level synthesis tools are capable of automatic pipelining. Such techniques could be used to automatically generate many pipeline implementations for the runtime to choose between. The description above leaves out a few implementation details regarding the lookup tables. Because the lookup table data is spread across two cores (i.e., pipeline stages), there is enough room to store data for 2 Huffman tokens (2 literals, or a full match). This provides a large speedup compared to traditional implementations that store all data in the caches of a single core. Because the stage is throughput bound rather than latency bound, it can afford to access the lookup table via a layer of indirection. The 13 input bits are used to lookup a index, and that index is used to access the final data in another lookup table. The second lookup table has fewer entries, but each entry is larger. This reduces the total working set. This design leans heavily on CPU branch prediction. The code snippets shown earlier are for the common cases, with branches used to implement uncommon cases (e.g., a single encoded token that is wider than 13 bits). As long as those cases are rare, branch prediction does a great job of keeping the inner loops humming. An interesting puzzle arose during this experiment. I found that performance could swing widely (~10%) based on where the operating system located stacks of the various threads. The stack address would change from run to run because of ASLR . A little to offset the stack by a small amount would resolve this issue. It seems to be an important consideration when trying to maximize usage of the L1 cache. Subscribe now Parallel network packet processing in cases where load balancing schemes like RSS do not apply Parallel transaction processing when there is high contention between transactions Parallel encryption of a single stream of data Fine-grain synchronization/communication Load imbalance The position of encoded token in the input stream is not known until token is decoded (because input data is encoded with a variable length code). The output generated by a match (i.e., length & distance tuple) cannot be computed until some amount of previous output has been generated (because a match references previously generated output)

0 views
DHH 1 weeks ago

Basecamp Five

I've been working on Basecamp for half my life, and nearly my entire professional career in software. The first code was written in the summer of 2003 when I was just 23. Now I'm 46, and we've just released the fifth major version.  It's an incredible update to a service that continues to help about a million users a day avoid dropping the ball when working with others. It's AI accessible, but not agent hysteric. It's still famously easy to use, still executes the basics beautifully, and still focuses on the small to medium-sized teams we've been serving in the Fortune 5,000,000 for decades. Here are just three of my favorite new features in Basecamp 5: Lexxy editor: Our new text editor finally brings tables, markdown, and live syntax highlighting for code to Basecamp. Oh, and voice notes. It's built on Meta's Lexical editor toolkit, and it's going to ship as the default for Action Text in the next major version of Rails. Keyboard accessible: After moving to Linux, building Omarchy, and acquiring a taste for mechanical keyboards, I've come to love navigating the computer primarily through hotkeys. So with a lot of effort, Basecamp is now a delight to drive through the keys, and you don't have to be a brainiac to remember them all: just hold down SHIFT, and they're revealed in the interface. SHIFT + S opens the sidebar, ESC moves focus between it and the main page, SHIFT + C starts composing a comment/chat line/answer. The permanent sidebar: If you live in Basecamp, like I do, it's to stay on top of all the new things that are constantly happening in a busy account, and that's just gotten so much faster with the new permanent sidebar. Before, we had a Hey! menu in the top bar. You'd get a little dot when something was new, then you'd open it, click, and the menu would close. If you had five things that were new, it'd be open-click-close, open-click-close, five times. Being able to zoom through these now with just the return key, tap, tap, tap, and I've read three new things. So good. And there's so much more. Jason put together a great summary on the new marketing site, which in itself is brand new too. A back-to-basics design in many ways. As our entire industry is getting swept up in agent hysteria (and I love AI as much as anyone!), we thought it better to focus on the human communication that's the cornerstone of Basecamp. The new site just speaks plainly to that mission and shows you the software right at the top. Another thing that's back is color, specifically in the logo. Basecamp's clever but flat paperclip logo has been replaced with a modern take of our original rolling mountains. In full three dimensions, with depth and a gradient. Love it.  Overall, I'm really proud of what we've built with Basecamp Five. We're inching in on a quarter of a century in service! We still have customers who signed up back in early 2004! This is the kind of legacy that makes me beam, and the new version is just ace.  If you've tried Basecamp in the past, it's time to take another look. If you haven't tried it yet, you're in for a treat.

0 views
Unsung 1 weeks ago

FAIL_MAIL_OVER_500_MILES=TRUE

Here’s a 2002 story from a younger internet, by programmer Trey Harris ( link to the original and if you don’t like the classic Usenet formatting – my browser’s reader mode can’t even prettify it! – here’s a nicer-looking format ): “We’re having a problem sending email out of the department.” “What’s the problem?” I asked. “We can’t send mail more than 500 miles,” the chairman explained. I choked on my latte. “Come again?” “We can’t send mail farther than 500 miles from here,” he repeated. “A little bit more, actually. Call it 520 miles. But no farther.” It would be easy to assume this is a classic case of pebkac , “problem exists between keyboard and chair,” the derisive term used (supposedly!) by support people, describing naïve public who had a tenuous grasp of technological reality. But the story goes to an unexpected place. This might be the most widely-shared computer bug story of all time I’ve seen – I just saw a comment from 2008 calling it “oldie but a goodie,“ and it even has a FAQ page that’s actually a really great read. There’s quite a bit of chatter inside about something important to me: the balance between the needs of good storytelling and going deep into technical details: In the story, I make it sound like it took all of ten minutes from being made aware of the 500-mile email limit and determining a 3 ms light-speed issue. In fact, this took several hours, and quite a bit of detective work. The point is, eventually I came up with that figure, ran units, and gagged on my latte. You can sense author’s frustration with every nerd trying to “gotcha” him instead of just enjoying the story. Even a younger internet wasn’t without faults. #bug deep dives #bugs #change management #storytelling #web

0 views
Armin Ronacher 1 weeks ago

Building Pi With Pi

Pi is now part of Earendil, but in the important sense it is still Mario’s project. He has been living with its issue tracker longer than I have, and he has been exposed to the weirdness of the new form of agent traffic in Open Source projects for longer too. This post is mostly a reflection of my own experience after spending more time in the tracker, using Pi to work on Pi, and watching what I have learned about it so far. Unsurprisingly, we are using Pi to build Pi. That sounds like a cute dogfooding thing but it really helps understand what we do. An interesting effect of building with agents is that it changes the role of the issue tracker a tiny bit. The issue descriptions are not just messages from a user to a maintainer because we also use them as inputs for prompts in Pi sessions. It is something I might hand to my clanker 1 and say: “understand this, reproduce it, inspect the code, and propose a fix.” That means the shape of the issue matters in a new way. A bad issue was always annoying, but at least a lot of issues were vague. Now we are also dealing with a class of issues that are 5% human and 95% clanker-generated and largely inaccurate shit. A bad issue that contains a plausible but wrong diagnosis creates extra work. The most frustrating failure mode right now is that people submit issues that are not in their own voice. They contain an observed problem somewhere, but it has been thrown into a clanker and the clanker reworded it and made a huge mess of it. Typically, it was prompted so badly that the conclusions produced are more often than not inaccurate but always full of confidence. The result is complete guesswork on root causes, fake-minimal repros, suggested implementation strategies, analogies to adjacent but often the wrong code, and long lists of error classes that might or might not matter. That is worse than no diagnosis. I don’t want to point to specific issues because I really do not want to bad mouth anyone, but it is frustrating. It is also frustrating because when I give that issue to Pi, Pi sees the wrong diagnosis too. It does not treat the issue body as a rumor. It treats it as evidence. It will happily go down the path that the issue already prepared for it, because the prose is confident and the code references look plausible. We use a custom slash command called , which specifically has this instruction in it: Do not trust analysis written in the issue. Independently verify behavior and derive your own analysis from the code and execution path. Unfortunately, it does not fully work, because when humans first throw their issue through the clanker wringer, their clanker expands scope almost immediately. What was once a very narrow and fact based bug observation, turns into a much expanded surface area full of hypotheses. So at least personally, I increasingly want issue reports to be condensed to what the human actually observed: That is enough. If you used an LLM to understand the problem, great, maybe leave it as a follow-up comment. But the issue and the issue text should be something you own. If you do not know the root cause, say that. I too can operate a clanker, and I would rather do this myself than use your slop. If your repro is a guess, say that. If the only hard fact is one stack trace, give me the stack trace and stop there. That we’re seeing issues full of slop is just a result of the present day quality of these machines. Sadly, their failures in creating good issues extend to a lot of code that is generated. Not all of it, but a lot of code. Over and over I keep running into them over-engineering the hell out of issues and implementations. If you tell them that “this malformed session log crashes the reader,” the clanker will often add a tolerant reader. Then it will add a fallback, then maybe a migration, then more debug output, then a test for all of this. None of this is necessarily wrong in isolation, but it can be the wrong move for the system. At Pi’s core is a rather well-designed session log with invariants that must be upheld. The clanker’s present-day behavior is to just assume that no such invariants exist, and instead to make the system work with all kinds of malformedness, blowing up the complexity in the process. Almost always, the correct fix is not to handle the bad state, but to make the bad state impossible. This matters a lot for persisted data such as Pi session logs. They are opened, branched, compacted, exported, shared, and analyzed. The goal here is to never write bad session data. Yet if you just let the clanker roam freely, it will attempt to handle every case of bad data in the session log with a more permissive reader. I have complained about this plenty, but working on Pi’s code base continues to reinforce the point. This is one of the ways LLM authored code grows so much needless complexity. All these models see a local failure and try to locally defend against it. As maintainers we have to keep pulling the conversation back to the global invariant, which is harder than it should be, and it’s laborious. Then there is the issue of volume. The tracker is receiving a lot of issues and PRs, and a significant fraction of them are clearly LLM-assisted. Some are good, none are excellent, and most are just bad. The total throughput is a maintenance problem by itself. As you might know, Pi’s issue tracker is automated to close all issues and pull requests from new contributors, and there is a manual process by which we might reopen some of them or approve individuals. So auto-close -> reopen -> close again is an interesting statistic for us to look at. I pulled the public GitHub tracker data while writing this over the last 90 days. Excluding Earendil members, that leaves 3,145 external issues and pull requests. Of those, 2,504 were auto-closed because they were from non-approved individuals. 17% were reopened. For pull requests the number is worse: less than 10% were merged. Many of the issues and PRs are complete slop and in some cases the humans did not even realize that they created them. Sources of low-quality spam include OpenClaw instances, as well as some skills that people put into their context that seemingly encourage issue creation. GitHub clearly is not built to deal with this new form of Open Source, but I’m increasingly feeling the need to put the blame less on GitHub than on all the people involved who make that experience painful. If your clanker shits on someone else’s issue tracker then it’s not the fault of GitHub, it’s yours alone. Pi might be built with Pi, but we’re quite far off today from where Bun and OpenClaw already are: fully detached, automated software engineering. Maybe we will reach that point, I don’t know. Today it does not seem like we know how to pull off a dark factory and we also don’t yet have the desire. That said, there is quite a bit of parallelism going on, and it is mostly for reproducing issues. The small setup we use for this is three tiny pieces in Pi’s own committed folder. (for analyze is sue) is a prompt for analyzing GitHub issues: it labels and assigns the issue, reads the full thread and links, then explicitly tells the agent not to trust the analysis in the issue and to derive its own diagnosis from the code. Then an extension adds a which watches the prompt before the agent starts, recognizes the GitHub issue or PR URL that (or the PR equivalent) put into the prompt, fetches the title and author with , renders that in a little UI widget, and renames the session. It also rebuilds that state on session start or session switch, so if we reopen an older investigation the window still tells the developer which issue it belongs to. In practice this means it’s possible to have several Pi windows open, each running against a different issue, and the UI keeps the investigations visually distinct while the agents do their independent reproduction and code reading. Once the investigations are done, one can work through them sequentially. To finish off everything, ( wr ap it up) is the matching wrap-up prompt: it infers the GitHub context from the session, updates the changelog, drafts or posts the final issue comment with a disclaimer, commits only the files changed in that session, adds the appropriate when there is exactly one issue, and pushes from . You will have noticed this already but Open Source in a post-AI world is under a strange new pressure. We are getting more code, more projects, and more issues. Projects appear with no real users, or a temporary audience of one, and even projects with thousands of stars can have a shelf life of weeks. For us, Pi’s harness layer is worth maintaining carefully because it solves hard coordination problems and creates a platform we and others can build on. We also know that coordination and cooperation lifts us all up. Many times the right answer is not to work around a problem locally, but to make the upstream behavior correct. Mario has been very good at refusing to make Pi paper over every misconfigured gateway, and we’re trying to preserve that discipline. When a gateway behaves correctly, everybody benefits. Sadly that type of thinking is quickly disappearing because these machines make local workarounds cheap, so code accumulates local defenses against every misbehavior. Instead of humans talking to humans about where a fix belongs, one human and one machine work around the problem in isolation. Keep in mind that AI has not increased the number of people who need software, or the number of maintainers who can review it. It has mostly increased the amount of code and the number of projects competing for attention. Some of that is healthy, but a lot of it fragments effort that should be shared. We need stronger foundations, not weaker ones. Open Source needs more collaboration, not more isolated work with a machine. Human communication is hard, and it is tempting to avoid it when you can sit alone with your clanker. But isolation is not where Open Source derives its value. The value is in the community and the structure that lets projects outlive their original creators. To me, clanker is a much preferable term for agent. Agency lies with humans, not with machines. Calling these things agents I still believe is a mistake, but alas. ↩ I ran this command. I expected this to happen. This happened instead. Here is the exact error or log. To me, clanker is a much preferable term for agent. Agency lies with humans, not with machines. Calling these things agents I still believe is a mistake, but alas. ↩

0 views