Posts in Backend (20 found)
Giles's blog Yesterday

JAX backends and devices

There's nothing like writing your own code with a framework to clarify how things fit together! Continuing with my port of my PyTorch LLM code to JAX , I wanted to load up a large dataset: the 10,248,871,837 16-bit unsigned integers in the split of . That's just over 19GiB of data. When I ran that, I got a CUDA out-of-memory error: That makes sense! The allocation it was trying to do is exactly the size of the data I was trying to load. I have an RTX 3090 with 24 GiB, but some is already used up by the OS, various apps, and a model that the code creates earlier on. But in PyTorch land, I was used to things being loaded into RAM by default, and only moved over to the GPU when I asked it to do that. JAX was clearly loading to the GPU by default. How could I stop it from doing that for this case? The load into the GPU was happening inside Safetensors, in code I couldn't directly control. Understanding how to do it helped me understand a little bit more about JAX. JAX has a function that looks relevant: . Without reading the docs, let's try running it. In my virtualenv, with the package installed, I get this: That seems a bit weird! I do indeed have a CUDA device, but I also have a CPU, obviously. Why isn't it showing up? Running the same code in another virtualenv, with just installed -- no CUDA -- gets this: OK, so it did recognise it this time. Feels like it might be time to RTFM. The docs explain things a bit: Returns a list of all devices for a given backend. If is , returns all the devices from the default backend. The default backend is generally or if available, otherwise . OK. So JAX has multiple backends -- named that because they're classes of backend hardware that XLA (the compiler behind the JIT) targets. There is a default one, which is essentially going to be the "best" one available given the hardware configuration and the parts of JAX that are installed. When I had the CUDA version installed, it made the backend default, but when I didn't, it defaulted to (and warned me). And because it only shows the devices on the default backend, when that was , I didn't see the CPU. However, you can specify which backend you want to use with that parameter, so let's go back to the virtualenv with CUDA: Great! So is there some way to list which backends are available? Apparently not -- the recommended way appears to be to try loading devices for the different possibilities, and catch to see which ones aren't available. Yuck. But maybe that's not such a big deal. In PyTorch-land I was very much used to putting code like this near the start of my code: ...then moving models to the device: ...and then moving data to the model's device as needed: What I actually wanted was essentially what JAX does -- have everything on the fastest device available at all times -- but with specific exceptions. In particular, the one that started off this investigation: how would I put this huge array of training data on the CPU's RAM rather than the GPU's VRAM? I had a bit of a false start when I spotted that the function in the Safetensors FLAX API has a parameter, but that appears to be more to do with how it loads up the file -- a backend in a different sense. And anyway, backend is not the right concept in JAX-land, as the backend means just something generic like -- for what we're trying to do, we want to load it onto a specific device . After some digging around, I discovered that JAX has a concept of a default device , which is the one used when it doesn't have any indication of where to put something. It makes sense that this will be on the default backend -- indeed, it looks like it's essentially "the first device in the list that returns for the default backend". There is a config option which you can use to set it; you'd normally use or an environment variable to change it. But what if you only want to change it temporarily? I found this documentation for . The docs are more than a little confusing: Context manager for config option. Configure the default device for JAX operations. Set to a Device object (e.g. ) to use that Device as the default device for JAX operations and jit’d function calls (there is no effect on multi-device computations, e.g. pmapped function calls). Set to None to use the system default device. That near the start tripped me up, as I missed the words "Context manager" just below, and the odd type, and tried this: I still got the CUDA OOM, though, so I reread the docs, spotted the "context manager" bit, swore violently, and tried this: ...which works. It looks like the equals sign in the docs is being used to mean something very different to what you'd normally use it for, and they decided not to actually document the signature of the context manager. Heigh ho. I guess documentation is hard . Still, at least now I have a solution. And as I said earlier, doc grumbles aside, the shape of the code might wind up being a little less fiddly than PyTorch. The default location of things I create is the fastest hardware I have, which is what I want. And for the rare exceptions when I don't want to use that, there is a reasonably simple (now that I know it) way to say where I want things to go. I'll call that a win :-) The only thing I'll need to remember is that when, in my training loop, I want to use subsets of that in-RAM tensor, I'll need to move them to the GPU. looks like the right tool for that.

0 views
Xe Iaso Yesterday

IPv6 zones in URLs are a mistake

IPv6 is weird. One of the more strange parts of the standard is that every interface's link local addresses are in . If you have a machine with two network interfaces, both of them will be in , so if you have a packet destined to , how do you disambiguate it? The answer is you use IPv6 scopes/zones . The exact format of what goes into a zone is OS dependent, but on Linux it's the interface name and on Windows it's the interface ID. This lets the kernel's routing table know how to handle an address range conflict. On my tower, this would be represented like this: Where is the name of my tower's ethernet device. When you create a host:port bindhost, you normally separate the hostname and port with a colon. IPv6 uses colons to separate hex groups. In order to disambiguate what's the host and what's the port, you typically format the IPv6 address in square brackets, so on port 80 would look like this: And with the right scope it looks like this: Now let's get URL encoding into the mix. From high orbit, you can imagine a URL's format as being something like this: An IPv6 zone would then be part of the hostname, just like with that port 80 example from earlier. So you'd think the URL would be something like this: But if you try to parse this as a URL in Go, you get an error: This happens because URLs can't represent all Unicode values, so any values that don't fit into the grammar of a URL become percent-encoded . This is why sometimes you'll see a in URLs in the wild; that's encoding the ascii space key, which is invalid in URLs. In order to work around this, you need to percent-encode the percent sign in the IPv6 zone: In theory, there is guidance for how to properly handle IPv6 zones in user interfaces in RFC 9844 , but there's no such guidance for URLs . Go also does not seem to follow this RFC in net/url . EDIT: It seems that this behaviour is compliant with RFC 6874 and that this is in fact how it is meant to be done. Our industry confounds me. So in the meantime in order for Anubis to point to IPv6 zoned addresses, you need to encode the with percent encoding. This is horrible, but it seems that this is an edge case that applies to other frameworks, programming languages, and libraries: Maybe some day in the future there will be a better option here. In the meantime my policy of not forking the Go standard library means that this somewhat terrible UX for an edge case is acceptable. I hate it, but what can you do? TL;DR: computers were a mistake. https://trac.nginx.org/nginx/ticket/623 https://github.com/psf/requests/issues/6808 https://datatracker.ietf.org/doc/html/draft-schinazi-httpbis-link-local-uri-bcp-03 -- Browsers don't currently support IPv6 zones because it breaks the concept of an "origin" which is used for many subtle things, this RFC draft attempts to define an zone origin in IPv6 so that browsers have a leg to stand on

0 views

Broker-Visible vs Client-Local Parallelism

This post is a little side-quest from my “Kafka Share Groups and Parallelizing Consumption” series. My “Kafka Share Groups and Parallelizing Consumption” series ( part 1 , part 2 ) has been laser focused on how different configurations and behaviors affect parallel consumption in share groups (Queues for Kafka). So far I’ve shown that you most definitely can hold share groups wrong . You could quite easily and inadvertently create a work queue and with the right combination of things going against you, see a small number of consumers dominate, leaving most consumers starved of messages. All the while lag builds and builds. You need to know the settings and what they do. Don’t just rely on the defaults. But it’s worth asking the question: is parallelizing consumption what share groups are for? The answer is no. If your only concern is parallel consumption, then there are other options. Chuck Larrieu Casias wrote a good post on LinkedIn pointing out that people shouldn’t be thinking of share groups as THE solution to parallelizing work (without exploding the partition count). Share groups exist to expose queue-like semantics over a log. Unlike a normal consumer group, a share group lets you accept one record and reject another for retry. A consumer group tracks one committed offset per partition. A share group has to track many individual records independently: which records are available, which have been delivered (to whom), which have been acknowledged, and which should become available again. But just because share groups don’t exist primarily to parallelize work doesn’t mean it’s not a tool that can be used for that purpose. If your messages are independent or you are otherwise ok with loose ordering then share groups could be a simple choice for breaking away from partition count as the unit of parallelism. The central theme I took from Chuck’s post is that parallelism has to be accounted for somewhere . The unit of parallelism can be broker-visible and broker-managed, or client-local and client-managed. Broker-visible/managed can only take you so far. When you need to process 1,000 messages in parallel to cope with the producer rate, what represents those 1,000 parallel units of work? Is it partitions, consumers, virtual threads/async tasks? If the unit of parallelism is the consumer itself then we must scale out serial consumers to scale the parallel processing (with a matching partition count with consumers groups). Every parallel unit of work (consumer) becomes visible to the broker as protocol interactions and state plus one or more TCP connections. If parallelism comes in part from the client itself, the unit of parallelism could be a virtual thread, an async task or even an OS thread. This is invisible to the broker. You need fewer consumers, fewer TCP connections, and less broker-visible protocol interaction/state.  This split of where the unit of parallelism is accounted for, broker-side vs client-side, exists across all messaging systems. It’s not specific to Kafka. A simple calculation for aggregate parallelism is easy: 60000 msg/s * 1s = 60000 60000 msg/s * 5s = 300000 100 msg/s * 20s = 2000 10000 msg/s * 0.5s = 5000 50 msg/s * 5s = 250 Once you know how many messages must be processed in parallel, you can figure out your tactics. The formula tells you how much parallelism you need, then it’s up to you to figure out where that parallelism should live. Let’s use our 60,000 messages per second workload from the share group series. If it takes 1 second to process each message, then we need to support 60,000 messages being processed at any given moment. If each unit of parallelism is a serial consumer, then that means 60,000 consumers! That’s a lot of connections, a lot of protocol state, and a really big consumer group. What if it takes 10 seconds on average to process a message, you’d need 600,000 consumers, and well over 1 million TCP connections! If most of the work is I/O, and the CPU spends a lot of time waiting around then can’t we make a single client do more work? What if one client can handle processing 1000 messages in parallel? Then we’d only need 60 consumers for the “60K msg/s + 1 second processing time” example.  Fig 1. Left: Parallel work across N serial consumers. Right: Parallel work across N parallel-capable consumers. If the ultimate unit of parallelism is visible to the broker as something it must manage, it can get really expensive in resources for highly parallel workloads (no matter which messaging system you use). Managing virtual threads, or even OS threads, is much cheaper than managing one or more TCP connections + metadata per unit of parallelism. This is true of all messaging systems I have ever used. The cost is greater complexity on the client, but if you don’t want to roll your own logic, there are libraries to help here (see Chuck’s post for some). Unfortunately, the ParallelConsumer library is no longer being maintained (though a fork might be in the future). This library not only added internal client-side parallel processing but queue semantics as well (on top of consumer groups). Now that we have share groups, perhaps we need a new library that adds client-side parallelism to share groups. I’m going back to writing Part 3 of my parallelism in share groups series. We’ll be comparing broker-managed vs client-managed parallelism with share groups and consumer groups. 60000 msg/s * 1s = 60000 60000 msg/s * 5s = 300000 100 msg/s * 20s = 2000 10000 msg/s * 0.5s = 5000 50 msg/s * 5s = 250

0 views

How Other Link Checkers Do Recursion

After I published Five Years of Trying to Add Recursion to lychee , one reply I got was a very fair question: If recursion is so hard, how do other link checkers do it? Plenty of them already crawl websites! This sent me down a rabbit hole of reading the code of other link checkers. The key takeaway is: they didn’t find a clever trick we missed. They were built as crawlers from the very first commit, and I initially built lychee as a stream. I went and read the source of the recursive checkers we list in lychee’s README : muffet (Go), LinkChecker (Python), linkinator (TypeScript), and broken-link-checker (JavaScript). This post is a teardown of how each one actually handles recursion, what it costs them, and what it means for lychee. If you haven’t read the first post , the summary is that lychee was architected as a one-shot, unidirectional pipeline ( ). Recursion needs a cycle (responses create new inputs), and cycles in an async, channel-based pipeline are where the dragons live . 🐲 Five years and four attempts later, the pieces we’ll need to do it properly only just landed. DAGs vs. cycles Every recursive checker I looked at is built from the same three parts: Diagrammatically, lychee is different from the others: Crawlers have a back-edge baked in. Our pipeline doesn’t, and every one of my failed attempts was an effort to bend that back-edge into a graph that was never designed for it. Let’s look at that graph design more closely: Note that the visited check happens in the enqueue step, atomically with the mark, before the worker ever touches the network. That ordering is the entire fix to the deduplication race that haunted lychee’s attempts 1–4, where the cache was written after checking. Each tool uses a variation on it. muffet (Go): a WaitGroup and a Set muffet is closest in spirit to lychee: a fast, single-binary, concurrent website checker. The dedup + scheduling decision lives in one method ( ): is a (a mutex-guarded ). returns whether the URL was already present, so a page is only scheduled the first time it’s seen. Dedup happens at enqueue, synchronized by the set’s mutex. This is basically a line-by-line translation of the diagram above. Checking a page fetches all of its links concurrently, and feeds qualifying ones back into , the back-edge: How muffet knows it’s done muffet’s answer to termination is a little built around a ( ): Every scheduled page increments the group; every completed page decrements it; returns when the count hits zero. The whole crawl bootstraps with a single before , so the counter is positive before anyone waits on it. This is the same counter I tried (and failed with) in Attempt 1 and Attempt 4 . The difference is the invariant: is only ever called from inside an already-running daemon that holds the count above zero (or from the bootstrap). There is no window where the counter briefly reads zero while work is still pending. Go’s enforces this invariant so naturally that it doesn’t feel like distributed termination detection at all, but that’s exactly what it is. It’s the moral equivalent of the primitive Kait contributed to lychee in 2026 . Where the tradeoffs are Concurrency isn’t bounded by the daemon manager. does for every task, spawning unbounded goroutines. The actual limiting happens downstream in a (a buffered-channel counting semaphore) and a per-host throttler pool. muffet separates “the frontier” from “the rate limiter,” which is exactly the separation lychee lacked when it tried to use one bounded channel as both in the past. Cheap goroutines do a lot of heavy lifting. Spawning a goroutine per link is “fine” in Go. The equivalent in Rust ( per link, each needing state) is what pushed me toward and the ownership pain I wrote about . On extensibility, muffet is a focused CLI, not a library. There’s no plugin surface; you get what the flags give you. lychee deliberately ships as a reusable crate, which raises the bar, since every architectural choice has to uphold the standards of a public API. On scalability, unbounded goroutines plus an in-memory visited set scale comfortably to large sites, but there’s no disk-backed frontier, so a truly enormous crawl is bounded by RAM. Same as lychee. Takeaways: muffet LinkChecker (Python): a joinable unbounded queue LinkChecker has existed since the year 2000. It’s a synchronous, thread-pool crawler. Its frontier is a hand-written ( ), a clone of Python’s with / . Look at the very first design comment: It’s explicit about the exact deadlock that bit me. That comment is our Attempt 4 backpressure deadlock , called out and designed around. lychee tried to push discovered URLs into a bounded channel; when it filled, the response handler blocked, no responses drained, no slots freed. Deadlock. 💥 LinkChecker’s answer is brutalist in nature: the frontier is unbounded . Backpressure is enforced elsewhere (a fixed thread count and per-host throttling), never by blocking a producer that is also a consumer. Termination by counter, done right blocks until hits zero ( ): Again: a counter. But the increment in and the decrement in are both inside the queue’s lock, and a worker calls only after fully processing an item including enqueuing its children . So children are counted before the parent is marked done, with no premature zero. It’s semantics implemented with a mutex and a condition variable. Deduplication, before the request LinkChecker writes the URL into its result cache at enqueue time ( ): That sentinel is a “fix” that’s missing in lychee’s attempts. By the time any worker thread checks the URL, the cache already says “mine,” so concurrent discovery from another page is a no-op. Per-host politeness and termination guards The ( ) throttles per host: and calls so a stuck crawl can’t hang forever. Where the tradeoffs are Blocking threads instead of async. Each of the (default 10–100) threads does blocking I/O via . Simple and battle-tested, but the concurrency ceiling is the thread count, and each thread carries a full stack. lychee’s Tokio model reaches thousands of concurrent in-flight requests on a handful of OS threads; LinkChecker can’t, and doesn’t try. The unbounded frontier trades a deadlock for unbounded memory. The explicit “no max size” decision means RAM growth on huge sites. There’s a cap and a periodic to mitigate it. Extensibility is excellent. LinkChecker has a real plugin system ( : anchor checks, SSL, virus scanning, and more) and many output loggers. This is the most extensible of the bunch, and it pays for that with a large, mature, somewhat old-fashioned codebase. On scalability, it’s GIL-bound and thread-limited, so raw throughput is the lowest here, but correctness and feature coverage are high. Takeaways: LinkChecker linkinator (TypeScript): Single-Threaded linkinator is a Node.js checker, and it benefits from something neither Go nor Rust provides: a single-threaded event loop . Check-and-insert into the visited set is atomic for free , because no two callbacks run simultaneously. The frontier is a concurrency-limited (a p-queue-style structure). Termination is one line in ( ): is the library’s termination detection: it resolves when the queue is empty and no task is in flight. Same idea as muffet’s and LinkChecker’s , just expressed as a promise and backed by a single-threaded runtime, so no Mutex is needed to protect the visited set. The back-edge and the race-free dedup When crawling, GETs the page, extracts links, and for each new URL re-enters the queue ( ): Because JavaScript is single-threaded, the entire thing executes without interruption. In Rust or Go, that’s a critical section you must guard with a mutex (and get the ordering right); in Node it’s just three statements. This is the single biggest reason recursion is easier in Node than in Rust. It’s just a language feature. linkinator also keeps a of keys, and a map so it can wait on an in-flight check and still report a duplicate broken link against every parent that references it. Those reuse-operations are themselves pushed onto the same queue, so correctly waits for them too. HEAD vs GET linkinator uses for leaf links but when it needs to crawl, because recursion needs the response body to find more links : This is precisely lychee’s remaining open problem : you can only recurse into pages you fetched with a body. linkinator just always GETs when crawling; lychee plans to reuse the body it already has in cache from the check it just performed. Where the tradeoffs are Single-threaded is both a blessing and a ceiling. No data races, trivially correct dedup, but HTML parsing is CPU work that blocks the one event loop. For thousands of pages, you’re bound by a single core. lychee’s multi-threaded runtime parses and checks in parallel. It suffers from in-memory result inflation. The source explicitly comments on “massive result inflation for heavily interlinked sites”: the array, , and all grow with the crawl. Fine for a docs site, heavy for a giant one. Rate limiting is reactive, not proactive. There’s a that backs off per host on a with , but no general per-host concurrency cap like lychee’s . linkinator can hammer a host until it complains; lychee now paces before the complaint. For extensibility, it’s an ( , , and so on), so it’s embeddable and scriptable, which is nice. It’s a library first, like lychee. Takeaways: linkinator broken-link-checker (JavaScript): event-driven, using two queues broken-link-checker (BLC) takes the event-driven model furthest. It’s built on , a queue with (concurrency) and , and it nests two of them: a site-level queue feeding a page-level . The frontier and dedup live in ( ). Visited pages are tracked in a , written at enqueue time: Recursion is governed by a filter that decides whether a discovered link becomes a crawled page: Termination by event cascade BLC has no counter and no . It rides the queue’s drain events. When the page-level queue empties it fires , which makes emit and call the site queue’s callback; when the site queue drains, it fires . That’s the public : That’s their termination detection, expressed as “the request queue reported empty.” And in classic Node.js fashion, the callback is what actually tells the site queue to free up a slot for another site. So the termination of one site is what allows another to start, and the termination of the whole crawl is what allows the process to exit. It’s a cascade of events that propagates from the page queue to the site queue to the process. Where the tradeoffs are It’s the best web citizen of the bunch. robots.txt is honored ( , ), is respected, and plus are first-class. This is a crawler that’s polite by default. Event cascades are powerful but fiddly. Termination is spread across half a dozen event handlers and two nested queues. It works, but the control flow is much harder to follow than . This is the JS cousin of the “leaky abstraction” problem I described, where recursion-awareness ends up sprinkled across many handlers. It’s single-threaded, the same ceiling as linkinator, plus the in-memory per site. On maturity versus momentum, it’s very widely used (it powers a lot of tooling), but development has slowed. The architecture is still sound and worth studying. Takeaways: broken-link-checker A note on markdown-link-check and the “industrial” crawlers Our README marks markdown-link-check as supporting recursion, but there’s some nuance there: it recurses over Markdown files , not by spidering a live website. There’s no HTTP frontier and no termination problem in the sense above. Worth a mention so the comparison is honest, not worth a teardown. If you want to see the pattern at full industrial scale, look at Scrapy (Python/Twisted) or Colly (Go). Both use the same approach: a scheduler (frontier) with a pluggable, optionally disk-backed queue, a dupefilter (often a Bloom filter rather than a ), a bounded downloader pool, and explicit “engine idle → close spider” termination. They solve exactly the problems lychee struggled with ( distributed termination detection , backpressure, dedup), just with years of dedicated crawler engineering behind them. The takeaway isn’t “lychee should be Scrapy”: it’s that crawling is a well-trodden architecture, and lychee is simply standing on a different one right now. Side-by-side Tool Lang / runtime Concurrency model Frontier “Done?” signal Dedup point Per-host limiting muffet Go, goroutines goroutine pool + semaphore + host throttler mutex-guarded set + daemon channel visited set at enqueue host throttler pool LinkChecker Python, threads fixed blocking thread pool unbounded joinable-queue counter ( ) result cache at (req/s) linkinator Node, event loop single-thread + p-queue ( ) p-queue at enqueue (race-free) reactive broken-link-checker Node, event loop ( ) nested request queues queue-drain events at enqueue + lychee (2026) Rust, Tokio tasks + channels + per-host pool lychee in 2026 finally has a column-for-column match. The is muffet’s and LinkChecker’s . The is BLC’s / and LinkChecker’s . The per-URI mutex is everyone’s enqueue-time dedup. So Why Couldn’t We Just Copy Them? Three reasons, in increasing order of how much they’re actually lychee’s fault. They started as crawlers; lychee started as a stream. Every tool above has a back-edge in its core data structure. lychee’s core was a DAG optimized for the 99% case (a list of files/URLs, checked once, fast). Retrofitting a cycle onto a pipeline is much harder than having one from the start. The problem is architectural in nature. The frontier and the rate-limiter must be different objects. muffet (set + semaphore), LinkChecker (unbounded queue + thread count), linkinator (p-queue + delayCache), BLC (request queue + maxSockets) all keep “what to do next” separate from “how fast to go.” lychee’s early attempts tried to make one bounded channel serve both roles, and a cycle through a bounded channel deadlocks. The fix (lychee’s plus a over an unbounded work source) is the same separation we’re aiming for now. Single-threaded runtimes get dedup for free. Both Node tools dedup with a plain and zero locking, because the event loop serializes access. Go and Python pay a mutex. Rust pays a mutex and fights the borrow checker about who owns the shared state across . That’s the ~30% “Rust tax” I estimated last time : not the algorithm, but the friction of expressing shared mutable frontier state under . None of this is a knock on lychee’s design. A unidirectional stream is the right call for the common, non-recursive case: it’s why lychee is fast and why the 30% channel regression from Attempt 2 was a dealbreaker. The other tools pay for their back-edge on every run, recursive or not. lychee refused to, and that principle is exactly why recursion took five years and why, when it lands, it won’t slow down the path everyone actually uses. I believe that we can have our cake and eat it too: a crawler architecture that supports recursion without sacrificing the speed of a one-shot pipeline. But it’s a harder problem than just “copy what they do,” because most link checkers didn’t start with uncompromising performance as their top goal. Key takeaways So when someone asks “how do other link checkers do recursion?”, the real answer is: they made it a part of the architecture from the beginning, and they leaned on a runtime (providing conveniences like a , a joinable queue, an idle promise) that solved termination without solving “distributed termination detection.” Thanks to the maintainers of muffet, LinkChecker, linkinator, and broken-link-checker: reading your source is the clearest way to learn about crawler architecture out there and we’re all in this together, just with a different set of tradeoffs. A mutable work queue (let’s call it “frontier”), not a fixed input stream. Discovered URLs go back into the same queue they came from. A visited set that’s updated at enqueue time (before the request completes), so two pages discovering the same link can’t both submit it. A primitive that answers “is everything done?”: a , a joinable-queue counter, an promise, or a queue-drain event. Concurrency isn’t bounded by the daemon manager. does for every task, spawning unbounded goroutines. The actual limiting happens downstream in a (a buffered-channel counting semaphore) and a per-host throttler pool. muffet separates “the frontier” from “the rate limiter,” which is exactly the separation lychee lacked when it tried to use one bounded channel as both in the past. Cheap goroutines do a lot of heavy lifting. Spawning a goroutine per link is “fine” in Go. The equivalent in Rust ( per link, each needing state) is what pushed me toward and the ownership pain I wrote about . On extensibility, muffet is a focused CLI, not a library. There’s no plugin surface; you get what the flags give you. lychee deliberately ships as a reusable crate, which raises the bar, since every architectural choice has to uphold the standards of a public API. On scalability, unbounded goroutines plus an in-memory visited set scale comfortably to large sites, but there’s no disk-backed frontier, so a truly enormous crawl is bounded by RAM. Same as lychee. muffet’s termination is a , full stop. It’s the design lychee converged on after five years; muffet got it for free from Go’s standard library on day one. The frontier and the concurrency limiter are separate things. A mutex-guarded set is the frontier; a semaphore plus host throttler bounds concurrency. Conflating them is what deadlocked lychee. Goroutines hide the cost that Rust makes you pay explicitly. The same per-task model that’s trivial in Go is where Rust’s /ownership friction shows up. Blocking threads instead of async. Each of the (default 10–100) threads does blocking I/O via . Simple and battle-tested, but the concurrency ceiling is the thread count, and each thread carries a full stack. lychee’s Tokio model reaches thousands of concurrent in-flight requests on a handful of OS threads; LinkChecker can’t, and doesn’t try. The unbounded frontier trades a deadlock for unbounded memory. The explicit “no max size” decision means RAM growth on huge sites. There’s a cap and a periodic to mitigate it. Extensibility is excellent. LinkChecker has a real plugin system ( : anchor checks, SSL, virus scanning, and more) and many output loggers. This is the most extensible of the bunch, and it pays for that with a large, mature, somewhat old-fashioned codebase. On scalability, it’s GIL-bound and thread-limited, so raw throughput is the lowest here, but correctness and feature coverage are high. The unbounded frontier is a deliberate anti-deadlock choice, documented in a one-line comment. It describes the exact problem we hit in lychee in attempt 4. Dedup at time (a placeholder in the cache) is their synchronization mechanism. The cache must claim the URL before the request, not after. Threads buy simplicity at the cost of throughput. A blocking thread pool is the easiest correct model… and the slowest one. Single-threaded is both a blessing and a ceiling. No data races, trivially correct dedup, but HTML parsing is CPU work that blocks the one event loop. For thousands of pages, you’re bound by a single core. lychee’s multi-threaded runtime parses and checks in parallel. It suffers from in-memory result inflation. The source explicitly comments on “massive result inflation for heavily interlinked sites”: the array, , and all grow with the crawl. Fine for a docs site, heavy for a giant one. Rate limiting is reactive, not proactive. There’s a that backs off per host on a with , but no general per-host concurrency cap like lychee’s . linkinator can hammer a host until it complains; lychee now paces before the complaint. For extensibility, it’s an ( , , and so on), so it’s embeddable and scriptable, which is nice. It’s a library first, like lychee. is the termination mechanism. Simple and provided by the JS runtime. A single-threaded event loop makes request deduplication pretty much free. This is the biggest structural reason recursion is easier in that case. Reactive 429 backoff is not the same as proactive per-host pacing. lychee’s aims higher, at the cost of more machinery. It’s the best web citizen of the bunch. robots.txt is honored ( , ), is respected, and plus are first-class. This is a crawler that’s polite by default. Event cascades are powerful but fiddly. Termination is spread across half a dozen event handlers and two nested queues. It works, but the control flow is much harder to follow than . This is the JS cousin of the “leaky abstraction” problem I described, where recursion-awareness ends up sprinkled across many handlers. It’s single-threaded, the same ceiling as linkinator, plus the in-memory per site. On maturity versus momentum, it’s very widely used (it powers a lot of tooling), but development has slowed. The architecture is still sound and worth studying. Termination is a cascade of queue-drain events, not a counter. Same idea, different syntax. Politeness is built in. robots.txt, , and make it the most server-friendly recursive checker by default. Event-driven control flow is the cost. Distributing recursion logic across many handlers is exactly the kind of spread-out complexity that makes the feature hard to reason about. There is no secret sauce. Every recursive checker is a worklist plus a visited set plus a quiescence detector. The “trick” is being shaped like a crawler from commit one. Termination is always the same idea wearing different clothes: (muffet), joinable-queue counter (LinkChecker), (linkinator), queue-drain events (BLC), (lychee 2026). All of them are distributed termination detection. Dedup belongs at enqueue, before the request. Marking a URL visited after checking it (what lychee did for four attempts) is the bug. Everyone else claims the URL the moment it enters the frontier. Separate the frontier from the rate limiter. A bounded channel that is both your queue and your backpressure will deadlock the instant you add a cycle. There is no free lunch. Node’s single thread makes dedup trivial at the cost of performance; Go’s goroutines and make termination trivial at the cost of a runtime; Rust gives you neither for free but hands you a compiler that refuses to let the races compile and you can get the network card to glow if you know exactly what you are doing.

0 views
Frederik Braun 6 days ago

The S in interoperability

This is a blog post about standards, their proliferation and the issues that may arise. My first involvement with standards was just as a reader. To better understand complicated code or unexpected behavior in a protocol. After a while, I also got involved and helped clarify certain things to ensure …

0 views
The Coder Cafe 1 weeks ago

Metastable Failures in Distributed Systems

☕ Welcome to The Coder Cafe! Today, we explore one of the nastiest failure patterns in distributed systems: metastable failures. Based on the Metastable Failures in Distributed Systems whitepaper, we break down why these failures happen, why they persist, what we can do about them, and why our instinct to fix them is probably wrong. Get cozy, grab a coffee, and let’s begin! Stable, Vulnerable, Metastable Metastable failures borrow their name from physics, where metastable means something that looks stable but isn’t . To understand how a distributed system can end up in such a state, we need to look at three distinct states it can be in: Stable: The system recovers on its own after any disruption. This is what we call resilience in Resilient, Fault-tolerant, Robust, or Reliable . Vulnerable : The system looks perfectly healthy, but it's operating above its hidden capacity : the load level below which it can self-heal from any disruption. It responds fast, metrics are green, and nothing is alarming. Many production systems deliberately operate here because it's more efficient: resources are used closer to their limit. But there's no slack left . And the deeper the system operates in a vulnerable state, the smaller the trigger needed to push it over the edge. Indeed, a system just above its hidden capacity can survive large disruptions; a system near its advertised capacity can be tipped by almost anything. Metastable failure : A trigger (e.g., a network blip, a deployment, a traffic spike) pushes the system over its hidden capacity. The system is not fully broken: processes are alive, and it’s still running. But goodput collapses: it’s no longer doing any useful work. Technically up, effectively down . And unlike a regular outage, removing the trigger doesn’t fix it. Getting out requires a strong corrective push: a restart, a dramatic load reduction, a manual intervention. NOTE : If you’re not familiar with the concept of goodput, it’s the throughput of useful work completed successfully. For example, in a web application receiving 1000 requests per second but returning errors for 800 of them, the goodput is only 200 RPS. The three states of a metastable failure. A system can drift into the vulnerable state unnoticed, and a single trigger is enough to push it into the metastable state it cannot escape on its own. The most disorienting property of a metastable failure: stopping the trigger doesn’t stop the failure. To understand why, we need to talk about feedback loops. In a previous post on Systems Thinking Explained , we defined a feedback loop as: If causes , then influences . A feedback loop is exactly the mechanism that keeps a system stuck in the metastable state . There is always a sustaining effect, a feedback loop, that prevents recovery. The trigger is just what pushes the system over the edge. The loop is what keeps it there. Blaming the trigger is the natural instinct, and almost always the wrong diagnosis. Let’s discuss a concrete example to make this clear. Imagine a web application that queries a database. The database comfortably handles up to 300 QPS. The application retries any query that doesn’t respond within 1 second. The system is running at 280 QPS, healthy and fast, within the database’s capacity. Then, a transient network issue occurs for 10 seconds. When the issue is over, all the queued requests flood in at once. The database gets hit with a surge it can’t absorb: latency spikes and queries start timing out. So the application retries them. This doubles the effective load to 560 QPS. The database, already struggling, falls further behind. More timeouts. More retries. The loop is now self-sustaining: High load → Timeouts → Retries → Higher load → More timeouts → More retries The transient network issue was fixed minutes ago. Yet, the system is still completely broken. The trigger is gone; the feedback loop is not . The only way out is to dramatically cut the load or disable retries entirely. This is a metastable failure . The system was vulnerable because it was operating close to its hidden capacity . A minor, transient trigger pushed it over the edge and into a self-sustaining failure state it couldn’t escape on its own. The retry mechanism, a feature designed to improve reliability, became the very thing that prevented recovery. This is one example, but the same pattern appears with caches, connection pools, failover logic, and more. The shape is always the same: a feedback loop that turns a temporary problem into a permanent one . Two things make metastable failures particularly nasty. We can be tempted to blame the wrong thing . When an outage happens, the trigger is what’s visible and recent: a spike, a deployment, a hardware fault. It’s the obvious culprit. But the trigger only exposed the problem; it didn’t create it. The sustaining feedback loop was already there, structural and invisible. When analyzing the problem in retrospect, teams focus on the trigger; fixes address the trigger; and the system remains vulnerable to the next one. The authors of the paper observed teams declare a metastable failure “resolved” multiple times before realizing the real cause had never been touched. The feedback loop grows stronger with scale . Small-scale tests won’t reveal it. A staging environment running at 10% capacity may handle the same trigger without falling into a metastable state, because the loop isn’t strong enough at that scale to be self-sustaining. This means these failures can slip past even rigorous testing regimes and only manifest in production at full load. We defined hidden capacity earlier as the load level below which the system can self-heal from any disruption. It’s different, and always lower, than the advertised capacity. In our example, the numbers make it concrete: the advertised capacity is 300 QPS, but the hidden capacity is only 150 QPS, because retries double the load under failure. The gap between those two numbers is where vulnerability lives . Measuring the hidden capacity is not straightforward, though. One possible approach is to apply a trigger at a given load level and observe whether the system recovers on its own: If it does, we are below the hidden capacity. If it doesn’t, we are above it. We can also estimate it indirectly: in the retry example, retries double the load under failure, so the hidden capacity is roughly half the advertised capacity. Metastable failures are not bugs . We can’t write a unit test that catches them. They are emergent behaviors: properties that arise from the interaction of a system’s components under specific conditions, not logic errors in any individual component. No single piece of code is buggy, no single configuration is wrong. The failure is a consequence of how everything fits together under load. This changes how we need to think about them. The right question after an outage is not “ What failed? ” but “ What loop sustained it? ” And before an outage, the danger is not having bugs; it’s optimizing so aggressively for efficiency that we push the system deeper into the vulnerable state without realizing it . Retries, caches, failover logic, connection pools: these are all features that improve reliability in the common case. They are also, under the right conditions, the sustaining mechanisms of metastable failures. The same design decision that makes a system more resilient in normal operation can also prevent it from recovering when things go wrong. The paper describes several approaches to reduce the risk of metastable failures: Retry budgets and circuit breakers : Instead of retrying indefinitely, cap the total number of retries in flight at any given time. This directly weakens the feedback loop by limiting work amplification. LIFO scheduling under overload : Counterintuitively, switching from FIFO to LIFO when the system is overloaded allows some requests to complete within their deadline, preserving goodput instead of letting every request time out. NOTE : I already wrote a post about that approach in Adaptive LIFO . Fast error paths : Success paths are heavily optimized, but error paths often aren’t. An expensive error path (stack traces, DNS lookups, disk writes) under high failure rates can itself become a sustaining mechanism. Optimizing error paths reduces this risk. Read-through caches over look-aside caches : A read-through cache (where the cache itself fetches missing data from the database) can continue filling itself even when the application has given up on a request, steadily increasing the hit rate and helping the system recover. A look-aside cache (where the application is responsible for populating the cache) can’t. Production stress testing : Small-scale tests won’t reveal metastable failures. Testing against a portion of production traffic, with engineers ready to intervene, is the most reliable way to surface them. A note of humility from the paper: there is no systematic solution yet. These are ad-hoc mitigations developed in response to known failures. Detecting vulnerable states before they collapse remains an open problem. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. A distributed system can pass through three states: stable, vulnerable, and metastable. The vulnerable state looks healthy, but it isn’t. The threshold between stable and vulnerable is invisible. Systems can operate in the vulnerable state for months without any sign of trouble. When a trigger pushes a vulnerable system into a metastable failure, a feedback loop sustains the failure even after the trigger is gone. The trigger is not the root cause. The feedback loop is. Fixing the trigger leaves the system vulnerable to the next one. Reliability features like retries and caches can become the sustaining mechanism of a metastable failure under the right conditions. Metastable failures are emergent behaviors, not bugs. We can’t unit test for them, and optimizing for efficiency makes them more likely. Mitigations exist (retry budgets, circuit breakers, LIFO scheduling, fast error paths), but they are all ad-hoc responses to known failures. Detecting vulnerable states before they collapse remains an open problem. Resilient, Fault-tolerant, Robust, or Reliable? Adaptive LIFO Fail Open vs. Fail Closed Metastable Failures in Distributed Systems Metastability and Distributed Systems Stable, Vulnerable, Metastable Metastable failures borrow their name from physics, where metastable means something that looks stable but isn’t . To understand how a distributed system can end up in such a state, we need to look at three distinct states it can be in: Stable: The system recovers on its own after any disruption. This is what we call resilience in Resilient, Fault-tolerant, Robust, or Reliable . Vulnerable : The system looks perfectly healthy, but it's operating above its hidden capacity : the load level below which it can self-heal from any disruption. It responds fast, metrics are green, and nothing is alarming. Many production systems deliberately operate here because it's more efficient: resources are used closer to their limit. But there's no slack left . And the deeper the system operates in a vulnerable state, the smaller the trigger needed to push it over the edge. Indeed, a system just above its hidden capacity can survive large disruptions; a system near its advertised capacity can be tipped by almost anything. Metastable failure : A trigger (e.g., a network blip, a deployment, a traffic spike) pushes the system over its hidden capacity. The system is not fully broken: processes are alive, and it’s still running. But goodput collapses: it’s no longer doing any useful work. Technically up, effectively down . And unlike a regular outage, removing the trigger doesn’t fix it. Getting out requires a strong corrective push: a restart, a dramatic load reduction, a manual intervention. NOTE : If you’re not familiar with the concept of goodput, it’s the throughput of useful work completed successfully. For example, in a web application receiving 1000 requests per second but returning errors for 800 of them, the goodput is only 200 RPS. We can be tempted to blame the wrong thing . When an outage happens, the trigger is what’s visible and recent: a spike, a deployment, a hardware fault. It’s the obvious culprit. But the trigger only exposed the problem; it didn’t create it. The sustaining feedback loop was already there, structural and invisible. When analyzing the problem in retrospect, teams focus on the trigger; fixes address the trigger; and the system remains vulnerable to the next one. The authors of the paper observed teams declare a metastable failure “resolved” multiple times before realizing the real cause had never been touched. The feedback loop grows stronger with scale . Small-scale tests won’t reveal it. A staging environment running at 10% capacity may handle the same trigger without falling into a metastable state, because the loop isn’t strong enough at that scale to be self-sustaining. This means these failures can slip past even rigorous testing regimes and only manifest in production at full load. If it does, we are below the hidden capacity. If it doesn’t, we are above it. Retry budgets and circuit breakers : Instead of retrying indefinitely, cap the total number of retries in flight at any given time. This directly weakens the feedback loop by limiting work amplification. LIFO scheduling under overload : Counterintuitively, switching from FIFO to LIFO when the system is overloaded allows some requests to complete within their deadline, preserving goodput instead of letting every request time out. NOTE : I already wrote a post about that approach in Adaptive LIFO . Fast error paths : Success paths are heavily optimized, but error paths often aren’t. An expensive error path (stack traces, DNS lookups, disk writes) under high failure rates can itself become a sustaining mechanism. Optimizing error paths reduces this risk. Read-through caches over look-aside caches : A read-through cache (where the cache itself fetches missing data from the database) can continue filling itself even when the application has given up on a request, steadily increasing the hit rate and helping the system recover. A look-aside cache (where the application is responsible for populating the cache) can’t. Production stress testing : Small-scale tests won’t reveal metastable failures. Testing against a portion of production traffic, with engineers ready to intervene, is the most reliable way to surface them. A distributed system can pass through three states: stable, vulnerable, and metastable. The vulnerable state looks healthy, but it isn’t. The threshold between stable and vulnerable is invisible. Systems can operate in the vulnerable state for months without any sign of trouble. When a trigger pushes a vulnerable system into a metastable failure, a feedback loop sustains the failure even after the trigger is gone. The trigger is not the root cause. The feedback loop is. Fixing the trigger leaves the system vulnerable to the next one. Reliability features like retries and caches can become the sustaining mechanism of a metastable failure under the right conditions. Metastable failures are emergent behaviors, not bugs. We can’t unit test for them, and optimizing for efficiency makes them more likely. Mitigations exist (retry budgets, circuit breakers, LIFO scheduling, fast error paths), but they are all ad-hoc responses to known failures. Detecting vulnerable states before they collapse remains an open problem. Resilient, Fault-tolerant, Robust, or Reliable? Adaptive LIFO Fail Open vs. Fail Closed Metastable Failures in Distributed Systems Metastability and Distributed Systems

0 views
Jack Vanlightly 1 weeks ago

Kafka Share Groups and Parallelizing Consumption - Part 2: Producer Batches and share.acquire.mode

All tests were executed against Kafka 4.3.0 using Dimster .  In the last post we used simulated consumer processing time to reveal how important it is to set an appropriate value for to ensure the consumer parallelism that we expect. With a uniform distribution of messages over partitions, the rule of thumb was a value somewhat lower than: But there’s more to parallel consumption than . The size of producer batches also plays a role when using the default ( ). Share group members are assigned to partitions like consumer group members are, except that share group assignment allows multiple consumers to be assigned to the same partition. If the number of share consumers is less than the partition count, then each consumer will be assigned multiple partitions. If the consumer count matches or exceeds the partition count, then each consumer will be assigned one partition. Fig 1. Share consumer assignments. Left: consumer count < partition count. Right: consumer count > partition count. When a consumer is assigned only one partition, it will always be fetching from one broker. If a consumer is assigned multiple partitions, it may fetch from multiple brokers concurrently. There are two values for : The Javadoc says the following: The application chooses between the two modes using the consumer share.acquire.mode configuration property. If the application sets the property to batch_optimized or does not set it at all, the share consumer fetches records based on batch boundaries which may mean that the number of records returned may exceed the max.poll.records configuration property. The share consumer may also prefetch records and buffer them temporarily awaiting the application's next call to poll(Duration). If the application sets the property to record_limit, the share consumer fetches no more than records at a time and does not prefetch. This is slower but gives the application tighter control on how many records are fetched and when the acquisition locks begin. So why two modes?  It comes down to efficiency ( ) and consumer control ( ). First of all the sentence “ the share consumer fetches records based on batch boundaries” is correct but a little misleading. No matter what mode is used, whole batches are returned to the consumer over the wire . In other words, the data sent over the network is always based on batch boundaries as the record batch is the unit of data delivery.  What that sentence refers to is what records are acquired by the consumer and returned to the application: With , the config is a soft cap. The consumer acquires any batches (in their entirety) that are covered by the offset range determined by . These acquired batches are returned to the consumer, and the consumer returns the records of those batches to the calling application (that invoked ). With , the config is a strict cap. The consumer only acquires the records that are covered by the offset range determined by (though less if less records are available). However, the unit of data delivery is the record batch, so the consumer receives whole batches but only returns a specific offset range to the calling application. For example, in the figure below we have three consumers sending fetch requests with and . Fig 2. Three consumers fetching with batch_optimized Despite asking for only one record, each consumer acquires and receives records along batch boundaries. The result of consumer.poll(Duration) for c1 is three records, not one. If we rerun this scenario with record_limit: c1 acquires record 0 c2 acquires record 1 c3 acquires record 2 However, the batch is the unit of data delivery, so batch 1 is sent in its entirety to each consumer (the consumer internals only returns the acquired records of the batch to the application). Fig 3. Three consumers fetching with record_limit This is obviously less efficient… We just sent the same batch three times! Nonetheless, exists because sometimes that inefficiency over the wire is countered by other concerns (one of which is covered in this post). Another efficiency gain that has is that because each batch is only sent to one consumer, Kafka only needs to do share group housekeeping of the batch as a whole, not each record individually. This reduces CPU and makes metadata more compact. If we get mixed acknowledgments of the batch records (2 success, 1 reject) only then does the record tracking explode the metadata to be per-record. With , the housekeeping always tracks state per record, which is more expensive. The final difference between the modes is that in mode, a consumer can send concurrent fetches to all the brokers of its assigned partitions. This further increases the number of records that a consumer might receive as is a soft cap per broker. With , the consumer sends one fetch at a time, round-robin between the brokers of its partitions. This difference only manifests when the consumer count is less than the partition count. We’ll cover this aspect more in the next post. The main implications are that: With , the effective consumer parallelism can be impacted by the average number of records per record batch. With , the network throughput will increase in most scenarios as offset ranges are unlikely to align with batch boundaries. If the is larger than the average number of records per batch, then each batch may only be delivered twice. The network throughput can a lot if the is much smaller than the average number of records per record batch. Don’t worry if that isn’t clear yet, we’ll gather some empirical results next which should make it clearer. Let’s test this out with Dimster’s interactive mode, using the same workload as the last post. In the last post, we calculated that the maximum theoretical consumption rate for 300 consumers with a processing time of 5 ms per message would be 60,000 msg/s. By setting to 30 we reached 55,000 msg/s and then finally reached 60,000 with low end-to-end latency by adding an additional 12 consumers (2 per partition). So we use the following workload file (no dimensional stuff in this one as we’re going to use live-interaction): In this test we’re going to make the record batches bigger and see what happens to the consumption rate. First we start Dimster and ensure it’s handling the 60k msg/s. Once it has started and settled in, we see it’s coping well. If I look at the metrics, the current record batch size is around 5KB with 10 records per batch. The average fetch size is 7KB with 14 records. This means some consumers get 1 record batch per fetch and some get 2 record batches per fetch. Let’s increase the batch size. To do this we’ll drop to 1 producer, and set the linger.ms to 10 to reach the default batch.size of 16KB batches. We see that the batch size has risen to the default of 16KB, or 32 records per batch. The consumers should now, on average, receive 32 records per fetch (2 above the max.poll.records). Fig 4. The record batches sent by the producers increase from 5.5 KB to 16 KB The coordinator output shows that the consumers are still coping, as expected. With 500b records, the number of records returned per fetch will be 32 which is close enough to the max.poll.records of 30 to not impact consumption. Now let’s double batch.size to 32786. From a separate terminal window to the coordinator output, we’ll run the following: We see the batch size increase again in the dashboard. Fig 5. The record batches sent by the producers increase from 5.5 KB to 16 KB to 32 KB The coordinator output shows that the consumers are no longer keeping up! Only managing 37K msg/s with a fast growing backlog. The problem is that each partition has an inflight budget of 2000 records and each record batch contains 64 records. That allows up to 31 effective consumers per partition (2000 / 64), leaving 21 consumers starved at any point in time. This explains the 37K msgs/s: We can fix this problem in three ways: Set in the producer. Increase to create a larger inflight budget We already know the default 16KB batch size is ok. Let’s first increase the inflight budget. We’ll double the budget and see what happens. First we’ll stop the producers and remove the processing time on the consumers to drain the backlog. Next we need to update the broker config and restart the brokers. In we add: Then we’ll redeploy Kafka (again from a separate terminal window). Now we’ll start the producers again and apply the 5 ms processing time to the consumers. We’re in business! The consumers are now coping with the larger batch sizes with this increased inflight budget. This time we’ll try . First let’s walk back that inflight budget change by  1) stopping the producers, 2) commenting out the added line to our broker config, 3) redeploying Kafka. While the producers are still stopped, I’ll change the consumers to use : Then start the producers again: In the coordinator, we see that the consumers are now coping with the 60K msg/s. The reason that allows the consumers to keep up, despite the larger record batches, is that each consumer is only allocated a max of 30 records per fetch, even though each batch contains 64 records. However, each batch is now being delivered three times as 30 doesn’t align well with 64. We can see this in the Kafka client metrics. Fig 6. On the left, with the larger inflight budget and batch_optimized. The middle was when we stopped the producers to restart Kafka with the original inflight budget. The right is with record_limit and each batch being sent three times. We could make this more efficient if we increase to 32 to align with the 64 record batches. If I simply change the to 32, we don’t see much of an improvement as most offset ranges of 32 records will touch two batches. But if we stop the producers, ensure there is no backlog at all then set , the fetches will be perfectly aligned. Fig 6. On the left, with unaligned fetches with max.poll.records=32 (each batch delivered 3 times). Right: aligned fetches with max.poll.records=32 (each batch delivered 2 times). Let’s not over-index on this one case. The purpose of this post was to explain the underlying mechanics and back that up with some empirical benchmarks, sticking with the same workload example as the last post. What we’ve learned: Consumer parallelism is impacted by more than just consumer count and . It is also impacted by: Record batch sizes (determined by the producers) The inflight budget ( ) The share consumer config Record acquisition is along batch boundaries with , and record ranges with . Record batches are the unit of delivery, so can cause consumer network bandwidth to increase because fetches likely will not align on batch boundaries causing batches to be delivered at least twice (more if is much smaller than the average number of records per batch). In the next post we’re going to look a bit closer at . ps: you can run this whole scenario with two terminal windows: Window 1 - kick off the benchmark (using the workload yaml described in the post) Window 2 - wait a few minutes then run the following bash script: Happy testing! If the application sets the property to batch_optimized or does not set it at all, the share consumer fetches records based on batch boundaries which may mean that the number of records returned may exceed the max.poll.records configuration property. The share consumer may also prefetch records and buffer them temporarily awaiting the application's next call to poll(Duration). If the application sets the property to record_limit, the share consumer fetches no more than records at a time and does not prefetch. This is slower but gives the application tighter control on how many records are fetched and when the acquisition locks begin. With , the config is a soft cap. The consumer acquires any batches (in their entirety) that are covered by the offset range determined by . These acquired batches are returned to the consumer, and the consumer returns the records of those batches to the calling application (that invoked ). With , the config is a strict cap. The consumer only acquires the records that are covered by the offset range determined by (though less if less records are available). However, the unit of data delivery is the record batch, so the consumer receives whole batches but only returns a specific offset range to the calling application. c1 acquires record 0 c2 acquires record 1 c3 acquires record 2 With , the effective consumer parallelism can be impacted by the average number of records per record batch. With , the network throughput will increase in most scenarios as offset ranges are unlikely to align with batch boundaries. If the is larger than the average number of records per batch, then each batch may only be delivered twice. The network throughput can a lot if the is much smaller than the average number of records per record batch. Don’t worry if that isn’t clear yet, we’ll gather some empirical results next which should make it clearer. Set in the producer. Increase to create a larger inflight budget Consumer parallelism is impacted by more than just consumer count and . It is also impacted by: Record batch sizes (determined by the producers) The inflight budget ( ) The share consumer config Record acquisition is along batch boundaries with , and record ranges with . Record batches are the unit of delivery, so can cause consumer network bandwidth to increase because fetches likely will not align on batch boundaries causing batches to be delivered at least twice (more if is much smaller than the average number of records per batch).

0 views
Jack Vanlightly 1 weeks ago

Kafka Share Groups and Parallelizing Consumption — Part 1: Tuning max.poll.records

All tests were executed against Kafka 4.2.0 using Dimster (and also validated against 4.3.0).  In the last post we measured the overhead that the mechanics of share groups adds, and saw that it is pretty small. Likewise we saw that raw throughput was also comparable to consumer groups and even saw it exceed consumer group throughput on one test. In this post we’re going to simulate processing time in the consumers to make these benchmarks more realistic and show the utility of share groups (namely the ability to parallelize processing beyond the partition count). We’ll see how the following two configurations play an important role in parallelizing consumption with share groups: max.poll.records (consumer config) group.share.partition.max.record.locks (broker-side config) If we know the average processing time and the number of consumers, we can calculate the theoretical max throughput of a topic: For example: If we have 100 consumers, with an average processing time of 5 ms, then the maximum throughput will be 20,000 message/s. If we have a topic which peaks at 60K message/s and our average processing time is 5 ms. We’ll need 300 consumers to handle that. If we use consumer groups we’ll also need 300 partitions.  > Of course we could do some fancy concurrent work in the consumer to parallelize the consumer work but that comes with some downsides, principally that consumer groups track a position in the log which doesn’t map well to concurrently processing multiple positions in the log simultaneously, should the consumer encounter problems, abruptly terminate or get reassigned partitions. The ParallelConsumer does some clever tricks to handle this. With a share group we don’t need 300 partitions, we could have a handful of partitions with a group of 300 consumers and it should handle the load.  Let’s test it out with share group config defaults . Like last time, I’m going to ensure that load is even across the partitions so that load skew doesn’t pollute the results (I’ll be looking at load skew in a future post). See the last post for how I did that. We’ll use Dimster’s live interaction feature to model the workload on the fly, to see the impact of changing configurations and consumer counts. Fig 1. Dimster supports mutating the running workload. The max theoretical throughput of 300 consumers with 5 ms processing time is 60K msg/s so we’ll start with see what happens. Remember, Dimster uses named environments where commands take the format: , my environment is called localBeefy (detailed in the last post). Note: I prepend workload files with to so they don’t appear as untracked files in Git. The coordinator output shows: Straight away we see that consumption is really low at 4800 msg/s, nowhere near 60K msg/s. Let’s stop the producers and set processing time to 0 ms to allow the consumers to catch up so we can try again. From a separate terminal window: Viewing the coordinator output in the other terminal window, we wait until the backlog is drained, then slowly ramp up the producer rate to 50K msg/s. Then from the live interaction terminal: The coordinator output shows the consumers are now managing 50K msg/s. Much better than 4800 msg/s: However, it soon degrades, dropping to 20K msg/s and eventually back down to 4800 msg/s. Fig 2. The ramp up to 50k msg/s, followed by a reduction back down to 4800 msg/s What on earth is going on? The metrics tell a story. The max poll size (the number of records the consumer is returned after calling poll) is 500. Most calls to poll return few records, with p50 at 8. But the max tells us some return 500. Fig 3. The number of records returned by consumer.poll() The Kafka client metrics show that the average records per fetch response batch climbs toward 450. Fig 4. The records per fetch start low and grow larger and larger What we’re seeing is that most consumers aren’t getting very many records, but a tiny number are getting a lot. When the throughput was low and growing, the records per fetch were low (up to 10). But then the average records per batch started creeping up (while the consumption kept dropping) until the average fetch size was around 450 records. The average is high despite only few being full because most fetch requests sit idle until they can be serviced (by default up to 500 ms). Fig 5. The average fetch latency creeps up and almost reaches the default fetch.max.wait.ms of 500. It’s clear that the default of 500 is at play here. There is an interesting phenomenon here: at low producer rates, the broker does not have enough available records to fill each consumer’s , so each poll tends to return a small batch. Since many consumers are polling, processing, and acknowledging at roughly the same cadence, the available records get spread across the group. The result is an accidental fair-sharing regime : lots of consumers are active, each processing small batches, and aggregate throughput can approach the theoretical maximum. But this regime is fragile. It is not guaranteed by the broker-side allocation policy. Once enough records are available to fill large polls, the greedy allocation behavior takes over. A small number of consumers can acquire large batches, occupying the partition’s inflight record budget while the rest of the consumers sit idle. We’ll call this the greedy-capture regime , as a few consumers greedily capture the inflight window. This regime works as follows. The broker config determines how many records can be locked/inflight per partition, and defaults to 2000. With the default of 500, a single consumer can acquire 25% of that budget. At 5 ms per record, that one batch takes 2.5 seconds to process. While those records are locked, other fetches may sit idle even as new records arrive. This creates a feedback loop: larger batches consume more of the inflight budget, queued fetches wait longer, lag builds, and future fetches are more likely to be filled with large batches. Eventually the group collapses into the greedy-capture regime. We see this in the behavior above . With an inflight budget of 2000 messages and large fetches of 500 records, we only have 4 effective consumers per partition at a time. Across 6 partitions we only have 24 effective consumers each able to process 200 messages a second, resulting in an aggregate 4800 msg/s (exactly the number we’ve seen). To test whether the fair-sharing state was actually stable, I tried ramping only to 30K msg/s and holding it there. I left it for ten minutes and it remained stable. Then I restarted the consumers. Sure enough, throughput dropped back down to 4800 msg/s again. Fig 6. Fair-sharing regime collapses after a consumer restart Why go on about this accidental fair-sharing? Because the system can appear healthy under a slowly changing throughput and a moderate load because it has entered accidental fair-sharing, despite a bad choice of max.poll.records . I imagine this could trip some people up. Consumption may look fine for a long time, but suddenly degrade causing some head scratching and stress! The solution here is simple: reduce max.poll.records .  In theory we should carve up the inflight window between all consumers. So let’s take the configured and divide it by the number of consumers per partition. In our case, we are using the default of 2000. With 50 consumers per partition, we should set to . First, let’s drain the backlog. Once the backlog is drained, let’s set . This causes all 300 consumers to restart with the new config. Now we’ll attempt 60K msg/s with 5 ms processing time abruptly, no ramp up. The coordinator output shows: We’re close, about 55K msg/s consumption, however this soon drops to 45K and remains stable there. It seems that 40 was still too high as it did not account for all the overhead of the fetch/response time, the timing of commits, etc. After dropping to 30 and finally we hit 60K msg/s consumption! But the coordinator output shows that end-to-end latency is growing, little by little, it still isn’t quite keeping up. Let’s add 2 more consumers per partition (300 -> 312 consumers). The coordinator output now shows that end-to-end latency has dropped and stabilized. At this point, the benchmark has a few minutes left. We can discard all the cumulative latency histograms so we record the last minutes with this stable configuration. The final e2e latency distribution for 10 or so minutes with 312 consumers and is: Fig 7. End-to-end latency distribution of 60K msg/s, 5 ms processing time and 312 consumers Rules of thumb:  Set by taking and dividing it by the number of consumers per partition . Then set it somewhat lower to leave room for timing variance, uneven fetch timing, partition skew, and transient backlog. If you have very long processing time (over 1 second) you can even drop max.poll.records to 1 as the cost of a fetch is dwarfed by the processing time. You can also try increasing the group.share.partition.max.record.locks ( max of 10000) which will allow for a larger inflight budget and be more forgiving of a suboptimal max.poll.records. Now armed with a good rule-of-thumb, we’ll run two scenarios with Dimster’s explore limits mode , a benchmark mode for finding the highest sustainable throughput (see the last post for how it works): Fig 8. All test points achieved 57,000 msg/s while staying under p75, 100 ms end-to-end latency. All 5 workloads achieved 57K msg/s, just short of the max theoretical throughput (likely due to the latency constraint of explore mode). Adding some more consumers would be enough to reach 60K msg/s. Next, with 1 ms processing time. Fig 9. Share groups with 12+ partitions reached 95% of the theoretical max consumption throughput. Share groups with 12, 30 and 60 partitions did best, reaching 95% of the max theoretical throughput. The reason 6 partitions fared a little worse is likely due to contention over the inflight window (6 * 2000 records). The higher partition tests had a larger window across the same number of consumers. I expect the consumer groups could have gotten higher throughput, just not within the latency target of the test (100 ms, p75, based on the worst partition). First of all, I hope you see how useful live interaction using Dimster is! You can mutate a live workload to explore the impact of changing client configurations, producer rate, the number of producers and consumers, all on the fly. Once you have a topology you want to record stats for, clear the stats, set a new running time and get all the usual Dimster results. You can download the results from this blog post: tarball for the interactive session tarball of the explore limits run Regarding Kafka, the important lesson is that share groups change the parallelism bottleneck. With consumer groups, it’s the partition count. With share groups, it’s easy to think it simply comes down to the number of consumers, but it’s a little more complicated than that. Parallelism is determined by the inflight budget and the size of fetch requests .  Setting carefully might seem obvious, but I think it could trip people up for a few reasons: Defaults can come into play easily, especially for people without a lot of Kafka experience. The greedy behavior is not necessarily obvious (especially in terms of message queues in general). Synthetic benchmarks with 0 processing time will miss this (4 consumers per partition can handle whatever you throw at the partition). Only once you add processing time to a benchmark does the relationship between the inflight budget and fetch size become apparent. This greedy algorithm makes a very important configuration for share groups and the default of 500 is arguably the wrong value for share groups. It would be nice for a future version of Kafka to offer an alternative which enforces fair-sharing broker-side. I’ve posted this sentiment to the dev mailing list. Next : isn’t the only config that determines the size of consumer fetches ! In the next post we’ll look at the role of the following (in terms of how they can affect consumer fetch sizes and therefore the parallelism of share groups): Producer batch sizes. Share group consumer config :  (default, used in this post) max.poll.records (consumer config) group.share.partition.max.record.locks (broker-side config) If we have 100 consumers, with an average processing time of 5 ms, then the maximum throughput will be 20,000 message/s. If we have a topic which peaks at 60K message/s and our average processing time is 5 ms. We’ll need 300 consumers to handle that. If we use consumer groups we’ll also need 300 partitions.  Set by taking and dividing it by the number of consumers per partition . Then set it somewhat lower to leave room for timing variance, uneven fetch timing, partition skew, and transient backlog. If you have very long processing time (over 1 second) you can even drop max.poll.records to 1 as the cost of a fetch is dwarfed by the processing time. You can also try increasing the group.share.partition.max.record.locks ( max of 10000) which will allow for a larger inflight budget and be more forgiving of a suboptimal max.poll.records. tarball for the interactive session tarball of the explore limits run Defaults can come into play easily, especially for people without a lot of Kafka experience. The greedy behavior is not necessarily obvious (especially in terms of message queues in general). Synthetic benchmarks with 0 processing time will miss this (4 consumers per partition can handle whatever you throw at the partition). Only once you add processing time to a benchmark does the relationship between the inflight budget and fetch size become apparent. Producer batch sizes. Share group consumer config :  (default, used in this post)

0 views
Corrode 2 weeks ago

Migrating from Go to Rust

Out of all the migrations I help teams with, Go to Rust is a bit of an outlier. It’s not a question of “is Rust faster?” or “does Rust have types?”, Go already gets you most of the way there. The discussion is mostly about correctness guarantees , runtime tradeoffs , and developer ergonomics . A quick disclaimer before we start: this guide is heavily backend-focused . Backend services are where Go is strongest, small static binaries, a standard library focused on networking, and an ecosystem of libraries for HTTP servers, gRPC, databases, etc. That’s also where most teams considering Rust are coming from (at least the ones who reach out to me), so I think that’s the comparison that’s actually useful in practice. If you’re writing CLI tools, embedded firmware, or game engines, some of this still applies, but to be honest, I’m afraid this is not the best resource for you. For context, I’ve written about Go and Rust before: “Go vs Rust? Choose Go.” back in 2017, and later the “Rust vs Go: A Hands-On Comparison” with the Shuttle team, which walks through a small backend service in both languages. What you will learn in this article I’ll be upfront: I’m not a fan of Go. I think it’s a badly designed language, even if a very successful one. It confuses easiness with simplicity , and several of its core design tradeoffs ( everywhere, error handling as a discipline rule rather than a type, the long absence of generics) point in a direction I disagree with. That said, success matters! Go has captured a real and persistent share of working developers, hovering around 17–19% in the JetBrains Developer Ecosystem Survey. Rust is growing steadily but is still a smaller slice: Go is clearly working for a lot of people, and a guide that pretends otherwise isn’t helpful. So I’ll do my very best to be objective in this guide rather than relitigate old arguments. But you should know my priors so you can calibrate. The other prior worth disclosing: I run a Rust consultancy; of course I’m biased! More people using Rust is good for my business. But I’ve also worked in both languages professionally and shipped Go services to production. This guide is for Go developers who want an honest, side-by-side look at what changes when you move to Rust. For a deliberately opposite take, I recommend reading “Just Fucking Use Go” by Blain Smith. Holding both views in your head at once is more useful than either one alone. If you prefer to watch rather than read, here’s a video from the Shuttle article above, read and commented by the Primeagen: Go developers already have one of the cleanest toolchains in the industry. Back in the day, it started off a trend of “batteries included” toolchains that give you a single, consistent interface for building, testing, formatting, linting, and managing dependencies. I’m glad that Rust followed suit, because it’s a great model. It’s one of my favorite parts about both ecosystems. has even more built-in: The big difference is that in Go you typically reach for third-party tools ( , , , ) to fill gaps. In Rust, the first-party ecosystem covers more out of the box. Things that do require external crates (e.g. , ) install with one command and feel native, e.g. gives you right away. Both communities have converged on the same insight about formatters: a single canonical style, even an imperfect one, is worth more than the bikeshedding it eliminates. Gofmt’s style is no one’s favorite, yet gofmt is everyone’s favorite. — Rob Pike, Go Proverbs The same is true of : not everyone likes every detail, but the absence of style debates in code review is worth far more than the occasional formatting preference you’d have made differently. The headline is that Go and Rust are both compiled, statically typed, single-binary-deploy languages with strong concurrency stories. The differences are about what guarantees you get from the compiler and how much control you have over runtime behaviour . Go developers don’t usually come to Rust because Go is “too slow.” For most backend workloads, Go is plenty fast. People are generally a bit frustrated with Go’s verbose error handling, the danger of segmentation faults from pointers, and the lack of generics (for a long time) or any sophisticated type system features, such as enums or traits. Interfaces are not a worthy replacement for traits, and the Go standard library has some weird gaps, such as the lack of a type. I call it my billion-dollar mistake. It was the invention of the null reference in 1965 … This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years. — Tony Hoare, inventor of , QCon London 2009 This is the one I hear most often. You ship a Go service, it runs fine for months, and then one Tuesday at 3 a.m. a code path runs where someone forgot to check whether a pointer was , and the goroutine panics. Go’s compiler does not force you to consider the absence case. Rust’s does: You literally cannot dereference an without acknowledging the case. Whole categories of pager-duty incidents disappear. is a great tool, but it’s a runtime detector, it only finds races that actually execute during your tests. Mutating a map from two goroutines without a lock compiles fine in Go and only blows up in production under load. In Rust, sharing mutable state across threads requires types that implement and . Try to share a plain between threads and the program does not compile . You’re forced to wrap it in an , an , or use a channel. That race condition becomes a type error. 1 is fine for a while. After a few years, you notice three things: It’s worth being honest about the counter-argument here, since it came up in the Lobste.rs thread on my Shuttle article: experienced Go developers point out that and catch most of the “forgot to handle the error” cases in practice, and that explicit is easier to read than dense chains. Both points are fair, and the explicit style is a deliberate cultural value, not an accident: I think that error handling should be explicit, this should be a core value of the language. — Peter Bourgon, GoTime #91 , quoted in Dave Cheney’s Zen of Go My take is that lints are an opt-in safety net you have to remember to set up, while Rust’s is the type signature itself, there’s no way to forget. The boilerplate-vs-readability tradeoff is more genuinely subjective. The operator handles propagation; handles wrapping; and a on is exhaustively checked . Add a new variant tomorrow and the compiler shows you every place that needs updating. Go got generics in 1.18, and they’re useful, but the implementation has constraints (no methods with type parameters, GC shape stenciling, occasional surprising performance characteristics). Rust generics monomorphize, each instantiation produces specialized code with zero runtime cost. Combined with traits, this gives you real zero-cost abstractions. This matters less in handler code and more in shared infrastructure (middleware, generic repositories, decoders, parsers), where Go often pushes you back to / plus type assertions. Go’s GC is excellent, concurrent, low-pause, well-tuned for typical service workloads. But “low-pause” is not “no-pause.” Under heavy allocation, P99 latency tails are noticeably worse than a Rust equivalent that simply doesn’t allocate on the hot path. I won’t oversell this, for the vast majority of services, Go’s GC is a non-issue. But for latency-sensitive systems (trading, real-time bidding, network proxies, high-throughput ingestion), the lack of GC pauses is a genuine selling point. Go is death by a thousand paper cuts. It is a very pragmatic language and if you are willing to glance over the above issues, you can be very productive in it. But at a certain codebase size, the problems start to compound. There is no single moment when Go loses its appeal, but teams find themselves wishing for more (more safety, more control, more expressiveness) and that’s when they start looking around for alternatives. The fastest way to feel comfortable in Rust is to map patterns you already know. For a longer, fully-worked example of building the same backend service in both languages, see the Shuttle comparison , the section below focuses on the patterns that come up most often. The operator does the dance for you, including type conversion if is implemented (idiomatic with ’s ). There is no in safe Rust. References can’t be null. Pointers can be, but you almost never use raw pointers in application code. Go’s interfaces are structural, a type satisfies an interface implicitly: Rust’s traits are nominal, you implement them explicitly: The Go style is great for ad-hoc duck typing. The Rust style is great for refactoring and discoverability, you can grep for every implementer of a trait. The closest equivalent of / in Rust is , but you almost never want it. The Go community knows the cost of reaching for too: interface{} says nothing. — Rob Pike, Go Proverbs Generic functions with trait bounds ( ) cover the vast majority of cases and give you monomorphization with no runtime dispatch. Where Go pre-1.18 would have forced you back to plus a type assertion, Rust’s traits + generics let you stay specific. When you do want runtime dispatch (e.g. heterogeneous storage of different implementers), reach for or . That’s the direct Rust analog of holding an value in Go. Go’s concurrency model is famously simple: Goroutines are cheap, the runtime schedules them across OS threads, and channels ( ) are the primary coordination primitive. The Go proverb captures the philosophy: Don’t communicate by sharing memory; share memory by communicating. — Rob Pike, Go Proverbs This is the area where Go genuinely shines, several commenters in the Lobste.rs discussion made the point that goroutines “just disappear” into normal-looking blocking code, and that’s worth giving Go credit for. Rust async is more powerful, but it’s also more visible in your code. Rust uses / on top of an executor (almost always for backend services): The shape is similar. The differences: For most backend code, the day-to-day feel is similar: spawn a task, communicate via channels, use timeouts liberally. In Go, you plumb a through every blocking call: Rust has no built-in . The closest equivalent for cancellation is : For timeouts, wraps any future. For deadlines/values, you typically pass them as explicit arguments or via spans rather than a single context object. Some Go developers miss the implicit-feel of . In practice, the explicit Rust style is easier to reason about, you always know exactly what’s cancellable and what isn’t. The deeper point is that neither language gives you cancellation for free, the discipline just shows up at different layers: Go doesn’t have a way to tell a goroutine to exit. There is no stop or kill function, for good reason. If we cannot command a goroutine to stop, we must instead ask it, politely. — Dave Cheney, The Zen of Go In Go that “asking politely” is a plumbed through every call site by convention. In Rust it’s a (or a channel) plumbed through every call site, but the compiler can actually tell you when you forgot. Both languages have channels. The translation is direct: Rust’s channels distinguish sender and receiver as separate types, which makes ownership and -ness explicit at the type level. Rust’s is the equivalent of a Go value receiver; is a pointer receiver with mutation. Owned (consuming the value) has no Go analog and is occasionally very useful (typestate, builders). Go’s is a UTF-8 byte slice with copy-on-assign semantics (the header is copied, the underlying bytes are shared and immutable). Rust splits this into two types: As a rule of thumb, take in arguments, return when you produce new data. This is mostly painless once you internalize it. The vs split is a microcosm of Rust’s broader “borrow vs own” model. Go got generics in 1.18 (March 2022), thirteen years after the language shipped. They are useful, but they feel tacked on, and in practice they have most of the downsides of a generic type system without delivering the upsides you’d expect coming from Rust, Haskell, or even modern C++. This is a strong claim, so let me back it up. The most telling signal is that three years after generics landed, Go’s own standard library still mostly avoids them. still takes a closure instead of a constraint. is still typed as / . The generic helpers that do exist live in a small handful of packages: , , , and a few entries under . Compare that to Rust, where generics permeate the standard library from day one: , , , , , / , , , every collection, every smart pointer. You cannot write idiomatic Rust without using generics, because the standard library is generic. In Go, generics are an opt-in feature for library authors who really need them. In Rust, they’re the substrate everything else is built on. Rust’s generics are tied to traits, which double as the language’s mechanism for ad-hoc polymorphism, supertraits, associated types, blanket impls, and coherence. Go’s constraints are just interfaces with an extra operator for type-set membership. There are no: The practical consequence is that the moment your abstraction needs more than “a function that works for any with these few operations,” Go pushes you back to plus type assertions, code generation, or runtime reflection. Rust uses a Hindley-Milner-style inference engine that propagates type information through entire expressions, including across closures, iterator chains, and operators. You routinely write: and the compiler figures out is from the range, and is from the target. Go’s inference is much shallower. It can usually infer type parameters from function arguments, but it cannot infer from return-position context , cannot chain inference through generic builders the way Rust does, and frequently forces explicit type arguments at call sites: In Rust this is the exception; in Go it’s still common. Rust monomorphizes: every and produces specialized machine code with zero runtime dispatch. Go uses GCShape stenciling with dictionaries , where types that share a “GC shape” share the same compiled function and dispatch through a dictionary at runtime. The result is a compile-time/runtime tradeoff that often surprises people: generic Go code can be measurably slower than the equivalent hand-written non-generic version, because every method call on a type parameter goes through an indirection. There’s a well-known PlanetScale post showing exactly this. In Rust, generic code is the fast path. Reaching for (the equivalent of Go’s interface dispatch) is a deliberate choice you make when you want runtime polymorphism. This is the part that bothers me most. A good generics system removes reasons to fall back to escape hatches. In Rust, generics + traits eliminate most of what you’d otherwise need or runtime reflection for. The type system gets stronger. In Go, generics did not remove , did not remove , did not remove code generation as the dominant pattern for things like ORMs, decoders, and mocks. still uses reflection. still uses . still generates code. The places where a real generics system would shine are the same places Go reaches for runtime mechanisms it had before 1.18. Generics in Go feel additive, a new tool in the box that’s useful in narrow cases. Generics in Rust feel foundational; remove them and the language collapses. That’s the difference, and it’s why generic Go code, in my experience, doesn’t read better than the -based code it replaced; it just reads differently, with more punctuation. If you’re already opinionated in Go, the Rust ecosystem has converged to a similar level of “default picks.” For a typical backend service: + + + + + covers 90% of what you need. I want to be straightforward here. Coming from Go, you will hit a wall . The wall has a name. Go’s runtime handles memory and aliasing for you. Rust pushes that decision into the type system. The first few weeks you’ll write code that “should obviously work” and the compiler will refuse it. The patterns that bite Go developers most often: With all of these rules, the borrow checker truly sounds like a “gatekeeper” of sorts, which keeps getting in the way and is just overall frustrating to deal with. That is not the mental mindset you should have when learning Rust. The borrow checker truly uncovers real and very existing bugs in your code, and if you don’t address them, your program will deal with safety issues. So whenever you get a compiler error from , take a step back and think how your code could break. A few questions you can ask yourself: That is the mindset you need to understand the borrow checker. Humans are genuinely bad at reasoning about memory. We forget that pointers can be null, that old references can outlive the data they point to, and that multiple threads can touch the same data at the same time. We tend to have a “linear” mental model of how data flows through a program, but in reality it’s closer to a complex graph with many paths and interactions. Every condition forces you to consider what happens in both branches. Every loop forces you to consider what happens on every iteration. That is exactly the kind of reasoning the borrow checker is designed to do for you! It enforces best practices at compile time, and it can feel annoying when your own mental model disagrees with the borrow checker’s (which is the more accurate one 99% of the time). There are cases where the borrow checker is genuinely too strict, but they are rare, and as a beginner you’ll almost never run into them. I got memory management wrong plenty of times in my early days, but I approached it with a learner’s mindset , which helped me ask “what’s wrong with my code?” instead of “what’s wrong with the compiler?”, a reaction I see a lot in trainings. The good news is that once you internalize borrowing, it stops fighting you. Most experienced Rust developers will tell you the borrow checker became an ally somewhere between weeks 4 and 12. The first month is the hardest. Be honest with your team, Rust compile times are a real downgrade from Go’s. A clean release build of a medium service can take minutes in comparison to Go’s near-instantaneous compiles. Incremental builds and are reasonable and compile times have gotten much better over the years, but you’ll feel the difference. To mitigate, use in your edit loop, split into a workspace once it pays off, and keep proc-macro-heavy crates in their own crate so they only recompile when they change. See tips for faster Rust compile times for a deeper dive. Go’s “one type of function, sync everywhere, the runtime handles concurrency” is genuinely simpler than Rust’s split between and . You’ll need to think about which of your functions are async, where you , and how that interacts with traits. Async traits (stable since Rust 1.75) help a lot, but there are still rough edges (especially around with async methods). Rust’s crate ecosystem is growing and libraries are high-quality across the board, but Go has a head start in some backend-adjacent domains: Kubernetes operators, cloud-provider SDKs, database drivers for certain niche stores. Before you commit, spend a day checking that the libraries you depend on have Rust equivalents you’re willing to use. Teams I help often have to hand-roll at least one or two core libraries themselves. For example, they might have to update an abandoned crate for XML schema validation, or write their own client for a lesser-known protocol. You don’t have to rewrite everything in one go. The strategies that work best, in order of how I usually recommend them: If one specific service in your fleet is the perpetual problem child (high CPU, latency-sensitive, or constantly hit with reliability issues), rewrite just that one in Rust, behind the same API contract. This is the lowest-risk migration. Other Go services keep talking to it via HTTP/gRPC, oblivious to the underlying language. Background workers, queue consumers, ingestion pipelines, and CPU-bound batch jobs are excellent first targets. They typically have a clear input/output boundary (a queue, a topic) and no shared in-process state with the rest of the system. You can call Rust from Go via cgo, and there are good guides on how to do it . (Reach out if you’d be interested in a guide on this from me.) In practice, I rarely recommend it for backend services. The build complexity and FFI overhead usually outweigh the benefits compared to “just stand up a Rust service and put it behind a network call.” For libraries and CLI tools, it’s more viable. If you have an API gateway or reverse proxy, you can route specific endpoints to a new Rust service while the rest stays in Go. This works particularly well when one bounded context (auth, search, billing) is the right unit to migrate. The pattern is often called “strangler fig,” because the new service grows around the old one until it eventually replaces it entirely. Start with a service that has a clear boundary. Don’t pick the most central, most-deployed service in your fleet. Pick the one where the contract with the rest of the system is well-defined and the blast radius is small. Keep the same API contract. If your Go service exposes a REST API, your Rust service should too: same paths, same JSON shapes, same error envelope. The migration is invisible to clients, and you can swap traffic incrementally with a gateway. Don’t translate idioms verbatim. Resist the urge to write Go-flavoured Rust. becomes . Goroutine-per-request becomes only when you actually need it (axum already concurrently handles requests). Interfaces with one method usually become trait bounds on a generic, not . Use the compiler as a pair programmer. Rust’s compiler errors are usually pretty good. Read them slowly. They almost always tell you the right answer. The team members who struggle longest are the ones who fight the compiler instead of treating it as a collaborator. Invest in training early. I’ve seen teams try to do a Rust migration “on the side,” learning as they go. It rarely ends well. It’s a bit like training for a marathon by signing up for the race and then trying to run it without any prior training. You can do it, but it’s going to be painful and you might not finish. Block off real time for learning: a workshop, an online course , paired sessions on real code. The upfront investment pays back many times over once the team is fluent. (Hey, if you want to talk about training options, I’m happy to chat .) Not everything should be migrated. Go is excellent for: A hybrid strategy is fine and common. Many of the teams I work with end up with a polyglot backend: Go for the “boring” services, Rust for the ones where reliability and performance pay back the extra effort. Numbers vary wildly by workload, so take these as rough guidance. Not promises! But here are some ballpark numbers, based on Go-to-Rust migrations I’ve helped with: Honestly, you’re unlikely to get a 10x throughput improvement going from Go to Rust the way you might from Python. What you get is fewer “silly errors” and flatter latency tails, plus the ability to expand into other domains like embedded development or systems programming while still using the same language. That’s often the most surprising side-effect of a migration: there’s a lot of opportunity for code-sharing across teams that previously had to use different stacks. You can use Rust for everything. Going from Go to Rust is a different kind of migration than coming from Python or TypeScript . Coming from Go, you know the benefits of a statically-typed, compiled language. So you’re not trading away dynamic typing or a slow runtime, you’re trading away in exchange for a more robust codebase with fewer footguns, and a stricter compiler that catches more mistakes at compile time. There is a steeper learning curve, however. For foundational services (services that your organization relies on, that have high uptime requirements, that are critical to your business), that trade is obviously worth it. For others, Go remains the right answer. The point of a migration is to put each problem in the language that solves it best. Ready to Make the Move to Rust? I help backend teams evaluate, plan, and execute Go-to-Rust migrations. Whether you need an architecture review, training, or hands-on help porting a critical service, let’s talk about your needs . Rust’s type system doesn’t catch all data races, but types that truly can’t be shared between threads without synchronization won’t compile. You can still have logic bugs in your synchronization, but you won’t have the kind of “oh no, I forgot to lock this” that often leads to silent data corruption. ↩ Where Go and Rust overlap, and where they diverge. How Go patterns map to Rust. What you gain from the borrow checker. Where I tell people to keep Go and where Rust is worth the migration cost. How to migrate Go services incrementally. The boilerplate dilutes the actual logic of your function. Wrapping with is a discipline rule, not a compiler rule. It’s easy to drop context on the floor. Sentinel errors via / work, but the compiler doesn’t tell you when you forgot to handle a new variant. Rust async functions return s. They don’t run until awaited or spawned. The compiler tracks / across points. If you hold a non- value across an await, you get a compile error explaining exactly why. There’s no built-in goroutine-style preemption. Long CPU-bound work in an async task starves the executor; you offload to or instead. Channels ( , , ) are first-class but live in libraries, not the language. , owned, heap-allocated, growable. Equivalent to you intend to mutate. , a borrowed view into someone else’s string data. Equivalent to a Go parameter most of the time. Supertraits / constraint hierarchies. In Rust you write , and any automatically satisfies and . Go has no equivalent; you stack interface embeddings, but the constraint solver doesn’t reason about hierarchies the way Rust’s trait system does. Associated types. Rust’s has , so is a first-class thing you can name in bounds. Go’s closest equivalent is a second type parameter, which leaks into every signature. Blanket impls. In Rust, automatically gives every type a method. Go has no way to add methods to a type from outside its defining package, generic or not. Methods with their own type parameters. This is an explicit, documented non-feature in Go. You cannot write . In Rust, generic methods on generic types are routine. Long-lived references. In Go, you’d happily hold a from a map for as long as you want. In Rust, that borrow blocks mutation of the map for its whole lifetime. The fix is usually to clone, or to scope the borrow tighter. Self-referential structs. Common in Go (a struct holding both data and an iterator over it). In Rust, this requires , , or a redesign. Almost always: redesign. Sharing mutable state across goroutines. What you’d write as becomes . Slightly more verbose, much more checked. Returning references from functions. Lifetime annotations show up. They’re not as bad as their reputation, but they’re new. If a value got moved from one place to another, what would happen if the original place tried to use it again? If a value is shared across threads, what would happen if one thread modified it while another thread is using it? If a pointer is dereferenced , what would happen if it was null or dangling? When a value goes out of scope , what would happen if it was still being used somewhere else? Kubernetes-native tooling : operators, controllers, CRDs. The ecosystem is overwhelmingly in Go. CLI utilities and dev tooling : fast compiles, easy cross-compilation, simple deployment. Glue services : thin API layers, proxies, format converters. The boilerplate ratio in Rust isn’t worth it here. Anywhere your team velocity matters more than absolute correctness guarantees . CPU usage: 20–60% reduction. Less dramatic than Python-to-Rust, because Go is already efficient. The wins come from no GC and tighter loops. Memory: 30–50% reduction, mostly from the absence of GC overhead and a smaller runtime. P99 latency: significantly more consistent. Rust services tend to flatline where Go services have visible GC-induced jitter. (This has gotten much better on the Go-side ever since they introduced their low-latency GC, but the difference is still there under heavy load.) Production incidents: this is the one teams report most enthusiastically. The classes of bugs that survive and reach production (data races, nil dereferences, missed error paths) just don’t compile in Rust. Oncall rotations are typically very boring after a Rust migration. Rust’s type system doesn’t catch all data races, but types that truly can’t be shared between threads without synchronization won’t compile. You can still have logic bugs in your synchronization, but you won’t have the kind of “oh no, I forgot to lock this” that often leads to silent data corruption. ↩

0 views
Phil Eaton 2 weeks ago

Serving files over HTTP three ways: synchronous, epoll, and io_uring

This is an external post of mine. Click here if you are not redirected.

0 views
Karboosx 2 weeks ago

How to actually handle database transactions (and why your ORM fails at it)

Wrapping database calls in a simple try-catch block is just asking for trouble. Here is a practical look at how to handle transactions properly, avoid the dreaded ORM lost update, and keep your data actually safe.

0 views

Flat/Non-higher-order construction of ANF via mutable holes

This post will describe an algorithm for constructing an ANF/similar IR from a functional-like base language with no closure usage, and with no related excess space requirements. This algorithm is as far as I know not new, and has apparently been known in imperative compilation for a while. However, I admittedly have no source for this, nor have I seen it described for functional-like languages. Thanks to Ryan Brewer and rpjohnst for their inspiration on this. If you are only interested in code, please skip to Implementation. First, let us describe our input and output languages. Our input language will look like this: This is meant to echo some simple functional programming language, and contains all the elements needed to demonstrate the algorithm. Our output language will look like a lexically-scoped flat ANF-style construction, with all lets at the toplevel. The goal is to convert from our input to this language. The notable addition is the constructor in , the purpose of which will be described now. Imagine we are trying to convert the term When we arrive at the body of , we have a conundrum. We need to put the conversion of before , so that using is well-scoped. Hence, we must convert before we can continue. How do we use this conversion, though? 's content needs to be put in the field of 's , so it's not clear how to both convert before and to use 's finished conversion in . One "traditional" method of solving this is via continuations, where is passed a closure representing "what to do next", after it converts its body. We can then put 's content in that closure, which will be invoked later. This has two main flaws: This method resolves these. Recall that we left a field in our type. I will denote a hole like this: . The central idea is that our conversion of any given fragment will return: Consider this simplified example: When we convert , this signals for the conversion of . There are two steps to converting a - we must first convert the "head" ( ), and then the "body" ( ). Converting the will result in something of form Note that the hole is left to be filled later. This way, the conversion of 's head need know nothing about its environment. We represent this in code via the constructor, which contains a reference to a fragment that either may be empty ( ) or filled ( ). The conversion of the body can then proceed by first generating the expected addition, binding it to so it may be referenced later: We now need to get this fragment into the proper lexical scope. We can do this by plugging it into the hole left by the conversion of the head: Then, returning this fragment, the name bound ( ), and the reference to the inner hole, we can continue converting ; First, convert the head, using the name returned by the conversion of : Then, the body: And finally plug it into the hole returned by . We are left with a final hole we can plug with to be explicit about the end of a control flow path. One disadvantage of this strategy is that it does generate many unneeded names. Most of these can be avoided by carefully special casing certain , and the rest can be eliminated via a simple folding pass. Below is a sample implementation of this algorithm, in OCaml. I recommend toying with the conversion yourself, using or similar. In the following, I will "clean up" the output slightly, as to not make it unreadable. If we use our first example term: We can then plug the returned hole, as described. Via some inspection, we can verify this is indeed correct, and conforms to our ANF-like form, as expected. As mentioned earlier, the large amount of duplicate names can be mitigated through strategies such as special casing and folding passes. Clearly, the algorithm does not use closures, or any indirect jumps. It does not allocate any more than needed for the representation of the output program; there is no excessive closure allocation. I do not currently have a good way to test performance on real-world large inputs, but I have no reason to believe it would not perform better than an equivalent higher-order version. I believe this method should be extendable to CPS as well, although I have not tried as of the moment. The "usual" higher-order CPS conversion described in e.g. "Compiling with continuations, continued" looks quite similar to the higher-order presentation of ANF conversion, so I would be interested in whether it can be similarly massaged into a mutable form. It uses excess space. A conversion of a term will allocate ~2x the memory "needed"; it must allocate for the ANF construction itself, and for the closures needed during the construction. It is slower, due to the aforementioned closure allocation and application. The fragment itself. A name we can use to refer to what was bound in the fragment (In the conversion of addition/similar with nested content, names must be conjured.) The reference created by the hole, so that it may be filled.

0 views
Max Bernstein 3 weeks ago

Partial static single information form

In compilers, static single information form (SSI) is a common extension to static single assignment form (SSA). It was introduced by C. Scott Ananian in 1999 in his MS thesis (PDF) 1 . SSI extends your existing SSA intermediate representation by discovering facts from your existing program and reifying them as path-dependent/flow-sensitive IR nodes. That might sound complicated, but at least the basic idea is pretty natural. I talk a little bit about it in What I talk about when I talk about IRs and I’ll rehash here in more depth, starting with some motivating examples. Consider this admittedly contrived example: We should be able to learn from the comparison that in some branches in the IR, is positive. In that region, we can add a new IR instruction that attaches that knowledge right in the instruction’s type field (yay, sparseness!) and then rewrite uses of to now use . Because we’ve done that, our (imaginary) optimization rule that gets rid of on known-positive integers can kick in, and we can delete the invocation of . Yay, optimization! But a couple of questions remain, at least for me: We’ll go through them, starting with the compiler pipeline. The original SSI paper starts with (I think?) SSA form and places some number of new refinement nodes based on conditionals. I have admittedly not tried very hard, but the into-SSI algorithms look complicated and kind of heavyweight. As a reward, you get “linear” into-SSI time complexity. But I am a humble compiler engineer, and I don’t have the time to go through and load all of this into my head. Instead what I have seen done and have been doing is to take a shortcut: build partial SSI during SSA construction 2 . Most of the time this is from bytecode, but it could also be from some other non-SSA IR. In any case, this is an excellent shortcut for two reasons: This is pretty compelling. We can learn from the bytecode with a very small amount of marginal new complexity. See my implementation in ZJIT , for example. All it really does is modify the abstract interpreter state when building SSA out of , , and bytecode instructions to take into account the new refined values. This is fine for branches that are already in the user’s source program but sometimes optimization, especially of dynamic languages, adds new branches that were not there before. And sometimes these branches get added much later, long after SSA construction. What then? Can we do something similar and rely on existing infrastructure? Implicit in this “can we do it” is the assumption that your IR tracks data dependencies from use to corresponding def, but not from def to uses. Sea of Nodes (at least the Simple implementation), is an IR that tracks both directions all the time for easier rewriting. Many IRs do not do this, so we will continue assuming that there’s no “easy way out”. JIT optimization of dynamic language compilers often adds synthetic instructions to the IR that enforce pre-conditions. These guards allow optimizing happy/fast path cases in JIT code while leaving the interpreter as a fallback. For example, we might be able to optimize two back-to-back instructions (a very dynamic operation in the world of ideas, but fast when concretely implemented using object shapes) from: which is very generic and involves calling into C code that might raise an exception, to something more like: which is much faster (assuming shape stability at run-time). There’s an irritating problem, though, which is that we have a bunch of duplicate instructions littered around the IR now because our optimizer worked on each instruction individually. Kind of a “template optimizer” situation. Now we need some pass to clean up the detritus. Global value numbering (GVN) will do a good job of de-duplicating instructions. It should notice that we already have an instruction that looks like called and rewrite into . That’s great because we have de-duplicated the guard. GVN may not get everything, though; if some instructions later use , they will not get rewritten to instead use the output of these new guard instructions. To do that, we need to add some kind of pass or augment GVN with some canonicalization feature. That canonicalization would handle rewriting operands to use the “latest version” of some value, so to speak. See the canonicalization section of Chris Fallin’s excellent aegraphs blog post for more (and of course the (currently block-local) implementation in ZJIT ). Where I’m going with all of this, though, is that you may already have some dominance-based instruction rewriting mechanism in your compiler, either as part of GVN or separately! And you can use this to do a very low code into-partial-SSI in the middle of your optimizer. This means you could very well get away with inserting instructions in successor blocks of conditionals and get the into-SSI “for free”. That’s up to you. There’s a trade-off between compile-time and run-time, especially in JITs. Inserting more instructions and rewriting more times may slow down your compiler. It’s a cheap lunch, not a free one. I don’t know. I don’t have a good grasp of how this “partial SSI” compares to the “full SSI”. I don’t plan on implementing full SSI in the near future. I will note that this partial SSI approach doesn’t do two things: I can’t tell what impact this has. Like Simple, TruffleRuby is built on a Sea of Nodes IR (Graal). Chris Seaton has an excellent blog post about TruffleRuby’s use of “stamp nodes” (“Pi nodes” 3 ). The function does a lot of heavy lifting, I think because Graal tracks uses. Cinder mostly inserts instructions in the HIR builder, before into-SSA, and then lets the SSA construction take care of things. That’s where I learned this trick, actually. Here is one example of refining the type of the matched operand when building IR for pattern matching. Luau is working on something like this, but for their type checker. Chatting with someone on their team is actually part of the reason I got motivated to write this post. Android ART looks like it has HBoundType and inserts them in reference type propagation . This handles class checks, null checks, and instanceof checks. Last, I want to talk a little bit about some interesting reasoning you can do when you have two implementations of something that you can switch between. For example, JIT (+ interpreter), or aliasing and non-aliasing cases in C code, or the weirdo NULL-UB reasoning LLVM can do to C code, things like that. In ZJIT, we currently insert s opportunistically in “easy” cases when building our HIR from the interpreter bytecode. For example, if in the bytecode there is a branch that compares some value with , it will have two outgoing control-flow edges: one block where is definitely , and one block where is definitely not . In each of these control-flow edges, we can insert corresponding type refinement hints. That’s pretty standard. But we can also do weirder stuff. CRuby has a notion of heap objects vs immediate objects. Many (most?) objects are heap objects. However, integer , for example is not allocated on the heap but instead represented by a tagged bit pattern that pretends to be an address: the whole value is encoded in the pointer itself. We encode this knowledge in the HIR’s type system: “heapness” and “immediateness” each get a bit in the type lattice . We use this in the optimizer to reason about effects , among other things. We can’t know a lot of the time what type a thing is, so we pessimistically type most objects flowing through bytecode as . This type encapsulates the entire world of possible values that could go on the stack or in a local variable. On most heap objects, with only a few exceptions, you can write instance variables (fields, attributes, whatever you want to call them). You can never write an instance variable to an immediate. This means that if we observe the following pattern in the bytecode: Then after building and emitting HIR for the opcode, we can upgrade the type of from a to a . We can do this because if it weren’t a heap-allocated object, we would have left the compiled code and entered the interpreter. This is another SSI-type thing you can do in your compiler. Uhh I guess the conclusion is that you don’t have to do full SSI and partial SSI is available and not too scary? Does your compiler do this? Reader, please write in. …and optimized in 2002 (PDF), revisited in 2009 (PDF), implemented in LLVM in 2010 (PDF), investigated in 2017 for abstract compilation (PDF), and probably more. The 2009 paper by Boissinot, Brisk, Darte, and Rastello even shows that both Ananian and Singer’s papers have bugs, while perhaps unintentionally also making an excellent pun about the literature being “sparse”.  ↩ This blog post is different than the what the LLVM paper (PDF) calls partial SSI. Partial for different reasons. Maybe it’s not even single information anymore.  ↩ Today I learned that this terminology comes from the ABCD paper (PDF).  ↩ Where/when in the compiler pipeline do we insert and remove these type refinements? Do we need to refine after every conditional? Do we need to implement the whole into-SSI and out-of-SSI algorithms from all the complicated-looking papers? It lets me cleanly separate adding the type refinements (pretty straightforward) from the hard part of doing all of the operand rewriting and phi placement and marking and all manner of other nonsense. In addition to separating the concerns, the hard part is already done by SSA construction. We can actually just skip it! SSA construction handles phi placement, operand rewriting, all of it. It probably fits neatly into a naive or a Braun-style (PDF) construction. It doesn’t split variables with a new sigma node, and it generally inserts the refine node within the target block rather than above the branch (For only) It doesn’t insert new phi nodes; it just leaves both IR nodes available and, instead of re-merging, drops them …and optimized in 2002 (PDF), revisited in 2009 (PDF), implemented in LLVM in 2010 (PDF), investigated in 2017 for abstract compilation (PDF), and probably more. The 2009 paper by Boissinot, Brisk, Darte, and Rastello even shows that both Ananian and Singer’s papers have bugs, while perhaps unintentionally also making an excellent pun about the literature being “sparse”.  ↩ This blog post is different than the what the LLVM paper (PDF) calls partial SSI. Partial for different reasons. Maybe it’s not even single information anymore.  ↩ Today I learned that this terminology comes from the ABCD paper (PDF).  ↩

0 views
The Coder Cafe 1 months ago

Cache Use Cases Explained

☕ Welcome to The Coder Cafe! Today, we discuss cache use cases. When we think about caching, it’s pretty frequent to focus on where it happens; for example, client-side, server-side, or in a CDN. Yet, there’s a more important question that should be answered first: What’s the use case? In this post, we will break down two common cache use cases: reducing latency and improving capacity. And we will see why the line between the two is blurrier than it seems. Get cozy, grab a coffee, and let’s begin! A Cache for Latency Latency is the time between when a request is sent and when a response is received. A cache for latency exists to reduce the average latency of a service . The classic access pattern looks like this 1 : We check the cache first. On a cache hit, we return the data directly without touching the backend. On a miss, we go to the backend, return the result, and store it in the cache for future requests. Why does this reduce latency? The cache keeps data in memory, which is significantly faster to read from than a remote database that may involve network round-trips, disk I/O, and query execution. On a hit, all of that work is skipped. In Soft vs. Hard Dependency , we introduced two kinds of dependencies: A soft dependency is a non-critical dependency for the service to operate properly. A hard dependency is a critical dependency for the service to operate properly. A cache for latency is a soft dependency . If the cache becomes unavailable, requests fall through to the backend. The system keeps working, just at a higher latency. Keep this in mind, because it’s the key difference we’ll come back to. A cache for capacity exists to serve higher throughput than the backend can handle on its own. The access pattern is identical to the latency case: cache first, then backend on a miss. So what actually makes these two different? The difference is not in the code; it’s in what the backend can absorb. In a capacity scenario, the backend would be overwhelmed if it received all the traffic directly. The cache absorbs a large portion of the requests, keeping the backend load manageable. This changes the nature of the dependency . If the cache goes down, the backend is suddenly hit with all the traffic it was previously shielded from. Whether the system survives depends on the backend’s own capacity. If the backend can scale fast enough, the cache is still a soft dependency: there will be a rough period, but the system recovers. If the backend can’t cope with the load, the cache becomes a hard dependency . Without it, the system fails . Here’s a question worth asking: if the access pattern for both types is identical, how do we know which one we have? In most cases, caches are introduced to reduce latency. But here’s what can happen over time: Our system is stable. Cache hit rates are high, backend load is low. Traffic grows. The backend load stays low because the cache is absorbing most of it. Nothing breaks. No alerts fire. Six months pass. Nothing has changed, no code, no configuration, no architecture decision. And yet the cache is no longer reducing latency. It’s keeping the backend alive. The cache didn’t change. The code didn’t change. The system grew around the cache, and the cache quietly became load-bearing . The same risk appears when a cache goes cold. For example: A migration to a new cache instance A data format change that requires purging existing entries A cache restart after maintenance Any of these can produce a large wave of cache misses in a short window. If we were running a latency cache, we would see higher latency for a while. If we were running a capacity cache, we would see a traffic spike that the backend can’t absorb. The unsettling part is that the code is identical in both cases. The difference only becomes visible at failure time . The root problem is that teams often don’t know which type of cache they’re running . They built it for latency, and that’s still how they think about it, even as the system outgrows that assumption. A few approaches help here: Periodically ask: could the backend handle the current traffic if the cache were completely removed ? Load testing without the cache, or estimating backend capacity against current traffic levels, gives you a concrete answer. Treat cache hit rate as a meaningful operational signal , not just a performance metric. A sustained drop in hit rate means the backend is absorbing more traffic than usual. If that trend continues, it’s an early warning that you may be drifting toward a capacity problem. When migrating a cache or invalidating a large portion of its data, warm the new cache before routing live traffic to it. This prevents a cold-start burst from hitting the backend all at once. Finally, once we recognize that a cache is operating as a capacity cache , we should treat it accordingly. It’s no longer optional infrastructure and it deserves proper alerting and a clear plan for what happens if it goes down. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. A cache for latency serves data from memory to reduce average response time. It is a soft dependency: if unavailable, the system degrades in latency but continues to work. A cache for capacity absorbs traffic that the backend couldn’t handle on its own. It can be a soft or a hard dependency, depending on whether the backend can absorb the load without it. Both types share the same access pattern, which makes them easy to confuse. A latency cache can silently become a capacity cache as traffic grows, without any code change. When a capacity cache goes cold or fails, the backend can be overwhelmed. Hit rate monitoring, periodic load testing, and cache warming are practical ways to manage this risk. Availability Models Safety and Liveness The PACELC Theorem The Three Types of Cache Cache stampede Even though variations exist. A Cache for Latency Latency is the time between when a request is sent and when a response is received. A cache for latency exists to reduce the average latency of a service . The classic access pattern looks like this 1 : We check the cache first. On a cache hit, we return the data directly without touching the backend. On a miss, we go to the backend, return the result, and store it in the cache for future requests. A soft dependency is a non-critical dependency for the service to operate properly. A hard dependency is a critical dependency for the service to operate properly. Our system is stable. Cache hit rates are high, backend load is low. Traffic grows. The backend load stays low because the cache is absorbing most of it. Nothing breaks. No alerts fire. Six months pass. Nothing has changed, no code, no configuration, no architecture decision. And yet the cache is no longer reducing latency. It’s keeping the backend alive. A migration to a new cache instance A data format change that requires purging existing entries A cache restart after maintenance Periodically ask: could the backend handle the current traffic if the cache were completely removed ? Load testing without the cache, or estimating backend capacity against current traffic levels, gives you a concrete answer. Treat cache hit rate as a meaningful operational signal , not just a performance metric. A sustained drop in hit rate means the backend is absorbing more traffic than usual. If that trend continues, it’s an early warning that you may be drifting toward a capacity problem. When migrating a cache or invalidating a large portion of its data, warm the new cache before routing live traffic to it. This prevents a cold-start burst from hitting the backend all at once. Finally, once we recognize that a cache is operating as a capacity cache , we should treat it accordingly. It’s no longer optional infrastructure and it deserves proper alerting and a clear plan for what happens if it goes down. A cache for latency serves data from memory to reduce average response time. It is a soft dependency: if unavailable, the system degrades in latency but continues to work. A cache for capacity absorbs traffic that the backend couldn’t handle on its own. It can be a soft or a hard dependency, depending on whether the backend can absorb the load without it. Both types share the same access pattern, which makes them easy to confuse. A latency cache can silently become a capacity cache as traffic grows, without any code change. When a capacity cache goes cold or fails, the backend can be overwhelmed. Hit rate monitoring, periodic load testing, and cache warming are practical ways to manage this risk. Availability Models Safety and Liveness The PACELC Theorem The Three Types of Cache Cache stampede

0 views
daniel.haxx.se 1 months ago

Inspired

The picture was taken by mr Nasser and shared on social media In appendix A of the book Root cause: Stories and lessons from two decades of Backend Engineering Bugs , author Hussein Nasser has these wonderful words to say about me: Daniel Stenberg is a Swedish engineer and the creator of curl (cURL), one of the most widely used tools and libraries for fetching content over various protocols. I’ve always admired Daniel’s work, reading his blogs and watching his talks on YouTube. He is one of the engineers who inspired me to start my own YouTube channel and teach backend engineering. It warms my heart to read this. Words like this give me energy and motivation. My work has meaning.

0 views
Phil Eaton 1 months ago

Branimir Lambov from IBM on Cassandra

This is an external post of mine. Click here if you are not redirected.

0 views
Zak Knill 1 months ago

All your agents are going async

Agents used to be a thing you talked to synchronously. Now they’re a thing that runs in the background while you work. When you make that change, the transport breaks. For most of the time LLMs have been around, you use them by opening a chat-style window and typing a prompt. The LLM streams the response back token-by-token. It’s how ChatGPT, claude.ai, and Claude Code work. It’s also how the demos work for basically every AI SDK or AI Library. It’s easy to think that LLM chatbots are the ‘art of the possible’ for AI right now. But that’s not the case.

0 views
Ankur Sethi 1 months ago

A broken 404 template in Django can swallow your backtraces

I recently migrated this website from Astro to Wagtail . The reason why I did it is a story for another day. In this post, I want to talk about a bug that took me far too long to figure out. In his (verifiably incorrect) post about making chai , Abhigyan linked to my own (verifiably correct) post on the topic . While linking to my post, he accidentally omitted the trailing slash from the URL. This shouldn't have been a problem. By default, Django automatically redirects a URL without a trailing slash to the same URL with the trailing slash appended, provided the original URL returns a . For example, if you try to access the following URL on my website: Django automatically performs a redirect to: This is the default behavior, controlled by the setting . However, when Abhigyan linked to my (verifiably correct) post about making chai, my server returned a error instead. I'd never have discovered this error myself, but Shubh pointed it out to me on the IndieWebClub chat last week. Thanks Shubh! I started investigating the issue by checking the Gunicorn logs on my VPS. I was hoping they would contain a backtrace that would help me pinpoint the exact problem, but the logs only printed the string whenever the broken URL was accessed. I ran my app with production settings inside a Docker container to see if I could trigger the same behavior. And sure enough, the Dockerized app produced the same error with the same mysterious in the Gunicorn logs. My first instinct was that I had somehow messed up my logging configuration. I'd surely introduced a bug in some Python code somewhere, and my logging configuration was failing to log the backtrace because of a misconfiguration. But tweaking Django's setting didn't change anything. I could see backtraces from the exceptions I inserted at random points in my code, but accessing a URL without a trailing slash would still only produce the string in the logs. After a lot of head scratching, reading the docs, and yelling at Claude, I wondered if something in my template could be responsible for the error. My template was fairly complex, loading and calling several template tags, inheriting from a chain of templates, rendering a few s, and concatenating assets using django-compressor . I started by deleting everything from and reducing it to a single tag. Sure enough, this fixed the issue! Then I slowly added some of the code back until I found the one custom template tag that was throwing an exception, but only when called in the context of a 404 page. Fixing the tag and redeploying fixed the issue for good. But what about the logs? An error in my 404 template not only caused my server to return a 500, but also suppressed any backtraces that might have helped me diagnose the issue. That's weird, right? I might be wrong, but I believe the sequence of events that can lead to this issue is as follows: The lessons I learned from this frustrating scenario were: Somebody accesses a URL without a trailing slash. Django tries to find that URL in its . Since this is a Wagtail installation, it also tries to find a page in the URLs known to Wagtail. All the URLs in my have trailing slashes. Wagtail also appends trailing slashes to all its URLs when is true. So trying to access a page without a trailing slash returns a 404. You would expect Django's redirect logic to kick in at this point, trying to append a trailing slash to the original URL and performing a redirect. But that's not what happens! The redirect logic lives in , which can only perform the redirect after the entire handling chain has finished running. This means regardless of what happens, Django will always render your template when an unknown URL is accessed. Yes, even if redirecting to the same URL with a trailing slash produces a known, correct URL! This means if your template errors out, doesn't even get a chance to run. Django encounters an unknown URL, tries to render the template, fails, and turns the into a . When this happens, Django only logs the , not the template failing to render. This happens even if you're logging template rendering errors in your logging configuration . From what I can tell, there is no way to get Django to log an error in without creating a custom view, manually catching errors, logging the caught errors, and re-raising them so that Django can turn them into s. Always render your and pages in unit tests to make sure they can never error out. Keep your error pages as simple as possible. Ideally, they should only contain HTML and inlined CSS, nothing more.

0 views
Max Bernstein 2 months ago

Value numbering

Welcome back to compiler land. Today we’re going to talk about value numbering , which is like SSA, but more. Static single assignment (SSA) gives names to values: every expression has a name, and each name corresponds to exactly one expression. It transforms programs like this: where the variable is assigned more than once in the program text, into programs like this: where each assignment to has been replaced with an assignment to a new fresh name. It’s great because it makes clear the differences between the two expressions. Though they textually look similar, they compute different values. The first computes 1 and the second computes 2. In this example, it is not possible to substitute in a variable and re-use the value of , because the s are different. But what if we see two “textually” identical instructions in SSA? That sounds much more promising than non-SSA because the transformation into SSA form has removed (much of) the statefulness of it all. When can we re-use the result? Identifying instructions that are known at compile-time to always produce the same value at run-time is called value numbering . To understand value numbering, let’s extend the above IR snippet with two more instructions, v3 and v4. In this new snippet, v3 looks the same as v1: adding v0 and 1. Assuming our addition operation is some ideal mathematical addition, we can absolutely re-use v1; no need to compute the addition again. We can rewrite the IR to something like: This is kind of similar to the destructive union-find representation that JavaScriptCore and a couple other compilers use, where the optimizer doesn’t eagerly re-write all uses but instead leaves a little breadcrumb / instruction 1 . We could then run our copy propagation pass (“union-find cleanup”?) and get: Great. But how does this happen? How does an optimizer identify reusable instruction candidates that are “textually identical”? Generally, there is no actual text in the IR . One popular solution is to compute a hash of each instruction. Then any instructions with the same hash (that also compare equal, in case of collisions) are considered equivalent. This is called hash-consing . When trying to figure all this out, I read through a couple of different implementations. I particularly like the Maxine VM implementation. For example, here is the (hashing) and functions for most binary operations, slightly modified for clarity: The rest of the value numbering implementation assumes that if a function returns 0, it does not wish to be considered for value numbering. Why might an instruction opt-out of value numbering? An instruction might opt out of value numbering if it is not “pure”. Some instructions are not pure. Purity is in the eye of the beholder, but in general it means that an instruction does not interact with the state of the outside world, except for trivial computation on its operands. (What does it mean to de-duplicate/cache/reuse ?) A load from an array object is also not a pure operation 2 . The load operation implicitly relies on the state of the memory. Also, even if the array was known-constant, in some runtime systems, the load might raise an exception. Changing the source location where an exception is raised is generally frowned upon. Languages such as Java often have requirements about where exceptions are raised codified in their specifications. We’ll work only on pure operations for now, but we’ll come back to this later. We do often want to optimize impure operations as well! We’ll start off with the simplest form of value numbering, which operates only on linear sequences of instructions, like basic blocks or traces. Let’s build a small implementation of local value numbering (LVN). We’ll start with straight-line code—no branches or anything tricky. Most compiler optimizations on control-flow graphs (CFGs) iterate over the instructions “top to bottom” 3 and it seems like we can do the same thing here too. From what we’ve seen so far optimizing our made-up IR snippet, we can do something like this: The find-and-replace, remember, is not a literal find-and-replace, but instead something like: (if you have been following along with the toy optimizer series) This several-line function (as long as you already have a hash map and a union-find available to you) is enough to build local value numbering! And real compilers are built this way, too. If you don’t believe me, take a look at this slightly edited snippet from Maxine’s value numbering implementation. It has all of the components we just talked about: iterating over instructions, map lookup, and some substitution. This alone will get you pretty far. Code generators of all shapes tend to leave messy repeated computations all over their generated code and this will make short work of them. Sometimes, though, your computations are spread across control flow—over multiple basic blocks. What do you do then? Computing value numbers for an entire function is called global value numbering (GVN) and it requires dealing with control flow (if, loops, etc). I don’t just mean that for an entire function, we run local value numbering block-by-block. Global value numbering implies that expressions can be de-duplicated and shared across blocks. Let’s tackle control flow case by case. First is the simple case from above: one block. In this case, we can go top to bottom with our value numbering and do alright. The second case is also reasonable to handle: one block flowing into another. In this case, we can still go top to bottom. We just have to find a way to iterate over the blocks. If we’re not going to share value maps between blocks, the order doesn’t matter. But since the point of global value numbering is to share values, we have to iterate them in topological order (reverse post order (RPO)). This ensures that predecessors get visited before successors. If you have , we have to visit first and then . Because of how SSA works and how CFGs work, the second block can “look up” into the first block and use the values from it. To get global value numbering working, we have to copy ’s value map before we start processing so we can re-use the instructions. Maybe something like: Then the expressions can accrue across blocks. can re-use the already-computed from because it is still in the map. …but this breaks as soon as you have control-flow splits. Consider the following shape graph: We’re going to iterate over that graph in one of two orders: A B C or A C B. In either case, we’re going to be adding all this stuff into the value map from one block (say, B) that is not actually available to its sibling block (say, C). When I say “not available”, I mean “would not have been computed before”. This is because we execute either A then B or A then C. There’s no world in which we execute B then C. But alright, look at a third case where there is such a world: a control-flow join. In this diagram, we have two predecessor blocks B and C each flowing into D. In this diagram, B always flows into D and also C always flows into D. So the iterator order is fine, right? Well, still no. We have the same sibling problem as before. B and C still can’t share value maps. We also have a weird question when we enter D: where did we come from? If we came from B, we can re-use expressions from B. If we came from C, we can re-use expressions from C. But we cannot in general know which predecessor block we came from. The only block we know for sure that we executed before D is A. This means we can re-use A’s value map in D because we can guarantee that all execution paths that enter D have previously gone through A. This relationship is called a dominator relationship and this is the key to one style of global value numbering that we’re going to talk about in this post. A block can always use the value map from any other block that dominates it. For completeness’ sake, in the diamond diagram, A dominates each of B and C, too. We can compute dominators a couple of ways 4 , but that’s a little bit out of scope for this blog post. If we assume that we have dominator information available in our CFG, we can use that for global value numbering. And that’s just what—you guessed it—Maxine VM does. It iterates over all blocks in reverse post-order, doing local value numbering, threading through value maps from dominator blocks. In this case, their method gets the immediate dominator : the “closest” dominator block of all the blocks that dominate the current one. And that’s it! That’s the core of Maxine’s GVN implementation . I love how short it is. For not very much code, you can remove a lot of duplicate pure SSA instructions. This does still work with loops, but with some caveats. From p7 of Briggs GVN : The φ-functions require special treatment. Before the compiler can analyze the φ-functions in a block, it must previously have assigned value numbers to all of the inputs. This is not possible in all cases; specifically, any φ-function input whose value flows along a back edge (with respect to the dominator tree) cannot have a value number. If any of the parameters of a φ-function have not been assigned a value number, then the compiler cannot analyze the φ-function, and it must assign a unique, new value number to the result. It also talks about eliminating useless phis, which is optional, but would the strengthen global value numbering pass: it makes more information transparent. But what if we want to handle impure instructions? Languages such as Java allow for reading fields from the / object within methods as if the field were a variable name. This makes code like the following common: Each of these reference to and is an implicit reference to or , which is semantically a field load off an object. You can see it in the bytecode (thanks, Matt Godbolt): When straightforwardly building an SSA IR from the JVM bytecode for this method, you will end up with a bunch of IR that looks like this: Pretty much the same as the bytecode. Even though no code in the middle could modify the field (which would require a re-load), we still have a duplicate load. Bummer. I don’t want to re-hash this too much but it’s possible to fold Load and store forwarding into your GVN implementation by either: See, there’s nothing fundamentally stopping you from tracking the state of your heap at compile-time across blocks. You just have to do a little more bookkeeping. In our dominator-based GVN implementation, for example, you can: Not so bad. Maxine doesn’t do global memory tracking, but they do a limited form of load-store forwarding while building their HIR from bytecode: see GraphBuilder which uses the MemoryMap to help track this stuff. At least they would not have the same duplicate instructions in the example above! We’ve now looked at one kind of value numbering and one implementation of it. What else is out there? Apparently, you can get better results by having a unified hash table (p9 of Briggs GVN ) of expressions, not limiting the value map to dominator-available expressions. Not 100% on how this works yet. They note: Using a unified hash-table has one important algorithmic consequence. Replacements cannot be performed on-line because the table no longer reflects availability. Which is the first time that it occurred to me that hash-based value numbering with dominators was an approximation of available expression analysis. There’s also a totally different kind of value numbering called value partitioning (p12 of Briggs GVN ). See also a nice blog post about this by Allen Wang from the Cornell compiler course . I think this mostly replaces the hashing bit, and you still need some other thing for the available expressions bit. Ben Titzer and Seth Goldstein have some good slides from CMU . Where they talk about the worklist dataflow approach. Apparently this is slower but gets you more available expressions than just looking to dominator blocks. I wonder how much it differs from dominator+unified hash table. While Maxine uses hash table cloning to copy value maps from dominator blocks, there are also compilers such as Cranelift that use scoped hash maps to track this information more efficiently. (Though Amanieu notes that you may not need a scoped hash map and instead can tag values in your value map with the block they came from, ignoring non-dominating values with a quick check. The dominance check makes sense but I haven’t internalized how this affects the set of available expressions yet.) You may be wondering if this kind of algorithm even helps at all in a dynamic language JIT context. Surely everything is too dynamic, right? Actually, no! The JIT hopes to eliminate a lot of method calls and dynamic behaviors, replacing them with guards, assumptions, and simpler operations. These strength reductions often leave behind a lot of repeated instructions. Just the other day, Kokubun filed a value-numbering-like PR to clean up some of the waste. ART has a recent blog post about speeding up GVN. Go forth and give your values more numbers. There’s been an ongoing discussion with Phil Zucker on SSI, GVN, acyclic egraphs, and scoped union-find. TODO summarize Commutativity; canonicalization Seeding alternative representations into the GVN Aegraphs and union-find during GVN https://github.com/bytecodealliance/rfcs/blob/main/accepted/cranelift-egraph.md https://github.com/bytecodealliance/wasmtime/issues/9049 https://github.com/bytecodealliance/wasmtime/issues/4371 Writing this post is roughly the time when I realized that the whole time I was wondering why Cinder did not use union-find for rewriting, it actually did! Optimizing instruction by replacing with followed by copy propagation is equivalent to union-find.  ↩ In some forms of SSA, like heap-array SSA or sea of nodes, it’s possible to more easily de-duplicate loads because the memory representation has been folded into (modeled in) the IR.  ↩ The order is a little more complicated than that: reverse post-order (RPO). And there’s a paper called “A Simple Algorithm for Global Data Flow Analysis Problems” that I don’t yet have a PDF for that claims that RPO is optimal for solving dataflow problems.  ↩ There’s the iterative dataflow way (described in the Cooper paper (PDF)), Lengauer-Tarjan (PDF), the Engineered Algorithm (PDF), hybrid/Semi-NCA approach (PDF), …  ↩ initialize a map from instruction numbers to instruction pointers for each instruction if wants to participate in value numbering if ’s value number is already in the map, replace all pointers to in the rest of the program with the corresponding value from the map otherwise, add to the map doing load-store forwarding as part of local value numbering and clearing memory information from the value map at the end of each block, or keeping track of effects across blocks track heap write effects for each block at the start of each block B, union all of the “kill” sets for every block back to its immediate dominator finally, remove the stuff that got killed from the dominator’s value map V8 Hydrogen Writing this post is roughly the time when I realized that the whole time I was wondering why Cinder did not use union-find for rewriting, it actually did! Optimizing instruction by replacing with followed by copy propagation is equivalent to union-find.  ↩ In some forms of SSA, like heap-array SSA or sea of nodes, it’s possible to more easily de-duplicate loads because the memory representation has been folded into (modeled in) the IR.  ↩ The order is a little more complicated than that: reverse post-order (RPO). And there’s a paper called “A Simple Algorithm for Global Data Flow Analysis Problems” that I don’t yet have a PDF for that claims that RPO is optimal for solving dataflow problems.  ↩ There’s the iterative dataflow way (described in the Cooper paper (PDF)), Lengauer-Tarjan (PDF), the Engineered Algorithm (PDF), hybrid/Semi-NCA approach (PDF), …  ↩

0 views