Posts in Backend (20 found)
Uros Popovic 3 days ago

How to use Linux vsock for fast VM communication

Discover how to bypass the network stack for Host-to-VM communication using Linux Virtual Sockets (AF_VSOCK). This article details how to use these sockets to build a high-performance gRPC service in C++ that communicates directly over the hypervisor bus, avoiding TCP/IP overhead entirely.

0 views
iDiallo 1 weeks ago

How Do You Send an Email?

It's been over a year and I didn't receive a single notification email from my web-server. It could either mean that my $6 VPS is amazing and hasn't gone down once this past year. Or it could mean that my health check service has gone down. Well this year, I have received emails from readers to tell me my website was down. So after doing some digging, I discovered that my health checker works just fine, but all emails it sends are being rejected by gmail. Unless you use a third party service, you have little to no chance of sending an email that gets delivered. Every year, email services seem to become a tad bit more expensive. When I first started this website, sending emails to my subscribers was free on Mailchimp. Now it costs $45 a month. On Buttondown, as of this writing, it costs $29 a month. What are they doing that costs so much? It seems like sending emails is impossibly hard, something you can almost never do yourself. You have to rely on established services if you want any guarantee that your email will be delivered. But is it really that complicated? Emails, just like websites, use a basic communication protocol to function. For you to land on this website, your browser somehow communicated with my web server, did some negotiating, and then my server sent HTML data that your browser rendered on the page. But what about email? Is the process any different? The short answer is no. Email and the web work in remarkably similar fashion. Here's the short version: In order to send me an email, your email client takes the email address you provide, connects to my server, does some negotiating, and then my server accepts the email content you intended to send and saves it. My email client will then take that saved content and notify me that I have a new message from you. That's it. That's how email works. So what's the big fuss about? Why are email services charging $45 just to send ~1,500 emails? Why is it so expensive, while I can serve millions of requests a day on my web server for a fraction of the cost? The short answer is spam . But before we get to spam, let's get into the details I've omitted from the examples above. The negotiations. How similar email and web traffic really are? When you type a URL into your browser and hit enter, here's what happens: The entire exchange is direct, simple, and happens in milliseconds. Now let's look at email. The process is similar: Both HTTP and email use DNS to find servers, establish TCP connections, exchange data using text-based protocols, and deliver content to the end user. They're built on the same fundamental internet technologies. So if email is just as simple as serving a website, why does it cost so much more? The answer lies in a problem that both systems share but handle very differently. Unwanted third-party writes. Both web servers and email servers allow outside parties to send them data. Web servers accept form submissions, comments, API requests, and user-generated content. Email servers accept messages from any other email server on the internet. In both cases, this openness creates an opportunity for abuse. Spam isn't unique to email, it's everywhere. My blog used to get around 6,000 spam comments on a daily basis. On the greater internet, you will see spam comments on blogs, spam account registrations, spam API calls, spam form submissions, and yes, spam emails. The main difference is visibility. When spam protection works well, it's invisible. You visit websites every day without realizing that behind the scenes. CAPTCHAs are blocking bot submissions, rate limiters are rejecting suspicious traffic, and content filters are catching spam comments before they're published. You don't get to see the thousands of spam attempts that happen every day on my blog, because of some filtering I've implemented. On a well run web-server, the work is invisible. The same is true for email. A well-run email server silently: There is a massive amount of spam. In fact, spam accounts for roughly 45-50% of all email traffic globally . But when the system works, you simply don't see it. If we can combat spam on the web without charging exorbitant fees, email spam shouldn't be that different. The technical challenges are very similar. Yet a basic web server on a $5/month VPS can handle millions of requests with minimal spam-fighting overhead. Meanwhile, sending 1,500 emails costs $29-45 per month through commercial services. The difference isn't purely technical. It's about reputation, deliverability networks, and the ecosystem that has evolved around email. Email providers have created a cartel-like system where your ability to reach inboxes depends on your server's reputation, which is nearly impossible to establish as a newcomer. They've turned a technical problem (spam) into a business moat. And we're all paying for it. Email isn't inherently more complex or expensive than web hosting. Both the protocols and the infrastructure are similar, and the spam problem exists in both domains. The cost difference is mostly artificial. It's the result of an ecosystem that has consolidated around a few major providers who control deliverability. It doesn't help that Intuit owns Mailchimp now. Understanding this doesn't necessarily change the fact that you'll probably still need to pay for email services if you want reliable delivery. But it should make you question whether that $45 monthly bill is really justified by the technical costs involved. Or whether it's just the price of admission to a gatekept system. DNS Lookup : Your browser asks a DNS server, "What's the IP address for this domain?" The DNS server responds with something like . Connection : Your browser establishes a TCP connection with that IP address on port 80 (HTTP) or port 443 (HTTPS). Request : Your browser sends an HTTP request: "GET /blog-post HTTP/1.1" Response : My web server processes the request and sends back the HTML, CSS, and JavaScript that make up the page. Rendering : Your browser receives this data and renders it on your screen. DNS Lookup : Your email client takes my email address ( ) and asks a DNS server, "What's the mail server for example.com?" The DNS server responds with an MX (Mail Exchange) record pointing to my mail server's address. Connection : Your email client (or your email provider's server) establishes a TCP connection with my mail server on port 25 (SMTP) or port 587 (for authenticated SMTP). Negotiation (SMTP) : Your server says "HELO, I have a message for [email protected]." My server responds: "OK, send it." Transfer : Your server sends the email content, headers, body, attachments, using the Simple Mail Transfer Protocol (SMTP). Storage : My mail server accepts the message and stores it in my mailbox, which can be a simple text file on the server. Retrieval : Later, when I open my email client, it connects to my server using IMAP (port 993) or POP3 (port 110) and asks, "Any new messages?" My server responds with your email, and my client displays it. Checks sender reputation against blacklists Validates SPF, DKIM, and DMARC records Scans message content for spam signatures Filters out malicious attachments Quarantines suspicious senders Both require reputation systems Both need content filtering Both face distributed abuse Both require infrastructure to handle high volume

0 views
Binary Igor 2 weeks ago

Modular Monolith and Microservices: Data ownership, boundaries, consistency and synchronization

Virtually every module - folder or versioned package in a modular monolith, separately deployed microservice - must own or at least read some data to provide its functionality. As we shall see, the degree to which module A needs data from module B is often the degree to which it depends on this module; functionality being another important dimension of dependence. This leads us to the following principles...

0 views
Karboosx 2 weeks ago

Improving Train Meal Ordering Systems Without Internet

Train meal ordering apps require internet, but trains often don't have it. Here's how to make them work offline using local servers and JWT authentication

0 views
maxdeviant.com 2 weeks ago

Head in the Zed Cloud

For the past five months I've been leading the efforts to rebuild Zed 's cloud infrastructure. Our current backend—known as Collab—has been chugging along since basically the beginning of the company. We use Collab every day to work together on Zed in Zed. However, as Zed continues to grow and attracts more users, we knew that we needed a full reboot of our backend infrastructure to set us up for success for our future endeavors. Enter Zed Cloud. Like Zed itself, Zed Cloud is built in Rust 1 . This time around there is a slight twist: all of this is running on Cloudflare Workers , with our Rust code being compiled down to WebAssembly (Wasm). One of our goals with this rebuild was to reduce the amount of operational effort it takes to maintain our hosted services, so that we can focus more of our time and energy on building Zed itself. Cloudflare Workers allow us to easily scale up to meet demand without having to fuss over it too much. Additionally, Cloudflare offers an ever-growing amount of managed services that cover anything you might need for a production web service. Here are some of the Cloudflare services we're using today: Another one of our goals with this rebuild was to build a platform that was easy to test. To achieve this, we built our own platform framework on top of the Cloudflare Workers runtime APIs. At the heart of this framework is the trait: This trait allows us to write our code in a platform-agnostic way while still leveraging all of the functionality that Cloudflare Workers has to offer. Each one of these associated types corresponds to some aspect of the platform that we'll want to have control over in a test environment. For instance, if we have a service that needs to interact with the system clock and a Workers KV store, we would define it like this: There are two implementors of the trait: and . —as the name might suggest—is an implementation of the platform on top of the Cloudflare Workers runtime. This implementation targets Wasm and is what we run when developing locally (using Wrangler ) and in production. We have a crate 2 that contains bindings to the Cloudflare Workers JS runtime. You can think of as the glue between those bindings and the idiomatic Rust APIs exposed by the trait. The is used when running tests, and allows for simulating almost every part of the system in order to effectively test our code. Here's an example of a test for ingesting a webhook from Orb : In this test we're able to test the full end-to-end flow of: The call to advances the test simulator, in this case running the pending queue consumers. At the center of the is the , a crate that powers our in-house async runtime. The scheduler is shared between GPUI —Zed's UI framework—and the used in tests. This shared scheduler enables us to write tests that span the client and the server. So we can have a test that starts in a piece of Zed code, flows through Zed Cloud, and then asserts on the state of something in Zed after it receives the response from the backend. The work being done on Zed Cloud now is laying the foundation to support our future work around collaborative coding with DeltaDB . If you want to work with me on building out Zed Cloud, we are currently hiring for this role. We're looking for engineers with experience building and maintaining web APIs and platforms, solid web fundamentals, and who are excited about Rust. If you end up applying, you can mention this blog post in your application. I look forward to hearing from you! The codebase is currently 70k lines of Rust code and 5.7k lines of TypeScript. This is essentially our own version of . I'd like to switch to using directly, at some point. Hyperdrive for talking to Postgres Workers KV for ephemeral storage Cloudflare Queues for asynchronous job processing Receiving and validating an incoming webhook event to our webhook ingestion endpoint Putting the webhook event into a queue Consuming the webhook event in a background worker and processing it

1 views
Karboosx 3 weeks ago

Building a Simple Search Engine That Actually Works

You don't need Elasticsearch for most projects. I built a simple search engine from scratch that tokenizes everything, stores it in your existing database, and scores results by relevance. Dead simple to understand and maintain.

25 views
Abhinav Sarkar 3 weeks ago

A Short Survey of Compiler Targets

As an amateur compiler developer, one of the decisions I struggle with is choosing the right compiler target. Unlike the 80’s when people had to target various machine architectures directly, now there are many mature options available. This is a short and very incomplete survey of some of the popular and interesting options. A compiler can always directly output machine code or assembly targeted for one or more architectures. A well-known example is the Tiny C Compiler . It’s known for its speed and small size, and it can compile and run C code on the fly. Another such example is Turbo Pascal . You could do this with your compiler too, but you’ll have to figure out the intricacies of the Instruction set of each architecture (ISA) you want to target, as well as, concepts like Register allocation . Most modern compilers actually don’t emit machine code or assembly directly. They lower the source code down to a language-agnostic Intermediate representation (IR) first, and then generate machine code for major architectures (x86-64, ARM64, etc.) from it. The most prominent tool in this space is LLVM . It’s a large, open-source compiler-as-a-library. Compilers for many languages such as Rust , Swift , C/C++ (via Clang ), and Julia use LLVM as an IR to emit machine code. An alternative is the GNU C compiler (GCC), via its GIMPLE IR, though no compilers seem to use it directly. GCC can be used as a library to compile code, much like LLVM, via libgccjit . It is used in Emacs to Just-in-time (JIT) compile Elisp . Cranelift is another new option in this space, though it supports only few ISAs. For those who find LLVM or GCC too large or slow to compile, minimalist alternatives exist. QBE is a small backend focused on simplicity, targeting “70% of the performance in 10% of the code”. It’s used by the language Hare that prioritizes fast compile times. Another option is libFIRM , which uses a graph-based SSA representation instead of a linear IR. Sometimes you are okay with letting other compilers/runtimes take care of the heavy lifting. You can transpile your code to a another established high-level language and leverage that language’s existing compiler/runtime and toolchain. A common target in such cases is C. Since C compilers exist for nearly all platforms, generating C code makes your language highly portable. This is the strategy used by Chicken Scheme and Vala . Or you could compile to C++ instead, like Jank , if that’s your thing. There is also C– , a subset of C targeted by GHC and OCaml . Another ubiquitous target is JavaScript (JS), which is one of the two options (other being WebAssembly ) for running code natively in a web browser or one of the JS runtimes ( Node , Deno , Bun ). Multiple languages such as TypeScript , PureScript , Reason , ClojureScript , Dart and Elm transpile to JS. Nim interestingly, can transpile to C, C++ or JS. Another target similar to JS is Lua , a lightweight and embeddable scripting language, which languages such as MoonScript and Fennel transpile to. A more niche approach is to target a Lisp dialect. Compiling to Chez Scheme , for example, allows you to leverage its macro system, runtime, and compiler. The Idris 2 and Racket use Chez Scheme as their primary backend targets. This is a common choice for application languages. You compile to a portable bytecode for a Virtual machine (VM). VMs generally come with features like Garbage collection , JIT compilation , and security sandboxing. The Java Virtual Machine (JVM) is probably the most popular one. It’s the target for many languages including Java , Kotlin , Scala , Groovy , and Clojure . Its main competitor is the Common Language Runtime , originally developed by Microsoft , which is targeted by languages such as C# , F# , and Visual Basic.NET . Another notable VM is the BEAM , originally built for Erlang . The BEAM VM isn’t built for raw computation speed but for high concurrency, fault tolerance, and reliability. Recently, new languages such as Elixir and Gleam have been created to target it. Finally, this category also includes MoarVM —the spiritual successor to the Parrot VM —built for the Raku (formerly Perl 6) language. WebAssembly (Wasm) is a relatively new target. It’s a portable binary instruction format focused on security and efficiency. Wasm is supported by all major browsers, but not limited to them. The WebAssembly System Interface (WASI) standard provides APIs for running Wasm in non-browser and non-JS environments. Wasm is now targeted by many languages such as Rust , C/C++ , Go , Kotlin , Scala , Zig , and Haskell . Meta-tracing frameworks are a more complex category. These are not the targets for your compiler backend, instead, you use them to build a custom JIT compiler for your language by specifying an interpreter for it. The most well-known example is PyPy , an implementation of Python , created using the RPython framework. Another such framework is GraalVM/Truffle , a polyglot VM and meta-tracing framework from Oracle . Its main feature is zero-cost interoperability: code from GraalJS , TruffleRuby , and GraalPy can all run on the same VM, and can call each other directly. Move past the mainstream, and you’ll discover a world of unconventional and esoteric compiler targets. Developers pick them for academic curiosity, artistic expression, or to test the boundaries of viable compilation targets. Brainfuck: An esoteric language with only eight commands, Brainfuck is Turing-complete and has been a target for compilers as a challenge. People have written compilers for C , Haskell and Lambda calculus . Lambda calculus: Lambda calculus is a minimal programming languages that expresses computation solely as functions and their applications. It is often used as the target of educational compilers because of its simplicity, and its link to the fundamental nature of computation. Hell , a subset of Haskell, compiles to Simply typed lambda calculus . SKI combinators: The SKI combinator calculus is even more minimal than lambda calculus. All programs in SKI calculus can be composed of only three combinators: S, K and I. MicroHs compiles a subset of Haskell to SKI calculus. JSFuck: Did you know that you can write all possible JavaScript programs using only six characters ? Well, now you know . Postscript: Postscript is also a Turing-complete programming language. Your next compiler could target it! Regular Expressions ? Lego ? Cellular automata ? I’m going to write a compiler from C++ to JSFuck. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! This post was originally published on abhinavsarkar.net . If you liked this post, please leave a comment . Machine Code / Assembly Intermediate Representations Other High-level Languages Virtual Machines / Bytecode WebAssembly Meta-tracing Frameworks Unconventional Targets Brainfuck: An esoteric language with only eight commands, Brainfuck is Turing-complete and has been a target for compilers as a challenge. People have written compilers for C , Haskell and Lambda calculus . Lambda calculus: Lambda calculus is a minimal programming languages that expresses computation solely as functions and their applications. It is often used as the target of educational compilers because of its simplicity, and its link to the fundamental nature of computation. Hell , a subset of Haskell, compiles to Simply typed lambda calculus . SKI combinators: The SKI combinator calculus is even more minimal than lambda calculus. All programs in SKI calculus can be composed of only three combinators: S, K and I. MicroHs compiles a subset of Haskell to SKI calculus. JSFuck: Did you know that you can write all possible JavaScript programs using only six characters ? Well, now you know . Postscript: Postscript is also a Turing-complete programming language. Your next compiler could target it! Regular Expressions ? Lego ? Cellular automata ?

0 views
Lukáš Lalinský 1 months ago

How I turned Zig into my favorite language to write network programs in

I’ve been watching the Zig language for a while now, given that it was created for writing audio software (low-level, no allocations, real time). I never paid too much attention though, it seemed a little weird to me and I didn’t see the real need. Then I saw a post from Andrew Kelley (creator of the language) on Hacker News, about how he reimplemented my Chromaprint algorithm in Zig, and that got me really interested. I’ve been planning to rewrite AcoustID’s inverted index for a long time, I had a couple of prototypes, but none of the approaches felt right. I was going through some rough times, wanted to learn something new, so I decided to use the project as an opportunity to learn Zig. And it was great, writing Zig is a joy. The new version was faster and more scalable than the previous C++ one. I was happy, until I wanted to add a server interface. In the previous C++ version, I used Qt , which might seem very strange for a server software, but I wanted a nice way of doing asynchronous I/O and Qt allowed me to do that. It was callback-based, but Qt has a lot of support for making callbacks usable. In the newer prototypes, I used Go, specifically for the ease of networking and concurrency. With Zig, I was stuck. There are some Zig HTTP servers, so I could use those. I wanted to implement my legacy TCP server as well, and that’s a lot harder, unless I want to spawn a lot of threads. Then I made a crazy decision, to use Zig also for implementing a clustered layer on top of my server, using NATS as a messaging system, so I wrote a Zig NATS client , and that gave me a lot of experience with Zig’s networking capabilities. Fast forward to today, I’m happy to introduce Zio, an asynchronous I/O and concurrency library for Zig . If you look at the examples, you will not really see where is the asynchronous I/O, but it’s there, in the background and that’s the point. Writing asynchronous code with callbacks is a pain. Not only that, it requires a lot of allocations, because you need state to survive across callbacks. Zio is an implementation of Go style concurrency, but limited to what’s possible in Zig. Zio tasks are stackful coroutines with fixed-size stacks. When you run , this will initiate the I/O operation in the background and then suspend the current task until the I/O operation is done. When it’s done, the task will be resumed, and the result will be returned. That gives you the illusion of synchronous code, allowing for much simpler state management. Zio support fully asynchronous network and file I/O, has synchronization primitives (mutexes, condition variables, etc.) that work with the cooperative runtime, has Go-style channels, OS signal watches and more. Tasks can run in single-threaded mode, or multi-threaded, in which case they can migrate from thread to thread for lower latency and better load balancing. And it’s FAST. I don’t want to be posting benchmarks here, maybe later when I have more complex ones, but the single-threaded mode is beating any framework I’ve tried so far. It’s much faster than both Go and Rust’s Tokio. Context switching is virtually free, comparable to a function call. The multi-threaded mode, while still not being as robust as Go/Tokio, has comparable performance. It’s still a bit faster than either of them, but that performance might go down as I add more fairness features. Because it implements the standard interfaces for reader/writer, you can actually use external libraries that are unaware they are running within Zio. Here is an example of a HTTP server: When I started working with Zig, I really thought it’s going to be a niche language to write the fast code in, and then I’ll need a layer on top of that in a different language. With Zio, that changed. The next step for me is to update my NATS client to use Zio internally. And after that, I’m going to work on a HTTP client/server library based on Zio.

0 views
Ahmad Alfy 1 months ago

The Hidden Cost of URL Design

When we architected an e-commerce platform for one of our clients, we made what seemed like a simple, user-friendly decision: use clean, flat URLs. Products would live at , categories at , pages at . No prefixes, no or clutter. Minimalist paths that felt simple. This decision, which was made hastily and without proper discussion, would later cost us hours spent on optimization. The problem wasn’t the URLs themselves. It was that we treated URL design as a UX decision when it’s fundamentally an architectural decision with cascading technical implications. Every request to the application triggered two backend API calls. Every bot crawling a malformed URL hit the database twice. Every 404 was expensive. This article isn’t about URL best practices you’ve read a hundred times (keeping URLs short, avoiding special characters, or using hyphens instead of underscores). This is about something rarely discussed: how your URL structure shapes your entire application architecture, performance characteristics, and operational costs. Flat URLs like or feel right. They’re intuitive, readable, and align with how users think about content. No technical jargon, no hierarchy to remember. This design philosophy emerged from the SEO community’s consensus that simpler URLs perform better in search rankings. But here’s what the SEO guides don’t tell you: flat URLs trade determinism for aesthetics . When your URL is , your application cannot know whether you’re requesting: This ambiguity means your application must ask rather than know . And asking is expensive. Many traditional CMSs solved this problem decades ago. Systems like Joomla, WordPress (to some extent), and Drupal maintain dedicated SEF (Search Engine Friendly) URL tables which are essentially lookup dictionaries that map clean URLs to their corresponding entity types and IDs. When you request , these systems do a single, fast database lookup: One query, instant resolution, minimal overhead. The URL ambiguity is resolved at the database layer with an indexed lookup rather than through sequential API calls. But not every system works this way. In our case, Magento’s API architecture didn’t expose a unified URL resolution endpoint. The frontend had to query separate endpoints for products and categories, which brings us to our problem. In a structured URL system ( ), the routing decision is instant: In a flat URL system, you need a resolver: This might seem like a minor difference, a few extra database queries. But let’s see what this actually costs at scale. Our stack for this particular client consisted of a Nuxt.js frontend (running in SSR mode) and a Magento backend. Every URL that hit the application went through this flow: User requests: . Then, Nuxt SSR Server would query Magento API twice to confirm what the slug represented. If neither existed, it returned a 404. The diagram makes the inefficiency obvious: Let’s say we have: This will be translated to: Now scale this during a traffic spike, say Black Friday or a major product launch. Our backend autoscaling would kick in, spinning up new instances to handle what was essentially artificial load created by our URL design . We observed: The kicker? This wasn’t a bug. This was the architecture working exactly as designed. There’s another subtle issue: in systems without slug uniqueness constraints, a product and category could both use the same slug. Now your resolver doesn’t just need to check what exists. It needs to decide which one to serve. Do you prioritize products? Categories? First-created wins? This ambiguity isn’t just a performance problem; it’s a business logic problem. If a user comes from a marketing email expecting a product but lands on a category page instead, that’s a conversion lost. You might be experiencing this issue if: If you checked 3 or more of these, keep reading. Your URLs might be costing you more than you think. Faced with this problem on a platform with 100k+ SKUs, we had to choose a solution that balanced performance gains with implementation reality. URL restructuring with 301 redirects would be a massive undertaking. Maintaining redirect maps for that many products, ensuring no SEO disruption, and coordinating the migration was simply too risky and resource-intensive. Instead, we implemented a two-part solution that leveraged what we already had and made smart optimizations where it mattered most. We realized the Nuxt server already cached the category tree for building navigation menus. Categories don’t change frequently, so this cache was stable and reliable. We modified the URL resolver to: Result: We went from 2 backend calls per request to just 1 for product pages, and 0 additional calls for category pages (they already hit the cache). Categories resolved instantly, products required only one API call instead of two. Here’s the key insight: when users navigate within the application, we already know what they clicked on. A product card knows it’s linking to a product. A category menu knows it’s linking to a category. We updated all internal links to include a simple query parameter: Then in the route middleware: Result: Internal navigation happens purely on the client side with direct API calls. The server-side resolver is only used for: Since most traffic after the initial landing is internal navigation, this reduced our server-side resolution load by approximately 70-80%. Before optimization: After optimization: This solution had several advantages for our specific context: The query parameter approach might seem inelegant, but remember: these parameters only appear during internal navigation within the SPA. When users share links or search engines crawl, they see clean URLs like . The only exists in the client-side routing context. Here’s how requests are handled in our optimized system: If you’re starting fresh, you have the luxury of making informed decisions before the first line of code. Here’s what we wish we’d known and what we now recommend to clients. Default to deterministic URLs unless you have a compelling reason not to. Good starting point: It’s far easier to remove structure later (with redirects) than to add it. Going from to is a simple 301. Going the other direction means updating every existing URL, redirecting old URLs, potential SEO turbulence, and user confusion with bookmarks. Before finalizing your URL scheme, ask: Questions to answer: Decision matrix: When choosing flat URLs, explicitly budget for: Infrastructure: Development time: These aren’t failures. They’re the actual cost of flat URLs. If the business value (SEO, UX, brand consistency) exceeds these costs, great. But make the trade-off explicit rather than discovering it six months in. If you do use flat URLs, make slug uniqueness a hard constraint at the database or application level. This prevents the slug collision problem entirely. Yes, it means occasionally appending numbers, using prefixes or rejecting slugs. It’s far better than ambiguous runtime behavior. Whatever you choose, document why using an Architecture Decision Record (ADR): Future developers (including future you) will thank you. We should treat URL structure as a public API contract. Like any API, URLs: URL structure decisions are cheapest and most flexible at the start of a project (ideally in week one) when a quick discussion can define the architecture with little cost. Once development begins, changes require some rework but are still manageable. After launch, however, even minor adjustments trigger a chain reaction involving redirects, SEO checks, and cache updates. By the time scaling issues appear, restructuring URLs becomes a complex, time-consuming, and costly endeavor. Before finalizing URLs, bring together: The takeaway: invest early in thoughtful URL design to avoid expensive fixes later. A product named “Leather Jacket” A category called “Leather Jacket” A blog post titled “Leather Jacket” A landing page for a “Leather Jacket” campaign Nothing at all (a 404) Every valid page required 2 backend lookups (one fails, one succeeds) Every invalid URL triggered 2 backend lookups (both fail) Every crawler generated 2 database queries per attempt 100,000 page views per day 30% of traffic is bots/crawlers hitting invalid URLs Average API latency: 50ms per call Daily backend calls: 100,000 × 2 = 200,000 requests Bot overhead: 30,000 × 2 = 60,000 wasted requests Added latency per request: 100ms (2 × 50ms) Latency: P95 response times during peak traffic reached 800ms-1.2s Compute costs: Backend infrastructure costs were 40% higher than projected Vulnerability: During bot attacks or crawler storms, the 2x request multiplier meant we were essentially DDoSing ourselves Your application serves multiple entity types (products, categories, pages, posts) from flat URLs You’re using a headless/API-first architecture without unified URL resolution Your APM/monitoring shows 2+ similar backend queries per page request 404 pages have similar latency to valid pages (they shouldn’t) Bot traffic causes disproportionate backend load Backend autoscaling triggers don’t correlate with actual user traffic patterns Check cached categories first - If the slug matches a cached category, route directly to category handler (in-memory lookup, ~1ms) Query products if not a category - Only make one API call to check if it’s a product Return 404 if neither - No entity found, render 404 page Direct URL access (user types URL or bookmarks) External links (social media, search engines, emails) First page load Every request: 2 backend API calls Average response time: 800ms-1.2s (P95) Backend costs: 40% over projection Category pages (initial load): 0 additional backend calls (cached) Product pages (initial load): 1 backend call (50% reduction) Internal navigation: 0 server-side resolution (pure client-side) Average response time: 200-400ms (P95) Backend costs: Reduced by ~35% No URL changes - No redirects, no SEO impact, no user confusion Leveraged existing infrastructure - The category cache was already there Progressive enhancement - External/shared URLs still work perfectly (clean, no query params visible) Low implementation effort - Mostly frontend changes, minimal backend work Immediate impact - Deployed in one day. We didn’t have to wait for any changes to backend APIs. Does our backend provide unified URL resolution? (SEF tables, single endpoint) Can we query by slug across all entity types efficiently? What’s the database query cost for “does this slug exist?” Backend has SEF tables → Flat URLs viable Backend has separate endpoints → Structured URLs recommended Backend has no resolution → Build it or use structure Cache layer to store resolved slugs Additional backend capacity for resolution queries Building resolution logic Maintaining slug uniqueness Implementing cache invalidation Debugging cache consistency issues Define how clients (users, bots, search engines) interact with your system Create expectations that are expensive to break Have performance characteristics that affect the entire stack Must be versioned carefully (via redirects) if changed Should be designed with both current and future capabilities in mind Frontend team: What’s cleanest for users? Backend team: What can we resolve efficiently? DevOps: What are the caching implications? SEO/Marketing: What’s the measurable impact?

0 views

What Dynamic Typing Is For

Unplanned Obsolescence is a blog is about writing maintainable, long-lasting software. It also frequently touts—or is, at the very least, not inherently hostile to—writing software in dynamically-typed programming languages. These two positions are somewhat at odds. Dynamically-typed languages encode less information. That’s a problem for the person reading the code and trying to figure out what it does. This is a simplified version of an authentication middleware that I include in most of my web services: it checks an HTTP request to see if it corresponds to a logged-in user’s session. Pretty straightforward stuff. The function gets a cookie from the HTTP request, checks the database to see if that token corresponds to a user, and then returns the user if it does. Line 2 fetches the cookie from the request, line 3 gets the user from the database, and the rest either returns the user or throw an error. There are, however, some problems with this. What happens if there’s no cookie included in the HTTP request? Will it return or an empty string? Will even exist if there’s no cookies at all? There’s no way to know without looking at the implementation (or, less reliably, the documentation). That doesn’t mean there isn’t an answer! A request with no cookie will return . That results in a call, which returns (the function checks for that). is a falsy value in JavaScript, so the conditional evaluates to false and throws an . The code works and it’s very readable, but you have to do a fair amount of digging to ensure that it works reliably. That’s a cost that gets paid in the future, anytime the “missing token” code path needs to be understood or modified. That cost reduces the maintainability of the service. Unsurprisingly, the equivalent Rust code is much more explicit. In Rust, the tooling can answer a lot more questions for me. What type is ? A simple hover in any code editor with an LSP tells me, definitively, that it’s . Because it’s Rust, you have to explicitly check if the token exists; ditto for whether the user exists. That’s better for the reader too: they don’t have to wonder whether certain edge cases are handled. Rust is not the only language with a strict, static typing. At every place I’ve ever worked, the longest-running web services have all been written in Java. Java is not as good as Rust at forcing you to show your work and handle edge cases, but it’s much better than JavaScript. Putting aside the question of which one I prefer to write, if I find myself in charge a production web service that someone else wrote, I would much prefer it to be in Java or Rust than JavaScript or Python. Conceding that, ceteris paribus , static typing is good for software maintainability, one of the reasons that I like dynamically-typed languages is that they encourage a style I find important for web services in particular: writing to the DSL. A DSL (domain-specific language) is programming language that’s designed for a specific problem area. This is in contrast to what we typically call “general-purpose programming languages” (e.g. Java, JavaScript, Python, Rust), which can reasonably applied to most programming tasks. Most web services have to contend with at least three DSLs: HTML, CSS, and SQL. A web service with a JavaScript backend has to interface with, at a minimum , four programming languages: one general-purpose and three DSLs. If you have the audacity to use something other than JavaScript on the server, then that number goes up to five, because you still need JavaScript to augment HTML. That’s a lot of languages! How are we supposed to find developers who can do all this stuff ? The answer that a big chunk of the industry settled on is to build APIs so that the domains of the DSLs can be described in the general-purpose programming language. Instead of writing HTML… …you can write JSX, a JavaScript syntax extension that supports tags. This has the important advantage of allowing you to include dynamic JavaScript expressions in your markup. And now we don’t have to kick out to another DSL to write web pages. Can we start abstracting away CSS too? Sure can! This example uses styled-components . This is a tactic I call “expanding the bounds” of the programming language. In an effort to reduce complexity, you try to make one language express everything about the project. In theory, this reduces the number of languages that one needs to learn to work on it. The problem is that it usually doesn’t work. Expressing DSLs in general-purpose programming syntax does not free you from having to understand the DSL—you can’t actually use styled-components without understanding CSS. So now a prospective developer has to both understand CSS and a new CSS syntax that only applies to the styled-components library. Not to mention, it is almost always a worse syntax. CSS is designed to make expressing declarative styles very easy, because that’s the only thing CSS has to do. Expressing this in JavaScript is naturally way clunkier. Plus, you’ve also tossed the web’s backwards compatibility guarantees. I picked styled-components because it’s very popular. If you built a website with styled-components in 2019 , didn’t think about the styles for a couple years, and then tried to upgrade it in 2023 , you would be two major versions behind. Good luck with the migration guide . CSS files, on the other hand, are evergreen . Of course, one of the reasons for introducing JSX or CSS-in-JS is that they add functionality, like dynamic population of values. That’s an important problem, but I prefer a different solution. Instead of expanding the bounds of the general-purpose language so that it can express everything, another strategy is to build strong and simple API boundaries between the DSLs. Some benefits of this approach include: The following example uses a JavaScript backend. A lot of enthusiasm for htmx (the software library I co-maintain) is driven by communities like Django and Spring Boot developers, who are thrilled to no longer be bolting on a JavaScript frontend to their website; that’s a core value proposition for hypermedia-driven development . I happen to like JavaScript though, and sometimes write services in NodeJS, so, at least in theory, I could still use JSX if I wanted to. What I prefer, and what I encourage hypermedia-curious NodeJS developers to do, is use a template engine . This bit of production code I wrote for an events company uses Nunjucks , a template engine I once (fondly!) called “abandonware” on stage . Other libraries that support Jinja -like syntax are available in pretty much any programming language. This is just HTML with basic loops ( ) and data access ( ). I get very frustrated when something that is easy in HTML is hard to do because I’m using some wrapper with inferior semantics; with templates, I can dynamically build content for HTML without abstracting it away. Populating this template in JavaScript is so easy . You just give it a JavaScript object with an field. That’s not particularly special on its own—many languages support serialized key-value pairs. This strategy really shines when you start stringing it together with SQL. Let’s replace that database function call with an actual query, using an interface similar to . I know the above code is not everybody’s taste, but I think it’s marvelous. You get to write all parts of the application in the language best suited to each: HTML for the frontend and SQL for the queries. And if you need to do any additional logic between the database and the template, JavaScript is still right there. One result of this style is that it increases the percentage of your service that is specified declaratively. The database schema and query are declarative, as is the HTML template. The only imperative code in the function is the glue that moves that query result into the template: two statements in total. Debugging is also dramatically easier. I typically do two quick things to narrow down the location of the bug: Those two steps are easy, can be done in production with no deployments, and provide excellent signal on the location of the error. Fundamentally, what’s happening here is a quick check at the two hard boundaries of the system: the one between the server and the client, and the one between the client and the database. Similar tools are available to you if you abstract over those layers, but they are lessened in usefulness. Every web service has network requests that can be inspected, but putting most frontend logic in the template means that the HTTP response’s data (“does the date ever get send to the frontend”) and functionality (“does the date get displayed in the right HTML element?”) can be inspected in one place, with one keystroke. Every database can be queried, but using the database’s native query language in your server means you can validate both the stored data (“did the value get saved?”) and the query (“does the code ask for the right value?”) independent of the application. By pushing so much of the business logic outside the general-purpose programming language, you reduce the likelihood that a bug will exist in the place where it is hardest to track down—runtime server logic. You’d rather the bug be a malformatted SQL query or HTML template, because those are easy to find and easy to fix. When combined with the router-driven style described in Building The Hundred-Year Web Service , you get simple and debuggable web systems. Each HTTP request is a relatively isolated function call: it takes some parameters, runs an SQL query, and returns some HTML. In essence, dynamically-typed languages help you write the least amount of server code possible, leaning heavily on the DSLs that define web programming while validating small amounts of server code via means other than static type checking. To finish, let’s take a look at the equivalent code in Rust, using rusqlite , minjina , and a quasi-hypothetical server implementation: I am again obfuscating some implementation details (Are we storing human-readable dates in the database? What’s that universal result type?). The important part is that this blows. Most of the complexity comes from the need to tell Rust exactly how to unpack that SQL result into a typed data structure, and then into an HTML template. The struct is declared so that Rust knows to expect a for . The derive macros create a representation that minijinja knows how to serialize. It’s tedious. Worse, after all that work, the compiler still doesn’t do the most useful thing: check whether is the correct type for . If it turns out that can’t be represented as a (maybe it’s a blob ), the query will compile correctly and then fail at runtime. From a safety standpoint, we’re not really in a much better spot than we were with JavaScript: we don’t know if it works until we run the code. Speaking of JavaScript, remember that code? That was great! Now we have no idea what any of these types are, but if we run the code and we see some output, it’s probably fine. By writing the JavaScript version, you are banking that you’ve made the code so highly auditable by hand that the compile-time checks become less necessary. In the long run, this is always a bad bet, but at least I’m not writing 150% more code for 10% more compile-time safety. The “expand the bounds” solution to this is to pull everything into the language’s type system: the database schema, the template engine, everything. Many have trod that path; I believe it leads to madness (and toolchain lock-in). Is there a better one? I believe there is. The compiler should understand the DSLs I’m writing and automatically map them to types it understands. If it needs more information—like a database schema—to figure that out, that information can be provided. Queries correspond to columns with known types—the programming language can infer that is of type . HTML has context-dependent escaping rules —the programming language can validate that is being used in a valid element and escape it correctly. With this functionality in the compiler, if I make a database migration that would render my usage of a dependent variable in my HTML template invalid, the compiler will show an error. All without losing the advantages of writing the expressive, interoperable, and backwards-compatible DSLs the comprise web development. Dynamically-typed languages show us how easy web development can be when we ditch the unnecessary abstractions. Now we need tooling to make it just as easy in statically-typed languages too. Thanks to Meghan Denny for her feedback on a draft of this blog. DSLs are better at expressing their domain, resulting in simpler code It aids debugging by segmenting bugs into natural categories The skills gained by writing DSLs are more more transferable CMD+U to View Source - If the missing data is in the HTML, it’s a frontend problem Run the query in the database - If the missing data is in the SQL, it’s a problem with the GET route Language extensions that just translate the syntax are alright by me, like generating HTML with s-expressions , ocaml functions , or zig comptime functions . I tend to end up just using templates, but language-native HTML syntax can be done tastefully, and they are probably helpful in the road to achieving the DX I’m describing; I’ve never seen them done well for SQL. Sqlx and sqlc seem to have the right idea, but I haven’t used either because I to stick to SQLite-specific libraries to avoid async database calls. I don’t know as much about compilers as I’d like to, so I have no idea what kind of infrastructure would be required to make this work with existing languages in an extensible way. I assume it would be hard.

0 views
Justin Duke 1 months ago

Adding imports to the Django shell

I was excited to finally remove from my file when 5.2 dropped because they added support for automatic model import. However, I found myself missing one other little escape hatch that exposed, which was the ability to import other arbitrary modules into the namespace. Django explains how to do bring in modules without a namespace , but I wanted to be able to inoculate my shell, since most of my modules follow a similar structure (exposing a single function). It took the bare minimum of sleuthing to figure out how to hack this in for myself, and now here I am to share that sleuthing with you. Behold, a code snippet that is hopefully self-explanatory:

0 views
NorikiTech 1 months ago

A Rust web service

Part of the “ Rustober ” series. A web service in Rust is much more than it seems. Whereas in Go you can just and , in Rust you have to bring your own async runtime. I remember when async was very new in Rust, and apparently since then the project leadership never settled on a specific runtime. Now you can pick and choose (good), but also you have to pick and choose (not so good). In practical, real-world terms it means you use Tokio . And . And . And . And . I think that’s it? My default intention is to use as few dependencies as possible — preferably none. For Typingvania I wrote the engine from scratch in C and the only library I ship with the game is SDL . DDPub that runs this website only really needs Markdown support. However, with Rust I want to try something different and commit to the existing ecosystem. I expect there will be plenty of challenges as it is. So how does all of the above relate to each other and why is it necessary? In short, as far as I understand it now: Because all of the libraries are so integrated with each other, it enables (hopefully) a fairly ergonomic experience where I don’t have to painstakingly plug things together. They are all mature libraries tested in production, so even though it’s a lot of API surface, I consider it a worthwhile time investment: my project will work, and more likely than not I’ll have to work with them again if/when I get to write Rust in a commercial setting — or just work on more side projects. , being the source (all three of the others are subprojects) provides the runtime I mentioned above and defines how everything else on top of it works. provides the trait that defines how data flows around, with abstractions and utilities around it, but is protocol-agnostic. Here’s the trait: implements HTTP on top of ’s trait, but as a building block, not as a framework. is a framework for building web applications — an abstraction over and . Effectively, I will be using mostly facilities implemented using the other three libraries. Finally, is an asynchronous SQL toolkit using directly.

0 views

Functional Threading “Macros”

Read on the website: Threading macros make Lisp-family languages much more readable. Other languages too, potentially! Except… other languages don’t have macros. How do we go about enabling threading “macros” there?

0 views

Consistent hashing

As a concrete experiment, I ran a similar simulation to the one above, but with 10 virtual nodes per node. We'll consider the total portion of the circle mapped to a node when it maps to either of its virtual nodes. While the average remains 18 degrees, the variance is reduced drastically - the smallest one is 11 degrees and the largest 26. You can find the code for these experiments in the demo.go file of the source code repository. This is close to the original motivation for the development of consistent caching by researchers at MIT. Their work, described in the paper "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web" became the foundation of Akamai Technologies . There are other use cases of consistent hashing in distributed systems. AWS popularized one common application in their paper "Dynamo: Amazon’s Highly Available Key-value Store" , where the algorithm is used to distribute storage keys across servers.

1 views
David Bushell 2 months ago

I Shut The Emails Out

Last week I let the emails in by self-hosting an unencrypted SMTP server on port 25. That was a “fun” challenge but left me feeling exposed. I could host elsewhere but most reputable hosting providers block SMTP ports. I found another answer. I know. Bleh! Sellout to Big Tech why don’t I? I don’t see you opening ports into your home. Anyway, Cloudflare has this thing called Email Routing and Email Workers . Cloudflare handles the SMTP and verification of incoming emails. Emails can be forwarded elsewhere or handled by a worker. Catch-all can be configured so it’s wise to do your own address validation. My worker takes the following actions: I’m locked in to Cloudflare now for this side project. Might as well use the full suite. The basic worker script is simple. There is one issue I found. claims to be type: And claims to take type: — amongst others. In reality the email API does not play nicely with the storage API. My worker timed out after 5 minutes with an error. “Provided readable stream must have a known length (request/response body or readable half of FixedLengthStream)” That is why I’m using which I yoinked from @std/streams . I’d rather stream directly but this was a quick fix. Since I have a one megabyte limit it’s not a problem. My original idea was to generate an RSS feed for my Croissant web app à la Kill the Newsletter (which I could just use instead of reinventing…) HTML emails though are a special class of disaster. Semantics and accessibility, anyone? No? Table layouts from the ’90s? Oh, okay… that’s another blog post. Actually hold up. Do we just lose all respect for accessibility when coding HTML emails? Apparently so. I built an HTML email once early in my career and I refused to do it ever again. Here’s a screenshot of the web UI I was already working on to read emails. I’m employing similar tricks I learnt when sanitising RSS feeds . This time I allow inline styles. I remove scripts and images (no thanks). Content security policy is very effective at blocking any tracking attempts that might sneak through. I have a second proxy worker that receives a URL and resolves any HTTP redirects to return the real URL. For good measure, tracking params like are removed at every step. Email truly is the cesspit of the internet. Dreadful code. Ruthless tracking. Why am I doing this again? Most of these newsletters have an RSS and web version available. I can’t believe I let this stuff into my home! Come to think of it, maybe I can pass the plain text versions through a Markdown parser? Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds. validate address reject emails larger than to an R2 Storage bucket with metadata

0 views
David Bushell 2 months ago

I Let The Emails In

They say never operate your own email server. It’s all fine and dandy until Google et al. arbitrarily ban your IP address. Doesn’t matter if you configure DMARC, DKIM, and SPF — straight to jail. But that only applies to sending email . I think. I’m testing that theory. How easy is it to receive emails? Turns out it’s almost too easy. I coded an SMTP server in TypeScript. I added a couple of DNS records on a spare domain (redacted for obvious reasons). I rawdogged port 25 on my public IP address. I sent myself a test email from Gmail and it worked! Join me on the adventure of how I got this far. I’m using Deno flavoured TypeScript and below is the basic wrapper. I’ve simplified the example code below to illustrate specific concepts. Open the TCP server and pass off connections to an async function. The handler immediately responds in plain text. Wikipedia has a good example of a full message exchange. Only the number codes really matter. Then it reads buffered data until the connection closes. Commands are ASCII ending with the carriage return, line feed combo. I get a little fancy with the so that it throws an error on malformed text. Later I decided that giving unbridled to a 3rd-party was not a smart move. I added a couple of protections: By the way, this exact code never throws because the main thread is blocked. The abort signal task is never executed. Replacing the placeholder comment with an to read data unblocks the event loop. Handling commands is easy if you’re careless. I don’t even bother to parse commands properly. (Note to self: do a proper job.) If there is any command I don’t recognise I close the connection immediately. It was at this stage in my journey that I learnt of the command. The STARTTLS keyword is used to tell the SMTP client that the SMTP server is currently able to negotiate the use of TLS. It takes no parameters. This is supposed to be included as part of the response to . It’s worth noting at this point I’ve tested nothing in the wild. Had I tested I would have saved myself days of work. I found Deno’s function which looked ideal. But no, this only works from the client’s perspective ( issue #18451 ). One does not simply code the TLS handshake. (Some time later I found Mat’s @typemail/smtp — this looks much easier in Node!) It’s possible for an SMTP server to listen securely on port 465 with TLS by default. Deno has to replace . Say no more! Side quest: code an ACME library Side quest status: success! So after that 48 hour side quest I now have a TLS certificate. Which is useless because mail servers deliver to each other on port 25 unencrypted before upgrading with and I’m still blocked there. It’s confusing. Clients can connect directly over TLS to post emails (I think). Whatever, the only way to know for sure is to test in production. And this brings me back to the screenshot above. I opened the firewall on my router and let the emails in. And guess that? Google et al. don’t give a hoot about privacy! Even my beloved Proton will happily send unencrypted plain text emails. Barely compliant and poorly configured server held together by statements and a dream? Take the email! My server is suspect af and yet they handoff emails no sweat. Not their problem. If I tried to send email that’d be another story. For my project I’m just collecting email newsletter; did I mention that? We’ll see if they continue to deliver. If you have a port open on a public IP address you will be found . Especially if it’s a known port like 25. There are bots that literally scan every port of every IP. I log all messages in and out of my SMTP server. I use the free IPinfo.io service to do my own snooping. Here is an example of Google stopping by for a cup of tea. I decided it was best to block all connections from outside NA and EU. For my purposes those would be very unlikely. This one looked interesting: Sorry for the lack of hospitality :( When it’s running, my SMTP server is inside a container on a dedicated machine that is fire-walled off from the LAN. I won’t provide exact schematics because that would only highlight weaknesses in my setup. I’d prefer not to be hacked into oblivion. My server validates SPF and DKIM signatures of any email it receives. RFC 6376 was a formidable foe that had me close to tears. I know I don’t need to code all this myself, but where’s the fun in that? I’m throwing away emails that have malformed encoding. In this case parsing Quoted-Printable and MIME Words formats myself did not look fun. I found Mat’s lettercoder package that does a perfect job. I added a concurrent connection and rate limiter too. @ me if I’m missing another trick. The plan is to keep the SMTP server live and collect sample data. I want to know how feasible it is to run. I’m collecting email newsletters with the idea of designing a dedicated reader. I dislike newsletters in my inbox. This may be integrated into my Croissant RSS app. Of course, Kill the Newsletter! can do that job already. If it proves to be too much hassle I’ll slam the door on port 25. Does anybody know a hosting provided that allows port 25? I was going to use a Digital Ocean droplet for this task but that’s blocked. Update: one week later… I shut the emails out! Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds. A generous 30 second timeout A maximum 1 MB message size

0 views
Max Bernstein 3 months ago

Linear scan register allocation on SSA

Much of the code and education that resulted in this post happened with Aaron Patterson . The fundamental problem in register allocation is to take an IR that uses a virtual registers (as many as you like) and rewrite it to use a finite amount of physical registers and stack space 1 . This is an example of a code snippet using virtual registers: And here is the same example after it has been passed through a register allocator (note that Rs changed to Ps): Each virtual register was assigned a physical place: R1 to the stack, R2 to P0, R3 to P1, and R4 also to P0 (since we weren’t using R2 anymore). People use register allocators like they use garbage collectors: it’s an abstraction that can manage your resources for you, maybe with some cost. When writing the back-end of a compiler, it’s probably much easier to have a separate register-allocator-in-a-box than manually managing variable lifetimes while also considering all of your different target architectures. How do JIT compilers do register allocation? Well, “everyone knows” that “every JIT does its own variant of linear scan” 2 . This bothered me for some time because I’ve worked on a couple of JITs and still didn’t understand the backend bits. There are a couple different approaches to register allocation, but in this post we’ll focus on linear scan of SSA . I started reading Linear Scan Register Allocation on SSA Form (PDF, 2010) by Wimmer and Franz after writing A catalog of ways to generate SSA . Reading alone didn’t make a ton of sense—I ended up with a lot of very frustrated margin notes. I started trying to implement it alongside the paper. As it turns out, though, there is a rich history of papers in this area that it leans on really heavily. I needed to follow the chain of references! For example, here is a lovely explanation of the process, start to finish, from Christian Wimmer’s Master’s thesis (PDF, 2004). There it is, all laid out at once. It’s very refreshing when compared to all of the compact research papers. I didn’t realize that there were more than one or two papers on linear scan. So this post will also incidentally serve as a bit of a survey or a history of linear scan—as best as I can figure it out, anyway. If you were in or near the room where it happened, please feel free to reach out and correct some parts. Throughout this post, we’ll use an example SSA code snippet from Wimmer2010, adapted from phi-SSA to block-argument-SSA. Wimmer2010’s code snippet is between the arrows and we add some filler (as alluded to in the paper): Virtual registers start with R and are defined either with an arrow or by a block parameter. Because it takes a moment to untangle the unfamiliar syntax and draw the control-flow graph by hand, I’ve also provided the same code in graphical form. Block names (and block parameters) are shaded with grey. We have one entry block, , that is implied in Wimmer2010. Its only job is to define and for the rest of the CFG. Then we have a loop between and with an implicit fallthrough. Instead of doing that, we instead generate a conditional branch with explicit jump targets. This makes it possible to re-order blocks as much as we like. The contents of are also just to fill in the blanks from Wimmer2010 and add some variable uses. Our goal for the post is to analyze this CFG, assign physical locations (registers or stack slots) to each virtual register, and then rewrite the code appropriately. For now, let’s rewind the clock and look at how linear scan came about. Linear scan register allocation (LSRA) has been around for awhile. It’s neat because it does the actual register assignment part of register allocation in one pass over your low-level IR. (We’ll talk more about what that means in a minute.) It first appeared in the literature in tcc: A System for Fast, Flexible, and High-level Dynamic Code Generation (PDF, 1997) by Poletto, Engler, and Kaashoek. (Until writing this post, I had never seen this paper. It was only on a re-read of the 1999 paper (below) that I noticed it.) In this paper, they mostly describe a staged variant of C called ‘C (TickC), for which a fast register allocator is quite useful. Then came a paper called Quality and Speed in Linear-scan Register Allocation (PDF, 1998) by Traub, Holloway, and Smith. It adds some optimizations (lifetime holes, binpacking) to the algorithm presented in Poletto1997. Then came the first paper I read, and I think the paper everyone refers to when they talk about linear scan: Linear Scan Register Allocation (PDF, 1999) by Poletto and Sarkar. In this paper, they give a fast alternative to graph coloring register allocation, especially motivated by just-in-time compilers. In retrospect, it seems to be a bit of a rehash of the previous two papers. Linear scan (1997, 1999) operates on live ranges instead of virtual registers. A live range is a pair of integers [start, end) (end is exclusive) that begins when the register is defined and ends when it is last used. This means that there is an assumption that the order for instructions in your program has already been fixed into a single linear sequence! It also means that you have given each instruction a number that represents its position in that order. This may or not be a surprising requirement depending on your compilers background. It was surprising to me because I often live in control flow graph fantasy land where blocks are unordered and instructions sometimes float around. But if you live in a land of basic blocks that are already in reverse post order, then it may be less surprising. In non-SSA-land, these live ranges are different from the virtual registers: they represent some kind of lifetimes of each version of a virtual register. For an example, consider the following code snippet: There are two definitions of and they each live for different amounts of time: In fact, the ranges are completely disjoint. It wouldn’t make sense for the register allocator to consider variables, because there’s no reason the two s should necessarily live in the same physical register. In SSA land, it’s a little different: since each virtual registers only has one definition (by, uh, definition), live ranges are an exact 1:1 mapping with virtual registers. We’ll focus on SSA for the remainder of the post because this is what I am currently interested in. The research community seems to have decided that allocating directly on SSA gives more information to the register allocator 3 . Linear scan starts at the point in your compiler process where you already know these live ranges—that you have already done some kind of analysis to build a mapping. In this blog post, we’re going to back up to the point where we’ve just built our SSA low-level IR and have yet to do any work on it. We’ll do all of the analysis from scratch. Part of this analysis is called liveness analysis . The result of liveness analysis is a mapping of that tells you which virtual registers (remember, since we’re in SSA, instruction==vreg) are alive (used later) at the beginning of the basic block. This is called a live-in set. For example: We compute liveness by working backwards: a variable is live from the moment it is backwardly-first used until its definition. In this case, at the end of B2, nothing is live. If we step backwards to the , we see a use: R16 becomes live. If we step once more, we see its definition—R16 no longer live—but now we see a use of R14 and R15, which become live. This leaves us with R14 and R15 being live-in to B2. This live-in set becomes B1’s live-out set because B1 is B2’s predecessor. We continue in B1. We could continue backwards linearly through the blocks. In fact, I encourage you to do it as an exercise. You should have a (potentially emtpy) set of registers per basic block. It gets more interesting, though, when we have branches: what does it mean when two blocks’ live-in results merge into their shared predecessor? If we have two blocks A and B that are successors of a block C, the live-in sets get unioned together. C C A A C->A B B C->B That is, if there were some register R0 live-in to B and some register R1 live-in to A, both R0 and R1 would be live-out of C. They may also be live-in to C, but that entirely depends on the contents of C. Since the total number of virtual registers is nonnegative and is finite for a given program, it seems like a good lattice for an abstract interpreter . That’s right, we’re doing AI. In this liveness analysis, we’ll: We store gen, kill, and live-in sets as bitsets, using some APIs conveniently available on Ruby’s Integer class. Like most abstract interpretations, it doesn’t matter what order we iterate over the collection of basic blocks for correctness, but it does matter for performance. In this case, iterating backwards ( ) converges much faster than forwards ( ): We could also use a worklist here, and it would be faster, but eh. Repeatedly iterating over all blocks is fine for now. The Wimmer2010 paper skips this liveness analysis entirely by assuming some computed information about your CFG: where loops start and end. It also requires all loop blocks be contiguous. Then it makes variables defined before a loop and used at any point inside the loop live for the whole loop . By having this information available, it folds the liveness analysis into the live range building, which we’ll instead do separately in a moment. The Wimmer approach sounded complicated and finicky. Maybe it is, maybe it isn’t. So I went with a dataflow liveness analysis instead. If it turns out to be the slow part, maybe it will matter enough to learn about this loop tagging method. For now, we will pick a schedule for the control-flow graph. In order to build live ranges, you have to have some kind of numbering system for your instructions, otherwise a live range’s start and end are meaningless. We can write a function that fixes a particular block order (in this case, reverse post-order) and then assigns each block and instruction a number in a linear sequence. You can think of this as flattening or projecting the graph: A couple interesting things to note: Even though we have extra instructions, it looks very similar to the example in the Wimmer2010 paper. Since we’re not going to be messing with the order of the instructions within a block anymore, all we have to do going forward is make sure that we iterate through the blocks in . Finally, we have all that we need to compute live ranges. We’ll more or less copy the algorithm to compute live ranges from the Wimmer2010 paper. We’ll have two main differences: I know I said we were going to be computing live ranges. So why am I presenting you with a function called ? That’s because early in the history of linear scan (Traub1998!), people moved from having a single range for a particular virtual register to having multiple disjoint ranges. This collection of multiple ranges is called an interval and it exists to free up registers in the context of branches. For example, in the our IR snippet (above), R12 is defined in B2 as a block parameter, used in B3, and then not used again until some indetermine point in B4. (Our example uses it immediately in an add instruction to keep things short, but pretend the second use is some time away.) The Wimmer2010 paper creates a lifetime hole between 28 and 34, meaning that the interval for R12 (called i12) is . Interval holes are not strictly necessary—they exist to generate better code. So for this post, we’re going to start simple and assume 1 interval == 1 range. We may come back later and add additional ranges, but that will require some fixes to our later implementation. We’ll note where we think those fixes should happen. Anyway, here is the mostly-copied annotated implementation of BuildIntervals from the Wimmer2010 paper: Another difference is that since we’re using block parameters, we don’t really have this thing. That’s just the block argument. The last difference is that since we’re skipping the loop liveness hack, we don’t modify a block’s set as we iterate through instructions. I know we said we’re building live ranges, so our class only has one on it. This is Ruby’s built-in range, but it’s really just being used as a tuple of integers here. Note that there’s some implicit behavior happening here: For example, if we have and someone calls , we end up with . There’s no gap in the middle. And if we have and someone calls , we end up with . After figuring out from scratch some of these assumptions about what the interval/range API should and should not do, Aaron and I realized that there was some actual code for in a different, earlier paper: Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mössenböck and Pfeiffer. Unfortunately, many other versions of this PDF look absolutely horrible (like bad OCR) and I had to do some digging to find the version linked above. Finally we can start thinking about doing some actual register assignment. Let’s return to the 90s. Because we have faithfully kept 1 interval == 1 range, we can re-use the linear scan algorithm from Poletto1999 (which looks, at a glance, to be the same as 1997). I recommend looking at the PDF side by side with the code. We have tried to keep the structure very similar. Note that unlike in many programming languages these days, in the algorithm description represents a set , not a (hash-)map. In our Ruby code, we represent as an array: Internalizing this took us a bit. It is mostly a three-state machine: We would like to come back to this and incrementally modify it as we add lifetime holes to intervals. I finally understood, very late in the game, that Poletto1999 linear scan assigns only one location per virtual register. Ever . It’s not that every virtual register gets a shot in a register and then gets moved to a stack slot—that would be interval splitting and hopefully we get to that later—if a register gets spilled, it’s in a stack slot from beginning to end. I only found this out accidentally after trying to figure out a bug (that wasn’t a bug) due to a lovely sentence in Optimized Interval Splitting in a Linear Scan Register Allocator (PDF, 2005) by Wimmer and Mössenböck): However, it cannot deal with lifetime holes and does not split intervals, so an interval has either a register assigned for the whole lifetime, or it is spilled completely. In particular, it is not possible to implement the algorithm without reserving a scratch register: When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately Let’s take a look at the code snippet again. Here it is before register allocation, using virtual registers: Let’s run it through register allocation with incrementally decreasing numbers of physical registers available. We get the following assignments: Some other things to note: If you have a register free, choosing which register to allocate is a heuristic! It is tunable. There is probably some research out there that explores the space. In fact, you might even consider not allocating a register greedily. What might that look like? I have no idea. Spilling the interval with the furthest endpoint is a heuristic! You can pick any active interval you want. In Register Spilling and Live-Range Splitting for SSA-Form Programs (PDF, 2009) by Braun and Hack, for example, they present the MIN algorithm, which spills the interval with the furthest next use. This requires slightly more information and takes slightly more time than the default heuristic but apparently generates much better code. Also, block ordering? You guessed it. Heuristic. Here is an example “slideshow” I generated by running linear scan with 2 registers. Use the arrow keys to navigate forward and backward in time 4 . At this point we have register assignments : we have a hash table mapping intervals to physical locations. That’s great but we’re still in SSA form: labelled code regions don’t have block arguments in hardware. We need to write some code to take us out of SSA and into the real world. We can use a modified Wimmer2010 as a great start point here. It handles more than we need to right now—interval splitting—but we can simplify. Because we don’t split intervals, we know that every interval live at the beginning of a block is either: For this reason, we only handle the second case in our SSA resolution. If we added lifetime holes interval splitting, we would have to go back to the full Wimmer SSA resolution. This means that we’re going to iterate over every outbound edge from every block. For each edge, we’re going to insert some parallel moves. This already looks very similar to the RESOLVE function from Wimmer2010. Unfortunately, Wimmer2010 basically shrugs off with an eh, it’s already in the literature comment. What’s not made clear, though, is that this particular subroutine has been the source of a significant amount of bugs in the literature. Only recently did some folks roll through and suggest (proven!) fixes: This sent us on a deep rabbit hole of trying to understand what bugs occur, when, and how to fix them. We implemented both the Leroy and the Boissinot algorithms. We found differences between Boissinot2009, Boissinot2010, and the SSA book implementation following those algorithms. We found Paul Sokolovsky’s implementation with bugfixes . We found Dmitry Stogov’s unmerged pull request to the same repository to fix another bug. We looked at Benoit Boissinot’s thesis again and emailed him some questions. He responded! And then he even put up an amended version of his algorithm in Rust with tests and fuzzing. All this is to say that this is still causing people grief and, though I understand page limits, I wish parallel moves were not handwaved away. We ended up with this implementation which passes all of the tests from Sokolovsky’s repository as well as the example from Boissinot’s thesis (though, as we discussed in the email, the example solution in the thesis is incorrect 5 ). Leroy’s algorithm, which is shorter, passes almost all the tests—in one test case, it uses one more temporary variable than Boissinot’s does. We haven’t spent much time looking at why. Whatever algorithm you choose, you now have a way to parallel move some registers to some other registers. You have avoided the “swap problem”. That’s great. You can generate an ordered list of instructions from a tangled graph. But where do you put them? What about the “lost copy” problem? As it turns out, we still need to handle critical edge splitting. Let’s consider what it means to insert moves at an edge between blocks when the surrounding CFG looks a couple of different ways. These are the four (really, three) cases we may come across. In Case 1, if we only have two neighboring blocks A and B, we can insert the moves into either block. It doesn’t matter: at the end of A or at the beginning of B are both fine. In Case 2, if A has two successors, then we should insert the moves at the beginning of B. That way we won’t be mucking things up for the edge . In Case 3, if B has two predecessors, then we should insert the moves at the end of A. That way we won’t be mucking things up for the edge . Case 4 is the most complicated. There is no extant place in the graph we can insert moves. If we insert in A, we mess things up for . If we insert in , we mess things up for . Inserting in or doesn’t make any sense. What is there to do? As it turns out, Case 4 is called a critical edge . And we have to split it. We can insert a new block E along the edge and put the moves in E! That way they still happen along the edge without affecting any other blocks. Neat. In Ruby code, that looks like: Adding a new block invalidates the cached , so we also need to recompute that. We could also avoid that by splitting critical edges earlier, before numbering. Then, when we arrive in , we can clean up branches to empty blocks! (See also Nick’s post on critical edge splitting , which also links to Faddegon’s thesis, which I should at least skim.) And that’s it, folks. We have gone from virtual registers in SSA form to physical locations. Everything’s all hunky-dory. We can just turn these LIR instructions into their very similar looking machine equivalents, right? Not so fast… You may have noticed that the original linear scan paper does not mention calls or other register constraints. I didn’t really think about it until I wanted to make a function call. The authors of later linear scan papers definitely noticed, though; Wimmer2005 writes the following about Poletto1999: When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register. Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately. Fun. We will start off by handling calls and method parameters separately, we will note that it’s not amazing code, and then we will eventually implement the later papers, which handle register constraints more naturally. We’ll call this new function after register allocation but before SSA resolution. We do it after register allocation so we know where each virtual register goes but before resolution so we can still inspect the virtual register operands. Its goal is to do a couple of things: We’ll also remove the operands since we’re placing them in special registers explicitly now. (Unfortunately, this sidesteps handling the less-fun bit of calls in ABIs where after the 6th parameter, they are expected on the stack. It also completely ignores ABI size constraints.) Now, you may have noticed that we don’t do anything special for the incoming params of the function we’re compiling! That’s another thing we have to handle. Thankfully, we can handle it with yet another parallel move (wow!) at the end of . Again, this is yet another kind of thing where some of the later papers have much better ergonomics and also much better generated code. But this is really cool! If you have arrived at this point with me, we have successfully made it to 1997 and that is nothing to sneeze at. We have even adapted research from 1997 to work with SSA, avoiding several significant classes of bugs along the way. We have just built an enormously complex machine. Even out the gate, with the original linear scan, there is a lot of machinery. It’s possible to write tests that spot check sample programs of all shapes and sizes but it’s very difficult to anticipate every possible edge case that will appear in the real world. Even if the original algorithm you’re using has been proven correct, your implementation may have subtle bugs due to (for example) having slightly different invariants or even transcription errors. We have all these proof tools at our disposal: we can write an abstract interpreter that verifies properties of one graph, but it’s very hard (impossible?) to scale that to sets of graphs. Maybe that’s enough, though. In one of my favorite blog posts, Chris Fallin writes about writing a register allocation verifier based on abstract interpretation. It can verify one concrete LIR function at a time. It’s fast enough that it can be left on in debug builds. This means that a decent chunk of the time (tests, CI, maybe a production cluster) we can get a very clear signal that every register assignment that passes through the verifier satisfies some invariants. Furthermore, we are not limited to Real World Code. With the advent of fuzzing, one can imagine an always-on fuzzer that tries to break the register allocator. A verifier can then catch bugs that come from exploring this huge search space. Some time after finding Chris’s blog post, I also stumbled across the very same thing in V8 ! I find this stuff so cool. I’ll also mention Boissinot’s Rust code again because it does something similar for parallel moves. It’s possible to do linear scan allocation in reverse, at least on traces without control-flow. See for example The Solid-State Register Allocator , the LuaJIT register allocator , and Reverse Linear Scan Allocation is probably a good idea . By doing linear scan this way, it is also possible to avoid computing liveness and intervals. I am not sure if this works on programs with control-flow, though. We built a register allocator that works on SSA. Hopefully next time we will add features such as lifetime holes, interval splitting, and register hints. The full Ruby code listing is not (yet?) public available under the Apache 2 license . UPDATE: See the post on lifetime holes . Thanks to Waleed Khan and Iain Ireland for giving feedback on this post. It’s not just about registers, either. In 2016, Facebook engineer Dave legendarily used linear-scan register allocation to book meeting rooms .  ↩ Well. As I said on one of the social media sites earlier this year, “All AOT compilers are alike; each JIT compiler is fucked up in its own way.” JavaScript: Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mössenböck and Pfeiffer notes: Our allocator relies on static single assignment form, which simplifies data flow analysis and tends to produce short live intervals. Register allocation for programs in SSA-form (PDF, 2006) by Hack, Grund, and Goos notes that interference graphs for SSA programs are chordal and can be optimally colored in quadratic time. SSA Elimination after Register Allocation (PDF, 2008) by Pereira and Palsberg notes: One of the main advantages of SSA based register allocation is the separation of phases between spilling and register assignment. Cliff Click (private communication, 2025) notes: It’s easier. Got it already, why lose it […] spilling always uses use/def and def/use edges. This is inspired by Rasmus Andersson ’s graph coloring visualization that I saw some years ago.  ↩ The example in the thesis is to sequentialize the following parallel copy: The solution in the thesis is: but we think this is incorrect. Solving manually, Aaron and I got: which is what the code gives us, too.  ↩

0 views