Posts in Performance (20 found)

Flexible I/O for Database Management Systems with xNVMe

Flexible I/O for Database Management Systems with xNVMe Emil Houlborg, Simon A. F. Lund, Marcel Weisgut, Tilmann Rabl, Javier González, Vivek Shah, Pınar Tözün CIDR’26 This paper describes xNVMe , a storage library (developed by Samsung), and demonstrates how it can be integrated into DuckDB. Section 2 contains the hard sell for . The “x” prefix serves a similar role to the “X” in DirectX. It is fast, while also being portable across operating systems and storage devices. The C API will feel like home for folks who have experience with low-level graphics APIs (no shaders on the disk yet, sorry). There are APIs to open a handle to a device, allocate buffers, and submit NVMe commands (synchronously or asynchronously). Listing 3 has an example, which feels like “Mantle for NVMe”: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf The API works on Linux, FreeBSD, Windows, and macOS. Some operating systems have multiple backends available (e.g., , ). The point of this paper is that it is easy to drop into an existing application. The paper describes , which is an implementation of the DuckDB interface and uses . creates dedicated queues for each DuckDB worker thread to avoid synchronization (similar tricks are used by applications calling graphics APIs in parallel). The paper also describes how supports shiny new NVMe features like Flexible Data Placement (FDP). This allows DuckDB to pass hints to the SSD to colocate buffers with similar lifetimes (which improves garbage collection performance). Most of the results in the paper show comparable performance for vs the baseline DuckDB filesystem. Fig. 5 shows one benchmark where yields a significant improvement: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf Dangling Pointers I think the long-term success of will depend on governance. Potential members of the ecosystem could be scared off by Samsung’s potential conflict of interest (i.e., will Samsung privilege Samsung SSDs in some way?) There is a delicate balancing act between an API driven by a sluggish bureaucratic committee, and an API which is dominated by one vendor. Subscribe now

0 views

A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation

A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation Christian Lanius, Florian Freye, and Tobias Gemmeke GLVLSI'25 This paper ostensibly describes an ASIC accelerator for NFA evaluation (e.g., regex matching), but this paper also describes two orthogonal techniques for optimizing NFA evaluation which are applicable to more than just this ASIC. Any regular expression can be converted to a non-deterministic finite automaton (NFA) . Think of an NFA like a state machine where some inputs can trigger multiple transitions. The state machine is defined by a set of transitions . A transition is an ( , , ) tuple. The non-deterministic naming comes from the fact that multiple tuples may exist with identical ( , ) values; they only differ in their values. This means that an NFA can be in multiple states at once. One way to evaluate an NFA is to use a bitmap to track the set of active states. For each new input symbol, the set of active states in the bitmap is used to determine which transitions apply. Each activated transition sets one bit in the bitmap used to represent the active states for the next input symbol. The hardware described in this paper uses a compute-in-memory (CIM) microarchitecture. A set of columns stores the state machine, with each column storing one transition. This assumes that the transition function is sparse (i.e., the number of transitions used is much lower than the maximum possible). During initialization, the transitions are written into the CIM hardware. An input symbol is processed by broadcasting it and the current state bitmap to all columns. All columns evaluate whether their transition should be activated. The hardware then iterates (over multiple clock cycles) over all activated transitions and updates the state bitmap for the next input symbol. The left side of Fig. 5 illustrates the hardware in each column which compares the input symbol, current state, against the stored tuple: Source: https://dl.acm.org/doi/10.1145/3716368.3735157 The algorithm described above processes at most one input symbol per cycle (and it is slower for inputs that activate multiple transitions). The paper contains two tricks for overcoming this limitation. Fig. 4 illustrates how an NFA that accepts one symbol per cycle can be converted into an NFA which accepts two symbols per cycle. For example, rather than consider and to be separate symbols, put them together into one mega-symbol: . This is feasible as long as your NFA implementation isn’t too sensitive to the number of bits per symbol. Source: https://dl.acm.org/doi/10.1145/3716368.3735157 Cool Trick #2 - Bloom Filter The target application for this hardware is monitoring network traffic for threats (e.g., Snort ). A key observation is that most inputs (network packets) do not produce a match, so it is reasonable to assume that most of the time the NFA will be in the initial state, and most input symbols will not trigger any transitions. If that assumption holds, then a bloom filter can be used to quickly skip many input symbols before they even reach the core NFA evaluation hardware. The bloom filter is built when the NFA transition function changes. To build the bloom filter, iterate over each transition for which holds. For each such transition, compute a hash of the input symbol, decompose the hashed value into indices, and set the corresponding bits in the bloom filter. To test an input symbol against the bloom filter, hash the input symbol, decompose the hashed value into indices, and check to see if all of the corresponding bits are set in the bloom filter. If any bit is not set, then the input symbol does not trigger a transition from the initial state. When that symbol finally arrives at the NFA hardware, it can be dropped if the NFA is in the initial state. Table 1 compares PPA results against other published NFA accelerators. It is a bit apples-to-oranges as the various designs target different technology nodes. The metric that stands out is the low power consumption of this design. Source: https://dl.acm.org/doi/10.1145/3716368.3735157 Dangling Pointers I wonder if the bloom filter trick can be extended. For example, rather than assuming the NFA will always be in the initial state, the hardware could dynamically compute which states are the most frequent and then use bloom filters to drop input symbols which cannot trigger any transitions from those states. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views

Fixing qBittorrent in Docker, swallowing RAM and locking up the host

IntroI’ve had qBittorrent running happily in Docker for over a year now, using the linuxserver/qbittorrent image. I run the docker container and others on an Ubuntu Server host. Recently, the host regularly becomes unresponsive. SSH doesn’t work, and the only way to recover is to power cycle the server (running on a small NUC). The symptoms were: RAM usage would climb and climb, the server would become sluggish, and then it would completely lock up.

0 views
Martin Alderson 1 weeks ago

Which web frameworks are most token-efficient for AI agents?

I benchmarked 19 web frameworks on how efficiently an AI coding agent can build and extend the same app. Minimal frameworks cost up to 2.9x fewer tokens than full-featured ones.

0 views
Den Odell 1 weeks ago

Constraints and the Lost Art of Optimization

In 1984, Steve Jobs walked over to a bag standing on stage and pulled out a computer that would change the world. The Macintosh had an operating system, a graphical user interface, window manager, font renderer, and a complete graphics engine called QuickDraw, one of the most elegant pieces of software ever written. The whole thing fit inside the machine’s 64KB ROM. Sixty . Four . Kilobytes . A typical webpage hero image is larger. A "Hello World" React application is larger. The entire intellectual and creative output of a team that reinvented personal computing fits in a space that, today, we wouldn’t think twice about wasting on a single font file. They did it because there just wasn’t any other choice. A larger ROM would have been too expensive for a mass-market consumer device, so efficiency and hyper-optimization was the only route. Somewhere in the years that followed we’ve lost the creative solutions, the art of optimization, that being constrained in that way produces. The Atari 2600 game console had 4KB of ROM and 128 bytes of RAM. Not kilobytes… bytes . And because there was no display buffer, programmers had to manually synchronize their code with the connected TV’s electron beam as it swept across the screen, line by line, sixty times a second. If your code ran too slow, the image tore. If it ran too fast, you’d corrupt the next line. They called it " racing the beam ." It was brutal, unforgiving, and it forced some of the most inventive programming in computing history. Super Mario Bros shipped on a 40KB NES cartridge. Tetris for the Nintendo Game Boy, the most successful handheld game ever made at the time, was 32KB . These weren’t compromised experiences, they were masterpieces. The constraints didn’t diminish the work, they actually defined it. The programmers who built these things weren’t just efficient, they thought differently. They knew their medium the way a sculptor knows stone. They understood every byte, every clock cycle, and every trade-off. The machine had no secrets from them because it couldn’t afford to. Modern development is defined by abundance. Cheap storage, fast networks, and powerful hardware. The practical consequences of inefficiency have largely disappeared. A few extra megabytes, a few wasted milliseconds, an unnecessary UI re-render. For the most part, nobody notices and nothing breaks. And so we’ve stopped noticing ourselves. This isn’t out of laziness, it’s just how rational people work. When the hard limits disappear, the thinking they demanded tends to disappear with them. There’s no forcing function, no electron beam to race, no 128 bytes standing between your idea and disaster. We can afford not to understand, and so increasingly, we simply don’t. Here’s what I think is worth recovering: not the constraints themselves, but the relationship with the medium that having those constraints produced. The engineers who wrote the Mac ROM didn’t just know how to be efficient, they understood their problem at a level that made elegance possible. Bill Atkinson’s QuickDraw wasn’t just small, it was a beautiful piece of code. The 64KB forced him to find the right solution, not just a working one. That instinct, to understand deeply before you build, to ask whether this is the right structure and not just a functional one, to treat your medium as something to be understood rather than merely used, that’s the transferable thing. Not the bit-twiddling, the thinking . The best engineers I’ve worked with carry this instinct even when others might think it crazy. They impose their own constraints. They ask what this would look like if it had to be half the size, or run twice as fast, or use a tenth of the memory. Not because anyone demanded it, but because just by thinking there could be a better, more efficient solution, one often emerges. If you want to start developing this instinct today, the most valuable thing you can do is learn how your runtime actually works. Not at the API level, but internally. How your platform parses, allocates, renders, and executes. For web developers that means understanding the browser pipeline : parsing, style resolution, layout, paint, and compositing. For mobile developers it means understanding how iOS or Android manages memory, handles drawing, and schedules work. Understanding your platform changes what you notice, what makes you wince, and what you reach for instinctively. The engineers who built the Mac knew their domain completely, and you can know yours too. There’s a principle I keep coming back to in engineering apps for performance: fast by default . Not fast because you optimized after the fact, but fast because the thinking that produces fast software is simply better thinking . It catches unnecessary complexity early, and it produces systems that are easier to understand, easier to change, and easier to reason about under pressure. The Atari programmers were fast by default; they had no choice. But the discipline they practiced, that intimate, demanding relationship with their constraints, that’s a choice we can still make. The 64KB Mac ROM isn’t just a remarkable footnote, it’s a provocation. It asks: if they could do that with that , then what’s our excuse? Not to shame us, but to remind us that constraints aren’t the enemy of great work. They’re often the source of it .

0 views

Better Memory Tiering, Right from the First Placement

Better Memory Tiering, Right from the First Placement João Póvoas, João Barreto, Bartosz Chomiński, André Gonçalves, Fedar Karabeinikau, Maciej Maciejewski, Jakub Schmiegel, and Kostiantyn Storozhuk ICPE'25 This paper addresses the first placement problem in systems with multiple tiers of memory (e.g., DRAM paired with HBM, or local DRAM paired with remote DRAM accessed over CXL). The paper cites plenty of prior work which dynamically migrates pages/allocations out of suboptimal memory tiers. What is different about this paper is that it attempts to avoid placing data in a suboptimal tier in the first place. The key insight is: statistics from one allocation can be used to generate better placements for similar allocations which will occur in the future. Fig. 3 offers insight into how much waste there is in a policy which initially places all pages into a fast tier and then migrates them to a slower tier if they are accessed infrequently. The figure shows results from one migration policy, applied to three benchmarks. Source: https://dl.acm.org/doi/10.1145/3676151.3719378 Allocation Contexts This paper proposes gathering statistics for each allocation context . An allocation context is defined by the source code location of the allocation, the call stack at the moment of allocation, and the size of the allocation. If two allocations match on these attributes, then they are considered part of the same context. The system hooks heap allocation functions (e.g., , ) to track all outstanding allocations associated with each allocation context. The x86 PMU event is used to determine how frequently each allocation context is accessed. A tidbit I learned from this paper is that some x86 performance monitoring features do more than just count events. For example, randomly samples load operations and emits the accessed (virtual) address. Given the accessed address, it is straightforward to map back to the associated allocation context. The hotness of an allocation context is the frequency of these access events divided by the total size of all allocations in the context. Time is divided into epochs. During an epoch, the hotness of each allocation context is recalculated. When a new allocation occurs, the hotness of the allocation context (from the previous epoch) is used to determine which memory tier to place the allocation into. The paper only tracks large allocations (at least 64 bytes). For smaller allocations, the juice is not worth the squeeze. These allocations are assumed to be short-lived and frequently accessed. This paper also describes a kernel component which complements the user space policy described so far. Whereas the user space code deals with allocations, the kernel code deals with pages. This is useful for allocations which do not access all pages uniformly. It is also useful for detecting and correcting suboptimal initial placements. All PTEs associated with all allocations are continually scanned. The accessed bit determines if a page has been read since the last scan. The dirty bit determines if a page has been written since the last scan. After 10 scans, the system has a pretty good idea of how frequently a page is accessed. These statistics are used to migrate pages between fast and slow tiers. Fig. 8 shows execution time for three benchmarks. represents the user and kernel solutions described by this paper. Source: https://dl.acm.org/doi/10.1145/3676151.3719378 Dangling Pointers I wasn’t able to find details in the paper about how PTE scanning works without interfering with other parts of the OS. For example, doesn’t the OS use the dirty bit to determine if it needs to write pages back to disk? I assume the PTE scanning described in this paper must reset the dirty bit on each scan. The definition of an allocation context seems ripe for optimization. I suspect that allowing some variability in call stack or allocation size would allow for better statistics. Maybe this is a good use case for machine learning? Subscribe now

0 views
Farid Zakaria 1 weeks ago

Linker Pessimization

In a previous post , I wrote about linker relaxation : the linker’s ability to replace a slower, larger instruction with a faster, smaller one when it has enough information at link time. For instance, an indirect through the GOT can be relaxed into a direct plus a . This is a well-known technique to optimize the instructions for performance. Does it ever make sense to go the other direction ? 🤔 We’ve been working on linking some massive binaries that include Intel’s Math Kernel Library (MKL) , a prebuilt static archive. MKL ships as object files compiled with the small code-model ( ), meaning its instructions assume everything is reachable within ±2 GiB. The included object files also has some odd relocations where the addend is a very large negative number (>1GiB). The calculation for the relocation value is S + A - P : the symbol address plus the addend minus the instruction address. WIth a sufficiently large negative addend, the relocation value can easily exceed the 2 GiB limit and the linker fails with relocation overflows. We can’t recompile MKL (it’s a prebuilt proprietary archive), and we can’t simply switch everything to the large code model. What can we do? 🤔 I am calling this technique linker pessimization : the reverse of relaxation. Instead of shrinking an instruction, we expand one to tolerate a larger address space. 😈 The specific instructions that overflow in our case are (Load Effective Address) instructions. In x86_64, performs pure arithmetic: it computes and stores the result in without accessing memory. The is a 32-bit signed integer embedded directly into the instruction encoding, and the linker fills it in via an relocation. The relocation formula is S + A - P . Let’s look at an example with a large addend. A 32-bit signed integer can only represent ±2,048 MB (±2 GiB). Our value of −2,062 MB exceeds that range and the linker rightfully complains 💥: Note These instructions appear in MKL because the library uses them as a way to compute an address of a data table relative to the instruction pointer. The large negative addend ( ) is intentional ; it’s an offset within a large lookup table. The core idea is delightful because often as an engineer we are trained to optimize systems, but in this case we want the opposite. We swap the for a that reads through a nearby pointer. Recall from the relaxation post : relaxation shrinks instructions (e.g. indirect -> direct ). Here we do the opposite: we make the instruction do more work (pure arithmetic -> memory load) in exchange for a reachable displacement. That’s why I consider it a pessimization or reverse-relaxation . Both instructions use the same encoding length (7 bytes with a REX prefix), so the patch is a single byte change in the opcode. 🤓 The difference in behavior is critical: Original — the must reach across the entire binary: Pessimized — the reads a nearby pointer that holds the full address: We’ve traded one direct computation for an indirect through a pointer, and we make sure the displacement is now tiny. The 64-bit pointer slot can reach any address in the virtual address space. 👌 For each problematic relocation, three changes are needed in the object file: 1. Opcode Patch : In , change byte to (1 byte). This converts the (compute address) into a (load from address). The rest of the instruction encoding (ModR/M byte, REX prefix) stays identical because both instructions use the same operand format. 2. New Pointer Slot — Create a new section ( ) containing 8 zero bytes per patch site, plus a new relocation pointing to the original symbol with the original addend. is a 64-bit absolute relocation. Its formula is simply , no subtraction of . There is no 32-bit range limitation; it can address the entire 64-bit address space. This is the key insight that makes the fix work. 3. Retarget the Original Relocation — In the entry for the patched instruction, change the symbol to point at the new pointer slot in and update the type to . The addend becomes a small offset (the distance from the instruction to the fixup slot), which is guaranteed to fit. Note Because both and with a operand are exactly the same length (7 bytes with a REX prefix), we don’t shift any code, don’t invalidate any other relocations, and don’t need to rewrite any other parts of the object file. It’s truly a surgical patch. The pessimized now performs a memory load where the original did pure register arithmetic. That’s an extra cache line fetch and a data dependency. If this instruction is in a tight loop, it could be a performance hit. Optimization is the root of all evil, what does that make pessimization? 🧌 LEA : (arithmetic, no memory access). must encode the entire distance to the far-away data. This overflows. MOV : (memory load). points to a nearby 8-byte pointer slot. The pointer slot holds the full 64-bit address. This never overflows.

0 views
Evan Schwartz 1 weeks ago

PSA: Your SQLite Connection Pool Might Be Ruining Your Write Performance

Update (Feb 18, 2026): After a productive discussion on Reddit and additional benchmarking , I found that the solutions I originally proposed (batched writes or using a synchronous connection) don't actually help. The real issue is simpler and more fundamental than I described: SQLite is single-writer, so any amount of contention at the SQLite level will severely hurt write performance. The fix is to use a single writer connection with writes queued at the application level, and a separate connection pool for concurrent reads. The original blog post text is preserved below, with retractions and updates marked accordingly. My apologies to the SQLx maintainers for suggesting that this behavior was unique to SQLx. Write transactions can lead to lock starvation and serious performance degradation when using SQLite with SQLx , the popular async Rust SQL library. In retrospect, I feel like this should have been obvious, but it took a little more staring at suspiciously consistent "slow statement" logs than I'd like to admit, so I'm writing it up in case it helps others avoid this footgun. SQLite is single-writer. In WAL mode, it can support concurrent reads and writes (or, technically "write" singular), but no matter the mode there is only ever one writer at a time. Before writing, a process needs to obtain an EXCLUSIVE lock on the database. If you start a read transaction with a SELECT and then perform a write in the same transaction, the transaction will need to be upgraded to write transaction with an exclusive lock: A read transaction is used for reading only. A write transaction allows both reading and writing. A read transaction is started by a SELECT statement, and a write transaction is started by statements like CREATE, DELETE, DROP, INSERT, or UPDATE (collectively "write statements"). If a write statement occurs while a read transaction is active, then the read transaction is upgraded to a write transaction if possible. ( source ) Transactions started with or also take the exclusive write lock as soon as they are started. Transactions in SQLx look like this: This type of transaction where you read and then write is completely fine. The transaction starts as a read transaction and then is upgraded to a write transaction for the . Update: This section incorrectly attributes the performance degradation to the interaction between async Rust and SQLite. The problem is actually that any contention for the EXCLUSIVE lock at the SQLite level, whether from single statements or batches, will hurt write performance. The problem arises when you call within a write transaction. For example, this could happen if you call multiple write statements within a transaction: This code will cause serious performance degradation if you have multiple concurrent tasks that might be trying this operation, or any other write, at the same time. When the program reaches the first statement, the transaction is upgraded to a write transaction with an exclusive lock. However, when you call , the task yields control back to the async runtime. The runtime may schedule another task before returning to this one. The problem is that this task is now holding an exclusive lock on the database. All other writers must wait for this one to finish. If the newly scheduled task tries to write, it will simply wait until it hits the and returns a busy timeout error. The original task might be able to make progress if no other concurrent writers are scheduled before it, but under higher load you might continuously have new tasks that block the original writer from progressing. Starting a transaction with will also cause this problem, because you will immediately take the exclusive lock and then yield control with . In practice, you can spot this issue in your production logs if you see a lot of SQLx warnings that say where the time is very close to your (which is 5 seconds by default). This is the result of other tasks being scheduled by the runtime and then trying and failing to obtain the exclusive lock they need to write to the database while being blocked by a parked task. SQLite's concurrency model (in WAL mode) is many concurrent readers with exactly one writer. Mirroring this architecture at the application level provides the best performance. Instead of a single connection pool, where connections may be upgraded to write at any time, use two separate pools: With this setup, write transactions serialize within the application. Tasks will queue waiting for the single writer connection, rather than all trying to obtain SQLite's EXCLUSIVE lock. In my benchmarks , this approach was ~20x faster than using a single pool with multiple connections: An alternative to separate pools is wrapping writes in a Mutext , which achieves similar performance (95ms in the benchmarks). However, separate pools make the intent clearer and, if the reader pool is configured as read-only, prevent accidentally issuing a write on a reader connection. Having separate pools works when reads and writes are independent, but sometimes you need to atomically read and then write based on it: Sending this transaction to the single write connection is fine if the read is extremely fast, such as a single lookup by primary key. However, if your application requires expensive reads that must precede writes in a single atomic transaction, the shared connection pool with moderate concurrency might outperform a single writer. Retraction: Benchmarking showed that batched writes perform no better than the naive loop under concurrency, because 50 connections still contend for the write lock regardless of whether each connection issues 100 small s or one large . QueryBuilder is still useful for reducing per-statement overhead, but it does not fix the contention problem. We could safely replace the example code above with this snippet that uses a bulk insert to avoid the lock starvation problem: Note that if you do this with different numbers of values, you should call . By default, SQLx caches prepared statements. However, each version of the query with a different number of arguments will be cached separately, which may thrash the cache. Retraction: Benchmarking showed that this did not actually improve performance. Unfortunately, the fix for atomic writes to multiple tables is uglier and potentially very dangerous. To avoid holding an exclusive lock across an , you need to use the interface to execute a transaction in one shot: However, this can lead to catastrophic SQL injection attacks if you use this for user input, because does not support binding and sanitizing query parameters. Note that you can technically run a transaction with multiple statements in a call but the docs say: The query string may only contain a single DML statement: SELECT, INSERT, UPDATE, DELETE and variants. The SQLite driver does not currently follow this restriction, but that behavior is deprecated. If you find yourself needing atomic writes to multiple tables with SQLite and Rust, you might be better off rethinking your schema to combine those tables or switching to a synchronous library like with a single writer started with . Update: the most useful change would actually be making a distinction between a and a . Libraries like SQLx could enforce the distinction at compile time or runtime by inspecting the queries for the presence of write statements, or the could be configured as read-only. Maybe, but it probably won't. If SQLx offered both a sync and async API (definitely out of scope) and differentiated between read and write statements, a write could be like , which would prevent it from being held across an point. However, SQLx is not an ORM and it probably isn't worth it for the library to have different methods for read versus write statements. Without that, there isn't a way to prevent write transaction locks from being held across s while allowing safe read transactions to be used across s. So, in lieu of type safety to prevent this footgun, I wrote up this blog post and this pull request to include a warning about this in the docs. Discuss on r/rust and Hacker News .

0 views
David Bushell 1 weeks ago

Web font choice and loading strategy

When I rebuilt my website I took great care to optimise fonts for both performance and aesthetics. Fonts account for around 50% of my website (bytes downloaded on an empty cache). I designed and set a performance budget around my font usage. I use three distinct font families and three different methods to load them. Web fonts are usually defined by the CSS rule. The property allows us some control over how fonts are loaded. The value has become somewhat of a best practice — at least the most common default. The CSS spec says: Gives the font face an extremely small block period (100ms or less is recommended in most cases) and an infinite swap period . In other words, the browser draws the text immediately with a fallback if the font face isn’t loaded, but swaps the font face in as soon as it loads. CSS Fonts Module Level 4 - W3C That small “block period”, if implemented by the browser, renders an invisible font temporarily to minimise FOUC . Personally I default to and don’t change unless there are noticeable or measurable issues. Most of the time you’ll use swap. If you don’t know which option to use, go with swap. It allows you to use custom fonts and tip your hand to accessibility. font-display for the Masses - Jeremy Wagner Google Fonts’ default to which has performance gains. In effect, this makes the font files themselves asynchronous—the browser immediately displays our fallback text before swapping to the web font whenever it arrives. This means we’re not going to leave users looking at any invisible text (FOIT), which makes for both a faster and more pleasant experience. Speed Up Google Fonts - Harry Roberts Harry further notes that a suitable fallback is important, as I’ll discover below. My three fonts in order of importance are: Ahkio for headings. Its soft brush stroke style has a unique hand-drawn quality that remains open and legible. As of writing, I load three Ahkio weights at a combined 150 KB. That is outright greed! Ahkio is core to my brand so it takes priority in my performance budget (and financial budget, for that matter!) Testing revealed the 100ms † block period was not enough to avoid FOUC, despite optimisation techniques like preload . Ahkio’s design is more condensed so any fallback can wrap headings over additional lines. This adds significant layout shift. † Chrome blog mention a zero second block period . Firefox has a config preference default of 100ms. My solution was to use instead of which extends the block period from a recommended 0–100ms up to a much longer 3000ms. Gives the font face a short block period (3s is recommended in most cases) and an infinite swap period . In other words, the browser draws “invisible” text at first if it’s not loaded, but swaps the font face in as soon as it loads. CSS Fonts Module Level 4 - W3C This change was enough to avoid ugly FOUC under most conditions. Worst case scenario is three seconds of invisible headings. With my website’s core web vitals a “slow 4G” network can beat that by half. For my audience an extended block period is an acceptable trade-off. Hosting on an edge CDN with good cache headers helps minimised the cost. Update: Richard Rutter suggested which gives more fallback control than I knew. I shall experiment and report back! Atkinson Hyperlegible Next for body copy. It’s classed as a grotesque sans-serif with interesting quirks such as a serif on the lowercase ‘i’. I chose this font for both its accessible design and technical implementation as a variable font . One file at 78 KB provides both weight and italic variable axes. This allows me to give links a subtle weight boost. For italics I just go full-lean. I currently load Atkinson Hyperlegible with out of habit but I’m strongly considering why I don’t use . Gives the font face an extremely small block period (100ms or less is recommended in most cases) and a short swap period (3s is recommended in most cases). In other words, the font face is rendered with a fallback at first if it’s not loaded, but it’s swapped in as soon as it loads. However, if too much time passes, the fallback will be used for the rest of the page’s lifetime instead. CSS Fonts Module Level 4 - W3C The browser can give up and presumably stop downloading the font. The spec actually says that and “[must/should] only be used for small pieces of text.” Although it notes that most browsers implement the default with similar strategies to . 0xProto for code snippets. If my use of Ahkio was greedy, this is gluttonous! A default would be acceptable. My justification is that controlling presentation of code on a web development site is reasonable. 0xProto is designed for legibility with a personality that compliments my design. I don’t specify 0xProto with the CSS rule. Instead I use the JavaScript font loading API to conditionally load when a element is present. Note the name change because some browsers aren’t happy with a numeric first character. Not shown is the event wrapper around this code. I also load the script with both and attributes. This tells the browser the script is non-critical and avoids render blocking. I could probably defer loading even later without readers noticing the font pop in. Update: for clarity, browsers will conditionally load but JavaScript can purposefully delay the loading further to avoid fighting for bandwidth. When JavaScript is not available the system default is fine. There we have it, three fonts, three strategies, and a few open questions and decisions to make. Those may be answered when CrUX data catches up. My new website is a little chunkier than before but its well within reasonable limits. I’ll monitor performance and keep turning the dials. Web performance is about priorities . In isolation it’s impossible to say exactly how an individual asset should be loaded. There are upper limits, of course. How do you load a one megabyte font? You don’t. Unless you’re a font studio providing a complete type specimen. But even then you could split the font and progressive load different unicode ranges. I wonder if anyone does that? Anyway I’m rambling now, bye. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters Kaiyang Zhao, Kaiwen Xue, Ziqi Wang, Dan Schatzberg, Leon Yang, Antonis Manousis, Johannes Weiner, Rik Van Riel, Bikash Sharma, Chunqiang Tang, and Dimitrios Skarlatos ISCA'23 This paper has a lot of great statistics from the Meta fleet. Memory capacity per server is growing over time, but TLB size is not, thus TLB coverage (the fraction of main memory that can be referenced by the TLB at any one time) is trending downward: Source: https://dl.acm.org/doi/10.1145/3579371.3589079 As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases. With 4KiB pages, TLB misses can account for almost 20% of CPU cycles! Source: https://dl.acm.org/doi/10.1145/3579371.3589079 A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory. This is shockingly difficult in the real world: We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page. The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access. The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault ( here is a paper that describes some problems associated with PCIe devices causing page faults). The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages. The first solution only requires software changes. The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations. A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction. A background resizing algorithm runs periodically to move the boundary between these two regions. One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking. The paper doesn’t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary. The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Hardware Page Migration The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page. The OS can use this support to defragment memory by moving “unmovable” pages. The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. The page copy occurs one cache line at a time. During the copy operation, accesses to the old page work fine. When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet. If the cache line has been copied, then the access is serviced from the destination page. Otherwise, the access is serviced by the source page. Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU). Once those TLB invalidations have completed, the OS can reuse the old page. Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation. Fig. 10 has results for memory segmentation. “Partial” represents a case where physical memory is partially fragmented, whereas “full” represents full fragmentation. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Dangling Pointers If the NIC is the primary culprit, some vertical integration might be called for here. For example, allocations used to send packets cycle through three states: CPU writing to it NIC reading from it It seems like it would be fine for the OS to migrate a packet when it is not in state #3. Subscribe now Source: https://dl.acm.org/doi/10.1145/3579371.3589079 As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases. With 4KiB pages, TLB misses can account for almost 20% of CPU cycles! Source: https://dl.acm.org/doi/10.1145/3579371.3589079 A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory. This is shockingly difficult in the real world: We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page. The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access. The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault ( here is a paper that describes some problems associated with PCIe devices causing page faults). The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages. Segmentation The first solution only requires software changes. The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations. A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction. A background resizing algorithm runs periodically to move the boundary between these two regions. One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking. The paper doesn’t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary. The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Hardware Page Migration The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page. The OS can use this support to defragment memory by moving “unmovable” pages. The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. The page copy occurs one cache line at a time. During the copy operation, accesses to the old page work fine. When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet. If the cache line has been copied, then the access is serviced from the destination page. Otherwise, the access is serviced by the source page. Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU). Once those TLB invalidations have completed, the OS can reuse the old page. Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation. Results Fig. 10 has results for memory segmentation. “Partial” represents a case where physical memory is partially fragmented, whereas “full” represents full fragmentation. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Dangling Pointers If the NIC is the primary culprit, some vertical integration might be called for here. For example, allocations used to send packets cycle through three states: Empty CPU writing to it NIC reading from it

0 views
Max Bernstein 2 weeks ago

Type-based alias analysis in the Toy Optimizer

Another entry in the Toy Optimizer series . Last time, we did load-store forwarding in the context of our Toy Optimizer. We managed to cache the results of both reads from and writes to the heap—at compile-time! We were careful to mind object aliasing: we separated our heap information into alias classes based on what offset the reads/writes referenced. This way, if we didn’t know if object and aliased, we could at least know that different offsets would never alias (assuming our objects don’t overlap and memory accesses are on word-sized slots). This is a coarse-grained heuristic. Fortunately, we often have much more information available at compile-time than just the offset, so we should use it. I mentioned in a footnote that we could use type information, for example, to improve our alias analysis. We’ll add a lightweight form of type-based alias analysis (TBAA) (PDF) in this post. We return once again to Fil Pizlo land, specifically How I implement SSA form . We’re going to be using the hierarchical heap effect representation from the post in our implementation, but you can use your own type representation if you have one already. This representation divides the heap into disjoint regions by type. Consider, for example, that objects and objects do not overlap. A pointer is never going to alias an pointer. They can therefore be reasoned about separately. But sometimes you don’t have perfect type information available. If you have in your language an base class of all objects, then the heap overlaps with, say, the heap. So you need some way to represent that too—just having an enum doesn’t work cleanly. Here is an example simplified type hierarchy: Where might represent different parts of the runtime’s data structures, and could be further segmented into , , etc. Fil’s idea is that we can represent each node in that hierarchy with a tuple of integers (inclusive, exclusive) that represent the pre- and post-order traversals of the tree. Or, if tree traversals are not engraved into your bones, they represent the range of all the nested objects within them. Then the “does this write interfere with this read” check—the aliasing check—is a range overlap query. Here’s a perhaps over-engineered Python implementation of the range and heap hierarchy based on the Ruby generator and C++ runtime code from JavaScriptCore: Where kicks off the tree-numbering scheme. Fil’s implementation also covers a bunch of abstract heaps such as SSAState and Control because his is used for code motion and whatnot. That can be added on later but we will not do so in this post. So there you have it: a type representation. Now we need to use it in our load-store forwarding. Recall that our load-store optimization pass looks like this: At its core, it iterates over the instructions, keeping a representation of the heap at compile-time. Reads get cached, writes get cached, and writes also invalidate the state of compile-time information about fields that may alias. In this case, our may alias asks only if the offsets overlap. This means that the following unit test will fail: This test is expecting the write to to still remain cached even though we wrote to the same offset in —because we have annotated as being an and as being a . If we account for type information in our alias analysis, we can get this test to pass. After doing a bunch of fussing around with the load-store forwarding (many rewrites), I eventually got it down to a very short diff: If we don’t have any type/alias information, we default to “I know nothing” ( ) for each object. Then we check range overlap. The boolean logic in looks a little weird, maybe. But we can also rewrite (via DeMorgan’s law) as: So, keeping all the cached field state about fields that are known by offset and by type not to alias. Maybe that is clearer (but not as nice a diff). Note that the type representation is not so important here! You could use a bitset version of the type information if you want. The important things are that you can cheaply construct types and check overlap between them. Nice, now our test passes! We can differentiate between memory accesses on objects of different types. But what if we knew more? Sometimes we know where an object came from. For example, we may have seen it get allocated in the trace. If we saw an object’s allocation, we know that it does not alias (for example) any object that was passed in via a parameter. We can use this kind of information to our advantage. For example, in the following made up IR snippet: We know that (among other facts) doesn’t alias or because we have seen its allocation site. I saw this in the old V8 IR Hydrogen’s lightweight alias analysis 1 : There is plenty of other useful information such as: If you have other fun ones, please write in. We only handle loads and stores in our optimizer. Unfortunately, this means we may accidentally cache stale information. Consider: what happens if a function call (or any other opaque instruction) writes into an object we are tracking? The conservative approach is to invalidate all cached information on a function call. This is definitely correct, but it’s a bummer for the optimizer. Can we do anything? Well, perhaps we are calling a well-known function or a specific IR instruction. In that case, we can annotate it with effects in the same abstract heap model: if the instruction does not write, or only writes to some heaps, we can at least only partially invalidate our heap. However, if the function is unknown or otherwise opaque, we need at least more advanced alias information and perhaps even (partial) escape analysis. Consider: even if an instruction takes no operands, we have no idea what state it has access to. If it writes to any object A, we cannot safely cache information about any other object B unless we know for sure that A and B do not alias. And we don’t know what the instruction writes to. So we may only know we can cache information about B because it was allocated locally and has not escaped. Some runtimes such as ART pre-compute all of their alias information in a bit matrix. This makes more sense if you are using alias information in a full control-flow graph, where you might need to iterate over the graph a few times. In a trace context, you can do a lot in one single pass—no need to make a matrix. As usual, this is a toy IR and a toy optimizer, so it’s hard to say how much faster it makes its toy programs. In general, though, there is a dial for analysis and optimization that goes between precision and speed. This is a happy point on that dial, only a tiny incremental analysis cost bump above offset-only invalidation, but for higher precision. I like that tradeoff. Also, it is very useful in JIT compilers where generally the managed language is a little better-behaved than a C-like language . Somewhere in your IR there will be a lot of duplicate loads and stores from a strength reduction pass, and this can clean up the mess. Thanks for joining as I work through a small use of type-based alias analysis for myself. I hope you enjoyed. Thank you to Chris Gregory for helpful feedback. I made a fork of V8 to go spelunk around the Hydrogen IR. I reset the V8 repo to the last commit before they deleted it in favor of their new Sea of Nodes based IR called TurboFan.  ↩ If we know at compile-time that object A has 5 at offset 0 and object B has 7 at offset 0, then A and B don’t alias (thanks, CF) In the RPython JIT in PyPy, this is used to determine if two user (Python) objects don’t alias because we know the contents of the user (Python) class field Object size (though perhaps that is a special case of the above bullet) Field size/type Deferring alias checks to run-time Have a branch I made a fork of V8 to go spelunk around the Hydrogen IR. I reset the V8 repo to the last commit before they deleted it in favor of their new Sea of Nodes based IR called TurboFan.  ↩

0 views
Sean Goedecke 2 weeks ago

Two different tricks for fast LLM inference

Anthropic and OpenAI both recently announced “fast mode”: a way to interact with their best coding model at significantly higher speeds. These two versions of fast mode are very different. Anthropic’s offers up to 2.5x tokens per second (so around 170, up from Opus 4.6’s 65). OpenAI’s offers more than 1000 tokens per second (up from GPT-5.3-Codex’s 65 tokens per second, so 15x). So OpenAI’s fast mode is six times faster than Anthropic’s 1 . However, Anthropic’s big advantage is that they’re serving their actual model. When you use their fast mode, you get real Opus 4.6, while when you use OpenAI’s fast mode you get GPT-5.3-Codex-Spark, not the real GPT-5.3-Codex. Spark is indeed much faster, but is a notably less capable model: good enough for many tasks, but it gets confused and messes up tool calls in ways that vanilla GPT-5.3-Codex would never do. Why the differences? The AI labs aren’t advertising the details of how their fast modes work, but I’m pretty confident it’s something like this: Anthropic’s fast mode is backed by low-batch-size inference, while OpenAI’s fast mode is backed by special monster Cerebras chips . Let me unpack that a bit. The tradeoff at the heart of AI inference economics is batching , because the main bottleneck is memory . GPUs are very fast, but moving data onto a GPU is not. Every inference operation requires copying all the tokens of the user’s prompt 2 onto the GPU before inference can start. Batching multiple users up thus increases overall throughput at the cost of making users wait for the batch to be full. A good analogy is a bus system. If you had zero batching for passengers - if, whenever someone got on a bus, the bus departed immediately - commutes would be much faster for the people who managed to get on a bus . But obviously overall throughput would be much lower, because people would be waiting at the bus stop for hours until they managed to actually get on one. Anthropic’s fast mode offering is basically a bus pass that guarantees that the bus immediately leaves as soon as you get on. It’s six times the cost, because you’re effectively paying for all the other people who could have got on the bus with you, but it’s way faster 3 because you spend zero time waiting for the bus to leave. Obviously I can’t be fully certain this is right. Maybe they have access to some new ultra-fast compute that they’re running this on, or they’re doing some algorithmic trick nobody else has thought of. But I’m pretty sure this is it. Brand new compute or algorithmic tricks would likely require changes to the model (see below for OpenAI’s system), and “six times more expensive for 2.5x faster” is right in the ballpark for the kind of improvement you’d expect when switching to a low-batch-size regime. OpenAI’s fast mode does not work anything like this. You can tell that simply because they’re introducing a new, worse model for it. There would be absolutely no reason to do that if they were simply tweaking batch sizes. Also, they told us in the announcement blog post exactly what’s backing their fast mode: Cerebras. OpenAI announced their Cerebras partnership a month ago in January. What’s Cerebras? They build “ultra low-latency compute”. What this means in practice is that they build giant chips . A H100 chip (fairly close to the frontier of inference chips) is just over a square inch in size. A Cerebras chip is 70 square inches. You can see from pictures that the Cerebras chip has a grid-and-holes pattern all over it. That’s because silicon wafers this big are supposed to be broken into dozens of chips. Instead, Cerebras etches a giant chip over the entire thing. The larger the chip, the more internal memory it can have. The idea is to have a chip with SRAM large enough to fit the entire model , so inference can happen entirely in-memory. Typically GPU SRAM is measured in the tens of megabytes . That means that a lot of inference time is spent streaming portions of the model weights from outside of SRAM into the GPU compute 4 . If you could stream all of that from the (much faster) SRAM, inference would a big speedup: fifteen times faster, as it turns out! So how much internal memory does the latest Cerebras chip have? 44GB . This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex. That’s why they’re offering a brand new model, and why the Spark model has a bit of “small model smell” to it: it’s a smaller distil of the much larger GPT-5.3-Codex model 5 . It’s interesting that the two major labs have two very different approaches to building fast AI inference. If I had to guess at a conspiracy theory, it would go something like this: Obviously OpenAI’s achievement here is more technically impressive. Getting a model running on Cerebras chips is not trivial, because they’re so weird. Training a 20B or 40B param distil of GPT-5.3-Codex that is still kind-of-good-enough is not trivial. But I commend Anthropic for finding a sneaky way to get ahead of the announcement that will be largely opaque to non-technical people. It reminds me of OpenAI’s mid-2025 sneaky introduction of the Responses API to help them conceal their reasoning tokens . Seeing the two major labs put out this feature might make you think that fast AI inference is the new major goal they’re chasing. I don’t think it is. If my theory above is right, Anthropic don’t care that much about fast inference, they just didn’t want to appear behind OpenAI. And OpenAI are mainly just exploring the capabilities of their new Cerebras partnership. It’s still largely an open question what kind of models can fit on these giant chips, how useful those models will be, and if the economics will make any sense. I personally don’t find “fast, less-capable inference” particularly useful. I’ve been playing around with it in Codex and I don’t like it. The usefulness of AI agents is dominated by how few mistakes they make , not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model 6 . However, it’s certainly possible that fast, less-capable inference becomes a core lower-level primitive in AI systems. Claude Code already uses Haiku for some operations. Maybe OpenAI will end up using Spark in a similar way. This isn’t even factoring in latency. Anthropic explicitly warns that time to first token might still be slow (or even slower), while OpenAI thinks the Spark latency is fast enough to warrant switching to a persistent websocket (i.e. they think the 50-200ms round trip time for the handshake is a significant chunk of time to first token). Either in the form of the KV-cache for previous tokens, or as some big tensor of intermediate activations if inference is being pipelined through multiple GPUs. I write a lot more about this in Why DeepSeek is cheap at scale but expensive to run locally , since it explains why DeepSeek can be offered at such cheap prices (massive batches allow an economy of scale on giant expensive GPUs, but individual consumers can’t access that at all). Is it a contradiction that low-batch-size means low throughput, but this fast pass system gives users much greater throughput? No. The overall throughput of the GPU is much lower when some users are using “fast mode”, but those user’s throughput is much higher. Remember, GPUs are fast, but copying data onto them is not. Each “copy these weights to GPU” step is a meaningful part of the overall inference time. Or a smaller distil of whatever more powerful base model GPT-5.3-Codex was itself distilled from. I don’t know how AI labs do it exactly, and they keep it very secret. More on that here . On this note, it’s interesting to point out that Cursor’s hype dropped away basically at the same time they released their own “much faster, a little less-capable” agent model. Of course, much of this is due to Claude Code sucking up all the oxygen in the room, but having a very fast model certainly didn’t help . OpenAI partner with Cerebras in mid-January, obviously to work on putting an OpenAI model on a fast Cerebras chip Anthropic have no similar play available, but they know OpenAI will announce some kind of blazing-fast inference in February, and they want to have something in the news cycle to compete with that Anthropic thus hustles to put together the kind of fast inference they can provide: simply lowering the batch size on their existing inference stack Anthropic (probably) waits until a few days before OpenAI are done with their much more complex Cerebras implementation to announce it, so it looks like OpenAI copied them This isn’t even factoring in latency. Anthropic explicitly warns that time to first token might still be slow (or even slower), while OpenAI thinks the Spark latency is fast enough to warrant switching to a persistent websocket (i.e. they think the 50-200ms round trip time for the handshake is a significant chunk of time to first token). ↩ Either in the form of the KV-cache for previous tokens, or as some big tensor of intermediate activations if inference is being pipelined through multiple GPUs. I write a lot more about this in Why DeepSeek is cheap at scale but expensive to run locally , since it explains why DeepSeek can be offered at such cheap prices (massive batches allow an economy of scale on giant expensive GPUs, but individual consumers can’t access that at all). ↩ Is it a contradiction that low-batch-size means low throughput, but this fast pass system gives users much greater throughput? No. The overall throughput of the GPU is much lower when some users are using “fast mode”, but those user’s throughput is much higher. ↩ Remember, GPUs are fast, but copying data onto them is not. Each “copy these weights to GPU” step is a meaningful part of the overall inference time. ↩ Or a smaller distil of whatever more powerful base model GPT-5.3-Codex was itself distilled from. I don’t know how AI labs do it exactly, and they keep it very secret. More on that here . ↩ On this note, it’s interesting to point out that Cursor’s hype dropped away basically at the same time they released their own “much faster, a little less-capable” agent model. Of course, much of this is due to Claude Code sucking up all the oxygen in the room, but having a very fast model certainly didn’t help . ↩

1 views

Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs

Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs Noushin Azami, Alex Fallin, and Martin Burtscher ASPLOS'25 This paper describes four (creatively named) lossless compression algorithms: SPspeed: 32-bit floating-point, optimized for speed DPspeed: 64-bit floating-point, optimized for speed SPratio: 32-bit floating-point, optimized for compression ratio DPratio: 64-bit floating-point, optimized for compression ratio The claim to fame here is excellent performance on both CPUs and GPUs. Each compressor is implemented as a pipeline of transformations, illustrated in Fig. 1: Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Decompression is similar, but the order of stages is reversed. The goal of DIFFMS is to transform the data such that most of the upper bits of each element to be compressed are 0. DIFFMS interprets inputs as integers ( or ) and replaces element with the difference between element and element . Differences are stored in sign-magnitude format. Neighboring values typically have small differences, so fewer bits are needed to store differences rather than raw values. Converting to sign-magnitude causes the upper bits of negative differences (which are close to zero) to be zero. The sign bit is stored in the least significant position, to ensure that it doesn’t contaminate the most-significant bit with a one. The range of representable values in sign-magnitude format is one less than the range of values representable in two’s complement, but I suppose that is OK because that situation can only arise if an input float is NaN. Next, MPLG (introduced in a previous paper) operates on chunks of elements. For each chunk, MPLG finds the element with the highest-order non-zero bit and uses that to determine the number of leading bits which can safely be removed from each element in the chunk. This number of leading bits is stored in per-chunk metadata, and the input stream is bit-packed after removing those leading bits. Fig. 3 illustrates the bit packing: Source: https://dl.acm.org/doi/10.1145/3669940.3707280 BIT The MPLG stage has the “one bad apple spoils the bunch” property. The BIT stage addresses that with a transpose. Each chunk is interpreted as a 2D array of bits, and that 2D array is transposed. Say most elements in a chunk only require the least significant 8 bits, but one element requires 10 bits. Then after the MPLG stage, most elements will have 2 leading zeros in them. After the BIT stage, these zeros will all be grouped together rather than spaced apart. After BIT has arranged the data such that there are many long ranges of zero bits in it, Repeated Zero Elimination (RZE) finds and removes bytes which are equal to zero. RZE produces both the output stream (with “zero bytes” removed), and a bitmap indicating which bytes were removed. The authors found that RZE doesn’t work well for the low-order bits of double-precision data. They address this with RAZE and RARE, which do not try to eliminate ranges of zeros from the low-order bits. Each stage in the pipeline operates on chunks of data. The only interaction between chunks is that the size of the output data produced by chunk N is needed before the output offset of chunk N+1 is known. Efficient parallelization is possible in spite of this hazard on both CPU and GPU implementations. As far as I can tell, there is no explicit attempt to ensure that the data passed between stages is kept on-chip. It could be that CPU and GPU caches work well enough for this purpose. These algorithms extend the Pareto frontier for both CPU and GPU. Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Dangling Pointers The state of the art feels very hand-crafted, somewhat analogous to the state-of-the-art in image classification before AlexNet moved the state-of-the-art from handcrafted feature engineering to more generalized models. In the text compression space, LZ-based compressors leave the same aftertaste. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. SPspeed: 32-bit floating-point, optimized for speed DPspeed: 64-bit floating-point, optimized for speed SPratio: 32-bit floating-point, optimized for compression ratio DPratio: 64-bit floating-point, optimized for compression ratio

0 views
Farid Zakaria 2 weeks ago

Creating massively huge fake files and binaries

I was writing a test case for to support “thunks” [ llvm#180266 ] which uses a linker script to place two sections very far apart (8GiB) in the virtual address space. After linking a trivially small assembly file, I ran on the resulting binary was confused 8 GiB . For what amounts to a handful of instructions. 😲 What’s going on? And where did all that space come from? Turns out reports the logical (apparent) size of the file, which is simply an integer stored in the inode metadata. It represents the offset of the last byte written. Since lives at (~8 GiB), the file’s logical size extends out that far even though the actual code is tiny. The real story is told by : 12 KiB on disk. The file is sparse . 🤓 A sparse file is one where the filesystem doesn’t bother allocating blocks for regions that are all zeros. The filesystem (ext4, btrfs, etc.) stores a mapping of logical file offsets to physical disk blocks in the inode’s extent tree . For a sparse file, there are simply no extents for the hole regions. For our 8 GiB binary, the extent tree looks something like: We can use to also see the same information, albeit a little more condensed. When something reads the file: You don’t need a linker to create sparse files. will do it: A 1 PiB file that takes zero bytes on disk. with works too: Both produce the same result: a file whose logical size is 1 PiB but whose on-disk footprint is effectively nothing. The virtual filesystem (VFS) receives at some offset The filesystem looks up the extent tree for that offset If extent found then read from the physical disk block If no extent (hole) then the kernel fills the buffer with zeros, no disk I/O

0 views
Jim Nielsen 2 weeks ago

Unresponsive Buttons on My Fastest Hardware Ever

This is one of those small things that drives me nuts. Why? I don’t know. I think it has something to do with the fact that I have a computer that is faster than any computer I’ve ever used in my entire life — and yet, clicking on buttons results in slight but perceptible delays. Let me explain. Imagine a button that looks like this: For SPA apps, when the user clicks that button it takes a split second (even on a fast connection) for anything to happen because: When clicking on that button, even on a fast connection, my brain glitches for a second, my thought process going something like: Granted those thoughts occur in my brain in under a second, but I hate that pause of indetermination. I clicked, I want (perceptibly) instant feedback. If something is happening, tell me! For SPA apps, you could put some state in there, like: This would provide more immediate feedback. But it also raises a whole set of other questions: Oh boy, this is getting complicated isn’t it? This is why, I assume, lots of apps just don’t deal with it. They accept there will be a slight delay in the responsiveness of the UI (and that it might error, but the user can just click again) and justify that it’s really not that big of a deal if there’s a slight, almost imperceptible delay between clicking a button and seeing the UI respond. “We’ve got bigger fish to fry.” And it makes sense. I mean, a slight delay in UI responsiveness, is that why people will or won’t buy your thing? Seems like a small detail. Who’s got the time to spend on details like this?Who cares? I care. That’s why I’m writing this post. To my original point, every piece of hardware I currently own is the fastest version of that device I’ve ever had in my life. And yet, everywhere I go I encounter lag. Lag everywhere. And I’m grumpy about it, hence this post. Reply via: Email · Mastodon · Bluesky The browser makes a request to the server The server talks to Stripe to get a session The server responds with the session data to the client The client redirects [nothing happens] I think “Did that work?” Just as I’m about to click again, I see the URL bar change I think, “Oh, ok, it’s doing something .” I stop myself from clicking again while I wait for the UI to redraw Is that actually the interaction you want, where the text changes? That’s probably gonna shift layout. Maybe you want something different, like a spinner in place of the text. How do you handle that? What if you have multiple places to upgrade? Do you have to implement state in all those places too? What if the trigger in each place is slightly different? A button here, some text there, and icon over yonder? How do you handle all of those different interactions in a standard, immediate way? Errors. What if it fails? Well, we already weren’t handling that in the first code example were we? But maybe we should…

0 views

The LLM Context Tax: Best Tips for Tax Avoidance

Every token you send to an LLM costs money. Every token increases latency. And past a certain point, every additional token makes your agent dumber. This is the triple penalty of context bloat: higher costs, slower responses, and degraded performance through context rot, where the agent gets lost in its own accumulated noise. Context engineering is very important. The difference between a $0.50 query and a $5.00 query is often just how thoughtfully you manage context. Here’s what I’ll cover: Stable Prefixes for KV Cache Hits - The single most important optimization for production agents Append-Only Context - Why mutating context destroys your cache hit rate Store Tool Outputs in the Filesystem - Cursor’s approach to avoiding context bloat Design Precise Tools - How smart tool design reduces token consumption by 10x Clean Your Data First (Maximize Your Deductions) - Strip the garbage before it enters context Delegate to Cheaper Subagents (Offshore to Tax Havens) - Route token-heavy operations to smaller models Reusable Templates Over Regeneration (Standard Deductions) - Stop regenerating the same code The Lost-in-the-Middle Problem - Strategic placement of critical information Server-Side Compaction (Depreciation) - Let the API handle context decay automatically Output Token Budgeting (Withholding Tax) - The most expensive tokens are the ones you generate The 200K Pricing Cliff (The Tax Bracket) - The tax bracket that doubles your bill overnight Parallel Tool Calls (Filing Jointly) - Fewer round trips, less context accumulation Application-Level Response Caching (Tax-Exempt Status) - The cheapest token is the one you never send With Claude Opus 4.6, the math is brutal: That’s a 10x difference between cached and uncached inputs. Output tokens cost 5x more than uncached inputs. Most agent builders focus on prompt engineering while hemorrhaging money on context inefficiency. In most agent workflows, context grows substantially with each step while outputs remain compact. This makes input token optimization critical: a typical agent task might involve 50 tool calls, each accumulating context. The performance penalty is equally severe. Research shows that past 32K tokens, most models show sharp performance degradation. Your agent isn’t just getting expensive. It’s getting confused. This is the single most important metric for production agents: KV cache hit rate. The Manus team considers this the most important optimization for their agent infrastructure, and I agree completely. The principle is simple: LLMs process prompts autoregressively, token by token. If your prompt starts identically to a previous request, the model can reuse cached key-value computations for that prefix. The killer of cache hit rates? Timestamps. A common mistake is including a timestamp at the beginning of the system prompt. It’s a simple mistake but the impact is massive. The key is granularity: including the date is fine. Including the hour is acceptable since cache durations are typically 5 minutes (Anthropic default) to 10 minutes (OpenAI default), with longer options available. But never include seconds or milliseconds. A timestamp precise to the second guarantees every single request has a unique prefix. Zero cache hits. Maximum cost. Move all dynamic content (including timestamps) to the END of your prompt. System instructions, tool definitions, few-shot examples, all of these should come first and remain identical across requests. For distributed systems, ensure consistent request routing. Use session IDs to route requests to the same worker, maximizing the chance of hitting warm caches. Context should be append-only. Any modification to earlier content invalidates the KV cache from that point forward. This seems obvious but the violations are subtle: The tool definition problem is particularly insidious. If you dynamically add or remove tools based on context, you invalidate the cache for everything after the tool definitions. Manus solved this elegantly: instead of removing tools, they mask token logits during decoding to constrain which actions the model can select. The tool definitions stay constant (cache preserved), but the model is guided toward valid choices through output constraints. For simpler implementations, keep your tool definitions static and handle invalid tool calls gracefully in your orchestration layer. Deterministic serialization matters too. Python dicts don’t guarantee order. If you’re serializing tool definitions or context as JSON, use sort_keys=True or a library that guarantees deterministic output. A different key order = different tokens = cache miss. Cursor’s approach to context management changed how I think about agent architecture. Instead of stuffing tool outputs into the conversation, write them to files. In their A/B testing, this reduced total agent tokens by 46.9% for runs using MCP tools. The insight: agents don’t need complete information upfront. They need the ability to access information on demand. Files are the perfect abstraction for this. We apply this pattern everywhere: Shell command outputs : Write to files, let agent tail or grep as needed Search results : Return file paths, not full document contents API responses : Store raw responses, let agent extract what matters Intermediate computations : Persist to disk, reference by path When context windows fill up, Cursor triggers a summarization step but exposes chat history as files. The agent can search through past conversations to recover details lost in the lossy compression. Clever. A vague tool returns everything. A precise tool returns exactly what the agent needs. Consider an email search tool: The two-phase pattern: search returns metadata, separate tool returns full content. The agent decides which items deserve full retrieval. This is exactly how our conversation history tool works at Fintool. It passes date ranges or search terms and returns up to 100-200 results with only user messages and metadata. The agent then reads specific conversations by passing the conversation ID. Filter parameters like has_attachment, time_range, and sender let the agent narrow results before reading anything. The same pattern applies everywhere: Document search : Return titles and snippets, not full documents Database queries : Return row counts and sample rows, not full result sets File listings : Return paths and metadata, not contents API integrations : Return summaries, let agent drill down Each parameter you add to a tool is a chance to reduce returned tokens by an order of magnitude. Garbage tokens are still tokens. Clean your data before it enters context. For emails, this means: For HTML content, the gains are even larger. A typical webpage might be 100KB of HTML but only 5KB of actual content. CSS selectors that extract semantic regions (article, main, section) and discard navigation, ads, and tracking can reduce token counts by 90%+. Markdown uses significantly fewer tokens than HTML , making conversion valuable for any web content entering your pipeline. For financial data specifically: Strip SEC filing boilerplate (every 10-K has the same legal disclaimers) Collapse repeated table headers across pages Remove watermarks and page numbers from extracted text Normalize whitespace (multiple spaces, tabs, excessive newlines) Convert HTML tables to markdown tables The principle: remove noise at the earliest possible stage, not after tokenization. Every preprocessing step that runs before the LLM call saves money and improves quality. Not every task needs your most expensive model. The Claude Code subagent pattern processes 67% fewer tokens overall due to context isolation. Instead of stuffing every intermediate search result into a single global context, workers keep only what’s relevant inside their own window and return distilled outputs. Tasks perfect for cheaper subagents: Data extraction : Pull specific fields from documents Classification : Categorize emails, documents, or intents Summarization : Compress long documents before main agent sees them Validation : Check outputs against criteria Formatting : Convert between data formats The orchestrator sees condensed results, not raw context. This prevents hitting context limits and reduces the risk of the main agent getting confused by irrelevant details. Scope subagent tasks tightly. The more iterations a subagent requires, the more context it accumulates and the more tokens it consumes. Design for single-turn completion when possible. Every time an agent generates code from scratch, you’re paying for output tokens. Output tokens cost 5x input tokens with Claude. Stop regenerating the same patterns. Our document generation workflow used to be painfully inefficient: OLD APPROACH: User: “Create a DCF model for Apple” Agent: *generates 2,000 lines of Excel formulas from scratch* Cost: ~$0.50 in output tokens alone NEW APPROACH: User: “Create a DCF model for Apple” Agent: *loads DCF template, fills in Apple-specific values* Cost: ~$0.05 The template approach: Skill references template : dcf_template.xlsx in /public/skills/dcf/ Agent reads template once : Understands structure and placeholders Agent fills parameters : Company-specific values, assumptions WriteFile with minimal changes : Only modified cells, not full regeneration For code generation, the same principle applies. If your agent frequently generates similar Python scripts, data processing pipelines, or analysis frameworks, create reusable functions: # Instead of regenerating this every time: def process_earnings_transcript(path): # 50 lines of parsing code... # Reference a skill with reusable utilities: from skills.earnings import parse_transcript, extract_guidance The agent imports and calls rather than regenerates. Fewer output tokens, faster responses, more consistent results. Subscribe now LLMs don’t process context uniformly. Research shows a consistent U-shaped attention pattern: models attend strongly to the beginning and end of prompts while “losing” information in the middle. Strategic placement matters: System instructions : Beginning (highest attention) Current user request : End (recency bias) Critical context : Beginning or end, never middle Lower-priority background : Middle (acceptable loss) For retrieval-augmented generation, this means reordering retrieved documents. The most relevant chunks should go at the beginning and end. Lower-ranked chunks fill the middle. Manus uses an elegant hack: they maintain a todo.md file that gets updated throughout task execution. This “recites” current objectives at the end of context, combating the lost-in-the-middle effect across their typical 50-tool-call trajectories. We use a similar architecture at Fintool. As agents run, context grows until it hits the window limit. You used to have two options: build your own summarization pipeline, or implement observation masking (replacing old tool outputs with placeholders). Both require significant engineering. Now you can let the API handle it. Anthropic’s server-side compaction automatically summarizes your conversation when it approaches a configurable token threshold. Claude Code uses this internally, and it’s the reason you can run 50+ tool call sessions without the agent losing track of what it’s doing. The key design decisions: Trigger threshold : Default is 150K tokens. Set it lower if you want to stay under the 200K pricing cliff, or higher if you need more raw context before summarizing. Custom instructions : You can replace the default summarization prompt entirely. For financial workflows, something like “Preserve all numerical data, company names, and analytical conclusions” prevents the summary from losing critical details. Pause after compaction : The API can pause after generating the summary, letting you inject additional context (like preserving the last few messages verbatim) before continuing. This gives you control over what survives the compression. Compaction also stacks well with prompt caching. Add a cache breakpoint on your system prompt so it stays cached separately. When compaction occurs, only the summary needs to be written as a new cache entry. Your system prompt cache stays warm. The beauty of this approach: context depreciates in value over time, and the API handles the depreciation schedule for you. Output tokens are the most expensive tokens. With Claude Sonnet, outputs cost 5x inputs. With Opus, they cost 5x inputs that are already expensive. Yet most developers leave max_tokens unlimited and hope for the best. # BAD: Unlimited output response = client.messages.create( model=”claude-sonnet-4-20250514”, max_tokens=8192, # Model might use all of this messages=[...] ) # GOOD: Task-appropriate limits TASK_LIMITS = { “classification”: 50, “extraction”: 200, “short_answer”: 500, “analysis”: 2000, “code_generation”: 4000, } Structured outputs reduce verbosity. JSON responses use fewer tokens than natural language explanations of the same information. Natural language: “The company’s revenue was 94.5 billion dollars, which represents a year-over-year increase of 12.3 percent compared to the previous fiscal year’s revenue of 84.2 billion dollars.” Structured: {”revenue”: 94.5, “unit”: “B”, “yoy_change”: 12.3} For agents specifically, consider response chunking. Instead of generating a 10,000-token analysis in one shot, break it into phases: Outline phase : Generate structure (500 tokens) Section phases : Generate each section on demand (1000 tokens each) Review phase : Check and refine (500 tokens) This gives you control points to stop early if the user has what they need, rather than always generating the maximum possible output. With Claude Opus 4.6 and Sonnet 4.5, crossing 200K input tokens triggers premium pricing. Your per-token cost doubles: Opus goes from $5 to $10 per million input tokens, and output jumps from $25 to $37.50. This isn’t gradual. It’s a cliff. This is the LLM equivalent of a tax bracket. And just like tax planning, the right strategy is to stay under the threshold when you can. For agent workflows that risk crossing 200K, implement a context budget. Track cumulative input tokens across tool calls. When you approach the cliff, trigger aggressive compression: observation masking, summarization of older turns, or pruning low-value context. The cost of a compression step is far less than doubling your per-token rate for the rest of the conversation. Every sequential tool call is a round trip. Each round trip re-sends the full conversation context. If your agent makes 20 tool calls sequentially, that’s 20 times the context gets transmitted and billed. The Anthropic API supports parallel tool calls: the model can request multiple independent tool calls in a single response, and you execute them simultaneously. This means fewer round trips for the same amount of work. The savings compound. With fewer round trips, you accumulate less intermediate context, which means each subsequent round trip is also cheaper. Design your tools so that independent operations can be identified and batched by the model. The cheapest token is the one you never send to the API. Before any LLM call, check if you’ve already answered this question. At Fintool, we cache aggressively for earnings call summarizations and common queries. When a user asks for Apple’s latest earnings summary, we don’t regenerate it from scratch for every request. The first request pays the full cost. Every subsequent request is essentially free. This operates above the LLM layer entirely. It’s not prompt caching or KV cache. It’s your application deciding that this query has a valid cached response and short-circuiting the API call. Good candidates for application-level caching: Factual lookups : Company financials, earnings summaries, SEC filings Common queries : Questions that many users ask about the same data Deterministic transformations : Data formatting, unit conversions Stable analysis : Any output that won’t change until the underlying data changes The cache invalidation strategy matters. For financial data, earnings call summaries are stable once generated. Real-time price data obviously isn’t. Match your cache TTL to the volatility of the underlying data. Even partial caching helps. If an agent task involves five tool calls and you can cache two of them, you’ve cut 40% of your tool-related token costs without touching the LLM. The Meta Lesson Context engineering isn’t glamorous. It’s not the exciting part of building agents. But it’s the difference between a demo that impresses and a product that scales with decent gross margin. The best teams building sustainable agent products are obsessing over token efficiency the same way database engineers obsess over query optimization. Because at scale, every wasted token is money on fire. The context tax is real. But with the right architecture, it’s largely avoidable. Subscribe now Every token you send to an LLM costs money. Every token increases latency. And past a certain point, every additional token makes your agent dumber. This is the triple penalty of context bloat: higher costs, slower responses, and degraded performance through context rot, where the agent gets lost in its own accumulated noise. Context engineering is very important. The difference between a $0.50 query and a $5.00 query is often just how thoughtfully you manage context. Here’s what I’ll cover: Stable Prefixes for KV Cache Hits - The single most important optimization for production agents Append-Only Context - Why mutating context destroys your cache hit rate Store Tool Outputs in the Filesystem - Cursor’s approach to avoiding context bloat Design Precise Tools - How smart tool design reduces token consumption by 10x Clean Your Data First (Maximize Your Deductions) - Strip the garbage before it enters context Delegate to Cheaper Subagents (Offshore to Tax Havens) - Route token-heavy operations to smaller models Reusable Templates Over Regeneration (Standard Deductions) - Stop regenerating the same code The Lost-in-the-Middle Problem - Strategic placement of critical information Server-Side Compaction (Depreciation) - Let the API handle context decay automatically Output Token Budgeting (Withholding Tax) - The most expensive tokens are the ones you generate The 200K Pricing Cliff (The Tax Bracket) - The tax bracket that doubles your bill overnight Parallel Tool Calls (Filing Jointly) - Fewer round trips, less context accumulation Application-Level Response Caching (Tax-Exempt Status) - The cheapest token is the one you never send That’s a 10x difference between cached and uncached inputs. Output tokens cost 5x more than uncached inputs. Most agent builders focus on prompt engineering while hemorrhaging money on context inefficiency. In most agent workflows, context grows substantially with each step while outputs remain compact. This makes input token optimization critical: a typical agent task might involve 50 tool calls, each accumulating context. The performance penalty is equally severe. Research shows that past 32K tokens, most models show sharp performance degradation. Your agent isn’t just getting expensive. It’s getting confused. Stable Prefixes for KV Cache Hits This is the single most important metric for production agents: KV cache hit rate. The Manus team considers this the most important optimization for their agent infrastructure, and I agree completely. The principle is simple: LLMs process prompts autoregressively, token by token. If your prompt starts identically to a previous request, the model can reuse cached key-value computations for that prefix. The killer of cache hit rates? Timestamps. A common mistake is including a timestamp at the beginning of the system prompt. It’s a simple mistake but the impact is massive. The key is granularity: including the date is fine. Including the hour is acceptable since cache durations are typically 5 minutes (Anthropic default) to 10 minutes (OpenAI default), with longer options available. But never include seconds or milliseconds. A timestamp precise to the second guarantees every single request has a unique prefix. Zero cache hits. Maximum cost. Move all dynamic content (including timestamps) to the END of your prompt. System instructions, tool definitions, few-shot examples, all of these should come first and remain identical across requests. For distributed systems, ensure consistent request routing. Use session IDs to route requests to the same worker, maximizing the chance of hitting warm caches. Append-Only Context Context should be append-only. Any modification to earlier content invalidates the KV cache from that point forward. This seems obvious but the violations are subtle: The tool definition problem is particularly insidious. If you dynamically add or remove tools based on context, you invalidate the cache for everything after the tool definitions. Manus solved this elegantly: instead of removing tools, they mask token logits during decoding to constrain which actions the model can select. The tool definitions stay constant (cache preserved), but the model is guided toward valid choices through output constraints. For simpler implementations, keep your tool definitions static and handle invalid tool calls gracefully in your orchestration layer. Deterministic serialization matters too. Python dicts don’t guarantee order. If you’re serializing tool definitions or context as JSON, use sort_keys=True or a library that guarantees deterministic output. A different key order = different tokens = cache miss. Store Tool Outputs in the Filesystem Cursor’s approach to context management changed how I think about agent architecture. Instead of stuffing tool outputs into the conversation, write them to files. In their A/B testing, this reduced total agent tokens by 46.9% for runs using MCP tools. The insight: agents don’t need complete information upfront. They need the ability to access information on demand. Files are the perfect abstraction for this. We apply this pattern everywhere: Shell command outputs : Write to files, let agent tail or grep as needed Search results : Return file paths, not full document contents API responses : Store raw responses, let agent extract what matters Intermediate computations : Persist to disk, reference by path The two-phase pattern: search returns metadata, separate tool returns full content. The agent decides which items deserve full retrieval. This is exactly how our conversation history tool works at Fintool. It passes date ranges or search terms and returns up to 100-200 results with only user messages and metadata. The agent then reads specific conversations by passing the conversation ID. Filter parameters like has_attachment, time_range, and sender let the agent narrow results before reading anything. The same pattern applies everywhere: Document search : Return titles and snippets, not full documents Database queries : Return row counts and sample rows, not full result sets File listings : Return paths and metadata, not contents API integrations : Return summaries, let agent drill down For HTML content, the gains are even larger. A typical webpage might be 100KB of HTML but only 5KB of actual content. CSS selectors that extract semantic regions (article, main, section) and discard navigation, ads, and tracking can reduce token counts by 90%+. Markdown uses significantly fewer tokens than HTML , making conversion valuable for any web content entering your pipeline. For financial data specifically: Strip SEC filing boilerplate (every 10-K has the same legal disclaimers) Collapse repeated table headers across pages Remove watermarks and page numbers from extracted text Normalize whitespace (multiple spaces, tabs, excessive newlines) Convert HTML tables to markdown tables The Claude Code subagent pattern processes 67% fewer tokens overall due to context isolation. Instead of stuffing every intermediate search result into a single global context, workers keep only what’s relevant inside their own window and return distilled outputs. Tasks perfect for cheaper subagents: Data extraction : Pull specific fields from documents Classification : Categorize emails, documents, or intents Summarization : Compress long documents before main agent sees them Validation : Check outputs against criteria Formatting : Convert between data formats Scope subagent tasks tightly. The more iterations a subagent requires, the more context it accumulates and the more tokens it consumes. Design for single-turn completion when possible. Reusable Templates Over Regeneration (Standard Deductions) Every time an agent generates code from scratch, you’re paying for output tokens. Output tokens cost 5x input tokens with Claude. Stop regenerating the same patterns. Our document generation workflow used to be painfully inefficient: OLD APPROACH: User: “Create a DCF model for Apple” Agent: *generates 2,000 lines of Excel formulas from scratch* Cost: ~$0.50 in output tokens alone NEW APPROACH: User: “Create a DCF model for Apple” Agent: *loads DCF template, fills in Apple-specific values* Cost: ~$0.05 The template approach: Skill references template : dcf_template.xlsx in /public/skills/dcf/ Agent reads template once : Understands structure and placeholders Agent fills parameters : Company-specific values, assumptions WriteFile with minimal changes : Only modified cells, not full regeneration Strategic placement matters: System instructions : Beginning (highest attention) Current user request : End (recency bias) Critical context : Beginning or end, never middle Lower-priority background : Middle (acceptable loss) The key design decisions: Trigger threshold : Default is 150K tokens. Set it lower if you want to stay under the 200K pricing cliff, or higher if you need more raw context before summarizing. Custom instructions : You can replace the default summarization prompt entirely. For financial workflows, something like “Preserve all numerical data, company names, and analytical conclusions” prevents the summary from losing critical details. Pause after compaction : The API can pause after generating the summary, letting you inject additional context (like preserving the last few messages verbatim) before continuing. This gives you control over what survives the compression. Outline phase : Generate structure (500 tokens) Section phases : Generate each section on demand (1000 tokens each) Review phase : Check and refine (500 tokens) This is the LLM equivalent of a tax bracket. And just like tax planning, the right strategy is to stay under the threshold when you can. For agent workflows that risk crossing 200K, implement a context budget. Track cumulative input tokens across tool calls. When you approach the cliff, trigger aggressive compression: observation masking, summarization of older turns, or pruning low-value context. The cost of a compression step is far less than doubling your per-token rate for the rest of the conversation. Parallel Tool Calls (Filing Jointly) Every sequential tool call is a round trip. Each round trip re-sends the full conversation context. If your agent makes 20 tool calls sequentially, that’s 20 times the context gets transmitted and billed. The Anthropic API supports parallel tool calls: the model can request multiple independent tool calls in a single response, and you execute them simultaneously. This means fewer round trips for the same amount of work. The savings compound. With fewer round trips, you accumulate less intermediate context, which means each subsequent round trip is also cheaper. Design your tools so that independent operations can be identified and batched by the model. Application-Level Response Caching (Tax-Exempt Status) The cheapest token is the one you never send to the API. Before any LLM call, check if you’ve already answered this question. At Fintool, we cache aggressively for earnings call summarizations and common queries. When a user asks for Apple’s latest earnings summary, we don’t regenerate it from scratch for every request. The first request pays the full cost. Every subsequent request is essentially free. This operates above the LLM layer entirely. It’s not prompt caching or KV cache. It’s your application deciding that this query has a valid cached response and short-circuiting the API call. Good candidates for application-level caching: Factual lookups : Company financials, earnings summaries, SEC filings Common queries : Questions that many users ask about the same data Deterministic transformations : Data formatting, unit conversions Stable analysis : Any output that won’t change until the underlying data changes

1 views

EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation

EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation Weigao Su and Vishal Shrivastav ASPLOS'25 This paper describes incremental changes to Ethernet NICs and switches to enable efficient disaggregation of memory without the need for a separate network ( e.g., CXL) for memory traffic. Fig. 1 shows the north star: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Servers are partitioned into Compute Nodes and Memory Nodes . When a compute node wants to access remote memory, it issues a request to its local NIC, which sends the request to the correct memory node (via a switch). The key problem this paper addresses is Ethernet fabric latency (i.e., the time taken for requests/responses to flow between NICs and switches). The paper assumes that the latency between the processor and the NIC is low (and cites other papers which describe techniques for reducing this latency to below 100ns). Typical Ethernet fabric latency is measured in microseconds, which is much higher than a local memory access. The Ethernet hardware stack can be decomposed into MAC and PHY layers. The MAC is higher level and sits on top of the PHY. The paper proposes implementing EDM (Ethernet Disaggregated Memory) with modifications to the PHY layer in both the NIC and the switch. Normal network packets flow through the MAC and PHY as they usually would, but a side channel exists which allows remote memory accesses to be handled directly by the enhanced PHY layer. Fig. 3 illustrates the hardware changes in Ethernet NICs and switches. Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Remote memory access requests and responses are smaller than typical Ethernet packets. Additionally, end-to-end application performance is more sensitive to remote memory access latency than the latency of regular network traffic. The bulk of the paper describes how EDM achieves low latency for remote memory traffic. The EDM PHY modifications allow a memory request to preempt a non-memory packet. Say the MAC sends a 1KiB packet to the PHY, which begins to send the packet over the wire in 66-bit blocks. If a memory request shows up in the middle of transmitting the network packet, the PHY can sneak the memory request onto the wire between 66-bit blocks, rather than waiting for the whole 1KiB to be sent. Standard Ethernet requires 96 bits of zeros to be sent on the wire between each packet. This overhead is small for large packets, but it is non-trivial for small packets (like remote memory access requests). The EDM PHY modifications allow these idle bits to be used for remote memory accesses. The MAC still sees the gaps, but the PHY does not. If you ask an LLM what could possibly go wrong by trying to use the inter-frame gap to send useful data, it will spit out a long list. I can’t find too much detail in the paper about how to ensure that this enhancement is robust. The possible problems are limited to the PHY layer however, as the MAC still sees the zeros it expects. To avoid congestion and dropping of memory requests, EDM uses an in-network scheduling algorithm somewhat like PFC. The EDM scheduler is in the PHY layer of the switch. Senders notify the switch when they have memory traffic to send, and the switch responds later with a grant , allowing a certain amount of data to be sent. The authors implemented EDM on FPGAs (acting as both NIC and switch). Table 1 compares latencies for TCP/IP, RDMA, raw Ethernet packets, and EDM, breaking down latencies at each step: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Fig. 7 throws CXL into the mix: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Dangling Pointers Section 3.3 “Practical Concerns” has a discussion of what could go wrong ( e.g., fault tolerance and data corruption). It is hard to judge how much work is needed to make this into something that industry could rely on. Subscribe now

0 views
Den Odell 3 weeks ago

Fast by Default

After 25 years building sites for global brands, I kept seeing the same pattern appear. A team ships new features, users quietly begin to struggle, and only later do the bug reports start trickling in. Someone finally checks the metrics, panic spreads, and feature development is put on hold so the team can patch problems already affecting thousands of people. The fixes help for a while, but a month later another slowdown appears and the cycle begins again. The team spends much of its time firefighting instead of building. I call this repeating sequence of ship, complain, panic, patch the Performance Decay Cycle . Sadly, it’s the default state for many teams and it drains morale fast. There has to be a better way. When I stepped into tech-lead roles, I started experimenting. What if performance was something we protected from the start rather than something we cleaned up afterward? What if the entire team shared responsibility instead of relying on a single performance-minded engineer to swoop in and fix things? And what if the system itself made performance visible early, long before issues hit production? Across several teams and many iterations, a different pattern began to emerge. I now call it Fast by Default . Fast by Default is the practice of embedding performance into every stage of development so speed becomes the natural outcome, not a late rescue mission. It involves everyone in the team, not just engineers. Most organizations treat performance as something to address when it hurts, or they schedule a bug-fix sprint every few months. Both approaches are expensive, unreliable, and almost always too late. By the time a slowdown is noticeable, the causes are already baked into the rendering strategy, the data-fetching sequence, and the component boundaries. These decisions define a ceiling on how fast your system can ever be. You can tune within that ceiling, but without a rebuild, you can’t break through it. Meanwhile, the baseline slowly drifts. Slow builds and sluggish interactions become expected. What felt unacceptable in week 1 feels normal by month 6. And once a feature ships, the attention shifts. Performance work competes with new ideas and roadmap pressure. Most teams never return to clean things up. Performance regressions rarely announce themselves through one dramatic failure. They accumulate quietly, through dozens of reasonable decisions. A feature adds a little more JavaScript, a new dependency brings a hidden transitive load, and a design tweak introduces layout movement. A single page load still feels fine, but interactions begin to feel heavier. More features are added, more code ships, and slowly the slow path becomes the normal path. It shows up most clearly at the dependency level: Each import made sense in isolation and passed through code review. No single decision broke the experience; the combination did. This is why prevention always beats the cure. If you want to avoid returning to a culture of whack-a-mole fixes, you need to change the incentives so fast outcomes happen naturally. The core idea is simple: make the fast path easier than the slow path. Once you do that, performance stops depending on vigilance or heroics. You create systems and workflows that quietly pull the team toward fast decisions without friction. Here’s what this looks like day-to-day: If your starting point is a client-rendered SPA, you’re already fighting uphill. Server-first rendering with selective hydration (often called the Islands Architecture ) gives you a performance margin that doesn’t require constant micro-optimization to maintain. It also helps clarify how much of your SPA truly needs to be a SPA. When dependency size appears directly in your IDE, bundle size and budget checks run automatically in CI, and hydration warnings surface in local development, developers see the cost of their changes immediately and fix issues while the context is still fresh. Reaching for utility-first libraries, choosing smaller dependencies, and cultivating a culture where the first question is "do we need this?" rather than "why not?" keeps complexity from compounding. When reviewers consistently ask how a change affects render time or memory pressure, the entire team levels up. The question becomes part of the craft rather than an afterthought, and eventually it appears in every pull request. Teams that stay fast don’t succeed because they have more performance experts; they succeed because they distribute ownership. Designers think about layout stability, product managers scope work with speed in mind, and engineers treat performance budgets as part of correctness rather than a separate concern. Everyone understands that shipping fast code is as important as shipping correct code. For this to work, regressions need to surface early. That requires continuous measurement, clear ownership, and tooling that highlights problems before users do. Once the system pulls in the right direction with minimal resistance, performance becomes self-sustaining. A team with fast defaults ships fast software in month 1, and they’re still shipping fast software in month 12 and month 36 because small advantages accumulate in their favor. A team living in the Performance Decay Cycle may start with acceptable performance, but by month 12 they find themselves planning a dedicated performance sprint, and by month 36 they’re discussing a rewrite. The difference isn’t expertise or effort; it’s the approach they started from. Speed is leverage because it builds trust, sharpens design, and accelerates development. Once you lose it, you lose more than seconds: you lose users, revenue, and confidence in your own system. Fast by Default is how teams break this cycle and build systems that stay fast as they grow. For more on this model, see https://fastbydefault.com. <small>This article was first published on 4 December 2025 at https://calendar.perfplanet.com/2025/fast-by-default/</small>

0 views

An Analysis of User-space Idle State Instructions on x86 Processors

An Analysis of User-space Idle State Instructions on x86 Processors Malte-Christian Kuns, Hannes Tröpgen, and Robert Schöne ICPE'25 I’ve long believed that busy waiting is poor form. The closest thing you should ever come to busy waiting is to lock a , which will busy wait for a short while on your behalf. If your primary concern is power consumption, then busy waiting may be less offensive on a modern processor. This paper describes newly added x86 instructions to enable low power busy waiting from user space, and has a ton of data to help you sleep better at night. puts the processor into a low power state for a user-specified amount of time. supports two low power states ( and ), which trade power consumption for wake-up latency. TPAUSE can be called in user space but doesn’t wrest control of the core away from the OS. The trick is the OS can set a maximum timeout value, which gives the OS a chance to switch away from the busy waiting thread. and instructions are similar to but allow the processor to be woken up when a write occurs in a specified memory range. sets up the memory range to be monitored, and causes the processor to enter a low power state. accepts a timeout value and a target power state (just like ). AMD supports similar functionality via the and instructions. A key question the paper investigates is how closely the user-specified timeout is honored. Fig. 1 shows results for three Intel cores: Source: https://dl.acm.org/doi/10.1145/3676151.3719370 Times are measured in timestamp counter cycles (roughly 3 GHz for Alder Lake). The plateau at the top right is caused by the OS-specified maximum timeout. The authors find that timeout values are quantized ( e.g., 83 cycles on Alder Lake P-core). Additionally, for short timeouts the processor may ignore the user-requested power state (presumably because it doesn’t make sense to enter a deep sleep for a short amount of time). On Alder Lake P-cores, the threshold below which the processor will not enter the lowest power state is around 23,000 TSC cycles. Alder Lake E-cores seem to only support one low power state. Fig. 3 measures how much the processor can “oversleep” (wake up later than requested) depending on processor frequency and requested power state: Source: https://dl.acm.org/doi/10.1145/3676151.3719370 And finally, table 2 shows measured power consumption for these new instructions vs old-fashioned busy wait loops that uses the PAUSE instruction (which does not support a user-specified timeout): Source: https://dl.acm.org/doi/10.1145/3676151.3719370 I’m shocked by the advantage that AMD has here. If CPU core power during busy waiting is your primary concern, then you should choose your chip carefully. Condition variables are a general and useful abstraction. It would be nice if code that used condition variables could automatically benefit from these instructions. Maybe some compiler and/or hardware assistance is necessary to enable that. Subscribe now

0 views