Posts in Performance (20 found)

Binary Compatible Critical Section Delegation

Binary Compatible Critical Section Delegation Junyao Zhang, Zhuo Wang, and Zhe Zhou PPoPP'26 The futex design works great when contention is low but leaves much to be desired when contention is high. I generally think that algorithms should be crafted to avoid high lock contention, but this paper offers a contrarian approach that improves performance without code changes . Acquiring a futex involves atomic operations on the cache lines that contain the futex state. In the case of high contention, these cache lines violently bounce between cores. Also, user space code will eventually give up trying to acquire a lock the easy way and will call into the kernel, which has its own synchronization to protect the shared data structures that manage the queue of threads waiting to acquire the lock. The problems don’t end when a lock is finally acquired. A typical futex guards some specific application data. The cache lines containing that data will also uncomfortably bounce between cores. The idea behind delegation is to replace the queue of pending threads with a queue of pending operations . An operation comprises the code that will be executed under the lock, and the associated data. This C++ code snippet shows how I think of delegation: In the uncontended case, will execute directly. In the contended case, will be placed into a queue to be executed later. After any thread finishes executing a function like , that thread will check the queue. If the queue is not empty, that thread will go ahead and execute all of the functions contained in the queue. In the example above, the data guarded by the lock is . If a particular thread calls 10 functions from the queue, then remains local to the core that thread is running on, and the system will avoid moving between cores 10 times. The magic of this paper is that it shows how to change the OS kernel to automatically implement delegation for any application that uses futexes. When the futex code gives up trying to acquire a futex in user space, it calls the OS to wait on the futex. The implementation of this system call is changed to implement automatic delegation. Automatic delegation can fail (as illustrated by Fig. 2), in which case the traditional futex waiting algorithm is used. Source: https://dl.acm.org/doi/10.1145/3774934.3786439 This paper makes heavy use of the Userspace Bypass library (a.k.a. UB; paper here ). This library allows the kernel to safely execute user-mode code. It was originally designed to optimize syscall heavy applications, by allowing the kernel to execute the small tidbits of user space code in-between system calls. UB uses binary translation to translate instructions that were meant to run in user space into instructions that can securely be executed by the kernel. Binary compatible critical section delegation uses UB to translate the code inside of the critical section (i.e., the code between the futex lock and unlock calls) into code that can be safely executed by the kernel. A pointer to this translated code is placed into a queue of delegated calls (the vw queue ). The set of threads which are trying to acquire a lock cooperatively execute the functions in the vw queue. At any one time, at most one thread is elected to be the delegate thread. It drains the vw queue by executing (in kernel space) all the delegated functions in the queue. This works great in cases where the code inside of the critical section accesses a lot of shared state, because that shared state can happily reside in the cache of the core that is running the delegate thread, rather than bouncing between cores. The paper has impressive results from microbenchmarks, but I think real applications are more relevant. Table 2 shows performance results for a few applications and a few locking strategies. BCD is the work in this paper. TCS and TCB are prior work which have the drawback of not being compatible with existing binaries. Source: https://dl.acm.org/doi/10.1145/3774934.3786439 Dangling Pointers There is a hint here at another advantage of pipeline parallelism over data parallelism: allowing persistent data structures to remain local to a core. Subscribe now

0 views

Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds

Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds I learned a lot about the LLC configuration and monitoring capabilities of modern CPUs from this paper, I bet you will too. The problem this paper addresses is: how to avoid performance variability in cloud applications due to cross-VM contention for the last level cache (e.g., the L3 cache on a Xeon)? In a typical CPU, the L1 and L2 caches are private to a core, but the L3 is shared. In a cloud environment, the L3 is shared by multiple tenants, and is an avenue for a “noisy neighbor” to annoy its neighbors. The work described by this paper builds upon Intel CMT and CAT. Cache Monitoring Technology allows the hypervisor to track how much of the L3 cache is occupied by each VM. Cache Allocation Technology allows the hypervisor to restrict a VM to only use a subset of the L3. CAT allows a VM to be assigned to a cache level of service (CLOS), which defines the set of L3 ways accessible to the VM ( this page defines the term “ways” if you are unfamiliar). A typical CPU used by a cloud service provider has more CPU cores than L3 ways. If a cloud server hosts many small VMs, then L3 ways must be shared amongst VMs. The key problem solved by this paper is how to reduce performance variability given this constraint. Fig. 1 illustrates the assignments of CLOS levels to LLC ways advocated by this paper. Each row is a level of service, and each column is a way of the LLC cache. CLOS[0] can access all ways, CLOS[1] can access all LLC ways except for one. CLOS[7] can only access a single way of the LLC. Source: https://dl.acm.org/doi/10.1145/3774934.3786415 The hypervisor uses Intel CMT to monitor how much of the LLC is occupied by each VM. Every 6 seconds the hypervisor uses this information to change the CLOS that each VM is assigned to. The hypervisor computes a target LLC occupancy for each VM based on the number of cores assigned to the VM. This target is compared against the measured LLC occupancy to classify each VM into one of three categories: Poor (the VM is starved for space) Adequate (the VM is using just the right amount of cache) Excess (the VM is hogging too much) VMs in the poor category are de-suppressed (i.e., assigned to a CLOS with access to more LLC ways). Additionally, VMs in the excess category are suppressed (i.e., assigned to a CLOS with access to fewer ways), but this suppression only occurs when there are VMs in the poor category. This policy means that cache-hungry VMs can use more than their fair share of the L3 during periods of low server utilization. This can lead to higher mean performance, at the cost of a wider standard deviation. The paper describes a 4th state (overflow), which is only applied to VMs that wish to be held back even if there is plenty of L3 space available. These VMs are suppressed when they are found to be using too much L3, even if all other VMs on the system are getting enough cache space. Fig. 5 shows a case where this strategy works well compared to static allocation. The server in question is running 5 VMs, each running a different application: VM1 - 32 cores VM2 - 16 cores (but doesn’t fully utilize those cores) VM3 - 8 cores VM4 - 4 cores VM5 - 4 cores The top of figure 5 shows a simple static partitioning of LLC ways. VM1 is assigned to 6 ways, VM2 is assigned to 3 ways, VM3 is assigned to 2 ways, and VMs 4 and 5 must share 1 way. They have to share because sharing based on the number of ways in the LLC is inherently coarse-grained. The two charts show measured LLC utilization over 10 minutes. Notice the Y-axis. The technique described in this paper (Cacheman) allows VM4 and VM5 to use far more aggregate LLC capacity than the static partitioning. Also notice that in the static partitioning, VM5 always uses more LLC than VM4 (because they are running different applications), whereas Cacheman allows for a more even balance between them. Source: https://dl.acm.org/doi/10.1145/3774934.3786415 Dangling Pointers While the L3 cache is logically a monolithic shared resource, it is physically partitioned across the chip (with a separate slice near each core). It seems like it could be more efficient if VMs could be assigned to nearby L3 slices rather than L3 ways. Subscribe now Poor (the VM is starved for space) Adequate (the VM is using just the right amount of cache) Excess (the VM is hogging too much) VM1 - 32 cores VM2 - 16 cores (but doesn’t fully utilize those cores) VM3 - 8 cores VM4 - 4 cores VM5 - 4 cores

0 views

Hapax Locks: Scalable Value-Based Mutual Exclusion

Hapax Locks: Scalable Value-Based Mutual Exclusion Dave Dice and Alex Kogan PPoPP'26 This paper describes a locking algorithm intended for cases where spinning is acceptable (e.g., one thread per core systems). It is similar to a ticket lock but generates less coherence traffic. Each lock/unlock operation causes a constant number of cache lines to move between cores, regardless of the number of cores involved or how long they spin for. As we’ve seen in a previous paper , polling a value in memory is cheap if the cache line is already local to the core which is polling. A Hapax lock comprises two 64-bit fields: Additionally, there is a global (shared among all Hapax locks) 64-bit sequence number. Each time a thread attempts to lock a Hapax lock, it generates a Hapax value which uniquely identifies the locking episode . A locking episode is a single lock/unlock sequence performed by a specific thread. A Hapax value is generated by atomically incrementing the sequence number. It is assumed that the 64-bit counter additions will never overflow. Next, the locking thread atomically exchange the value of with the Hapax value it just generated. This exchange operation generates a total ordering among Hapax values. It is a way for threads to cooperatively decide the order in which they will acquire the lock. Say thread generates Hapax value and stores it into (via an atomic exchange operation). Next, thread generates Hapax value and atomically exchanges the value of with . The result of the exchange operation performed by will be . At this point, thread knows that it is directly behind thread A in the queue and must wait for thread to release the lock. To finish acquiring the lock, threads continually poll , waiting for to equal the Hapax value of the preceding locking episode. In the example above, thread polls until it sees the value . At this point, the lock has been acquired. Unlocking is implemented by storing the Hapax value used by the unlocking thread into . In the running example, thread would unlock the lock by storing the value into . This algorithm generates a lot of coherence traffic. In particular, the cache line which holds the sequence number would move between cores each time a new Hapax value is generated. Also, each store to would send coherence traffic to each core which had recently polled the value of . The paper has two techniques to address these issues. While the sequence number monotonically increases, the values stored in and do not. There are two reasons for this. First, a single sequence number is shared among all Hapax locks. The second reason is that multiple threads can generate Hapax values and then race to perform the atomic exchange operation. For example, thread could generate a Hapax value of while thread generates a Hapax value of . They then race each other to atomically exchange their Hapax value with the value of . If thread wins the race, then will first take on the value , and then later it will have the value . Once you realize that the values of and are not monotonically increasing, it is straightforward to see how the generation of Hapax values can be made cheap. A thread can hoard a batch of Hapax values with a single atomic add operation. For example, a thread could atomically increase the value of the sequence number by 1024. At this point, the thread has allocated 1024 Hapax values for itself that it can use in the future without accessing the cache line which holds the shared sequence number. The paper proposes allocating Hapax values in blocks of 64K. The paper proposes adding an additional array which serves a similar role as . The number of elements in the array should be greater than the number of cores (the paper uses an array of 4096 values). Like the sequence number, this array is shared among all Hapax locks. When a thread writes its Hapax value into , the thread also stores its Hapax value into one of the 4096 elements. The array index is determined by the Hapax value. Many potential hash functions could be used. The paper proposes hashing bits [27:16] of the Hapax value. The 16 is related to the allocator block size. In the locking sequence, a thread loads the value of once. If the value of Depart does not match the expected Hapax value, then the locking thread polls the appropriate element of the shared array. The thread polls this element until its value changes. If the new value is the expected value of , then the lock has been acquired. If not, then a hash collision has occurred (e.g., a locking episode associated with a different Hapax lock caused the value to be updated). In this case, the thread starts over by checking and then polling the array element if necessary. This scheme minimizes coherence traffic associated with polling. When an unlocking core stores a value into an array element, the associated cache line will typically be present only in the cache of the next core in line. Coherence traffic is only generated related to the locking and unlocking cores. Other threads (which are further back in the line) will be polling other array elements and thus be loading from other cache lines and so the cores those threads are running on won’t see the coherence messages. Fig. 3 has results from a microbenchmark. Hapax locks scale much better than ticket locks and go head-to-head with other state-of-the-art locking algorithms. The Hapax implementation is so concise (about 100 lines) that the authors included C++ source code in the paper. Source: https://dl.acm.org/doi/10.1145/3774934.3786443 Dangling Pointers The big downside of spinning is that it wastes cycles in the case where there are other threads that the OS could schedule. I wonder if there is a lightweight coordination mechanism available. For example, the OS could write scheduling information into memory that is mapped read-only into user space. This could be used to communicate to the spinning code whether or not there are other threads ready to run. Subscribe now

0 views
Evan Schwartz 1 weeks ago

Scour - February Update

Hi friends, In February, Scour scoured 647,139 posts from 17,766 feeds (1,211 were newly added). Also, 917 new users signed up, so welcome everyone who just joined! Here's what's new in the product: If you subscribe to specific feeds (as opposed to scouring all of them), Scour can now infer topics you might be interested in from them. You can click the link that says "Suggest from my feeds" on the Interests page . Thank you to the anonymous user who requested this! The onboarding experience is simpler. Instead of typing out three interests, you now can describe yourself and your interests in free-form text. Scour extracts a set of interests from what you write. Thank you to everyone who let me know that they were a little confused by the onboarding process. I made two subtle changes to the ranking algorithm. First, the scoring algorithm ranks posts by how well they match your closest interest and gives a slight boost if the post matches multiple interests. That was the intended design from earlier, but I realized that multiple weaker matches were pulling down the scores rather than boosting them. The second change was that I finally retired the machine learning text quality classifier model that Scour had been using. The final straw was when a blog post I had written (and worked hard on!) wasn't showing up on Scour. The model had classified it as low quality 😤. I knew for a while that what the model was optimizing for was somewhat orthogonal to my idea of text quality, but that was it. For the moment, Scour relies on a large domain blocklist (of just under 1 million domains) to prevent low-quality content and spam from getting into your feed. I'm also investigating other ways of assessing quality without relying on social signals , but more on that to come in the future. I've always been striving to make Scour fast and it got much faster this past month. My feed, which compares about 35,000 posts against 575 interests, now loads in around 50 milliseconds. Even comparing all the 600,000+ posts from the last month across all feeds takes only 180 milliseconds. This graph shows the 99th percentile latency (the slowest requests) dropping from the occasional 10 seconds down to under 400 milliseconds (lower is better): For those interested in the technical details, this speed up came from two changes: First, I switched from scanning through post embeddings streamed from SQLite, which was already quite fast because the data is local, to keeping all the relevant details in memory. The in-memory snapshot is rebuilt every 15 minutes when the scraper finishes polling all of the feeds for new content. This change resulted in the very nice combination of much higher performance and lower memory usage, because SQLite connections have independent caches. The second change came from another round of optimization on the library I use to compute the Hamming Distance between each post's embedding and the embeddings of each of your interests. You can read more about this in the upcoming blog post, but I was able to speed up the comparisons by around another 40x, making it so Scour can now do around 1.6 billion comparisons per second. Together, these changes make loading the feed feel instantaneous, even though your whole feed is ranked on the fly when you load the page. Here were some of my favorite posts that I found on Scour in February: Happy Scouring! Scour is built on vector embeddings, so I'm especially excited when someone releases a new and promising-sounding embedding model. I get particularly excited by those that are explicitly trained to support binary quantization like this one from Perplexity: pplx-embed: State-of-the-Art Embedding Models for Web-Scale Retrieval . I also spend a fair amount of time thinking about optimizing Rust code, especially using SIMD, so this was an interesting write up from TurboPuffer: Rust zero-cost abstractions vs. SIMD . This was an interesting write up comparing what different coding agents do under the hood: I Intercepted 3,177 API Calls Across 4 AI Coding Tools. Here's What's Actually Filling Your Context Window. . And finally, this one is on a very different topic but has some nice animations that demonstrate why boarding airplanes is slow and shows The Fastest Way to Board an Airplane .

0 views

Scalar Interpolation: A Better Balance between Vector and Scalar Execution for SuperScalar Architectures

Scalar Interpolation: A Better Balance between Vector and Scalar Execution for SuperScalar Architectures Reza Ghanbari, Henry Kao, João P. L. De Carvalho, Ehsan Amiri, and J. Nelson Amaral CGO'25 This paper serves as a warning: don’t go overboard with vector instructions. There is a non-trivial amount of performance to be had by balancing compute between scalar and vector instructions. Even if you fear that automatic vectorization is fragile, this paper has some interesting lessons. Listing 1 contains a vectorizable loop and listing 2 shows a vectorized implementation: Source: https://dl.acm.org/doi/10.1145/3696443.3708950 Source: https://dl.acm.org/doi/10.1145/3696443.3708950 After achieving this result, one may be tempted to pat oneself on the back and call it a day. If you were a workaholic, you might profile the optimized code. If you did, you would see something like the data in table 1: Source: https://dl.acm.org/doi/10.1145/3696443.3708950 And you could conclude that this algorithm is compute-bound. But what do we really mean by “compute-bound”? A processor contains many execution ports, each with a unique set of capabilities. In the running example, the execution ports capable of vector multiplication and addition are fully booked, but the other ports are sitting mostly idle! Listing 3 shows a modified loop which tries to balance the load between the vector and scalar execution ports. Each loop iteration processes 9 elements (8 via vector instructions, and 1 via scalar instructions). This assumes that the processor supports fast unaligned vector loads and stores. Source: https://dl.acm.org/doi/10.1145/3696443.3708950 Section 3 has details on how to change LLVM to get it to do this transformation. Fig. 3 shows benchmark results. By my calculations, the geometric mean of the speedups is 8%. Source: https://dl.acm.org/doi/10.1145/3696443.3708950 Dangling Pointers This paper builds on top of automatic vectorization. In other words, the input source code is scalar and the compiler vectorizes loops while balancing the workload. An alternative would be to have the source code in a vectorized form and then let the compiler “devectorize” where it makes sense. Subscribe now

0 views
Rik Huijzer 1 weeks ago

More Accurate Speech Recognition with whisper.cpp

I have been using OpenAI's whisper for a while to convert audio files to text. For example, to generate subtitles for a file, I used ```bash whisper "$INPUT_FILE" -f srt --model turbo --language en ``` Especially on long files, this would sometimes over time change it's behavior leading to either extremely long or extremely short sentences (run away). Also, `whisper` took a long time to run. Luckily, there is whisper-cpp. On my system with an M2 Pro chip, this can now run speech recognition on a 40 minute audio file in a few minutes instead of half an hour. Also, thanks to a tip from whisp...

0 views
Binary Igor 2 weeks ago

JSON Documents Performance, Storage and Search: MongoDB vs PostgreSQL

Does MongoDB still have an edge as a document-oriented database for JSON in particular? Or is Postgres better? Or at least good-enough to stick with it, since it is a more universal database, offering a richer feature set and wider applicability?

0 views

Flexible I/O for Database Management Systems with xNVMe

Flexible I/O for Database Management Systems with xNVMe Emil Houlborg, Simon A. F. Lund, Marcel Weisgut, Tilmann Rabl, Javier González, Vivek Shah, Pınar Tözün CIDR’26 This paper describes xNVMe , a storage library (developed by Samsung), and demonstrates how it can be integrated into DuckDB. Section 2 contains the hard sell for . The “x” prefix serves a similar role to the “X” in DirectX. It is fast, while also being portable across operating systems and storage devices. The C API will feel like home for folks who have experience with low-level graphics APIs (no shaders on the disk yet, sorry). There are APIs to open a handle to a device, allocate buffers, and submit NVMe commands (synchronously or asynchronously). Listing 3 has an example, which feels like “Mantle for NVMe”: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf The API works on Linux, FreeBSD, Windows, and macOS. Some operating systems have multiple backends available (e.g., , ). The point of this paper is that it is easy to drop into an existing application. The paper describes , which is an implementation of the DuckDB interface and uses . creates dedicated queues for each DuckDB worker thread to avoid synchronization (similar tricks are used by applications calling graphics APIs in parallel). The paper also describes how supports shiny new NVMe features like Flexible Data Placement (FDP). This allows DuckDB to pass hints to the SSD to colocate buffers with similar lifetimes (which improves garbage collection performance). Most of the results in the paper show comparable performance for vs the baseline DuckDB filesystem. Fig. 5 shows one benchmark where yields a significant improvement: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf Dangling Pointers I think the long-term success of will depend on governance. Potential members of the ecosystem could be scared off by Samsung’s potential conflict of interest (i.e., will Samsung privilege Samsung SSDs in some way?) There is a delicate balancing act between an API driven by a sluggish bureaucratic committee, and an API which is dominated by one vendor. Subscribe now

0 views

A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation

A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation Christian Lanius, Florian Freye, and Tobias Gemmeke GLVLSI'25 This paper ostensibly describes an ASIC accelerator for NFA evaluation (e.g., regex matching), but this paper also describes two orthogonal techniques for optimizing NFA evaluation which are applicable to more than just this ASIC. Any regular expression can be converted to a non-deterministic finite automaton (NFA) . Think of an NFA like a state machine where some inputs can trigger multiple transitions. The state machine is defined by a set of transitions . A transition is an ( , , ) tuple. The non-deterministic naming comes from the fact that multiple tuples may exist with identical ( , ) values; they only differ in their values. This means that an NFA can be in multiple states at once. One way to evaluate an NFA is to use a bitmap to track the set of active states. For each new input symbol, the set of active states in the bitmap is used to determine which transitions apply. Each activated transition sets one bit in the bitmap used to represent the active states for the next input symbol. The hardware described in this paper uses a compute-in-memory (CIM) microarchitecture. A set of columns stores the state machine, with each column storing one transition. This assumes that the transition function is sparse (i.e., the number of transitions used is much lower than the maximum possible). During initialization, the transitions are written into the CIM hardware. An input symbol is processed by broadcasting it and the current state bitmap to all columns. All columns evaluate whether their transition should be activated. The hardware then iterates (over multiple clock cycles) over all activated transitions and updates the state bitmap for the next input symbol. The left side of Fig. 5 illustrates the hardware in each column which compares the input symbol, current state, against the stored tuple: Source: https://dl.acm.org/doi/10.1145/3716368.3735157 The algorithm described above processes at most one input symbol per cycle (and it is slower for inputs that activate multiple transitions). The paper contains two tricks for overcoming this limitation. Fig. 4 illustrates how an NFA that accepts one symbol per cycle can be converted into an NFA which accepts two symbols per cycle. For example, rather than consider and to be separate symbols, put them together into one mega-symbol: . This is feasible as long as your NFA implementation isn’t too sensitive to the number of bits per symbol. Source: https://dl.acm.org/doi/10.1145/3716368.3735157 Cool Trick #2 - Bloom Filter The target application for this hardware is monitoring network traffic for threats (e.g., Snort ). A key observation is that most inputs (network packets) do not produce a match, so it is reasonable to assume that most of the time the NFA will be in the initial state, and most input symbols will not trigger any transitions. If that assumption holds, then a bloom filter can be used to quickly skip many input symbols before they even reach the core NFA evaluation hardware. The bloom filter is built when the NFA transition function changes. To build the bloom filter, iterate over each transition for which holds. For each such transition, compute a hash of the input symbol, decompose the hashed value into indices, and set the corresponding bits in the bloom filter. To test an input symbol against the bloom filter, hash the input symbol, decompose the hashed value into indices, and check to see if all of the corresponding bits are set in the bloom filter. If any bit is not set, then the input symbol does not trigger a transition from the initial state. When that symbol finally arrives at the NFA hardware, it can be dropped if the NFA is in the initial state. Table 1 compares PPA results against other published NFA accelerators. It is a bit apples-to-oranges as the various designs target different technology nodes. The metric that stands out is the low power consumption of this design. Source: https://dl.acm.org/doi/10.1145/3716368.3735157 Dangling Pointers I wonder if the bloom filter trick can be extended. For example, rather than assuming the NFA will always be in the initial state, the hardware could dynamically compute which states are the most frequent and then use bloom filters to drop input symbols which cannot trigger any transitions from those states. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views

Fixing qBittorrent in Docker, swallowing RAM and locking up the host

IntroI’ve had qBittorrent running happily in Docker for over a year now, using the linuxserver/qbittorrent image. I run the docker container and others on an Ubuntu Server host. Recently, the host regularly becomes unresponsive. SSH doesn’t work, and the only way to recover is to power cycle the server (running on a small NUC). The symptoms were: RAM usage would climb and climb, the server would become sluggish, and then it would completely lock up.

0 views
Martin Alderson 2 weeks ago

Which web frameworks are most token-efficient for AI agents?

I benchmarked 19 web frameworks on how efficiently an AI coding agent can build and extend the same app. Minimal frameworks cost up to 2.9x fewer tokens than full-featured ones.

0 views
Den Odell 2 weeks ago

Constraints and the Lost Art of Optimization

In 1984, Steve Jobs walked over to a bag standing on stage and pulled out a computer that would change the world. The Macintosh had an operating system, a graphical user interface, window manager, font renderer, and a complete graphics engine called QuickDraw, one of the most elegant pieces of software ever written. The whole thing fit inside the machine’s 64KB ROM. Sixty . Four . Kilobytes . A typical webpage hero image is larger. A "Hello World" React application is larger. The entire intellectual and creative output of a team that reinvented personal computing fits in a space that, today, we wouldn’t think twice about wasting on a single font file. They did it because there just wasn’t any other choice. A larger ROM would have been too expensive for a mass-market consumer device, so efficiency and hyper-optimization was the only route. Somewhere in the years that followed we’ve lost the creative solutions, the art of optimization, that being constrained in that way produces. The Atari 2600 game console had 4KB of ROM and 128 bytes of RAM. Not kilobytes… bytes . And because there was no display buffer, programmers had to manually synchronize their code with the connected TV’s electron beam as it swept across the screen, line by line, sixty times a second. If your code ran too slow, the image tore. If it ran too fast, you’d corrupt the next line. They called it " racing the beam ." It was brutal, unforgiving, and it forced some of the most inventive programming in computing history. Super Mario Bros shipped on a 40KB NES cartridge. Tetris for the Nintendo Game Boy, the most successful handheld game ever made at the time, was 32KB . These weren’t compromised experiences, they were masterpieces. The constraints didn’t diminish the work, they actually defined it. The programmers who built these things weren’t just efficient, they thought differently. They knew their medium the way a sculptor knows stone. They understood every byte, every clock cycle, and every trade-off. The machine had no secrets from them because it couldn’t afford to. Modern development is defined by abundance. Cheap storage, fast networks, and powerful hardware. The practical consequences of inefficiency have largely disappeared. A few extra megabytes, a few wasted milliseconds, an unnecessary UI re-render. For the most part, nobody notices and nothing breaks. And so we’ve stopped noticing ourselves. This isn’t out of laziness, it’s just how rational people work. When the hard limits disappear, the thinking they demanded tends to disappear with them. There’s no forcing function, no electron beam to race, no 128 bytes standing between your idea and disaster. We can afford not to understand, and so increasingly, we simply don’t. Here’s what I think is worth recovering: not the constraints themselves, but the relationship with the medium that having those constraints produced. The engineers who wrote the Mac ROM didn’t just know how to be efficient, they understood their problem at a level that made elegance possible. Bill Atkinson’s QuickDraw wasn’t just small, it was a beautiful piece of code. The 64KB forced him to find the right solution, not just a working one. That instinct, to understand deeply before you build, to ask whether this is the right structure and not just a functional one, to treat your medium as something to be understood rather than merely used, that’s the transferable thing. Not the bit-twiddling, the thinking . The best engineers I’ve worked with carry this instinct even when others might think it crazy. They impose their own constraints. They ask what this would look like if it had to be half the size, or run twice as fast, or use a tenth of the memory. Not because anyone demanded it, but because just by thinking there could be a better, more efficient solution, one often emerges. If you want to start developing this instinct today, the most valuable thing you can do is learn how your runtime actually works. Not at the API level, but internally. How your platform parses, allocates, renders, and executes. For web developers that means understanding the browser pipeline : parsing, style resolution, layout, paint, and compositing. For mobile developers it means understanding how iOS or Android manages memory, handles drawing, and schedules work. Understanding your platform changes what you notice, what makes you wince, and what you reach for instinctively. The engineers who built the Mac knew their domain completely, and you can know yours too. There’s a principle I keep coming back to in engineering apps for performance: fast by default . Not fast because you optimized after the fact, but fast because the thinking that produces fast software is simply better thinking . It catches unnecessary complexity early, and it produces systems that are easier to understand, easier to change, and easier to reason about under pressure. The Atari programmers were fast by default; they had no choice. But the discipline they practiced, that intimate, demanding relationship with their constraints, that’s a choice we can still make. The 64KB Mac ROM isn’t just a remarkable footnote, it’s a provocation. It asks: if they could do that with that , then what’s our excuse? Not to shame us, but to remind us that constraints aren’t the enemy of great work. They’re often the source of it .

0 views

Better Memory Tiering, Right from the First Placement

Better Memory Tiering, Right from the First Placement João Póvoas, João Barreto, Bartosz Chomiński, André Gonçalves, Fedar Karabeinikau, Maciej Maciejewski, Jakub Schmiegel, and Kostiantyn Storozhuk ICPE'25 This paper addresses the first placement problem in systems with multiple tiers of memory (e.g., DRAM paired with HBM, or local DRAM paired with remote DRAM accessed over CXL). The paper cites plenty of prior work which dynamically migrates pages/allocations out of suboptimal memory tiers. What is different about this paper is that it attempts to avoid placing data in a suboptimal tier in the first place. The key insight is: statistics from one allocation can be used to generate better placements for similar allocations which will occur in the future. Fig. 3 offers insight into how much waste there is in a policy which initially places all pages into a fast tier and then migrates them to a slower tier if they are accessed infrequently. The figure shows results from one migration policy, applied to three benchmarks. Source: https://dl.acm.org/doi/10.1145/3676151.3719378 Allocation Contexts This paper proposes gathering statistics for each allocation context . An allocation context is defined by the source code location of the allocation, the call stack at the moment of allocation, and the size of the allocation. If two allocations match on these attributes, then they are considered part of the same context. The system hooks heap allocation functions (e.g., , ) to track all outstanding allocations associated with each allocation context. The x86 PMU event is used to determine how frequently each allocation context is accessed. A tidbit I learned from this paper is that some x86 performance monitoring features do more than just count events. For example, randomly samples load operations and emits the accessed (virtual) address. Given the accessed address, it is straightforward to map back to the associated allocation context. The hotness of an allocation context is the frequency of these access events divided by the total size of all allocations in the context. Time is divided into epochs. During an epoch, the hotness of each allocation context is recalculated. When a new allocation occurs, the hotness of the allocation context (from the previous epoch) is used to determine which memory tier to place the allocation into. The paper only tracks large allocations (at least 64 bytes). For smaller allocations, the juice is not worth the squeeze. These allocations are assumed to be short-lived and frequently accessed. This paper also describes a kernel component which complements the user space policy described so far. Whereas the user space code deals with allocations, the kernel code deals with pages. This is useful for allocations which do not access all pages uniformly. It is also useful for detecting and correcting suboptimal initial placements. All PTEs associated with all allocations are continually scanned. The accessed bit determines if a page has been read since the last scan. The dirty bit determines if a page has been written since the last scan. After 10 scans, the system has a pretty good idea of how frequently a page is accessed. These statistics are used to migrate pages between fast and slow tiers. Fig. 8 shows execution time for three benchmarks. represents the user and kernel solutions described by this paper. Source: https://dl.acm.org/doi/10.1145/3676151.3719378 Dangling Pointers I wasn’t able to find details in the paper about how PTE scanning works without interfering with other parts of the OS. For example, doesn’t the OS use the dirty bit to determine if it needs to write pages back to disk? I assume the PTE scanning described in this paper must reset the dirty bit on each scan. The definition of an allocation context seems ripe for optimization. I suspect that allowing some variability in call stack or allocation size would allow for better statistics. Maybe this is a good use case for machine learning? Subscribe now

0 views
Farid Zakaria 3 weeks ago

Linker Pessimization

In a previous post , I wrote about linker relaxation : the linker’s ability to replace a slower, larger instruction with a faster, smaller one when it has enough information at link time. For instance, an indirect through the GOT can be relaxed into a direct plus a . This is a well-known technique to optimize the instructions for performance. Does it ever make sense to go the other direction ? 🤔 We’ve been working on linking some massive binaries that include Intel’s Math Kernel Library (MKL) , a prebuilt static archive. MKL ships as object files compiled with the small code-model ( ), meaning its instructions assume everything is reachable within ±2 GiB. The included object files also has some odd relocations where the addend is a very large negative number (>1GiB). The calculation for the relocation value is S + A - P : the symbol address plus the addend minus the instruction address. WIth a sufficiently large negative addend, the relocation value can easily exceed the 2 GiB limit and the linker fails with relocation overflows. We can’t recompile MKL (it’s a prebuilt proprietary archive), and we can’t simply switch everything to the large code model. What can we do? 🤔 I am calling this technique linker pessimization : the reverse of relaxation. Instead of shrinking an instruction, we expand one to tolerate a larger address space. 😈 The specific instructions that overflow in our case are (Load Effective Address) instructions. In x86_64, performs pure arithmetic: it computes and stores the result in without accessing memory. The is a 32-bit signed integer embedded directly into the instruction encoding, and the linker fills it in via an relocation. The relocation formula is S + A - P . Let’s look at an example with a large addend. A 32-bit signed integer can only represent ±2,048 MB (±2 GiB). Our value of −2,062 MB exceeds that range and the linker rightfully complains 💥: Note These instructions appear in MKL because the library uses them as a way to compute an address of a data table relative to the instruction pointer. The large negative addend ( ) is intentional ; it’s an offset within a large lookup table. The core idea is delightful because often as an engineer we are trained to optimize systems, but in this case we want the opposite. We swap the for a that reads through a nearby pointer. Recall from the relaxation post : relaxation shrinks instructions (e.g. indirect -> direct ). Here we do the opposite: we make the instruction do more work (pure arithmetic -> memory load) in exchange for a reachable displacement. That’s why I consider it a pessimization or reverse-relaxation . Both instructions use the same encoding length (7 bytes with a REX prefix), so the patch is a single byte change in the opcode. 🤓 The difference in behavior is critical: Original — the must reach across the entire binary: Pessimized — the reads a nearby pointer that holds the full address: We’ve traded one direct computation for an indirect through a pointer, and we make sure the displacement is now tiny. The 64-bit pointer slot can reach any address in the virtual address space. 👌 For each problematic relocation, three changes are needed in the object file: 1. Opcode Patch : In , change byte to (1 byte). This converts the (compute address) into a (load from address). The rest of the instruction encoding (ModR/M byte, REX prefix) stays identical because both instructions use the same operand format. 2. New Pointer Slot — Create a new section ( ) containing 8 zero bytes per patch site, plus a new relocation pointing to the original symbol with the original addend. is a 64-bit absolute relocation. Its formula is simply , no subtraction of . There is no 32-bit range limitation; it can address the entire 64-bit address space. This is the key insight that makes the fix work. 3. Retarget the Original Relocation — In the entry for the patched instruction, change the symbol to point at the new pointer slot in and update the type to . The addend becomes a small offset (the distance from the instruction to the fixup slot), which is guaranteed to fit. Note Because both and with a operand are exactly the same length (7 bytes with a REX prefix), we don’t shift any code, don’t invalidate any other relocations, and don’t need to rewrite any other parts of the object file. It’s truly a surgical patch. The pessimized now performs a memory load where the original did pure register arithmetic. That’s an extra cache line fetch and a data dependency. If this instruction is in a tight loop, it could be a performance hit. Optimization is the root of all evil, what does that make pessimization? 🧌 LEA : (arithmetic, no memory access). must encode the entire distance to the far-away data. This overflows. MOV : (memory load). points to a nearby 8-byte pointer slot. The pointer slot holds the full 64-bit address. This never overflows.

0 views
Evan Schwartz 3 weeks ago

PSA: Your SQLite Connection Pool Might Be Ruining Your Write Performance

Update (Feb 18, 2026): After a productive discussion on Reddit and additional benchmarking , I found that the solutions I originally proposed (batched writes or using a synchronous connection) don't actually help. The real issue is simpler and more fundamental than I described: SQLite is single-writer, so any amount of contention at the SQLite level will severely hurt write performance. The fix is to use a single writer connection with writes queued at the application level, and a separate connection pool for concurrent reads. The original blog post text is preserved below, with retractions and updates marked accordingly. My apologies to the SQLx maintainers for suggesting that this behavior was unique to SQLx. Write transactions can lead to lock starvation and serious performance degradation when using SQLite with SQLx , the popular async Rust SQL library. In retrospect, I feel like this should have been obvious, but it took a little more staring at suspiciously consistent "slow statement" logs than I'd like to admit, so I'm writing it up in case it helps others avoid this footgun. SQLite is single-writer. In WAL mode, it can support concurrent reads and writes (or, technically "write" singular), but no matter the mode there is only ever one writer at a time. Before writing, a process needs to obtain an EXCLUSIVE lock on the database. If you start a read transaction with a SELECT and then perform a write in the same transaction, the transaction will need to be upgraded to write transaction with an exclusive lock: A read transaction is used for reading only. A write transaction allows both reading and writing. A read transaction is started by a SELECT statement, and a write transaction is started by statements like CREATE, DELETE, DROP, INSERT, or UPDATE (collectively "write statements"). If a write statement occurs while a read transaction is active, then the read transaction is upgraded to a write transaction if possible. ( source ) Transactions started with or also take the exclusive write lock as soon as they are started. Transactions in SQLx look like this: This type of transaction where you read and then write is completely fine. The transaction starts as a read transaction and then is upgraded to a write transaction for the . Update: This section incorrectly attributes the performance degradation to the interaction between async Rust and SQLite. The problem is actually that any contention for the EXCLUSIVE lock at the SQLite level, whether from single statements or batches, will hurt write performance. The problem arises when you call within a write transaction. For example, this could happen if you call multiple write statements within a transaction: This code will cause serious performance degradation if you have multiple concurrent tasks that might be trying this operation, or any other write, at the same time. When the program reaches the first statement, the transaction is upgraded to a write transaction with an exclusive lock. However, when you call , the task yields control back to the async runtime. The runtime may schedule another task before returning to this one. The problem is that this task is now holding an exclusive lock on the database. All other writers must wait for this one to finish. If the newly scheduled task tries to write, it will simply wait until it hits the and returns a busy timeout error. The original task might be able to make progress if no other concurrent writers are scheduled before it, but under higher load you might continuously have new tasks that block the original writer from progressing. Starting a transaction with will also cause this problem, because you will immediately take the exclusive lock and then yield control with . In practice, you can spot this issue in your production logs if you see a lot of SQLx warnings that say where the time is very close to your (which is 5 seconds by default). This is the result of other tasks being scheduled by the runtime and then trying and failing to obtain the exclusive lock they need to write to the database while being blocked by a parked task. SQLite's concurrency model (in WAL mode) is many concurrent readers with exactly one writer. Mirroring this architecture at the application level provides the best performance. Instead of a single connection pool, where connections may be upgraded to write at any time, use two separate pools: With this setup, write transactions serialize within the application. Tasks will queue waiting for the single writer connection, rather than all trying to obtain SQLite's EXCLUSIVE lock. In my benchmarks , this approach was ~20x faster than using a single pool with multiple connections: An alternative to separate pools is wrapping writes in a Mutext , which achieves similar performance (95ms in the benchmarks). However, separate pools make the intent clearer and, if the reader pool is configured as read-only, prevent accidentally issuing a write on a reader connection. Having separate pools works when reads and writes are independent, but sometimes you need to atomically read and then write based on it: Sending this transaction to the single write connection is fine if the read is extremely fast, such as a single lookup by primary key. However, if your application requires expensive reads that must precede writes in a single atomic transaction, the shared connection pool with moderate concurrency might outperform a single writer. Retraction: Benchmarking showed that batched writes perform no better than the naive loop under concurrency, because 50 connections still contend for the write lock regardless of whether each connection issues 100 small s or one large . QueryBuilder is still useful for reducing per-statement overhead, but it does not fix the contention problem. We could safely replace the example code above with this snippet that uses a bulk insert to avoid the lock starvation problem: Note that if you do this with different numbers of values, you should call . By default, SQLx caches prepared statements. However, each version of the query with a different number of arguments will be cached separately, which may thrash the cache. Retraction: Benchmarking showed that this did not actually improve performance. Unfortunately, the fix for atomic writes to multiple tables is uglier and potentially very dangerous. To avoid holding an exclusive lock across an , you need to use the interface to execute a transaction in one shot: However, this can lead to catastrophic SQL injection attacks if you use this for user input, because does not support binding and sanitizing query parameters. Note that you can technically run a transaction with multiple statements in a call but the docs say: The query string may only contain a single DML statement: SELECT, INSERT, UPDATE, DELETE and variants. The SQLite driver does not currently follow this restriction, but that behavior is deprecated. If you find yourself needing atomic writes to multiple tables with SQLite and Rust, you might be better off rethinking your schema to combine those tables or switching to a synchronous library like with a single writer started with . Update: the most useful change would actually be making a distinction between a and a . Libraries like SQLx could enforce the distinction at compile time or runtime by inspecting the queries for the presence of write statements, or the could be configured as read-only. Maybe, but it probably won't. If SQLx offered both a sync and async API (definitely out of scope) and differentiated between read and write statements, a write could be like , which would prevent it from being held across an point. However, SQLx is not an ORM and it probably isn't worth it for the library to have different methods for read versus write statements. Without that, there isn't a way to prevent write transaction locks from being held across s while allowing safe read transactions to be used across s. So, in lieu of type safety to prevent this footgun, I wrote up this blog post and this pull request to include a warning about this in the docs. Discuss on r/rust and Hacker News .

0 views
David Bushell 3 weeks ago

Web font choice and loading strategy

When I rebuilt my website I took great care to optimise fonts for both performance and aesthetics. Fonts account for around 50% of my website (bytes downloaded on an empty cache). I designed and set a performance budget around my font usage. I use three distinct font families and three different methods to load them. Web fonts are usually defined by the CSS rule. The property allows us some control over how fonts are loaded. The value has become somewhat of a best practice — at least the most common default. The CSS spec says: Gives the font face an extremely small block period (100ms or less is recommended in most cases) and an infinite swap period . In other words, the browser draws the text immediately with a fallback if the font face isn’t loaded, but swaps the font face in as soon as it loads. CSS Fonts Module Level 4 - W3C That small “block period”, if implemented by the browser, renders an invisible font temporarily to minimise FOUC . Personally I default to and don’t change unless there are noticeable or measurable issues. Most of the time you’ll use swap. If you don’t know which option to use, go with swap. It allows you to use custom fonts and tip your hand to accessibility. font-display for the Masses - Jeremy Wagner Google Fonts’ default to which has performance gains. In effect, this makes the font files themselves asynchronous—the browser immediately displays our fallback text before swapping to the web font whenever it arrives. This means we’re not going to leave users looking at any invisible text (FOIT), which makes for both a faster and more pleasant experience. Speed Up Google Fonts - Harry Roberts Harry further notes that a suitable fallback is important, as I’ll discover below. My three fonts in order of importance are: Ahkio for headings. Its soft brush stroke style has a unique hand-drawn quality that remains open and legible. As of writing, I load three Ahkio weights at a combined 150 KB. That is outright greed! Ahkio is core to my brand so it takes priority in my performance budget (and financial budget, for that matter!) Testing revealed the 100ms † block period was not enough to avoid FOUC, despite optimisation techniques like preload . Ahkio’s design is more condensed so any fallback can wrap headings over additional lines. This adds significant layout shift. † Chrome blog mention a zero second block period . Firefox has a config preference default of 100ms. My solution was to use instead of which extends the block period from a recommended 0–100ms up to a much longer 3000ms. Gives the font face a short block period (3s is recommended in most cases) and an infinite swap period . In other words, the browser draws “invisible” text at first if it’s not loaded, but swaps the font face in as soon as it loads. CSS Fonts Module Level 4 - W3C This change was enough to avoid ugly FOUC under most conditions. Worst case scenario is three seconds of invisible headings. With my website’s core web vitals a “slow 4G” network can beat that by half. For my audience an extended block period is an acceptable trade-off. Hosting on an edge CDN with good cache headers helps minimised the cost. Update: Richard Rutter suggested which gives more fallback control than I knew. I shall experiment and report back! Atkinson Hyperlegible Next for body copy. It’s classed as a grotesque sans-serif with interesting quirks such as a serif on the lowercase ‘i’. I chose this font for both its accessible design and technical implementation as a variable font . One file at 78 KB provides both weight and italic variable axes. This allows me to give links a subtle weight boost. For italics I just go full-lean. I currently load Atkinson Hyperlegible with out of habit but I’m strongly considering why I don’t use . Gives the font face an extremely small block period (100ms or less is recommended in most cases) and a short swap period (3s is recommended in most cases). In other words, the font face is rendered with a fallback at first if it’s not loaded, but it’s swapped in as soon as it loads. However, if too much time passes, the fallback will be used for the rest of the page’s lifetime instead. CSS Fonts Module Level 4 - W3C The browser can give up and presumably stop downloading the font. The spec actually says that and “[must/should] only be used for small pieces of text.” Although it notes that most browsers implement the default with similar strategies to . 0xProto for code snippets. If my use of Ahkio was greedy, this is gluttonous! A default would be acceptable. My justification is that controlling presentation of code on a web development site is reasonable. 0xProto is designed for legibility with a personality that compliments my design. I don’t specify 0xProto with the CSS rule. Instead I use the JavaScript font loading API to conditionally load when a element is present. Note the name change because some browsers aren’t happy with a numeric first character. Not shown is the event wrapper around this code. I also load the script with both and attributes. This tells the browser the script is non-critical and avoids render blocking. I could probably defer loading even later without readers noticing the font pop in. Update: for clarity, browsers will conditionally load but JavaScript can purposefully delay the loading further to avoid fighting for bandwidth. When JavaScript is not available the system default is fine. There we have it, three fonts, three strategies, and a few open questions and decisions to make. Those may be answered when CrUX data catches up. My new website is a little chunkier than before but its well within reasonable limits. I’ll monitor performance and keep turning the dials. Web performance is about priorities . In isolation it’s impossible to say exactly how an individual asset should be loaded. There are upper limits, of course. How do you load a one megabyte font? You don’t. Unless you’re a font studio providing a complete type specimen. But even then you could split the font and progressive load different unicode ranges. I wonder if anyone does that? Anyway I’m rambling now, bye. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters Kaiyang Zhao, Kaiwen Xue, Ziqi Wang, Dan Schatzberg, Leon Yang, Antonis Manousis, Johannes Weiner, Rik Van Riel, Bikash Sharma, Chunqiang Tang, and Dimitrios Skarlatos ISCA'23 This paper has a lot of great statistics from the Meta fleet. Memory capacity per server is growing over time, but TLB size is not, thus TLB coverage (the fraction of main memory that can be referenced by the TLB at any one time) is trending downward: Source: https://dl.acm.org/doi/10.1145/3579371.3589079 As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases. With 4KiB pages, TLB misses can account for almost 20% of CPU cycles! Source: https://dl.acm.org/doi/10.1145/3579371.3589079 A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory. This is shockingly difficult in the real world: We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page. The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access. The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault ( here is a paper that describes some problems associated with PCIe devices causing page faults). The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages. The first solution only requires software changes. The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations. A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction. A background resizing algorithm runs periodically to move the boundary between these two regions. One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking. The paper doesn’t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary. The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Hardware Page Migration The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page. The OS can use this support to defragment memory by moving “unmovable” pages. The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. The page copy occurs one cache line at a time. During the copy operation, accesses to the old page work fine. When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet. If the cache line has been copied, then the access is serviced from the destination page. Otherwise, the access is serviced by the source page. Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU). Once those TLB invalidations have completed, the OS can reuse the old page. Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation. Fig. 10 has results for memory segmentation. “Partial” represents a case where physical memory is partially fragmented, whereas “full” represents full fragmentation. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Dangling Pointers If the NIC is the primary culprit, some vertical integration might be called for here. For example, allocations used to send packets cycle through three states: CPU writing to it NIC reading from it It seems like it would be fine for the OS to migrate a packet when it is not in state #3. Subscribe now Source: https://dl.acm.org/doi/10.1145/3579371.3589079 As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases. With 4KiB pages, TLB misses can account for almost 20% of CPU cycles! Source: https://dl.acm.org/doi/10.1145/3579371.3589079 A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory. This is shockingly difficult in the real world: We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page. The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access. The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault ( here is a paper that describes some problems associated with PCIe devices causing page faults). The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages. Segmentation The first solution only requires software changes. The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations. A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction. A background resizing algorithm runs periodically to move the boundary between these two regions. One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking. The paper doesn’t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary. The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Hardware Page Migration The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page. The OS can use this support to defragment memory by moving “unmovable” pages. The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. The page copy occurs one cache line at a time. During the copy operation, accesses to the old page work fine. When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet. If the cache line has been copied, then the access is serviced from the destination page. Otherwise, the access is serviced by the source page. Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU). Once those TLB invalidations have completed, the OS can reuse the old page. Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation. Results Fig. 10 has results for memory segmentation. “Partial” represents a case where physical memory is partially fragmented, whereas “full” represents full fragmentation. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Dangling Pointers If the NIC is the primary culprit, some vertical integration might be called for here. For example, allocations used to send packets cycle through three states: Empty CPU writing to it NIC reading from it

0 views
Max Bernstein 3 weeks ago

Type-based alias analysis in the Toy Optimizer

Another entry in the Toy Optimizer series . Last time, we did load-store forwarding in the context of our Toy Optimizer. We managed to cache the results of both reads from and writes to the heap—at compile-time! We were careful to mind object aliasing: we separated our heap information into alias classes based on what offset the reads/writes referenced. This way, if we didn’t know if object and aliased, we could at least know that different offsets would never alias (assuming our objects don’t overlap and memory accesses are on word-sized slots). This is a coarse-grained heuristic. Fortunately, we often have much more information available at compile-time than just the offset, so we should use it. I mentioned in a footnote that we could use type information, for example, to improve our alias analysis. We’ll add a lightweight form of type-based alias analysis (TBAA) (PDF) in this post. We return once again to Fil Pizlo land, specifically How I implement SSA form . We’re going to be using the hierarchical heap effect representation from the post in our implementation, but you can use your own type representation if you have one already. This representation divides the heap into disjoint regions by type. Consider, for example, that objects and objects do not overlap. A pointer is never going to alias an pointer. They can therefore be reasoned about separately. But sometimes you don’t have perfect type information available. If you have in your language an base class of all objects, then the heap overlaps with, say, the heap. So you need some way to represent that too—just having an enum doesn’t work cleanly. Here is an example simplified type hierarchy: Where might represent different parts of the runtime’s data structures, and could be further segmented into , , etc. Fil’s idea is that we can represent each node in that hierarchy with a tuple of integers (inclusive, exclusive) that represent the pre- and post-order traversals of the tree. Or, if tree traversals are not engraved into your bones, they represent the range of all the nested objects within them. Then the “does this write interfere with this read” check—the aliasing check—is a range overlap query. Here’s a perhaps over-engineered Python implementation of the range and heap hierarchy based on the Ruby generator and C++ runtime code from JavaScriptCore: Where kicks off the tree-numbering scheme. Fil’s implementation also covers a bunch of abstract heaps such as SSAState and Control because his is used for code motion and whatnot. That can be added on later but we will not do so in this post. So there you have it: a type representation. Now we need to use it in our load-store forwarding. Recall that our load-store optimization pass looks like this: At its core, it iterates over the instructions, keeping a representation of the heap at compile-time. Reads get cached, writes get cached, and writes also invalidate the state of compile-time information about fields that may alias. In this case, our may alias asks only if the offsets overlap. This means that the following unit test will fail: This test is expecting the write to to still remain cached even though we wrote to the same offset in —because we have annotated as being an and as being a . If we account for type information in our alias analysis, we can get this test to pass. After doing a bunch of fussing around with the load-store forwarding (many rewrites), I eventually got it down to a very short diff: If we don’t have any type/alias information, we default to “I know nothing” ( ) for each object. Then we check range overlap. The boolean logic in looks a little weird, maybe. But we can also rewrite (via DeMorgan’s law) as: So, keeping all the cached field state about fields that are known by offset and by type not to alias. Maybe that is clearer (but not as nice a diff). Note that the type representation is not so important here! You could use a bitset version of the type information if you want. The important things are that you can cheaply construct types and check overlap between them. Nice, now our test passes! We can differentiate between memory accesses on objects of different types. But what if we knew more? Sometimes we know where an object came from. For example, we may have seen it get allocated in the trace. If we saw an object’s allocation, we know that it does not alias (for example) any object that was passed in via a parameter. We can use this kind of information to our advantage. For example, in the following made up IR snippet: We know that (among other facts) doesn’t alias or because we have seen its allocation site. I saw this in the old V8 IR Hydrogen’s lightweight alias analysis 1 : There is plenty of other useful information such as: If you have other fun ones, please write in. We only handle loads and stores in our optimizer. Unfortunately, this means we may accidentally cache stale information. Consider: what happens if a function call (or any other opaque instruction) writes into an object we are tracking? The conservative approach is to invalidate all cached information on a function call. This is definitely correct, but it’s a bummer for the optimizer. Can we do anything? Well, perhaps we are calling a well-known function or a specific IR instruction. In that case, we can annotate it with effects in the same abstract heap model: if the instruction does not write, or only writes to some heaps, we can at least only partially invalidate our heap. However, if the function is unknown or otherwise opaque, we need at least more advanced alias information and perhaps even (partial) escape analysis. Consider: even if an instruction takes no operands, we have no idea what state it has access to. If it writes to any object A, we cannot safely cache information about any other object B unless we know for sure that A and B do not alias. And we don’t know what the instruction writes to. So we may only know we can cache information about B because it was allocated locally and has not escaped. Some runtimes such as ART pre-compute all of their alias information in a bit matrix. This makes more sense if you are using alias information in a full control-flow graph, where you might need to iterate over the graph a few times. In a trace context, you can do a lot in one single pass—no need to make a matrix. As usual, this is a toy IR and a toy optimizer, so it’s hard to say how much faster it makes its toy programs. In general, though, there is a dial for analysis and optimization that goes between precision and speed. This is a happy point on that dial, only a tiny incremental analysis cost bump above offset-only invalidation, but for higher precision. I like that tradeoff. Also, it is very useful in JIT compilers where generally the managed language is a little better-behaved than a C-like language . Somewhere in your IR there will be a lot of duplicate loads and stores from a strength reduction pass, and this can clean up the mess. Thanks for joining as I work through a small use of type-based alias analysis for myself. I hope you enjoyed. Thank you to Chris Gregory for helpful feedback. I made a fork of V8 to go spelunk around the Hydrogen IR. I reset the V8 repo to the last commit before they deleted it in favor of their new Sea of Nodes based IR called TurboFan.  ↩ If we know at compile-time that object A has 5 at offset 0 and object B has 7 at offset 0, then A and B don’t alias (thanks, CF) In the RPython JIT in PyPy, this is used to determine if two user (Python) objects don’t alias because we know the contents of the user (Python) class field Object size (though perhaps that is a special case of the above bullet) Field size/type Deferring alias checks to run-time Have a branch I made a fork of V8 to go spelunk around the Hydrogen IR. I reset the V8 repo to the last commit before they deleted it in favor of their new Sea of Nodes based IR called TurboFan.  ↩

0 views
Sean Goedecke 4 weeks ago

Two different tricks for fast LLM inference

Anthropic and OpenAI both recently announced “fast mode”: a way to interact with their best coding model at significantly higher speeds. These two versions of fast mode are very different. Anthropic’s offers up to 2.5x tokens per second (so around 170, up from Opus 4.6’s 65). OpenAI’s offers more than 1000 tokens per second (up from GPT-5.3-Codex’s 65 tokens per second, so 15x). So OpenAI’s fast mode is six times faster than Anthropic’s 1 . However, Anthropic’s big advantage is that they’re serving their actual model. When you use their fast mode, you get real Opus 4.6, while when you use OpenAI’s fast mode you get GPT-5.3-Codex-Spark, not the real GPT-5.3-Codex. Spark is indeed much faster, but is a notably less capable model: good enough for many tasks, but it gets confused and messes up tool calls in ways that vanilla GPT-5.3-Codex would never do. Why the differences? The AI labs aren’t advertising the details of how their fast modes work, but I’m pretty confident it’s something like this: Anthropic’s fast mode is backed by low-batch-size inference, while OpenAI’s fast mode is backed by special monster Cerebras chips . Let me unpack that a bit. The tradeoff at the heart of AI inference economics is batching , because the main bottleneck is memory . GPUs are very fast, but moving data onto a GPU is not. Every inference operation requires copying all the tokens of the user’s prompt 2 onto the GPU before inference can start. Batching multiple users up thus increases overall throughput at the cost of making users wait for the batch to be full. A good analogy is a bus system. If you had zero batching for passengers - if, whenever someone got on a bus, the bus departed immediately - commutes would be much faster for the people who managed to get on a bus . But obviously overall throughput would be much lower, because people would be waiting at the bus stop for hours until they managed to actually get on one. Anthropic’s fast mode offering is basically a bus pass that guarantees that the bus immediately leaves as soon as you get on. It’s six times the cost, because you’re effectively paying for all the other people who could have got on the bus with you, but it’s way faster 3 because you spend zero time waiting for the bus to leave. Obviously I can’t be fully certain this is right. Maybe they have access to some new ultra-fast compute that they’re running this on, or they’re doing some algorithmic trick nobody else has thought of. But I’m pretty sure this is it. Brand new compute or algorithmic tricks would likely require changes to the model (see below for OpenAI’s system), and “six times more expensive for 2.5x faster” is right in the ballpark for the kind of improvement you’d expect when switching to a low-batch-size regime. OpenAI’s fast mode does not work anything like this. You can tell that simply because they’re introducing a new, worse model for it. There would be absolutely no reason to do that if they were simply tweaking batch sizes. Also, they told us in the announcement blog post exactly what’s backing their fast mode: Cerebras. OpenAI announced their Cerebras partnership a month ago in January. What’s Cerebras? They build “ultra low-latency compute”. What this means in practice is that they build giant chips . A H100 chip (fairly close to the frontier of inference chips) is just over a square inch in size. A Cerebras chip is 70 square inches. You can see from pictures that the Cerebras chip has a grid-and-holes pattern all over it. That’s because silicon wafers this big are supposed to be broken into dozens of chips. Instead, Cerebras etches a giant chip over the entire thing. The larger the chip, the more internal memory it can have. The idea is to have a chip with SRAM large enough to fit the entire model , so inference can happen entirely in-memory. Typically GPU SRAM is measured in the tens of megabytes . That means that a lot of inference time is spent streaming portions of the model weights from outside of SRAM into the GPU compute 4 . If you could stream all of that from the (much faster) SRAM, inference would a big speedup: fifteen times faster, as it turns out! So how much internal memory does the latest Cerebras chip have? 44GB . This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex. That’s why they’re offering a brand new model, and why the Spark model has a bit of “small model smell” to it: it’s a smaller distil of the much larger GPT-5.3-Codex model 5 . It’s interesting that the two major labs have two very different approaches to building fast AI inference. If I had to guess at a conspiracy theory, it would go something like this: Obviously OpenAI’s achievement here is more technically impressive. Getting a model running on Cerebras chips is not trivial, because they’re so weird. Training a 20B or 40B param distil of GPT-5.3-Codex that is still kind-of-good-enough is not trivial. But I commend Anthropic for finding a sneaky way to get ahead of the announcement that will be largely opaque to non-technical people. It reminds me of OpenAI’s mid-2025 sneaky introduction of the Responses API to help them conceal their reasoning tokens . Seeing the two major labs put out this feature might make you think that fast AI inference is the new major goal they’re chasing. I don’t think it is. If my theory above is right, Anthropic don’t care that much about fast inference, they just didn’t want to appear behind OpenAI. And OpenAI are mainly just exploring the capabilities of their new Cerebras partnership. It’s still largely an open question what kind of models can fit on these giant chips, how useful those models will be, and if the economics will make any sense. I personally don’t find “fast, less-capable inference” particularly useful. I’ve been playing around with it in Codex and I don’t like it. The usefulness of AI agents is dominated by how few mistakes they make , not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model 6 . However, it’s certainly possible that fast, less-capable inference becomes a core lower-level primitive in AI systems. Claude Code already uses Haiku for some operations. Maybe OpenAI will end up using Spark in a similar way. This isn’t even factoring in latency. Anthropic explicitly warns that time to first token might still be slow (or even slower), while OpenAI thinks the Spark latency is fast enough to warrant switching to a persistent websocket (i.e. they think the 50-200ms round trip time for the handshake is a significant chunk of time to first token). Either in the form of the KV-cache for previous tokens, or as some big tensor of intermediate activations if inference is being pipelined through multiple GPUs. I write a lot more about this in Why DeepSeek is cheap at scale but expensive to run locally , since it explains why DeepSeek can be offered at such cheap prices (massive batches allow an economy of scale on giant expensive GPUs, but individual consumers can’t access that at all). Is it a contradiction that low-batch-size means low throughput, but this fast pass system gives users much greater throughput? No. The overall throughput of the GPU is much lower when some users are using “fast mode”, but those user’s throughput is much higher. Remember, GPUs are fast, but copying data onto them is not. Each “copy these weights to GPU” step is a meaningful part of the overall inference time. Or a smaller distil of whatever more powerful base model GPT-5.3-Codex was itself distilled from. I don’t know how AI labs do it exactly, and they keep it very secret. More on that here . On this note, it’s interesting to point out that Cursor’s hype dropped away basically at the same time they released their own “much faster, a little less-capable” agent model. Of course, much of this is due to Claude Code sucking up all the oxygen in the room, but having a very fast model certainly didn’t help . OpenAI partner with Cerebras in mid-January, obviously to work on putting an OpenAI model on a fast Cerebras chip Anthropic have no similar play available, but they know OpenAI will announce some kind of blazing-fast inference in February, and they want to have something in the news cycle to compete with that Anthropic thus hustles to put together the kind of fast inference they can provide: simply lowering the batch size on their existing inference stack Anthropic (probably) waits until a few days before OpenAI are done with their much more complex Cerebras implementation to announce it, so it looks like OpenAI copied them This isn’t even factoring in latency. Anthropic explicitly warns that time to first token might still be slow (or even slower), while OpenAI thinks the Spark latency is fast enough to warrant switching to a persistent websocket (i.e. they think the 50-200ms round trip time for the handshake is a significant chunk of time to first token). ↩ Either in the form of the KV-cache for previous tokens, or as some big tensor of intermediate activations if inference is being pipelined through multiple GPUs. I write a lot more about this in Why DeepSeek is cheap at scale but expensive to run locally , since it explains why DeepSeek can be offered at such cheap prices (massive batches allow an economy of scale on giant expensive GPUs, but individual consumers can’t access that at all). ↩ Is it a contradiction that low-batch-size means low throughput, but this fast pass system gives users much greater throughput? No. The overall throughput of the GPU is much lower when some users are using “fast mode”, but those user’s throughput is much higher. ↩ Remember, GPUs are fast, but copying data onto them is not. Each “copy these weights to GPU” step is a meaningful part of the overall inference time. ↩ Or a smaller distil of whatever more powerful base model GPT-5.3-Codex was itself distilled from. I don’t know how AI labs do it exactly, and they keep it very secret. More on that here . ↩ On this note, it’s interesting to point out that Cursor’s hype dropped away basically at the same time they released their own “much faster, a little less-capable” agent model. Of course, much of this is due to Claude Code sucking up all the oxygen in the room, but having a very fast model certainly didn’t help . ↩

1 views