Latest Posts (20 found)

Efficient Remote Memory Ordering for Non-Coherent Systems

Efficient Remote Memory Ordering for Non-Coherent Systems Wei Siew Liew, Md Ashfaqur Rahaman, Adarsh Patil, Ryan Stutsman, and Vijay Nagarajan ASPLOS’26 It seems like every year there is a new PCIe standard which doubles bandwidth. The key takeaway from this paper is that these improvements are for the fast path , but there exist use cases which are crippled by details of the PCIe protocol. This paper describes two inefficiencies caused by the current protocol and suggests improvements to address them. From my experience, here is the canonical way for the host X64 CPU to send a request to a PCIe device and receive a response. First, the request payload is stored in host memory (which can be mapped as CPU cacheable). During this time, the device does not read any of the payload. Next, a handful of MMIO writes (to uncacheable memory) are used to point the device to the payload and kick off the work. The device stores results via posted DMA writes to host memory. After producing all results, the device uses more posted DMA writes to update control information in host memory. Finally (and optionally), the device signals an interrupt. From the perspective of the device, the interrupt is another posted DMA write to the host. Posted DMA writes are visible to the host in the order they are issued. In other words, the host is guaranteed that it will only observe the control information update after the response payload has been written. Similarly, the host will receive the interrupt after the control information is written. The problems identified by this paper are caused by a lack of ordering guarantees. Table 1 illustrates where ordering is enforced today: Source: https://dl.acm.org/doi/10.1145/3779212.3790156 W→W ordering means two writes from a device will appear to land in host memory in the order they were issued. W→R means that if a device writes to an address in host memory, and then issues a read, the read response will contain the updated data. For more details there is a great explainer on a LinkedIn post here . PCIe has provisions that allow devices to relax these ordering guarantees, but there is no way to enforce more ordering. “R→R No” means that if a device issues two DMA read requests, the read response data could appear as if the reads occurred in the wrong order. This matters for scenarios where the host CPU is actively writing to the same data structures that the device is reading from. Imagine an application that involves two data structures: An array of data: An array of flags: is only valid if is set. Carefully written software could update these data structures, ensuring that is set only after the associated data is written. The trouble is there is no efficient way for a PCIe device to pipeline reads of and . For a given , the device has no choice but to read , wait for the response, and then issue the read of . Some systems work around this by ensuring that data and metadata are stored in the same cache line. A more realistic application is a key-value store that is concurrently updated by the host CPU and read by a device (i.e., RDMA reads from a NIC). It is hard to develop such a system in a way that offers strong consistency and high performance. The solution proposed by the authors is to add semantics similar to and to PCIe. A read TLP with the bit set would not be reordered past subsequent reads. A write TLP with the bit set would not be reordered before prior writes. The other inefficient scenario described by this paper is caused by the lack of W→W MMIO ordering. The problem here is in the CPU architecture. On x86, an MMIO region can be mapped as write-combined, but the application must issue expensive instructions to ensure that writes appear in the correct order. This restricts MMIO writes to low-bandwidth scenarios. The architectural solution described by this paper is to add four explicit MMIO instructions to the ISA: MMIO-Release MMIO-Acquire MMIO-Store and MMIO-Release are store instructions. MMIO-Release has release semantics (it will not be reordered before prior stores). MMIO-Load and MMIO-Acquire are load instructions. MMIO-Acquire instructions are not reordered after subsequent loads. The authors note that RISC-V has similar instructions, but they involve the CPU stalling to implement the desired memory ordering. The solution offered by this paper instead involves a re-order buffer in the PCIe root complex. The CPU assigns sequence numbers to MMIO operations, and the root complex uses those sequence numbers to restore operations to their correct order. Fig. 6 has simulation numbers projecting speedups an RDMA-based key-value store could see if it could properly pipeline DMA reads. and are the work described by this paper. assumes additional speculative optimizations in the root complex. Source: https://dl.acm.org/doi/10.1145/3779212.3790156 Fig. 10 shows how fast a single core can perform unordered MMIO writes. The idea is that if the CPU architecture is enhanced to allow an application to express just the right amount of ordering, it could be possible for a single core to write packet data as fast as a NIC can consume it. Source: https://dl.acm.org/doi/10.1145/3779212.3790156 Dangling Pointers The paper ends with this food for thought: By establishing a high-performance baseline for non-coherent I/O, this work raises the question of whether the complexity of coherent interconnects (like CXL) is truly necessary for future host-device communication. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. An array of data: An array of flags: MMIO-Release MMIO-Acquire

0 views

Performance Predictability in Heterogeneous Memory

Performance Predictability in Heterogeneous Memory Jinshu Liu, Hanchen Xu, Daniel S. Berger, Marcos K. Aguilera, and Huaicheng Li ASPLOS'26 This paper presents a system named CAMP, which can be used to predict how the performance of a particular workload will be affected by moving the workload from local DRAM to remote DRAM accessed over CXL. Just collect some performance counters when running your workload on local DRAM, and you can predict how your workload will run on remote DRAM. In machine learning vernacular: CAMP derives a set of features from the values of Intel PMU counters collected while running an application out of local DRAM. Plug those feature values into a pre-trained model, and you can predict how much slower the application will run on CXL. The model is somewhat the opposite of a DNN; it has 5 weights! The values of those weights are learned by running a suite of microbenchmarks on the test hardware. The high-level model is adopted from previous work . The slowdown percentage associated with moving a workload to CXL memory is the sum of three factors: Slowdowns due to store buffer stalls Slowdowns due to demand read stalls Slowdowns due to line fill buffer (LFB) utilization A typical processor can commit a store instruction when the store data & address are placed in the store buffer (a local queue). This allows the processor to keep humming before the store lands in memory. The processor will stall however if the store buffer fills up. On CXL systems, stores that require a read-for-ownership (RFO) are a frequent cause of backpressure. If the hardware which drains the store buffer detects a store to a cache line which is not in the cache, then the cache line must be read into the cache before storing the updated data into the cache. These RFO reads have a longer latency when accessing remote DRAM over CXL. The model for CXL slowdown due to store buffer stalls is: k store is one of the pre-trained (platform-specific) model weights. is the value of a performance counter on Intel CPUs which counts the number of cycles where the store buffer is full. is simply the total number of cycles that the application ran for. Demand read stalls occur when a memory load instruction misses in the cache. The term “demand” refers to explicit memory load instructions, not prefetching. Cache misses due to loads are more complicated to model than cache misses due to stores. Processors have hardware to exploit memory level parallelism (multiple load misses outstanding at once), but the effectiveness of that hardware depends on the particular application code being run. Pointer chasing is the classic example of a workload with low memory level parallelism. The model for CXL slowdown due to demand read stalls is: k drd , , and are pre-trained weights. is a PMU counter which counts the number of cycles that the processor stalled due to a demand load that caused an L3 miss. is a PMU counter which counts the number of demand read requests sent to the uncore (L3 or off-chip memory). is a PMU counter which counts the total number of cycles where a demand read sent to the uncore was pending. The first term models what percentage of time the application could possibly be bound by memory. The second term models how much memory level parallelism is available in the application, and how much of that parallelism the hardware can exploit. The line fill buffer (LFB) is a hardware structure that tracks outstanding L1 cache misses (due to explicit loads or prefetch operations). When an L1 cache miss or prefetch occurs, an entry in the LFB tracks the pending load. When a subsequent load operation occurs that would normally trigger an L1 miss, the processor first checks the LFB to see if there is already a pending request to load the associated cache line. When the LFB fills up, then cache misses cause the processor to stall. The paper has separate models for CXL slowdowns due to LFB overutilization for different Intel chips. Here is the model for Skylake: I abbreviated the PMU counter names in the equation above: As usual, k cache is a platform specific constant, and measures total clock cycles. The first term models the percentage of time the workload is likely to be affected by increased memory latency. It is computed as the percentage of clock cycles when there was an L1 data miss but no L2 miss. The second term measures average utilization of the LFB. It is computed as a ratio of hits in the LFB divided by total operations that do not hit in the L1. The final term measures the percentage of LFB utilization that is attributable to prefetch operations. It is computed as the percentage of L1 prefetch operations which were satisfied by the L3. Fig. 1 shows how well various metrics predict the slowdown associated with moving a workload to CXL memory. The beautifully straight line in the rightmost chart shows that CAMP is a very accurate predictor. Source: https://dl.acm.org/doi/10.1145/3779212.3790201 Dangling Pointers This paper is a marvel of feature engineering, but do we not live in the era of deep learning? Subscribe now Slowdowns due to store buffer stalls Slowdowns due to demand read stalls Slowdowns due to line fill buffer (LFB) utilization

0 views

Enabling Fast Networking in the Public Cloud

Enabling Fast Networking in the Public Cloud Alireza Sanaee, Vahab Jabrayilov, Ilias Marinos, Farbod Shahinfar, Divyanshu Saxena, Gianni Antichi, and Kostis Kaffes ASPLOS'26 Networking applications that care about latency don’t bother with the Linux networking stack. Instead they use a kernel-bypass library for fast networking (e.g., mTCP , libvma , TAS ). This paper points out two problems with using this approach on cloud VMs: Many of these libraries are not widely supported on cloud VMs, because cloud service providers want to expose a uniform feature set across a diverse set of server configurations The options that do work in the cloud (e.g., DPDK) assume a single process owns the NIC, which doesn’t work if there are multiple processes per VM that want to use low-latency networking This paper proposes Machnet, which is built around a user space sidecar process. As shown in Fig. 2, applications that want low-latency networking communicate with the sidecar process, and the sidecar uses DPDK to communicate with the virtual NIC exposed by the cloud service provider: Source: https://dl.acm.org/doi/10.1145/3779212.3790158 Each application uses multiple receive/transmit queue pairs in shared memory to communicate with the sidecar process. The sidecar comprises multiple CPU threads, each of which uses a dedicated receive/transmit queue pair to communicate with the NIC. Machnet is all about optimizing latency, not throughput. This justifies Machnet’s lack of zero-copy support. The really interesting bit is the mapping between application queues and sidecar queues. Say there is an HTTP server powered by 8 threads, each with its own queue pair and set of connections. Also assume the sidecar process uses 4 threads to communicate with the NIC. To avoid packet reordering within a connection, all packets from a particular HTTP server thread map to a single sidecar thread. This mapping is configured by applications. When a queue is created to communicate with the sidecar, the application specifies which sidecar thread (i.e., NIC queue) it should be associated with. What happens when the NIC receives a packet? Ideally that packet would make its way to the desired queue without expensive synchronization between the sidecar threads. If the sidecar had bare-metal access to a particular NIC, then it could configure the RSS settings of the NIC to map packets for a specific connection to the correct NIC queue. However, this level of RSS configuration is not broadly available to software running in cloud VMs. To work around this problem, the authors came up with RSS--. The authors found that opaque RSS is broadly supported. This means that the NIC can hash various fields from a packet header to determine which receive queue to route the packet to, but the hashing function is undocumented and unconfigurable. The authors leverage this support by giving the Machnet sidecar process flexibility in which ports are used for a given connection. When a new connection is set up, Machnet tries out a bunch of different source/destination ports, hoping that the NIC RSS hashing function will map one of them to the desired NIC receive queue. Section 4.2.1 has back-of-the-envelope math to say that trying out 47 different combinations is likely enough. Note that this scheme requires that Machnet is running on both sides of the connection. Once Machnet finds the magic combination of source/destination ports to use, it sticks with that combination for the connection and assumes the NIC will consistently route received packets to the correct NIC receive queue. With this magic in place, the sidecar threads can run at full speed without the need to synchronize with each other. Fig. 7 plots latency vs load for the standard Linux networking stack (labeled “HTTP…”) and Machnet. Lines labeled “Azure” were generated from runs in Azure VMs, lines labeled Cloudlab were generated on bare-metal servers. Machnet seems like a clear win. Source: https://dl.acm.org/doi/10.1145/3779212.3790158 Dangling Pointers The underlying issue here is a collective action problem. Once a standard like NVMe is widely adopted, then it becomes “table stakes” for cloud service providers. Clearly there is value in an industry standard for low-latency networking between cloud VMs, but how to achieve that standardization? Subscribe now Many of these libraries are not widely supported on cloud VMs, because cloud service providers want to expose a uniform feature set across a diverse set of server configurations The options that do work in the cloud (e.g., DPDK) assume a single process owns the NIC, which doesn’t work if there are multiple processes per VM that want to use low-latency networking

0 views

Cicada: Dependably Fast Multi-Core In-Memory Transactions

Cicada: Dependably Fast Multi-Core In-Memory Transactions Hyeontaek Lim, Michael Kaminsky, and David G. Andersen SIGMOD'17 This paper is nine years old, but Gemini tells me it is still the state-of-the-art for single server, in-memory OLTP. Leave a comment if you know of a newer result. Section 3 of the paper describes three design principles for the Cicada database: Optimistic concurrency control Multi-versioning Loosely synchronized clocks Cicada stores multiple versions of each tuple in the database (each write produces a new version). As shown in Fig. 2, each tuple is stored as a linked list of versions: Source: https://dl.acm.org/doi/10.1145/3035918.3064015 Inlining is an important optimization described by the paper. The head node of each linked list can optionally store the data for the most recently generated version, which avoids pointer chasing on the common path. Each list element contains two timestamps. is the write timestamp, is a read timestamp. Each list is sorted such that the first list element is the most-recently-committed version. Cicada transactions do not use locks. Instead, Cicada optimistically assumes that no conflicts will occur and validates that assumption using the timestamps associated with each tuple accessed by a transaction. Per-version timestamps are also used to garbage collect old versions. The authors clearly put a lot of effort into scalable timestamp generation. A naive approach would be a single shared variable that is atomically incremented on each transaction. Contention for the cache line holding that variable would be nasty. Instead, Cicada uses loosely synchronized per-core clocks. Before starting a new transaction, a core uses the instruction to measure how much time has elapsed since the last transaction. This elapsed time is used to increment a local clock variable, which is used to generate the transaction timestamp. The least significant bits of each timestamp contain the ID of the thread which generated the timestamp (to ensure that all timestamps are unique). is not architecturally guaranteed to generate perfectly synchronized results across cores. Cicada compensates for unsynchronized clocks with periodic one-sided synchronization. Periodically, a core will compare the value of its local clock variable with the clock variable of another core. If the remote core has a larger value, then the local core will realize that it has a slow clock and will adjust accordingly to catch up. Cicada does not require perfect synchronization for correctness. If one clock is running slowly, its transactions will simply be more likely to abort. Fig. 3 shows TPC-C results. While Cicada supports durability with redo logging, these results were generated with logging disabled: Source: https://dl.acm.org/doi/10.1145/3035918.3064015 I find the 1 warehouse case to be the most interesting, as it has the most contention. Write contention clearly limits scalability even with all of the careful tuning to Cicada. Fig. 3(a) saturates at about 300K transactions per second for 8 cores. Back-of-the-envelope, that gives each core roughly 50,000 clock cycles to execute each transaction. TPC-C transactions aren’t that meaty. What is the speed of light? Subscribe now Optimistic concurrency control Multi-versioning Loosely synchronized clocks

0 views

Beyond Page Migration: Enhancing Tiered Memory Performance via Integrated Last-Level Cache Management and Page Migration

Beyond Page Migration: Enhancing Tiered Memory Performance via Integrated Last-Level Cache Management and Page Migration Hwanjun Lee, Minho Kim, Yeji Jung, Seonmu Oh, Ki-Dong Kang, Seunghak Lee, and Daehoon Kim MICRO'25 This paper describes a complement to page migration to optimize performance on systems with tiered memory (a.k.a. a far pool of memory accessible over CXL ). State-of-the-art techniques for optimizing tiered memory systems involve classifying all pages as “hot” or “cold”. Hot pages are placed in the low-latency memory tier while cold pages are placed in a tier with higher access latency. Some systems dynamically measure hotness others try to predict it up front. The main point of this paper is that this bimodal hot/cold assumption may not always hold. What if no page is particularly hot or cold, but all are warm? Fig. 1 has some data to back this up. The bars labeled represent a system where pages are spread across both memory tiers. The bars labeled represent a system where pages are dynamically migrated based on their hotness. Interleaving utilizes more of the aggregate memory bandwidth. Source: https://dl.acm.org/doi/10.1145/3725843.3756063 LLC Partitioning This paper describes a system called . At its core, doesn’t migrate pages at all. Instead, it adjusts the fraction of the LLC that is dedicated to far cache lines. Fig. 4 illustrates the design of (hardware and software): Source: https://dl.acm.org/doi/10.1145/3725843.3756063 The hardware module measures average L1 miss latency for accesses to near and far memory. The adjusts the fractions of LLC ways that are available for caching near and far memory, in an attempt to balance the L1 miss latency for near and far accesses. For example, if the average L1 miss latency for far pages is higher than for near pages, then more LLC ways will be dedicated for caching far memory accesses. If the hardware is able to achieve balanced near and far L1 miss latencies, then the job is done. If not, then the hardware signals to the kernel that it should migrate pages based on measured hotness. Fig. 6 has simulated performance results. represents the core LLC allocation scheme while allows page migration as well. It is impressive that without page migration can beat other page migration schemes. This isn’t just a cherry on top, but a whole new way of looking at the problem. Source: https://dl.acm.org/doi/10.1145/3725843.3756063 Dangling Pointers Section 3 of the paper goes into some detail about how data cache prefetching helps hide far memory latency. It would be interesting if the system could decide which pages are likely to contain data for which the latency can be hidden. Subscribe now

0 views
Dangling Pointers 1 months ago

MagiCache: A Virtual In-Cache Computing Engine

MagiCache: A Virtual In-Cache Computing Engine Renhao Fan, Yikai Cui, Weike Li, Mingyu Wang, and Zhaolin Li ISCA'25 This paper presents an implementation of RISC-V vector extensions where all vector computation occurs in the cache (i.e., SRAM-based in-memory computation). It contains an accessible description of in-SRAM computation, and some novel extensions. Recall that SRAM is organized as a 2D array of bits. Each row represents a word, and each column represents a single bit location in many words. A traditional read operation occurs by activating a single row. Analog values are read out from each bit and placed onto shared bit lines. There are two bit lines per column (one holding the value, one holding the complement). Values flow down to sense amplifiers that output digital values. Prior work has shown that this basic structure can be augmented to perform computation. Rather than activating a single row, two rows are activated simultaneously (let’s call the values of these rows and ). The shared bit lines perform computation in the analog domain, which results in two expressions appearing on the output of the sense amplifiers: ( AND ) and ( NOR ). Fig. 1(a) shows a diagram of such an SRAM array: Source: https://dl.acm.org/doi/10.1145/3695053.3731113 If you slap some digital logic at the end of the sense amplifiers, then you can generate other functions like OR, XOR, XNOR, NAND, shift, add. Shift and add involve horizontal connections. Fig. 4(c) shows a hardware diagram of this additional logic at the end of the sense amplifiers. Note that the resulting value can be written back into the SRAM array for future use. Multiplication is not directly supported but can be implemented with a sequence of shift and add operations. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Virtual Engine The innovation in this paper is to dynamically share a fixed amount of on-chip SRAM for two separate purposes: caching and a vector register file. The logical vector register file capacity required for a particular algorithm depends on the number of architectural registers used, and the width of each architectural register (RISC-V vector extensions allow software to configure a logical vector width). Note that this hardware does not have separate vector ALUs, the computation is performed directly in the SRAM arrays. Fig. 6 illustrates how the hardware dynamically allocates SRAM space between generic cache storage and vector registers (with in-memory compute). The unit of allocation is a segment . The width of a vector register determines how many segments it requires. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Initially, all SRAM space is dedicated to caching. When the hardware processes an instruction that writes to an uninitialized vector register, then the hardware allocates segments to hold data for that register (evicting cached data if necessary). This system assumes an enlightened compiler which will emit a instruction to hint to the hardware when it has reached a point in the instruction stream where no vector register has valid content. The hardware can use this hint to reallocate all memory back to being used for caching. Fig. 8 shows performance results normalized against prior work (labeled here). This shows a 20%-60% performance improvement, which is pretty good considering that the baseline offers an order-of-magnitude improvement over a standard in-order vector processor. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Dangling Pointers I wonder how this would compare to hardware that did not have a cache, but rather a scratchpad with support for in-memory computing. Subscribe now

0 views
Dangling Pointers 1 months ago

FlexGuard: Fast Mutual Exclusion Independent of Subscription

FlexGuard: Fast Mutual Exclusion Independent of Subscription Victor Laforet, Sanidhya Kashyap, Călin Iorgulescu, Julia Lawall, and Jean-Pierre Lozi SOSP'25 This paper presents an interesting use of eBPF to effectively add an OS feature: coordination between user space locking code and the kernel thread scheduler to improve locking performance. The paper describes most lock implementations as spin-then-park locks (e.g., busy wait in user space for some time, then give up and call the OS to block the waiting thread). A big problem with busy waiting is the performance cliff under oversubscription . Oversubscription occurs when there are more active threads than cores. In this case, busy waiting can be harmful, because it wastes CPU cycles when there is other useful work to do. The worst case occurs when a thread acquires a lock and then is preempted by the OS scheduler while many other threads are busy waiting. If the OS thread scheduler were smart, it would preempt one of the busy waiters and let the lock holder keep running. But alas, that level of coordination isn’t available … until now. In the good old days, researchers would have modified Linux scheduling code and tested their modified kernel. The modern (easier) way to achieve this is to use eBPF. The authors wrote an eBPF program that runs (in kernel space) each time a context switch occurs. This program is called the Preemption Monitor . The Preemption Monitor works in conjunction with a custom user space lock implementation. The net result is that the Preemption Monitor can reliably detect when the OS scheduler preempts a thread that is holding a lock. When this occurs the eBPF program writes information to a variable that user space code can read. The locking algorithm is as follows: First, try to acquire the lock with a simple atomic compare-and-swap. If that fails, then busy wait. Similar to Hapax locks , this busy waiting avoids contention on one cache line by forcing all threads to agree on the order they will acquire the lock and letting each thread spin on per-thread variables. During busy waiting, the variable written by the Preemption Monitor is checked. If this variable indicates that there currently exists a thread which has acquired a lock and has been preempted by the OS, then threads stop busy waiting and instead call the OS to block until the lock is released (using the same system call that a futex would use). Fig. 2 has performance results. The x-axis shows thread count (which varies over time). The green line is FlexGuard. The idea is that it gives great performance when there is no oversubscription (i.e., fewer than 150 threads) and offers performance similar to a purely blocking lock (the dark blue line) when there is oversubscription. Source: https://dl.acm.org/doi/10.1145/3731569.3764852 Dangling Pointers This problem seems ripe for overengineering. In some sick world, the compiler, OS, and hardware could all coordinate to support a “true critical section”. All pages accessed inside this critical section would be pinned into main memory (or even closer to the CPU), and the OS would try extremely hard not to preempt threads inside of the critical section. This would require some upper bound on the critical section working set and running time. Subscribe now First, try to acquire the lock with a simple atomic compare-and-swap. If that fails, then busy wait. Similar to Hapax locks , this busy waiting avoids contention on one cache line by forcing all threads to agree on the order they will acquire the lock and letting each thread spin on per-thread variables. During busy waiting, the variable written by the Preemption Monitor is checked. If this variable indicates that there currently exists a thread which has acquired a lock and has been preempted by the OS, then threads stop busy waiting and instead call the OS to block until the lock is released (using the same system call that a futex would use).

0 views
Dangling Pointers 1 months ago

RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations

RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations Hongrui Zhang, Yunan Zhang, and Hung-Wei Tseng ISCA'25 I recall a couple of decades ago when Pat Hanrahan said something like “all hardware wants to be programmable”. You can find a similar sentiment here : With most SGI machines, if you opened one up and looked at what was actually in there—processing vertexes in particular, but for some machines, processing the fragments—it was a programmable engine. It’s just that it was not programmable by you; it was programmable by me. And now, twenty years later, GPU companies have bucked the programmability trend and added dedicated ray tracing hardware to their chips. Little did they know, users would find a way to utilize this hardware for applications that have nothing to do with graphics. The task at hand is multiplying two (very) sparse matrices ( and ). Each matrix can be partitioned into a 2D grid, where most cells in the grid contain all 0’s. Cells in with non-zero entries must be multiplied by specific cells in with non-zero entries (using a dense matrix multiplication for each product of two cells). The core idea is elegantly simple, and is illustrated in Fig. 5: Source: https://dl.acm.org/doi/full/10.1145/3695053.3731072 The steps are: Build a ray tracing acceleration structure corresponding to the non-zero cells in For each non-zero cell in Trace a ray through to determine if there are any non-zero cells in that need to be multiplied by the current cell in In fig. 5 the coordinates of the non-zero cells in matrix are: [(2, 1) (2, 3) (3, 3) (7, 1)]. The figure shows rays overlaid on top of the result matrix, but I find it easier to think of the rays traced through matrix . The ray corresponding to the cell in at (2, 1) has a column index of 1, so the algorithm traces a ray horizontally through B at row 1. The ray tracing hardware will find that this ray intersects with the cell from at coordinate (1, 4). So, these cells are multiplied together to determine their contribution to the result. Fig. 7 has benchmark results. All results are normalized to the performance of the library (i.e., values greater than one represent a speedup). corresponds to the Intel MKL library running on a Core i7 14700K processor. The “w/o RT cores” bars show results from the same algorithm with ray tracing implemented in general CUDA code rather than using the ray tracing accelerators. It is amazing that this beats across the board. Source: https://dl.acm.org/doi/full/10.1145/3695053.3731072 Dangling Pointers It seems like the core problem to be solved here is pointer-chasing. I wonder if a more general-purpose processor that is located closer to off-chip memory could provide similar benefits. Subscribe now Build a ray tracing acceleration structure corresponding to the non-zero cells in For each non-zero cell in Trace a ray through to determine if there are any non-zero cells in that need to be multiplied by the current cell in

0 views
Dangling Pointers 1 months ago

Dissecting and Modeling the Architecture of Modern GPU Cores

Dissecting and Modeling the Architecture of Modern GPU Cores Rodrigo Huerta, Mojtaba Abaie Shoushtary, José-Lorenzo Cruz, and Antonio Gonzalez MICRO'25 The purpose of this paper is to understand the microarchitecture of recent NVIDIA GPUs, to be able to update architectural simulators that are used for research purposes. The authors uncovered lots of interesting tidbits. Take this information with a grain of salt; it is derived from careful experimentation rather than NVIDIA documentation. The paper uses the term sub-core to represent the hardware module which can execute warp-wide instructions. Each SM comprises four sub-cores. Fig. 3 illustrates the components within a sub-core and shows how 4 sub-cores share instruction and data caches: Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Instruction Issue The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware. There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards: The instruction encoding allows any instruction to set the value of a per-warp stall counter. When the hardware issues such an instruction, it sets the stall counter to the specified value. On each clock cycle thereafter, the counter is decremented by one. The hardware will not issue more instructions for the warp until the counter reaches zero. This is useful for handling hazards with a fixed latency. Variable-latency hazards are resolved with dependence counters . The hardware tracks the value of six dependence counters per warp. The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued. One of these counters is decremented when the instruction writes to the register file, and the other is decremented when the instruction reads from the register file (to resolve WAR hazards). Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero. In fig. 2 above, the values of these counters are checked in the block, and the counters are incremented in the block. The warp scheduler prefers to pick a warp and stick with it (e.g., it is not a round-robin scheduler). If the current warp cannot be scheduled (e.g., the stall counter is greater than zero, or there was a cache miss), then the scheduler switches to another warp. The warp scheduler issues instructions in program order (within a warp). There is no out-of-order execution support. The register file has a limited number of ports, and instructions must be controlled to avoid attempting too many reads or writes in parallel. Register file port contention is not handled by the warp scheduler, instead it is handled further down the pipe. For example, the stage in fig. 2 will stall fixed-latency instructions until register file read ports are available. The register file cache (RFC) is a hardware component that reduces contention on the register file read ports. The RFC has storage for 6 vectors (and tags). The compiler can mark a source operand of an instruction such that the hardware will store the source operand in the cache for a subsequent operation to use. Note that the RFC does not store per-warp values and is only useful for caching data within one warp. This plays nicely with the “pick a warp and stick to it” scheduling policy. Listing 4 has some example code sequences demonstrating how the compiler can direct the operation of the RFC (e.g., ): Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Memory Access Most of the resources that are shared between sub-cores are shared for efficiency reasons. A single sub-core will not generate memory requests at a high throughput, and there is locality of reference between the memory accesses in multiple sub-cores. The block in fig. 3 is shared in order to properly support thread group shared memory (as a thread group is spread across all sub-cores in a SM). The shared memory access modules can handle one request every two cycles. That means if all 4 sub-cores are contending on memory, each one can make a request every 8 cycles. There is a FIFO of depth ~4 between each sub-core and the shared memory structures. Typical read-after-write latency in shared memory is between 20-40 cycles. The authors built a simulation model based on their experiments. Mean percentage absolute error (MAPE) is one metric for measuring how accurate a simulation model is compared to real hardware. Table 4 shows that the model derived from the findings in this paper are a better performance model for recent NVIDIA GPUs than the baseline ( ): Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Subscribe now Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Instruction Issue The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware. There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards: The instruction encoding allows any instruction to set the value of a per-warp stall counter. When the hardware issues such an instruction, it sets the stall counter to the specified value. On each clock cycle thereafter, the counter is decremented by one. The hardware will not issue more instructions for the warp until the counter reaches zero. This is useful for handling hazards with a fixed latency. Variable-latency hazards are resolved with dependence counters . The hardware tracks the value of six dependence counters per warp. The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued. One of these counters is decremented when the instruction writes to the register file, and the other is decremented when the instruction reads from the register file (to resolve WAR hazards). Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero.

0 views
Dangling Pointers 1 months ago

Nexus Machine: An Energy-Efficient Active Message Inspired Reconfigurable Architecture

Nexus Machine: An Energy-Efficient Active Message Inspired Reconfigurable Architecture Rohan Juneja, Pranav Dangi, Thilini Kaushalya Bandara, Tulika Mitra, and Li-Shiuan Peh MICRO'25 This paper presents an implementation of the Active Message (AM) architecture, as an alternative to FPGA/CGRA architectures. AM architectures have been studied for a while; this was my first exposure. An accelerator implemented on an FPGA or CGRA typically uses a spatial computing paradigm. Each “instruction” in the algorithm is pinned to a physical location on the chip, and data flows between the instructions. I prefer to think of the data in motion as the local variables associated with threads that also move (using a specialized memory consistency model ). The active message architecture flips that script around. Data structures are pinned, while instructions move to the relevant data . Fig. 5 shows two processing elements (PEs), each of which contain two active messages (AMs). An active message looks a lot like an instruction: it contains an opcode, source operands, and a result operand. Throughout the computation, AMs move between PEs. PEs have a local ALU and local memory. Source: https://dl.acm.org/doi/10.1145/3725843.3756091 The AM at the top of the figure has and . Here, is an operand that is being carried around for future use. The AM with a opcode will make its way through the chip until it arrives at the PE which contains the data to be loaded. At this point, the load operation will execute, and a new AM will be created. In the figure above, the new AM is the one at the bottom of PE0. It has , , and . Op1 is forwarded unchanged from the predecessor AM. The value of was the value of the data loaded from memory. The new opcode was obtained from the config memory , which contains a description of the program that is being executed. The next step to be performed is to multiply . One might expect PE0 to perform the multiplication, but in the figure above the AM is routed to , which performs the multiplication. A reason why you would want to do this is in a situation where there are many AMs queued to access the data memory associated with PE0, but few AMs queued to access the data memory associated with PE1. In this situation, it is better to let PE0 perform loads for other AMs (because PE0 is the only PE that can fulfill that task) and find a PE that is currently idle to perform the multiplication (any PE can perform the multiplication). Now the question you should be asking is: what real-world applications exhibit load imbalances between PEs like this? If a data structure were split between all PEs evenly, you would think that load will be spread nicely across the PEs. The answer is: irregular workloads like sparse matrix-vector multiplication. Fig. 6 shows how a source matrix, source vector, and result vector could be partitioned across 4 PEs. You can imagine how the sparsity of the tensors being operated on would cause load imbalance between the PEs. Source: https://dl.acm.org/doi/10.1145/3725843.3756091 Fig. 11 compares the Nexus Machine against other architectures (each design has the same number of ALUs). Fig. 12 shows performance-per-watt. Source: https://dl.acm.org/doi/10.1145/3725843.3756091 Dangling Pointers I imagine that AM architectures work best for algorithms that are insensitive to the order in which AMs are executed. That would be the case for matrix/vector multiplication (assuming addition is associative). It seems like there is a large design space here related to PE capabilities. Data structures could be replicated across PEs to enable memory access AMs to be serviced by multiple PEs, or the ALUs inside of each PE could be heterogeneous (e.g., some PEs can do division, others cannot). Subscribe now

0 views
Dangling Pointers 1 months ago

TiNA: Tiered Network Buffer Architecture for Fast Networking in Chiplet-based CPUs

TiNA: Tiered Network Buffer Architecture for Fast Networking in Chiplet-based CPUs Siddharth Agarwal, Tianchen Wang, Jinghan Huang, Saksham Agarwal, and Nam Sung Kim ASPLOS'26 Here we go again , another paper in a top-tier conference on the classic CS problem: how to DMA received packets from NIC to host. It would be interesting to understand why this is such a hot topic these days. This paper deals with the case where the host CPU comprises multiple chiplets. If you get nothing else from this, I hope you will learn something about SNC mode (I had not heard of it before). Recent Intel CPUs can be placed into Sub-NUMA Clustering mode (via a BIOS setting). This causes each chiplet to appear as a separate NUMA node. It is like a single socket CPU is transformed into a 4 socket CPU. The DRAM memory space is divided into four regions (one per chiplet), and the LLC slices within a chiplet only cache data from one memory space. This can be advantageous for some applications, because it can lower average LLC and DRAM access latency (by avoiding inter-chiplet communication). The downside is that the peak LLC capacity available to a single core is reduced. Fig. 3 illustrates these tradeoffs: Source: https://dl.acm.org/doi/10.1145/3760250.3762224 SNC and DDIO Recall that DDIO is a feature of Intel CPUs that allows a NIC to write received packets directly into the LLC, which the host CPU can then read. PCIe lanes are distributed among chiplets. This means that the NIC is directly connected to one chiplet. One way to support DDIO with SNC is to allocate buffers for received packets in the memory region associated with the chiplet that the NIC is connected to. This improves LLC bandwidth (for both the NIC and CPU cores) but decreases the LLC capacity available for network packets. In practice, this means that longer bursts of network packets degrade performance more when SNC is enabled (i.e., leaky DMA is a larger problem in SNC mode). Fig. 6 has data from a microbenchmark to back this up: Source: https://dl.acm.org/doi/10.1145/3760250.3762224 TiNA The solution proposed by this paper requires a change to the NIC/driver interface. Each ring buffer of received network packets is replaced by ring buffers (where is the number of chiplets). Ring buffer is placed in the memory region associated with chiplet . The NIC knows about all of these ring buffers and dynamically decides which one to use. The NIC prefers to use the ring buffer associated with the chiplet that it is directly connected to. However, if a burst of traffic causes high utilization of the LLC capacity of that chiplet, then the NIC will fall back to using the other ring buffers. The NIC estimates LLC utilization based on two competing rates: The rate that received network packets are produced by the NIC The rate that received network packets are consumed by the host The first rate is easy for the NIC to compute as it knows how fast it is sending bytes to the host. The second rate is computed by networking software running on the host, and periodically sent to the NIC. The overall approach reminds me of CEIO . The key difference is the set of memory segments available. CEIO uses NIC-local DRAM as the fallback path. One complication of splitting a single ring buffer into multiple is ensuring that the host processes received packets in order. This paper proposes using sequence numbers associated with each packet. Most protocols already use per-packet sequence numbers. For other protocols (e.g., UDP), the NIC adds a sequence number based on the order in which packets were received. When the host reads a packet from a logical ring buffer, it examines the sequence numbers from the packets at the head of each of the physical ring buffers and chooses the packet with the lowest sequence number. Fig. 9 has benchmark results: lower latency than SNC and non-SNC across a range of microbenchmarks. Source: https://dl.acm.org/doi/10.1145/3760250.3762224 Dangling Pointers It would be nice if SNC allowed more fine-grained configuration. For example, there may be applications where ideal performance is achieved if each CPU core only has access to the L3 slice that is directly connected to it. Subscribe now Source: https://dl.acm.org/doi/10.1145/3760250.3762224 SNC and DDIO Recall that DDIO is a feature of Intel CPUs that allows a NIC to write received packets directly into the LLC, which the host CPU can then read. PCIe lanes are distributed among chiplets. This means that the NIC is directly connected to one chiplet. One way to support DDIO with SNC is to allocate buffers for received packets in the memory region associated with the chiplet that the NIC is connected to. This improves LLC bandwidth (for both the NIC and CPU cores) but decreases the LLC capacity available for network packets. In practice, this means that longer bursts of network packets degrade performance more when SNC is enabled (i.e., leaky DMA is a larger problem in SNC mode). Fig. 6 has data from a microbenchmark to back this up: Source: https://dl.acm.org/doi/10.1145/3760250.3762224 TiNA The solution proposed by this paper requires a change to the NIC/driver interface. Each ring buffer of received network packets is replaced by ring buffers (where is the number of chiplets). Ring buffer is placed in the memory region associated with chiplet . The NIC knows about all of these ring buffers and dynamically decides which one to use. The NIC prefers to use the ring buffer associated with the chiplet that it is directly connected to. However, if a burst of traffic causes high utilization of the LLC capacity of that chiplet, then the NIC will fall back to using the other ring buffers. The NIC estimates LLC utilization based on two competing rates: The rate that received network packets are produced by the NIC The rate that received network packets are consumed by the host

0 views
Dangling Pointers 2 months ago

Binary Compatible Critical Section Delegation

Binary Compatible Critical Section Delegation Junyao Zhang, Zhuo Wang, and Zhe Zhou PPoPP'26 The futex design works great when contention is low but leaves much to be desired when contention is high. I generally think that algorithms should be crafted to avoid high lock contention, but this paper offers a contrarian approach that improves performance without code changes . Acquiring a futex involves atomic operations on the cache lines that contain the futex state. In the case of high contention, these cache lines violently bounce between cores. Also, user space code will eventually give up trying to acquire a lock the easy way and will call into the kernel, which has its own synchronization to protect the shared data structures that manage the queue of threads waiting to acquire the lock. The problems don’t end when a lock is finally acquired. A typical futex guards some specific application data. The cache lines containing that data will also uncomfortably bounce between cores. The idea behind delegation is to replace the queue of pending threads with a queue of pending operations . An operation comprises the code that will be executed under the lock, and the associated data. This C++ code snippet shows how I think of delegation: In the uncontended case, will execute directly. In the contended case, will be placed into a queue to be executed later. After any thread finishes executing a function like , that thread will check the queue. If the queue is not empty, that thread will go ahead and execute all of the functions contained in the queue. In the example above, the data guarded by the lock is . If a particular thread calls 10 functions from the queue, then remains local to the core that thread is running on, and the system will avoid moving between cores 10 times. The magic of this paper is that it shows how to change the OS kernel to automatically implement delegation for any application that uses futexes. When the futex code gives up trying to acquire a futex in user space, it calls the OS to wait on the futex. The implementation of this system call is changed to implement automatic delegation. Automatic delegation can fail (as illustrated by Fig. 2), in which case the traditional futex waiting algorithm is used. Source: https://dl.acm.org/doi/10.1145/3774934.3786439 This paper makes heavy use of the Userspace Bypass library (a.k.a. UB; paper here ). This library allows the kernel to safely execute user-mode code. It was originally designed to optimize syscall heavy applications, by allowing the kernel to execute the small tidbits of user space code in-between system calls. UB uses binary translation to translate instructions that were meant to run in user space into instructions that can securely be executed by the kernel. Binary compatible critical section delegation uses UB to translate the code inside of the critical section (i.e., the code between the futex lock and unlock calls) into code that can be safely executed by the kernel. A pointer to this translated code is placed into a queue of delegated calls (the vw queue ). The set of threads which are trying to acquire a lock cooperatively execute the functions in the vw queue. At any one time, at most one thread is elected to be the delegate thread. It drains the vw queue by executing (in kernel space) all the delegated functions in the queue. This works great in cases where the code inside of the critical section accesses a lot of shared state, because that shared state can happily reside in the cache of the core that is running the delegate thread, rather than bouncing between cores. The paper has impressive results from microbenchmarks, but I think real applications are more relevant. Table 2 shows performance results for a few applications and a few locking strategies. BCD is the work in this paper. TCS and TCB are prior work which have the drawback of not being compatible with existing binaries. Source: https://dl.acm.org/doi/10.1145/3774934.3786439 Dangling Pointers There is a hint here at another advantage of pipeline parallelism over data parallelism: allowing persistent data structures to remain local to a core. Subscribe now

0 views
Dangling Pointers 2 months ago

Radshield: Software Radiation Protection for Commodity Hardware in Space

Radshield: Software Radiation Protection for Commodity Hardware in Space Haoda Wang, Steven Myint, Vandi Verma, Yonatan Winetraub, Junfeng Yang, and Asaf Cidon ASPLOS'25 If you read no further, here are two interesting factoids about outer space from this paper: Launch costs have fallen 60x, with the current cost to launch 1kg to space clocking in at $1,400 (see Fig. 1 below). Many satellites orbiting the Earth and devices sent to Mars use Snapdragon CPUs! I assumed that all chips leaving planet Earth would be specialized for space, apparently not. Source: https://dl.acm.org/doi/10.1145/3760250.3762218 This paper describes software solutions to deal with two common problems that occur in outer space: Single-Event Latchups and Single-Event Upsets , both of which are caused by radiation interfering with the normal operation of a circuit. A single-event latchup (SEL) causes one portion of the chip to heat up. If left unmitigated, this can damage the chip. The solution to this is to detect the problem and reboot. The trick is in the detection. The classic detection method monitors chip current draw. However, this technique fails with a modern off-the-shelf CPU which is designed to have a wide variability in current draw. When compute load increases, clock frequencies and voltages change, cores come out of sleep states, and power consumption naturally increases. The point of this design is to save power during idle periods, which is especially important for satellites which must get their power from the sun. The solution proposed by this paper is called ILD. The idea is to predict the expected current draw based on a simple model that uses CPU performance counters (e.g., cache hit rate, instruction execution rate) as input. If the measured current draw is much larger than predicted, then the system is rebooted. The model is not perfect, and the authors noticed that this scheme only works well when the CPU load is not too high. This “predict, check, reboot if necessary” cycle only occurs during relatively calm periods of time. The system is modified to force 3-second idle periods every 3 minutes to ensure that reliable measurements can be taken. An SEL takes about 5 minutes to damage the chip, the 3-minute period is chosen to be below that threshold. A single-event upset causes the value of a bit to flip (in memory, cache, the register file, etc). There are two common solutions to SEUs: Use ECC on stored data Perform computations with triple modular redundancy (3-MR), which requires computing each result 3 times and choosing the most popular result if there is disagreement about the correct result This paper deals with mitigating SEUs that affect user “space” code. The authors define the term reliability frontier to represent the interface between hardware components that support ECC and those that do not. For example, if flash storage has ECC but DRAM does not, then flash is considered part of the reliability frontier. A typical smartphone CPU advanced satellite chip has multiple CPU cores. One way to alleviate the compute cost of 3-MR is to compute all 3 results on 3 separate cores in parallel. A problem with this approach is that the CPU cores may share unreliable hardware. For example, the last level cache could be shared by all cores but not support ECC. If a bit flips in the LLC, then all cores will see the corrupted value, and parallel 3-MR will not detect a problem. The paper proposes an algorithm called EMR. The idea is to break a computation into multiple tasks and associate metadata with each task that describes the subset of input data accessed by the task. Fig. 6 shows a motivating example. The task of analyzing an image may be decomposed into many tasks, where each task processes a subset of the input image. Source: https://dl.acm.org/doi/10.1145/3760250.3762218 In EMR, there is an API to explicitly create tasks and specify the set of input data that each task reads from. EMR then runs tasks in multiple epochs. Within an epoch, no two tasks read the same input data. EMR invalidates caches up to the reliability frontier between epochs. If there are many tasks, and few epochs, then this system works great (i.e., it has high CPU utilization and does not spend too much time invalidating caches). Table 2 compares ILD performance in detecting SELs against a random forest model and a model that simply compares current draw against a fixed value: Source: https://dl.acm.org/doi/10.1145/3760250.3762218 Fig. 11 shows the performance impact of EMR. Each result is normalized against a parallel version of 3-MR which ignores the problems associated with shared hardware. The red bars represent 3-MR run on a single core; the blue bars represent EMR. Source: https://dl.acm.org/doi/10.1145/3760250.3762218 Dangling Pointers EMR would benefit from a system that detects when a programmer misspecifies the set of inputs that will be read. Maybe hardware or software support could be added to detect this kind of bug. Subscribe now Source: https://dl.acm.org/doi/10.1145/3760250.3762218 This paper describes software solutions to deal with two common problems that occur in outer space: Single-Event Latchups and Single-Event Upsets , both of which are caused by radiation interfering with the normal operation of a circuit. Single-Event Latchups A single-event latchup (SEL) causes one portion of the chip to heat up. If left unmitigated, this can damage the chip. The solution to this is to detect the problem and reboot. The trick is in the detection. The classic detection method monitors chip current draw. However, this technique fails with a modern off-the-shelf CPU which is designed to have a wide variability in current draw. When compute load increases, clock frequencies and voltages change, cores come out of sleep states, and power consumption naturally increases. The point of this design is to save power during idle periods, which is especially important for satellites which must get their power from the sun. The solution proposed by this paper is called ILD. The idea is to predict the expected current draw based on a simple model that uses CPU performance counters (e.g., cache hit rate, instruction execution rate) as input. If the measured current draw is much larger than predicted, then the system is rebooted. The model is not perfect, and the authors noticed that this scheme only works well when the CPU load is not too high. This “predict, check, reboot if necessary” cycle only occurs during relatively calm periods of time. The system is modified to force 3-second idle periods every 3 minutes to ensure that reliable measurements can be taken. An SEL takes about 5 minutes to damage the chip, the 3-minute period is chosen to be below that threshold. Single-Event Upsets A single-event upset causes the value of a bit to flip (in memory, cache, the register file, etc). There are two common solutions to SEUs: Use ECC on stored data Perform computations with triple modular redundancy (3-MR), which requires computing each result 3 times and choosing the most popular result if there is disagreement about the correct result

0 views
Dangling Pointers 2 months ago

Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds

Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds I learned a lot about the LLC configuration and monitoring capabilities of modern CPUs from this paper, I bet you will too. The problem this paper addresses is: how to avoid performance variability in cloud applications due to cross-VM contention for the last level cache (e.g., the L3 cache on a Xeon)? In a typical CPU, the L1 and L2 caches are private to a core, but the L3 is shared. In a cloud environment, the L3 is shared by multiple tenants, and is an avenue for a “noisy neighbor” to annoy its neighbors. The work described by this paper builds upon Intel CMT and CAT. Cache Monitoring Technology allows the hypervisor to track how much of the L3 cache is occupied by each VM. Cache Allocation Technology allows the hypervisor to restrict a VM to only use a subset of the L3. CAT allows a VM to be assigned to a cache level of service (CLOS), which defines the set of L3 ways accessible to the VM ( this page defines the term “ways” if you are unfamiliar). A typical CPU used by a cloud service provider has more CPU cores than L3 ways. If a cloud server hosts many small VMs, then L3 ways must be shared amongst VMs. The key problem solved by this paper is how to reduce performance variability given this constraint. Fig. 1 illustrates the assignments of CLOS levels to LLC ways advocated by this paper. Each row is a level of service, and each column is a way of the LLC cache. CLOS[0] can access all ways, CLOS[1] can access all LLC ways except for one. CLOS[7] can only access a single way of the LLC. Source: https://dl.acm.org/doi/10.1145/3774934.3786415 The hypervisor uses Intel CMT to monitor how much of the LLC is occupied by each VM. Every 6 seconds the hypervisor uses this information to change the CLOS that each VM is assigned to. The hypervisor computes a target LLC occupancy for each VM based on the number of cores assigned to the VM. This target is compared against the measured LLC occupancy to classify each VM into one of three categories: Poor (the VM is starved for space) Adequate (the VM is using just the right amount of cache) Excess (the VM is hogging too much) VMs in the poor category are de-suppressed (i.e., assigned to a CLOS with access to more LLC ways). Additionally, VMs in the excess category are suppressed (i.e., assigned to a CLOS with access to fewer ways), but this suppression only occurs when there are VMs in the poor category. This policy means that cache-hungry VMs can use more than their fair share of the L3 during periods of low server utilization. This can lead to higher mean performance, at the cost of a wider standard deviation. The paper describes a 4th state (overflow), which is only applied to VMs that wish to be held back even if there is plenty of L3 space available. These VMs are suppressed when they are found to be using too much L3, even if all other VMs on the system are getting enough cache space. Fig. 5 shows a case where this strategy works well compared to static allocation. The server in question is running 5 VMs, each running a different application: VM1 - 32 cores VM2 - 16 cores (but doesn’t fully utilize those cores) VM3 - 8 cores VM4 - 4 cores VM5 - 4 cores The top of figure 5 shows a simple static partitioning of LLC ways. VM1 is assigned to 6 ways, VM2 is assigned to 3 ways, VM3 is assigned to 2 ways, and VMs 4 and 5 must share 1 way. They have to share because sharing based on the number of ways in the LLC is inherently coarse-grained. The two charts show measured LLC utilization over 10 minutes. Notice the Y-axis. The technique described in this paper (Cacheman) allows VM4 and VM5 to use far more aggregate LLC capacity than the static partitioning. Also notice that in the static partitioning, VM5 always uses more LLC than VM4 (because they are running different applications), whereas Cacheman allows for a more even balance between them. Source: https://dl.acm.org/doi/10.1145/3774934.3786415 Dangling Pointers While the L3 cache is logically a monolithic shared resource, it is physically partitioned across the chip (with a separate slice near each core). It seems like it could be more efficient if VMs could be assigned to nearby L3 slices rather than L3 ways. Subscribe now Poor (the VM is starved for space) Adequate (the VM is using just the right amount of cache) Excess (the VM is hogging too much) VM1 - 32 cores VM2 - 16 cores (but doesn’t fully utilize those cores) VM3 - 8 cores VM4 - 4 cores VM5 - 4 cores

0 views
Dangling Pointers 2 months ago

Hapax Locks: Scalable Value-Based Mutual Exclusion

Hapax Locks: Scalable Value-Based Mutual Exclusion Dave Dice and Alex Kogan PPoPP'26 This paper describes a locking algorithm intended for cases where spinning is acceptable (e.g., one thread per core systems). It is similar to a ticket lock but generates less coherence traffic. Each lock/unlock operation causes a constant number of cache lines to move between cores, regardless of the number of cores involved or how long they spin for. As we’ve seen in a previous paper , polling a value in memory is cheap if the cache line is already local to the core which is polling. A Hapax lock comprises two 64-bit fields: Additionally, there is a global (shared among all Hapax locks) 64-bit sequence number. Each time a thread attempts to lock a Hapax lock, it generates a Hapax value which uniquely identifies the locking episode . A locking episode is a single lock/unlock sequence performed by a specific thread. A Hapax value is generated by atomically incrementing the sequence number. It is assumed that the 64-bit counter additions will never overflow. Next, the locking thread atomically exchange the value of with the Hapax value it just generated. This exchange operation generates a total ordering among Hapax values. It is a way for threads to cooperatively decide the order in which they will acquire the lock. Say thread generates Hapax value and stores it into (via an atomic exchange operation). Next, thread generates Hapax value and atomically exchanges the value of with . The result of the exchange operation performed by will be . At this point, thread knows that it is directly behind thread A in the queue and must wait for thread to release the lock. To finish acquiring the lock, threads continually poll , waiting for to equal the Hapax value of the preceding locking episode. In the example above, thread polls until it sees the value . At this point, the lock has been acquired. Unlocking is implemented by storing the Hapax value used by the unlocking thread into . In the running example, thread would unlock the lock by storing the value into . This algorithm generates a lot of coherence traffic. In particular, the cache line which holds the sequence number would move between cores each time a new Hapax value is generated. Also, each store to would send coherence traffic to each core which had recently polled the value of . The paper has two techniques to address these issues. While the sequence number monotonically increases, the values stored in and do not. There are two reasons for this. First, a single sequence number is shared among all Hapax locks. The second reason is that multiple threads can generate Hapax values and then race to perform the atomic exchange operation. For example, thread could generate a Hapax value of while thread generates a Hapax value of . They then race each other to atomically exchange their Hapax value with the value of . If thread wins the race, then will first take on the value , and then later it will have the value . Once you realize that the values of and are not monotonically increasing, it is straightforward to see how the generation of Hapax values can be made cheap. A thread can hoard a batch of Hapax values with a single atomic add operation. For example, a thread could atomically increase the value of the sequence number by 1024. At this point, the thread has allocated 1024 Hapax values for itself that it can use in the future without accessing the cache line which holds the shared sequence number. The paper proposes allocating Hapax values in blocks of 64K. The paper proposes adding an additional array which serves a similar role as . The number of elements in the array should be greater than the number of cores (the paper uses an array of 4096 values). Like the sequence number, this array is shared among all Hapax locks. When a thread writes its Hapax value into , the thread also stores its Hapax value into one of the 4096 elements. The array index is determined by the Hapax value. Many potential hash functions could be used. The paper proposes hashing bits [27:16] of the Hapax value. The 16 is related to the allocator block size. In the locking sequence, a thread loads the value of once. If the value of Depart does not match the expected Hapax value, then the locking thread polls the appropriate element of the shared array. The thread polls this element until its value changes. If the new value is the expected value of , then the lock has been acquired. If not, then a hash collision has occurred (e.g., a locking episode associated with a different Hapax lock caused the value to be updated). In this case, the thread starts over by checking and then polling the array element if necessary. This scheme minimizes coherence traffic associated with polling. When an unlocking core stores a value into an array element, the associated cache line will typically be present only in the cache of the next core in line. Coherence traffic is only generated related to the locking and unlocking cores. Other threads (which are further back in the line) will be polling other array elements and thus be loading from other cache lines and so the cores those threads are running on won’t see the coherence messages. Fig. 3 has results from a microbenchmark. Hapax locks scale much better than ticket locks and go head-to-head with other state-of-the-art locking algorithms. The Hapax implementation is so concise (about 100 lines) that the authors included C++ source code in the paper. Source: https://dl.acm.org/doi/10.1145/3774934.3786443 Dangling Pointers The big downside of spinning is that it wastes cycles in the case where there are other threads that the OS could schedule. I wonder if there is a lightweight coordination mechanism available. For example, the OS could write scheduling information into memory that is mapped read-only into user space. This could be used to communicate to the spinning code whether or not there are other threads ready to run. Subscribe now

0 views
Dangling Pointers 2 months ago

Scalar Interpolation: A Better Balance between Vector and Scalar Execution for SuperScalar Architectures

Scalar Interpolation: A Better Balance between Vector and Scalar Execution for SuperScalar Architectures Reza Ghanbari, Henry Kao, João P. L. De Carvalho, Ehsan Amiri, and J. Nelson Amaral CGO'25 This paper serves as a warning: don’t go overboard with vector instructions. There is a non-trivial amount of performance to be had by balancing compute between scalar and vector instructions. Even if you fear that automatic vectorization is fragile, this paper has some interesting lessons. Listing 1 contains a vectorizable loop and listing 2 shows a vectorized implementation: Source: https://dl.acm.org/doi/10.1145/3696443.3708950 Source: https://dl.acm.org/doi/10.1145/3696443.3708950 After achieving this result, one may be tempted to pat oneself on the back and call it a day. If you were a workaholic, you might profile the optimized code. If you did, you would see something like the data in table 1: Source: https://dl.acm.org/doi/10.1145/3696443.3708950 And you could conclude that this algorithm is compute-bound. But what do we really mean by “compute-bound”? A processor contains many execution ports, each with a unique set of capabilities. In the running example, the execution ports capable of vector multiplication and addition are fully booked, but the other ports are sitting mostly idle! Listing 3 shows a modified loop which tries to balance the load between the vector and scalar execution ports. Each loop iteration processes 9 elements (8 via vector instructions, and 1 via scalar instructions). This assumes that the processor supports fast unaligned vector loads and stores. Source: https://dl.acm.org/doi/10.1145/3696443.3708950 Section 3 has details on how to change LLVM to get it to do this transformation. Fig. 3 shows benchmark results. By my calculations, the geometric mean of the speedups is 8%. Source: https://dl.acm.org/doi/10.1145/3696443.3708950 Dangling Pointers This paper builds on top of automatic vectorization. In other words, the input source code is scalar and the compiler vectorizes loops while balancing the workload. An alternative would be to have the source code in a vectorized form and then let the compiler “devectorize” where it makes sense. Subscribe now

0 views
Dangling Pointers 2 months ago

Flexible I/O for Database Management Systems with xNVMe

Flexible I/O for Database Management Systems with xNVMe Emil Houlborg, Simon A. F. Lund, Marcel Weisgut, Tilmann Rabl, Javier González, Vivek Shah, Pınar Tözün CIDR’26 This paper describes xNVMe , a storage library (developed by Samsung), and demonstrates how it can be integrated into DuckDB. Section 2 contains the hard sell for . The “x” prefix serves a similar role to the “X” in DirectX. It is fast, while also being portable across operating systems and storage devices. The C API will feel like home for folks who have experience with low-level graphics APIs (no shaders on the disk yet, sorry). There are APIs to open a handle to a device, allocate buffers, and submit NVMe commands (synchronously or asynchronously). Listing 3 has an example, which feels like “Mantle for NVMe”: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf The API works on Linux, FreeBSD, Windows, and macOS. Some operating systems have multiple backends available (e.g., , ). The point of this paper is that it is easy to drop into an existing application. The paper describes , which is an implementation of the DuckDB interface and uses . creates dedicated queues for each DuckDB worker thread to avoid synchronization (similar tricks are used by applications calling graphics APIs in parallel). The paper also describes how supports shiny new NVMe features like Flexible Data Placement (FDP). This allows DuckDB to pass hints to the SSD to colocate buffers with similar lifetimes (which improves garbage collection performance). Most of the results in the paper show comparable performance for vs the baseline DuckDB filesystem. Fig. 5 shows one benchmark where yields a significant improvement: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf Dangling Pointers I think the long-term success of will depend on governance. Potential members of the ecosystem could be scared off by Samsung’s potential conflict of interest (i.e., will Samsung privilege Samsung SSDs in some way?) There is a delicate balancing act between an API driven by a sluggish bureaucratic committee, and an API which is dominated by one vendor. Subscribe now

0 views
Dangling Pointers 2 months ago

A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation

A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation Christian Lanius, Florian Freye, and Tobias Gemmeke GLVLSI'25 This paper ostensibly describes an ASIC accelerator for NFA evaluation (e.g., regex matching), but this paper also describes two orthogonal techniques for optimizing NFA evaluation which are applicable to more than just this ASIC. Any regular expression can be converted to a non-deterministic finite automaton (NFA) . Think of an NFA like a state machine where some inputs can trigger multiple transitions. The state machine is defined by a set of transitions . A transition is an ( , , ) tuple. The non-deterministic naming comes from the fact that multiple tuples may exist with identical ( , ) values; they only differ in their values. This means that an NFA can be in multiple states at once. One way to evaluate an NFA is to use a bitmap to track the set of active states. For each new input symbol, the set of active states in the bitmap is used to determine which transitions apply. Each activated transition sets one bit in the bitmap used to represent the active states for the next input symbol. The hardware described in this paper uses a compute-in-memory (CIM) microarchitecture. A set of columns stores the state machine, with each column storing one transition. This assumes that the transition function is sparse (i.e., the number of transitions used is much lower than the maximum possible). During initialization, the transitions are written into the CIM hardware. An input symbol is processed by broadcasting it and the current state bitmap to all columns. All columns evaluate whether their transition should be activated. The hardware then iterates (over multiple clock cycles) over all activated transitions and updates the state bitmap for the next input symbol. The left side of Fig. 5 illustrates the hardware in each column which compares the input symbol, current state, against the stored tuple: Source: https://dl.acm.org/doi/10.1145/3716368.3735157 The algorithm described above processes at most one input symbol per cycle (and it is slower for inputs that activate multiple transitions). The paper contains two tricks for overcoming this limitation. Fig. 4 illustrates how an NFA that accepts one symbol per cycle can be converted into an NFA which accepts two symbols per cycle. For example, rather than consider and to be separate symbols, put them together into one mega-symbol: . This is feasible as long as your NFA implementation isn’t too sensitive to the number of bits per symbol. Source: https://dl.acm.org/doi/10.1145/3716368.3735157 Cool Trick #2 - Bloom Filter The target application for this hardware is monitoring network traffic for threats (e.g., Snort ). A key observation is that most inputs (network packets) do not produce a match, so it is reasonable to assume that most of the time the NFA will be in the initial state, and most input symbols will not trigger any transitions. If that assumption holds, then a bloom filter can be used to quickly skip many input symbols before they even reach the core NFA evaluation hardware. The bloom filter is built when the NFA transition function changes. To build the bloom filter, iterate over each transition for which holds. For each such transition, compute a hash of the input symbol, decompose the hashed value into indices, and set the corresponding bits in the bloom filter. To test an input symbol against the bloom filter, hash the input symbol, decompose the hashed value into indices, and check to see if all of the corresponding bits are set in the bloom filter. If any bit is not set, then the input symbol does not trigger a transition from the initial state. When that symbol finally arrives at the NFA hardware, it can be dropped if the NFA is in the initial state. Table 1 compares PPA results against other published NFA accelerators. It is a bit apples-to-oranges as the various designs target different technology nodes. The metric that stands out is the low power consumption of this design. Source: https://dl.acm.org/doi/10.1145/3716368.3735157 Dangling Pointers I wonder if the bloom filter trick can be extended. For example, rather than assuming the NFA will always be in the initial state, the hardware could dynamically compute which states are the most frequent and then use bloom filters to drop input symbols which cannot trigger any transitions from those states. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 2 months ago

Better Memory Tiering, Right from the First Placement

Better Memory Tiering, Right from the First Placement João Póvoas, João Barreto, Bartosz Chomiński, André Gonçalves, Fedar Karabeinikau, Maciej Maciejewski, Jakub Schmiegel, and Kostiantyn Storozhuk ICPE'25 This paper addresses the first placement problem in systems with multiple tiers of memory (e.g., DRAM paired with HBM, or local DRAM paired with remote DRAM accessed over CXL). The paper cites plenty of prior work which dynamically migrates pages/allocations out of suboptimal memory tiers. What is different about this paper is that it attempts to avoid placing data in a suboptimal tier in the first place. The key insight is: statistics from one allocation can be used to generate better placements for similar allocations which will occur in the future. Fig. 3 offers insight into how much waste there is in a policy which initially places all pages into a fast tier and then migrates them to a slower tier if they are accessed infrequently. The figure shows results from one migration policy, applied to three benchmarks. Source: https://dl.acm.org/doi/10.1145/3676151.3719378 Allocation Contexts This paper proposes gathering statistics for each allocation context . An allocation context is defined by the source code location of the allocation, the call stack at the moment of allocation, and the size of the allocation. If two allocations match on these attributes, then they are considered part of the same context. The system hooks heap allocation functions (e.g., , ) to track all outstanding allocations associated with each allocation context. The x86 PMU event is used to determine how frequently each allocation context is accessed. A tidbit I learned from this paper is that some x86 performance monitoring features do more than just count events. For example, randomly samples load operations and emits the accessed (virtual) address. Given the accessed address, it is straightforward to map back to the associated allocation context. The hotness of an allocation context is the frequency of these access events divided by the total size of all allocations in the context. Time is divided into epochs. During an epoch, the hotness of each allocation context is recalculated. When a new allocation occurs, the hotness of the allocation context (from the previous epoch) is used to determine which memory tier to place the allocation into. The paper only tracks large allocations (at least 64 bytes). For smaller allocations, the juice is not worth the squeeze. These allocations are assumed to be short-lived and frequently accessed. This paper also describes a kernel component which complements the user space policy described so far. Whereas the user space code deals with allocations, the kernel code deals with pages. This is useful for allocations which do not access all pages uniformly. It is also useful for detecting and correcting suboptimal initial placements. All PTEs associated with all allocations are continually scanned. The accessed bit determines if a page has been read since the last scan. The dirty bit determines if a page has been written since the last scan. After 10 scans, the system has a pretty good idea of how frequently a page is accessed. These statistics are used to migrate pages between fast and slow tiers. Fig. 8 shows execution time for three benchmarks. represents the user and kernel solutions described by this paper. Source: https://dl.acm.org/doi/10.1145/3676151.3719378 Dangling Pointers I wasn’t able to find details in the paper about how PTE scanning works without interfering with other parts of the OS. For example, doesn’t the OS use the dirty bit to determine if it needs to write pages back to disk? I assume the PTE scanning described in this paper must reset the dirty bit on each scan. The definition of an allocation context seems ripe for optimization. I suspect that allowing some variability in call stack or allocation size would allow for better statistics. Maybe this is a good use case for machine learning? Subscribe now

0 views
Dangling Pointers 2 months ago

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters Kaiyang Zhao, Kaiwen Xue, Ziqi Wang, Dan Schatzberg, Leon Yang, Antonis Manousis, Johannes Weiner, Rik Van Riel, Bikash Sharma, Chunqiang Tang, and Dimitrios Skarlatos ISCA'23 This paper has a lot of great statistics from the Meta fleet. Memory capacity per server is growing over time, but TLB size is not, thus TLB coverage (the fraction of main memory that can be referenced by the TLB at any one time) is trending downward: Source: https://dl.acm.org/doi/10.1145/3579371.3589079 As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases. With 4KiB pages, TLB misses can account for almost 20% of CPU cycles! Source: https://dl.acm.org/doi/10.1145/3579371.3589079 A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory. This is shockingly difficult in the real world: We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page. The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access. The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault ( here is a paper that describes some problems associated with PCIe devices causing page faults). The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages. The first solution only requires software changes. The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations. A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction. A background resizing algorithm runs periodically to move the boundary between these two regions. One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking. The paper doesn’t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary. The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Hardware Page Migration The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page. The OS can use this support to defragment memory by moving “unmovable” pages. The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. The page copy occurs one cache line at a time. During the copy operation, accesses to the old page work fine. When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet. If the cache line has been copied, then the access is serviced from the destination page. Otherwise, the access is serviced by the source page. Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU). Once those TLB invalidations have completed, the OS can reuse the old page. Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation. Fig. 10 has results for memory segmentation. “Partial” represents a case where physical memory is partially fragmented, whereas “full” represents full fragmentation. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Dangling Pointers If the NIC is the primary culprit, some vertical integration might be called for here. For example, allocations used to send packets cycle through three states: CPU writing to it NIC reading from it It seems like it would be fine for the OS to migrate a packet when it is not in state #3. Subscribe now Source: https://dl.acm.org/doi/10.1145/3579371.3589079 As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases. With 4KiB pages, TLB misses can account for almost 20% of CPU cycles! Source: https://dl.acm.org/doi/10.1145/3579371.3589079 A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory. This is shockingly difficult in the real world: We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page. The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access. The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault ( here is a paper that describes some problems associated with PCIe devices causing page faults). The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages. Segmentation The first solution only requires software changes. The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations. A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction. A background resizing algorithm runs periodically to move the boundary between these two regions. One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking. The paper doesn’t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary. The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Hardware Page Migration The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page. The OS can use this support to defragment memory by moving “unmovable” pages. The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. The page copy occurs one cache line at a time. During the copy operation, accesses to the old page work fine. When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet. If the cache line has been copied, then the access is serviced from the destination page. Otherwise, the access is serviced by the source page. Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU). Once those TLB invalidations have completed, the OS can reuse the old page. Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation. Results Fig. 10 has results for memory segmentation. “Partial” represents a case where physical memory is partially fragmented, whereas “full” represents full fragmentation. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Dangling Pointers If the NIC is the primary culprit, some vertical integration might be called for here. For example, allocations used to send packets cycle through three states: Empty CPU writing to it NIC reading from it

0 views