Latest Posts (20 found)

MagiCache: A Virtual In-Cache Computing Engine

MagiCache: A Virtual In-Cache Computing Engine Renhao Fan, Yikai Cui, Weike Li, Mingyu Wang, and Zhaolin Li ISCA'25 This paper presents an implementation of RISC-V vector extensions where all vector computation occurs in the cache (i.e., SRAM-based in-memory computation). It contains an accessible description of in-SRAM computation, and some novel extensions. Recall that SRAM is organized as a 2D array of bits. Each row represents a word, and each column represents a single bit location in many words. A traditional read operation occurs by activating a single row. Analog values are read out from each bit and placed onto shared bit lines. There are two bit lines per column (one holding the value, one holding the complement). Values flow down to sense amplifiers that output digital values. Prior work has shown that this basic structure can be augmented to perform computation. Rather than activating a single row, two rows are activated simultaneously (let’s call the values of these rows and ). The shared bit lines perform computation in the analog domain, which results in two expressions appearing on the output of the sense amplifiers: ( AND ) and ( NOR ). Fig. 1(a) shows a diagram of such an SRAM array: Source: https://dl.acm.org/doi/10.1145/3695053.3731113 If you slap some digital logic at the end of the sense amplifiers, then you can generate other functions like OR, XOR, XNOR, NAND, shift, add. Shift and add involve horizontal connections. Fig. 4(c) shows a hardware diagram of this additional logic at the end of the sense amplifiers. Note that the resulting value can be written back into the SRAM array for future use. Multiplication is not directly supported but can be implemented with a sequence of shift and add operations. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Virtual Engine The innovation in this paper is to dynamically share a fixed amount of on-chip SRAM for two separate purposes: caching and a vector register file. The logical vector register file capacity required for a particular algorithm depends on the number of architectural registers used, and the width of each architectural register (RISC-V vector extensions allow software to configure a logical vector width). Note that this hardware does not have separate vector ALUs, the computation is performed directly in the SRAM arrays. Fig. 6 illustrates how the hardware dynamically allocates SRAM space between generic cache storage and vector registers (with in-memory compute). The unit of allocation is a segment . The width of a vector register determines how many segments it requires. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Initially, all SRAM space is dedicated to caching. When the hardware processes an instruction that writes to an uninitialized vector register, then the hardware allocates segments to hold data for that register (evicting cached data if necessary). This system assumes an enlightened compiler which will emit a instruction to hint to the hardware when it has reached a point in the instruction stream where no vector register has valid content. The hardware can use this hint to reallocate all memory back to being used for caching. Fig. 8 shows performance results normalized against prior work (labeled here). This shows a 20%-60% performance improvement, which is pretty good considering that the baseline offers an order-of-magnitude improvement over a standard in-order vector processor. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Dangling Pointers I wonder how this would compare to hardware that did not have a cache, but rather a scratchpad with support for in-memory computing. Subscribe now

0 views

FlexGuard: Fast Mutual Exclusion Independent of Subscription

FlexGuard: Fast Mutual Exclusion Independent of Subscription Victor Laforet, Sanidhya Kashyap, Călin Iorgulescu, Julia Lawall, and Jean-Pierre Lozi SOSP'25 This paper presents an interesting use of eBPF to effectively add an OS feature: coordination between user space locking code and the kernel thread scheduler to improve locking performance. The paper describes most lock implementations as spin-then-park locks (e.g., busy wait in user space for some time, then give up and call the OS to block the waiting thread). A big problem with busy waiting is the performance cliff under oversubscription . Oversubscription occurs when there are more active threads than cores. In this case, busy waiting can be harmful, because it wastes CPU cycles when there is other useful work to do. The worst case occurs when a thread acquires a lock and then is preempted by the OS scheduler while many other threads are busy waiting. If the OS thread scheduler were smart, it would preempt one of the busy waiters and let the lock holder keep running. But alas, that level of coordination isn’t available … until now. In the good old days, researchers would have modified Linux scheduling code and tested their modified kernel. The modern (easier) way to achieve this is to use eBPF. The authors wrote an eBPF program that runs (in kernel space) each time a context switch occurs. This program is called the Preemption Monitor . The Preemption Monitor works in conjunction with a custom user space lock implementation. The net result is that the Preemption Monitor can reliably detect when the OS scheduler preempts a thread that is holding a lock. When this occurs the eBPF program writes information to a variable that user space code can read. The locking algorithm is as follows: First, try to acquire the lock with a simple atomic compare-and-swap. If that fails, then busy wait. Similar to Hapax locks , this busy waiting avoids contention on one cache line by forcing all threads to agree on the order they will acquire the lock and letting each thread spin on per-thread variables. During busy waiting, the variable written by the Preemption Monitor is checked. If this variable indicates that there currently exists a thread which has acquired a lock and has been preempted by the OS, then threads stop busy waiting and instead call the OS to block until the lock is released (using the same system call that a futex would use). Fig. 2 has performance results. The x-axis shows thread count (which varies over time). The green line is FlexGuard. The idea is that it gives great performance when there is no oversubscription (i.e., fewer than 150 threads) and offers performance similar to a purely blocking lock (the dark blue line) when there is oversubscription. Source: https://dl.acm.org/doi/10.1145/3731569.3764852 Dangling Pointers This problem seems ripe for overengineering. In some sick world, the compiler, OS, and hardware could all coordinate to support a “true critical section”. All pages accessed inside this critical section would be pinned into main memory (or even closer to the CPU), and the OS would try extremely hard not to preempt threads inside of the critical section. This would require some upper bound on the critical section working set and running time. Subscribe now First, try to acquire the lock with a simple atomic compare-and-swap. If that fails, then busy wait. Similar to Hapax locks , this busy waiting avoids contention on one cache line by forcing all threads to agree on the order they will acquire the lock and letting each thread spin on per-thread variables. During busy waiting, the variable written by the Preemption Monitor is checked. If this variable indicates that there currently exists a thread which has acquired a lock and has been preempted by the OS, then threads stop busy waiting and instead call the OS to block until the lock is released (using the same system call that a futex would use).

0 views

RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations

RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations Hongrui Zhang, Yunan Zhang, and Hung-Wei Tseng ISCA'25 I recall a couple of decades ago when Pat Hanrahan said something like “all hardware wants to be programmable”. You can find a similar sentiment here : With most SGI machines, if you opened one up and looked at what was actually in there—processing vertexes in particular, but for some machines, processing the fragments—it was a programmable engine. It’s just that it was not programmable by you; it was programmable by me. And now, twenty years later, GPU companies have bucked the programmability trend and added dedicated ray tracing hardware to their chips. Little did they know, users would find a way to utilize this hardware for applications that have nothing to do with graphics. The task at hand is multiplying two (very) sparse matrices ( and ). Each matrix can be partitioned into a 2D grid, where most cells in the grid contain all 0’s. Cells in with non-zero entries must be multiplied by specific cells in with non-zero entries (using a dense matrix multiplication for each product of two cells). The core idea is elegantly simple, and is illustrated in Fig. 5: Source: https://dl.acm.org/doi/full/10.1145/3695053.3731072 The steps are: Build a ray tracing acceleration structure corresponding to the non-zero cells in For each non-zero cell in Trace a ray through to determine if there are any non-zero cells in that need to be multiplied by the current cell in In fig. 5 the coordinates of the non-zero cells in matrix are: [(2, 1) (2, 3) (3, 3) (7, 1)]. The figure shows rays overlaid on top of the result matrix, but I find it easier to think of the rays traced through matrix . The ray corresponding to the cell in at (2, 1) has a column index of 1, so the algorithm traces a ray horizontally through B at row 1. The ray tracing hardware will find that this ray intersects with the cell from at coordinate (1, 4). So, these cells are multiplied together to determine their contribution to the result. Fig. 7 has benchmark results. All results are normalized to the performance of the library (i.e., values greater than one represent a speedup). corresponds to the Intel MKL library running on a Core i7 14700K processor. The “w/o RT cores” bars show results from the same algorithm with ray tracing implemented in general CUDA code rather than using the ray tracing accelerators. It is amazing that this beats across the board. Source: https://dl.acm.org/doi/full/10.1145/3695053.3731072 Dangling Pointers It seems like the core problem to be solved here is pointer-chasing. I wonder if a more general-purpose processor that is located closer to off-chip memory could provide similar benefits. Subscribe now Build a ray tracing acceleration structure corresponding to the non-zero cells in For each non-zero cell in Trace a ray through to determine if there are any non-zero cells in that need to be multiplied by the current cell in

0 views

Dissecting and Modeling the Architecture of Modern GPU Cores

Dissecting and Modeling the Architecture of Modern GPU Cores Rodrigo Huerta, Mojtaba Abaie Shoushtary, José-Lorenzo Cruz, and Antonio Gonzalez MICRO'25 The purpose of this paper is to understand the microarchitecture of recent NVIDIA GPUs, to be able to update architectural simulators that are used for research purposes. The authors uncovered lots of interesting tidbits. Take this information with a grain of salt; it is derived from careful experimentation rather than NVIDIA documentation. The paper uses the term sub-core to represent the hardware module which can execute warp-wide instructions. Each SM comprises four sub-cores. Fig. 3 illustrates the components within a sub-core and shows how 4 sub-cores share instruction and data caches: Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Instruction Issue The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware. There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards: The instruction encoding allows any instruction to set the value of a per-warp stall counter. When the hardware issues such an instruction, it sets the stall counter to the specified value. On each clock cycle thereafter, the counter is decremented by one. The hardware will not issue more instructions for the warp until the counter reaches zero. This is useful for handling hazards with a fixed latency. Variable-latency hazards are resolved with dependence counters . The hardware tracks the value of six dependence counters per warp. The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued. One of these counters is decremented when the instruction writes to the register file, and the other is decremented when the instruction reads from the register file (to resolve WAR hazards). Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero. In fig. 2 above, the values of these counters are checked in the block, and the counters are incremented in the block. The warp scheduler prefers to pick a warp and stick with it (e.g., it is not a round-robin scheduler). If the current warp cannot be scheduled (e.g., the stall counter is greater than zero, or there was a cache miss), then the scheduler switches to another warp. The warp scheduler issues instructions in program order (within a warp). There is no out-of-order execution support. The register file has a limited number of ports, and instructions must be controlled to avoid attempting too many reads or writes in parallel. Register file port contention is not handled by the warp scheduler, instead it is handled further down the pipe. For example, the stage in fig. 2 will stall fixed-latency instructions until register file read ports are available. The register file cache (RFC) is a hardware component that reduces contention on the register file read ports. The RFC has storage for 6 vectors (and tags). The compiler can mark a source operand of an instruction such that the hardware will store the source operand in the cache for a subsequent operation to use. Note that the RFC does not store per-warp values and is only useful for caching data within one warp. This plays nicely with the “pick a warp and stick to it” scheduling policy. Listing 4 has some example code sequences demonstrating how the compiler can direct the operation of the RFC (e.g., ): Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Memory Access Most of the resources that are shared between sub-cores are shared for efficiency reasons. A single sub-core will not generate memory requests at a high throughput, and there is locality of reference between the memory accesses in multiple sub-cores. The block in fig. 3 is shared in order to properly support thread group shared memory (as a thread group is spread across all sub-cores in a SM). The shared memory access modules can handle one request every two cycles. That means if all 4 sub-cores are contending on memory, each one can make a request every 8 cycles. There is a FIFO of depth ~4 between each sub-core and the shared memory structures. Typical read-after-write latency in shared memory is between 20-40 cycles. The authors built a simulation model based on their experiments. Mean percentage absolute error (MAPE) is one metric for measuring how accurate a simulation model is compared to real hardware. Table 4 shows that the model derived from the findings in this paper are a better performance model for recent NVIDIA GPUs than the baseline ( ): Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Subscribe now Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Instruction Issue The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware. There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards: The instruction encoding allows any instruction to set the value of a per-warp stall counter. When the hardware issues such an instruction, it sets the stall counter to the specified value. On each clock cycle thereafter, the counter is decremented by one. The hardware will not issue more instructions for the warp until the counter reaches zero. This is useful for handling hazards with a fixed latency. Variable-latency hazards are resolved with dependence counters . The hardware tracks the value of six dependence counters per warp. The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued. One of these counters is decremented when the instruction writes to the register file, and the other is decremented when the instruction reads from the register file (to resolve WAR hazards). Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero.

0 views

Nexus Machine: An Energy-Efficient Active Message Inspired Reconfigurable Architecture

Nexus Machine: An Energy-Efficient Active Message Inspired Reconfigurable Architecture Rohan Juneja, Pranav Dangi, Thilini Kaushalya Bandara, Tulika Mitra, and Li-Shiuan Peh MICRO'25 This paper presents an implementation of the Active Message (AM) architecture, as an alternative to FPGA/CGRA architectures. AM architectures have been studied for a while; this was my first exposure. An accelerator implemented on an FPGA or CGRA typically uses a spatial computing paradigm. Each “instruction” in the algorithm is pinned to a physical location on the chip, and data flows between the instructions. I prefer to think of the data in motion as the local variables associated with threads that also move (using a specialized memory consistency model ). The active message architecture flips that script around. Data structures are pinned, while instructions move to the relevant data . Fig. 5 shows two processing elements (PEs), each of which contain two active messages (AMs). An active message looks a lot like an instruction: it contains an opcode, source operands, and a result operand. Throughout the computation, AMs move between PEs. PEs have a local ALU and local memory. Source: https://dl.acm.org/doi/10.1145/3725843.3756091 The AM at the top of the figure has and . Here, is an operand that is being carried around for future use. The AM with a opcode will make its way through the chip until it arrives at the PE which contains the data to be loaded. At this point, the load operation will execute, and a new AM will be created. In the figure above, the new AM is the one at the bottom of PE0. It has , , and . Op1 is forwarded unchanged from the predecessor AM. The value of was the value of the data loaded from memory. The new opcode was obtained from the config memory , which contains a description of the program that is being executed. The next step to be performed is to multiply . One might expect PE0 to perform the multiplication, but in the figure above the AM is routed to , which performs the multiplication. A reason why you would want to do this is in a situation where there are many AMs queued to access the data memory associated with PE0, but few AMs queued to access the data memory associated with PE1. In this situation, it is better to let PE0 perform loads for other AMs (because PE0 is the only PE that can fulfill that task) and find a PE that is currently idle to perform the multiplication (any PE can perform the multiplication). Now the question you should be asking is: what real-world applications exhibit load imbalances between PEs like this? If a data structure were split between all PEs evenly, you would think that load will be spread nicely across the PEs. The answer is: irregular workloads like sparse matrix-vector multiplication. Fig. 6 shows how a source matrix, source vector, and result vector could be partitioned across 4 PEs. You can imagine how the sparsity of the tensors being operated on would cause load imbalance between the PEs. Source: https://dl.acm.org/doi/10.1145/3725843.3756091 Fig. 11 compares the Nexus Machine against other architectures (each design has the same number of ALUs). Fig. 12 shows performance-per-watt. Source: https://dl.acm.org/doi/10.1145/3725843.3756091 Dangling Pointers I imagine that AM architectures work best for algorithms that are insensitive to the order in which AMs are executed. That would be the case for matrix/vector multiplication (assuming addition is associative). It seems like there is a large design space here related to PE capabilities. Data structures could be replicated across PEs to enable memory access AMs to be serviced by multiple PEs, or the ALUs inside of each PE could be heterogeneous (e.g., some PEs can do division, others cannot). Subscribe now

0 views

TiNA: Tiered Network Buffer Architecture for Fast Networking in Chiplet-based CPUs

TiNA: Tiered Network Buffer Architecture for Fast Networking in Chiplet-based CPUs Siddharth Agarwal, Tianchen Wang, Jinghan Huang, Saksham Agarwal, and Nam Sung Kim ASPLOS'26 Here we go again , another paper in a top-tier conference on the classic CS problem: how to DMA received packets from NIC to host. It would be interesting to understand why this is such a hot topic these days. This paper deals with the case where the host CPU comprises multiple chiplets. If you get nothing else from this, I hope you will learn something about SNC mode (I had not heard of it before). Recent Intel CPUs can be placed into Sub-NUMA Clustering mode (via a BIOS setting). This causes each chiplet to appear as a separate NUMA node. It is like a single socket CPU is transformed into a 4 socket CPU. The DRAM memory space is divided into four regions (one per chiplet), and the LLC slices within a chiplet only cache data from one memory space. This can be advantageous for some applications, because it can lower average LLC and DRAM access latency (by avoiding inter-chiplet communication). The downside is that the peak LLC capacity available to a single core is reduced. Fig. 3 illustrates these tradeoffs: Source: https://dl.acm.org/doi/10.1145/3760250.3762224 SNC and DDIO Recall that DDIO is a feature of Intel CPUs that allows a NIC to write received packets directly into the LLC, which the host CPU can then read. PCIe lanes are distributed among chiplets. This means that the NIC is directly connected to one chiplet. One way to support DDIO with SNC is to allocate buffers for received packets in the memory region associated with the chiplet that the NIC is connected to. This improves LLC bandwidth (for both the NIC and CPU cores) but decreases the LLC capacity available for network packets. In practice, this means that longer bursts of network packets degrade performance more when SNC is enabled (i.e., leaky DMA is a larger problem in SNC mode). Fig. 6 has data from a microbenchmark to back this up: Source: https://dl.acm.org/doi/10.1145/3760250.3762224 TiNA The solution proposed by this paper requires a change to the NIC/driver interface. Each ring buffer of received network packets is replaced by ring buffers (where is the number of chiplets). Ring buffer is placed in the memory region associated with chiplet . The NIC knows about all of these ring buffers and dynamically decides which one to use. The NIC prefers to use the ring buffer associated with the chiplet that it is directly connected to. However, if a burst of traffic causes high utilization of the LLC capacity of that chiplet, then the NIC will fall back to using the other ring buffers. The NIC estimates LLC utilization based on two competing rates: The rate that received network packets are produced by the NIC The rate that received network packets are consumed by the host The first rate is easy for the NIC to compute as it knows how fast it is sending bytes to the host. The second rate is computed by networking software running on the host, and periodically sent to the NIC. The overall approach reminds me of CEIO . The key difference is the set of memory segments available. CEIO uses NIC-local DRAM as the fallback path. One complication of splitting a single ring buffer into multiple is ensuring that the host processes received packets in order. This paper proposes using sequence numbers associated with each packet. Most protocols already use per-packet sequence numbers. For other protocols (e.g., UDP), the NIC adds a sequence number based on the order in which packets were received. When the host reads a packet from a logical ring buffer, it examines the sequence numbers from the packets at the head of each of the physical ring buffers and chooses the packet with the lowest sequence number. Fig. 9 has benchmark results: lower latency than SNC and non-SNC across a range of microbenchmarks. Source: https://dl.acm.org/doi/10.1145/3760250.3762224 Dangling Pointers It would be nice if SNC allowed more fine-grained configuration. For example, there may be applications where ideal performance is achieved if each CPU core only has access to the L3 slice that is directly connected to it. Subscribe now Source: https://dl.acm.org/doi/10.1145/3760250.3762224 SNC and DDIO Recall that DDIO is a feature of Intel CPUs that allows a NIC to write received packets directly into the LLC, which the host CPU can then read. PCIe lanes are distributed among chiplets. This means that the NIC is directly connected to one chiplet. One way to support DDIO with SNC is to allocate buffers for received packets in the memory region associated with the chiplet that the NIC is connected to. This improves LLC bandwidth (for both the NIC and CPU cores) but decreases the LLC capacity available for network packets. In practice, this means that longer bursts of network packets degrade performance more when SNC is enabled (i.e., leaky DMA is a larger problem in SNC mode). Fig. 6 has data from a microbenchmark to back this up: Source: https://dl.acm.org/doi/10.1145/3760250.3762224 TiNA The solution proposed by this paper requires a change to the NIC/driver interface. Each ring buffer of received network packets is replaced by ring buffers (where is the number of chiplets). Ring buffer is placed in the memory region associated with chiplet . The NIC knows about all of these ring buffers and dynamically decides which one to use. The NIC prefers to use the ring buffer associated with the chiplet that it is directly connected to. However, if a burst of traffic causes high utilization of the LLC capacity of that chiplet, then the NIC will fall back to using the other ring buffers. The NIC estimates LLC utilization based on two competing rates: The rate that received network packets are produced by the NIC The rate that received network packets are consumed by the host

0 views
Dangling Pointers 1 months ago

Binary Compatible Critical Section Delegation

Binary Compatible Critical Section Delegation Junyao Zhang, Zhuo Wang, and Zhe Zhou PPoPP'26 The futex design works great when contention is low but leaves much to be desired when contention is high. I generally think that algorithms should be crafted to avoid high lock contention, but this paper offers a contrarian approach that improves performance without code changes . Acquiring a futex involves atomic operations on the cache lines that contain the futex state. In the case of high contention, these cache lines violently bounce between cores. Also, user space code will eventually give up trying to acquire a lock the easy way and will call into the kernel, which has its own synchronization to protect the shared data structures that manage the queue of threads waiting to acquire the lock. The problems don’t end when a lock is finally acquired. A typical futex guards some specific application data. The cache lines containing that data will also uncomfortably bounce between cores. The idea behind delegation is to replace the queue of pending threads with a queue of pending operations . An operation comprises the code that will be executed under the lock, and the associated data. This C++ code snippet shows how I think of delegation: In the uncontended case, will execute directly. In the contended case, will be placed into a queue to be executed later. After any thread finishes executing a function like , that thread will check the queue. If the queue is not empty, that thread will go ahead and execute all of the functions contained in the queue. In the example above, the data guarded by the lock is . If a particular thread calls 10 functions from the queue, then remains local to the core that thread is running on, and the system will avoid moving between cores 10 times. The magic of this paper is that it shows how to change the OS kernel to automatically implement delegation for any application that uses futexes. When the futex code gives up trying to acquire a futex in user space, it calls the OS to wait on the futex. The implementation of this system call is changed to implement automatic delegation. Automatic delegation can fail (as illustrated by Fig. 2), in which case the traditional futex waiting algorithm is used. Source: https://dl.acm.org/doi/10.1145/3774934.3786439 This paper makes heavy use of the Userspace Bypass library (a.k.a. UB; paper here ). This library allows the kernel to safely execute user-mode code. It was originally designed to optimize syscall heavy applications, by allowing the kernel to execute the small tidbits of user space code in-between system calls. UB uses binary translation to translate instructions that were meant to run in user space into instructions that can securely be executed by the kernel. Binary compatible critical section delegation uses UB to translate the code inside of the critical section (i.e., the code between the futex lock and unlock calls) into code that can be safely executed by the kernel. A pointer to this translated code is placed into a queue of delegated calls (the vw queue ). The set of threads which are trying to acquire a lock cooperatively execute the functions in the vw queue. At any one time, at most one thread is elected to be the delegate thread. It drains the vw queue by executing (in kernel space) all the delegated functions in the queue. This works great in cases where the code inside of the critical section accesses a lot of shared state, because that shared state can happily reside in the cache of the core that is running the delegate thread, rather than bouncing between cores. The paper has impressive results from microbenchmarks, but I think real applications are more relevant. Table 2 shows performance results for a few applications and a few locking strategies. BCD is the work in this paper. TCS and TCB are prior work which have the drawback of not being compatible with existing binaries. Source: https://dl.acm.org/doi/10.1145/3774934.3786439 Dangling Pointers There is a hint here at another advantage of pipeline parallelism over data parallelism: allowing persistent data structures to remain local to a core. Subscribe now

0 views
Dangling Pointers 1 months ago

Radshield: Software Radiation Protection for Commodity Hardware in Space

Radshield: Software Radiation Protection for Commodity Hardware in Space Haoda Wang, Steven Myint, Vandi Verma, Yonatan Winetraub, Junfeng Yang, and Asaf Cidon ASPLOS'25 If you read no further, here are two interesting factoids about outer space from this paper: Launch costs have fallen 60x, with the current cost to launch 1kg to space clocking in at $1,400 (see Fig. 1 below). Many satellites orbiting the Earth and devices sent to Mars use Snapdragon CPUs! I assumed that all chips leaving planet Earth would be specialized for space, apparently not. Source: https://dl.acm.org/doi/10.1145/3760250.3762218 This paper describes software solutions to deal with two common problems that occur in outer space: Single-Event Latchups and Single-Event Upsets , both of which are caused by radiation interfering with the normal operation of a circuit. A single-event latchup (SEL) causes one portion of the chip to heat up. If left unmitigated, this can damage the chip. The solution to this is to detect the problem and reboot. The trick is in the detection. The classic detection method monitors chip current draw. However, this technique fails with a modern off-the-shelf CPU which is designed to have a wide variability in current draw. When compute load increases, clock frequencies and voltages change, cores come out of sleep states, and power consumption naturally increases. The point of this design is to save power during idle periods, which is especially important for satellites which must get their power from the sun. The solution proposed by this paper is called ILD. The idea is to predict the expected current draw based on a simple model that uses CPU performance counters (e.g., cache hit rate, instruction execution rate) as input. If the measured current draw is much larger than predicted, then the system is rebooted. The model is not perfect, and the authors noticed that this scheme only works well when the CPU load is not too high. This “predict, check, reboot if necessary” cycle only occurs during relatively calm periods of time. The system is modified to force 3-second idle periods every 3 minutes to ensure that reliable measurements can be taken. An SEL takes about 5 minutes to damage the chip, the 3-minute period is chosen to be below that threshold. A single-event upset causes the value of a bit to flip (in memory, cache, the register file, etc). There are two common solutions to SEUs: Use ECC on stored data Perform computations with triple modular redundancy (3-MR), which requires computing each result 3 times and choosing the most popular result if there is disagreement about the correct result This paper deals with mitigating SEUs that affect user “space” code. The authors define the term reliability frontier to represent the interface between hardware components that support ECC and those that do not. For example, if flash storage has ECC but DRAM does not, then flash is considered part of the reliability frontier. A typical smartphone CPU advanced satellite chip has multiple CPU cores. One way to alleviate the compute cost of 3-MR is to compute all 3 results on 3 separate cores in parallel. A problem with this approach is that the CPU cores may share unreliable hardware. For example, the last level cache could be shared by all cores but not support ECC. If a bit flips in the LLC, then all cores will see the corrupted value, and parallel 3-MR will not detect a problem. The paper proposes an algorithm called EMR. The idea is to break a computation into multiple tasks and associate metadata with each task that describes the subset of input data accessed by the task. Fig. 6 shows a motivating example. The task of analyzing an image may be decomposed into many tasks, where each task processes a subset of the input image. Source: https://dl.acm.org/doi/10.1145/3760250.3762218 In EMR, there is an API to explicitly create tasks and specify the set of input data that each task reads from. EMR then runs tasks in multiple epochs. Within an epoch, no two tasks read the same input data. EMR invalidates caches up to the reliability frontier between epochs. If there are many tasks, and few epochs, then this system works great (i.e., it has high CPU utilization and does not spend too much time invalidating caches). Table 2 compares ILD performance in detecting SELs against a random forest model and a model that simply compares current draw against a fixed value: Source: https://dl.acm.org/doi/10.1145/3760250.3762218 Fig. 11 shows the performance impact of EMR. Each result is normalized against a parallel version of 3-MR which ignores the problems associated with shared hardware. The red bars represent 3-MR run on a single core; the blue bars represent EMR. Source: https://dl.acm.org/doi/10.1145/3760250.3762218 Dangling Pointers EMR would benefit from a system that detects when a programmer misspecifies the set of inputs that will be read. Maybe hardware or software support could be added to detect this kind of bug. Subscribe now Source: https://dl.acm.org/doi/10.1145/3760250.3762218 This paper describes software solutions to deal with two common problems that occur in outer space: Single-Event Latchups and Single-Event Upsets , both of which are caused by radiation interfering with the normal operation of a circuit. Single-Event Latchups A single-event latchup (SEL) causes one portion of the chip to heat up. If left unmitigated, this can damage the chip. The solution to this is to detect the problem and reboot. The trick is in the detection. The classic detection method monitors chip current draw. However, this technique fails with a modern off-the-shelf CPU which is designed to have a wide variability in current draw. When compute load increases, clock frequencies and voltages change, cores come out of sleep states, and power consumption naturally increases. The point of this design is to save power during idle periods, which is especially important for satellites which must get their power from the sun. The solution proposed by this paper is called ILD. The idea is to predict the expected current draw based on a simple model that uses CPU performance counters (e.g., cache hit rate, instruction execution rate) as input. If the measured current draw is much larger than predicted, then the system is rebooted. The model is not perfect, and the authors noticed that this scheme only works well when the CPU load is not too high. This “predict, check, reboot if necessary” cycle only occurs during relatively calm periods of time. The system is modified to force 3-second idle periods every 3 minutes to ensure that reliable measurements can be taken. An SEL takes about 5 minutes to damage the chip, the 3-minute period is chosen to be below that threshold. Single-Event Upsets A single-event upset causes the value of a bit to flip (in memory, cache, the register file, etc). There are two common solutions to SEUs: Use ECC on stored data Perform computations with triple modular redundancy (3-MR), which requires computing each result 3 times and choosing the most popular result if there is disagreement about the correct result

0 views
Dangling Pointers 1 months ago

Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds

Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds I learned a lot about the LLC configuration and monitoring capabilities of modern CPUs from this paper, I bet you will too. The problem this paper addresses is: how to avoid performance variability in cloud applications due to cross-VM contention for the last level cache (e.g., the L3 cache on a Xeon)? In a typical CPU, the L1 and L2 caches are private to a core, but the L3 is shared. In a cloud environment, the L3 is shared by multiple tenants, and is an avenue for a “noisy neighbor” to annoy its neighbors. The work described by this paper builds upon Intel CMT and CAT. Cache Monitoring Technology allows the hypervisor to track how much of the L3 cache is occupied by each VM. Cache Allocation Technology allows the hypervisor to restrict a VM to only use a subset of the L3. CAT allows a VM to be assigned to a cache level of service (CLOS), which defines the set of L3 ways accessible to the VM ( this page defines the term “ways” if you are unfamiliar). A typical CPU used by a cloud service provider has more CPU cores than L3 ways. If a cloud server hosts many small VMs, then L3 ways must be shared amongst VMs. The key problem solved by this paper is how to reduce performance variability given this constraint. Fig. 1 illustrates the assignments of CLOS levels to LLC ways advocated by this paper. Each row is a level of service, and each column is a way of the LLC cache. CLOS[0] can access all ways, CLOS[1] can access all LLC ways except for one. CLOS[7] can only access a single way of the LLC. Source: https://dl.acm.org/doi/10.1145/3774934.3786415 The hypervisor uses Intel CMT to monitor how much of the LLC is occupied by each VM. Every 6 seconds the hypervisor uses this information to change the CLOS that each VM is assigned to. The hypervisor computes a target LLC occupancy for each VM based on the number of cores assigned to the VM. This target is compared against the measured LLC occupancy to classify each VM into one of three categories: Poor (the VM is starved for space) Adequate (the VM is using just the right amount of cache) Excess (the VM is hogging too much) VMs in the poor category are de-suppressed (i.e., assigned to a CLOS with access to more LLC ways). Additionally, VMs in the excess category are suppressed (i.e., assigned to a CLOS with access to fewer ways), but this suppression only occurs when there are VMs in the poor category. This policy means that cache-hungry VMs can use more than their fair share of the L3 during periods of low server utilization. This can lead to higher mean performance, at the cost of a wider standard deviation. The paper describes a 4th state (overflow), which is only applied to VMs that wish to be held back even if there is plenty of L3 space available. These VMs are suppressed when they are found to be using too much L3, even if all other VMs on the system are getting enough cache space. Fig. 5 shows a case where this strategy works well compared to static allocation. The server in question is running 5 VMs, each running a different application: VM1 - 32 cores VM2 - 16 cores (but doesn’t fully utilize those cores) VM3 - 8 cores VM4 - 4 cores VM5 - 4 cores The top of figure 5 shows a simple static partitioning of LLC ways. VM1 is assigned to 6 ways, VM2 is assigned to 3 ways, VM3 is assigned to 2 ways, and VMs 4 and 5 must share 1 way. They have to share because sharing based on the number of ways in the LLC is inherently coarse-grained. The two charts show measured LLC utilization over 10 minutes. Notice the Y-axis. The technique described in this paper (Cacheman) allows VM4 and VM5 to use far more aggregate LLC capacity than the static partitioning. Also notice that in the static partitioning, VM5 always uses more LLC than VM4 (because they are running different applications), whereas Cacheman allows for a more even balance between them. Source: https://dl.acm.org/doi/10.1145/3774934.3786415 Dangling Pointers While the L3 cache is logically a monolithic shared resource, it is physically partitioned across the chip (with a separate slice near each core). It seems like it could be more efficient if VMs could be assigned to nearby L3 slices rather than L3 ways. Subscribe now Poor (the VM is starved for space) Adequate (the VM is using just the right amount of cache) Excess (the VM is hogging too much) VM1 - 32 cores VM2 - 16 cores (but doesn’t fully utilize those cores) VM3 - 8 cores VM4 - 4 cores VM5 - 4 cores

0 views
Dangling Pointers 1 months ago

Hapax Locks: Scalable Value-Based Mutual Exclusion

Hapax Locks: Scalable Value-Based Mutual Exclusion Dave Dice and Alex Kogan PPoPP'26 This paper describes a locking algorithm intended for cases where spinning is acceptable (e.g., one thread per core systems). It is similar to a ticket lock but generates less coherence traffic. Each lock/unlock operation causes a constant number of cache lines to move between cores, regardless of the number of cores involved or how long they spin for. As we’ve seen in a previous paper , polling a value in memory is cheap if the cache line is already local to the core which is polling. A Hapax lock comprises two 64-bit fields: Additionally, there is a global (shared among all Hapax locks) 64-bit sequence number. Each time a thread attempts to lock a Hapax lock, it generates a Hapax value which uniquely identifies the locking episode . A locking episode is a single lock/unlock sequence performed by a specific thread. A Hapax value is generated by atomically incrementing the sequence number. It is assumed that the 64-bit counter additions will never overflow. Next, the locking thread atomically exchange the value of with the Hapax value it just generated. This exchange operation generates a total ordering among Hapax values. It is a way for threads to cooperatively decide the order in which they will acquire the lock. Say thread generates Hapax value and stores it into (via an atomic exchange operation). Next, thread generates Hapax value and atomically exchanges the value of with . The result of the exchange operation performed by will be . At this point, thread knows that it is directly behind thread A in the queue and must wait for thread to release the lock. To finish acquiring the lock, threads continually poll , waiting for to equal the Hapax value of the preceding locking episode. In the example above, thread polls until it sees the value . At this point, the lock has been acquired. Unlocking is implemented by storing the Hapax value used by the unlocking thread into . In the running example, thread would unlock the lock by storing the value into . This algorithm generates a lot of coherence traffic. In particular, the cache line which holds the sequence number would move between cores each time a new Hapax value is generated. Also, each store to would send coherence traffic to each core which had recently polled the value of . The paper has two techniques to address these issues. While the sequence number monotonically increases, the values stored in and do not. There are two reasons for this. First, a single sequence number is shared among all Hapax locks. The second reason is that multiple threads can generate Hapax values and then race to perform the atomic exchange operation. For example, thread could generate a Hapax value of while thread generates a Hapax value of . They then race each other to atomically exchange their Hapax value with the value of . If thread wins the race, then will first take on the value , and then later it will have the value . Once you realize that the values of and are not monotonically increasing, it is straightforward to see how the generation of Hapax values can be made cheap. A thread can hoard a batch of Hapax values with a single atomic add operation. For example, a thread could atomically increase the value of the sequence number by 1024. At this point, the thread has allocated 1024 Hapax values for itself that it can use in the future without accessing the cache line which holds the shared sequence number. The paper proposes allocating Hapax values in blocks of 64K. The paper proposes adding an additional array which serves a similar role as . The number of elements in the array should be greater than the number of cores (the paper uses an array of 4096 values). Like the sequence number, this array is shared among all Hapax locks. When a thread writes its Hapax value into , the thread also stores its Hapax value into one of the 4096 elements. The array index is determined by the Hapax value. Many potential hash functions could be used. The paper proposes hashing bits [27:16] of the Hapax value. The 16 is related to the allocator block size. In the locking sequence, a thread loads the value of once. If the value of Depart does not match the expected Hapax value, then the locking thread polls the appropriate element of the shared array. The thread polls this element until its value changes. If the new value is the expected value of , then the lock has been acquired. If not, then a hash collision has occurred (e.g., a locking episode associated with a different Hapax lock caused the value to be updated). In this case, the thread starts over by checking and then polling the array element if necessary. This scheme minimizes coherence traffic associated with polling. When an unlocking core stores a value into an array element, the associated cache line will typically be present only in the cache of the next core in line. Coherence traffic is only generated related to the locking and unlocking cores. Other threads (which are further back in the line) will be polling other array elements and thus be loading from other cache lines and so the cores those threads are running on won’t see the coherence messages. Fig. 3 has results from a microbenchmark. Hapax locks scale much better than ticket locks and go head-to-head with other state-of-the-art locking algorithms. The Hapax implementation is so concise (about 100 lines) that the authors included C++ source code in the paper. Source: https://dl.acm.org/doi/10.1145/3774934.3786443 Dangling Pointers The big downside of spinning is that it wastes cycles in the case where there are other threads that the OS could schedule. I wonder if there is a lightweight coordination mechanism available. For example, the OS could write scheduling information into memory that is mapped read-only into user space. This could be used to communicate to the spinning code whether or not there are other threads ready to run. Subscribe now

0 views
Dangling Pointers 1 months ago

Scalar Interpolation: A Better Balance between Vector and Scalar Execution for SuperScalar Architectures

Scalar Interpolation: A Better Balance between Vector and Scalar Execution for SuperScalar Architectures Reza Ghanbari, Henry Kao, João P. L. De Carvalho, Ehsan Amiri, and J. Nelson Amaral CGO'25 This paper serves as a warning: don’t go overboard with vector instructions. There is a non-trivial amount of performance to be had by balancing compute between scalar and vector instructions. Even if you fear that automatic vectorization is fragile, this paper has some interesting lessons. Listing 1 contains a vectorizable loop and listing 2 shows a vectorized implementation: Source: https://dl.acm.org/doi/10.1145/3696443.3708950 Source: https://dl.acm.org/doi/10.1145/3696443.3708950 After achieving this result, one may be tempted to pat oneself on the back and call it a day. If you were a workaholic, you might profile the optimized code. If you did, you would see something like the data in table 1: Source: https://dl.acm.org/doi/10.1145/3696443.3708950 And you could conclude that this algorithm is compute-bound. But what do we really mean by “compute-bound”? A processor contains many execution ports, each with a unique set of capabilities. In the running example, the execution ports capable of vector multiplication and addition are fully booked, but the other ports are sitting mostly idle! Listing 3 shows a modified loop which tries to balance the load between the vector and scalar execution ports. Each loop iteration processes 9 elements (8 via vector instructions, and 1 via scalar instructions). This assumes that the processor supports fast unaligned vector loads and stores. Source: https://dl.acm.org/doi/10.1145/3696443.3708950 Section 3 has details on how to change LLVM to get it to do this transformation. Fig. 3 shows benchmark results. By my calculations, the geometric mean of the speedups is 8%. Source: https://dl.acm.org/doi/10.1145/3696443.3708950 Dangling Pointers This paper builds on top of automatic vectorization. In other words, the input source code is scalar and the compiler vectorizes loops while balancing the workload. An alternative would be to have the source code in a vectorized form and then let the compiler “devectorize” where it makes sense. Subscribe now

0 views
Dangling Pointers 1 months ago

Flexible I/O for Database Management Systems with xNVMe

Flexible I/O for Database Management Systems with xNVMe Emil Houlborg, Simon A. F. Lund, Marcel Weisgut, Tilmann Rabl, Javier González, Vivek Shah, Pınar Tözün CIDR’26 This paper describes xNVMe , a storage library (developed by Samsung), and demonstrates how it can be integrated into DuckDB. Section 2 contains the hard sell for . The “x” prefix serves a similar role to the “X” in DirectX. It is fast, while also being portable across operating systems and storage devices. The C API will feel like home for folks who have experience with low-level graphics APIs (no shaders on the disk yet, sorry). There are APIs to open a handle to a device, allocate buffers, and submit NVMe commands (synchronously or asynchronously). Listing 3 has an example, which feels like “Mantle for NVMe”: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf The API works on Linux, FreeBSD, Windows, and macOS. Some operating systems have multiple backends available (e.g., , ). The point of this paper is that it is easy to drop into an existing application. The paper describes , which is an implementation of the DuckDB interface and uses . creates dedicated queues for each DuckDB worker thread to avoid synchronization (similar tricks are used by applications calling graphics APIs in parallel). The paper also describes how supports shiny new NVMe features like Flexible Data Placement (FDP). This allows DuckDB to pass hints to the SSD to colocate buffers with similar lifetimes (which improves garbage collection performance). Most of the results in the paper show comparable performance for vs the baseline DuckDB filesystem. Fig. 5 shows one benchmark where yields a significant improvement: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf Dangling Pointers I think the long-term success of will depend on governance. Potential members of the ecosystem could be scared off by Samsung’s potential conflict of interest (i.e., will Samsung privilege Samsung SSDs in some way?) There is a delicate balancing act between an API driven by a sluggish bureaucratic committee, and an API which is dominated by one vendor. Subscribe now

0 views
Dangling Pointers 1 months ago

A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation

A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation Christian Lanius, Florian Freye, and Tobias Gemmeke GLVLSI'25 This paper ostensibly describes an ASIC accelerator for NFA evaluation (e.g., regex matching), but this paper also describes two orthogonal techniques for optimizing NFA evaluation which are applicable to more than just this ASIC. Any regular expression can be converted to a non-deterministic finite automaton (NFA) . Think of an NFA like a state machine where some inputs can trigger multiple transitions. The state machine is defined by a set of transitions . A transition is an ( , , ) tuple. The non-deterministic naming comes from the fact that multiple tuples may exist with identical ( , ) values; they only differ in their values. This means that an NFA can be in multiple states at once. One way to evaluate an NFA is to use a bitmap to track the set of active states. For each new input symbol, the set of active states in the bitmap is used to determine which transitions apply. Each activated transition sets one bit in the bitmap used to represent the active states for the next input symbol. The hardware described in this paper uses a compute-in-memory (CIM) microarchitecture. A set of columns stores the state machine, with each column storing one transition. This assumes that the transition function is sparse (i.e., the number of transitions used is much lower than the maximum possible). During initialization, the transitions are written into the CIM hardware. An input symbol is processed by broadcasting it and the current state bitmap to all columns. All columns evaluate whether their transition should be activated. The hardware then iterates (over multiple clock cycles) over all activated transitions and updates the state bitmap for the next input symbol. The left side of Fig. 5 illustrates the hardware in each column which compares the input symbol, current state, against the stored tuple: Source: https://dl.acm.org/doi/10.1145/3716368.3735157 The algorithm described above processes at most one input symbol per cycle (and it is slower for inputs that activate multiple transitions). The paper contains two tricks for overcoming this limitation. Fig. 4 illustrates how an NFA that accepts one symbol per cycle can be converted into an NFA which accepts two symbols per cycle. For example, rather than consider and to be separate symbols, put them together into one mega-symbol: . This is feasible as long as your NFA implementation isn’t too sensitive to the number of bits per symbol. Source: https://dl.acm.org/doi/10.1145/3716368.3735157 Cool Trick #2 - Bloom Filter The target application for this hardware is monitoring network traffic for threats (e.g., Snort ). A key observation is that most inputs (network packets) do not produce a match, so it is reasonable to assume that most of the time the NFA will be in the initial state, and most input symbols will not trigger any transitions. If that assumption holds, then a bloom filter can be used to quickly skip many input symbols before they even reach the core NFA evaluation hardware. The bloom filter is built when the NFA transition function changes. To build the bloom filter, iterate over each transition for which holds. For each such transition, compute a hash of the input symbol, decompose the hashed value into indices, and set the corresponding bits in the bloom filter. To test an input symbol against the bloom filter, hash the input symbol, decompose the hashed value into indices, and check to see if all of the corresponding bits are set in the bloom filter. If any bit is not set, then the input symbol does not trigger a transition from the initial state. When that symbol finally arrives at the NFA hardware, it can be dropped if the NFA is in the initial state. Table 1 compares PPA results against other published NFA accelerators. It is a bit apples-to-oranges as the various designs target different technology nodes. The metric that stands out is the low power consumption of this design. Source: https://dl.acm.org/doi/10.1145/3716368.3735157 Dangling Pointers I wonder if the bloom filter trick can be extended. For example, rather than assuming the NFA will always be in the initial state, the hardware could dynamically compute which states are the most frequent and then use bloom filters to drop input symbols which cannot trigger any transitions from those states. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

Better Memory Tiering, Right from the First Placement

Better Memory Tiering, Right from the First Placement João Póvoas, João Barreto, Bartosz Chomiński, André Gonçalves, Fedar Karabeinikau, Maciej Maciejewski, Jakub Schmiegel, and Kostiantyn Storozhuk ICPE'25 This paper addresses the first placement problem in systems with multiple tiers of memory (e.g., DRAM paired with HBM, or local DRAM paired with remote DRAM accessed over CXL). The paper cites plenty of prior work which dynamically migrates pages/allocations out of suboptimal memory tiers. What is different about this paper is that it attempts to avoid placing data in a suboptimal tier in the first place. The key insight is: statistics from one allocation can be used to generate better placements for similar allocations which will occur in the future. Fig. 3 offers insight into how much waste there is in a policy which initially places all pages into a fast tier and then migrates them to a slower tier if they are accessed infrequently. The figure shows results from one migration policy, applied to three benchmarks. Source: https://dl.acm.org/doi/10.1145/3676151.3719378 Allocation Contexts This paper proposes gathering statistics for each allocation context . An allocation context is defined by the source code location of the allocation, the call stack at the moment of allocation, and the size of the allocation. If two allocations match on these attributes, then they are considered part of the same context. The system hooks heap allocation functions (e.g., , ) to track all outstanding allocations associated with each allocation context. The x86 PMU event is used to determine how frequently each allocation context is accessed. A tidbit I learned from this paper is that some x86 performance monitoring features do more than just count events. For example, randomly samples load operations and emits the accessed (virtual) address. Given the accessed address, it is straightforward to map back to the associated allocation context. The hotness of an allocation context is the frequency of these access events divided by the total size of all allocations in the context. Time is divided into epochs. During an epoch, the hotness of each allocation context is recalculated. When a new allocation occurs, the hotness of the allocation context (from the previous epoch) is used to determine which memory tier to place the allocation into. The paper only tracks large allocations (at least 64 bytes). For smaller allocations, the juice is not worth the squeeze. These allocations are assumed to be short-lived and frequently accessed. This paper also describes a kernel component which complements the user space policy described so far. Whereas the user space code deals with allocations, the kernel code deals with pages. This is useful for allocations which do not access all pages uniformly. It is also useful for detecting and correcting suboptimal initial placements. All PTEs associated with all allocations are continually scanned. The accessed bit determines if a page has been read since the last scan. The dirty bit determines if a page has been written since the last scan. After 10 scans, the system has a pretty good idea of how frequently a page is accessed. These statistics are used to migrate pages between fast and slow tiers. Fig. 8 shows execution time for three benchmarks. represents the user and kernel solutions described by this paper. Source: https://dl.acm.org/doi/10.1145/3676151.3719378 Dangling Pointers I wasn’t able to find details in the paper about how PTE scanning works without interfering with other parts of the OS. For example, doesn’t the OS use the dirty bit to determine if it needs to write pages back to disk? I assume the PTE scanning described in this paper must reset the dirty bit on each scan. The definition of an allocation context seems ripe for optimization. I suspect that allowing some variability in call stack or allocation size would allow for better statistics. Maybe this is a good use case for machine learning? Subscribe now

0 views
Dangling Pointers 1 months ago

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters Kaiyang Zhao, Kaiwen Xue, Ziqi Wang, Dan Schatzberg, Leon Yang, Antonis Manousis, Johannes Weiner, Rik Van Riel, Bikash Sharma, Chunqiang Tang, and Dimitrios Skarlatos ISCA'23 This paper has a lot of great statistics from the Meta fleet. Memory capacity per server is growing over time, but TLB size is not, thus TLB coverage (the fraction of main memory that can be referenced by the TLB at any one time) is trending downward: Source: https://dl.acm.org/doi/10.1145/3579371.3589079 As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases. With 4KiB pages, TLB misses can account for almost 20% of CPU cycles! Source: https://dl.acm.org/doi/10.1145/3579371.3589079 A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory. This is shockingly difficult in the real world: We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page. The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access. The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault ( here is a paper that describes some problems associated with PCIe devices causing page faults). The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages. The first solution only requires software changes. The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations. A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction. A background resizing algorithm runs periodically to move the boundary between these two regions. One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking. The paper doesn’t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary. The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Hardware Page Migration The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page. The OS can use this support to defragment memory by moving “unmovable” pages. The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. The page copy occurs one cache line at a time. During the copy operation, accesses to the old page work fine. When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet. If the cache line has been copied, then the access is serviced from the destination page. Otherwise, the access is serviced by the source page. Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU). Once those TLB invalidations have completed, the OS can reuse the old page. Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation. Fig. 10 has results for memory segmentation. “Partial” represents a case where physical memory is partially fragmented, whereas “full” represents full fragmentation. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Dangling Pointers If the NIC is the primary culprit, some vertical integration might be called for here. For example, allocations used to send packets cycle through three states: CPU writing to it NIC reading from it It seems like it would be fine for the OS to migrate a packet when it is not in state #3. Subscribe now Source: https://dl.acm.org/doi/10.1145/3579371.3589079 As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases. With 4KiB pages, TLB misses can account for almost 20% of CPU cycles! Source: https://dl.acm.org/doi/10.1145/3579371.3589079 A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory. This is shockingly difficult in the real world: We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page. The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access. The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault ( here is a paper that describes some problems associated with PCIe devices causing page faults). The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages. Segmentation The first solution only requires software changes. The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations. A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction. A background resizing algorithm runs periodically to move the boundary between these two regions. One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking. The paper doesn’t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary. The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Hardware Page Migration The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page. The OS can use this support to defragment memory by moving “unmovable” pages. The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. The page copy occurs one cache line at a time. During the copy operation, accesses to the old page work fine. When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet. If the cache line has been copied, then the access is serviced from the destination page. Otherwise, the access is serviced by the source page. Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU). Once those TLB invalidations have completed, the OS can reuse the old page. Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation. Results Fig. 10 has results for memory segmentation. “Partial” represents a case where physical memory is partially fragmented, whereas “full” represents full fragmentation. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Dangling Pointers If the NIC is the primary culprit, some vertical integration might be called for here. For example, allocations used to send packets cycle through three states: Empty CPU writing to it NIC reading from it

0 views
Dangling Pointers 2 months ago

Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs

Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs Noushin Azami, Alex Fallin, and Martin Burtscher ASPLOS'25 This paper describes four (creatively named) lossless compression algorithms: SPspeed: 32-bit floating-point, optimized for speed DPspeed: 64-bit floating-point, optimized for speed SPratio: 32-bit floating-point, optimized for compression ratio DPratio: 64-bit floating-point, optimized for compression ratio The claim to fame here is excellent performance on both CPUs and GPUs. Each compressor is implemented as a pipeline of transformations, illustrated in Fig. 1: Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Decompression is similar, but the order of stages is reversed. The goal of DIFFMS is to transform the data such that most of the upper bits of each element to be compressed are 0. DIFFMS interprets inputs as integers ( or ) and replaces element with the difference between element and element . Differences are stored in sign-magnitude format. Neighboring values typically have small differences, so fewer bits are needed to store differences rather than raw values. Converting to sign-magnitude causes the upper bits of negative differences (which are close to zero) to be zero. The sign bit is stored in the least significant position, to ensure that it doesn’t contaminate the most-significant bit with a one. The range of representable values in sign-magnitude format is one less than the range of values representable in two’s complement, but I suppose that is OK because that situation can only arise if an input float is NaN. Next, MPLG (introduced in a previous paper) operates on chunks of elements. For each chunk, MPLG finds the element with the highest-order non-zero bit and uses that to determine the number of leading bits which can safely be removed from each element in the chunk. This number of leading bits is stored in per-chunk metadata, and the input stream is bit-packed after removing those leading bits. Fig. 3 illustrates the bit packing: Source: https://dl.acm.org/doi/10.1145/3669940.3707280 BIT The MPLG stage has the “one bad apple spoils the bunch” property. The BIT stage addresses that with a transpose. Each chunk is interpreted as a 2D array of bits, and that 2D array is transposed. Say most elements in a chunk only require the least significant 8 bits, but one element requires 10 bits. Then after the MPLG stage, most elements will have 2 leading zeros in them. After the BIT stage, these zeros will all be grouped together rather than spaced apart. After BIT has arranged the data such that there are many long ranges of zero bits in it, Repeated Zero Elimination (RZE) finds and removes bytes which are equal to zero. RZE produces both the output stream (with “zero bytes” removed), and a bitmap indicating which bytes were removed. The authors found that RZE doesn’t work well for the low-order bits of double-precision data. They address this with RAZE and RARE, which do not try to eliminate ranges of zeros from the low-order bits. Each stage in the pipeline operates on chunks of data. The only interaction between chunks is that the size of the output data produced by chunk N is needed before the output offset of chunk N+1 is known. Efficient parallelization is possible in spite of this hazard on both CPU and GPU implementations. As far as I can tell, there is no explicit attempt to ensure that the data passed between stages is kept on-chip. It could be that CPU and GPU caches work well enough for this purpose. These algorithms extend the Pareto frontier for both CPU and GPU. Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Dangling Pointers The state of the art feels very hand-crafted, somewhat analogous to the state-of-the-art in image classification before AlexNet moved the state-of-the-art from handcrafted feature engineering to more generalized models. In the text compression space, LZ-based compressors leave the same aftertaste. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. SPspeed: 32-bit floating-point, optimized for speed DPspeed: 64-bit floating-point, optimized for speed SPratio: 32-bit floating-point, optimized for compression ratio DPratio: 64-bit floating-point, optimized for compression ratio

0 views
Dangling Pointers 2 months ago

EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation

EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation Weigao Su and Vishal Shrivastav ASPLOS'25 This paper describes incremental changes to Ethernet NICs and switches to enable efficient disaggregation of memory without the need for a separate network ( e.g., CXL) for memory traffic. Fig. 1 shows the north star: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Servers are partitioned into Compute Nodes and Memory Nodes . When a compute node wants to access remote memory, it issues a request to its local NIC, which sends the request to the correct memory node (via a switch). The key problem this paper addresses is Ethernet fabric latency (i.e., the time taken for requests/responses to flow between NICs and switches). The paper assumes that the latency between the processor and the NIC is low (and cites other papers which describe techniques for reducing this latency to below 100ns). Typical Ethernet fabric latency is measured in microseconds, which is much higher than a local memory access. The Ethernet hardware stack can be decomposed into MAC and PHY layers. The MAC is higher level and sits on top of the PHY. The paper proposes implementing EDM (Ethernet Disaggregated Memory) with modifications to the PHY layer in both the NIC and the switch. Normal network packets flow through the MAC and PHY as they usually would, but a side channel exists which allows remote memory accesses to be handled directly by the enhanced PHY layer. Fig. 3 illustrates the hardware changes in Ethernet NICs and switches. Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Remote memory access requests and responses are smaller than typical Ethernet packets. Additionally, end-to-end application performance is more sensitive to remote memory access latency than the latency of regular network traffic. The bulk of the paper describes how EDM achieves low latency for remote memory traffic. The EDM PHY modifications allow a memory request to preempt a non-memory packet. Say the MAC sends a 1KiB packet to the PHY, which begins to send the packet over the wire in 66-bit blocks. If a memory request shows up in the middle of transmitting the network packet, the PHY can sneak the memory request onto the wire between 66-bit blocks, rather than waiting for the whole 1KiB to be sent. Standard Ethernet requires 96 bits of zeros to be sent on the wire between each packet. This overhead is small for large packets, but it is non-trivial for small packets (like remote memory access requests). The EDM PHY modifications allow these idle bits to be used for remote memory accesses. The MAC still sees the gaps, but the PHY does not. If you ask an LLM what could possibly go wrong by trying to use the inter-frame gap to send useful data, it will spit out a long list. I can’t find too much detail in the paper about how to ensure that this enhancement is robust. The possible problems are limited to the PHY layer however, as the MAC still sees the zeros it expects. To avoid congestion and dropping of memory requests, EDM uses an in-network scheduling algorithm somewhat like PFC. The EDM scheduler is in the PHY layer of the switch. Senders notify the switch when they have memory traffic to send, and the switch responds later with a grant , allowing a certain amount of data to be sent. The authors implemented EDM on FPGAs (acting as both NIC and switch). Table 1 compares latencies for TCP/IP, RDMA, raw Ethernet packets, and EDM, breaking down latencies at each step: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Fig. 7 throws CXL into the mix: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Dangling Pointers Section 3.3 “Practical Concerns” has a discussion of what could go wrong ( e.g., fault tolerance and data corruption). It is hard to judge how much work is needed to make this into something that industry could rely on. Subscribe now

0 views
Dangling Pointers 2 months ago

An Analysis of User-space Idle State Instructions on x86 Processors

An Analysis of User-space Idle State Instructions on x86 Processors Malte-Christian Kuns, Hannes Tröpgen, and Robert Schöne ICPE'25 I’ve long believed that busy waiting is poor form. The closest thing you should ever come to busy waiting is to lock a , which will busy wait for a short while on your behalf. If your primary concern is power consumption, then busy waiting may be less offensive on a modern processor. This paper describes newly added x86 instructions to enable low power busy waiting from user space, and has a ton of data to help you sleep better at night. puts the processor into a low power state for a user-specified amount of time. supports two low power states ( and ), which trade power consumption for wake-up latency. TPAUSE can be called in user space but doesn’t wrest control of the core away from the OS. The trick is the OS can set a maximum timeout value, which gives the OS a chance to switch away from the busy waiting thread. and instructions are similar to but allow the processor to be woken up when a write occurs in a specified memory range. sets up the memory range to be monitored, and causes the processor to enter a low power state. accepts a timeout value and a target power state (just like ). AMD supports similar functionality via the and instructions. A key question the paper investigates is how closely the user-specified timeout is honored. Fig. 1 shows results for three Intel cores: Source: https://dl.acm.org/doi/10.1145/3676151.3719370 Times are measured in timestamp counter cycles (roughly 3 GHz for Alder Lake). The plateau at the top right is caused by the OS-specified maximum timeout. The authors find that timeout values are quantized ( e.g., 83 cycles on Alder Lake P-core). Additionally, for short timeouts the processor may ignore the user-requested power state (presumably because it doesn’t make sense to enter a deep sleep for a short amount of time). On Alder Lake P-cores, the threshold below which the processor will not enter the lowest power state is around 23,000 TSC cycles. Alder Lake E-cores seem to only support one low power state. Fig. 3 measures how much the processor can “oversleep” (wake up later than requested) depending on processor frequency and requested power state: Source: https://dl.acm.org/doi/10.1145/3676151.3719370 And finally, table 2 shows measured power consumption for these new instructions vs old-fashioned busy wait loops that uses the PAUSE instruction (which does not support a user-specified timeout): Source: https://dl.acm.org/doi/10.1145/3676151.3719370 I’m shocked by the advantage that AMD has here. If CPU core power during busy waiting is your primary concern, then you should choose your chip carefully. Condition variables are a general and useful abstraction. It would be nice if code that used condition variables could automatically benefit from these instructions. Maybe some compiler and/or hardware assistance is necessary to enable that. Subscribe now

0 views
Dangling Pointers 2 months ago

Re-architecting End-host Networking with CXL: Coherence, Memory, and Offloading

Re-architecting End-host Networking with CXL: Coherence, Memory, and Offloading Houxiang Ji, Yifan Yuan, Yang Zhou, Ipoom Jeong, Ren Wang, Saksham Agarwal, and Nam Sung Kim MICRO'25 This paper is the third one that I’ve posted about which deals with the subtleties of interfacing a NIC and a host CPU. Here are links to the previous posts on this subject: Disentangling the Dual Role of NIC Receive Rings CEIO vs rxBisect: Fixing DDIO’s Leaky DMA Problem The authors bring a new hammer to the construction site: CXL , which offers some interesting efficiencies and simplifications. This paper shows how CXL can address two specific problems with the HW/SW interface of a typical PCIe NIC: After the host prepares a packet to be transmitted, it notifies the NIC with a MMIO write. This MMIO write is expensive because it introduces serialization into the host processor pipeline. When the NIC sends a received packet to the host, ideally it would write data to the LLC rather than host DRAM. However, if the host CPU cannot keep up, then the NIC should have a graceful fallback. CXL Type-1 devices are asymmetric: the device has coherent access to host memory, but the host does not have coherent access to device memory. Practically speaking, both packet descriptors and packet payloads must still be stored in host memory (no change from PCIe based NICs). Because the NIC has coherent access to host memory, it can safely prefetch receive descriptors (RxDs) into an on-NIC cache. When a packet arrives, the NIC can grab a descriptor from the cache and thus avoid an expensive host memory read to determine where to write packet data. If the host CPU updates a RxD after the NIC has prefetched it, the CXL cache coherence protocol will notify the NIC that it must invalidate its cached data. Coherence also enables the tail pointers for transmit ring buffers to be safely stored in host memory. The host networking stack can update a tail pointer with a regular store instruction (rather than an MMIO write). The NIC can continually poll this value, using coherent reads. If the tail index pointer has not been updated since the last poll, the NIC will read a cached value and not generate any PCIe traffic. CXL Type-2 NICs allow packets and descriptors to be stored in NIC memory. The host CPU can cache data read from the NIC, as the NIC will generate the necessary coherence traffic when it reads or writes this data. The design space (what data goes into what memory) is large, and the results section has numbers for many possible configurations. Section 5.3 of the paper describes how a type-2 NIC can intelligently use the CXL operation to write received packet data directly into the host LLC. This is similar to DDIO (described in the two papers linked at the top of this post), but the key difference is that the NIC is in the driver’s seat. The CEIO paper proposes monitoring LLC usage and falling back to storing received packets in DRAM local to the NIC if the LLC is too full. With CXL, the NIC has the option to write data to host memory directly (bypassing the LLC), thus avoiding the need for DRAM attached to the NIC. The authors implemented a CXL NIC on an Altera FPGA. They compared results against an nVidia BlueField-3 PCIe NIC. Fig. 10 compares loopback latency for the two devices, normalized to the BlueField-3 latency (lower is better) for a variety of CXL configurations. Source: https://dl.acm.org/doi/pdf/10.1145/3725843.3756102 Dangling Pointers One fact I took away from this paper is that CXL coherence messages are much cheaper than MMIOs and interrupts. Burning a CPU core polling a memory location seems wasteful to me. It would be nice if that CPU core could at least go into a low power state until a relevant coherence message arrives. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. After the host prepares a packet to be transmitted, it notifies the NIC with a MMIO write. This MMIO write is expensive because it introduces serialization into the host processor pipeline. When the NIC sends a received packet to the host, ideally it would write data to the LLC rather than host DRAM. However, if the host CPU cannot keep up, then the NIC should have a graceful fallback.

0 views
Dangling Pointers 2 months ago

Forest: Access-aware GPU UVM Management

Forest: Access-aware GPU UVM Management Mao Lin, Yuan Feng, Guilherme Cox, and Hyeran Jeon ISCA'25 Unified virtual memory is an abstraction which presents a single unified address space for both the CPU and GPU. This is a convenient programming model because it allows one device to create a complex data structure (with pointers) and pass that directly to the other device. Maintaining this illusion in systems with discrete GPUs is a complex task. The state-of-the-art involves initially placing allocations in host memory and then copying them to GPU memory to resolve GPU page faults ( far-faults ). Prefetching can help avoid some far-faults. A state-of-the-art prefetcher is called the tree-based neighboring prefetcher (TBNp). Fig. 3 shows the TBNp data structure for a 512KiB allocation: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 TBNp tracks each 2MiB of virtual memory with a binary tree containing 64KB leaf nodes. When a far-fault occurs on a 4 KiB page (yellow “1” in Fig. 3) the remaining 60 KiB (blue “2” in Fig. 3) in the leaf are prefetched. Once the leaves contained by a particular sub-tree are >= 50% resident in GPU memory, the remainder of the sub-tree is prefetched. In Fig. 3 that situation occurs when the three left-most leaf nodes have been transferred to GPU memory. At that point, the remaining leaf node in the subtree (green “7” in Fig. 3) is prefetched. This paper proposes customizing the TBNp trees associated with each object, with the help of access pattern profiling hardware. Fig. 7 illustrates the design: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 The GPU MMU tracks access times for each object (i.e., allocation created by and for each page. Once a healthy amount of profiling data has been collected, the GPU fires an interrupt which notifies the driver that it should grab the latest statistics. The driver classifies access patterns for each object into one of the following four buckets: Linear/Streaming Non-Linear, High-Coverage, High-Intensity Non-Linear, High-Coverage, Low-Intensity Non-Linear, Low-Coverage The paper uses tree-based prefetching like TBNp, but configures the trees differently for each object depending on the access pattern bucket. This is where the name “Forest” comes from: each tree has a maximum size, so large objects are chopped up and tracked with multiple trees. Streaming accesses are detected with linear regression. If the R 2 value is close to 1, then the driver classifies the accesses as streaming. The prefetching trees used for objects accessed in a streaming manner are 4MiB in size and contain 256KiB leaf nodes (relatively large). If the driver determines accesses are not streaming, then the choice between the remaining three buckets is determined by the access coverage and access intensity . Access coverage is computed based on the minimum and maximum page numbers accessed during profiling. Access intensity is based on the number of accesses during profiling. For objects with high access coverage and high access intensity, the associated prefetching trees are 512KiB in size and contain 64KiB leaf nodes. For objects with high access coverage and low access intensity, the associated prefetching trees are 512KiB in size and contain 16KiB leaf nodes. Finally, for objects with low access intensity, the associated prefetching trees are 2MiB in size and contain 64KiB leaf nodes. Fig. 12 contains simulation results for a number of benchmarks: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 is the design described in this paper. is a modification that avoids the overheads associated with initial profiling by trying to make better initial guesses about access patterns before profiling data is available. I wonder how much vertical integration can help here. Certainly, a number of applications have enough context to make smarter decisions than the driver relying on profiling information. Subscribe now Source: https://dl.acm.org/doi/10.1145/3695053.3731047 TBNp tracks each 2MiB of virtual memory with a binary tree containing 64KB leaf nodes. When a far-fault occurs on a 4 KiB page (yellow “1” in Fig. 3) the remaining 60 KiB (blue “2” in Fig. 3) in the leaf are prefetched. Once the leaves contained by a particular sub-tree are >= 50% resident in GPU memory, the remainder of the sub-tree is prefetched. In Fig. 3 that situation occurs when the three left-most leaf nodes have been transferred to GPU memory. At that point, the remaining leaf node in the subtree (green “7” in Fig. 3) is prefetched. Access Pattern Profiling This paper proposes customizing the TBNp trees associated with each object, with the help of access pattern profiling hardware. Fig. 7 illustrates the design: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 The GPU MMU tracks access times for each object (i.e., allocation created by and for each page. Once a healthy amount of profiling data has been collected, the GPU fires an interrupt which notifies the driver that it should grab the latest statistics. Access Pattern Classification The driver classifies access patterns for each object into one of the following four buckets: Linear/Streaming Non-Linear, High-Coverage, High-Intensity Non-Linear, High-Coverage, Low-Intensity Non-Linear, Low-Coverage

0 views