Latest Posts (20 found)

Flexible I/O for Database Management Systems with xNVMe

Flexible I/O for Database Management Systems with xNVMe Emil Houlborg, Simon A. F. Lund, Marcel Weisgut, Tilmann Rabl, Javier González, Vivek Shah, Pınar Tözün CIDR’26 This paper describes xNVMe , a storage library (developed by Samsung), and demonstrates how it can be integrated into DuckDB. Section 2 contains the hard sell for . The “x” prefix serves a similar role to the “X” in DirectX. It is fast, while also being portable across operating systems and storage devices. The C API will feel like home for folks who have experience with low-level graphics APIs (no shaders on the disk yet, sorry). There are APIs to open a handle to a device, allocate buffers, and submit NVMe commands (synchronously or asynchronously). Listing 3 has an example, which feels like “Mantle for NVMe”: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf The API works on Linux, FreeBSD, Windows, and macOS. Some operating systems have multiple backends available (e.g., , ). The point of this paper is that it is easy to drop into an existing application. The paper describes , which is an implementation of the DuckDB interface and uses . creates dedicated queues for each DuckDB worker thread to avoid synchronization (similar tricks are used by applications calling graphics APIs in parallel). The paper also describes how supports shiny new NVMe features like Flexible Data Placement (FDP). This allows DuckDB to pass hints to the SSD to colocate buffers with similar lifetimes (which improves garbage collection performance). Most of the results in the paper show comparable performance for vs the baseline DuckDB filesystem. Fig. 5 shows one benchmark where yields a significant improvement: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf Dangling Pointers I think the long-term success of will depend on governance. Potential members of the ecosystem could be scared off by Samsung’s potential conflict of interest (i.e., will Samsung privilege Samsung SSDs in some way?) There is a delicate balancing act between an API driven by a sluggish bureaucratic committee, and an API which is dominated by one vendor. Subscribe now

0 views

A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation

A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation Christian Lanius, Florian Freye, and Tobias Gemmeke GLVLSI'25 This paper ostensibly describes an ASIC accelerator for NFA evaluation (e.g., regex matching), but this paper also describes two orthogonal techniques for optimizing NFA evaluation which are applicable to more than just this ASIC. Any regular expression can be converted to a non-deterministic finite automaton (NFA) . Think of an NFA like a state machine where some inputs can trigger multiple transitions. The state machine is defined by a set of transitions . A transition is an ( , , ) tuple. The non-deterministic naming comes from the fact that multiple tuples may exist with identical ( , ) values; they only differ in their values. This means that an NFA can be in multiple states at once. One way to evaluate an NFA is to use a bitmap to track the set of active states. For each new input symbol, the set of active states in the bitmap is used to determine which transitions apply. Each activated transition sets one bit in the bitmap used to represent the active states for the next input symbol. The hardware described in this paper uses a compute-in-memory (CIM) microarchitecture. A set of columns stores the state machine, with each column storing one transition. This assumes that the transition function is sparse (i.e., the number of transitions used is much lower than the maximum possible). During initialization, the transitions are written into the CIM hardware. An input symbol is processed by broadcasting it and the current state bitmap to all columns. All columns evaluate whether their transition should be activated. The hardware then iterates (over multiple clock cycles) over all activated transitions and updates the state bitmap for the next input symbol. The left side of Fig. 5 illustrates the hardware in each column which compares the input symbol, current state, against the stored tuple: Source: https://dl.acm.org/doi/10.1145/3716368.3735157 The algorithm described above processes at most one input symbol per cycle (and it is slower for inputs that activate multiple transitions). The paper contains two tricks for overcoming this limitation. Fig. 4 illustrates how an NFA that accepts one symbol per cycle can be converted into an NFA which accepts two symbols per cycle. For example, rather than consider and to be separate symbols, put them together into one mega-symbol: . This is feasible as long as your NFA implementation isn’t too sensitive to the number of bits per symbol. Source: https://dl.acm.org/doi/10.1145/3716368.3735157 Cool Trick #2 - Bloom Filter The target application for this hardware is monitoring network traffic for threats (e.g., Snort ). A key observation is that most inputs (network packets) do not produce a match, so it is reasonable to assume that most of the time the NFA will be in the initial state, and most input symbols will not trigger any transitions. If that assumption holds, then a bloom filter can be used to quickly skip many input symbols before they even reach the core NFA evaluation hardware. The bloom filter is built when the NFA transition function changes. To build the bloom filter, iterate over each transition for which holds. For each such transition, compute a hash of the input symbol, decompose the hashed value into indices, and set the corresponding bits in the bloom filter. To test an input symbol against the bloom filter, hash the input symbol, decompose the hashed value into indices, and check to see if all of the corresponding bits are set in the bloom filter. If any bit is not set, then the input symbol does not trigger a transition from the initial state. When that symbol finally arrives at the NFA hardware, it can be dropped if the NFA is in the initial state. Table 1 compares PPA results against other published NFA accelerators. It is a bit apples-to-oranges as the various designs target different technology nodes. The metric that stands out is the low power consumption of this design. Source: https://dl.acm.org/doi/10.1145/3716368.3735157 Dangling Pointers I wonder if the bloom filter trick can be extended. For example, rather than assuming the NFA will always be in the initial state, the hardware could dynamically compute which states are the most frequent and then use bloom filters to drop input symbols which cannot trigger any transitions from those states. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views

Better Memory Tiering, Right from the First Placement

Better Memory Tiering, Right from the First Placement João Póvoas, João Barreto, Bartosz Chomiński, André Gonçalves, Fedar Karabeinikau, Maciej Maciejewski, Jakub Schmiegel, and Kostiantyn Storozhuk ICPE'25 This paper addresses the first placement problem in systems with multiple tiers of memory (e.g., DRAM paired with HBM, or local DRAM paired with remote DRAM accessed over CXL). The paper cites plenty of prior work which dynamically migrates pages/allocations out of suboptimal memory tiers. What is different about this paper is that it attempts to avoid placing data in a suboptimal tier in the first place. The key insight is: statistics from one allocation can be used to generate better placements for similar allocations which will occur in the future. Fig. 3 offers insight into how much waste there is in a policy which initially places all pages into a fast tier and then migrates them to a slower tier if they are accessed infrequently. The figure shows results from one migration policy, applied to three benchmarks. Source: https://dl.acm.org/doi/10.1145/3676151.3719378 Allocation Contexts This paper proposes gathering statistics for each allocation context . An allocation context is defined by the source code location of the allocation, the call stack at the moment of allocation, and the size of the allocation. If two allocations match on these attributes, then they are considered part of the same context. The system hooks heap allocation functions (e.g., , ) to track all outstanding allocations associated with each allocation context. The x86 PMU event is used to determine how frequently each allocation context is accessed. A tidbit I learned from this paper is that some x86 performance monitoring features do more than just count events. For example, randomly samples load operations and emits the accessed (virtual) address. Given the accessed address, it is straightforward to map back to the associated allocation context. The hotness of an allocation context is the frequency of these access events divided by the total size of all allocations in the context. Time is divided into epochs. During an epoch, the hotness of each allocation context is recalculated. When a new allocation occurs, the hotness of the allocation context (from the previous epoch) is used to determine which memory tier to place the allocation into. The paper only tracks large allocations (at least 64 bytes). For smaller allocations, the juice is not worth the squeeze. These allocations are assumed to be short-lived and frequently accessed. This paper also describes a kernel component which complements the user space policy described so far. Whereas the user space code deals with allocations, the kernel code deals with pages. This is useful for allocations which do not access all pages uniformly. It is also useful for detecting and correcting suboptimal initial placements. All PTEs associated with all allocations are continually scanned. The accessed bit determines if a page has been read since the last scan. The dirty bit determines if a page has been written since the last scan. After 10 scans, the system has a pretty good idea of how frequently a page is accessed. These statistics are used to migrate pages between fast and slow tiers. Fig. 8 shows execution time for three benchmarks. represents the user and kernel solutions described by this paper. Source: https://dl.acm.org/doi/10.1145/3676151.3719378 Dangling Pointers I wasn’t able to find details in the paper about how PTE scanning works without interfering with other parts of the OS. For example, doesn’t the OS use the dirty bit to determine if it needs to write pages back to disk? I assume the PTE scanning described in this paper must reset the dirty bit on each scan. The definition of an allocation context seems ripe for optimization. I suspect that allowing some variability in call stack or allocation size would allow for better statistics. Maybe this is a good use case for machine learning? Subscribe now

0 views

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters Kaiyang Zhao, Kaiwen Xue, Ziqi Wang, Dan Schatzberg, Leon Yang, Antonis Manousis, Johannes Weiner, Rik Van Riel, Bikash Sharma, Chunqiang Tang, and Dimitrios Skarlatos ISCA'23 This paper has a lot of great statistics from the Meta fleet. Memory capacity per server is growing over time, but TLB size is not, thus TLB coverage (the fraction of main memory that can be referenced by the TLB at any one time) is trending downward: Source: https://dl.acm.org/doi/10.1145/3579371.3589079 As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases. With 4KiB pages, TLB misses can account for almost 20% of CPU cycles! Source: https://dl.acm.org/doi/10.1145/3579371.3589079 A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory. This is shockingly difficult in the real world: We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page. The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access. The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault ( here is a paper that describes some problems associated with PCIe devices causing page faults). The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages. The first solution only requires software changes. The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations. A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction. A background resizing algorithm runs periodically to move the boundary between these two regions. One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking. The paper doesn’t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary. The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Hardware Page Migration The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page. The OS can use this support to defragment memory by moving “unmovable” pages. The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. The page copy occurs one cache line at a time. During the copy operation, accesses to the old page work fine. When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet. If the cache line has been copied, then the access is serviced from the destination page. Otherwise, the access is serviced by the source page. Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU). Once those TLB invalidations have completed, the OS can reuse the old page. Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation. Fig. 10 has results for memory segmentation. “Partial” represents a case where physical memory is partially fragmented, whereas “full” represents full fragmentation. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Dangling Pointers If the NIC is the primary culprit, some vertical integration might be called for here. For example, allocations used to send packets cycle through three states: CPU writing to it NIC reading from it It seems like it would be fine for the OS to migrate a packet when it is not in state #3. Subscribe now Source: https://dl.acm.org/doi/10.1145/3579371.3589079 As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases. With 4KiB pages, TLB misses can account for almost 20% of CPU cycles! Source: https://dl.acm.org/doi/10.1145/3579371.3589079 A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory. This is shockingly difficult in the real world: We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page. The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access. The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault ( here is a paper that describes some problems associated with PCIe devices causing page faults). The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages. Segmentation The first solution only requires software changes. The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations. A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction. A background resizing algorithm runs periodically to move the boundary between these two regions. One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking. The paper doesn’t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary. The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Hardware Page Migration The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page. The OS can use this support to defragment memory by moving “unmovable” pages. The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. The page copy occurs one cache line at a time. During the copy operation, accesses to the old page work fine. When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet. If the cache line has been copied, then the access is serviced from the destination page. Otherwise, the access is serviced by the source page. Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU). Once those TLB invalidations have completed, the OS can reuse the old page. Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation. Results Fig. 10 has results for memory segmentation. “Partial” represents a case where physical memory is partially fragmented, whereas “full” represents full fragmentation. Source: https://dl.acm.org/doi/10.1145/3579371.3589079 Dangling Pointers If the NIC is the primary culprit, some vertical integration might be called for here. For example, allocations used to send packets cycle through three states: Empty CPU writing to it NIC reading from it

0 views

Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs

Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs Noushin Azami, Alex Fallin, and Martin Burtscher ASPLOS'25 This paper describes four (creatively named) lossless compression algorithms: SPspeed: 32-bit floating-point, optimized for speed DPspeed: 64-bit floating-point, optimized for speed SPratio: 32-bit floating-point, optimized for compression ratio DPratio: 64-bit floating-point, optimized for compression ratio The claim to fame here is excellent performance on both CPUs and GPUs. Each compressor is implemented as a pipeline of transformations, illustrated in Fig. 1: Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Decompression is similar, but the order of stages is reversed. The goal of DIFFMS is to transform the data such that most of the upper bits of each element to be compressed are 0. DIFFMS interprets inputs as integers ( or ) and replaces element with the difference between element and element . Differences are stored in sign-magnitude format. Neighboring values typically have small differences, so fewer bits are needed to store differences rather than raw values. Converting to sign-magnitude causes the upper bits of negative differences (which are close to zero) to be zero. The sign bit is stored in the least significant position, to ensure that it doesn’t contaminate the most-significant bit with a one. The range of representable values in sign-magnitude format is one less than the range of values representable in two’s complement, but I suppose that is OK because that situation can only arise if an input float is NaN. Next, MPLG (introduced in a previous paper) operates on chunks of elements. For each chunk, MPLG finds the element with the highest-order non-zero bit and uses that to determine the number of leading bits which can safely be removed from each element in the chunk. This number of leading bits is stored in per-chunk metadata, and the input stream is bit-packed after removing those leading bits. Fig. 3 illustrates the bit packing: Source: https://dl.acm.org/doi/10.1145/3669940.3707280 BIT The MPLG stage has the “one bad apple spoils the bunch” property. The BIT stage addresses that with a transpose. Each chunk is interpreted as a 2D array of bits, and that 2D array is transposed. Say most elements in a chunk only require the least significant 8 bits, but one element requires 10 bits. Then after the MPLG stage, most elements will have 2 leading zeros in them. After the BIT stage, these zeros will all be grouped together rather than spaced apart. After BIT has arranged the data such that there are many long ranges of zero bits in it, Repeated Zero Elimination (RZE) finds and removes bytes which are equal to zero. RZE produces both the output stream (with “zero bytes” removed), and a bitmap indicating which bytes were removed. The authors found that RZE doesn’t work well for the low-order bits of double-precision data. They address this with RAZE and RARE, which do not try to eliminate ranges of zeros from the low-order bits. Each stage in the pipeline operates on chunks of data. The only interaction between chunks is that the size of the output data produced by chunk N is needed before the output offset of chunk N+1 is known. Efficient parallelization is possible in spite of this hazard on both CPU and GPU implementations. As far as I can tell, there is no explicit attempt to ensure that the data passed between stages is kept on-chip. It could be that CPU and GPU caches work well enough for this purpose. These algorithms extend the Pareto frontier for both CPU and GPU. Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Source: https://dl.acm.org/doi/10.1145/3669940.3707280 Dangling Pointers The state of the art feels very hand-crafted, somewhat analogous to the state-of-the-art in image classification before AlexNet moved the state-of-the-art from handcrafted feature engineering to more generalized models. In the text compression space, LZ-based compressors leave the same aftertaste. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. SPspeed: 32-bit floating-point, optimized for speed DPspeed: 64-bit floating-point, optimized for speed SPratio: 32-bit floating-point, optimized for compression ratio DPratio: 64-bit floating-point, optimized for compression ratio

0 views

EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation

EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation Weigao Su and Vishal Shrivastav ASPLOS'25 This paper describes incremental changes to Ethernet NICs and switches to enable efficient disaggregation of memory without the need for a separate network ( e.g., CXL) for memory traffic. Fig. 1 shows the north star: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Servers are partitioned into Compute Nodes and Memory Nodes . When a compute node wants to access remote memory, it issues a request to its local NIC, which sends the request to the correct memory node (via a switch). The key problem this paper addresses is Ethernet fabric latency (i.e., the time taken for requests/responses to flow between NICs and switches). The paper assumes that the latency between the processor and the NIC is low (and cites other papers which describe techniques for reducing this latency to below 100ns). Typical Ethernet fabric latency is measured in microseconds, which is much higher than a local memory access. The Ethernet hardware stack can be decomposed into MAC and PHY layers. The MAC is higher level and sits on top of the PHY. The paper proposes implementing EDM (Ethernet Disaggregated Memory) with modifications to the PHY layer in both the NIC and the switch. Normal network packets flow through the MAC and PHY as they usually would, but a side channel exists which allows remote memory accesses to be handled directly by the enhanced PHY layer. Fig. 3 illustrates the hardware changes in Ethernet NICs and switches. Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Remote memory access requests and responses are smaller than typical Ethernet packets. Additionally, end-to-end application performance is more sensitive to remote memory access latency than the latency of regular network traffic. The bulk of the paper describes how EDM achieves low latency for remote memory traffic. The EDM PHY modifications allow a memory request to preempt a non-memory packet. Say the MAC sends a 1KiB packet to the PHY, which begins to send the packet over the wire in 66-bit blocks. If a memory request shows up in the middle of transmitting the network packet, the PHY can sneak the memory request onto the wire between 66-bit blocks, rather than waiting for the whole 1KiB to be sent. Standard Ethernet requires 96 bits of zeros to be sent on the wire between each packet. This overhead is small for large packets, but it is non-trivial for small packets (like remote memory access requests). The EDM PHY modifications allow these idle bits to be used for remote memory accesses. The MAC still sees the gaps, but the PHY does not. If you ask an LLM what could possibly go wrong by trying to use the inter-frame gap to send useful data, it will spit out a long list. I can’t find too much detail in the paper about how to ensure that this enhancement is robust. The possible problems are limited to the PHY layer however, as the MAC still sees the zeros it expects. To avoid congestion and dropping of memory requests, EDM uses an in-network scheduling algorithm somewhat like PFC. The EDM scheduler is in the PHY layer of the switch. Senders notify the switch when they have memory traffic to send, and the switch responds later with a grant , allowing a certain amount of data to be sent. The authors implemented EDM on FPGAs (acting as both NIC and switch). Table 1 compares latencies for TCP/IP, RDMA, raw Ethernet packets, and EDM, breaking down latencies at each step: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Fig. 7 throws CXL into the mix: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Dangling Pointers Section 3.3 “Practical Concerns” has a discussion of what could go wrong ( e.g., fault tolerance and data corruption). It is hard to judge how much work is needed to make this into something that industry could rely on. Subscribe now

0 views

An Analysis of User-space Idle State Instructions on x86 Processors

An Analysis of User-space Idle State Instructions on x86 Processors Malte-Christian Kuns, Hannes Tröpgen, and Robert Schöne ICPE'25 I’ve long believed that busy waiting is poor form. The closest thing you should ever come to busy waiting is to lock a , which will busy wait for a short while on your behalf. If your primary concern is power consumption, then busy waiting may be less offensive on a modern processor. This paper describes newly added x86 instructions to enable low power busy waiting from user space, and has a ton of data to help you sleep better at night. puts the processor into a low power state for a user-specified amount of time. supports two low power states ( and ), which trade power consumption for wake-up latency. TPAUSE can be called in user space but doesn’t wrest control of the core away from the OS. The trick is the OS can set a maximum timeout value, which gives the OS a chance to switch away from the busy waiting thread. and instructions are similar to but allow the processor to be woken up when a write occurs in a specified memory range. sets up the memory range to be monitored, and causes the processor to enter a low power state. accepts a timeout value and a target power state (just like ). AMD supports similar functionality via the and instructions. A key question the paper investigates is how closely the user-specified timeout is honored. Fig. 1 shows results for three Intel cores: Source: https://dl.acm.org/doi/10.1145/3676151.3719370 Times are measured in timestamp counter cycles (roughly 3 GHz for Alder Lake). The plateau at the top right is caused by the OS-specified maximum timeout. The authors find that timeout values are quantized ( e.g., 83 cycles on Alder Lake P-core). Additionally, for short timeouts the processor may ignore the user-requested power state (presumably because it doesn’t make sense to enter a deep sleep for a short amount of time). On Alder Lake P-cores, the threshold below which the processor will not enter the lowest power state is around 23,000 TSC cycles. Alder Lake E-cores seem to only support one low power state. Fig. 3 measures how much the processor can “oversleep” (wake up later than requested) depending on processor frequency and requested power state: Source: https://dl.acm.org/doi/10.1145/3676151.3719370 And finally, table 2 shows measured power consumption for these new instructions vs old-fashioned busy wait loops that uses the PAUSE instruction (which does not support a user-specified timeout): Source: https://dl.acm.org/doi/10.1145/3676151.3719370 I’m shocked by the advantage that AMD has here. If CPU core power during busy waiting is your primary concern, then you should choose your chip carefully. Condition variables are a general and useful abstraction. It would be nice if code that used condition variables could automatically benefit from these instructions. Maybe some compiler and/or hardware assistance is necessary to enable that. Subscribe now

0 views

Re-architecting End-host Networking with CXL: Coherence, Memory, and Offloading

Re-architecting End-host Networking with CXL: Coherence, Memory, and Offloading Houxiang Ji, Yifan Yuan, Yang Zhou, Ipoom Jeong, Ren Wang, Saksham Agarwal, and Nam Sung Kim MICRO'25 This paper is the third one that I’ve posted about which deals with the subtleties of interfacing a NIC and a host CPU. Here are links to the previous posts on this subject: Disentangling the Dual Role of NIC Receive Rings CEIO vs rxBisect: Fixing DDIO’s Leaky DMA Problem The authors bring a new hammer to the construction site: CXL , which offers some interesting efficiencies and simplifications. This paper shows how CXL can address two specific problems with the HW/SW interface of a typical PCIe NIC: After the host prepares a packet to be transmitted, it notifies the NIC with a MMIO write. This MMIO write is expensive because it introduces serialization into the host processor pipeline. When the NIC sends a received packet to the host, ideally it would write data to the LLC rather than host DRAM. However, if the host CPU cannot keep up, then the NIC should have a graceful fallback. CXL Type-1 devices are asymmetric: the device has coherent access to host memory, but the host does not have coherent access to device memory. Practically speaking, both packet descriptors and packet payloads must still be stored in host memory (no change from PCIe based NICs). Because the NIC has coherent access to host memory, it can safely prefetch receive descriptors (RxDs) into an on-NIC cache. When a packet arrives, the NIC can grab a descriptor from the cache and thus avoid an expensive host memory read to determine where to write packet data. If the host CPU updates a RxD after the NIC has prefetched it, the CXL cache coherence protocol will notify the NIC that it must invalidate its cached data. Coherence also enables the tail pointers for transmit ring buffers to be safely stored in host memory. The host networking stack can update a tail pointer with a regular store instruction (rather than an MMIO write). The NIC can continually poll this value, using coherent reads. If the tail index pointer has not been updated since the last poll, the NIC will read a cached value and not generate any PCIe traffic. CXL Type-2 NICs allow packets and descriptors to be stored in NIC memory. The host CPU can cache data read from the NIC, as the NIC will generate the necessary coherence traffic when it reads or writes this data. The design space (what data goes into what memory) is large, and the results section has numbers for many possible configurations. Section 5.3 of the paper describes how a type-2 NIC can intelligently use the CXL operation to write received packet data directly into the host LLC. This is similar to DDIO (described in the two papers linked at the top of this post), but the key difference is that the NIC is in the driver’s seat. The CEIO paper proposes monitoring LLC usage and falling back to storing received packets in DRAM local to the NIC if the LLC is too full. With CXL, the NIC has the option to write data to host memory directly (bypassing the LLC), thus avoiding the need for DRAM attached to the NIC. The authors implemented a CXL NIC on an Altera FPGA. They compared results against an nVidia BlueField-3 PCIe NIC. Fig. 10 compares loopback latency for the two devices, normalized to the BlueField-3 latency (lower is better) for a variety of CXL configurations. Source: https://dl.acm.org/doi/pdf/10.1145/3725843.3756102 Dangling Pointers One fact I took away from this paper is that CXL coherence messages are much cheaper than MMIOs and interrupts. Burning a CPU core polling a memory location seems wasteful to me. It would be nice if that CPU core could at least go into a low power state until a relevant coherence message arrives. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. After the host prepares a packet to be transmitted, it notifies the NIC with a MMIO write. This MMIO write is expensive because it introduces serialization into the host processor pipeline. When the NIC sends a received packet to the host, ideally it would write data to the LLC rather than host DRAM. However, if the host CPU cannot keep up, then the NIC should have a graceful fallback.

0 views
Dangling Pointers 1 months ago

Forest: Access-aware GPU UVM Management

Forest: Access-aware GPU UVM Management Mao Lin, Yuan Feng, Guilherme Cox, and Hyeran Jeon ISCA'25 Unified virtual memory is an abstraction which presents a single unified address space for both the CPU and GPU. This is a convenient programming model because it allows one device to create a complex data structure (with pointers) and pass that directly to the other device. Maintaining this illusion in systems with discrete GPUs is a complex task. The state-of-the-art involves initially placing allocations in host memory and then copying them to GPU memory to resolve GPU page faults ( far-faults ). Prefetching can help avoid some far-faults. A state-of-the-art prefetcher is called the tree-based neighboring prefetcher (TBNp). Fig. 3 shows the TBNp data structure for a 512KiB allocation: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 TBNp tracks each 2MiB of virtual memory with a binary tree containing 64KB leaf nodes. When a far-fault occurs on a 4 KiB page (yellow “1” in Fig. 3) the remaining 60 KiB (blue “2” in Fig. 3) in the leaf are prefetched. Once the leaves contained by a particular sub-tree are >= 50% resident in GPU memory, the remainder of the sub-tree is prefetched. In Fig. 3 that situation occurs when the three left-most leaf nodes have been transferred to GPU memory. At that point, the remaining leaf node in the subtree (green “7” in Fig. 3) is prefetched. This paper proposes customizing the TBNp trees associated with each object, with the help of access pattern profiling hardware. Fig. 7 illustrates the design: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 The GPU MMU tracks access times for each object (i.e., allocation created by and for each page. Once a healthy amount of profiling data has been collected, the GPU fires an interrupt which notifies the driver that it should grab the latest statistics. The driver classifies access patterns for each object into one of the following four buckets: Linear/Streaming Non-Linear, High-Coverage, High-Intensity Non-Linear, High-Coverage, Low-Intensity Non-Linear, Low-Coverage The paper uses tree-based prefetching like TBNp, but configures the trees differently for each object depending on the access pattern bucket. This is where the name “Forest” comes from: each tree has a maximum size, so large objects are chopped up and tracked with multiple trees. Streaming accesses are detected with linear regression. If the R 2 value is close to 1, then the driver classifies the accesses as streaming. The prefetching trees used for objects accessed in a streaming manner are 4MiB in size and contain 256KiB leaf nodes (relatively large). If the driver determines accesses are not streaming, then the choice between the remaining three buckets is determined by the access coverage and access intensity . Access coverage is computed based on the minimum and maximum page numbers accessed during profiling. Access intensity is based on the number of accesses during profiling. For objects with high access coverage and high access intensity, the associated prefetching trees are 512KiB in size and contain 64KiB leaf nodes. For objects with high access coverage and low access intensity, the associated prefetching trees are 512KiB in size and contain 16KiB leaf nodes. Finally, for objects with low access intensity, the associated prefetching trees are 2MiB in size and contain 64KiB leaf nodes. Fig. 12 contains simulation results for a number of benchmarks: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 is the design described in this paper. is a modification that avoids the overheads associated with initial profiling by trying to make better initial guesses about access patterns before profiling data is available. I wonder how much vertical integration can help here. Certainly, a number of applications have enough context to make smarter decisions than the driver relying on profiling information. Subscribe now Source: https://dl.acm.org/doi/10.1145/3695053.3731047 TBNp tracks each 2MiB of virtual memory with a binary tree containing 64KB leaf nodes. When a far-fault occurs on a 4 KiB page (yellow “1” in Fig. 3) the remaining 60 KiB (blue “2” in Fig. 3) in the leaf are prefetched. Once the leaves contained by a particular sub-tree are >= 50% resident in GPU memory, the remainder of the sub-tree is prefetched. In Fig. 3 that situation occurs when the three left-most leaf nodes have been transferred to GPU memory. At that point, the remaining leaf node in the subtree (green “7” in Fig. 3) is prefetched. Access Pattern Profiling This paper proposes customizing the TBNp trees associated with each object, with the help of access pattern profiling hardware. Fig. 7 illustrates the design: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 The GPU MMU tracks access times for each object (i.e., allocation created by and for each page. Once a healthy amount of profiling data has been collected, the GPU fires an interrupt which notifies the driver that it should grab the latest statistics. Access Pattern Classification The driver classifies access patterns for each object into one of the following four buckets: Linear/Streaming Non-Linear, High-Coverage, High-Intensity Non-Linear, High-Coverage, Low-Intensity Non-Linear, Low-Coverage

0 views
Dangling Pointers 1 months ago

UPP: Universal Predicate Pushdown to Smart Storage

UPP: Universal Predicate Pushdown to Smart Storage Ipoom Jeong, Jinghan Huang, Chuxuan Hu, Dohyun Park, Jaeyoung Kang, Nam Sung Kim, and Yongjoo Park ISCA'25 Working on hardware acceleration requires a healthy dose of honesty. If you try hard enough, you can find clever ways to accelerate a given application. Once that point is reached, it is helpful to step back and ask yourself: “are these ideas generally applicable to many hardware architectures, or only to the one I am targeting?” This paper describes techniques for high performance filtering of OLAP data, and a FPGA implementation. I wonder if these ideas would also work well on other chips. This paper rests on two assumptions: Inputs are relatively static, which means that the cost of preprocessing can be amortized over many queries Best-effort filtering is OK, because the system has another level of filtering to catch any false positives (rows which should have been removed, but were not) A preprocessing step generates a 256-bit row vector (RV) associated with each row. These bits are partitioned among all columns (e.g., if there are 8 columns in the relation, then each column is represented with 32 bits per row). When a query is run, the relevant filters from the query are converted into a set of 256-bit query vectors (QVs) and simple instructions which perform logical operations between the row vectors and query vectors. The result of those instructions is a single bit per row which determines if the row can be safely removed. Numerical expressions (e.g., ) are supported for monotone functions. During the preprocessing step, the lower and upper bounds of each column are computed. This space is divided into a fixed number of buckets. For each value in a column, the associated bucket index is computed, and the associated bit in the row vector is set to 1. Hashing is used to handle the case where there are more buckets than bits allocated for a given column. When a query is executed, the software can determine the set of buckets which the query references. For example, say the filter expression is: and the buckets for are: The query vector which selects rows which should not be filtered is (LSB first): To determine if a row should be filtered, compute the bitwise AND of the row vector and the query vector. If all bits of the result are zero, then the row can be removed. To convert a string into a row vector, the paper proposes tokenizing the string, and then hashing each token (i.e., word) to determine which bit in the row vector to set. This means that multiple bits in a row vector can be set (one bit per word). Only tokens which appear frequently in the dataset are hashed, the rest are ignored. A query expression like is decomposed into three tokens, and each token is hashed, and the hash values determine which bits in the query vector are set. Rows are accepted if they have all 3 bits set. Note that this is best-effort filtering. For example, if a row contains the string in the column, that row will not be removed. Table 3 shows FPGA resource usage numbers for an FPGA accelerator which executes queries by performing bitwise operations on row and query vectors, and compacting tables based on the generated bitmaps. These numbers seem pretty modest (i.e., good) to me. The authors argue that this implementation is small enough that it could be incorporated into a Smart SSD, allowing queries to be pushed down as far as possible. Source: https://dl.acm.org/doi/10.1145/3695053.3731005 Fig. 6 shows TPC-H performance results. Each pair of bars represents a particular query run on a baseline system, and on a system with filtering pushed down to a Smart SSD. Q21 doesn’t see a speedup because it is bound by join and aggregation, not filtering. Source: https://dl.acm.org/doi/10.1145/3695053.3731005 Dangling Pointers I wonder how much of this is overfitting to TPC-H. If you look at the TPC-H spec , a lot of string columns are generated by randomly sampling from a very small set of possible tokens. It would be great if the industry had a “held out test set” which could be used to evaluate OLAP performance on real-world yet hidden datasets which researchers could not directly see. Subscribe now Inputs are relatively static, which means that the cost of preprocessing can be amortized over many queries Best-effort filtering is OK, because the system has another level of filtering to catch any false positives (rows which should have been removed, but were not)

0 views
Dangling Pointers 1 months ago

RayN: Ray Tracing Acceleration with Near-memory Computing

RayN: Ray Tracing Acceleration with Near-memory Computing Mohammadreza Saed, Prashant J. Nair, and Tor M. Aamodt MICRO'25 A previous paper on OLTP acceleration with processing-in-memory hardware came to the conclusion that the ripest fruit to pick is pointer chasing. This paper comes to a similar conclusion for a different application: ray tracing. Ray tracing applications spend most of their time traversing a bounding volume hierarchy (BVH) which contains the scene to be rendered. The BVH data structure analyzed in this paper is a tree, where each node in the tree contains an axis-aligned bounding box which bounds the subtree rooted at that node. To determine which part of the scene a ray intersects, a ray tracer traverses the BVH starting at the root and recursing if the ray intersects the bounding box of the current node. A typical BVH is large and does not fit into on-chip memory. The problem with this process is that it is not coherent. If a bundle of 8 rays are processed together, one would hope that the traversals performed by each ray are similar, but alas, they frequently are not. This means that the cost of reading BVH nodes from off-chip memory cannot be amortized across many rays. Fig. 3 shows the potential speedup if BVH reads were free: Source: https://dl.acm.org/doi/10.1145/3725843.3756067 BVH Traversal in HBM Logic Dies Fig. 5 shows a GPU with four HBM stacks: Source: https://dl.acm.org/doi/10.1145/3725843.3756067 The yellow rectangles at the bottom of each stack are the key here: each one is an HBM logic die, which can implement a modest amount of functionality. The potential speedup comes from the fact that the BVH traversal logic in the HBM logic die has a much lower latency when accessing its local BVH data and thus can traverse the BVH faster than the GPU can. Typical GPUs contain sophisticated memory controllers. These are implemented in the GPU directly because the GPU die is the best location for complex (i.e., heat-generating) logic. The authors propose three designs for integrating simple BVH traversal logic into HBM logic dies: JEDEC-compatible: the GPU retains its sophisticated memory controller, and the HBM logic die integrates a simple memory controller and BVH traversal logic. At any one time, a particular HBM stack can be used for general compute or ray tracing. The driver on the host OS controls this configuration at a coarse granularity. Unified: Add a memory controller and BVH traversal logic to the HBM logic die. Remove the memory controller from the GPU. This avoids having to partition HBM stacks, but potentially limits the sophistication of the memory controller, and requires significant changes to the GPU. Hybrid: there are memory controllers in both the GPU and HBM logic dies, but both can be active at the same time. When both are active, the GPU memory controller sends requests to the HBM logic die controller, which interleaves these requests with requests coming from the BVH traversal logic on the logic die. This scheme requires larger GPU changes than the JEDEC-compatible option but avoids coarse-grain partitioning. Fig. 6 illustrates these three designs: Source: https://dl.acm.org/doi/10.1145/3725843.3756067 BVH Partitioning A recurring puzzle with processing-in-memory applications is how to partition across memory modules in a way that doesn’t cause load imbalance. The authors of this paper propose that the root of the BVH and nodes which are close to the root should be accessed by the GPU as they normally would be. The proposed design builds on a common two-level BVH implementation for GPUs. In these implementations, the BVH is divided into the top-level acceleration structure (TLAS) and bottom-level acceleration structures (BLAS). The TLAS contains the root and some interior nodes, each BLAS contains other interior nodes, and leaf nodes. One key design choice here is that multiple instances of a BLAS can appear in a scene (e.g., a single BLAS describes a fish, and the scene can contain 15 fishes). In this case, the TLAS contains multiple pointers to a single BLAS. The paper examines many alternative designs and concludes that BLAS Breaking (BB) is the optimal approach. In this design the GPU treats the TLAS and top-level nodes of each BLAS as normal and lets the BVH traversal logic in the logic dies only handle traversal of nodes near the leaves in each BLAS. There are two reasons for this design. The first is that the memory coherence problem grows in significance the further away from the root a bundle of rays gets (i.e., near the root of the BVH, all rays in a bundle will typically follow similar paths). The second reason is that it would be complex to support ray tracing acceleration of the TLAS because multiple instances of each BLAS may exist. This would require the ray tracing units in the HBM logic dies to communicate with each other. Subtrees within a BLAS are partitioned among the HBM stacks. Fig. 8 illustrates an example BVH. Green nodes are processed by the GPU as they normally would be. Once a subtree of another color is reached, the rest of the BVH traversal is handled by the logic die in the HBM stack indicated by the color of the subtree. Fig. 8 illustrates a BVH which has been partitioned with BLAS Breaking: Source: https://dl.acm.org/doi/10.1145/3725843.3756067 Results Fig. 11 has simulated speedups, comparing various designs to a baseline that does not have ray tracing offloads in HBM logic dies. (unified memory controller + BLAS breaking) is the approach advocated by this paper. Source: https://dl.acm.org/doi/10.1145/3725843.3756067 Dangling Pointers If the ray tracing logic in the HBM die is bound by memory latency, I wonder if it needs to be so specialized. Maybe one could get away with more general-purpose logic there (e.g., UPMEM processors) and still get similar results for ray tracing applications. Subscribe now Source: https://dl.acm.org/doi/10.1145/3725843.3756067 BVH Traversal in HBM Logic Dies Fig. 5 shows a GPU with four HBM stacks: Source: https://dl.acm.org/doi/10.1145/3725843.3756067 The yellow rectangles at the bottom of each stack are the key here: each one is an HBM logic die, which can implement a modest amount of functionality. The potential speedup comes from the fact that the BVH traversal logic in the HBM logic die has a much lower latency when accessing its local BVH data and thus can traverse the BVH faster than the GPU can. Typical GPUs contain sophisticated memory controllers. These are implemented in the GPU directly because the GPU die is the best location for complex (i.e., heat-generating) logic. The authors propose three designs for integrating simple BVH traversal logic into HBM logic dies: JEDEC-compatible: the GPU retains its sophisticated memory controller, and the HBM logic die integrates a simple memory controller and BVH traversal logic. At any one time, a particular HBM stack can be used for general compute or ray tracing. The driver on the host OS controls this configuration at a coarse granularity. Unified: Add a memory controller and BVH traversal logic to the HBM logic die. Remove the memory controller from the GPU. This avoids having to partition HBM stacks, but potentially limits the sophistication of the memory controller, and requires significant changes to the GPU. Hybrid: there are memory controllers in both the GPU and HBM logic dies, but both can be active at the same time. When both are active, the GPU memory controller sends requests to the HBM logic die controller, which interleaves these requests with requests coming from the BVH traversal logic on the logic die. This scheme requires larger GPU changes than the JEDEC-compatible option but avoids coarse-grain partitioning.

0 views
Dangling Pointers 1 months ago

SHADOW: Simultaneous Multi-Threading Architecture with Asymmetric Threads

SHADOW: Simultaneous Multi-Threading Architecture with Asymmetric Threads Ishita Chaturvedi, Bhargav Reddy Godala, Abiram Gangavaram, Daniel Flyer, Tyler Sorensen, Tor M. Aamodt, and David I. August MICRO'25 Some processors excel at exploiting instruction level parallelism (ILP), others excel at thread level parallelism (TLP). One approach to dealing with this heterogeneity is to assign each application to a microarchitecture which is suited for it (e.g., OLTP on ILP-optimized processors, OLAP on TLP-optimized processors). The problem is that some applications defy categorization. The characteristics of the workload may flip-flop at high frequency (milliseconds or seconds). This paper proposes a microarchitecture (named SHADOW) that is specialized for these fickle workloads. Fig. 5 summarizes the design described in this paper: Source: https://dl.acm.org/doi/10.1145/3725843.3756070 This processor exposes five hardware threads. One hardware thread (the O ut- o f- O rder thread, in blue) exploits ILP. The other four threads ( In - O rder threads, in yellow) exploit TLP. Green components in Fig. 5 are shared between the threads. It may not look like much is shared, but a lot is (e.g., caches, fetch, decode, execute). The OoO thread has all the bells and whistles you would expect: branch prediction, a reorder buffer, register renaming. The InO threads look more like what you would see on a GPU: no branch prediction, no register renaming. Most of the pipeline resources are shared between threads with time-division-multiplexing. For example, on each cycle, a single hardware thread accesses the instruction cache. The SHADOW architecture assumes some help from software. The OS can use the instruction during OS-level context switches to inform hardware about the thread being switched to. For example, the OS can use to tell the hardware if the thread should be considered InO or OoO. The hardware tracks the number of InO threads actively running and uses that to allocate physical registers to the OoO thread. For example, if no InO threads are running, then the OoO thread can use all physical registers. The paper advocates for a simple software method for dividing work between OoO and InO threads: work-stealing. If an application has a bunch of parallel work to do, dispatch that work to a pool of threads (consisting of both OoO and InO threads). The idea is that if the workload is OoO friendly (not too many cache misses), then the OoO thread will naturally run faster than the InO threads and thus perform more of the work. The arbiters at each pipeline stage are designed with this policy in mind. If you are like me, when you read that, you thought “if an application has a bunch of parallel work, then shouldn’t it just use the InO threads?” The answer is “no”. The key point is that OoO can do better for some parallelizable workloads, for example: because it does not require N times the working set in caches. Fig. 10 shows benchmark results for 9 hardware configurations: Source: https://dl.acm.org/doi/10.1145/3725843.3756070 Table 3 adds more color to those results: Source: https://dl.acm.org/doi/10.1145/3725843.3756070 The authors estimate that the power and area overhead of adding InO threads to an existing out-of-order core is about 1%. Wowzers those are some inexpensive threads. I imagine the engineering/validation costs are a bigger concern. The proposed software parallelization model assumes that running threads multiplies the working set by , which is true for most parallelization schemes. Pipeline parallelism doesn’t have this property. Maybe everything looks like a nail to me, but pipeline parallelism on CPUs seems to be an under-researched topic. Subscribe now

0 views
Dangling Pointers 1 months ago

Light-weight Cache Replacement for Instruction Heavy Workloads

Light-weight Cache Replacement for Instruction Heavy Workloads Saba Mostofi, Setu Gupta, Ahmad Hassani, Krishnam Tibrewala, Elvira Teran, Paul V. Gratz, and Daniel A. Jiménez ISCA'25 Caches with LRU policies are great, unless the working set exceeds the cache size (e.g., in RTL simulation ). This paper describes a simple cache replacement policy (PACIPV) which outperforms LRU in situations where the working set can be thought of as two separate sets of data: One set is accessed in a round-robin fashion, and is larger than the cache The other set is accessed in a more random fashion, and/or is smaller than the cache The findings of the paper can be boiled down to two insertion/promotion vectors (IPVs) which describe the cache replacement policy. The paper assumes a 4-way set associative cache and finds IPVs which work in that case, but the technique generalizes. The cache tracks two bits of metadata for each cache line. These two bits define the re-reference prediction value (RRPV) for a cache line. A low RRPV (e.g., 0 or 1) indicates that the cache line will likely be referenced again before it is evicted. A high RRPV (e.g., 2 or 3) indicates that the cache line is unlikely to be referenced again before it is evicted. When the cache needs to evict a line from a particular set, it examines the RRPVs associated with each cache line. If any cache line has an RRPV of 3, then it can be evicted. If all RRPVs in the set are less than 3, then all RRPVs are repeatedly incremented (i.e., they move closer to 3) until at least one unlucky cache line has an RRPV of 3 and is chosen for eviction. An IPV defines a state machine which is responsible for updating RRPVs on cache insertion and cache hit events. An IPV comprises five integers (e.g., ). The last integer in the list defines the RRPV assigned to a cache line when it is inserted into the cache. The remaining four integers define how RRPVs are updated on each cache hit. Symbolically, if the IPV is: , then the RRPV associated with a newly inserted cache line is . If a cache hit occurs on a cache line with an RRPV of 3, then the RRPV of that line is updated to be . Similarly, if a hit occurs on a cache line with an RRPV of 1, then the new RRPV is set to . RRPVs are dynamically tracked by the hardware on a per-cache line basis. RRPVs only apply to cache lines which are present in a cache; they are forgotten when a cache line is evicted. Fig. 3 shows an example IPV ([0, 1, 1, 2, 2]) and two state transitions. Initially, the RRPV of cache line is 0, while the RRPV of cache line is 1, etc. The RRPV associated with C is updated when C is accessed. Because C has an RRPV of 2 at the time of the access, the new RRPV is 1 (e.g., the bold element in the IPV: [0, 1, 1 , 2, 2]). Next, is inserted into the cache. Cache line is selected to be evicted because it has an RRPV of 3 (recall that only cache lines with RRPV=3 can be evicted). The RRPV associated with cache line is initialized to 2 (the last number in the IPV). Source: https://dl.acm.org/doi/10.1145/3695053.3730993 Prefetch IPV A cache policy is defined by two IPVs. One IPV determines how RRPVs are updated when a demand access (e.g., a load or store instruction) occurs. The other IPV defines how RRPVs are updated when a prefetch occurs. I assume this applies to both hardware and software prefetching. Section 4 of the paper describes how many candidate IPVs were evaluated under simulation. The best general-purpose IPVs are: (demand accesses) (prefetch accesses) The way I think of this is that cache lines are initially placed into the cache “under suspicion” (RRPV=3). This means that the cache policy does not have high confidence that a cache line will be referenced a second time. If it is referenced a second time however, then the cache line becomes the teacher’s pet and is promoted all the way to RRPV=0. It is hard to stay the teacher’s pet for a long time. If any eviction occurs and no other line has an RRPV=3, then the cache line will be downgraded to RRPV=1. RRPV=1 is fairly sticky however, because accesses at RRPV=1 or RRPV=2 will reset the RRPV to 1. The prefetch IPV is similar to the demand IPV. The critical difference is that the prefetch policy is more forgiving to cache lines at RRPV=2, they will be promoted all the way back to RRPV=0 on a prefetch access. The authors also found an IPV pair which works better for data-heavy workloads, it is: (demand accesses) (prefetch accesses) Fig. 10 shows benchmark results for both IPV pairs described above against state-of-the-art cache replacement policies. PACIPV-D is the IPV pair which is better suited for data-heavy workloads. There is some overfitting as PACIPV-D was trained on the benchmarks measured here. Source: https://dl.acm.org/doi/10.1145/3695053.3730993 Dangling Pointers I wonder if there is value in storing RRPVs in DRAM. Kind of like how additional DRAM bits are used for ECC data. This would allow the cache to remember properties of a cache line over long periods of time. Subscribe now One set is accessed in a round-robin fashion, and is larger than the cache The other set is accessed in a more random fashion, and/or is smaller than the cache Source: https://dl.acm.org/doi/10.1145/3695053.3730993 Prefetch IPV A cache policy is defined by two IPVs. One IPV determines how RRPVs are updated when a demand access (e.g., a load or store instruction) occurs. The other IPV defines how RRPVs are updated when a prefetch occurs. I assume this applies to both hardware and software prefetching. Magic Values Section 4 of the paper describes how many candidate IPVs were evaluated under simulation. The best general-purpose IPVs are: (demand accesses) (prefetch accesses) (demand accesses) (prefetch accesses)

0 views
Dangling Pointers 1 months ago

LoopFrog: In-Core Hint-Based Loop Parallelization

LoopFrog: In-Core Hint-Based Loop Parallelization Marton Erdos, Utpal Bora, Akshay Bhosale, Bob Lytton, Ali M. Zaidi, Alexandra W. Chadwick, Yuxin Guo, Giacomo Gabrielli, and Timothy M. Jones MICRO'25 To my Kanagawa pals: I think hardware like this would make a great target for Kanagawa, what do you think? The message of this paper is that there is plenty of loop-level parallelism available which superscalar cores are not yet harvesting. Fig. 1 illustrates the classic motivation for multi-core processors: scaling the processor width by 4x yields a 2x IPC improvement. In general, wider cores are heavily underutilized. Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The main idea behind is to add hints to the ISA which allow a wide core to exploit more loop-level parallelism in sequential code. If you understand Fig. 2, then you understand , the rest is just details: Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The compiler emits instructions which the processor can use to understand the structure of a loop. Processors are free to ignore the hints. A loop which can be optimized by comprises three sections: A header , which launches each loop iteration A body , which accepts values from the header A continuation , which computes values needed for the next loop iteration (e.g., the value of induction variables). Each execution of the header launches two threadlets . A threadlet is like a thread but is only ever executed on the core which launched it. One threadlet launched by the header executes the body of the loop. The other threadlet launched by the header is the continuation, which computes values needed for the next loop iteration. Register loop-carried dependencies are allowed between the header and continuation, but not between body invocations. That is the key which allows multiple bodies to execute in parallel (see Fig. 2c above). At any one time, there is one architectural threadlet (the oldest one), which can update architectural state. All other threadlets are speculative . Once the architectural threadlet for loop iteration completes, it hands the baton over to the threadlet executing iteration , which becomes architectural. Dependencies through memory are handled by the speculative state buffer (SSB). When a speculative threadlet executes a memory store, data is stored in the SSB and actually written to memory later on (i.e., after that threadlet is no longer speculative). Memory loads read from both the L1 cache and the SSB, and then disambiguation hardware determines which data to use and which to ignore. The hardware implementation evaluated by the paper does not support nested parallelization, it simply ignores hints inside of nested loops. Fig. 6 shows simulated performance results for an 8-wide core. A core which supports 4 threadlets is compared against a baseline which does not implement . Source: https://dl.acm.org/doi/10.1145/3725843.3756051 can improve performance by about 10%. Fig. 1 at the top shows that an 8-wide core experiences about 25% utilization, so there may be more fruit left to pick. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The main idea behind is to add hints to the ISA which allow a wide core to exploit more loop-level parallelism in sequential code. Structured Loops If you understand Fig. 2, then you understand , the rest is just details: Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The compiler emits instructions which the processor can use to understand the structure of a loop. Processors are free to ignore the hints. A loop which can be optimized by comprises three sections: A header , which launches each loop iteration A body , which accepts values from the header A continuation , which computes values needed for the next loop iteration (e.g., the value of induction variables).

0 views
Dangling Pointers 1 months ago

Dynamic Load Balancer in Intel Xeon Scalable Processor

Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines Jiaqi Lou, Srikar Vanavasam, Yifan Yuan, Ren Wang, and Nam Sung Kim ISCA'25 This paper describes the DLB accelerator present in modern Xeon CPUs. The DLB addresses a similar problem discussed in the state-compute replication paper : how to parallelize packet processing when RSS (static NIC-based load balancing) is insufficient. Imagine a 100 Gbps NIC is receiving a steady stream of 64B packets and sending them to the host. If RSS is inappropriate for the application, then another parallelization strategy would be for a single CPU core to distribute incoming packets to all of the others. To keep up, that load-distribution core would have to be able to process 200M packets per second, but state-of-the-art results top out at 30M packets per second. The DLB is an accelerator designed to solve this problem. Fig. 2 illustrates the DLB hardware and software architecture: Source: https://dl.acm.org/doi/10.1145/3695053.3731026 A set of producer cores can write 16B queue elements (QEs) into a set of producer ports (PPs). In a networking application, one QE could map to a single packet. A set of consumer cores can read QEs out of consumer queues (CQs). QEs contain metadata which producers can set to enable ordering within a flow/connection, and to control relative priorities. The DLB balances the load at each consumer, while honoring ordering constraints and priorities. A set of cores can send QEs to the DLB in parallel without suffering too much from skew. For example, imagine a CPU with 128 cores. If DLB is not used, and instead RSS is configured to statically distribute connections among those 128 cores, then skew could be a big problem. If DLB is used, and there are 4 cores which write into the producer ports, then RSS can be configured to statically distribute connections among those 4 cores, and skew is much less likely to be a problem. Fig. 5 shows that DLB works pretty well. and are software load balancers, while uses the DLB accelerator. DLB offers similar throughput and latency to RSS, but with much more flexibility. Source: https://dl.acm.org/doi/10.1145/3695053.3731026 AccDirect One awkward point in the design above is the large number of CPU cycles consumed by the set of producer cores which write QEs into the DLB. The paper proposes AccDirect to solve this. The idea is that DLB appears as a PCIe device, and therefore a flexible NIC can use PCIe peer-to-peer writes to send packets directly to the DLB. The authors find that the NVIDIA BlueField-3 has enough programmability to support this. Fig. 9 show that this results in a significant power savings, but not too much of a latency improvement: Source: https://dl.acm.org/doi/10.1145/3695053.3731026 Dangling Pointers I feel like it is common knowledge that fine-grained parallelism doesn’t work well on multi-core CPUs. In the context of this paper, the implication is that it is infeasible to write a multi-core packet processor that primarily uses pipeline parallelism. Back-of-the-envelope: at 400Gbps, and 64B packets, there is a budget of about 40 8-wide SIMD instructions to process a batch of 8 packets. If there are 128 cores, then maybe the aggregate budget is 4K instructions per batch of 8 packets across all cores. This doesn’t seem implausible to me. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

The XOR Cache: A Catalyst for Compression

The XOR Cache: A Catalyst for Compression Zhewen Pan and Joshua San Miguel ISCA'25 Inclusive caches seem a bit wasteful. This paper describes a clever mechanism for reducing that waste: store two cache lines of data in each physical cache line! In most of the paper there are only two caches in play (L1 and LLC), but the idea generalizes. Fig. 1 illustrates the core concept. In this example, there are two cache lines ( , and ) in the LLC. is also present in the L1 cache of a core. Here is the punchline: the LLC stores . If the core which has cached experiences a miss when trying to access , then the core resolves the miss by asking the LLC for . Once that data arrives at the L1, B can be recovered by computing . The L1 stores and separately, so as to not impact L1 hit latency. Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Coherence Protocol The mechanics of doing this correctly are implemented in a cache coherence protocol described in section 4 of the paper. We’ve discussed the local recovery case, where the core which needs already holds in its L1 cache. If the core which requires B does not hold A, then two fallbacks are possible. Before describing those cases, the following property is important to highlight. The cache coherence protocol ensures that if the LLC holds , then some L1 cache in the system will hold a copy of either or . If some action was about to violate this property, then the LLC would request a copy of or from some L1, and use it to split into separate cache lines in the LLC holding and . Direct forwarding occurs when some other core holds a copy of B. In this case, the system requests the other core to send B to the core that needs it. The final case is called remote recovery . If the LLC holds and no L1 cache holds a copy of , then some L1 cache in the system must hold a copy of . The LLC sends to that core which has locally cached. That core computes to recover a copy of and sends it to the requestor. Another tricky case to handle is when a core writes to . The cache coherence protocol handles this case similar to eviction and ensures that the LLC will split cache lines as necessary so that all data is always recoverable. The LLC has a lot of freedom when it comes to deciding which cache lines to pair up. The policies described in this paper optimize for intra-cache line compression (compressing the data within a single cache line). The LLC hardware maintains a hash table. When searching for a partner for cache line , the LLC computes a hash of the contents of to find a set of potential partners. One hash function described by the paper is sparse byte labeling , which produces 6 bits for each 8-byte word in a cache line. Each bit is set to 0 if the corresponding byte in the word is zero. The lower two bytes of each word are ignored. The idea here is that frequently the upper bytes of a word are zero. If two cache lines have the same byte label, then when you XOR them together the merged cache line will have many zero bytes (i.e., they have low entropy and are thus compressible). The LLC can optimize for this case by storing compressed data in the cache and thus increasing its effective capacity. This paper relies on prior work related to compressed caches. The takeaway is that not only is there a potential 2x savings possible from logically storing two cache lines in one physical location, but there are also further savings in compressing these merged cache lines. Fig. 13 compares the compression ratio achieved by XOR cache against prior work (taller bars are better). The right-most set of bars show the geometric mean: Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Fig. 15 shows performance impacts associated with this scheme: Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Dangling Pointers It seems to me that XOR Cache mostly benefits from cache lines that are rarely written. I wonder if there are ways to predict if a particular cache line is likely to be written in the near future. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 2 months ago

CHERIoT RTOS: An OS for Fine-Grained Memory-Safe Compartments on Low-Cost Embedded Devices

CHERIoT RTOS: An OS for Fine-Grained Memory-Safe Compartments on Low-Cost Embedded Devices Saar Amar, Tony Chen, David Chisnall, Nathaniel Wesley Filardo, Ben Laurie, Hugo Lefeuvre, Kunyan Liu, Simon W. Moore, Robert Norton-Wright, Margo Seltzer, Yucong Tao, Robert N. M. Watson, and Hongyan Xia SOSP'25 This paper is a companion to a previous paper which described the CHERIoT hardware architecture. This work presents an OS that doesn’t look like the systems you are used to. The primary goal is memory safety (and security more broadly). Why rewrite your embedded code in Rust when you can switch to a fancy new chip and OS instead? Recall that a CHERI capability is a pointer augmented with metadata (bounds, access permissions). CHERI allows a more restrictive capability to be derived from a less restrictive one (e.g., reduce the bounds or remove access permissions), but not the other way around. CHERIoT RTOS doesn’t have the notion of a process, instead it has a compartment. A compartment comprises code and compartment-global data. Compartment boundaries are trust boundaries. I think of it like a microkernel operating system. Example compartments in CHERIoT include: Boot loader Context switcher Heap allocator Thread scheduler The boot loader is fully trusted and is the first code to run. The hardware provides the boot loader with the ultimate capability. The boot loader then derives more restrictive capabilities, which it passes to other compartments. You could imagine a driver compartment which is responsible for managing a particular I/O device. The boot loader would provide that compartment with a capability that enables the compartment to access the MMIO registers associated with the device. There is no user space/kernel space distinction here, only a set of compartments, each with a unique set of capabilities. Fig. 3 illustrates a compartment: Source: https://dl.acm.org/doi/10.1145/3731569.3764844 Sealed Capabilities The CHERIoT hardware architecture supports sealing of capabilities. Sealing a capability is similar to deriving a more restrictive one, only this time the derived capability is useless until it is unsealed by a compartment which holds a capability with unsealing permissions. I think of this like a client encrypting some data before storing it on a server. The data is useless to everyone except for the client who can decrypt it. Cross-compartment function calls are similar to system calls and are implemented with sealed capabilities. Say compartment needs to be able to call a function exported by compartment . At boot, the boot loader derives a “function call” capability which is a pointer into the export table associated with , seals that capability, and passes it to compartment at initialization. The boot loader also gives the switcher a capability which allows it to unseal the function call capability. When A wants to call the function exported by , it passes the sealed capability to the switcher. The switcher then unseals the capability and uses it to read metadata about the exported function from ’s export table. The switcher uses this metadata to safely perform the function call. Capability sealing also simplifies inter-compartment state management. Say compartment calls into compartment (for networking) to create a TCP connection. The networking compartment can allocate a complicated tree of objects and then return a sealed capability which points to that tree. Compartment can hold on to that capability and pass it as a parameter for future networking function calls (which will unseal and then use). Compartment doesn’t need to track per-connection objects in its global state. The heap compartment handles memory allocation for all compartments. There is just one address space shared by all compartments, but capabilities make the whole thing safe. As described in the previous summary, when an allocation is freed, the heap allocator sets associated revocation bits to zero. This prevents use-after-free bugs (in conjunction with the CHERIoT hardware load filter). Similar to garbage collection, freed memory is quarantined (not reused) until a memory sweep completes which ensures that no outstanding valid capabilities are referencing the memory to be reused. The allocator supports allocation capabilities which can enforce per-compartment quotas. If you’ve had enough novelty, you can rest your eyes for a moment. The CHERIoT RTOS supports threads, and they mostly behave like you would expect. The only restriction is that threads are statically declared in code. Threads begin execution in the compartment that declares them, but then threads can execute code in other compartments via cross-compartment calls. Each compartment is responsible for managing its own state with proper error handling. If all else fails, the OS supports micro-reboots, where a single compartment can be reset to a fresh state. The cross-compartment call mechanism supported by the switcher enables the necessary bookkeeping for micro-reboots. The steps to reboot a single compartment are: Stop new threads from calling into the compartment (these calls fail with an error code) Fault all threads which are currently executing in the compartment (this will also result in error codes being returned to other compartments) Release all resources (e.g., heap data) which have been allocated by the compartment Reset all global variables to their initial state I wonder how often a micro-reboot of one compartment results in an error code which causes other compartments to micro-reboot. If a call into a compartment which is in the middle of a micro-reboot can fail, then I could see that triggering a cascade of micro-reboots. The ideas here remind me of Midori , which relied on managed languages rather than hardware support. I wonder which component is better to trust, an SoC or a compiler? Subscribe now Boot loader Context switcher Heap allocator Thread scheduler Source: https://dl.acm.org/doi/10.1145/3731569.3764844 Sealed Capabilities The CHERIoT hardware architecture supports sealing of capabilities. Sealing a capability is similar to deriving a more restrictive one, only this time the derived capability is useless until it is unsealed by a compartment which holds a capability with unsealing permissions. I think of this like a client encrypting some data before storing it on a server. The data is useless to everyone except for the client who can decrypt it. Cross-compartment function calls are similar to system calls and are implemented with sealed capabilities. Say compartment needs to be able to call a function exported by compartment . At boot, the boot loader derives a “function call” capability which is a pointer into the export table associated with , seals that capability, and passes it to compartment at initialization. The boot loader also gives the switcher a capability which allows it to unseal the function call capability. When A wants to call the function exported by , it passes the sealed capability to the switcher. The switcher then unseals the capability and uses it to read metadata about the exported function from ’s export table. The switcher uses this metadata to safely perform the function call. Capability sealing also simplifies inter-compartment state management. Say compartment calls into compartment (for networking) to create a TCP connection. The networking compartment can allocate a complicated tree of objects and then return a sealed capability which points to that tree. Compartment can hold on to that capability and pass it as a parameter for future networking function calls (which will unseal and then use). Compartment doesn’t need to track per-connection objects in its global state. Heap Allocator The heap compartment handles memory allocation for all compartments. There is just one address space shared by all compartments, but capabilities make the whole thing safe. As described in the previous summary, when an allocation is freed, the heap allocator sets associated revocation bits to zero. This prevents use-after-free bugs (in conjunction with the CHERIoT hardware load filter). Similar to garbage collection, freed memory is quarantined (not reused) until a memory sweep completes which ensures that no outstanding valid capabilities are referencing the memory to be reused. The allocator supports allocation capabilities which can enforce per-compartment quotas. Threads If you’ve had enough novelty, you can rest your eyes for a moment. The CHERIoT RTOS supports threads, and they mostly behave like you would expect. The only restriction is that threads are statically declared in code. Threads begin execution in the compartment that declares them, but then threads can execute code in other compartments via cross-compartment calls. Micro-reboots Each compartment is responsible for managing its own state with proper error handling. If all else fails, the OS supports micro-reboots, where a single compartment can be reset to a fresh state. The cross-compartment call mechanism supported by the switcher enables the necessary bookkeeping for micro-reboots. The steps to reboot a single compartment are: Stop new threads from calling into the compartment (these calls fail with an error code) Fault all threads which are currently executing in the compartment (this will also result in error codes being returned to other compartments) Release all resources (e.g., heap data) which have been allocated by the compartment Reset all global variables to their initial state

0 views
Dangling Pointers 2 months ago

CHERIoT: Complete Memory Safety for Embedded Devices

CHERIoT: Complete Memory Safety for Embedded Devices Saar Amar, David Chisnall, Tony Chen, Nathaniel Wesley Filardo, Ben Laurie, Kunyan Liu, Robert Norton, Simon W. Moore, Yucong Tao, Robert N. M. Watson, and Hongyan Xia Micro'23 If you are like me, you’ve vaguely heard of CHERI , but never really understood what it is about. Here it is in one sentence: hardware support for memory protection embedded in every single pointer. This particular paper focuses on CHERI implementation details for embedded/IoT devices. The C in C HERI stands for capability . A capability is a fat pointer which contains an address, bounds, and access permissions. Fig. 1 shows the bit layout (64 bits total) of the capabilities used by CHERIoT: Source: https://dl.acm.org/doi/10.1145/3613424.3614266 Here is the fundamental concept to understand: the only way to access memory in a CHERI architecture is via a capability. There are no pointers other than capabilities. The hardware uses special tag bits (associated with registers and memory locations), to track which registers or memory addresses contain valid capabilities, and which do not. In the following example, is a regular old integer (the associated tag bit will not be set). is a pointer generated by reinterpreting the bits in . The tag bit associated with will not be set, and thus memory read/writes using will fail. If a programmer cannot create a capability willy-nilly, where do they come from? At boot, the hardware creates the uber-capability (i.e., one which has full permissions to access all memory) and places this capability into a specific register. The initial OS code that runs on boot can access this capability and can use special instructions to derive new capabilities from it. For example, the OS could derive a capability which has read-only access to the first 1 MiB of memory. The owner of a capability may derive new capabilities from it, but hardware ensures a derived capability cannot have broader permissions than the capability from which it was derived. CHERIoT is designed for embedded use cases, which have real-time requirements. MMUs/MPUs can add variability because they usually contain caches (e.g., TLBs) which have dramatically different performance characteristics in hit vs. miss cases. CHERIoT does away with this. There is no memory translation, and memory protection is supported on a per-capability basis (as opposed to a per-page tracking in the MPU). This is pretty cool: capabilities not only give fine-grained memory protection, but they also make performance more consistent by removing a cache from the system. Each capability represents a range of memory which can be accessed. Three fields (comprising 22 bits total) in each capability are used to represent the size of memory which is accessible by the capability. The encoding is a bit like floating point, with an exponent field which allows small sizes (i.e., less than 512 bytes) to be represented exactly, while larger sizes require padding. Astute readers will ask themselves: “how does CHERIoT prevent use-after-free bugs? A call to must somehow invalidate all capabilities which point at the freed region, correct?” CHERIoT introduces heap revocation bits . Because the total amount of RAM is often modest in embedded use cases CHERIoT can get away with a dedicated SRAM to hold heap revocation bits. There is 1 revocation bit per 8 bytes of RAM. Most software does not have direct access to these bits, but the heap allocator does. All revocation bits are initially set to zero. When the heap allocator frees memory, it sets the corresponding bits to one. The hardware uses these bits to prevent capabilities from accessing freed memory. You may think that CHERIoT checks revocation bits on each memory access, but it does not. Instead, the hardware load filter checks the revocation bits when the special “load capability” ( ) instruction is executed. This instruction is used to load a capability from memory into a register. The tag bit associated with the destination register is set to one only if the revocation bit associated with the address the capability points to is zero, and the tag bit of the source address is one. The final ingredient in this recipe is akin to garbage collection. CHERIoT supports what I like to think of as a simple garbage collection hardware accelerator called the background pipelined revoker . Software can request this hardware to scan a range of memory. Scanning occurs “in the background” (i.e., in clock cycles where the processor was not accessing memory). The background revoker reuses existing hardware to load each potential capability in the specified memory range, and then store it back. The load operation reads the associated tag bit and revocation bit, while the store operation updates the tag bit. This clears the tag bit for any capability that points to revoked memory. Once the background revoker has finished scanning all memory, the heap allocator can safely set the revocation bits associated with recently freed allocations back to zero and reuse the memory to satisfy future heap allocations. The authors modified two existing processors to support CHERIoT. Flute is a 5-stage processor with a 64-bit memory bus. Ibex is a 2 or 3 stage processor with a 32-bit memory bus. Table 2 shows the area and power cost associated with extending the Ibex processor to support CHERIoT (roughly 2x for both metrics): Source: https://dl.acm.org/doi/10.1145/3613424.3614266 Table 3 uses CoreMark to measure the performance overhead associated with CHERIoT: Source: https://dl.acm.org/doi/10.1145/3613424.3614266 Dangling Pointers I would be interested to know how easily C#/Go/Rust can be modified to use CHERI hardware bounds checking rather than software bounds checking for array accesses. This seems like an area where CHERI could win back some performance. Subscribe now

0 views
Dangling Pointers 2 months ago

Analog in-memory computing attention mechanism for fast and energy-efficient large language models

Analog in-memory computing attention mechanism for fast and energy-efficient large language models Leroux, N., Manea, PP., Sudarshan, C. et al . Nature Comput Science The takeaway from this one is that there is a high cost to the synchronous digital logic abstraction. Attention is one of the core building blocks of the Transformer architecture. During inference, the input activation matrix is multiplied by 3 separate weight matrices to produce three intermediate matrices: , , and . and are then multiplied together, scaled by a constant, and passed through a nonlinear activation function (e.g., softmax). This result is called an attention matrix , which is multiplied by . Causal attention is a specialization which applies a mask to . This restricts token so that it can only attend to tokens . This restriction enables the implementation of the KV cache. The KV cache stores rows of the and matrices for later re-use. Say the network is producing output token , there is no need to compute rows of and , they can simply be retrieved from the cache. Only row needs to be computed (and stored in the cache for future use). There is a nice animation showing a KV cache here . What follows next is a description of CMOS hardware which can both store values from K and V and perform the computations required by the attention mechanism. I feel a bit out of my depth, but here we go! Fig. 1(d) shows a single gain cell (prior paper here ). This single cell comprises a single capacitor and a handful of transistors. Source: https://www.nature.com/articles/s43588-025-00854-1 The voltage of capacitor C 1 represents a single weight value from or (not a bit in a floating-point representation of a number, but an actual real-valued weight). The WE and WLW pins allow the weight value to be updated. An element of Q arrives at a gain cell, represented by a pulse on the WLR pins. The width of the pulse is proportional to the value of the element; the amplitude of the pulse is a constant. For the duration of the input pulse, an output pulse is generated with an amplitude proportional to the voltage of capacitor C 1 . It’s been a while since I’ve had to compute an integral, but I think I can find the area under this curve. The total output current generated is: The output currents of multiple gain cells are all dumped onto a shared bitline, and the aggregate current produced by all of them represents a dot product More analog hardware magic is used to produce a HardSigmoid charge-to-pulse circuit. It integrates all current received over the shared bitline, and produces a pulse whose width is a function of that current. Fig. 1(g) shows that function: Source: https://www.nature.com/articles/s43588-025-00854-1 One gets the sense that this is not the ideal function from a machine learning point of view, but it is efficient to implement in hardware. The proposed chip has a set of gain cells to store and a set of gain cells to store . There is a fixed number of these cells, so sliding window attention is implemented. In sliding window attention, the outputs for token can be affected by tokens . Fig. 2(c) shows how sliding window attention is implemented over time in a ping-pong fashion. While the product involving and is being computed, the capacitors for 1 token worth of are written. The counter step converts the computed value from analog to digital. Source: https://www.nature.com/articles/s43588-025-00854-1 One nice benefit of sliding window attention has to do with refreshing the gain cells. Leakage causes the capacitor in the gain cell to “forget” the value it is storing over time (~5ms). With sliding window attention, there is a limit to the amount of time that a particular value needs to be remembered. If that limit is low enough, then there is no need to refresh gain cells. Fig. 5 has the headline results (note the logarithmic scale), these are 100-100,000x improvements in latency and energy: Source: https://www.nature.com/articles/s43588-025-00854-1 If you are interested in how this design affects model accuracy, table 1 has the relevant metrics. My takeaway is that it doesn’t seem to have a large effect. Search the web for ‘CMOS full adder’, and you’ll see circuits that seem about as complex as a single gain cell. Those digital circuits only compute a single bit of a sum (and a carry bit as well)! The abstraction of digital logic is an expensive one indeed. I suppose this is not unlike the Turing Tax associated with higher levels of abstraction. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 2 months ago

A linear-time heuristic for improving network partitions

A linear-time heuristic for improving network partitions C. M. Fiduccia and R. M. Mattheyses DAC'1988 This paper describes the Fiduccia–Mattheyses algorithm which is similar to the Kernighan–Lin algorithm for graph partitioning. The ideas are old, but still very relevant. The object which this algorithm operates on is a network . A network is similar to a graph; the key difference is that edges are replaced with nets . A net connects 2 or more vertices (an edge is a degenerate case which always connects 2 vertices). This makes sense in hardware applications, where vertices are circuit components, and nets are the wires that connect them. The problem is to divide the vertices in a network into two sets, and . Nets which cross from to are called cut nets . The number of cut nets should be minimized. The relative sizes of and must adhere to a balance criterion which ensures that and have roughly the same number of vertices. The graph partitioning problem is similar to the min-cut problem. The key difference is that in graph partitioning, there is a requirement that the two partitions are balanced (have roughly equal sizes). Graph partitioning is NP-hard; the algorithm described in this paper is not guaranteed to produce an optimal result. The algorithm starts by dividing the vertices into initial sets and , ensuring that the balance criterion is honored. It then iteratively executes passes , until no more improvement can be made. A single pass can potentially move each vertex in the graph (to the complementary partition). Each vertex will only be moved at most once per pass. Within a pass, vertices are sorted according to their associated gain . The gain of a vertex is the overall improvement that would result by moving that vertex to the complementary partition. Fig. 1 has an example. The current partition is represented by the vertical dashed line. The partitioning would be improved by moving the vertex with +1 gain to the left partition (one fewer net would be cut). Partitioning would also be improved by moving the +2 vertex over to the right partition, but the balance criterion would likely not allow this. The partitioning would be made worse if the vertex with -1 gain were to be moved (as the associated net would become cut). Source: https://dl.acm.org/doi/10.1145/62882.62910 A single pass produces a set of candidate partitions and then chooses the best candidate to be the input partition for the next pass. The algorithm to execute a single pass is: Among all vertices which have not been considered in the current pass, and which can be moved without violating the balance criterion, select a vertex with maximal gain Move that vertex and note the updated partition as a candidate partition After a single pass has completed, examine all candidate partitions the pass produced. Choose the best one. There are two subtle points about this process: The gain of each vertex is recomputed after each move. Step #1 will select vertices with negative gains. The idea is that within a pass, a small bad move can unlock future (but within the same pass) good moves. The best candidate partition is selected at the end of each pass, so partitions that directly result from bad moves will be ignored. A net is called critical if it is connected to zero or one vertices in or . If a net is not critical, then it can be ignored when considering moving a single vertex, because an individual move cannot cause non-critical nets to change from cut to uncut (or the other way around). The gain of an individual vertex can be computed by summing the contributions from all critical nets that vertex is connected to. Each critical net contributes , , or to the gain of a vertex. The following steps enable incremental updates of data structures containing the following information when moving a vertex from set to set The gain of each vertex The number of vertices in A that each net is connected to The number of vertices in B that each net is connected to Note that the conditionals which examine the number of vertices that a net is connected to in or only consider critical nets (nets connected to 0 or 1 vertices in the relevant partition). The inner loop of the pass algorithm chooses a vertex with maximum gain. The algorithm above shows how to update the gain associated with each vertex as the pass progresses but does not deal with efficiently selecting a high gain vertex. The trick here is that there is a bounded range of values that a gain can have. The paper calls the bounds . is simply the maximum degree of any vertex. With this knowledge, vertices can be stored in an array of linked lists. Each entry in the array corresponds to a particular gain value. When the gain of a vertex changes, it can be removed from the list that it is currently contained in and inserted in the list corresponding to the new gain. Fig. 2 illustrates the data structure: Source: https://dl.acm.org/doi/10.1145/62882.62910 Results The following table shows performance results for four randomly sampled chips. The average chip has 267 vertices, 245 nets, and the average degree of a vertex is 2650. I would have thought that more passes were needed to converge on a solution. Source: https://dl.acm.org/doi/10.1145/62882.62910 Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. Source: https://dl.acm.org/doi/10.1145/62882.62910 A single pass produces a set of candidate partitions and then chooses the best candidate to be the input partition for the next pass. The algorithm to execute a single pass is: Among all vertices which have not been considered in the current pass, and which can be moved without violating the balance criterion, select a vertex with maximal gain Move that vertex and note the updated partition as a candidate partition The gain of each vertex is recomputed after each move. Step #1 will select vertices with negative gains. The idea is that within a pass, a small bad move can unlock future (but within the same pass) good moves. The best candidate partition is selected at the end of each pass, so partitions that directly result from bad moves will be ignored. The gain of each vertex The number of vertices in A that each net is connected to The number of vertices in B that each net is connected to

1 views