Latest Posts (20 found)

EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation

EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation Weigao Su and Vishal Shrivastav ASPLOS'25 This paper describes incremental changes to Ethernet NICs and switches to enable efficient disaggregation of memory without the need for a separate network ( e.g., CXL) for memory traffic. Fig. 1 shows the north star: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Servers are partitioned into Compute Nodes and Memory Nodes . When a compute node wants to access remote memory, it issues a request to its local NIC, which sends the request to the correct memory node (via a switch). The key problem this paper addresses is Ethernet fabric latency (i.e., the time taken for requests/responses to flow between NICs and switches). The paper assumes that the latency between the processor and the NIC is low (and cites other papers which describe techniques for reducing this latency to below 100ns). Typical Ethernet fabric latency is measured in microseconds, which is much higher than a local memory access. The Ethernet hardware stack can be decomposed into MAC and PHY layers. The MAC is higher level and sits on top of the PHY. The paper proposes implementing EDM (Ethernet Disaggregated Memory) with modifications to the PHY layer in both the NIC and the switch. Normal network packets flow through the MAC and PHY as they usually would, but a side channel exists which allows remote memory accesses to be handled directly by the enhanced PHY layer. Fig. 3 illustrates the hardware changes in Ethernet NICs and switches. Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Remote memory access requests and responses are smaller than typical Ethernet packets. Additionally, end-to-end application performance is more sensitive to remote memory access latency than the latency of regular network traffic. The bulk of the paper describes how EDM achieves low latency for remote memory traffic. The EDM PHY modifications allow a memory request to preempt a non-memory packet. Say the MAC sends a 1KiB packet to the PHY, which begins to send the packet over the wire in 66-bit blocks. If a memory request shows up in the middle of transmitting the network packet, the PHY can sneak the memory request onto the wire between 66-bit blocks, rather than waiting for the whole 1KiB to be sent. Standard Ethernet requires 96 bits of zeros to be sent on the wire between each packet. This overhead is small for large packets, but it is non-trivial for small packets (like remote memory access requests). The EDM PHY modifications allow these idle bits to be used for remote memory accesses. The MAC still sees the gaps, but the PHY does not. If you ask an LLM what could possibly go wrong by trying to use the inter-frame gap to send useful data, it will spit out a long list. I can’t find too much detail in the paper about how to ensure that this enhancement is robust. The possible problems are limited to the PHY layer however, as the MAC still sees the zeros it expects. To avoid congestion and dropping of memory requests, EDM uses an in-network scheduling algorithm somewhat like PFC. The EDM scheduler is in the PHY layer of the switch. Senders notify the switch when they have memory traffic to send, and the switch responds later with a grant , allowing a certain amount of data to be sent. The authors implemented EDM on FPGAs (acting as both NIC and switch). Table 1 compares latencies for TCP/IP, RDMA, raw Ethernet packets, and EDM, breaking down latencies at each step: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Fig. 7 throws CXL into the mix: Source: https://dl.acm.org/doi/10.1145/3669940.3707221 Dangling Pointers Section 3.3 “Practical Concerns” has a discussion of what could go wrong ( e.g., fault tolerance and data corruption). It is hard to judge how much work is needed to make this into something that industry could rely on. Subscribe now

0 views

An Analysis of User-space Idle State Instructions on x86 Processors

An Analysis of User-space Idle State Instructions on x86 Processors Malte-Christian Kuns, Hannes Tröpgen, and Robert Schöne ICPE'25 I’ve long believed that busy waiting is poor form. The closest thing you should ever come to busy waiting is to lock a , which will busy wait for a short while on your behalf. If your primary concern is power consumption, then busy waiting may be less offensive on a modern processor. This paper describes newly added x86 instructions to enable low power busy waiting from user space, and has a ton of data to help you sleep better at night. puts the processor into a low power state for a user-specified amount of time. supports two low power states ( and ), which trade power consumption for wake-up latency. TPAUSE can be called in user space but doesn’t wrest control of the core away from the OS. The trick is the OS can set a maximum timeout value, which gives the OS a chance to switch away from the busy waiting thread. and instructions are similar to but allow the processor to be woken up when a write occurs in a specified memory range. sets up the memory range to be monitored, and causes the processor to enter a low power state. accepts a timeout value and a target power state (just like ). AMD supports similar functionality via the and instructions. A key question the paper investigates is how closely the user-specified timeout is honored. Fig. 1 shows results for three Intel cores: Source: https://dl.acm.org/doi/10.1145/3676151.3719370 Times are measured in timestamp counter cycles (roughly 3 GHz for Alder Lake). The plateau at the top right is caused by the OS-specified maximum timeout. The authors find that timeout values are quantized ( e.g., 83 cycles on Alder Lake P-core). Additionally, for short timeouts the processor may ignore the user-requested power state (presumably because it doesn’t make sense to enter a deep sleep for a short amount of time). On Alder Lake P-cores, the threshold below which the processor will not enter the lowest power state is around 23,000 TSC cycles. Alder Lake E-cores seem to only support one low power state. Fig. 3 measures how much the processor can “oversleep” (wake up later than requested) depending on processor frequency and requested power state: Source: https://dl.acm.org/doi/10.1145/3676151.3719370 And finally, table 2 shows measured power consumption for these new instructions vs old-fashioned busy wait loops that uses the PAUSE instruction (which does not support a user-specified timeout): Source: https://dl.acm.org/doi/10.1145/3676151.3719370 I’m shocked by the advantage that AMD has here. If CPU core power during busy waiting is your primary concern, then you should choose your chip carefully. Condition variables are a general and useful abstraction. It would be nice if code that used condition variables could automatically benefit from these instructions. Maybe some compiler and/or hardware assistance is necessary to enable that. Subscribe now

0 views

Re-architecting End-host Networking with CXL: Coherence, Memory, and Offloading

Re-architecting End-host Networking with CXL: Coherence, Memory, and Offloading Houxiang Ji, Yifan Yuan, Yang Zhou, Ipoom Jeong, Ren Wang, Saksham Agarwal, and Nam Sung Kim MICRO'25 This paper is the third one that I’ve posted about which deals with the subtleties of interfacing a NIC and a host CPU. Here are links to the previous posts on this subject: Disentangling the Dual Role of NIC Receive Rings CEIO vs rxBisect: Fixing DDIO’s Leaky DMA Problem The authors bring a new hammer to the construction site: CXL , which offers some interesting efficiencies and simplifications. This paper shows how CXL can address two specific problems with the HW/SW interface of a typical PCIe NIC: After the host prepares a packet to be transmitted, it notifies the NIC with a MMIO write. This MMIO write is expensive because it introduces serialization into the host processor pipeline. When the NIC sends a received packet to the host, ideally it would write data to the LLC rather than host DRAM. However, if the host CPU cannot keep up, then the NIC should have a graceful fallback. CXL Type-1 devices are asymmetric: the device has coherent access to host memory, but the host does not have coherent access to device memory. Practically speaking, both packet descriptors and packet payloads must still be stored in host memory (no change from PCIe based NICs). Because the NIC has coherent access to host memory, it can safely prefetch receive descriptors (RxDs) into an on-NIC cache. When a packet arrives, the NIC can grab a descriptor from the cache and thus avoid an expensive host memory read to determine where to write packet data. If the host CPU updates a RxD after the NIC has prefetched it, the CXL cache coherence protocol will notify the NIC that it must invalidate its cached data. Coherence also enables the tail pointers for transmit ring buffers to be safely stored in host memory. The host networking stack can update a tail pointer with a regular store instruction (rather than an MMIO write). The NIC can continually poll this value, using coherent reads. If the tail index pointer has not been updated since the last poll, the NIC will read a cached value and not generate any PCIe traffic. CXL Type-2 NICs allow packets and descriptors to be stored in NIC memory. The host CPU can cache data read from the NIC, as the NIC will generate the necessary coherence traffic when it reads or writes this data. The design space (what data goes into what memory) is large, and the results section has numbers for many possible configurations. Section 5.3 of the paper describes how a type-2 NIC can intelligently use the CXL operation to write received packet data directly into the host LLC. This is similar to DDIO (described in the two papers linked at the top of this post), but the key difference is that the NIC is in the driver’s seat. The CEIO paper proposes monitoring LLC usage and falling back to storing received packets in DRAM local to the NIC if the LLC is too full. With CXL, the NIC has the option to write data to host memory directly (bypassing the LLC), thus avoiding the need for DRAM attached to the NIC. The authors implemented a CXL NIC on an Altera FPGA. They compared results against an nVidia BlueField-3 PCIe NIC. Fig. 10 compares loopback latency for the two devices, normalized to the BlueField-3 latency (lower is better) for a variety of CXL configurations. Source: https://dl.acm.org/doi/pdf/10.1145/3725843.3756102 Dangling Pointers One fact I took away from this paper is that CXL coherence messages are much cheaper than MMIOs and interrupts. Burning a CPU core polling a memory location seems wasteful to me. It would be nice if that CPU core could at least go into a low power state until a relevant coherence message arrives. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. After the host prepares a packet to be transmitted, it notifies the NIC with a MMIO write. This MMIO write is expensive because it introduces serialization into the host processor pipeline. When the NIC sends a received packet to the host, ideally it would write data to the LLC rather than host DRAM. However, if the host CPU cannot keep up, then the NIC should have a graceful fallback.

0 views

Forest: Access-aware GPU UVM Management

Forest: Access-aware GPU UVM Management Mao Lin, Yuan Feng, Guilherme Cox, and Hyeran Jeon ISCA'25 Unified virtual memory is an abstraction which presents a single unified address space for both the CPU and GPU. This is a convenient programming model because it allows one device to create a complex data structure (with pointers) and pass that directly to the other device. Maintaining this illusion in systems with discrete GPUs is a complex task. The state-of-the-art involves initially placing allocations in host memory and then copying them to GPU memory to resolve GPU page faults ( far-faults ). Prefetching can help avoid some far-faults. A state-of-the-art prefetcher is called the tree-based neighboring prefetcher (TBNp). Fig. 3 shows the TBNp data structure for a 512KiB allocation: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 TBNp tracks each 2MiB of virtual memory with a binary tree containing 64KB leaf nodes. When a far-fault occurs on a 4 KiB page (yellow “1” in Fig. 3) the remaining 60 KiB (blue “2” in Fig. 3) in the leaf are prefetched. Once the leaves contained by a particular sub-tree are >= 50% resident in GPU memory, the remainder of the sub-tree is prefetched. In Fig. 3 that situation occurs when the three left-most leaf nodes have been transferred to GPU memory. At that point, the remaining leaf node in the subtree (green “7” in Fig. 3) is prefetched. This paper proposes customizing the TBNp trees associated with each object, with the help of access pattern profiling hardware. Fig. 7 illustrates the design: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 The GPU MMU tracks access times for each object (i.e., allocation created by and for each page. Once a healthy amount of profiling data has been collected, the GPU fires an interrupt which notifies the driver that it should grab the latest statistics. The driver classifies access patterns for each object into one of the following four buckets: Linear/Streaming Non-Linear, High-Coverage, High-Intensity Non-Linear, High-Coverage, Low-Intensity Non-Linear, Low-Coverage The paper uses tree-based prefetching like TBNp, but configures the trees differently for each object depending on the access pattern bucket. This is where the name “Forest” comes from: each tree has a maximum size, so large objects are chopped up and tracked with multiple trees. Streaming accesses are detected with linear regression. If the R 2 value is close to 1, then the driver classifies the accesses as streaming. The prefetching trees used for objects accessed in a streaming manner are 4MiB in size and contain 256KiB leaf nodes (relatively large). If the driver determines accesses are not streaming, then the choice between the remaining three buckets is determined by the access coverage and access intensity . Access coverage is computed based on the minimum and maximum page numbers accessed during profiling. Access intensity is based on the number of accesses during profiling. For objects with high access coverage and high access intensity, the associated prefetching trees are 512KiB in size and contain 64KiB leaf nodes. For objects with high access coverage and low access intensity, the associated prefetching trees are 512KiB in size and contain 16KiB leaf nodes. Finally, for objects with low access intensity, the associated prefetching trees are 2MiB in size and contain 64KiB leaf nodes. Fig. 12 contains simulation results for a number of benchmarks: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 is the design described in this paper. is a modification that avoids the overheads associated with initial profiling by trying to make better initial guesses about access patterns before profiling data is available. I wonder how much vertical integration can help here. Certainly, a number of applications have enough context to make smarter decisions than the driver relying on profiling information. Subscribe now Source: https://dl.acm.org/doi/10.1145/3695053.3731047 TBNp tracks each 2MiB of virtual memory with a binary tree containing 64KB leaf nodes. When a far-fault occurs on a 4 KiB page (yellow “1” in Fig. 3) the remaining 60 KiB (blue “2” in Fig. 3) in the leaf are prefetched. Once the leaves contained by a particular sub-tree are >= 50% resident in GPU memory, the remainder of the sub-tree is prefetched. In Fig. 3 that situation occurs when the three left-most leaf nodes have been transferred to GPU memory. At that point, the remaining leaf node in the subtree (green “7” in Fig. 3) is prefetched. Access Pattern Profiling This paper proposes customizing the TBNp trees associated with each object, with the help of access pattern profiling hardware. Fig. 7 illustrates the design: Source: https://dl.acm.org/doi/10.1145/3695053.3731047 The GPU MMU tracks access times for each object (i.e., allocation created by and for each page. Once a healthy amount of profiling data has been collected, the GPU fires an interrupt which notifies the driver that it should grab the latest statistics. Access Pattern Classification The driver classifies access patterns for each object into one of the following four buckets: Linear/Streaming Non-Linear, High-Coverage, High-Intensity Non-Linear, High-Coverage, Low-Intensity Non-Linear, Low-Coverage

0 views

UPP: Universal Predicate Pushdown to Smart Storage

UPP: Universal Predicate Pushdown to Smart Storage Ipoom Jeong, Jinghan Huang, Chuxuan Hu, Dohyun Park, Jaeyoung Kang, Nam Sung Kim, and Yongjoo Park ISCA'25 Working on hardware acceleration requires a healthy dose of honesty. If you try hard enough, you can find clever ways to accelerate a given application. Once that point is reached, it is helpful to step back and ask yourself: “are these ideas generally applicable to many hardware architectures, or only to the one I am targeting?” This paper describes techniques for high performance filtering of OLAP data, and a FPGA implementation. I wonder if these ideas would also work well on other chips. This paper rests on two assumptions: Inputs are relatively static, which means that the cost of preprocessing can be amortized over many queries Best-effort filtering is OK, because the system has another level of filtering to catch any false positives (rows which should have been removed, but were not) A preprocessing step generates a 256-bit row vector (RV) associated with each row. These bits are partitioned among all columns (e.g., if there are 8 columns in the relation, then each column is represented with 32 bits per row). When a query is run, the relevant filters from the query are converted into a set of 256-bit query vectors (QVs) and simple instructions which perform logical operations between the row vectors and query vectors. The result of those instructions is a single bit per row which determines if the row can be safely removed. Numerical expressions (e.g., ) are supported for monotone functions. During the preprocessing step, the lower and upper bounds of each column are computed. This space is divided into a fixed number of buckets. For each value in a column, the associated bucket index is computed, and the associated bit in the row vector is set to 1. Hashing is used to handle the case where there are more buckets than bits allocated for a given column. When a query is executed, the software can determine the set of buckets which the query references. For example, say the filter expression is: and the buckets for are: The query vector which selects rows which should not be filtered is (LSB first): To determine if a row should be filtered, compute the bitwise AND of the row vector and the query vector. If all bits of the result are zero, then the row can be removed. To convert a string into a row vector, the paper proposes tokenizing the string, and then hashing each token (i.e., word) to determine which bit in the row vector to set. This means that multiple bits in a row vector can be set (one bit per word). Only tokens which appear frequently in the dataset are hashed, the rest are ignored. A query expression like is decomposed into three tokens, and each token is hashed, and the hash values determine which bits in the query vector are set. Rows are accepted if they have all 3 bits set. Note that this is best-effort filtering. For example, if a row contains the string in the column, that row will not be removed. Table 3 shows FPGA resource usage numbers for an FPGA accelerator which executes queries by performing bitwise operations on row and query vectors, and compacting tables based on the generated bitmaps. These numbers seem pretty modest (i.e., good) to me. The authors argue that this implementation is small enough that it could be incorporated into a Smart SSD, allowing queries to be pushed down as far as possible. Source: https://dl.acm.org/doi/10.1145/3695053.3731005 Fig. 6 shows TPC-H performance results. Each pair of bars represents a particular query run on a baseline system, and on a system with filtering pushed down to a Smart SSD. Q21 doesn’t see a speedup because it is bound by join and aggregation, not filtering. Source: https://dl.acm.org/doi/10.1145/3695053.3731005 Dangling Pointers I wonder how much of this is overfitting to TPC-H. If you look at the TPC-H spec , a lot of string columns are generated by randomly sampling from a very small set of possible tokens. It would be great if the industry had a “held out test set” which could be used to evaluate OLAP performance on real-world yet hidden datasets which researchers could not directly see. Subscribe now Inputs are relatively static, which means that the cost of preprocessing can be amortized over many queries Best-effort filtering is OK, because the system has another level of filtering to catch any false positives (rows which should have been removed, but were not)

0 views

RayN: Ray Tracing Acceleration with Near-memory Computing

RayN: Ray Tracing Acceleration with Near-memory Computing Mohammadreza Saed, Prashant J. Nair, and Tor M. Aamodt MICRO'25 A previous paper on OLTP acceleration with processing-in-memory hardware came to the conclusion that the ripest fruit to pick is pointer chasing. This paper comes to a similar conclusion for a different application: ray tracing. Ray tracing applications spend most of their time traversing a bounding volume hierarchy (BVH) which contains the scene to be rendered. The BVH data structure analyzed in this paper is a tree, where each node in the tree contains an axis-aligned bounding box which bounds the subtree rooted at that node. To determine which part of the scene a ray intersects, a ray tracer traverses the BVH starting at the root and recursing if the ray intersects the bounding box of the current node. A typical BVH is large and does not fit into on-chip memory. The problem with this process is that it is not coherent. If a bundle of 8 rays are processed together, one would hope that the traversals performed by each ray are similar, but alas, they frequently are not. This means that the cost of reading BVH nodes from off-chip memory cannot be amortized across many rays. Fig. 3 shows the potential speedup if BVH reads were free: Source: https://dl.acm.org/doi/10.1145/3725843.3756067 BVH Traversal in HBM Logic Dies Fig. 5 shows a GPU with four HBM stacks: Source: https://dl.acm.org/doi/10.1145/3725843.3756067 The yellow rectangles at the bottom of each stack are the key here: each one is an HBM logic die, which can implement a modest amount of functionality. The potential speedup comes from the fact that the BVH traversal logic in the HBM logic die has a much lower latency when accessing its local BVH data and thus can traverse the BVH faster than the GPU can. Typical GPUs contain sophisticated memory controllers. These are implemented in the GPU directly because the GPU die is the best location for complex (i.e., heat-generating) logic. The authors propose three designs for integrating simple BVH traversal logic into HBM logic dies: JEDEC-compatible: the GPU retains its sophisticated memory controller, and the HBM logic die integrates a simple memory controller and BVH traversal logic. At any one time, a particular HBM stack can be used for general compute or ray tracing. The driver on the host OS controls this configuration at a coarse granularity. Unified: Add a memory controller and BVH traversal logic to the HBM logic die. Remove the memory controller from the GPU. This avoids having to partition HBM stacks, but potentially limits the sophistication of the memory controller, and requires significant changes to the GPU. Hybrid: there are memory controllers in both the GPU and HBM logic dies, but both can be active at the same time. When both are active, the GPU memory controller sends requests to the HBM logic die controller, which interleaves these requests with requests coming from the BVH traversal logic on the logic die. This scheme requires larger GPU changes than the JEDEC-compatible option but avoids coarse-grain partitioning. Fig. 6 illustrates these three designs: Source: https://dl.acm.org/doi/10.1145/3725843.3756067 BVH Partitioning A recurring puzzle with processing-in-memory applications is how to partition across memory modules in a way that doesn’t cause load imbalance. The authors of this paper propose that the root of the BVH and nodes which are close to the root should be accessed by the GPU as they normally would be. The proposed design builds on a common two-level BVH implementation for GPUs. In these implementations, the BVH is divided into the top-level acceleration structure (TLAS) and bottom-level acceleration structures (BLAS). The TLAS contains the root and some interior nodes, each BLAS contains other interior nodes, and leaf nodes. One key design choice here is that multiple instances of a BLAS can appear in a scene (e.g., a single BLAS describes a fish, and the scene can contain 15 fishes). In this case, the TLAS contains multiple pointers to a single BLAS. The paper examines many alternative designs and concludes that BLAS Breaking (BB) is the optimal approach. In this design the GPU treats the TLAS and top-level nodes of each BLAS as normal and lets the BVH traversal logic in the logic dies only handle traversal of nodes near the leaves in each BLAS. There are two reasons for this design. The first is that the memory coherence problem grows in significance the further away from the root a bundle of rays gets (i.e., near the root of the BVH, all rays in a bundle will typically follow similar paths). The second reason is that it would be complex to support ray tracing acceleration of the TLAS because multiple instances of each BLAS may exist. This would require the ray tracing units in the HBM logic dies to communicate with each other. Subtrees within a BLAS are partitioned among the HBM stacks. Fig. 8 illustrates an example BVH. Green nodes are processed by the GPU as they normally would be. Once a subtree of another color is reached, the rest of the BVH traversal is handled by the logic die in the HBM stack indicated by the color of the subtree. Fig. 8 illustrates a BVH which has been partitioned with BLAS Breaking: Source: https://dl.acm.org/doi/10.1145/3725843.3756067 Results Fig. 11 has simulated speedups, comparing various designs to a baseline that does not have ray tracing offloads in HBM logic dies. (unified memory controller + BLAS breaking) is the approach advocated by this paper. Source: https://dl.acm.org/doi/10.1145/3725843.3756067 Dangling Pointers If the ray tracing logic in the HBM die is bound by memory latency, I wonder if it needs to be so specialized. Maybe one could get away with more general-purpose logic there (e.g., UPMEM processors) and still get similar results for ray tracing applications. Subscribe now Source: https://dl.acm.org/doi/10.1145/3725843.3756067 BVH Traversal in HBM Logic Dies Fig. 5 shows a GPU with four HBM stacks: Source: https://dl.acm.org/doi/10.1145/3725843.3756067 The yellow rectangles at the bottom of each stack are the key here: each one is an HBM logic die, which can implement a modest amount of functionality. The potential speedup comes from the fact that the BVH traversal logic in the HBM logic die has a much lower latency when accessing its local BVH data and thus can traverse the BVH faster than the GPU can. Typical GPUs contain sophisticated memory controllers. These are implemented in the GPU directly because the GPU die is the best location for complex (i.e., heat-generating) logic. The authors propose three designs for integrating simple BVH traversal logic into HBM logic dies: JEDEC-compatible: the GPU retains its sophisticated memory controller, and the HBM logic die integrates a simple memory controller and BVH traversal logic. At any one time, a particular HBM stack can be used for general compute or ray tracing. The driver on the host OS controls this configuration at a coarse granularity. Unified: Add a memory controller and BVH traversal logic to the HBM logic die. Remove the memory controller from the GPU. This avoids having to partition HBM stacks, but potentially limits the sophistication of the memory controller, and requires significant changes to the GPU. Hybrid: there are memory controllers in both the GPU and HBM logic dies, but both can be active at the same time. When both are active, the GPU memory controller sends requests to the HBM logic die controller, which interleaves these requests with requests coming from the BVH traversal logic on the logic die. This scheme requires larger GPU changes than the JEDEC-compatible option but avoids coarse-grain partitioning.

0 views

SHADOW: Simultaneous Multi-Threading Architecture with Asymmetric Threads

SHADOW: Simultaneous Multi-Threading Architecture with Asymmetric Threads Ishita Chaturvedi, Bhargav Reddy Godala, Abiram Gangavaram, Daniel Flyer, Tyler Sorensen, Tor M. Aamodt, and David I. August MICRO'25 Some processors excel at exploiting instruction level parallelism (ILP), others excel at thread level parallelism (TLP). One approach to dealing with this heterogeneity is to assign each application to a microarchitecture which is suited for it (e.g., OLTP on ILP-optimized processors, OLAP on TLP-optimized processors). The problem is that some applications defy categorization. The characteristics of the workload may flip-flop at high frequency (milliseconds or seconds). This paper proposes a microarchitecture (named SHADOW) that is specialized for these fickle workloads. Fig. 5 summarizes the design described in this paper: Source: https://dl.acm.org/doi/10.1145/3725843.3756070 This processor exposes five hardware threads. One hardware thread (the O ut- o f- O rder thread, in blue) exploits ILP. The other four threads ( In - O rder threads, in yellow) exploit TLP. Green components in Fig. 5 are shared between the threads. It may not look like much is shared, but a lot is (e.g., caches, fetch, decode, execute). The OoO thread has all the bells and whistles you would expect: branch prediction, a reorder buffer, register renaming. The InO threads look more like what you would see on a GPU: no branch prediction, no register renaming. Most of the pipeline resources are shared between threads with time-division-multiplexing. For example, on each cycle, a single hardware thread accesses the instruction cache. The SHADOW architecture assumes some help from software. The OS can use the instruction during OS-level context switches to inform hardware about the thread being switched to. For example, the OS can use to tell the hardware if the thread should be considered InO or OoO. The hardware tracks the number of InO threads actively running and uses that to allocate physical registers to the OoO thread. For example, if no InO threads are running, then the OoO thread can use all physical registers. The paper advocates for a simple software method for dividing work between OoO and InO threads: work-stealing. If an application has a bunch of parallel work to do, dispatch that work to a pool of threads (consisting of both OoO and InO threads). The idea is that if the workload is OoO friendly (not too many cache misses), then the OoO thread will naturally run faster than the InO threads and thus perform more of the work. The arbiters at each pipeline stage are designed with this policy in mind. If you are like me, when you read that, you thought “if an application has a bunch of parallel work, then shouldn’t it just use the InO threads?” The answer is “no”. The key point is that OoO can do better for some parallelizable workloads, for example: because it does not require N times the working set in caches. Fig. 10 shows benchmark results for 9 hardware configurations: Source: https://dl.acm.org/doi/10.1145/3725843.3756070 Table 3 adds more color to those results: Source: https://dl.acm.org/doi/10.1145/3725843.3756070 The authors estimate that the power and area overhead of adding InO threads to an existing out-of-order core is about 1%. Wowzers those are some inexpensive threads. I imagine the engineering/validation costs are a bigger concern. The proposed software parallelization model assumes that running threads multiplies the working set by , which is true for most parallelization schemes. Pipeline parallelism doesn’t have this property. Maybe everything looks like a nail to me, but pipeline parallelism on CPUs seems to be an under-researched topic. Subscribe now

0 views

Light-weight Cache Replacement for Instruction Heavy Workloads

Light-weight Cache Replacement for Instruction Heavy Workloads Saba Mostofi, Setu Gupta, Ahmad Hassani, Krishnam Tibrewala, Elvira Teran, Paul V. Gratz, and Daniel A. Jiménez ISCA'25 Caches with LRU policies are great, unless the working set exceeds the cache size (e.g., in RTL simulation ). This paper describes a simple cache replacement policy (PACIPV) which outperforms LRU in situations where the working set can be thought of as two separate sets of data: One set is accessed in a round-robin fashion, and is larger than the cache The other set is accessed in a more random fashion, and/or is smaller than the cache The findings of the paper can be boiled down to two insertion/promotion vectors (IPVs) which describe the cache replacement policy. The paper assumes a 4-way set associative cache and finds IPVs which work in that case, but the technique generalizes. The cache tracks two bits of metadata for each cache line. These two bits define the re-reference prediction value (RRPV) for a cache line. A low RRPV (e.g., 0 or 1) indicates that the cache line will likely be referenced again before it is evicted. A high RRPV (e.g., 2 or 3) indicates that the cache line is unlikely to be referenced again before it is evicted. When the cache needs to evict a line from a particular set, it examines the RRPVs associated with each cache line. If any cache line has an RRPV of 3, then it can be evicted. If all RRPVs in the set are less than 3, then all RRPVs are repeatedly incremented (i.e., they move closer to 3) until at least one unlucky cache line has an RRPV of 3 and is chosen for eviction. An IPV defines a state machine which is responsible for updating RRPVs on cache insertion and cache hit events. An IPV comprises five integers (e.g., ). The last integer in the list defines the RRPV assigned to a cache line when it is inserted into the cache. The remaining four integers define how RRPVs are updated on each cache hit. Symbolically, if the IPV is: , then the RRPV associated with a newly inserted cache line is . If a cache hit occurs on a cache line with an RRPV of 3, then the RRPV of that line is updated to be . Similarly, if a hit occurs on a cache line with an RRPV of 1, then the new RRPV is set to . RRPVs are dynamically tracked by the hardware on a per-cache line basis. RRPVs only apply to cache lines which are present in a cache; they are forgotten when a cache line is evicted. Fig. 3 shows an example IPV ([0, 1, 1, 2, 2]) and two state transitions. Initially, the RRPV of cache line is 0, while the RRPV of cache line is 1, etc. The RRPV associated with C is updated when C is accessed. Because C has an RRPV of 2 at the time of the access, the new RRPV is 1 (e.g., the bold element in the IPV: [0, 1, 1 , 2, 2]). Next, is inserted into the cache. Cache line is selected to be evicted because it has an RRPV of 3 (recall that only cache lines with RRPV=3 can be evicted). The RRPV associated with cache line is initialized to 2 (the last number in the IPV). Source: https://dl.acm.org/doi/10.1145/3695053.3730993 Prefetch IPV A cache policy is defined by two IPVs. One IPV determines how RRPVs are updated when a demand access (e.g., a load or store instruction) occurs. The other IPV defines how RRPVs are updated when a prefetch occurs. I assume this applies to both hardware and software prefetching. Section 4 of the paper describes how many candidate IPVs were evaluated under simulation. The best general-purpose IPVs are: (demand accesses) (prefetch accesses) The way I think of this is that cache lines are initially placed into the cache “under suspicion” (RRPV=3). This means that the cache policy does not have high confidence that a cache line will be referenced a second time. If it is referenced a second time however, then the cache line becomes the teacher’s pet and is promoted all the way to RRPV=0. It is hard to stay the teacher’s pet for a long time. If any eviction occurs and no other line has an RRPV=3, then the cache line will be downgraded to RRPV=1. RRPV=1 is fairly sticky however, because accesses at RRPV=1 or RRPV=2 will reset the RRPV to 1. The prefetch IPV is similar to the demand IPV. The critical difference is that the prefetch policy is more forgiving to cache lines at RRPV=2, they will be promoted all the way back to RRPV=0 on a prefetch access. The authors also found an IPV pair which works better for data-heavy workloads, it is: (demand accesses) (prefetch accesses) Fig. 10 shows benchmark results for both IPV pairs described above against state-of-the-art cache replacement policies. PACIPV-D is the IPV pair which is better suited for data-heavy workloads. There is some overfitting as PACIPV-D was trained on the benchmarks measured here. Source: https://dl.acm.org/doi/10.1145/3695053.3730993 Dangling Pointers I wonder if there is value in storing RRPVs in DRAM. Kind of like how additional DRAM bits are used for ECC data. This would allow the cache to remember properties of a cache line over long periods of time. Subscribe now One set is accessed in a round-robin fashion, and is larger than the cache The other set is accessed in a more random fashion, and/or is smaller than the cache Source: https://dl.acm.org/doi/10.1145/3695053.3730993 Prefetch IPV A cache policy is defined by two IPVs. One IPV determines how RRPVs are updated when a demand access (e.g., a load or store instruction) occurs. The other IPV defines how RRPVs are updated when a prefetch occurs. I assume this applies to both hardware and software prefetching. Magic Values Section 4 of the paper describes how many candidate IPVs were evaluated under simulation. The best general-purpose IPVs are: (demand accesses) (prefetch accesses) (demand accesses) (prefetch accesses)

0 views

LoopFrog: In-Core Hint-Based Loop Parallelization

LoopFrog: In-Core Hint-Based Loop Parallelization Marton Erdos, Utpal Bora, Akshay Bhosale, Bob Lytton, Ali M. Zaidi, Alexandra W. Chadwick, Yuxin Guo, Giacomo Gabrielli, and Timothy M. Jones MICRO'25 To my Kanagawa pals: I think hardware like this would make a great target for Kanagawa, what do you think? The message of this paper is that there is plenty of loop-level parallelism available which superscalar cores are not yet harvesting. Fig. 1 illustrates the classic motivation for multi-core processors: scaling the processor width by 4x yields a 2x IPC improvement. In general, wider cores are heavily underutilized. Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The main idea behind is to add hints to the ISA which allow a wide core to exploit more loop-level parallelism in sequential code. If you understand Fig. 2, then you understand , the rest is just details: Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The compiler emits instructions which the processor can use to understand the structure of a loop. Processors are free to ignore the hints. A loop which can be optimized by comprises three sections: A header , which launches each loop iteration A body , which accepts values from the header A continuation , which computes values needed for the next loop iteration (e.g., the value of induction variables). Each execution of the header launches two threadlets . A threadlet is like a thread but is only ever executed on the core which launched it. One threadlet launched by the header executes the body of the loop. The other threadlet launched by the header is the continuation, which computes values needed for the next loop iteration. Register loop-carried dependencies are allowed between the header and continuation, but not between body invocations. That is the key which allows multiple bodies to execute in parallel (see Fig. 2c above). At any one time, there is one architectural threadlet (the oldest one), which can update architectural state. All other threadlets are speculative . Once the architectural threadlet for loop iteration completes, it hands the baton over to the threadlet executing iteration , which becomes architectural. Dependencies through memory are handled by the speculative state buffer (SSB). When a speculative threadlet executes a memory store, data is stored in the SSB and actually written to memory later on (i.e., after that threadlet is no longer speculative). Memory loads read from both the L1 cache and the SSB, and then disambiguation hardware determines which data to use and which to ignore. The hardware implementation evaluated by the paper does not support nested parallelization, it simply ignores hints inside of nested loops. Fig. 6 shows simulated performance results for an 8-wide core. A core which supports 4 threadlets is compared against a baseline which does not implement . Source: https://dl.acm.org/doi/10.1145/3725843.3756051 can improve performance by about 10%. Fig. 1 at the top shows that an 8-wide core experiences about 25% utilization, so there may be more fruit left to pick. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The main idea behind is to add hints to the ISA which allow a wide core to exploit more loop-level parallelism in sequential code. Structured Loops If you understand Fig. 2, then you understand , the rest is just details: Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The compiler emits instructions which the processor can use to understand the structure of a loop. Processors are free to ignore the hints. A loop which can be optimized by comprises three sections: A header , which launches each loop iteration A body , which accepts values from the header A continuation , which computes values needed for the next loop iteration (e.g., the value of induction variables).

0 views
Dangling Pointers 1 months ago

Dynamic Load Balancer in Intel Xeon Scalable Processor

Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines Jiaqi Lou, Srikar Vanavasam, Yifan Yuan, Ren Wang, and Nam Sung Kim ISCA'25 This paper describes the DLB accelerator present in modern Xeon CPUs. The DLB addresses a similar problem discussed in the state-compute replication paper : how to parallelize packet processing when RSS (static NIC-based load balancing) is insufficient. Imagine a 100 Gbps NIC is receiving a steady stream of 64B packets and sending them to the host. If RSS is inappropriate for the application, then another parallelization strategy would be for a single CPU core to distribute incoming packets to all of the others. To keep up, that load-distribution core would have to be able to process 200M packets per second, but state-of-the-art results top out at 30M packets per second. The DLB is an accelerator designed to solve this problem. Fig. 2 illustrates the DLB hardware and software architecture: Source: https://dl.acm.org/doi/10.1145/3695053.3731026 A set of producer cores can write 16B queue elements (QEs) into a set of producer ports (PPs). In a networking application, one QE could map to a single packet. A set of consumer cores can read QEs out of consumer queues (CQs). QEs contain metadata which producers can set to enable ordering within a flow/connection, and to control relative priorities. The DLB balances the load at each consumer, while honoring ordering constraints and priorities. A set of cores can send QEs to the DLB in parallel without suffering too much from skew. For example, imagine a CPU with 128 cores. If DLB is not used, and instead RSS is configured to statically distribute connections among those 128 cores, then skew could be a big problem. If DLB is used, and there are 4 cores which write into the producer ports, then RSS can be configured to statically distribute connections among those 4 cores, and skew is much less likely to be a problem. Fig. 5 shows that DLB works pretty well. and are software load balancers, while uses the DLB accelerator. DLB offers similar throughput and latency to RSS, but with much more flexibility. Source: https://dl.acm.org/doi/10.1145/3695053.3731026 AccDirect One awkward point in the design above is the large number of CPU cycles consumed by the set of producer cores which write QEs into the DLB. The paper proposes AccDirect to solve this. The idea is that DLB appears as a PCIe device, and therefore a flexible NIC can use PCIe peer-to-peer writes to send packets directly to the DLB. The authors find that the NVIDIA BlueField-3 has enough programmability to support this. Fig. 9 show that this results in a significant power savings, but not too much of a latency improvement: Source: https://dl.acm.org/doi/10.1145/3695053.3731026 Dangling Pointers I feel like it is common knowledge that fine-grained parallelism doesn’t work well on multi-core CPUs. In the context of this paper, the implication is that it is infeasible to write a multi-core packet processor that primarily uses pipeline parallelism. Back-of-the-envelope: at 400Gbps, and 64B packets, there is a budget of about 40 8-wide SIMD instructions to process a batch of 8 packets. If there are 128 cores, then maybe the aggregate budget is 4K instructions per batch of 8 packets across all cores. This doesn’t seem implausible to me. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

The XOR Cache: A Catalyst for Compression

The XOR Cache: A Catalyst for Compression Zhewen Pan and Joshua San Miguel ISCA'25 Inclusive caches seem a bit wasteful. This paper describes a clever mechanism for reducing that waste: store two cache lines of data in each physical cache line! In most of the paper there are only two caches in play (L1 and LLC), but the idea generalizes. Fig. 1 illustrates the core concept. In this example, there are two cache lines ( , and ) in the LLC. is also present in the L1 cache of a core. Here is the punchline: the LLC stores . If the core which has cached experiences a miss when trying to access , then the core resolves the miss by asking the LLC for . Once that data arrives at the L1, B can be recovered by computing . The L1 stores and separately, so as to not impact L1 hit latency. Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Coherence Protocol The mechanics of doing this correctly are implemented in a cache coherence protocol described in section 4 of the paper. We’ve discussed the local recovery case, where the core which needs already holds in its L1 cache. If the core which requires B does not hold A, then two fallbacks are possible. Before describing those cases, the following property is important to highlight. The cache coherence protocol ensures that if the LLC holds , then some L1 cache in the system will hold a copy of either or . If some action was about to violate this property, then the LLC would request a copy of or from some L1, and use it to split into separate cache lines in the LLC holding and . Direct forwarding occurs when some other core holds a copy of B. In this case, the system requests the other core to send B to the core that needs it. The final case is called remote recovery . If the LLC holds and no L1 cache holds a copy of , then some L1 cache in the system must hold a copy of . The LLC sends to that core which has locally cached. That core computes to recover a copy of and sends it to the requestor. Another tricky case to handle is when a core writes to . The cache coherence protocol handles this case similar to eviction and ensures that the LLC will split cache lines as necessary so that all data is always recoverable. The LLC has a lot of freedom when it comes to deciding which cache lines to pair up. The policies described in this paper optimize for intra-cache line compression (compressing the data within a single cache line). The LLC hardware maintains a hash table. When searching for a partner for cache line , the LLC computes a hash of the contents of to find a set of potential partners. One hash function described by the paper is sparse byte labeling , which produces 6 bits for each 8-byte word in a cache line. Each bit is set to 0 if the corresponding byte in the word is zero. The lower two bytes of each word are ignored. The idea here is that frequently the upper bytes of a word are zero. If two cache lines have the same byte label, then when you XOR them together the merged cache line will have many zero bytes (i.e., they have low entropy and are thus compressible). The LLC can optimize for this case by storing compressed data in the cache and thus increasing its effective capacity. This paper relies on prior work related to compressed caches. The takeaway is that not only is there a potential 2x savings possible from logically storing two cache lines in one physical location, but there are also further savings in compressing these merged cache lines. Fig. 13 compares the compression ratio achieved by XOR cache against prior work (taller bars are better). The right-most set of bars show the geometric mean: Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Fig. 15 shows performance impacts associated with this scheme: Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Dangling Pointers It seems to me that XOR Cache mostly benefits from cache lines that are rarely written. I wonder if there are ways to predict if a particular cache line is likely to be written in the near future. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

CHERIoT RTOS: An OS for Fine-Grained Memory-Safe Compartments on Low-Cost Embedded Devices

CHERIoT RTOS: An OS for Fine-Grained Memory-Safe Compartments on Low-Cost Embedded Devices Saar Amar, Tony Chen, David Chisnall, Nathaniel Wesley Filardo, Ben Laurie, Hugo Lefeuvre, Kunyan Liu, Simon W. Moore, Robert Norton-Wright, Margo Seltzer, Yucong Tao, Robert N. M. Watson, and Hongyan Xia SOSP'25 This paper is a companion to a previous paper which described the CHERIoT hardware architecture. This work presents an OS that doesn’t look like the systems you are used to. The primary goal is memory safety (and security more broadly). Why rewrite your embedded code in Rust when you can switch to a fancy new chip and OS instead? Recall that a CHERI capability is a pointer augmented with metadata (bounds, access permissions). CHERI allows a more restrictive capability to be derived from a less restrictive one (e.g., reduce the bounds or remove access permissions), but not the other way around. CHERIoT RTOS doesn’t have the notion of a process, instead it has a compartment. A compartment comprises code and compartment-global data. Compartment boundaries are trust boundaries. I think of it like a microkernel operating system. Example compartments in CHERIoT include: Boot loader Context switcher Heap allocator Thread scheduler The boot loader is fully trusted and is the first code to run. The hardware provides the boot loader with the ultimate capability. The boot loader then derives more restrictive capabilities, which it passes to other compartments. You could imagine a driver compartment which is responsible for managing a particular I/O device. The boot loader would provide that compartment with a capability that enables the compartment to access the MMIO registers associated with the device. There is no user space/kernel space distinction here, only a set of compartments, each with a unique set of capabilities. Fig. 3 illustrates a compartment: Source: https://dl.acm.org/doi/10.1145/3731569.3764844 Sealed Capabilities The CHERIoT hardware architecture supports sealing of capabilities. Sealing a capability is similar to deriving a more restrictive one, only this time the derived capability is useless until it is unsealed by a compartment which holds a capability with unsealing permissions. I think of this like a client encrypting some data before storing it on a server. The data is useless to everyone except for the client who can decrypt it. Cross-compartment function calls are similar to system calls and are implemented with sealed capabilities. Say compartment needs to be able to call a function exported by compartment . At boot, the boot loader derives a “function call” capability which is a pointer into the export table associated with , seals that capability, and passes it to compartment at initialization. The boot loader also gives the switcher a capability which allows it to unseal the function call capability. When A wants to call the function exported by , it passes the sealed capability to the switcher. The switcher then unseals the capability and uses it to read metadata about the exported function from ’s export table. The switcher uses this metadata to safely perform the function call. Capability sealing also simplifies inter-compartment state management. Say compartment calls into compartment (for networking) to create a TCP connection. The networking compartment can allocate a complicated tree of objects and then return a sealed capability which points to that tree. Compartment can hold on to that capability and pass it as a parameter for future networking function calls (which will unseal and then use). Compartment doesn’t need to track per-connection objects in its global state. The heap compartment handles memory allocation for all compartments. There is just one address space shared by all compartments, but capabilities make the whole thing safe. As described in the previous summary, when an allocation is freed, the heap allocator sets associated revocation bits to zero. This prevents use-after-free bugs (in conjunction with the CHERIoT hardware load filter). Similar to garbage collection, freed memory is quarantined (not reused) until a memory sweep completes which ensures that no outstanding valid capabilities are referencing the memory to be reused. The allocator supports allocation capabilities which can enforce per-compartment quotas. If you’ve had enough novelty, you can rest your eyes for a moment. The CHERIoT RTOS supports threads, and they mostly behave like you would expect. The only restriction is that threads are statically declared in code. Threads begin execution in the compartment that declares them, but then threads can execute code in other compartments via cross-compartment calls. Each compartment is responsible for managing its own state with proper error handling. If all else fails, the OS supports micro-reboots, where a single compartment can be reset to a fresh state. The cross-compartment call mechanism supported by the switcher enables the necessary bookkeeping for micro-reboots. The steps to reboot a single compartment are: Stop new threads from calling into the compartment (these calls fail with an error code) Fault all threads which are currently executing in the compartment (this will also result in error codes being returned to other compartments) Release all resources (e.g., heap data) which have been allocated by the compartment Reset all global variables to their initial state I wonder how often a micro-reboot of one compartment results in an error code which causes other compartments to micro-reboot. If a call into a compartment which is in the middle of a micro-reboot can fail, then I could see that triggering a cascade of micro-reboots. The ideas here remind me of Midori , which relied on managed languages rather than hardware support. I wonder which component is better to trust, an SoC or a compiler? Subscribe now Boot loader Context switcher Heap allocator Thread scheduler Source: https://dl.acm.org/doi/10.1145/3731569.3764844 Sealed Capabilities The CHERIoT hardware architecture supports sealing of capabilities. Sealing a capability is similar to deriving a more restrictive one, only this time the derived capability is useless until it is unsealed by a compartment which holds a capability with unsealing permissions. I think of this like a client encrypting some data before storing it on a server. The data is useless to everyone except for the client who can decrypt it. Cross-compartment function calls are similar to system calls and are implemented with sealed capabilities. Say compartment needs to be able to call a function exported by compartment . At boot, the boot loader derives a “function call” capability which is a pointer into the export table associated with , seals that capability, and passes it to compartment at initialization. The boot loader also gives the switcher a capability which allows it to unseal the function call capability. When A wants to call the function exported by , it passes the sealed capability to the switcher. The switcher then unseals the capability and uses it to read metadata about the exported function from ’s export table. The switcher uses this metadata to safely perform the function call. Capability sealing also simplifies inter-compartment state management. Say compartment calls into compartment (for networking) to create a TCP connection. The networking compartment can allocate a complicated tree of objects and then return a sealed capability which points to that tree. Compartment can hold on to that capability and pass it as a parameter for future networking function calls (which will unseal and then use). Compartment doesn’t need to track per-connection objects in its global state. Heap Allocator The heap compartment handles memory allocation for all compartments. There is just one address space shared by all compartments, but capabilities make the whole thing safe. As described in the previous summary, when an allocation is freed, the heap allocator sets associated revocation bits to zero. This prevents use-after-free bugs (in conjunction with the CHERIoT hardware load filter). Similar to garbage collection, freed memory is quarantined (not reused) until a memory sweep completes which ensures that no outstanding valid capabilities are referencing the memory to be reused. The allocator supports allocation capabilities which can enforce per-compartment quotas. Threads If you’ve had enough novelty, you can rest your eyes for a moment. The CHERIoT RTOS supports threads, and they mostly behave like you would expect. The only restriction is that threads are statically declared in code. Threads begin execution in the compartment that declares them, but then threads can execute code in other compartments via cross-compartment calls. Micro-reboots Each compartment is responsible for managing its own state with proper error handling. If all else fails, the OS supports micro-reboots, where a single compartment can be reset to a fresh state. The cross-compartment call mechanism supported by the switcher enables the necessary bookkeeping for micro-reboots. The steps to reboot a single compartment are: Stop new threads from calling into the compartment (these calls fail with an error code) Fault all threads which are currently executing in the compartment (this will also result in error codes being returned to other compartments) Release all resources (e.g., heap data) which have been allocated by the compartment Reset all global variables to their initial state

0 views
Dangling Pointers 1 months ago

CHERIoT: Complete Memory Safety for Embedded Devices

CHERIoT: Complete Memory Safety for Embedded Devices Saar Amar, David Chisnall, Tony Chen, Nathaniel Wesley Filardo, Ben Laurie, Kunyan Liu, Robert Norton, Simon W. Moore, Yucong Tao, Robert N. M. Watson, and Hongyan Xia Micro'23 If you are like me, you’ve vaguely heard of CHERI , but never really understood what it is about. Here it is in one sentence: hardware support for memory protection embedded in every single pointer. This particular paper focuses on CHERI implementation details for embedded/IoT devices. The C in C HERI stands for capability . A capability is a fat pointer which contains an address, bounds, and access permissions. Fig. 1 shows the bit layout (64 bits total) of the capabilities used by CHERIoT: Source: https://dl.acm.org/doi/10.1145/3613424.3614266 Here is the fundamental concept to understand: the only way to access memory in a CHERI architecture is via a capability. There are no pointers other than capabilities. The hardware uses special tag bits (associated with registers and memory locations), to track which registers or memory addresses contain valid capabilities, and which do not. In the following example, is a regular old integer (the associated tag bit will not be set). is a pointer generated by reinterpreting the bits in . The tag bit associated with will not be set, and thus memory read/writes using will fail. If a programmer cannot create a capability willy-nilly, where do they come from? At boot, the hardware creates the uber-capability (i.e., one which has full permissions to access all memory) and places this capability into a specific register. The initial OS code that runs on boot can access this capability and can use special instructions to derive new capabilities from it. For example, the OS could derive a capability which has read-only access to the first 1 MiB of memory. The owner of a capability may derive new capabilities from it, but hardware ensures a derived capability cannot have broader permissions than the capability from which it was derived. CHERIoT is designed for embedded use cases, which have real-time requirements. MMUs/MPUs can add variability because they usually contain caches (e.g., TLBs) which have dramatically different performance characteristics in hit vs. miss cases. CHERIoT does away with this. There is no memory translation, and memory protection is supported on a per-capability basis (as opposed to a per-page tracking in the MPU). This is pretty cool: capabilities not only give fine-grained memory protection, but they also make performance more consistent by removing a cache from the system. Each capability represents a range of memory which can be accessed. Three fields (comprising 22 bits total) in each capability are used to represent the size of memory which is accessible by the capability. The encoding is a bit like floating point, with an exponent field which allows small sizes (i.e., less than 512 bytes) to be represented exactly, while larger sizes require padding. Astute readers will ask themselves: “how does CHERIoT prevent use-after-free bugs? A call to must somehow invalidate all capabilities which point at the freed region, correct?” CHERIoT introduces heap revocation bits . Because the total amount of RAM is often modest in embedded use cases CHERIoT can get away with a dedicated SRAM to hold heap revocation bits. There is 1 revocation bit per 8 bytes of RAM. Most software does not have direct access to these bits, but the heap allocator does. All revocation bits are initially set to zero. When the heap allocator frees memory, it sets the corresponding bits to one. The hardware uses these bits to prevent capabilities from accessing freed memory. You may think that CHERIoT checks revocation bits on each memory access, but it does not. Instead, the hardware load filter checks the revocation bits when the special “load capability” ( ) instruction is executed. This instruction is used to load a capability from memory into a register. The tag bit associated with the destination register is set to one only if the revocation bit associated with the address the capability points to is zero, and the tag bit of the source address is one. The final ingredient in this recipe is akin to garbage collection. CHERIoT supports what I like to think of as a simple garbage collection hardware accelerator called the background pipelined revoker . Software can request this hardware to scan a range of memory. Scanning occurs “in the background” (i.e., in clock cycles where the processor was not accessing memory). The background revoker reuses existing hardware to load each potential capability in the specified memory range, and then store it back. The load operation reads the associated tag bit and revocation bit, while the store operation updates the tag bit. This clears the tag bit for any capability that points to revoked memory. Once the background revoker has finished scanning all memory, the heap allocator can safely set the revocation bits associated with recently freed allocations back to zero and reuse the memory to satisfy future heap allocations. The authors modified two existing processors to support CHERIoT. Flute is a 5-stage processor with a 64-bit memory bus. Ibex is a 2 or 3 stage processor with a 32-bit memory bus. Table 2 shows the area and power cost associated with extending the Ibex processor to support CHERIoT (roughly 2x for both metrics): Source: https://dl.acm.org/doi/10.1145/3613424.3614266 Table 3 uses CoreMark to measure the performance overhead associated with CHERIoT: Source: https://dl.acm.org/doi/10.1145/3613424.3614266 Dangling Pointers I would be interested to know how easily C#/Go/Rust can be modified to use CHERI hardware bounds checking rather than software bounds checking for array accesses. This seems like an area where CHERI could win back some performance. Subscribe now

0 views
Dangling Pointers 1 months ago

Analog in-memory computing attention mechanism for fast and energy-efficient large language models

Analog in-memory computing attention mechanism for fast and energy-efficient large language models Leroux, N., Manea, PP., Sudarshan, C. et al . Nature Comput Science The takeaway from this one is that there is a high cost to the synchronous digital logic abstraction. Attention is one of the core building blocks of the Transformer architecture. During inference, the input activation matrix is multiplied by 3 separate weight matrices to produce three intermediate matrices: , , and . and are then multiplied together, scaled by a constant, and passed through a nonlinear activation function (e.g., softmax). This result is called an attention matrix , which is multiplied by . Causal attention is a specialization which applies a mask to . This restricts token so that it can only attend to tokens . This restriction enables the implementation of the KV cache. The KV cache stores rows of the and matrices for later re-use. Say the network is producing output token , there is no need to compute rows of and , they can simply be retrieved from the cache. Only row needs to be computed (and stored in the cache for future use). There is a nice animation showing a KV cache here . What follows next is a description of CMOS hardware which can both store values from K and V and perform the computations required by the attention mechanism. I feel a bit out of my depth, but here we go! Fig. 1(d) shows a single gain cell (prior paper here ). This single cell comprises a single capacitor and a handful of transistors. Source: https://www.nature.com/articles/s43588-025-00854-1 The voltage of capacitor C 1 represents a single weight value from or (not a bit in a floating-point representation of a number, but an actual real-valued weight). The WE and WLW pins allow the weight value to be updated. An element of Q arrives at a gain cell, represented by a pulse on the WLR pins. The width of the pulse is proportional to the value of the element; the amplitude of the pulse is a constant. For the duration of the input pulse, an output pulse is generated with an amplitude proportional to the voltage of capacitor C 1 . It’s been a while since I’ve had to compute an integral, but I think I can find the area under this curve. The total output current generated is: The output currents of multiple gain cells are all dumped onto a shared bitline, and the aggregate current produced by all of them represents a dot product More analog hardware magic is used to produce a HardSigmoid charge-to-pulse circuit. It integrates all current received over the shared bitline, and produces a pulse whose width is a function of that current. Fig. 1(g) shows that function: Source: https://www.nature.com/articles/s43588-025-00854-1 One gets the sense that this is not the ideal function from a machine learning point of view, but it is efficient to implement in hardware. The proposed chip has a set of gain cells to store and a set of gain cells to store . There is a fixed number of these cells, so sliding window attention is implemented. In sliding window attention, the outputs for token can be affected by tokens . Fig. 2(c) shows how sliding window attention is implemented over time in a ping-pong fashion. While the product involving and is being computed, the capacitors for 1 token worth of are written. The counter step converts the computed value from analog to digital. Source: https://www.nature.com/articles/s43588-025-00854-1 One nice benefit of sliding window attention has to do with refreshing the gain cells. Leakage causes the capacitor in the gain cell to “forget” the value it is storing over time (~5ms). With sliding window attention, there is a limit to the amount of time that a particular value needs to be remembered. If that limit is low enough, then there is no need to refresh gain cells. Fig. 5 has the headline results (note the logarithmic scale), these are 100-100,000x improvements in latency and energy: Source: https://www.nature.com/articles/s43588-025-00854-1 If you are interested in how this design affects model accuracy, table 1 has the relevant metrics. My takeaway is that it doesn’t seem to have a large effect. Search the web for ‘CMOS full adder’, and you’ll see circuits that seem about as complex as a single gain cell. Those digital circuits only compute a single bit of a sum (and a carry bit as well)! The abstraction of digital logic is an expensive one indeed. I suppose this is not unlike the Turing Tax associated with higher levels of abstraction. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 2 months ago

A linear-time heuristic for improving network partitions

A linear-time heuristic for improving network partitions C. M. Fiduccia and R. M. Mattheyses DAC'1988 This paper describes the Fiduccia–Mattheyses algorithm which is similar to the Kernighan–Lin algorithm for graph partitioning. The ideas are old, but still very relevant. The object which this algorithm operates on is a network . A network is similar to a graph; the key difference is that edges are replaced with nets . A net connects 2 or more vertices (an edge is a degenerate case which always connects 2 vertices). This makes sense in hardware applications, where vertices are circuit components, and nets are the wires that connect them. The problem is to divide the vertices in a network into two sets, and . Nets which cross from to are called cut nets . The number of cut nets should be minimized. The relative sizes of and must adhere to a balance criterion which ensures that and have roughly the same number of vertices. The graph partitioning problem is similar to the min-cut problem. The key difference is that in graph partitioning, there is a requirement that the two partitions are balanced (have roughly equal sizes). Graph partitioning is NP-hard; the algorithm described in this paper is not guaranteed to produce an optimal result. The algorithm starts by dividing the vertices into initial sets and , ensuring that the balance criterion is honored. It then iteratively executes passes , until no more improvement can be made. A single pass can potentially move each vertex in the graph (to the complementary partition). Each vertex will only be moved at most once per pass. Within a pass, vertices are sorted according to their associated gain . The gain of a vertex is the overall improvement that would result by moving that vertex to the complementary partition. Fig. 1 has an example. The current partition is represented by the vertical dashed line. The partitioning would be improved by moving the vertex with +1 gain to the left partition (one fewer net would be cut). Partitioning would also be improved by moving the +2 vertex over to the right partition, but the balance criterion would likely not allow this. The partitioning would be made worse if the vertex with -1 gain were to be moved (as the associated net would become cut). Source: https://dl.acm.org/doi/10.1145/62882.62910 A single pass produces a set of candidate partitions and then chooses the best candidate to be the input partition for the next pass. The algorithm to execute a single pass is: Among all vertices which have not been considered in the current pass, and which can be moved without violating the balance criterion, select a vertex with maximal gain Move that vertex and note the updated partition as a candidate partition After a single pass has completed, examine all candidate partitions the pass produced. Choose the best one. There are two subtle points about this process: The gain of each vertex is recomputed after each move. Step #1 will select vertices with negative gains. The idea is that within a pass, a small bad move can unlock future (but within the same pass) good moves. The best candidate partition is selected at the end of each pass, so partitions that directly result from bad moves will be ignored. A net is called critical if it is connected to zero or one vertices in or . If a net is not critical, then it can be ignored when considering moving a single vertex, because an individual move cannot cause non-critical nets to change from cut to uncut (or the other way around). The gain of an individual vertex can be computed by summing the contributions from all critical nets that vertex is connected to. Each critical net contributes , , or to the gain of a vertex. The following steps enable incremental updates of data structures containing the following information when moving a vertex from set to set The gain of each vertex The number of vertices in A that each net is connected to The number of vertices in B that each net is connected to Note that the conditionals which examine the number of vertices that a net is connected to in or only consider critical nets (nets connected to 0 or 1 vertices in the relevant partition). The inner loop of the pass algorithm chooses a vertex with maximum gain. The algorithm above shows how to update the gain associated with each vertex as the pass progresses but does not deal with efficiently selecting a high gain vertex. The trick here is that there is a bounded range of values that a gain can have. The paper calls the bounds . is simply the maximum degree of any vertex. With this knowledge, vertices can be stored in an array of linked lists. Each entry in the array corresponds to a particular gain value. When the gain of a vertex changes, it can be removed from the list that it is currently contained in and inserted in the list corresponding to the new gain. Fig. 2 illustrates the data structure: Source: https://dl.acm.org/doi/10.1145/62882.62910 Results The following table shows performance results for four randomly sampled chips. The average chip has 267 vertices, 245 nets, and the average degree of a vertex is 2650. I would have thought that more passes were needed to converge on a solution. Source: https://dl.acm.org/doi/10.1145/62882.62910 Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. Source: https://dl.acm.org/doi/10.1145/62882.62910 A single pass produces a set of candidate partitions and then chooses the best candidate to be the input partition for the next pass. The algorithm to execute a single pass is: Among all vertices which have not been considered in the current pass, and which can be moved without violating the balance criterion, select a vertex with maximal gain Move that vertex and note the updated partition as a candidate partition The gain of each vertex is recomputed after each move. Step #1 will select vertices with negative gains. The idea is that within a pass, a small bad move can unlock future (but within the same pass) good moves. The best candidate partition is selected at the end of each pass, so partitions that directly result from bad moves will be ignored. The gain of each vertex The number of vertices in A that each net is connected to The number of vertices in B that each net is connected to

1 views
Dangling Pointers 2 months ago

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS Youmin Chen, Jiwu Shu, Yanyan Shen, Linpeng Huang, and Hong Mei SOSP'25 I love this paper, because it grinds one of my axes: efficient pipeline parallelism on general purpose CPUs. In many hardware designs, pipeline parallelism is the dominant form of parallelism, whereas data parallelism takes the cake on CPUs and GPUs. It has always seemed to me that there are applications where pipeline parallelism should be great on multi-core CPUs, and here is an example. Fig. 1 illustrates the design space for key-value stores: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ). It is hard to effectively cache data in a TPR/TPQ model, because each request runs the entire key-value store request code path. For example, a CPU core may have enough cache capacity to hold network buffers or hot tuples, but not both. The key disadvantage to a TPS architecture is load balancing. One stage could become the bottleneck, leaving CPU cores idle. The authors propose dynamic reconfiguration of the pipeline based on workload changes. Another challenge with pipelining is implementing efficient communication between cores, because data associated with each request flows down the pipeline with the request itself. Fig. 3 shows the pipeline proposed in this paper: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 The NIC writes request packets into the network buffer (stored in the LLC). The cache-resident layer reads data from this buffer and handles requests involving commonly used keys by accessing the hot index and hot data caches (also in the LLC). The memory-resident layer handles cold keys and values, which are stored in DRAM. One set of threads (pinned to CPU cores) implement the cache-resident layer, and a different set of threads (pinned to other CPU cores) implement the memory-resident layer. An auto-tuner continually monitors the system and adjusts the number of threads assigned to each layer. Section 3.5 describes the synchronization required to implement this adjustment. The NIC writes request packets into a single queue. The cache-resident threads cooperatively read requests from this queue. If there are threads in the pool, then thread reads all requests with: . Next, threads check to see if the key associated with a request is hot (and thus cached in the LLC). Time is divided into epochs. During a given epoch, the set of cached items does not change. This enables fast lookups without costly synchronization between threads. A background thread gathers statistics to determine the set of items to be cached in the next epoch and has the ability to atomically switch to the next epoch when the time comes. The number of hot keys is kept small enough that it is highly likely that hot keys will be stored in the LLC. Requests that miss in the cache-resident layer are passed on to the memory-resident layer for further processing (via the CR-MR queue ). Typically, the LLC is treated like a global resource (shared by all cores). But this particular use case requires that most of the LLC be dedicated to the cache-resident layer. This is accomplished with the help of the PQOS utility from Intel, which uses “Intel(R) Resource Director Technology” to control which ways of the LLC are assigned to each layer. The memory-resident layer operates on batches of requests. Because the requests are not hot, it is highly likely that each request will require DRAM accesses for index lookups (keys) and data lookups (values). Software prefetching is used to hide DRAM latency during index lookups. When servicing operations, data values are copied directly into the outgoing network buffer. The CR-MR queue is used to communicate between the two layers. Each (CR thread, MR thread) pair has a dedicated lock-free queue. Enqueue operations use a round-robin policy (message from CR thread is sent to MR thread: ). Dequeue operations must potentially scan queues corresponding to all possible senders. Multiple requests can be stored per message, to amortize control overhead. Fig. 7 has throughput results for synthetic workloads (A, B, and C have different ratios of put/get operations), uTPS-T is this work: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 Dangling Pointers The pipelining here is coarse-grained, and the design is only optimized for the LLC. I wonder if a more fine-grained pipeline would allow hot data to be stored in L2 caches. For example, the set of hot keys could be sharded among N cores, with each core holding a different shard in its L2 cache. It seems redundant that this design requires software to determine the set of hot keys, when the hardware cache circuitry already has support to do something like this. Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. Pipelining Advantages A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ).

0 views
Dangling Pointers 2 months ago

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann SIGMOD'14 The giant upon whose shoulders this paper rests is Volcano . Parallelism in Volcano is achieved through a proper separation of concerns. Volcano contains many database operators, most of which are blissfully unaware of parallelism. A handful of operators in a query plan exist only to enable parallelism (for example, an operator could implement pipeline parallelism, or partition data between threads). Generally speaking, an elegant separation of concerns is good for performance. However, the thesis of morsel-driven parallelism is that this is not true. Deeply integrating the notion of parallel execution into each operator is the way to go for OLAP. The system described in this paper is named HyPer. Fig. 2 below illustrates how HyPer decomposes a single query into three pipelines: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 The leftmost pipeline scans the input relation and applies a filter to it. The middle pipeline scans and filters the input relation . The rightmost pipeline scans and filters , joins the result with , and finally joins the result of the first join with . A system like Volcano would be tempted to scan and in parallel. Not so with HyPer: the pipelines which make up a query plan are executed serially. Each relation (both inputs, and temporary data) are divided into morsels. A morsel is a group of ~100,000 tuples. Each morsel resides on a single NUMA node (indicated by colors in the figures). Fig. 3 illustrates how HyPer uses morsel-level parallelism to implement the left pipeline (scan and filter T): Source: https://dl.acm.org/doi/10.1145/2588555.2610507 The pipeline that processes T operates in two phases. In the first phase, a pool of threads (each pinned to a core) collectively process all morsels in T. When a thread comes up for air, it grabs another morsel of input. Threads are assigned to a NUMA node, and threads prefer to process morsels assigned to the same NUMA node. If no morsels of matching color are available, then a thread will reluctantly process a morsel from another NUMA node. During the first phase, each thread writes filtered tuples into a thread-local storage area. Once all input tuples have been processed, a hash table is created (conveniently, the hash table can be sized well because the number of tuples that must be stored is known). This hash table is global (i.e., not NUMA aware). In the second phase, tuples are inserted into the hash table. The HyPer hash table is designed to allow lock-free insertions, along with a fast path for the common case where a probe operation yields no hits. The hash table uses chaining, with very special pointers used to point to the next element in the chain. The lower 48 bits of each pointer in the hash table contains the memory address that is being pointed at, the upper 16 bits are a tag . Think of the tag as a 16-bit Bloom filter describing the set of elements in the sub-list that the pointer points to. When the hash table is probed, a hash of the join key from the probe tuple is used both to determine which chain to search in, and to stop the search early if no possible unexamined list element could contain the join key. Because both the pointer and the tag are packed into 64 bits, a compare-and-swap instruction can be used to insert elements into the hash table without using an OS lock. If the hash table is large enough, then most executions of the compare-and-swap instruction should succeed. Fig. 7 illustrates the hash table data structure and the insertion algorithm: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 It is a bit odd that the design in this paper goes to such great lengths to avoid cross-socket (NUMA) reads from main memory, and yet the hash table is not NUMA aware. I think that the 16-bit tags are key here. If the set of head pointers for all buckets in the hash table is small enough to fit into an L2 cache, then this data can be efficiently replicated into all L2 caches. As long as the hash table hit rate is low enough, the number of cross-socket memory accesses during probe operations will be low. Fig. 11 shows throughput for all 22 TPC-H queries, for 4 different configurations: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 It is interesting how NUMA-awareness matters a lot for some queries, but not all. Fig. 10 shows the author’s NUMA model: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 What is interesting here is that a similar setup exists within each socket. Say a socket contains 32 cores, and 4 memory controllers. Those cores and memory controllers will be laid out in a grid, with a network connecting them. I wonder if there is performance to be gained by paying attention to the intra-core layout (e.g., cores on the left side of the chip should only access memory controllers on the left side of the chip). Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 2 months ago

The Decomposition of Switching Functions

The Decomposition of Switching Functions Robert Ashenhurst International Symposium on the Theory of Switching Functions'59 Say you are given a logic formula such as: and you want to rewrite it in a more structured form. One way to impose structure is to express as the composition of simpler functions. You can perform such a rewrite with the Ashenhurst-Curtis decomposition , and the result will look like this: This transformation is useful in logic synthesis. The task of synthesizing a circuit to implement the 4-input function can be broken down into synthesizing circuits to implement: The 2-input function The 3-input function The goal of the Ashenhurst-Curtis decomposition is to partition the inputs of a logic formula ( are the inputs in the running example) into two disjoint sets ( , ). The partitioning depends on the logic formula being decomposed. The idea is to decompose the formula into two simpler formulas: Formula 1 accepts as inputs Formula 2 accepts as inputs, along with the output of #1 A way to do this is to search through all possible partition matrices . Fig. 4 shows the partition matrix for the decomposition in the running example: Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf Think of a partition matrix as a two-dimensional truth table. Here is the original logic formula, along with the above partition matrix containing explicit values for : In other words, f is non-zero only when given the following inputs (which correspond to the 4 entries in the partition matrix with the value ). For a given logic formula, there exist many partition matrices. Only some are suitable for decomposing the logic formula. The key property to look for is the column multiplicity of the partition matrix. This column multiplicity is the number of unique column vectors in the partition matrix. The partition matrix above has a column multiplicity of 2 (i.e., there are two unique column vectors in the matrix, and ). If you can find a partition matrix with a column multiplicity less than or equal to two, then you can decompose the logic formula. Note that a partition matrix is defined not only by the values in the matrix, but also the partitioning of the input variables. If you have located a partition matrix with a column multiplicity no greater than two, then the row multiplicity must be no greater than 4. In particular, there are 4 kinds of row vectors that you will see in the partition matrix: Vectors of all 0s Vectors of all 1s A particular vector The inverse of In the running example, the row vectors are: [1,0,0,1] (we call this ) [0,1,1,0] (note how this is the inverse of ) To decompose f, first generate a function ϕ. The inputs to the function are the variables written at the top of the partition matrix ( and in this example). The truth table for ϕ is simply the first non-trivial row vector in the partition matrix ([1,0,0,1] in the example). Next, generate a function . The inputs to F are the variables written at the left of the partition matrix ( and in this example), and the output of . In this example, has 3 inputs, so the truth table defining has 8 entries. A simple algorithm is used to generate the entries in that truth table. The algorithm iterates over the row vectors in the partition matrix, and each row vector defines 2 elements in the truth table of . If row vector is all zeros, then the truth table elements at indices and are 0 If row vector is all ones, then the truth table elements at indices and are 1 If row vector is , then the truth table element at index is 0, and the element at index is 1 If row vector is the inverse of , then the truth table element at index is 1, and the element at index is 0 These rules are described in Table 1: Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf The final decomposed function is: Until now, we’ve assumed that someone has handed us a partition matrix with column multiplicity no greater than two. But how does one find such a matrix. The paper proposes iterating through all partition matrices in the partition chart of the input formula. A partition chart contains many partition matrices, each one corresponding to a different partitioning of the input variables. A circled element in the partition chart corresponds to a in the truth table of the input logic formula. Fig. 5 shows partition charts for the running example. The one in the lower right corner is the partition matrix in the running example. Astute readers will notice that it is actually the transpose of the partition matrix (i.e., and are on the left, and are on the top), but that is no big deal, they can be transposed as long as the variable names are transposed along with the matrix entries. Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf Once you have generated a partition chart, it is straightforward to search for partition matrices which have column (or row) multiplicity no greater than two. These are the ones that can be used to generate the decomposed function. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. The 2-input function The 3-input function Formula 1 accepts as inputs Formula 2 accepts as inputs, along with the output of #1 Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf Think of a partition matrix as a two-dimensional truth table. Here is the original logic formula, along with the above partition matrix containing explicit values for : In other words, f is non-zero only when given the following inputs (which correspond to the 4 entries in the partition matrix with the value ). Vectors of all 0s Vectors of all 1s A particular vector The inverse of [1,0,0,1] (we call this ) [0,1,1,0] (note how this is the inverse of ) If row vector is all zeros, then the truth table elements at indices and are 0 If row vector is all ones, then the truth table elements at indices and are 1 If row vector is , then the truth table element at index is 0, and the element at index is 1 If row vector is the inverse of , then the truth table element at index is 1, and the element at index is 0

0 views
Dangling Pointers 2 months ago

Oasis: Pooling PCIe Devices Over CXL to Boost Utilization

Oasis: Pooling PCIe Devices Over CXL to Boost Utilization Yuhong Zhong, Daniel S. Berger, Pantea Zardoshti, Enrique Saurez, Jacob Nelson, Dan R. K. Ports, Antonis Psistakis, Joshua Fried, and Asaf Cidon SOSP'25 If you are like me, you’ve dabbled with software prefetching but never had much luck with it. Even if you care nothing about sharing of PCIe devices across servers in a rack, this paper is still interesting because it shows a use case where software prefetching really matters. I suppose it is common knowledge that CXL enables all of the servers in a rack to share a pool of DRAM . The unique insight from this paper is that once you’ve taken this step, sharing of PCIe devices (e.g., NICs, SSDs) can be implemented “at near-zero extra cost.” Fig. 2 shows how much SSD capacity and NIC bandwidth are stranded in Azure. A stranded resource is one that is underutilized because some other resource (e.g., CPU or memory) is the bottleneck. Customers may not be able to precisely predict the ratios of CPU:memory:SSD:NIC resources they will need. Even if they could, a standard VM size may not be available that exactly matches the desired ratio. Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The steps to receive a packet are similar: The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory. One trick used here is flow tagging . This is a DPDK feature that enables the NIC to determine which host the message is destined for, without the backend driver having to inspect network packet headers. Fig. 8 shows measurements of the overhead added by Oasis. The solid lines are the baseline; the dotted lines are Oasis. Each color represents a different latency bucket. The baseline uses a NIC which is local to the host running the benchmark. The overhead is measurable, but not excessive. Source: https://dl.acm.org/doi/pdf/10.1145/3731569.3764812 Dangling Pointers The paper doesn’t touch on the complexities related to network virtualization in a pooled device scheme. It seems to me that solving these problems wouldn’t affect performance but would require significant engineering. Subscribe now Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Datapath Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Cache Coherency Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Send and Receive Flows Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory.

0 views
Dangling Pointers 2 months ago

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds Bang Di, Yun Xu, Kaijie Guo, Yibin Shen, Yu Li, Sanchuan Cheng, Hao Zheng, Fudong Qiu, Xiaokang Hu, Naixuan Guan, Dongdong Huang, Jinhu Li, Yi Wang, Yifang Yang, Jintao Li, Hang Yang, Chen Liang, Yilong Lv, Zikang Chen, Zhenwei Lu, Xiaohan Ma, and Jiesheng Wu SOSP'25 Here is a contrarian view: the existence of hypervisors means that operating systems have fundamentally failed in some way. I remember thinking this a long time ago, and it still nags me from time to time. What does a hypervisor do? It virtualizes hardware so that it can be safely and fairly shared. But isn’t that what an OS is for? My conclusion is that this is a pragmatic engineering decision. It would simply be too much work to try to harden a large OS such that a cloud service provider would be comfortable allowing two competitors to share one server. It is a much safer bet to leave the legacy OS alone and instead introduce the hypervisor. This kind of decision comes up in other circumstances too. There are often two ways to go about implementing something. The first way involves widespread changes to legacy code, and the other way involves a low-level Jiu-Jitsu move which achieves the desired goal while leaving the legacy code untouched. Good managers have a reliable intuition about these decisions. The context here is a cloud service provider which virtualizes the network with a SmartNIC. The SmartNIC (e.g., NVIDIA BlueField-3 ) comprises ARM cores and programmable hardware accelerators. On many systems, the ARM cores are part of the data-plane (software running on an ARM core is invoked for each packet). These cores are also used as part of the control-plane (e.g., programming a hardware accelerator when a new VM is created). The ARM cores on the SmartNIC run an OS (e.g., Linux), which is separate from the host OS. The paper says that the traditional way to schedule work on SmartNIC cores is static scheduling. Some cores are reserved for data-plane tasks, while other cores are reserved for control-plane tasks. The trouble is, the number of VMs assigned to each server (and the size of each VM) changes dynamically. Fig. 2 illustrates a problem that arises from static scheduling: control-plane tasks take more time to execute on servers that host many small VMs. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dynamic Scheduling Headaches Dynamic scheduling seems to be a natural solution to this problem. The OS running on the SmartNIC could schedule a set of data-plane and control-plane threads. Data-plane threads would have higher priority, but control-plane threads could be scheduled onto all ARM cores when there aren’t many packets flowing. Section 3.2 says this is a no-go. It would be great if there was more detail here. The fundamental problem is that control-plane software on the SmartNIC calls kernel functions which hold spinlocks (which disable preemption) for relatively long periods of time. For example, during VM creation, a programmable hardware accelerator needs to be configured such that it will route packets related to that VM appropriately. Control-plane software running on an ARM core achieves this by calling kernel routines which acquire a spinlock, and then synchronously communicate with the accelerator. The authors take this design as immutable. It seems plausible that the communication with the accelerator could be done in an asynchronous manner, but that would likely have ramifications to the entire control-plane software stack. This quote is telling: Furthermore, the CP ecosystem comprises 300–500 heterogeneous tasks spanning C, Python, Java, Bash, and Rust, demanding non-intrusive deployment strategies to accommodate multi-language implementations without code modification. Here is the Jiu-Jitsu move: lie to the SmartNIC OS about how many ARM cores the SmartNIC has. Fig. 7(a) shows a simple example. The underlying hardware has 2 cores, but Linux thinks there are 3. One of the cores that the Linux scheduler sees is actually a virtual CPU (vCPU), the other two are physical CPUs (pCPU). Control-plane tasks run on vCPUs, while data-plane tasks run on pCPUs. From the point of view of Linux, all three CPUs may be running simultaneously, but in reality, a Linux kernel module (5,800 lines of code) is allowing the vCPU to run at times of low data-plane activity. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 One neat trick the paper describes is the hardware workload probe . This takes advantage of the fact that packets are first processed by a hardware accelerator (which can do things like parsing of packet headers) before they are processed by an ARM core. Fig. 10 shows that the hardware accelerator sees a packet at least 3 microseconds before an ARM core does. This enables this system to hide the latency of the context switch from vCPU to pCPU. Think of it like a group of students in a classroom without any teachers (e.g., network packets). The kids nominate one student to be on the lookout for an approaching adult. When the coast is clear, the students misbehave (i.e., execute control-plane tasks). When the lookout sees the teacher (a network packet) returning, they shout “act responsible”, and everyone returns to their schoolwork (running data-plane code). Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Results Section 6 of the paper has lots of data showing that throughput (data-plane) performance is not impacted by this technique. Fig. 17 shows the desired improvement for control-plane tasks: VM startup time is roughly constant no matter how many VMs are packed onto one server. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dangling Pointers To jump on the AI bandwagon, I wonder if LLMs will eventually change the engineering equation. Maybe LLMs will get to the point where widespread changes across a legacy codebase will be tractable. If that happens, then Jiu-Jitsu moves like this one will be less important. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views