Latest Posts (20 found)

Oasis: Pooling PCIe Devices Over CXL to Boost Utilization

Oasis: Pooling PCIe Devices Over CXL to Boost Utilization Yuhong Zhong, Daniel S. Berger, Pantea Zardoshti, Enrique Saurez, Jacob Nelson, Dan R. K. Ports, Antonis Psistakis, Joshua Fried, and Asaf Cidon SOSP'25 If you are like me, you’ve dabbled with software prefetching but never had much luck with it. Even if you care nothing about sharing of PCIe devices across servers in a rack, this paper is still interesting because it shows a use case where software prefetching really matters. I suppose it is common knowledge that CXL enables all of the servers in a rack to share a pool of DRAM . The unique insight from this paper is that once you’ve taken this step, sharing of PCIe devices (e.g., NICs, SSDs) can be implemented “at near-zero extra cost.” Fig. 2 shows how much SSD capacity and NIC bandwidth are stranded in Azure. A stranded resource is one that is underutilized because some other resource (e.g., CPU or memory) is the bottleneck. Customers may not be able to precisely predict the ratios of CPU:memory:SSD:NIC resources they will need. Even if they could, a standard VM size may not be available that exactly matches the desired ratio. Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The steps to receive a packet are similar: The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory. One trick used here is flow tagging . This is a DPDK feature that enables the NIC to determine which host the message is destined for, without the backend driver having to inspect network packet headers. Fig. 8 shows measurements of the overhead added by Oasis. The solid lines are the baseline; the dotted lines are Oasis. Each color represents a different latency bucket. The baseline uses a NIC which is local to the host running the benchmark. The overhead is measurable, but not excessive. Source: https://dl.acm.org/doi/pdf/10.1145/3731569.3764812 Dangling Pointers The paper doesn’t touch on the complexities related to network virtualization in a pooled device scheme. It seems to me that solving these problems wouldn’t affect performance but would require significant engineering. Subscribe now Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Datapath Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Cache Coherency Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Send and Receive Flows Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory.

0 views

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds Bang Di, Yun Xu, Kaijie Guo, Yibin Shen, Yu Li, Sanchuan Cheng, Hao Zheng, Fudong Qiu, Xiaokang Hu, Naixuan Guan, Dongdong Huang, Jinhu Li, Yi Wang, Yifang Yang, Jintao Li, Hang Yang, Chen Liang, Yilong Lv, Zikang Chen, Zhenwei Lu, Xiaohan Ma, and Jiesheng Wu SOSP'25 Here is a contrarian view: the existence of hypervisors means that operating systems have fundamentally failed in some way. I remember thinking this a long time ago, and it still nags me from time to time. What does a hypervisor do? It virtualizes hardware so that it can be safely and fairly shared. But isn’t that what an OS is for? My conclusion is that this is a pragmatic engineering decision. It would simply be too much work to try to harden a large OS such that a cloud service provider would be comfortable allowing two competitors to share one server. It is a much safer bet to leave the legacy OS alone and instead introduce the hypervisor. This kind of decision comes up in other circumstances too. There are often two ways to go about implementing something. The first way involves widespread changes to legacy code, and the other way involves a low-level Jiu-Jitsu move which achieves the desired goal while leaving the legacy code untouched. Good managers have a reliable intuition about these decisions. The context here is a cloud service provider which virtualizes the network with a SmartNIC. The SmartNIC (e.g., NVIDIA BlueField-3 ) comprises ARM cores and programmable hardware accelerators. On many systems, the ARM cores are part of the data-plane (software running on an ARM core is invoked for each packet). These cores are also used as part of the control-plane (e.g., programming a hardware accelerator when a new VM is created). The ARM cores on the SmartNIC run an OS (e.g., Linux), which is separate from the host OS. The paper says that the traditional way to schedule work on SmartNIC cores is static scheduling. Some cores are reserved for data-plane tasks, while other cores are reserved for control-plane tasks. The trouble is, the number of VMs assigned to each server (and the size of each VM) changes dynamically. Fig. 2 illustrates a problem that arises from static scheduling: control-plane tasks take more time to execute on servers that host many small VMs. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dynamic Scheduling Headaches Dynamic scheduling seems to be a natural solution to this problem. The OS running on the SmartNIC could schedule a set of data-plane and control-plane threads. Data-plane threads would have higher priority, but control-plane threads could be scheduled onto all ARM cores when there aren’t many packets flowing. Section 3.2 says this is a no-go. It would be great if there was more detail here. The fundamental problem is that control-plane software on the SmartNIC calls kernel functions which hold spinlocks (which disable preemption) for relatively long periods of time. For example, during VM creation, a programmable hardware accelerator needs to be configured such that it will route packets related to that VM appropriately. Control-plane software running on an ARM core achieves this by calling kernel routines which acquire a spinlock, and then synchronously communicate with the accelerator. The authors take this design as immutable. It seems plausible that the communication with the accelerator could be done in an asynchronous manner, but that would likely have ramifications to the entire control-plane software stack. This quote is telling: Furthermore, the CP ecosystem comprises 300–500 heterogeneous tasks spanning C, Python, Java, Bash, and Rust, demanding non-intrusive deployment strategies to accommodate multi-language implementations without code modification. Here is the Jiu-Jitsu move: lie to the SmartNIC OS about how many ARM cores the SmartNIC has. Fig. 7(a) shows a simple example. The underlying hardware has 2 cores, but Linux thinks there are 3. One of the cores that the Linux scheduler sees is actually a virtual CPU (vCPU), the other two are physical CPUs (pCPU). Control-plane tasks run on vCPUs, while data-plane tasks run on pCPUs. From the point of view of Linux, all three CPUs may be running simultaneously, but in reality, a Linux kernel module (5,800 lines of code) is allowing the vCPU to run at times of low data-plane activity. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 One neat trick the paper describes is the hardware workload probe . This takes advantage of the fact that packets are first processed by a hardware accelerator (which can do things like parsing of packet headers) before they are processed by an ARM core. Fig. 10 shows that the hardware accelerator sees a packet at least 3 microseconds before an ARM core does. This enables this system to hide the latency of the context switch from vCPU to pCPU. Think of it like a group of students in a classroom without any teachers (e.g., network packets). The kids nominate one student to be on the lookout for an approaching adult. When the coast is clear, the students misbehave (i.e., execute control-plane tasks). When the lookout sees the teacher (a network packet) returning, they shout “act responsible”, and everyone returns to their schoolwork (running data-plane code). Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Results Section 6 of the paper has lots of data showing that throughput (data-plane) performance is not impacted by this technique. Fig. 17 shows the desired improvement for control-plane tasks: VM startup time is roughly constant no matter how many VMs are packed onto one server. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dangling Pointers To jump on the AI bandwagon, I wonder if LLMs will eventually change the engineering equation. Maybe LLMs will get to the point where widespread changes across a legacy codebase will be tractable. If that happens, then Jiu-Jitsu moves like this one will be less important. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views

Backdoors To Typical Case Complexity

Backdoors To Typical Case Complexity Ryan Williams, Carla P. Gomes, and Bart Selman IJCAI'03 SAT solvers frequently perform much better on real-world problems than would be predicted by a worst-case analysis. This paper provides a clue as to why some problems are tractable and presents algorithms to solve such problems. The analysis here applies equally to SAT problems and CSP problems. This quote sums up the motivating idea behind the paper (all variables are not created equal): Another powerful intuition in the design of search methods is that one wants to select variables that simplify the problem instance as much as possible when these variables are assigned values. A backdoor is a subset of the variables in a SAT/CSP problem. There exists at least one assignment of backdoor variables to values such that the values of the remaining variables can easily be found by a sub-solver . A sub-solver is a SAT/CSP solver which runs in polynomial time and can solve a subset of SAT/CSP problems. The input to the sub-solver is a SAT/CSP problem, along with a partial assignment . A partial assignment assigns concrete values to some variables (e.g., the backdoor variables). When we say that a sub-solver “solves” a problem, we mean that it returns one of: A mapping of variable names to values which satisfies all constraints “Unsatisfiable” - the constraints cannot all be satisfied “Too hard” - the problem is not one of the kinds that the sub-solver can handle An example sub-solver for SAT problems is one that can only solve 2-SAT problems. Table 2 contains backdoor sizes from real-world SAT problems: Source: https://www.ijcai.org/Proceedings/03/Papers/168.pdf Think about that for a second. The first row means that in a problem with 6783 variables, the crux of the issue can be solved by only considering 12 variables. So rather than searching for a needle in a haystack of size 2 6783 , one only needs to find the correct values for the 12 variables in the backdoor (2 12 options to explore). Once that assignment has been found, the sub-solver can find the values of the other 6771 variables. The only trouble is how one finds a backdoor. The deterministic strategy optimistically assumes there exists a small backdoor. First it assumes the backdoor contains 1 variable. If that fails, then the strategy assumes the backdoor contains 2 variables, and so on. At step , the deterministic strategy enumerates all possible backdoors of size . For each candidate backdoor, it uses a backtracking search (which will invoke a sub-solver) to try to solve the problem. Backtracking search is more efficient than exhaustively searching all 2 N variable assignments because this search can invoke the sub-solver with a partial assignment of fewer than variables, in the hopes that the sub-solver can fail quickly. The randomized strategy is similar to the deterministic strategy, but it chooses random subsets of variables to perform a backtracking search on. Again, the process begins with a bias that the size of the backdoor is small and gradually increases the expected size of the random subset. The key question running through my mind is: “are there ways to transform SAT problems such that the output of the transformation has a smaller backdoor than the input”? Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. A mapping of variable names to values which satisfies all constraints “Unsatisfiable” - the constraints cannot all be satisfied “Too hard” - the problem is not one of the kinds that the sub-solver can handle

0 views

How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service

How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service Jingkai He, Yunpeng Dong, Dong Du, Mo Zou, Zhitai Yu, Yuxin Ren, Ning Jia, Yubin Xia, and Haibo Chen SOSP'25 Systems spend a lot of time copying bytes from here to there. Fig. 2(a) shows data for some benchmarks. Copy operations are bucketized by size and if the copy occurs on behalf of user/kernel code. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 One interesting piece of data missing from that figure is how much CPU time is spent waiting on cache misses. A more subtle point is how long applications wait after a copy operation completes and when they actually read from the destination buffer. The authors call this the copy-use window . Fig. 3 has some data on this point. Bars above the gray triangle represent opportunities for latency hiding. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 API This paper proposes a new OS service ( Copier ), which performs copy operations on behalf of user and kernel clients. Clients interact with Copier through a set of queues. Copy requests are submitted via descriptors in the Copy Queue . Descriptors contain pointers to bitmaps, which clients can use to monitor progress. Each bit in the bitmap corresponds to the completion of a single segment of a copy. For example, a client could submit a 1MiB copy but read data from 4KiB destination segments as soon as they are written. The Sync Queue is used to raise the priority of an already-submitted task. This is a fallback for when the application has left the copy-use window and has nothing useful to do other than wait for the copy to make progress. The Handler Queue enables arbitrary functions to be called when a copy operation completes. Applications interact with these queues indirectly via a set of functions (e.g., to initiate a copy, to wait for an operation to complete). Because Copier is a centralized service, it can perform global optimizations. An example given in the paper involves two copies. First, a large amount of data is copied from a kernel buffer to an intermediate buffer ( → ). Next, a smaller amount of data is copied from the intermediate buffer into the database ( → ). Copier can detect this situation and only read a subset of the bytes from . One trick that Copier uses is watching calls to the API to know when data can be accessed. For example, if is never called on , and only called on a subset of , then Copier knows that only a subset of needs to be read. Clients can expose more global optimization opportunities to Copier by marking copy tasks as lazy. Lazy copy tasks do not execute unless they need to. An application could supply this hint if it knows that the destination of a copy operation will likely only be used as the source of a future copy operation. Copier will defer copying bits associated with the first copy operation, under the assumption that will never be called on it. Subsequent copy operations will cut out the middleman and read from the original source. The task of actually copying bits is delegated to a set of kernel threads. The number of threads dynamically scales with the amount of work. These threads use AVX instructions in parallel with dedicated DMA hardware . One issue with this setup is that these kernel threads could potentially fault if a specified userspace address is inaccessible (the data is paged out, or the client submitted a bad address). Copier proactively avoids this by locking pages in memory for the duration of the copy (this step is also needed when using DMA hardware). Fig. 11 shows benchmark results for Redis, section 6 of the paper has similar results for other use cases. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 Dangling Pointers Copies that hit in cache are very different from those which do not. I think more exploration here is warranted. Section 6.3.5 points out that Copier has pros and cons related to CPU caches: Pro: because Copier uses separate threads, it hopefully runs on separate cores than the application. This can prevent large copy operations from evicting hot application data. Con: when Copier finishes copying data, the destination bytes are likely not in the L1/L2 of an application core, causing it to have to read data from LLC/DRAM. Because copy operations are spread throughout the stack, careful design of the API is important. For example, the paper describes synchronization between user and kernel requests for the same process, but what about synchronization across two processes? Subscribe now Source: https://dl.acm.org/doi/10.1145/3731569.3764800 One interesting piece of data missing from that figure is how much CPU time is spent waiting on cache misses. A more subtle point is how long applications wait after a copy operation completes and when they actually read from the destination buffer. The authors call this the copy-use window . Fig. 3 has some data on this point. Bars above the gray triangle represent opportunities for latency hiding. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 API This paper proposes a new OS service ( Copier ), which performs copy operations on behalf of user and kernel clients. Clients interact with Copier through a set of queues. Copy requests are submitted via descriptors in the Copy Queue . Descriptors contain pointers to bitmaps, which clients can use to monitor progress. Each bit in the bitmap corresponds to the completion of a single segment of a copy. For example, a client could submit a 1MiB copy but read data from 4KiB destination segments as soon as they are written. The Sync Queue is used to raise the priority of an already-submitted task. This is a fallback for when the application has left the copy-use window and has nothing useful to do other than wait for the copy to make progress. The Handler Queue enables arbitrary functions to be called when a copy operation completes. Applications interact with these queues indirectly via a set of functions (e.g., to initiate a copy, to wait for an operation to complete). Copy Absorption Because Copier is a centralized service, it can perform global optimizations. An example given in the paper involves two copies. First, a large amount of data is copied from a kernel buffer to an intermediate buffer ( → ). Next, a smaller amount of data is copied from the intermediate buffer into the database ( → ). Copier can detect this situation and only read a subset of the bytes from . One trick that Copier uses is watching calls to the API to know when data can be accessed. For example, if is never called on , and only called on a subset of , then Copier knows that only a subset of needs to be read. Clients can expose more global optimization opportunities to Copier by marking copy tasks as lazy. Lazy copy tasks do not execute unless they need to. An application could supply this hint if it knows that the destination of a copy operation will likely only be used as the source of a future copy operation. Copier will defer copying bits associated with the first copy operation, under the assumption that will never be called on it. Subsequent copy operations will cut out the middleman and read from the original source. Copy Threads The task of actually copying bits is delegated to a set of kernel threads. The number of threads dynamically scales with the amount of work. These threads use AVX instructions in parallel with dedicated DMA hardware . One issue with this setup is that these kernel threads could potentially fault if a specified userspace address is inaccessible (the data is paged out, or the client submitted a bad address). Copier proactively avoids this by locking pages in memory for the duration of the copy (this step is also needed when using DMA hardware). Results Fig. 11 shows benchmark results for Redis, section 6 of the paper has similar results for other use cases. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 Dangling Pointers Copies that hit in cache are very different from those which do not. I think more exploration here is warranted. Section 6.3.5 points out that Copier has pros and cons related to CPU caches: Pro: because Copier uses separate threads, it hopefully runs on separate cores than the application. This can prevent large copy operations from evicting hot application data. Con: when Copier finishes copying data, the destination bytes are likely not in the L1/L2 of an application core, causing it to have to read data from LLC/DRAM.

0 views

DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge

DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge Arjun Kashyap, Yuke Li, and Xiaoyi Lu HPDC'25 This paper ends with a counterintuitive result: DPUs aren’t amazingly better than traditional CPUs at implementing key-value stores. You would think they would have a shot, given that they have specialized networking accelerators, and key-value stores don’t require a lot of general-purpose computation. It all comes down to access to off-chip memory. Table 1 and Fig. 2 contain profiling results which motivate the designs presented by this paper: Source: https://dl.acm.org/doi/10.1145/3731545.3731571 When a key-value store is run on a traditional CPU, most of the time is spent in packet processing rather than CRUD operation on (key, value) tuples. DPUs are chips that are specialized for packet processing, so one would think a DPU would be especially helpful in this situation. The designs in this paper are evaluated on NVIDIA BlueField 2 and BlueField 3 DPUs. These DPUs contain a mix of fixed-function hardware and ARM cores. The fundamental idea proposed in this paper is to make the CPU’s job easier by having the DPU perform some of the awkward packet processing work and passing CRUD requests to the CPU in a convenient format. The DPU parses incoming packets, extracts the relevant fields, and writes the minimal amount of information (operation to perform, key, value) to queues in host memory. Fig. 8(d) shows the amount of cruft the DPU is able to remove from each CRUD packet. Source: https://dl.acm.org/doi/10.1145/3731545.3731571 The design described in this paper is optimized in all of the ways you would expect , requests are batched, and appropriate pipelining is used to avoid idle time. Cross-core synchronization (both DPU and host CPU cores) synchronization is minimized with the presence of many queues. Each ARM core owns a unique set of queues, and each host CPU core is assigned to one or more queues. As Fig. 11 shows, the design described so far ( DPU-KV-lat and DPU-KV-sav ) doesn’t offer significant speedups. There are latency improvements, but throughput suffers. These designs are bound by the DPU. Evidence for this comes from the performance of DPU-KV-dual and DPU-KV-shrd . DPU-KV-dual allows idle host CPU cores to process raw packets. DPU-KV-shrd uses sharding. The set of keys is partitioned, with some keys handled by the CPU and some keys handled by the DPU. Source: https://dl.acm.org/doi/10.1145/3731545.3731571 Dangling Pointers The moral of the story is that there is no special advantage conferred on an ARM core inside of a SmartNIC over an Intel core inside of the host CPU. It is interesting to compare this work to a key-value store implemented directly in an RMT pipeline . It would be interesting to drill down into the profiling numbers which motivated this paper and understand how much memory-level parallelism a traditional CPU core can utilize. At the microarchitectural level, this problem has to be memory bound, it would be interesting to see if it is bound by memory latency or bandwidth. Subscribe now

0 views

A Computing Procedure for Quantification Theory

A Computing Procedure for Quantification Theory Martin Davis and Hilary Putnam JACM Volume 7, Issue 3 This is the oldest paper I’ve summarized. I think SAT solvers are magic and wanted to peek inside the black box. This paper describes a precursor to the full DPLL algorithm. About half of the paper is about dealing with existential and universal quantifiers, which I’m not describing here. The input to the algorithm described here is a logic formula in conjunctive normal form (CNF). Conjunctive normal form comprises a set of clauses. A clause is the logical OR of a set of literals. A literal is either a variable, or the negation of a variable. Logical AND is used to combine all clauses. For example: There are straightforward algorithms to convert any logic formula into CNF (and more complicated ones like the Tseytin transformation ). The goal of this algorithm is to determine if the input logic formula is satisfiable. If it is satisfiable, then the algorithm produces a list, assigning each variable to a specific value ( or ). An alternative normal form would be disjunctive normal form (DNF), but that isn’t practical. If you think about it, any algorithm that outputs DNF effectively outputs a list of all possible solutions to the SAT problem (each clause is a possible solution). So, either the list is small and SAT solving really is easy, or most interesting problems would be prohibitively large in DNF form. This paper even cites a textbook which says that the act of putting a formula into DNF “automatically reveals whether or not the formula is inconsistent”. The search for a satisfying assignment (i.e., mapping of variable names to values) is described as the iterative application of three rules. The third rule contains the magic necessary to keep the search going. If a variable appears in a clause that only contains a single literal, then the variable can be assigned a value consistent with the polarity of the literal. For example, in the following formula, must be . If all appearances of a variable have the same polarity, then that variable can be assigned to a value consistent with that polarity. For example, in the following formula, can safely be assigned to false. Again, the algorithm notes the value of and then simplifies the formula by substituting that value for . This one is interesting in that subsequent work found an equivalent rule which is more efficient. The first step in applying rule III is to pick a variable (the choice can affect performance, but this paper doesn’t offer much guidance on how to choose), and refactor the formula such that it has the following form: Where , , and are expressions that do not contain . The paper doesn’t explicitly say this, but it seems this must cause the formula to temporarily leave CNF. Here is an example conversion. Now that the formula has been refactored, recursively solve the simpler formula: In the example above, the simpler formula is: The intuition here is that must either be or . If is , then must be . If is , then must be . This paper introduces an equivalent Rule III*, which says to simply assume and check if the formula is satisfiable with that assumption. If not, then assume and check if the formula is satisfiable with that assumption. This avoids having to transform the formula into [ form, which “can easily increase the number and the lengths of the clauses in the expression rather quickly after several applications”. This paper has the greatest statement of an improvement in runtime: The superiority of the present procedure over those previously available is indicated in part by the fact that a formula on which Gilmore’s routine for the IBM 704 causes the machine to compute for 21 minutes without obtaining a result was worked successfully by hand computation using the present method in 30 minutes. Computers were slower in the 1960s. Imagine someone today coming up with an alternative way of training LLMs which can be done faster by hand than a GPU. This follow-on paper has an interesting observation about the choice of which variable to select in rule III: By rearranging the clauses of a formula a different order would in general be created. In some cases, whether or not the program could actually prove the validity of a given formula (without running out of fast access storage) depended on how one shuffled the punched-data deck before reading it into the assembler! Thus, the variation in ordering of constants did affect by a factor of 10 (from 500 to 5000) … In modern lingo, the branching heuristic matters a lot. Subscribe now

0 views

High-Performance Query Processing with NVMe Arrays: Spilling without Killing Performance

High-Performance Query Processing with NVMe Arrays: Spilling without Killing Performance Maximilian Kuschewski, Jana Giceva, Thomas Neumann, and Viktor Leis SIGMOD'25 In database vernacular, spilling is the process of writing intermediate data to disk in order to evaluate a query with a finite amount of main memory. It goes without saying that database folks are control freaks performance conscious and don’t want to rely on generic OS paging mechanisms to handle working sets which are larger than main memory. This tidbit about Snowflake is fascinating: Only 5% of analytical queries in Snowflake’s 2018 workload trace spill data, but those 5% contribute 45% of the overall CPU time and 29% of the total execution time [77]. One fundamental problem in this area is that it is very hard to predict up-front exactly how much working memory will be needed to efficiently execute a query. Databases have to estimate based on statistics of the relations used in a query, or assume no spilling will be needed, and then gracefully fall back to spilling if necessary. The two operators which this paper deals with are joins and aggregations. Both involve key columns, and the cardinality of the key columns is critical in determining if spilling is necessary. One obvious spilling mechanism is to use a partitioning approach for joins and aggregations. I’ve described partitioned joins in summaries of these papers: Efficiently Processing Joins and Grouped Aggregations on GPUs The reason why partitioning nicely solves the problem is that the working set requirements for both steps of a partitioned join are modest. The partitioning step only requires a small amount of memory per partition (e.g., accumulate 64KiB per partition before appending partitioned tuples to on-disk storage). The join step only needs enough working memory to join a single partition. Section 4.1 of the paper claims that partitioning slows down TPC-H queries by 2-5x. My spider sense is tingling, but let’s take this as an axiom for now. Here is the premise of the paper: partitioning is better for queries that must spill, but worse for queries that can be completed without spilling . What is an efficient and simple design given that a database cannot perfectly predict up-front if it will need to spill? Prior work along similar lines introduced the hybrid hash join. A hybrid hash join partitions the left (build) input of the join and dynamically decides what percentage of build partitions must be spilled. A hash table is built containing all non-spilled build partitions. Next the right (probe) input to the join is processed. For each probe tuple, the database determines if the associated partition was spilled. If the partition was spilled, then the probe tuple is spilled. If not, then the probe tuple is processed immediately via a lookup in the hash table. Finally, all spilled partitions are processed. The downside of this approach is that it always partitions the build side, even when that is unnecessary. This paper proposes a join implementation that only pays the cost of partitioning when spilling is required. It is a two-phase process. In the materialization phase, the build table is scanned (and pushed-down filters are applied). The resulting tuples are stored in a list of pages. At first, the system optimistically assumes that no spilling is necessary and appends each tuple to the current page. If a memory limit is reached, then partitioning is enabled. Each tuple processed after that point is partitioned, and per-partition lists of pages are allocated. If a further memory limit is reached, then some partitions are spilled to disk. Next, the build and probe phase executes in a manner similar to hybrid hash join. However, there is a fast path for the case where no spilling occurred. In this case, the tuples produced by the materialization phase are inserted into one large hash table, and then the probe tuples are streamed, with one hash table lookup per tuple. If partitioning (but no spilling) occurred, then the hash table inserts will have high locality (assuming a chained hash table). If spilling did occur, then the build and probe phase operates like a hybrid hash join, spilling probe tuples if and only if the associated build partition was spilled. The paper isn’t clear on what happens to non-partitioned build tuples once the system decides to start spilling build partitions. My assumption is that in this case, probe tuples must probe both the spilled build tuples and these build tuples that were processed before partitioning was enabled. The implementation of aggregation described in this paper follows a similar 2-phase approach. The materialization phase performs pre-aggregation, the system starts by aggregating into an in-memory hash table. If the hash table grows too large, then partitioning kicks in. Tuples from in-memory hash tables are evicted into per-partition page lists, and subsequent tuples are directly stored in these per-partition page lists. These per-partition pages can be spilled if further memory limits are reached. The subsequent phase then performs any necessary aggregation. If no eviction occurred, then this phase has no work to do. The paper describes an adaptive compression algorithm that improves spilling performance. Some interesting numbers from the paper: The number of CPU cycles per input byte for executing an in-memory TPC-H queries ranges from 3.3 to 60.3 The number of CPU cycles required to write and read a byte to/from SSD is 11.1 The adaptive nature of this scheme is driven by the fact that queries are diverse, and compressors have knobs which trade off speed for compression ratio, as illustrated by Fig. 3: Source: https://dl.acm.org/doi/10.1145/3698813 When spilling occurs, the system dynamically balances CPU usage and IO bandwidth by adjusting these knobs. Fig. 5 shows in-memory throughput while Fig. 6 shows throughput when spilling occurs: Source: https://dl.acm.org/doi/10.1145/3698813 Dangling Pointers The design of the unified hash join hinges on the fact that partitioning is a bad idea for the in-memory case. This is in contrast to papers describing in-memory partitioning join implementation on other types of chips like SPID-Join and Efficiently Processing Joins and Grouped Aggregations on GPUs . I imagine there is a lot of literature about this topic that I haven’t read. Leave a comment with your experience. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. The number of CPU cycles per input byte for executing an in-memory TPC-H queries ranges from 3.3 to 60.3 The number of CPU cycles required to write and read a byte to/from SSD is 11.1

0 views
Dangling Pointers 1 months ago

Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt?

Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt? Kaisong Huang, Jiatang Zhou, Zhuoyue Zhao, Dong Xie, and Tianzheng Wang SIGMOD'25 Say you are a database, and your job is to execute two kinds of queries (both from different TPC benchmarks ): High-priority New-Order queries from TPC-C (OLTP) Low-priority Q2 queries from TPC-H (OLAP) Congratulations, you are a hybrid transaction/analytical processing ( HTAP ) database! You would like OLTP transactions to experience low tail latency, while OLAP transactions run at high throughput. How can transactions be scheduled to achieve these goals? A Non-Preemptive FIFO policy runs transactions to completion in the order they came. This is easy but has a high tail latency for OLTP transactions that have to sit in line behind a long queue of OLAP transaction. A cooperative policy involves yield points at specific places in the database code. At each yield point, an OLAP transaction can realize that an OLTP transaction is waiting and yield the CPU. It is a hassle to insert yield points, and it is hard to find the Goldilocks frequency. Checking for high-priority transactions too often adds overhead, but checking infrequently increases tail latency. A preemptive policy allows high-priority OLTP transactions to borrow a CPU core as soon as possible. In the past, the only practical way to do this involved OS context switching, which is expensive. Fig. 2. illustrates these policies: Source: https://dl.acm.org/doi/abs/10.1145/3725319 Enter userspace interrupts . These allow preemption without OS kernel involvement. Section 4.2 of the paper makes it clear that it isn’t totally easy to implement userspace context switching on top of userspace interrupts. An idiomatic use case for userspace interrupts is for an interrupt handler to quickly save some data and then return back to the code that was running. The context switch case is not idiomatic. For each CPU core, two pthreads threads are created, and there are two stacks. Say the CPU core is running a low-priority (OLAP) transaction and a userspace interrupt is delivered to the core. The userspace interrupt handler is invoked, which mucks around with the CPU registers and (including the stack pointer), and then returns. But it doesn’t return to the code that was running the low-priority transaction, it returns to code which runs the high-priority (OLTP) transaction . Once the high-priority transaction finishes, it calls a voluntary context switch function, which again mucks around with CPU registers and the stack pointer in just the correct manner so that it returns back to the code running the low-priority transaction. There are some nitty-gritty details to get this working correctly. Tricky cases have to be handled such as: A userspace interrupt occurring in the middle of the context switch function Support for database code which uses thread-local storage (e.g., the modifier in C++ ) Avoiding deadlocks associated with a userspace interrupt occurring while a database lock is acquired As seen with xUI , while userspace interrupts are cheap, they still incur a cost. This paper proposes firing a single interrupt to execute a batch of high-priority transactions. Section 5 also describes a starvation avoidance mechanism to ensure that low-priority transactions eventually finish. Note that when a low-priority transaction is not preempted, it is not automatically aborted . The paper assumes the underlying database uses multi-versioning and optimistic concurrency control. Fig. 10 has the headline results. represents FIFO scheduling. represents the case where the OLAP query can occasionally yield the CPU core. is the work described in this paper. Tail latencies for OLTP queries are significantly reduced, while performance of OLAP queries does not change much. Source: https://dl.acm.org/doi/abs/10.1145/3725319 Dangling Pointers It would be nice to see a comparison against a version which uses traditional OS thread synchronization rather than userspace interrupts. The details of userspace context switching are tricky and seem orthogonal to databases. A library or OS functionality which provides a robust implementation seems like a useful thing to exist. The paper doesn’t mention what happens if the work associated with a single query is parallelized across multiple CPU cores. I imagine this complicates the scheduling policy. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. High-priority New-Order queries from TPC-C (OLTP) Low-priority Q2 queries from TPC-H (OLAP) Source: https://dl.acm.org/doi/abs/10.1145/3725319 Enter userspace interrupts . These allow preemption without OS kernel involvement. Context Switch Complications Section 4.2 of the paper makes it clear that it isn’t totally easy to implement userspace context switching on top of userspace interrupts. An idiomatic use case for userspace interrupts is for an interrupt handler to quickly save some data and then return back to the code that was running. The context switch case is not idiomatic. For each CPU core, two pthreads threads are created, and there are two stacks. Say the CPU core is running a low-priority (OLAP) transaction and a userspace interrupt is delivered to the core. The userspace interrupt handler is invoked, which mucks around with the CPU registers and (including the stack pointer), and then returns. But it doesn’t return to the code that was running the low-priority transaction, it returns to code which runs the high-priority (OLTP) transaction . Once the high-priority transaction finishes, it calls a voluntary context switch function, which again mucks around with CPU registers and the stack pointer in just the correct manner so that it returns back to the code running the low-priority transaction. There are some nitty-gritty details to get this working correctly. Tricky cases have to be handled such as: A userspace interrupt occurring in the middle of the context switch function Support for database code which uses thread-local storage (e.g., the modifier in C++ ) Avoiding deadlocks associated with a userspace interrupt occurring while a database lock is acquired

0 views
Dangling Pointers 1 months ago

SuSe: Summary Selection for Regular Expression Subsequence Aggregation over Streams

SuSe: Summary Selection for Regular Expression Subsequence Aggregation over Streams Steven Purtzel and Matthias Weidlich SIGMOD'25 A common use case for regular expressions is to check whether a particular input string matches a regular expression. This paper introduces a new use case, which is generating statistics over streams. Imagine an infinite stream of characters constantly being processed, and a dashboard which a person occasionally views. The dashboard shows statistics, e.g., “there have been matches in the last hours”. It is OK for these statistics to be approximate. The final quirk introduced by this paper is to allow for noise. In particular, all regular expressions implicitly accept “unexpected inputs” at any time and simply behave as if the unexpected input had never appeared. Counterbalancing this flexibility is a sliding time window which requires that if a particular subset of the input stream is going to match, it must do so with a bounded number of characters. A regular expression can be represented as an NFA . The N stands for non-deterministic , and that is the tricky bit. For example, say the regular expression to be matched against is . This matches strings which begin with , followed by 0 or more characters, followed by exactly 1 , followed by . Another way to write this expression is: ( stands for “one or more s”). Imagine you were designing a state machine to recognize strings which match this pattern. It starts off easy, the machine should expect the first input to be . The next expected input is , but then what? That first could be one of many represented by the , or it could be the last right before the . The state machine doesn’t know how to classify the inputs that it has seen until it sees the final (or some other non-B character, indicating no match). A common solution to this problem is like the metaverse, when that first B comes, split the universe into two, and then send all subsequent inputs to both. An easy way to implement this is to use a bit vector to track all possible states the NFA could be in. The NFA starts in the initial state (so only one bit is set), and transitions through states like you would expect (only one bit is set at a time). When that first comes however, two bits are set, indicating that there are two possible ways the input could match. The NFA above can be represented with 6 states: The following table shows how the bit vector could evolve over time (one new input accepted each time step, bit vectors are written with the least-significant bit on the right): The key insight in this paper is to track states with counters, not bits. Each counter represents the number of matches that have been seen thus far in the stream. For example, when an input comes along, the counter associated with state 1 is increased by one. When a comes along, the counter associated with state 4 is increased by the value of the counter associated with state 3. Section 5.1 of the paper explains how counter increment rules can be derived for a specific NFA. This system described by this paper contains multiple sets of counters. Some counters are global, and some are local to each input character. The purpose of the local counters is to allow for “garbage collection”. The idea is that the system keeps a sliding window of the most important recent inputs, along with global and local state counters. When a new input comes, the local state counters are used to determine which input in the sliding window to delete. This determination is made based on an estimate of how likely each input is to generate a match. When the user views the dashboard, then all matches within the sliding window are identified. It is OK if this process is expensive, because it is assumed to be rare. Fig. 9 has a comparison of this work (StateSummary) against other approaches: Source: https://dl.acm.org/doi/10.1145/3725359 Dangling Pointers State machines are notoriously hard to optimize with SIMD or specialized hardware because of the fundamental feedback loop involved. It would be interesting to see if this approach is amenable to acceleration. Subscribe now

0 views
Dangling Pointers 1 months ago

Fast and Scalable Data Transfer Across Data Systems

Fast and Scalable Data Transfer Across Data Systems Haralampos Gavriilidis, Kaustubh Beedkar, Matthias Boehm, and Volker Mark SIGMOD'25 We live in exciting times, unimaginably large language models getting better each day, and a constant stream of amazing demos. And yet, efficiently transferring a table between heterogeneous systems is an open research problem! An example from the paper involves transferring data from PostgreSQL to pandas. Optimizing this transfer time is important and non-trivial. The paper describes a system named XDBC. XDBC software runs on both the source and the destination data management systems (DMS), as illustrated by Fig. 4: Source: https://dl.acm.org/doi/10.1145/3725294 The XDBC client/server processes are organized as a pipeline. Data parallelism within a stage is exploited by assigning 1 or more workers (e.g., cores) to each stage. There are a lot of knobs which can affect end-to-end throughput: Number of workers assigned to each task Data interchange format (row-major, column-major, Arrow ) Compression ( zstd , snappy , lzo , lz4 ) Section 4.1 of the paper claims the search space is so large that brute force search will not work, so a heuristic algorithm is used. The heuristic algorithm assumes accurate performance models which can estimate performance of each pipeline stage given a specific configuration. This model is based on real-world single-core performance measurements, and Gustafson’s law to estimate multi-core scaling. The algorithm starts by assigning 1 worker to each pipeline stage (in both the client and server). An iterative process then locates the pipeline stage which is estimated to be the slowest and assigns additional workers to it until it is no longer the bottleneck. This process continues until no more improvement can be found, due to one of the following reasons: All available CPU cores have been assigned Network bandwidth is the bottleneck If the process ends with more CPU cores available, then a hard-coded algorithm determines the best compression algorithm given the number of cores remaining. The data interchange format is determined based on which formats the source and destination DMSs support, and which compression algorithm was chosen. The XDBC optimizer has a lot of similarities with the Alkali optimizer . Here are some differences: Alkali does not require tasks to be executed on separate cores. For example, Alkali would allow a single core to execute both the and pipeline stages. Alkali uses an SMT solver to determine the number of cores to assign to each stage. The Alkali performance model explicitly takes into account inter-core bandwidth requirements. Alkali doesn’t deal with compression. Fig. 7(a) shows results from the motivating example (PostgreSQL→Pandas). Fig. 7(b) compares XDBC vs built-in Pandas functions to read CSV data over HTTP. connector-x is a more specialized library which supports reading data into Python programs specifically. Source: https://dl.acm.org/doi/10.1145/3725294 Dangling Pointers There are many search spaces which are too large for brute force. Special-case heuristic algorithms are one fallback, but as the Alkali paper shows, there are other approaches (e.g., LP solvers, ILP solvers, SMT solvers, machine learning models). It would be great to see cross-cutting studies comparing heuristics to other approaches. Subscribe now Source: https://dl.acm.org/doi/10.1145/3725294 The XDBC client/server processes are organized as a pipeline. Data parallelism within a stage is exploited by assigning 1 or more workers (e.g., cores) to each stage. There are a lot of knobs which can affect end-to-end throughput: Number of workers assigned to each task Data interchange format (row-major, column-major, Arrow ) Compression ( zstd , snappy , lzo , lz4 ) All available CPU cores have been assigned Network bandwidth is the bottleneck Alkali does not require tasks to be executed on separate cores. For example, Alkali would allow a single core to execute both the and pipeline stages. Alkali uses an SMT solver to determine the number of cores to assign to each stage. The Alkali performance model explicitly takes into account inter-core bandwidth requirements. Alkali doesn’t deal with compression.

0 views
Dangling Pointers 1 months ago

Falcon: A Reliable, Low Latency Hardware Transport

Falcon: A Reliable, Low Latency Hardware Transport Arjun Singhvi, Nandita Dukkipati, Prashant Chandra, Hassan M. G. Wassel, Naveen Kr. Sharma, Anthony Rebello, Henry Schuh, Praveen Kumar, Behnam Montazeri, Neelesh Bansod, Sarin Thomas, Inho Cho, Hyojeong Lee Seibert, Baijun Wu, Rui Yang, Yuliang Li, Kai Huang, Qianwen Yin, Abhishek Agarwal, Srinivas Vaduvatha, Weihuang Wang, Masoud Moshref, Tao Ji, David Wetherall, and Amin Vahdat SIGCOMM'25 Falcon is an IP block which can be integrated into a 3rd-party NIC. Fig. 7 shows an example integration of Falcon into a NIC. Blue components are part of Falcon: Source: https://dl.acm.org/doi/abs/10.1145/3718958.3754353 Multiple Upper Layer Protocols (ULPs, e.g., NVMe and RDMA ) are implemented on top of Falcon. Other protocols (e.g., Ethernet) can bypass Falcon and go straight to the standard NIC hardware. Falcon provides reliability and ordering via a connection-oriented interface to the ULPs. Multipathing is the ability for a single connection to use multiple network paths from the sender to the receiver. This improves throughput by allowing use of aggregate bandwidth and allows Falcon to quickly react to transient congestion on a subset of paths. The paper uses the term flow for a single path from sender to receiver. A single connection is associated with many flows. There are two parts to implementing multipathing, one easy and one not-so-easy. The easy task is to use the IPv6 Flow Label field. When the sending NIC chooses a flow for a particular packet, it sets the index of the flow in the flow label field. When a switch determines that there are multiple valid output ports for a packet, it hashes various fields from the packet (including the flow label) to determine which port to use. The switches are doing the hard work here. A Falcon NIC doesn’t need to maintain a local view of the network topology between the sender and receiver, nor does it have to pre-plan the exact set of switches a packet will traverse. The NIC simply sets the flow label field. The hard part is handling out-of-order packets. If the sending NIC is interleaving between flows at a fine granularity, then the receiving NIC will commonly receive packets out of order. Falcon burns 1-2 mm 2 of silicon on a packet buffer which holds received packets until they can be delivered to a ULP in order. ACK packets contain a packet sequence number and a 128-bit wide bitmap which represent a window of 128 recent packets that have been received. The sender uses these bitmaps to determine when to retransmit. The NIC maintains an estimate of the round-trip latency on each flow. If the most recent bitmap indicates that a packet has not been received, and a period of time longer than the round-trip latency has elapsed, then the packet is retransmitted. Falcon attempts to be a good citizen and minimize bufferbloat by estimating per-flow round-trip latency. These estimates are gathered via hardware near the edge of the NIC which records timestamps as packets (including ACKs) are sent and received. When Falcon is processing a packet to be sent for a given connection, it computes the open window associated with each flow. The open window is the difference between the round-trip latency and the number of unacknowledged packets. The flow with the largest open window is selected. You can think of the open window like a per-flow credit scheme, where the total credits available is determined from round-trip latency, sending a packet consumes a credit, and receiving an ACK produces credits. The trick here is that the round-trip latency associated with each flow is constantly changing. Section 5.2 of the paper describes three details which the authors felt were worth mentioning. The unspoken assumption is that these are non-standard design choices: As mentioned before, Falcon dedicates a non-trivial amount of on-chip resources to SRAM buffers which hold received packets before they are reassembled into the correct order. The paper says 1.2MB is required for 200Gbps, and the buffer size grows linearly with throughput. One interesting fact is that the buffer size is independent of latency, because throughput decreases with latency. For example, the paper mentions the same size works well with “inter-metro use-cases” which have 5-10x higher latency, but also 5-10x lower bandwidth. Falcon has an on-chip cache to hold mutable connection state, but the paper says that it is very common to have a high miss rate in this cache. The solution to this is to provision enough bandwidth to be able to have good performance when most accesses to connection state must go off chip. Reading between the lines, it seems like there are two scenarios which are important. The first has a small number of connections, with each connection experiencing a high packet rate. The second is a large number of connections, each with a low packet rate. Falcon has hardware support for somewhat rare events (errors, timeouts) rather than letting software on the host handle this. Fig. 10 compares Falcon against RoCE for various RDMA verbs and drop rates. Note that the drop rate maxes out at 1%. Source: https://dl.acm.org/doi/abs/10.1145/3718958.3754353 Dangling Pointers Falcon contains a lot of great optimizations. I wonder how many of them are local optimizations, and how much more performance is on the table if global optimization is allowed. In particular, Falcon works with standard ULPs (RDMA, NVMe) and standard Ethernet switches. At some scale, maybe extending the scope of allowable optimizations to those components would make sense? Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. Source: https://dl.acm.org/doi/abs/10.1145/3718958.3754353 Multiple Upper Layer Protocols (ULPs, e.g., NVMe and RDMA ) are implemented on top of Falcon. Other protocols (e.g., Ethernet) can bypass Falcon and go straight to the standard NIC hardware. Falcon provides reliability and ordering via a connection-oriented interface to the ULPs. Multipathing Multipathing is the ability for a single connection to use multiple network paths from the sender to the receiver. This improves throughput by allowing use of aggregate bandwidth and allows Falcon to quickly react to transient congestion on a subset of paths. The paper uses the term flow for a single path from sender to receiver. A single connection is associated with many flows. There are two parts to implementing multipathing, one easy and one not-so-easy. The easy task is to use the IPv6 Flow Label field. When the sending NIC chooses a flow for a particular packet, it sets the index of the flow in the flow label field. When a switch determines that there are multiple valid output ports for a packet, it hashes various fields from the packet (including the flow label) to determine which port to use. The switches are doing the hard work here. A Falcon NIC doesn’t need to maintain a local view of the network topology between the sender and receiver, nor does it have to pre-plan the exact set of switches a packet will traverse. The NIC simply sets the flow label field. The hard part is handling out-of-order packets. If the sending NIC is interleaving between flows at a fine granularity, then the receiving NIC will commonly receive packets out of order. Falcon burns 1-2 mm 2 of silicon on a packet buffer which holds received packets until they can be delivered to a ULP in order. ACK packets contain a packet sequence number and a 128-bit wide bitmap which represent a window of 128 recent packets that have been received. The sender uses these bitmaps to determine when to retransmit. The NIC maintains an estimate of the round-trip latency on each flow. If the most recent bitmap indicates that a packet has not been received, and a period of time longer than the round-trip latency has elapsed, then the packet is retransmitted. Congestion Control Falcon attempts to be a good citizen and minimize bufferbloat by estimating per-flow round-trip latency. These estimates are gathered via hardware near the edge of the NIC which records timestamps as packets (including ACKs) are sent and received. When Falcon is processing a packet to be sent for a given connection, it computes the open window associated with each flow. The open window is the difference between the round-trip latency and the number of unacknowledged packets. The flow with the largest open window is selected. You can think of the open window like a per-flow credit scheme, where the total credits available is determined from round-trip latency, sending a packet consumes a credit, and receiving an ACK produces credits. The trick here is that the round-trip latency associated with each flow is constantly changing. Notable Hardware Details Section 5.2 of the paper describes three details which the authors felt were worth mentioning. The unspoken assumption is that these are non-standard design choices: As mentioned before, Falcon dedicates a non-trivial amount of on-chip resources to SRAM buffers which hold received packets before they are reassembled into the correct order. The paper says 1.2MB is required for 200Gbps, and the buffer size grows linearly with throughput. One interesting fact is that the buffer size is independent of latency, because throughput decreases with latency. For example, the paper mentions the same size works well with “inter-metro use-cases” which have 5-10x higher latency, but also 5-10x lower bandwidth. Falcon has an on-chip cache to hold mutable connection state, but the paper says that it is very common to have a high miss rate in this cache. The solution to this is to provision enough bandwidth to be able to have good performance when most accesses to connection state must go off chip. Reading between the lines, it seems like there are two scenarios which are important. The first has a small number of connections, with each connection experiencing a high packet rate. The second is a large number of connections, each with a low packet rate. Falcon has hardware support for somewhat rare events (errors, timeouts) rather than letting software on the host handle this.

0 views
Dangling Pointers 1 months ago

Bounding Speculative Execution of Atomic Regions to a Single Retry

Bounding Speculative Execution of Atomic Regions to a Single Retry Eduardo José Gómez-Hernández, Juan M. Cebrian, Stefanos Kaxiras, and Alberto Ros ASPLOS'24 This paper proposes adding hardware support for a specific subset of atomic regions . An atomic region can either be a regular old critical section or a transaction in a system which supports transactional memory. Speculative lock elision is a microarchitectural optimization to speculatively remove synchronization between cores. A mis-speculation results in a finite number of retries, followed by falling back to locking as specified by the program. Hardware transactional memory (HTM) requires programmers to specify transactions at the language level. Again, conflicts are handled with a bounded number of retries, followed by executing user-specified “fallback” code. The key insight in this paper is that: many atomic regions can be implemented with at most a single retry . These atomic regions have an immutable set of memory addresses which they access. In other words, these regions will access the same set of cache lines on each retry. Table 1 shows statistics for each atomic region in a set of benchmarks analyzed by the paper: Source: https://dl.acm.org/doi/10.1145/3622781.3674176 Cacheline-locked executed Atomic Region The paper proposes hardware support for cacheline-locked executed atomic region s (CLEAR), to optimize execution of most atomic regions. The first invocation on an atomic region is part of the discovery phase . The processor keeps track of properties (e.g., the set of cache lines accessed) of the atomic region. If no conflicts occur with other transactions, then this first invocation of the atomic region can commit. If a conflict occurs, hardware finishes executing the atomic region, so that it gets a full picture of the set of cache lines touched by the region. The hardware then retries executing the atomic region, this time locking each cache line which will be accessed (locking in a sorted order to avoid deadlocks with other cores also locking cache lines). In most cases, this does the trick, and this second retry will succeed. A few things can go wrong, but the paper claims they are not too common: The atomic region can be too large for the core to keep track of all relevant metadata during discovery. The atomic region could contain indirections, which means the set of cache lines accessed could change from run to run. The processor optimistically assumes the set of cache lines won’t change and detects if this assumption was incorrect. For these reasons, the hardware must still have a fallback path (e.g., coarse-grained locking). But this is not the common case. Fig. 8 has the headline results. B is a simplistic implementation of HTM (requester-wins). P is a more advanced implementation of HTM (PowerTM). C is CLEAR built on top of the requester-wins HTM design. PW is CLEAR built on top of the PowerTM implementation of HTM. Source: https://dl.acm.org/doi/10.1145/3622781.3674176 Dangling Pointers Similar patterns appear in other domains. Ripple atomics require the read and write set of each atomic block to be known at compile time. Calvin runs OLTP transactions first as reconnaissance queries to determine the read/write sets of a transaction. From a language design point of view, it seems worth considering a special syntax for the subset of transactions which have immutable read/write sets. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. Source: https://dl.acm.org/doi/10.1145/3622781.3674176 Cacheline-locked executed Atomic Region The paper proposes hardware support for cacheline-locked executed atomic region s (CLEAR), to optimize execution of most atomic regions. The first invocation on an atomic region is part of the discovery phase . The processor keeps track of properties (e.g., the set of cache lines accessed) of the atomic region. If no conflicts occur with other transactions, then this first invocation of the atomic region can commit. If a conflict occurs, hardware finishes executing the atomic region, so that it gets a full picture of the set of cache lines touched by the region. The hardware then retries executing the atomic region, this time locking each cache line which will be accessed (locking in a sorted order to avoid deadlocks with other cores also locking cache lines). In most cases, this does the trick, and this second retry will succeed. A few things can go wrong, but the paper claims they are not too common: The atomic region can be too large for the core to keep track of all relevant metadata during discovery. The atomic region could contain indirections, which means the set of cache lines accessed could change from run to run. The processor optimistically assumes the set of cache lines won’t change and detects if this assumption was incorrect.

0 views
Dangling Pointers 1 months ago

CEIO: A Cache-Efficient Network I/O Architecture for NIC-CPU Data Paths

CEIO: A Cache-Efficient Network I/O Architecture for NIC-CPU Data Paths Bowen Liu, Xinyang Huang, Qijing Li, Zhuobin Huang, Yijun Sun, Wenxue Li, Junxue Zhang, Ping Yin, and Kai Chen SIGCOMM'25 Thanks to for the pointer to this paper. CEIO tries to solve the same problems as Disentangling the Dual Role of NIC Receive Rings (which proposes a solution named rxBisect). Here is a rehash from my summary of that paper: DDIO is a neat feature of Intel CPUs which causes the uncore to direct writes from I/O devices such that data lands in the LLC rather than main memory. In effect, DDIO speculates that a CPU core will read the data before it is evicted from the LLC. This feature is transparent to the I/O device. The particular problem the CEIO paper addresses is called Leaky DMA in the rxBisect paper, so let’s stick with that. Recall the Leaky DMA problem occurs when the sum total of all per-core receive rings is too large, and thus the working set of the system no longer fits in the LLC. In this situation, when the NIC writes data for a packet into the LLC, it ends up inadvertently evicting data associated with an older packet. This defeats the whole purpose of DDIO. The core idea of CEIO is for the NIC to track per-flow ( flow and connection are used interchangeably in this paper) credits. When a new packet arrives, the NIC decrements credits from the associated flow and then sends the packet to host memory. If no credits are available, then the NIC stuffs the packet in DRAM that is local to the NIC (this paper assumes a NIC like nVidia BlueField-3 with 16GB of local DRAM). The sum of the credits across all flows is sized such that LLC capacity will not be exceeded when all credits are in use. The case where credits are available is called the fast path , the other case is called the slow path . Section 4.2 of the paper goes into the details of how the system correctly switches back and forth between the slow and fast paths. Credits are tracked in increments of 64 bytes, and the number of credits a packet consumes is proportional to the size of the packet. The CEIO driver running on the host returns per-flow credits to the NIC via register writes after the CPU has finished reading a packet. An ARM core on the NIC handles allocation of credits to flows. Credit allocation is unashamedly fair and balanced. The desired total number of credits per flow is simply the total number of credits the LLC can support, divided by the number of flows. When a new connection is established, the flow allocator reallocates available credits from existing flows. If a given flow has too many credits allocated to it, but those credits are all in use, then the allocator records that the flow is “in credit debt” and rectifies the situation when the CEIO driver on the host returns credits to the indebted flow. The authors implemented CEIO on a BlueField-3 NIC. Fig. 9(a) compares throughput and LLC miss rates of CEIO with other implementations. This particular comparison represents static network conditions (no new flows coming and going, skew properties not changing). Source: https://dl.acm.org/doi/10.1145/3718958.3750488 Fig. 10(b) shows throughput in bursty conditions: Source: https://dl.acm.org/doi/10.1145/3718958.3750488 One advantage CEIO has is that it tracks credits at 64-byte granularity. The design of rxBisect requires tracking credits at MTU granularity, which will be less efficient for small packets. rxBisect doesn’t require DRAM local to the NIC, which makes it applicable to cheaper hardware. What happens after the host CPU has processed a packet? With rxBisect, it is unlikely that the packet will be evicted from the LLC, whereas with CEIO the packet likely will be evicted (thus wasting DRAM bandwidth writing data that is never needed again). This is because with rxBisect, the sum of all credits (the sum of the sizes of all rings) is sized to fit in the LLC, whereas with CEIO only the set of all pending packets (those not yet processed by the CPU) will be in the LLC. Necro-reaper proposes another solution to this dead writeback problem. It seems like there is mounting evidence that DDIO is missing a coordination interface to allow it to reliably work. Hardware designs and drivers need to walk on eggshells to take advantage of DDIO. One incremental change would be to allow hardware to hint that a particular DMA write should not land in the LLC. That would enable CEIO to work with NICs that do not have local DRAM, the slow path could write data into host DRAM instead. Subscribe now

0 views
Dangling Pointers 1 months ago

Automata All the Way Down

Fabs and EDA companies collaborate to provide the abstraction of synchronous digital logic to hardware designers. A hardware design comprises: A set of state elements (e.g., registers and on-chip memories), which retain values from one clock cycle to another A transfer function , which maps the values of all state elements at clock cycle N to new values of all state elements at clock cycle The transfer function cannot be too fancy. It can be large but cannot be defined with unbounded loops/recursion. The pragmatic reason for this restriction is that the function is implemented with physical gates on a chip, and each gate can only do one useful thing per clock cycle. You cannot loop the output of a circuit element back to itself without delaying the value by at least one clock cycle (via a state element). It feels to me like there is a deeper reason why this restriction must exist. Many people dabbled with synchronous digital logic in college. If you did, you probably designed a processor, which provides the stored program computer abstraction to software developers. And here comes the inception: you can think of a computer program as a transfer function. In this twisted mindset, the stored program computer abstraction enables software engineers to define transfer functions . For example, the following pseudo-assembly program: Can be thought of as the following transfer function: In the stored program computer abstraction, state elements are the architectural registers plus the contents of memory. As with synchronous digital logic, there are limits on what the transfer function can do. The switch statement can have many cases, but the body of each case block is defined by one instruction. Alternatively, you can define the transfer function at the basic block level (one case per basic block, many instructions inside of each case). Programming in assembly is a pain, so higher level languages were developed to make us less crazy. And here we go again, someone could write an interpreter for C. A user of this interpreter works at the C level of abstraction. Following along with our previous pattern, a C program comprises: A set of state elements (variables, both global and local) A transfer function For example, the following C function: Can be thought of with as the following transfer function: Think of and as intrinsics used to implement function calls. The key building blocks of the transfer function are state ments. It is easy to just store the term “statement” into your brain without thinking of where the term comes from. A state ment is a thing which can alter state . This transformation of an imperative program into a transfer function seems strange, but some PL folks do it all the time. In particular, the transfer function view is how small step operational semantics are defined. And of course this can keep going. One could write a Python interpreter in C, which allows development at a higher level of abstraction. But even at that level of abstraction, programs are defined in terms of state elements (variables) and a transfer function (statements). The term Turing Tax was originally meant to describe the performance loss associated with working at the stored-program computer level of abstraction instead of the synchronous digital logic level of abstraction. This idea can be generalized. At a particular level of abstraction, code defines the transfer function while data is held in the state elements. A particular set of bits can simultaneously be described as code at one level of abstraction, while defined as data at a lower level. This code/data duality is intimately related to the Turing Tax. The Turing Tax collector is constantly looking for bags of bits which can be interpreted as either code or data, and he collects his tax each time he finds such a situation. An analogous circumstance arises in hardware design. Some signals can be viewed as either part of the data path or the control path, depending on what level of abstraction one is viewing the hardware from. A compiler is one trick to avoid the Turing Tax by translating code (i.e., a transfer function) from a higher level of abstraction to a lower level. We all felt awkward when I wrote “interpreter for C” earlier, and now we can feel better about it. JIT compilers for Python are one way to avoid the Turing Tax. Another example is an HLS compiler which avoids the Turing Tax between the stored-program computer abstraction layer and the synchronous digital logic layer. No, this section isn’t about your Fitbit. Let’s call each evaluation of a transfer function a step . These steps occur at each level of abstraction. Let’s define the ultimate performance goal that we care about to be the number of steps required to execute a computation at the synchronous digital logic level of abstraction. The trouble with these layers of abstraction is that typically a step at a higher layer of abstraction requires multiple steps at a lower layer. For example, the multi-cycle processor implementation you learned about in a Patterson and Hennessy textbook could require 5 clock cycles to execute each instruction (instruction fetch, register fetch, execute, memory, register write back). Interpreters have the same behavior: one Python statement may be implemented with many C statements. Now imagine the following house of cards: A Python interpreter which requires an average of 4 C statements to implement 1 Python statement A C compiler which requires an average of 3 machine instructions to implement 1 C statement A processor which requires an average of 5 clock cycles to execute 1 machine instruction When the Turing property tax assessor sees this house, they tax each level of the house . In this system, an average Python statement requires (4 x 3 x 5) 60 clock cycles! Much engineering work goes into avoiding this problem (pipelined and superscalar processors, multi-threading, JIT compilation, SIMD). Partial evaluation is another way to avoid the Turing Tax. Partial evaluation transforms data into code . There must be some other method of creating abstractions which is more efficient. Self-modifying code is rarely used in the real world (outside of JIT compilers). Self-modifying code seems crazy to reason about but potentially could offer large performance gains. Partial evaluation is also rarely used but has a large potential. Subscribe now A set of state elements (e.g., registers and on-chip memories), which retain values from one clock cycle to another A transfer function , which maps the values of all state elements at clock cycle N to new values of all state elements at clock cycle A set of state elements (variables, both global and local) A transfer function Code vs Data At a particular level of abstraction, code defines the transfer function while data is held in the state elements. A particular set of bits can simultaneously be described as code at one level of abstraction, while defined as data at a lower level. This code/data duality is intimately related to the Turing Tax. The Turing Tax collector is constantly looking for bags of bits which can be interpreted as either code or data, and he collects his tax each time he finds such a situation. An analogous circumstance arises in hardware design. Some signals can be viewed as either part of the data path or the control path, depending on what level of abstraction one is viewing the hardware from. Compilers vs Interpreters A compiler is one trick to avoid the Turing Tax by translating code (i.e., a transfer function) from a higher level of abstraction to a lower level. We all felt awkward when I wrote “interpreter for C” earlier, and now we can feel better about it. JIT compilers for Python are one way to avoid the Turing Tax. Another example is an HLS compiler which avoids the Turing Tax between the stored-program computer abstraction layer and the synchronous digital logic layer. Step Counting (Multiple Taxation) No, this section isn’t about your Fitbit. Let’s call each evaluation of a transfer function a step . These steps occur at each level of abstraction. Let’s define the ultimate performance goal that we care about to be the number of steps required to execute a computation at the synchronous digital logic level of abstraction. The trouble with these layers of abstraction is that typically a step at a higher layer of abstraction requires multiple steps at a lower layer. For example, the multi-cycle processor implementation you learned about in a Patterson and Hennessy textbook could require 5 clock cycles to execute each instruction (instruction fetch, register fetch, execute, memory, register write back). Interpreters have the same behavior: one Python statement may be implemented with many C statements. Now imagine the following house of cards: A Python interpreter which requires an average of 4 C statements to implement 1 Python statement A C compiler which requires an average of 3 machine instructions to implement 1 C statement A processor which requires an average of 5 clock cycles to execute 1 machine instruction

0 views
Dangling Pointers 1 months ago

Gigaflow: Pipeline-Aware Sub-Traversal Caching for Modern SmartNICs

Gigaflow: Pipeline-Aware Sub-Traversal Caching for Modern SmartNICs Annus Zulfiqar, Ali Imran, Venkat Kunaparaju, Ben Pfaff, Gianni Antichi, and Muhammad Shahbaz ASPLOS'25 A virtual switch (vSwitch) routes network traffic to and from virtual machines. Section 2.1 of the paper describes the historical development of vSwitch technology, ending with a pipeline of match-action tables (MATs). A match-action table is a data-driven way to configuration a vSwitch, comprising matching rules, and associated actions to take when a matching packet is encountered. When a packet arrives at the vSwitch, it traverses the full pipeline of match-action tables. At each pipeline stage, header fields from the packet are used to perform a lookup into a match action table. If a match is found, then the packet is modified according to the actions found in the table. Megaflow is prior work which memoizes the full pipeline of MATs. The memoization data structure is treated like a cache. When a packet arrives, a cache lookup occurs. On a miss, the regular vSwitch implementation is called to transform the packet. Subsequent packets which hit in the cache avoid executing the vSwitch code entirely. Megaflow supports keys with wildcards, to allow one cache entry to serve multiple flows. A problem with Megaflow is that even with wildcards, a large cache is needed to achieve a high hit rate. For throughput reasons, one may wish to place a Megaflow cache in on-chip memory on a SmartNIC. However, if the SmartNIC does not have enough on-chip memory to achieve a high hit rate, then throughput suffers. See this post , and this post for a description of SmartNIC architectures and their on-chip memories. This paper introduces a different memoization scheme (Gigaflow) to make better use of SmartNIC memory. Rather than memoizing the entire vSwitch pipeline for a packet, Gigaflow divides the vSwitch pipeline into multiple smaller pipelines, and memoizes each one separately. Fig. 1 illustrates this: Source: https://dl.acm.org/doi/10.1145/3676641.3716000 Gigaflow takes advantage of SmartNICs’ ability to perform many table lookups per packet while maintaining high throughput. The total working set in a typical workload is reduced, because many flows can share some table entries (rather than all-or-nothing sharing). Another way to think about this is that Megaflow combines all of the MATs in a vSwitch pipeline into one very large table, whereas Gigaflow partitions the MAT vSwitch pipeline into a handful of sub-pipelines, and combines each sub-pipeline into a medium-sized table. Sections 4.1.1 and 4.2.2 of the paper have the nitty-gritty details of how Gigaflow decides how to correctly assign subsets of the vSwitch MAT pipeline into a set of tables. Fig. 9 shows cache misses for Megaflow and Gigaflow for a variety of benchmarks: Source: https://dl.acm.org/doi/10.1145/3676641.3716000 Dangling Pointers Memoization is useful in settings outside of networking. It would be interesting to see if the idea of separable memoization could be applied to other applications. Like I mentioned here , hardware support for memoization in general purpose CPUs seems compelling. Subscribe now

0 views
Dangling Pointers 1 months ago

Optimizing Datalog for the GPU

Optimizing Datalog for the GPU Yihao Sun, Ahmedur Rahman Shovon, Thomas Gilray, Sidharth Kumar, and Kristopher Micinski ASPLOS'25 Datalog source code comprises a set of relations, and a set of rules. A relation can be explicitly defined with a set of tuples. A running example in the paper is to define a graph with a relation named : A relation can also be implicitly defined with a set of rules. The paper uses the relation as an example: Rule 1 states that two vertices ( and ) are part of the same generation if they both share a common ancestor ( ), and they are not actually the same vertex ( ). Rule 2 states that two vertices ( and ) are part of the same generation if they have ancestors ( and ) from the same generation. “Running a Datalog program” entails evaluating all rules until a fixed point is reached (no more tuples are added). One key idea to internalize is that evaluating a Datalog rule is equivalent to performing a SQL join. For example, rule 1 is equivalent to joining the relation with itself, using as the join key, and as a filter. Semi-naïve Evaluation is an algorithm for performing these joins until convergence, while not wasting too much effort on redundant work. The tuples in a relation are put into three buckets: : holds tuples that were discovered on the current iteration holds tuples which were added in the previous iteration : holds all tuples that have been found in any iteration For a join involving two relations ( and ), is computed as the union of the result of 3 joins: joined with joined with joined with The key fact for performance is that is never joined with . More details on Semi-naïve Evaluation can be found in these notes . This paper introduces the hash-indexed sorted array for storing relations while executing Semi-naïve Evaluation on a GPU. It seems to me like this data structure would work well on other chips too. Fig. 2 illustrates the data structure (join keys are in red): Source: https://dl.acm.org/doi/10.1145/3669940.3707274 The data array holds the actual tuple data. It is densely packed in row-major order. The sorted index array holds pointers into the data array (one pointer per tuple). These pointers are lexicographically sorted (join keys take higher priority in the sort). The hash table is an open-addressed hash table which maps a hash of the join keys to the first element in the sorted index array that contains those join keys. A join of relations A and , can be implemented with the following pseudo-code: Memory accesses when probing through the sorted index array are coherent. Memory accesses when accessing the data array are coherent up to the number of elements in a tuple. Table 3 compares the results from this paper (GPULog) against a state-of-the-art CPU implementation (Soufflé). HIP represents GPULog ported to AMD’s HIP runtime and then run on the same Nvidia GPU. Source: https://dl.acm.org/doi/10.1145/3669940.3707274 Dangling Pointers The data structure and algorithms described by this paper seem generic, it would be interesting to see them run on other chips (FPGA, DPU, CPU, HPC cluster). I would guess most of GPULog is bound by memory bandwidth, not compute. I wonder if there are Datalog-specific algorithms to reduce the bandwidth/compute ratio. Subscribe now : holds tuples that were discovered on the current iteration holds tuples which were added in the previous iteration : holds all tuples that have been found in any iteration joined with joined with joined with

2 views
Dangling Pointers 1 months ago

Extended User Interrupts (xUI): Fast and Flexible Notification without Polling

Extended User Interrupts (xUI): Fast and Flexible Notification without Polling Berk Aydogmus, Linsong Guo, Danial Zuberi, Tal Garfinkel, Dean Tullsen, Amy Ousterhout, and Kazem Taram ASPLOS'25 This paper describes existing hardware support for userspace interrupts , and extensions to make it more efficient. The kernel is powerful, and yet slow. You only want to call on it when necessary. Kernel bypass technologies like DPDK and io_uring exist to allow applications to reduce the frequency of kernel calls. In addition to I/O, applications frequently call the kernel to communicate between threads. For example, in a producer/consumer design, the producer could use a signal to tell the consumer that more data is ready. Section 2 of the paper reminds readers that each of these signals costs about 2.4 microseconds. The idea behind userspace interrupts is to get the kernel out of the way and instead have dedicated hardware to support cheap signaling between threads. UIPI is a hardware feature introduced by Intel with Sapphire Rapids. Section 3 of the paper describes how UIPI works, including reverse engineering some of the Sapphire Rapids microarchitecture. Like other kernel bypass technologies, the kernel is still heavily involved in the control path . When a process requests that the kernel configure UIPI, the kernel responds by creating or modifying the user interrupt target table (UITT) for the process. This per-process table has one entry per thread. The kernel thread scheduler updates this table so that the UIPI hardware can determine which core a thread is currently running on. Once the control path is setup, the data path runs without kernel involvement. Userspace code which wants to send an interrupt to another thread can execute the instruction. This instruction has one operand, which is an index into the UITT (the index of the destination thread). The hardware then consults the UITT and sends an inter-processor interrupt (IPI) to the core on which the destination thread is running. Userspace code in the destination thread then jumps to a pre-registered interrupt handler, which runs arbitrary code in userspace. Hardware has long supported IPIs, but typically only the kernel has had the ability to invoke them. The hardware has the ability to coordinate with the OS to handle the case where the destination thread is not currently running on any core. These cases do involve running kernel code, but they are the slow path. The fast path is handled without any kernel involvement. The authors measure an end-to-end latency of 1360 clock cycles for a userspace interrupt on Sapphire Rapids. Fig. 2 illustrates where that time goes: Source: https://dl.acm.org/doi/abs/10.1145/3676641.3716028 The largest cost is the pipeline flush when the receiving core receives the IPI. Section 3.4 describes experiments the authors performed to determine these numbers, including how they determined that the receiving processor pipeline is flushed. Note that “flush” here means that in-flight instructions are squashed (i.e., what happens when a branch misprediction is detected). An alternative strategy would be to drain the pipeline, which would let outstanding instructions commit before handling the interrupt. This would avoid duplicate work but would increase the latency of handling an interrupt. The authors propose tracked interrupts to solve the performance problem associated with pipeline flushes. When an interrupt is received, the receiving core immediately injects the micro-ops needed to handle the interrupt into the pipeline. Outstanding instructions are not squashed. This may sound similar to draining, but there is a key difference. With draining, the processor waits for all inflight instructions to commit before injecting the interrupt handling micro-ops. With tracked interrupts, micro-ops enter the pipeline immediately, and the out-of-order machinery of the processor will not see any dependencies between the interrupt handling micro-ops and the inflight instructions. This means that the interrupt handling micro-ops can execute quickly and typically do not need to wait behind all inflight instructions. The paper has simulation results which show that tracked interrupts save 414 clock cycles on the receiver side. The paper also discusses some related improvements to UIPI: Userspace timer support Userspace interrupts from I/O devices Safepoints - to allow userspace interrupts to place nice with garbage collection I wonder how much of the Sapphire Rapids design was dictated by simplicity (to reduce the risk associated with this feature)? A frequent theme in papers reviewed on this blog is how difficult it is to implement fine-grained multithreading on general purpose multi-core CPUs. I wonder how much userspace interrupt support can help with that problem? Subscribe now Source: https://dl.acm.org/doi/abs/10.1145/3676641.3716028 The largest cost is the pipeline flush when the receiving core receives the IPI. Section 3.4 describes experiments the authors performed to determine these numbers, including how they determined that the receiving processor pipeline is flushed. Note that “flush” here means that in-flight instructions are squashed (i.e., what happens when a branch misprediction is detected). An alternative strategy would be to drain the pipeline, which would let outstanding instructions commit before handling the interrupt. This would avoid duplicate work but would increase the latency of handling an interrupt. Tracked Interrupts The authors propose tracked interrupts to solve the performance problem associated with pipeline flushes. When an interrupt is received, the receiving core immediately injects the micro-ops needed to handle the interrupt into the pipeline. Outstanding instructions are not squashed. This may sound similar to draining, but there is a key difference. With draining, the processor waits for all inflight instructions to commit before injecting the interrupt handling micro-ops. With tracked interrupts, micro-ops enter the pipeline immediately, and the out-of-order machinery of the processor will not see any dependencies between the interrupt handling micro-ops and the inflight instructions. This means that the interrupt handling micro-ops can execute quickly and typically do not need to wait behind all inflight instructions. Results The paper has simulation results which show that tracked interrupts save 414 clock cycles on the receiver side. The paper also discusses some related improvements to UIPI: Userspace timer support Userspace interrupts from I/O devices Safepoints - to allow userspace interrupts to place nice with garbage collection

0 views
Dangling Pointers 1 months ago

No Cap, This Memory Slaps: Breaking Through the Memory Wall of Transactional Database Systems with Processing-in-Memory

No Cap, This Memory Slaps: Breaking Through the Memory Wall of Transactional Database Systems with Processing-in-Memory Hyoungjoo Kim, Yiwei Zhao, Andrew Pavlo, Phillip B. Gibbons VLDB'25 This paper describes how processing-in-memory (PIM) hardware can be used to improve OLTP performance. Here is a prior paper summary from me on a similar topic, but that one is focused on OLAP rather than OLTP. UPMEM is specific PIM product (also used in the prior paper ) on this blog. A UPMEM DIMM is like a DRAM DIMM, but each DRAM bank is extended with a simple processor which can run user code. That processor has access to a small local memory and the DRAM associated with the bank. This paper calls each processor a PIM Module . There is no direct communication between PIM modules. Fig. 2 illustrates the system architecture used by this paper. A traditional CPU is connected to a set of boring old DRAM DIMMs and is also connected to a set of UPMEM DIMMs. Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Four Challenges The paper identifies the following difficulties associated with using UPMEM to accelerate an OLTP workload: PIM modules can only access their local memory PIM modules do not have typical niceties associated with x64 CPUs (high clock frequency, caches, SIMD) There is a non-trivial cost for the CPU to send data to UPMEM DIMMs (similar to the CPU writing data to regular DRAM) OLTP workloads have tight latency constraints The authors arrived at a solution that both provides a good speedup and doesn’t require boiling the ocean. The database code and architecture remain largely unchanged. Much of the data remains in standard DRAM DIMMs, and the database operates on it as it always has. In section 3.2 the authors identify a handful of data structures and operations with near-memory affinity which are offloaded. These data structures are stored in UPMEM DIMMs, and the algorithms which access them are offloaded to the PIM modules. The key feature that these algorithms have in common is pointer chasing . The sweet spots the authors identify involve a small number of parameters sent from the CPU to a PIM module, then the PIM module performing multiple roundtrips to its local DRAM bank, followed by the CPU reading back a small amount of response data. The roundtrips to PIM-local DRAM have lower latency than accesses from a traditional CPU core. One data structure which involves a lot of pointer chasing is B+ tree traversal. Thus, the system described in this paper moves B+ tree indexes into UPMEM DIMMs and uses PIM modules to search for values in an index. Note that the actual tuples that hold row data stay in plain-old DRAM. The tricky part is handling range queries while distributing an index across many banks. The solution described in this paper is to partition the set of keys into 2 R partitions (the lower bits of a key define the index the partition which holds that key). Each partition is thus responsible for a contiguous array of keys. For a range query, the lower bits of the lower and upper bounds of the range can be used to determine which partitions must be searched. Each PIM module is responsible for multiple partitions, and a hash function is used to convert a partition index into a PIM module index. MVCC is a concurrency control method which requires the database to keep around old versions of a given row (to allow older in-flight queries to access them). The set of versions associated with a row are typically stored in a linked list (yet another pointer traversal). Again, the actual tuple contents are stored in regular DRAM, but the list links are stored in UPMEM DIMMs, with the PIM modules traversing the links. Section 4.3 has more information about how old versions are eventually reclaimed with garbage collection. Fig. 7 has the headline results. is the baseline, is the work described by this paper. It is interesting that only beats on for read-only workloads. Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Processing-in-memory can help with memory bandwidth and memory latency. It seems like this work is primarily focused on memory latency. I suppose this indicates that OLTP workloads are fundamentally latency-bound, because there is not enough potential concurrency between transactions to hide that latency. Is there no way to structure a database such that OLTP workloads are not bound by memory latency? It would be interesting to see if these tricks could work in a distributed system, where the PIM modules are replaced by separate nodes in the system. Subscribe now Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Four Challenges The paper identifies the following difficulties associated with using UPMEM to accelerate an OLTP workload: PIM modules can only access their local memory PIM modules do not have typical niceties associated with x64 CPUs (high clock frequency, caches, SIMD) There is a non-trivial cost for the CPU to send data to UPMEM DIMMs (similar to the CPU writing data to regular DRAM) OLTP workloads have tight latency constraints

0 views
Dangling Pointers 1 months ago

Parendi: Thousand-Way Parallel RTL Simulation

Parendi: Thousand-Way Parallel RTL Simulation Mahyar Emami, Thomas Bourgeat, and James R. Larus ASPLOS'25 This paper describes an RTL simulator running on (one or more) Graphcore IPUs. One nice side benefit of this paper is the quantitative comparisons of IPU synchronization performance vs traditional CPUs. Here is another paper summary which describes some challenges with RTL simulation. The Graphcore IPU used in this paper is a chip with 1472 cores, operating with a MIMD architecture. A 1U server can contain 4 IPUs. It is interesting to see a chip that was designed for DNN workloads adapted to the domain of RTL simulation. Similar to other papers on RTL simulation, a fundamental step of the Parendi simulator is partitioning the circuit to be simulated. Parendi partitions the circuit into fibers . A fiber comprises a single (word-wide) register, and all of the combinational logic which feeds it. Note that some combinational logic may be present in multiple fibers. Fig. 3 contains an example, node a3 is present in multiple fibers. As far as I can tell, Parendi does not try to deduplicate this work (extra computation to save synchronization). Source: https://dl.acm.org/doi/10.1145/3676641.3716010 The driving factor in the design of this fiber-specific partitioning system is scalability . Each register has storage to hold the value of the register at the beginning and end of the current clock cycle (i.e., the and values). I think of the logic to simulate a single clock cycle with the following pseudo-code ( is the register rooted at fiber : Scalability comes from the fact that there are only two barriers per simulated clock cycle. This is an instance of the bulk synchronous parallel (BSP) model. In many cases, there are more fibers than CPU/IPU cores. Parendi addresses this by distributing the simulation across chips and scheduling multiple fibers to run on the same core. If the simulation is distributed across multiple chips, then a min-cut algorithm is used to partition the fibers across chips while minimizing communication. The Parendi compiler statically groups multiple fibers together into a single process . A core simulates all fibers within a process. The merging process primarily seeks to minimize inter-core communication. First, a special case merging algorithm merges multiple fibers which reference the same large array. This is to avoid communicating the contents of such an array across cores. I imagine this is primarily for simulation of on-chip memories. Secondly, a general-purpose merging algorithm merges fibers which each have low compute cost, and high data sharing with each other. Fig. 7 compares Parendi vs Verilator simulation. is a 2-socket server with 28 Intel cores per socket. is a 2-socket server with 64 AMD cores per socket: Source: https://dl.acm.org/doi/10.1145/3676641.3716010 Section 6.4 claims a roughly 2x improvement in cost per simulation using cloud pricing. As far as I can tell, this system doesn’t have optimizations for the case where some or all of a fiber’s inputs do not change between clock cycles. It seems tricky to optimize for this case while maintaining a static assignment of fibers to cores. Fig. 4 has a fascinating comparison of synchronization costs between an IPU and a traditional x64 CPU. This microbenchmark loads up the system with simple fibers (roughly 6 instructions per fiber). Note that the curves represent different fiber counts (e.g., the red dotted line represents 7 fibers on the IPU graph, vs 736 fibers on the x64 graph). The paper claims that a barrier between 56 x64 threads implemented with atomic memory accesses consumes thousands of cycles, whereas the IPU has dedicated hardware barrier support. Source: https://dl.acm.org/doi/10.1145/3676641.3716010 This seems to be one of many examples of how generic multi-core CPUs do not perform well with fine-grained multi-threading. We’ve seen it with pipeline parallelism, and now with the BSP model. Interestingly, both cases seem to work better with specialized multi-core chips (pipeline parallelism works with CPU-based SmartNICs, BSP works with IPUs). I’m not convinced this is a fundamental hardware problem that cannot be addressed with better software. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts.

0 views
Dangling Pointers 1 months ago

Skia: Exposing Shadow Branches

Skia: Exposing Shadow Branches Chrysanthos Pepi, Bhargav Reddy Godala, Krishnam Tibrewala, Gino A. Chacon, Paul V. Gratz, Daniel A. Jiménez, Gilles A. Pokam, and David I. August ASPLOS'25 This paper starts with your yearly reminder of the high cost of the Turing Tax : Recent works demonstrate that the front-end is a considerable source of performance loss [16], with upwards of 53% of performance [23] bounded by the front-end. Everyone knows that the front-end runs ahead of the back-end of a processor. If you want to think of it in AI terms, imagine a model that is told about the current value of and recent history of the program counter, and asked to predict future values of the program counter. The accuracy of these predictions determines how utilized the processor pipeline is. What I did not know is that in a modern processor, the front-end itself is divided into two decoupled components, one of which runs ahead of the other. Fig. 4 illustrates this Fetch Direction Instruction Processing (FDIP) microarchitecture: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 The Instruction Address Generator (IAG) runs the furthest ahead and uses tables (e.g., the Branch Target Buffer (BTB)) in the Branch Prediction Unit (BPU) to predict the sequence of basic blocks which will be executed. Information about each predicted basic block is stored in the Fetch Target Queue (FTQ). The Instruction Fetch Unit (IFU) uses the control flow predictions from the FTQ to actually read instructions from the instruction cache. Some mispredictions can be detected after an instruction has been read and decoded. These result in an early re-steer (i.e., informing the IAG about the misprediction immediately after decode). When a basic block is placed into the FTQ, the associated instructions are prefetched into the IFU (to reduce the impact of instruction cache misses). This paper introduces the term “shadow branch”. A shadow branch is a (static) branch instruction which is currently stored in the instruction cache but is not present in any BPU tables. The top of fig. 5 illustrates a head shadow branch. A branch instruction caused execution to jump to byte 24 and execute the non-shaded instructions. This causes an entire cache line to be pulled into the instruction cache, including the branch instruction starting at byte 19. The bottom of fig. 5 shows a tail shadow branch. In this case, the instruction at byte 12 jumped away from the cache line, causing the red branch instruction at byte 16 to not be executed (even though it is present in the instruction cache). Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Skia The proposed design (Skia) allows the IAG to make accurate predictions for a subset of shadow branches, thus improving pipeline utilization and reducing instruction cache misses. The types of shadow branches which Skia supports are: Direct unconditional branches (target PC can be determined without looking at backend state) Function calls As shown in Fig. 6, these three categories of branches (purple, red, orange) account for a significant fraction of all BTB misses: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 When a cache line enters the instruction cache, the Shadow Branch Decoder (SBD) decodes just enough information to locate shadow branches in the cache line and determine the target PC (for direct unconditional branches and function calls). Metadata from the SBD is placed into two new branch prediction tables in the BPU: The U-SBB holds information about direct unconditional branches and function calls The R-SBB holds information about returns When the BPU encounters a BTB miss, it can fall back to the U-SBB or R-SBB for a prediction. Fig. 11 illustrates the microarchitectural changes proposed by Skia: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Section 4 goes into more details about these structures including: Replacement policy How a shadow branch is upgraded into a first-class branch in the BTB Handling variable length instructions Fig. 14 has (simulated) IPC improvements across a variety of benchmarks: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Dangling Pointers A common problem that HW and SW architects solve is getting teams out of a local minimum caused by fixed interfaces. The failure mode is when two groups of engineers agree on a static interface, and then each optimize their component as best they can without changing the interface. In this paper, the interface is the ISA, and Skia is a clever optimization inside of the CPU front-end. Skia shows that there is fruit to be picked here. It would be interesting to examine potential performance gains from architectural (i.e., ISA) changes to pick the same fruit. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. Source: https://dl.acm.org/doi/10.1145/3676641.3716273 The Instruction Address Generator (IAG) runs the furthest ahead and uses tables (e.g., the Branch Target Buffer (BTB)) in the Branch Prediction Unit (BPU) to predict the sequence of basic blocks which will be executed. Information about each predicted basic block is stored in the Fetch Target Queue (FTQ). The Instruction Fetch Unit (IFU) uses the control flow predictions from the FTQ to actually read instructions from the instruction cache. Some mispredictions can be detected after an instruction has been read and decoded. These result in an early re-steer (i.e., informing the IAG about the misprediction immediately after decode). When a basic block is placed into the FTQ, the associated instructions are prefetched into the IFU (to reduce the impact of instruction cache misses). Shadow Branches This paper introduces the term “shadow branch”. A shadow branch is a (static) branch instruction which is currently stored in the instruction cache but is not present in any BPU tables. The top of fig. 5 illustrates a head shadow branch. A branch instruction caused execution to jump to byte 24 and execute the non-shaded instructions. This causes an entire cache line to be pulled into the instruction cache, including the branch instruction starting at byte 19. The bottom of fig. 5 shows a tail shadow branch. In this case, the instruction at byte 12 jumped away from the cache line, causing the red branch instruction at byte 16 to not be executed (even though it is present in the instruction cache). Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Skia The proposed design (Skia) allows the IAG to make accurate predictions for a subset of shadow branches, thus improving pipeline utilization and reducing instruction cache misses. The types of shadow branches which Skia supports are: Direct unconditional branches (target PC can be determined without looking at backend state) Function calls Source: https://dl.acm.org/doi/10.1145/3676641.3716273 When a cache line enters the instruction cache, the Shadow Branch Decoder (SBD) decodes just enough information to locate shadow branches in the cache line and determine the target PC (for direct unconditional branches and function calls). Metadata from the SBD is placed into two new branch prediction tables in the BPU: The U-SBB holds information about direct unconditional branches and function calls The R-SBB holds information about returns Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Section 4 goes into more details about these structures including: Replacement policy How a shadow branch is upgraded into a first-class branch in the BTB Handling variable length instructions

0 views