Latest Posts (20 found)

LoopFrog: In-Core Hint-Based Loop Parallelization

LoopFrog: In-Core Hint-Based Loop Parallelization Marton Erdos, Utpal Bora, Akshay Bhosale, Bob Lytton, Ali M. Zaidi, Alexandra W. Chadwick, Yuxin Guo, Giacomo Gabrielli, and Timothy M. Jones MICRO'25 To my Kanagawa pals: I think hardware like this would make a great target for Kanagawa, what do you think? The message of this paper is that there is plenty of loop-level parallelism available which superscalar cores are not yet harvesting. Fig. 1 illustrates the classic motivation for multi-core processors: scaling the processor width by 4x yields a 2x IPC improvement. In general, wider cores are heavily underutilized. Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The main idea behind is to add hints to the ISA which allow a wide core to exploit more loop-level parallelism in sequential code. If you understand Fig. 2, then you understand , the rest is just details: Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The compiler emits instructions which the processor can use to understand the structure of a loop. Processors are free to ignore the hints. A loop which can be optimized by comprises three sections: A header , which launches each loop iteration A body , which accepts values from the header A continuation , which computes values needed for the next loop iteration (e.g., the value of induction variables). Each execution of the header launches two threadlets . A threadlet is like a thread but is only ever executed on the core which launched it. One threadlet launched by the header executes the body of the loop. The other threadlet launched by the header is the continuation, which computes values needed for the next loop iteration. Register loop-carried dependencies are allowed between the header and continuation, but not between body invocations. That is the key which allows multiple bodies to execute in parallel (see Fig. 2c above). At any one time, there is one architectural threadlet (the oldest one), which can update architectural state. All other threadlets are speculative . Once the architectural threadlet for loop iteration completes, it hands the baton over to the threadlet executing iteration , which becomes architectural. Dependencies through memory are handled by the speculative state buffer (SSB). When a speculative threadlet executes a memory store, data is stored in the SSB and actually written to memory later on (i.e., after that threadlet is no longer speculative). Memory loads read from both the L1 cache and the SSB, and then disambiguation hardware determines which data to use and which to ignore. The hardware implementation evaluated by the paper does not support nested parallelization, it simply ignores hints inside of nested loops. Fig. 6 shows simulated performance results for an 8-wide core. A core which supports 4 threadlets is compared against a baseline which does not implement . Source: https://dl.acm.org/doi/10.1145/3725843.3756051 can improve performance by about 10%. Fig. 1 at the top shows that an 8-wide core experiences about 25% utilization, so there may be more fruit left to pick. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The main idea behind is to add hints to the ISA which allow a wide core to exploit more loop-level parallelism in sequential code. Structured Loops If you understand Fig. 2, then you understand , the rest is just details: Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The compiler emits instructions which the processor can use to understand the structure of a loop. Processors are free to ignore the hints. A loop which can be optimized by comprises three sections: A header , which launches each loop iteration A body , which accepts values from the header A continuation , which computes values needed for the next loop iteration (e.g., the value of induction variables).

0 views

Dynamic Load Balancer in Intel Xeon Scalable Processor

Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines Jiaqi Lou, Srikar Vanavasam, Yifan Yuan, Ren Wang, and Nam Sung Kim ISCA'25 This paper describes the DLB accelerator present in modern Xeon CPUs. The DLB addresses a similar problem discussed in the state-compute replication paper : how to parallelize packet processing when RSS (static NIC-based load balancing) is insufficient. Imagine a 100 Gbps NIC is receiving a steady stream of 64B packets and sending them to the host. If RSS is inappropriate for the application, then another parallelization strategy would be for a single CPU core to distribute incoming packets to all of the others. To keep up, that load-distribution core would have to be able to process 200M packets per second, but state-of-the-art results top out at 30M packets per second. The DLB is an accelerator designed to solve this problem. Fig. 2 illustrates the DLB hardware and software architecture: Source: https://dl.acm.org/doi/10.1145/3695053.3731026 A set of producer cores can write 16B queue elements (QEs) into a set of producer ports (PPs). In a networking application, one QE could map to a single packet. A set of consumer cores can read QEs out of consumer queues (CQs). QEs contain metadata which producers can set to enable ordering within a flow/connection, and to control relative priorities. The DLB balances the load at each consumer, while honoring ordering constraints and priorities. A set of cores can send QEs to the DLB in parallel without suffering too much from skew. For example, imagine a CPU with 128 cores. If DLB is not used, and instead RSS is configured to statically distribute connections among those 128 cores, then skew could be a big problem. If DLB is used, and there are 4 cores which write into the producer ports, then RSS can be configured to statically distribute connections among those 4 cores, and skew is much less likely to be a problem. Fig. 5 shows that DLB works pretty well. and are software load balancers, while uses the DLB accelerator. DLB offers similar throughput and latency to RSS, but with much more flexibility. Source: https://dl.acm.org/doi/10.1145/3695053.3731026 AccDirect One awkward point in the design above is the large number of CPU cycles consumed by the set of producer cores which write QEs into the DLB. The paper proposes AccDirect to solve this. The idea is that DLB appears as a PCIe device, and therefore a flexible NIC can use PCIe peer-to-peer writes to send packets directly to the DLB. The authors find that the NVIDIA BlueField-3 has enough programmability to support this. Fig. 9 show that this results in a significant power savings, but not too much of a latency improvement: Source: https://dl.acm.org/doi/10.1145/3695053.3731026 Dangling Pointers I feel like it is common knowledge that fine-grained parallelism doesn’t work well on multi-core CPUs. In the context of this paper, the implication is that it is infeasible to write a multi-core packet processor that primarily uses pipeline parallelism. Back-of-the-envelope: at 400Gbps, and 64B packets, there is a budget of about 40 8-wide SIMD instructions to process a batch of 8 packets. If there are 128 cores, then maybe the aggregate budget is 4K instructions per batch of 8 packets across all cores. This doesn’t seem implausible to me. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views

The XOR Cache: A Catalyst for Compression

The XOR Cache: A Catalyst for Compression Zhewen Pan and Joshua San Miguel ISCA'25 Inclusive caches seem a bit wasteful. This paper describes a clever mechanism for reducing that waste: store two cache lines of data in each physical cache line! In most of the paper there are only two caches in play (L1 and LLC), but the idea generalizes. Fig. 1 illustrates the core concept. In this example, there are two cache lines ( , and ) in the LLC. is also present in the L1 cache of a core. Here is the punchline: the LLC stores . If the core which has cached experiences a miss when trying to access , then the core resolves the miss by asking the LLC for . Once that data arrives at the L1, B can be recovered by computing . The L1 stores and separately, so as to not impact L1 hit latency. Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Coherence Protocol The mechanics of doing this correctly are implemented in a cache coherence protocol described in section 4 of the paper. We’ve discussed the local recovery case, where the core which needs already holds in its L1 cache. If the core which requires B does not hold A, then two fallbacks are possible. Before describing those cases, the following property is important to highlight. The cache coherence protocol ensures that if the LLC holds , then some L1 cache in the system will hold a copy of either or . If some action was about to violate this property, then the LLC would request a copy of or from some L1, and use it to split into separate cache lines in the LLC holding and . Direct forwarding occurs when some other core holds a copy of B. In this case, the system requests the other core to send B to the core that needs it. The final case is called remote recovery . If the LLC holds and no L1 cache holds a copy of , then some L1 cache in the system must hold a copy of . The LLC sends to that core which has locally cached. That core computes to recover a copy of and sends it to the requestor. Another tricky case to handle is when a core writes to . The cache coherence protocol handles this case similar to eviction and ensures that the LLC will split cache lines as necessary so that all data is always recoverable. The LLC has a lot of freedom when it comes to deciding which cache lines to pair up. The policies described in this paper optimize for intra-cache line compression (compressing the data within a single cache line). The LLC hardware maintains a hash table. When searching for a partner for cache line , the LLC computes a hash of the contents of to find a set of potential partners. One hash function described by the paper is sparse byte labeling , which produces 6 bits for each 8-byte word in a cache line. Each bit is set to 0 if the corresponding byte in the word is zero. The lower two bytes of each word are ignored. The idea here is that frequently the upper bytes of a word are zero. If two cache lines have the same byte label, then when you XOR them together the merged cache line will have many zero bytes (i.e., they have low entropy and are thus compressible). The LLC can optimize for this case by storing compressed data in the cache and thus increasing its effective capacity. This paper relies on prior work related to compressed caches. The takeaway is that not only is there a potential 2x savings possible from logically storing two cache lines in one physical location, but there are also further savings in compressing these merged cache lines. Fig. 13 compares the compression ratio achieved by XOR cache against prior work (taller bars are better). The right-most set of bars show the geometric mean: Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Fig. 15 shows performance impacts associated with this scheme: Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Dangling Pointers It seems to me that XOR Cache mostly benefits from cache lines that are rarely written. I wonder if there are ways to predict if a particular cache line is likely to be written in the near future. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views

CHERIoT RTOS: An OS for Fine-Grained Memory-Safe Compartments on Low-Cost Embedded Devices

CHERIoT RTOS: An OS for Fine-Grained Memory-Safe Compartments on Low-Cost Embedded Devices Saar Amar, Tony Chen, David Chisnall, Nathaniel Wesley Filardo, Ben Laurie, Hugo Lefeuvre, Kunyan Liu, Simon W. Moore, Robert Norton-Wright, Margo Seltzer, Yucong Tao, Robert N. M. Watson, and Hongyan Xia SOSP'25 This paper is a companion to a previous paper which described the CHERIoT hardware architecture. This work presents an OS that doesn’t look like the systems you are used to. The primary goal is memory safety (and security more broadly). Why rewrite your embedded code in Rust when you can switch to a fancy new chip and OS instead? Recall that a CHERI capability is a pointer augmented with metadata (bounds, access permissions). CHERI allows a more restrictive capability to be derived from a less restrictive one (e.g., reduce the bounds or remove access permissions), but not the other way around. CHERIoT RTOS doesn’t have the notion of a process, instead it has a compartment. A compartment comprises code and compartment-global data. Compartment boundaries are trust boundaries. I think of it like a microkernel operating system. Example compartments in CHERIoT include: Boot loader Context switcher Heap allocator Thread scheduler The boot loader is fully trusted and is the first code to run. The hardware provides the boot loader with the ultimate capability. The boot loader then derives more restrictive capabilities, which it passes to other compartments. You could imagine a driver compartment which is responsible for managing a particular I/O device. The boot loader would provide that compartment with a capability that enables the compartment to access the MMIO registers associated with the device. There is no user space/kernel space distinction here, only a set of compartments, each with a unique set of capabilities. Fig. 3 illustrates a compartment: Source: https://dl.acm.org/doi/10.1145/3731569.3764844 Sealed Capabilities The CHERIoT hardware architecture supports sealing of capabilities. Sealing a capability is similar to deriving a more restrictive one, only this time the derived capability is useless until it is unsealed by a compartment which holds a capability with unsealing permissions. I think of this like a client encrypting some data before storing it on a server. The data is useless to everyone except for the client who can decrypt it. Cross-compartment function calls are similar to system calls and are implemented with sealed capabilities. Say compartment needs to be able to call a function exported by compartment . At boot, the boot loader derives a “function call” capability which is a pointer into the export table associated with , seals that capability, and passes it to compartment at initialization. The boot loader also gives the switcher a capability which allows it to unseal the function call capability. When A wants to call the function exported by , it passes the sealed capability to the switcher. The switcher then unseals the capability and uses it to read metadata about the exported function from ’s export table. The switcher uses this metadata to safely perform the function call. Capability sealing also simplifies inter-compartment state management. Say compartment calls into compartment (for networking) to create a TCP connection. The networking compartment can allocate a complicated tree of objects and then return a sealed capability which points to that tree. Compartment can hold on to that capability and pass it as a parameter for future networking function calls (which will unseal and then use). Compartment doesn’t need to track per-connection objects in its global state. The heap compartment handles memory allocation for all compartments. There is just one address space shared by all compartments, but capabilities make the whole thing safe. As described in the previous summary, when an allocation is freed, the heap allocator sets associated revocation bits to zero. This prevents use-after-free bugs (in conjunction with the CHERIoT hardware load filter). Similar to garbage collection, freed memory is quarantined (not reused) until a memory sweep completes which ensures that no outstanding valid capabilities are referencing the memory to be reused. The allocator supports allocation capabilities which can enforce per-compartment quotas. If you’ve had enough novelty, you can rest your eyes for a moment. The CHERIoT RTOS supports threads, and they mostly behave like you would expect. The only restriction is that threads are statically declared in code. Threads begin execution in the compartment that declares them, but then threads can execute code in other compartments via cross-compartment calls. Each compartment is responsible for managing its own state with proper error handling. If all else fails, the OS supports micro-reboots, where a single compartment can be reset to a fresh state. The cross-compartment call mechanism supported by the switcher enables the necessary bookkeeping for micro-reboots. The steps to reboot a single compartment are: Stop new threads from calling into the compartment (these calls fail with an error code) Fault all threads which are currently executing in the compartment (this will also result in error codes being returned to other compartments) Release all resources (e.g., heap data) which have been allocated by the compartment Reset all global variables to their initial state I wonder how often a micro-reboot of one compartment results in an error code which causes other compartments to micro-reboot. If a call into a compartment which is in the middle of a micro-reboot can fail, then I could see that triggering a cascade of micro-reboots. The ideas here remind me of Midori , which relied on managed languages rather than hardware support. I wonder which component is better to trust, an SoC or a compiler? Subscribe now Boot loader Context switcher Heap allocator Thread scheduler Source: https://dl.acm.org/doi/10.1145/3731569.3764844 Sealed Capabilities The CHERIoT hardware architecture supports sealing of capabilities. Sealing a capability is similar to deriving a more restrictive one, only this time the derived capability is useless until it is unsealed by a compartment which holds a capability with unsealing permissions. I think of this like a client encrypting some data before storing it on a server. The data is useless to everyone except for the client who can decrypt it. Cross-compartment function calls are similar to system calls and are implemented with sealed capabilities. Say compartment needs to be able to call a function exported by compartment . At boot, the boot loader derives a “function call” capability which is a pointer into the export table associated with , seals that capability, and passes it to compartment at initialization. The boot loader also gives the switcher a capability which allows it to unseal the function call capability. When A wants to call the function exported by , it passes the sealed capability to the switcher. The switcher then unseals the capability and uses it to read metadata about the exported function from ’s export table. The switcher uses this metadata to safely perform the function call. Capability sealing also simplifies inter-compartment state management. Say compartment calls into compartment (for networking) to create a TCP connection. The networking compartment can allocate a complicated tree of objects and then return a sealed capability which points to that tree. Compartment can hold on to that capability and pass it as a parameter for future networking function calls (which will unseal and then use). Compartment doesn’t need to track per-connection objects in its global state. Heap Allocator The heap compartment handles memory allocation for all compartments. There is just one address space shared by all compartments, but capabilities make the whole thing safe. As described in the previous summary, when an allocation is freed, the heap allocator sets associated revocation bits to zero. This prevents use-after-free bugs (in conjunction with the CHERIoT hardware load filter). Similar to garbage collection, freed memory is quarantined (not reused) until a memory sweep completes which ensures that no outstanding valid capabilities are referencing the memory to be reused. The allocator supports allocation capabilities which can enforce per-compartment quotas. Threads If you’ve had enough novelty, you can rest your eyes for a moment. The CHERIoT RTOS supports threads, and they mostly behave like you would expect. The only restriction is that threads are statically declared in code. Threads begin execution in the compartment that declares them, but then threads can execute code in other compartments via cross-compartment calls. Micro-reboots Each compartment is responsible for managing its own state with proper error handling. If all else fails, the OS supports micro-reboots, where a single compartment can be reset to a fresh state. The cross-compartment call mechanism supported by the switcher enables the necessary bookkeeping for micro-reboots. The steps to reboot a single compartment are: Stop new threads from calling into the compartment (these calls fail with an error code) Fault all threads which are currently executing in the compartment (this will also result in error codes being returned to other compartments) Release all resources (e.g., heap data) which have been allocated by the compartment Reset all global variables to their initial state

0 views

CHERIoT: Complete Memory Safety for Embedded Devices

CHERIoT: Complete Memory Safety for Embedded Devices Saar Amar, David Chisnall, Tony Chen, Nathaniel Wesley Filardo, Ben Laurie, Kunyan Liu, Robert Norton, Simon W. Moore, Yucong Tao, Robert N. M. Watson, and Hongyan Xia Micro'23 If you are like me, you’ve vaguely heard of CHERI , but never really understood what it is about. Here it is in one sentence: hardware support for memory protection embedded in every single pointer. This particular paper focuses on CHERI implementation details for embedded/IoT devices. The C in C HERI stands for capability . A capability is a fat pointer which contains an address, bounds, and access permissions. Fig. 1 shows the bit layout (64 bits total) of the capabilities used by CHERIoT: Source: https://dl.acm.org/doi/10.1145/3613424.3614266 Here is the fundamental concept to understand: the only way to access memory in a CHERI architecture is via a capability. There are no pointers other than capabilities. The hardware uses special tag bits (associated with registers and memory locations), to track which registers or memory addresses contain valid capabilities, and which do not. In the following example, is a regular old integer (the associated tag bit will not be set). is a pointer generated by reinterpreting the bits in . The tag bit associated with will not be set, and thus memory read/writes using will fail. If a programmer cannot create a capability willy-nilly, where do they come from? At boot, the hardware creates the uber-capability (i.e., one which has full permissions to access all memory) and places this capability into a specific register. The initial OS code that runs on boot can access this capability and can use special instructions to derive new capabilities from it. For example, the OS could derive a capability which has read-only access to the first 1 MiB of memory. The owner of a capability may derive new capabilities from it, but hardware ensures a derived capability cannot have broader permissions than the capability from which it was derived. CHERIoT is designed for embedded use cases, which have real-time requirements. MMUs/MPUs can add variability because they usually contain caches (e.g., TLBs) which have dramatically different performance characteristics in hit vs. miss cases. CHERIoT does away with this. There is no memory translation, and memory protection is supported on a per-capability basis (as opposed to a per-page tracking in the MPU). This is pretty cool: capabilities not only give fine-grained memory protection, but they also make performance more consistent by removing a cache from the system. Each capability represents a range of memory which can be accessed. Three fields (comprising 22 bits total) in each capability are used to represent the size of memory which is accessible by the capability. The encoding is a bit like floating point, with an exponent field which allows small sizes (i.e., less than 512 bytes) to be represented exactly, while larger sizes require padding. Astute readers will ask themselves: “how does CHERIoT prevent use-after-free bugs? A call to must somehow invalidate all capabilities which point at the freed region, correct?” CHERIoT introduces heap revocation bits . Because the total amount of RAM is often modest in embedded use cases CHERIoT can get away with a dedicated SRAM to hold heap revocation bits. There is 1 revocation bit per 8 bytes of RAM. Most software does not have direct access to these bits, but the heap allocator does. All revocation bits are initially set to zero. When the heap allocator frees memory, it sets the corresponding bits to one. The hardware uses these bits to prevent capabilities from accessing freed memory. You may think that CHERIoT checks revocation bits on each memory access, but it does not. Instead, the hardware load filter checks the revocation bits when the special “load capability” ( ) instruction is executed. This instruction is used to load a capability from memory into a register. The tag bit associated with the destination register is set to one only if the revocation bit associated with the address the capability points to is zero, and the tag bit of the source address is one. The final ingredient in this recipe is akin to garbage collection. CHERIoT supports what I like to think of as a simple garbage collection hardware accelerator called the background pipelined revoker . Software can request this hardware to scan a range of memory. Scanning occurs “in the background” (i.e., in clock cycles where the processor was not accessing memory). The background revoker reuses existing hardware to load each potential capability in the specified memory range, and then store it back. The load operation reads the associated tag bit and revocation bit, while the store operation updates the tag bit. This clears the tag bit for any capability that points to revoked memory. Once the background revoker has finished scanning all memory, the heap allocator can safely set the revocation bits associated with recently freed allocations back to zero and reuse the memory to satisfy future heap allocations. The authors modified two existing processors to support CHERIoT. Flute is a 5-stage processor with a 64-bit memory bus. Ibex is a 2 or 3 stage processor with a 32-bit memory bus. Table 2 shows the area and power cost associated with extending the Ibex processor to support CHERIoT (roughly 2x for both metrics): Source: https://dl.acm.org/doi/10.1145/3613424.3614266 Table 3 uses CoreMark to measure the performance overhead associated with CHERIoT: Source: https://dl.acm.org/doi/10.1145/3613424.3614266 Dangling Pointers I would be interested to know how easily C#/Go/Rust can be modified to use CHERI hardware bounds checking rather than software bounds checking for array accesses. This seems like an area where CHERI could win back some performance. Subscribe now

0 views

Analog in-memory computing attention mechanism for fast and energy-efficient large language models

Analog in-memory computing attention mechanism for fast and energy-efficient large language models Leroux, N., Manea, PP., Sudarshan, C. et al . Nature Comput Science The takeaway from this one is that there is a high cost to the synchronous digital logic abstraction. Attention is one of the core building blocks of the Transformer architecture. During inference, the input activation matrix is multiplied by 3 separate weight matrices to produce three intermediate matrices: , , and . and are then multiplied together, scaled by a constant, and passed through a nonlinear activation function (e.g., softmax). This result is called an attention matrix , which is multiplied by . Causal attention is a specialization which applies a mask to . This restricts token so that it can only attend to tokens . This restriction enables the implementation of the KV cache. The KV cache stores rows of the and matrices for later re-use. Say the network is producing output token , there is no need to compute rows of and , they can simply be retrieved from the cache. Only row needs to be computed (and stored in the cache for future use). There is a nice animation showing a KV cache here . What follows next is a description of CMOS hardware which can both store values from K and V and perform the computations required by the attention mechanism. I feel a bit out of my depth, but here we go! Fig. 1(d) shows a single gain cell (prior paper here ). This single cell comprises a single capacitor and a handful of transistors. Source: https://www.nature.com/articles/s43588-025-00854-1 The voltage of capacitor C 1 represents a single weight value from or (not a bit in a floating-point representation of a number, but an actual real-valued weight). The WE and WLW pins allow the weight value to be updated. An element of Q arrives at a gain cell, represented by a pulse on the WLR pins. The width of the pulse is proportional to the value of the element; the amplitude of the pulse is a constant. For the duration of the input pulse, an output pulse is generated with an amplitude proportional to the voltage of capacitor C 1 . It’s been a while since I’ve had to compute an integral, but I think I can find the area under this curve. The total output current generated is: The output currents of multiple gain cells are all dumped onto a shared bitline, and the aggregate current produced by all of them represents a dot product More analog hardware magic is used to produce a HardSigmoid charge-to-pulse circuit. It integrates all current received over the shared bitline, and produces a pulse whose width is a function of that current. Fig. 1(g) shows that function: Source: https://www.nature.com/articles/s43588-025-00854-1 One gets the sense that this is not the ideal function from a machine learning point of view, but it is efficient to implement in hardware. The proposed chip has a set of gain cells to store and a set of gain cells to store . There is a fixed number of these cells, so sliding window attention is implemented. In sliding window attention, the outputs for token can be affected by tokens . Fig. 2(c) shows how sliding window attention is implemented over time in a ping-pong fashion. While the product involving and is being computed, the capacitors for 1 token worth of are written. The counter step converts the computed value from analog to digital. Source: https://www.nature.com/articles/s43588-025-00854-1 One nice benefit of sliding window attention has to do with refreshing the gain cells. Leakage causes the capacitor in the gain cell to “forget” the value it is storing over time (~5ms). With sliding window attention, there is a limit to the amount of time that a particular value needs to be remembered. If that limit is low enough, then there is no need to refresh gain cells. Fig. 5 has the headline results (note the logarithmic scale), these are 100-100,000x improvements in latency and energy: Source: https://www.nature.com/articles/s43588-025-00854-1 If you are interested in how this design affects model accuracy, table 1 has the relevant metrics. My takeaway is that it doesn’t seem to have a large effect. Search the web for ‘CMOS full adder’, and you’ll see circuits that seem about as complex as a single gain cell. Those digital circuits only compute a single bit of a sum (and a carry bit as well)! The abstraction of digital logic is an expensive one indeed. I suppose this is not unlike the Turing Tax associated with higher levels of abstraction. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

A linear-time heuristic for improving network partitions

A linear-time heuristic for improving network partitions C. M. Fiduccia and R. M. Mattheyses DAC'1988 This paper describes the Fiduccia–Mattheyses algorithm which is similar to the Kernighan–Lin algorithm for graph partitioning. The ideas are old, but still very relevant. The object which this algorithm operates on is a network . A network is similar to a graph; the key difference is that edges are replaced with nets . A net connects 2 or more vertices (an edge is a degenerate case which always connects 2 vertices). This makes sense in hardware applications, where vertices are circuit components, and nets are the wires that connect them. The problem is to divide the vertices in a network into two sets, and . Nets which cross from to are called cut nets . The number of cut nets should be minimized. The relative sizes of and must adhere to a balance criterion which ensures that and have roughly the same number of vertices. The graph partitioning problem is similar to the min-cut problem. The key difference is that in graph partitioning, there is a requirement that the two partitions are balanced (have roughly equal sizes). Graph partitioning is NP-hard; the algorithm described in this paper is not guaranteed to produce an optimal result. The algorithm starts by dividing the vertices into initial sets and , ensuring that the balance criterion is honored. It then iteratively executes passes , until no more improvement can be made. A single pass can potentially move each vertex in the graph (to the complementary partition). Each vertex will only be moved at most once per pass. Within a pass, vertices are sorted according to their associated gain . The gain of a vertex is the overall improvement that would result by moving that vertex to the complementary partition. Fig. 1 has an example. The current partition is represented by the vertical dashed line. The partitioning would be improved by moving the vertex with +1 gain to the left partition (one fewer net would be cut). Partitioning would also be improved by moving the +2 vertex over to the right partition, but the balance criterion would likely not allow this. The partitioning would be made worse if the vertex with -1 gain were to be moved (as the associated net would become cut). Source: https://dl.acm.org/doi/10.1145/62882.62910 A single pass produces a set of candidate partitions and then chooses the best candidate to be the input partition for the next pass. The algorithm to execute a single pass is: Among all vertices which have not been considered in the current pass, and which can be moved without violating the balance criterion, select a vertex with maximal gain Move that vertex and note the updated partition as a candidate partition After a single pass has completed, examine all candidate partitions the pass produced. Choose the best one. There are two subtle points about this process: The gain of each vertex is recomputed after each move. Step #1 will select vertices with negative gains. The idea is that within a pass, a small bad move can unlock future (but within the same pass) good moves. The best candidate partition is selected at the end of each pass, so partitions that directly result from bad moves will be ignored. A net is called critical if it is connected to zero or one vertices in or . If a net is not critical, then it can be ignored when considering moving a single vertex, because an individual move cannot cause non-critical nets to change from cut to uncut (or the other way around). The gain of an individual vertex can be computed by summing the contributions from all critical nets that vertex is connected to. Each critical net contributes , , or to the gain of a vertex. The following steps enable incremental updates of data structures containing the following information when moving a vertex from set to set The gain of each vertex The number of vertices in A that each net is connected to The number of vertices in B that each net is connected to Note that the conditionals which examine the number of vertices that a net is connected to in or only consider critical nets (nets connected to 0 or 1 vertices in the relevant partition). The inner loop of the pass algorithm chooses a vertex with maximum gain. The algorithm above shows how to update the gain associated with each vertex as the pass progresses but does not deal with efficiently selecting a high gain vertex. The trick here is that there is a bounded range of values that a gain can have. The paper calls the bounds . is simply the maximum degree of any vertex. With this knowledge, vertices can be stored in an array of linked lists. Each entry in the array corresponds to a particular gain value. When the gain of a vertex changes, it can be removed from the list that it is currently contained in and inserted in the list corresponding to the new gain. Fig. 2 illustrates the data structure: Source: https://dl.acm.org/doi/10.1145/62882.62910 Results The following table shows performance results for four randomly sampled chips. The average chip has 267 vertices, 245 nets, and the average degree of a vertex is 2650. I would have thought that more passes were needed to converge on a solution. Source: https://dl.acm.org/doi/10.1145/62882.62910 Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. Source: https://dl.acm.org/doi/10.1145/62882.62910 A single pass produces a set of candidate partitions and then chooses the best candidate to be the input partition for the next pass. The algorithm to execute a single pass is: Among all vertices which have not been considered in the current pass, and which can be moved without violating the balance criterion, select a vertex with maximal gain Move that vertex and note the updated partition as a candidate partition The gain of each vertex is recomputed after each move. Step #1 will select vertices with negative gains. The idea is that within a pass, a small bad move can unlock future (but within the same pass) good moves. The best candidate partition is selected at the end of each pass, so partitions that directly result from bad moves will be ignored. The gain of each vertex The number of vertices in A that each net is connected to The number of vertices in B that each net is connected to

1 views
Dangling Pointers 1 months ago

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS Youmin Chen, Jiwu Shu, Yanyan Shen, Linpeng Huang, and Hong Mei SOSP'25 I love this paper, because it grinds one of my axes: efficient pipeline parallelism on general purpose CPUs. In many hardware designs, pipeline parallelism is the dominant form of parallelism, whereas data parallelism takes the cake on CPUs and GPUs. It has always seemed to me that there are applications where pipeline parallelism should be great on multi-core CPUs, and here is an example. Fig. 1 illustrates the design space for key-value stores: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ). It is hard to effectively cache data in a TPR/TPQ model, because each request runs the entire key-value store request code path. For example, a CPU core may have enough cache capacity to hold network buffers or hot tuples, but not both. The key disadvantage to a TPS architecture is load balancing. One stage could become the bottleneck, leaving CPU cores idle. The authors propose dynamic reconfiguration of the pipeline based on workload changes. Another challenge with pipelining is implementing efficient communication between cores, because data associated with each request flows down the pipeline with the request itself. Fig. 3 shows the pipeline proposed in this paper: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 The NIC writes request packets into the network buffer (stored in the LLC). The cache-resident layer reads data from this buffer and handles requests involving commonly used keys by accessing the hot index and hot data caches (also in the LLC). The memory-resident layer handles cold keys and values, which are stored in DRAM. One set of threads (pinned to CPU cores) implement the cache-resident layer, and a different set of threads (pinned to other CPU cores) implement the memory-resident layer. An auto-tuner continually monitors the system and adjusts the number of threads assigned to each layer. Section 3.5 describes the synchronization required to implement this adjustment. The NIC writes request packets into a single queue. The cache-resident threads cooperatively read requests from this queue. If there are threads in the pool, then thread reads all requests with: . Next, threads check to see if the key associated with a request is hot (and thus cached in the LLC). Time is divided into epochs. During a given epoch, the set of cached items does not change. This enables fast lookups without costly synchronization between threads. A background thread gathers statistics to determine the set of items to be cached in the next epoch and has the ability to atomically switch to the next epoch when the time comes. The number of hot keys is kept small enough that it is highly likely that hot keys will be stored in the LLC. Requests that miss in the cache-resident layer are passed on to the memory-resident layer for further processing (via the CR-MR queue ). Typically, the LLC is treated like a global resource (shared by all cores). But this particular use case requires that most of the LLC be dedicated to the cache-resident layer. This is accomplished with the help of the PQOS utility from Intel, which uses “Intel(R) Resource Director Technology” to control which ways of the LLC are assigned to each layer. The memory-resident layer operates on batches of requests. Because the requests are not hot, it is highly likely that each request will require DRAM accesses for index lookups (keys) and data lookups (values). Software prefetching is used to hide DRAM latency during index lookups. When servicing operations, data values are copied directly into the outgoing network buffer. The CR-MR queue is used to communicate between the two layers. Each (CR thread, MR thread) pair has a dedicated lock-free queue. Enqueue operations use a round-robin policy (message from CR thread is sent to MR thread: ). Dequeue operations must potentially scan queues corresponding to all possible senders. Multiple requests can be stored per message, to amortize control overhead. Fig. 7 has throughput results for synthetic workloads (A, B, and C have different ratios of put/get operations), uTPS-T is this work: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 Dangling Pointers The pipelining here is coarse-grained, and the design is only optimized for the LLC. I wonder if a more fine-grained pipeline would allow hot data to be stored in L2 caches. For example, the set of hot keys could be sharded among N cores, with each core holding a different shard in its L2 cache. It seems redundant that this design requires software to determine the set of hot keys, when the hardware cache circuitry already has support to do something like this. Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. Pipelining Advantages A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ).

0 views
Dangling Pointers 1 months ago

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann SIGMOD'14 The giant upon whose shoulders this paper rests is Volcano . Parallelism in Volcano is achieved through a proper separation of concerns. Volcano contains many database operators, most of which are blissfully unaware of parallelism. A handful of operators in a query plan exist only to enable parallelism (for example, an operator could implement pipeline parallelism, or partition data between threads). Generally speaking, an elegant separation of concerns is good for performance. However, the thesis of morsel-driven parallelism is that this is not true. Deeply integrating the notion of parallel execution into each operator is the way to go for OLAP. The system described in this paper is named HyPer. Fig. 2 below illustrates how HyPer decomposes a single query into three pipelines: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 The leftmost pipeline scans the input relation and applies a filter to it. The middle pipeline scans and filters the input relation . The rightmost pipeline scans and filters , joins the result with , and finally joins the result of the first join with . A system like Volcano would be tempted to scan and in parallel. Not so with HyPer: the pipelines which make up a query plan are executed serially. Each relation (both inputs, and temporary data) are divided into morsels. A morsel is a group of ~100,000 tuples. Each morsel resides on a single NUMA node (indicated by colors in the figures). Fig. 3 illustrates how HyPer uses morsel-level parallelism to implement the left pipeline (scan and filter T): Source: https://dl.acm.org/doi/10.1145/2588555.2610507 The pipeline that processes T operates in two phases. In the first phase, a pool of threads (each pinned to a core) collectively process all morsels in T. When a thread comes up for air, it grabs another morsel of input. Threads are assigned to a NUMA node, and threads prefer to process morsels assigned to the same NUMA node. If no morsels of matching color are available, then a thread will reluctantly process a morsel from another NUMA node. During the first phase, each thread writes filtered tuples into a thread-local storage area. Once all input tuples have been processed, a hash table is created (conveniently, the hash table can be sized well because the number of tuples that must be stored is known). This hash table is global (i.e., not NUMA aware). In the second phase, tuples are inserted into the hash table. The HyPer hash table is designed to allow lock-free insertions, along with a fast path for the common case where a probe operation yields no hits. The hash table uses chaining, with very special pointers used to point to the next element in the chain. The lower 48 bits of each pointer in the hash table contains the memory address that is being pointed at, the upper 16 bits are a tag . Think of the tag as a 16-bit Bloom filter describing the set of elements in the sub-list that the pointer points to. When the hash table is probed, a hash of the join key from the probe tuple is used both to determine which chain to search in, and to stop the search early if no possible unexamined list element could contain the join key. Because both the pointer and the tag are packed into 64 bits, a compare-and-swap instruction can be used to insert elements into the hash table without using an OS lock. If the hash table is large enough, then most executions of the compare-and-swap instruction should succeed. Fig. 7 illustrates the hash table data structure and the insertion algorithm: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 It is a bit odd that the design in this paper goes to such great lengths to avoid cross-socket (NUMA) reads from main memory, and yet the hash table is not NUMA aware. I think that the 16-bit tags are key here. If the set of head pointers for all buckets in the hash table is small enough to fit into an L2 cache, then this data can be efficiently replicated into all L2 caches. As long as the hash table hit rate is low enough, the number of cross-socket memory accesses during probe operations will be low. Fig. 11 shows throughput for all 22 TPC-H queries, for 4 different configurations: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 It is interesting how NUMA-awareness matters a lot for some queries, but not all. Fig. 10 shows the author’s NUMA model: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 What is interesting here is that a similar setup exists within each socket. Say a socket contains 32 cores, and 4 memory controllers. Those cores and memory controllers will be laid out in a grid, with a network connecting them. I wonder if there is performance to be gained by paying attention to the intra-core layout (e.g., cores on the left side of the chip should only access memory controllers on the left side of the chip). Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

The Decomposition of Switching Functions

The Decomposition of Switching Functions Robert Ashenhurst International Symposium on the Theory of Switching Functions'59 Say you are given a logic formula such as: and you want to rewrite it in a more structured form. One way to impose structure is to express as the composition of simpler functions. You can perform such a rewrite with the Ashenhurst-Curtis decomposition , and the result will look like this: This transformation is useful in logic synthesis. The task of synthesizing a circuit to implement the 4-input function can be broken down into synthesizing circuits to implement: The 2-input function The 3-input function The goal of the Ashenhurst-Curtis decomposition is to partition the inputs of a logic formula ( are the inputs in the running example) into two disjoint sets ( , ). The partitioning depends on the logic formula being decomposed. The idea is to decompose the formula into two simpler formulas: Formula 1 accepts as inputs Formula 2 accepts as inputs, along with the output of #1 A way to do this is to search through all possible partition matrices . Fig. 4 shows the partition matrix for the decomposition in the running example: Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf Think of a partition matrix as a two-dimensional truth table. Here is the original logic formula, along with the above partition matrix containing explicit values for : In other words, f is non-zero only when given the following inputs (which correspond to the 4 entries in the partition matrix with the value ). For a given logic formula, there exist many partition matrices. Only some are suitable for decomposing the logic formula. The key property to look for is the column multiplicity of the partition matrix. This column multiplicity is the number of unique column vectors in the partition matrix. The partition matrix above has a column multiplicity of 2 (i.e., there are two unique column vectors in the matrix, and ). If you can find a partition matrix with a column multiplicity less than or equal to two, then you can decompose the logic formula. Note that a partition matrix is defined not only by the values in the matrix, but also the partitioning of the input variables. If you have located a partition matrix with a column multiplicity no greater than two, then the row multiplicity must be no greater than 4. In particular, there are 4 kinds of row vectors that you will see in the partition matrix: Vectors of all 0s Vectors of all 1s A particular vector The inverse of In the running example, the row vectors are: [1,0,0,1] (we call this ) [0,1,1,0] (note how this is the inverse of ) To decompose f, first generate a function ϕ. The inputs to the function are the variables written at the top of the partition matrix ( and in this example). The truth table for ϕ is simply the first non-trivial row vector in the partition matrix ([1,0,0,1] in the example). Next, generate a function . The inputs to F are the variables written at the left of the partition matrix ( and in this example), and the output of . In this example, has 3 inputs, so the truth table defining has 8 entries. A simple algorithm is used to generate the entries in that truth table. The algorithm iterates over the row vectors in the partition matrix, and each row vector defines 2 elements in the truth table of . If row vector is all zeros, then the truth table elements at indices and are 0 If row vector is all ones, then the truth table elements at indices and are 1 If row vector is , then the truth table element at index is 0, and the element at index is 1 If row vector is the inverse of , then the truth table element at index is 1, and the element at index is 0 These rules are described in Table 1: Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf The final decomposed function is: Until now, we’ve assumed that someone has handed us a partition matrix with column multiplicity no greater than two. But how does one find such a matrix. The paper proposes iterating through all partition matrices in the partition chart of the input formula. A partition chart contains many partition matrices, each one corresponding to a different partitioning of the input variables. A circled element in the partition chart corresponds to a in the truth table of the input logic formula. Fig. 5 shows partition charts for the running example. The one in the lower right corner is the partition matrix in the running example. Astute readers will notice that it is actually the transpose of the partition matrix (i.e., and are on the left, and are on the top), but that is no big deal, they can be transposed as long as the variable names are transposed along with the matrix entries. Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf Once you have generated a partition chart, it is straightforward to search for partition matrices which have column (or row) multiplicity no greater than two. These are the ones that can be used to generate the decomposed function. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. The 2-input function The 3-input function Formula 1 accepts as inputs Formula 2 accepts as inputs, along with the output of #1 Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf Think of a partition matrix as a two-dimensional truth table. Here is the original logic formula, along with the above partition matrix containing explicit values for : In other words, f is non-zero only when given the following inputs (which correspond to the 4 entries in the partition matrix with the value ). Vectors of all 0s Vectors of all 1s A particular vector The inverse of [1,0,0,1] (we call this ) [0,1,1,0] (note how this is the inverse of ) If row vector is all zeros, then the truth table elements at indices and are 0 If row vector is all ones, then the truth table elements at indices and are 1 If row vector is , then the truth table element at index is 0, and the element at index is 1 If row vector is the inverse of , then the truth table element at index is 1, and the element at index is 0

0 views
Dangling Pointers 1 months ago

Oasis: Pooling PCIe Devices Over CXL to Boost Utilization

Oasis: Pooling PCIe Devices Over CXL to Boost Utilization Yuhong Zhong, Daniel S. Berger, Pantea Zardoshti, Enrique Saurez, Jacob Nelson, Dan R. K. Ports, Antonis Psistakis, Joshua Fried, and Asaf Cidon SOSP'25 If you are like me, you’ve dabbled with software prefetching but never had much luck with it. Even if you care nothing about sharing of PCIe devices across servers in a rack, this paper is still interesting because it shows a use case where software prefetching really matters. I suppose it is common knowledge that CXL enables all of the servers in a rack to share a pool of DRAM . The unique insight from this paper is that once you’ve taken this step, sharing of PCIe devices (e.g., NICs, SSDs) can be implemented “at near-zero extra cost.” Fig. 2 shows how much SSD capacity and NIC bandwidth are stranded in Azure. A stranded resource is one that is underutilized because some other resource (e.g., CPU or memory) is the bottleneck. Customers may not be able to precisely predict the ratios of CPU:memory:SSD:NIC resources they will need. Even if they could, a standard VM size may not be available that exactly matches the desired ratio. Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The steps to receive a packet are similar: The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory. One trick used here is flow tagging . This is a DPDK feature that enables the NIC to determine which host the message is destined for, without the backend driver having to inspect network packet headers. Fig. 8 shows measurements of the overhead added by Oasis. The solid lines are the baseline; the dotted lines are Oasis. Each color represents a different latency bucket. The baseline uses a NIC which is local to the host running the benchmark. The overhead is measurable, but not excessive. Source: https://dl.acm.org/doi/pdf/10.1145/3731569.3764812 Dangling Pointers The paper doesn’t touch on the complexities related to network virtualization in a pooled device scheme. It seems to me that solving these problems wouldn’t affect performance but would require significant engineering. Subscribe now Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Datapath Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Cache Coherency Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Send and Receive Flows Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory.

0 views
Dangling Pointers 1 months ago

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds Bang Di, Yun Xu, Kaijie Guo, Yibin Shen, Yu Li, Sanchuan Cheng, Hao Zheng, Fudong Qiu, Xiaokang Hu, Naixuan Guan, Dongdong Huang, Jinhu Li, Yi Wang, Yifang Yang, Jintao Li, Hang Yang, Chen Liang, Yilong Lv, Zikang Chen, Zhenwei Lu, Xiaohan Ma, and Jiesheng Wu SOSP'25 Here is a contrarian view: the existence of hypervisors means that operating systems have fundamentally failed in some way. I remember thinking this a long time ago, and it still nags me from time to time. What does a hypervisor do? It virtualizes hardware so that it can be safely and fairly shared. But isn’t that what an OS is for? My conclusion is that this is a pragmatic engineering decision. It would simply be too much work to try to harden a large OS such that a cloud service provider would be comfortable allowing two competitors to share one server. It is a much safer bet to leave the legacy OS alone and instead introduce the hypervisor. This kind of decision comes up in other circumstances too. There are often two ways to go about implementing something. The first way involves widespread changes to legacy code, and the other way involves a low-level Jiu-Jitsu move which achieves the desired goal while leaving the legacy code untouched. Good managers have a reliable intuition about these decisions. The context here is a cloud service provider which virtualizes the network with a SmartNIC. The SmartNIC (e.g., NVIDIA BlueField-3 ) comprises ARM cores and programmable hardware accelerators. On many systems, the ARM cores are part of the data-plane (software running on an ARM core is invoked for each packet). These cores are also used as part of the control-plane (e.g., programming a hardware accelerator when a new VM is created). The ARM cores on the SmartNIC run an OS (e.g., Linux), which is separate from the host OS. The paper says that the traditional way to schedule work on SmartNIC cores is static scheduling. Some cores are reserved for data-plane tasks, while other cores are reserved for control-plane tasks. The trouble is, the number of VMs assigned to each server (and the size of each VM) changes dynamically. Fig. 2 illustrates a problem that arises from static scheduling: control-plane tasks take more time to execute on servers that host many small VMs. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dynamic Scheduling Headaches Dynamic scheduling seems to be a natural solution to this problem. The OS running on the SmartNIC could schedule a set of data-plane and control-plane threads. Data-plane threads would have higher priority, but control-plane threads could be scheduled onto all ARM cores when there aren’t many packets flowing. Section 3.2 says this is a no-go. It would be great if there was more detail here. The fundamental problem is that control-plane software on the SmartNIC calls kernel functions which hold spinlocks (which disable preemption) for relatively long periods of time. For example, during VM creation, a programmable hardware accelerator needs to be configured such that it will route packets related to that VM appropriately. Control-plane software running on an ARM core achieves this by calling kernel routines which acquire a spinlock, and then synchronously communicate with the accelerator. The authors take this design as immutable. It seems plausible that the communication with the accelerator could be done in an asynchronous manner, but that would likely have ramifications to the entire control-plane software stack. This quote is telling: Furthermore, the CP ecosystem comprises 300–500 heterogeneous tasks spanning C, Python, Java, Bash, and Rust, demanding non-intrusive deployment strategies to accommodate multi-language implementations without code modification. Here is the Jiu-Jitsu move: lie to the SmartNIC OS about how many ARM cores the SmartNIC has. Fig. 7(a) shows a simple example. The underlying hardware has 2 cores, but Linux thinks there are 3. One of the cores that the Linux scheduler sees is actually a virtual CPU (vCPU), the other two are physical CPUs (pCPU). Control-plane tasks run on vCPUs, while data-plane tasks run on pCPUs. From the point of view of Linux, all three CPUs may be running simultaneously, but in reality, a Linux kernel module (5,800 lines of code) is allowing the vCPU to run at times of low data-plane activity. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 One neat trick the paper describes is the hardware workload probe . This takes advantage of the fact that packets are first processed by a hardware accelerator (which can do things like parsing of packet headers) before they are processed by an ARM core. Fig. 10 shows that the hardware accelerator sees a packet at least 3 microseconds before an ARM core does. This enables this system to hide the latency of the context switch from vCPU to pCPU. Think of it like a group of students in a classroom without any teachers (e.g., network packets). The kids nominate one student to be on the lookout for an approaching adult. When the coast is clear, the students misbehave (i.e., execute control-plane tasks). When the lookout sees the teacher (a network packet) returning, they shout “act responsible”, and everyone returns to their schoolwork (running data-plane code). Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Results Section 6 of the paper has lots of data showing that throughput (data-plane) performance is not impacted by this technique. Fig. 17 shows the desired improvement for control-plane tasks: VM startup time is roughly constant no matter how many VMs are packed onto one server. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dangling Pointers To jump on the AI bandwagon, I wonder if LLMs will eventually change the engineering equation. Maybe LLMs will get to the point where widespread changes across a legacy codebase will be tractable. If that happens, then Jiu-Jitsu moves like this one will be less important. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

Backdoors To Typical Case Complexity

Backdoors To Typical Case Complexity Ryan Williams, Carla P. Gomes, and Bart Selman IJCAI'03 SAT solvers frequently perform much better on real-world problems than would be predicted by a worst-case analysis. This paper provides a clue as to why some problems are tractable and presents algorithms to solve such problems. The analysis here applies equally to SAT problems and CSP problems. This quote sums up the motivating idea behind the paper (all variables are not created equal): Another powerful intuition in the design of search methods is that one wants to select variables that simplify the problem instance as much as possible when these variables are assigned values. A backdoor is a subset of the variables in a SAT/CSP problem. There exists at least one assignment of backdoor variables to values such that the values of the remaining variables can easily be found by a sub-solver . A sub-solver is a SAT/CSP solver which runs in polynomial time and can solve a subset of SAT/CSP problems. The input to the sub-solver is a SAT/CSP problem, along with a partial assignment . A partial assignment assigns concrete values to some variables (e.g., the backdoor variables). When we say that a sub-solver “solves” a problem, we mean that it returns one of: A mapping of variable names to values which satisfies all constraints “Unsatisfiable” - the constraints cannot all be satisfied “Too hard” - the problem is not one of the kinds that the sub-solver can handle An example sub-solver for SAT problems is one that can only solve 2-SAT problems. Table 2 contains backdoor sizes from real-world SAT problems: Source: https://www.ijcai.org/Proceedings/03/Papers/168.pdf Think about that for a second. The first row means that in a problem with 6783 variables, the crux of the issue can be solved by only considering 12 variables. So rather than searching for a needle in a haystack of size 2 6783 , one only needs to find the correct values for the 12 variables in the backdoor (2 12 options to explore). Once that assignment has been found, the sub-solver can find the values of the other 6771 variables. The only trouble is how one finds a backdoor. The deterministic strategy optimistically assumes there exists a small backdoor. First it assumes the backdoor contains 1 variable. If that fails, then the strategy assumes the backdoor contains 2 variables, and so on. At step , the deterministic strategy enumerates all possible backdoors of size . For each candidate backdoor, it uses a backtracking search (which will invoke a sub-solver) to try to solve the problem. Backtracking search is more efficient than exhaustively searching all 2 N variable assignments because this search can invoke the sub-solver with a partial assignment of fewer than variables, in the hopes that the sub-solver can fail quickly. The randomized strategy is similar to the deterministic strategy, but it chooses random subsets of variables to perform a backtracking search on. Again, the process begins with a bias that the size of the backdoor is small and gradually increases the expected size of the random subset. The key question running through my mind is: “are there ways to transform SAT problems such that the output of the transformation has a smaller backdoor than the input”? Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. A mapping of variable names to values which satisfies all constraints “Unsatisfiable” - the constraints cannot all be satisfied “Too hard” - the problem is not one of the kinds that the sub-solver can handle

0 views
Dangling Pointers 2 months ago

How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service

How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service Jingkai He, Yunpeng Dong, Dong Du, Mo Zou, Zhitai Yu, Yuxin Ren, Ning Jia, Yubin Xia, and Haibo Chen SOSP'25 Systems spend a lot of time copying bytes from here to there. Fig. 2(a) shows data for some benchmarks. Copy operations are bucketized by size and if the copy occurs on behalf of user/kernel code. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 One interesting piece of data missing from that figure is how much CPU time is spent waiting on cache misses. A more subtle point is how long applications wait after a copy operation completes and when they actually read from the destination buffer. The authors call this the copy-use window . Fig. 3 has some data on this point. Bars above the gray triangle represent opportunities for latency hiding. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 API This paper proposes a new OS service ( Copier ), which performs copy operations on behalf of user and kernel clients. Clients interact with Copier through a set of queues. Copy requests are submitted via descriptors in the Copy Queue . Descriptors contain pointers to bitmaps, which clients can use to monitor progress. Each bit in the bitmap corresponds to the completion of a single segment of a copy. For example, a client could submit a 1MiB copy but read data from 4KiB destination segments as soon as they are written. The Sync Queue is used to raise the priority of an already-submitted task. This is a fallback for when the application has left the copy-use window and has nothing useful to do other than wait for the copy to make progress. The Handler Queue enables arbitrary functions to be called when a copy operation completes. Applications interact with these queues indirectly via a set of functions (e.g., to initiate a copy, to wait for an operation to complete). Because Copier is a centralized service, it can perform global optimizations. An example given in the paper involves two copies. First, a large amount of data is copied from a kernel buffer to an intermediate buffer ( → ). Next, a smaller amount of data is copied from the intermediate buffer into the database ( → ). Copier can detect this situation and only read a subset of the bytes from . One trick that Copier uses is watching calls to the API to know when data can be accessed. For example, if is never called on , and only called on a subset of , then Copier knows that only a subset of needs to be read. Clients can expose more global optimization opportunities to Copier by marking copy tasks as lazy. Lazy copy tasks do not execute unless they need to. An application could supply this hint if it knows that the destination of a copy operation will likely only be used as the source of a future copy operation. Copier will defer copying bits associated with the first copy operation, under the assumption that will never be called on it. Subsequent copy operations will cut out the middleman and read from the original source. The task of actually copying bits is delegated to a set of kernel threads. The number of threads dynamically scales with the amount of work. These threads use AVX instructions in parallel with dedicated DMA hardware . One issue with this setup is that these kernel threads could potentially fault if a specified userspace address is inaccessible (the data is paged out, or the client submitted a bad address). Copier proactively avoids this by locking pages in memory for the duration of the copy (this step is also needed when using DMA hardware). Fig. 11 shows benchmark results for Redis, section 6 of the paper has similar results for other use cases. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 Dangling Pointers Copies that hit in cache are very different from those which do not. I think more exploration here is warranted. Section 6.3.5 points out that Copier has pros and cons related to CPU caches: Pro: because Copier uses separate threads, it hopefully runs on separate cores than the application. This can prevent large copy operations from evicting hot application data. Con: when Copier finishes copying data, the destination bytes are likely not in the L1/L2 of an application core, causing it to have to read data from LLC/DRAM. Because copy operations are spread throughout the stack, careful design of the API is important. For example, the paper describes synchronization between user and kernel requests for the same process, but what about synchronization across two processes? Subscribe now Source: https://dl.acm.org/doi/10.1145/3731569.3764800 One interesting piece of data missing from that figure is how much CPU time is spent waiting on cache misses. A more subtle point is how long applications wait after a copy operation completes and when they actually read from the destination buffer. The authors call this the copy-use window . Fig. 3 has some data on this point. Bars above the gray triangle represent opportunities for latency hiding. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 API This paper proposes a new OS service ( Copier ), which performs copy operations on behalf of user and kernel clients. Clients interact with Copier through a set of queues. Copy requests are submitted via descriptors in the Copy Queue . Descriptors contain pointers to bitmaps, which clients can use to monitor progress. Each bit in the bitmap corresponds to the completion of a single segment of a copy. For example, a client could submit a 1MiB copy but read data from 4KiB destination segments as soon as they are written. The Sync Queue is used to raise the priority of an already-submitted task. This is a fallback for when the application has left the copy-use window and has nothing useful to do other than wait for the copy to make progress. The Handler Queue enables arbitrary functions to be called when a copy operation completes. Applications interact with these queues indirectly via a set of functions (e.g., to initiate a copy, to wait for an operation to complete). Copy Absorption Because Copier is a centralized service, it can perform global optimizations. An example given in the paper involves two copies. First, a large amount of data is copied from a kernel buffer to an intermediate buffer ( → ). Next, a smaller amount of data is copied from the intermediate buffer into the database ( → ). Copier can detect this situation and only read a subset of the bytes from . One trick that Copier uses is watching calls to the API to know when data can be accessed. For example, if is never called on , and only called on a subset of , then Copier knows that only a subset of needs to be read. Clients can expose more global optimization opportunities to Copier by marking copy tasks as lazy. Lazy copy tasks do not execute unless they need to. An application could supply this hint if it knows that the destination of a copy operation will likely only be used as the source of a future copy operation. Copier will defer copying bits associated with the first copy operation, under the assumption that will never be called on it. Subsequent copy operations will cut out the middleman and read from the original source. Copy Threads The task of actually copying bits is delegated to a set of kernel threads. The number of threads dynamically scales with the amount of work. These threads use AVX instructions in parallel with dedicated DMA hardware . One issue with this setup is that these kernel threads could potentially fault if a specified userspace address is inaccessible (the data is paged out, or the client submitted a bad address). Copier proactively avoids this by locking pages in memory for the duration of the copy (this step is also needed when using DMA hardware). Results Fig. 11 shows benchmark results for Redis, section 6 of the paper has similar results for other use cases. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 Dangling Pointers Copies that hit in cache are very different from those which do not. I think more exploration here is warranted. Section 6.3.5 points out that Copier has pros and cons related to CPU caches: Pro: because Copier uses separate threads, it hopefully runs on separate cores than the application. This can prevent large copy operations from evicting hot application data. Con: when Copier finishes copying data, the destination bytes are likely not in the L1/L2 of an application core, causing it to have to read data from LLC/DRAM.

0 views
Dangling Pointers 2 months ago

DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge

DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge Arjun Kashyap, Yuke Li, and Xiaoyi Lu HPDC'25 This paper ends with a counterintuitive result: DPUs aren’t amazingly better than traditional CPUs at implementing key-value stores. You would think they would have a shot, given that they have specialized networking accelerators, and key-value stores don’t require a lot of general-purpose computation. It all comes down to access to off-chip memory. Table 1 and Fig. 2 contain profiling results which motivate the designs presented by this paper: Source: https://dl.acm.org/doi/10.1145/3731545.3731571 When a key-value store is run on a traditional CPU, most of the time is spent in packet processing rather than CRUD operation on (key, value) tuples. DPUs are chips that are specialized for packet processing, so one would think a DPU would be especially helpful in this situation. The designs in this paper are evaluated on NVIDIA BlueField 2 and BlueField 3 DPUs. These DPUs contain a mix of fixed-function hardware and ARM cores. The fundamental idea proposed in this paper is to make the CPU’s job easier by having the DPU perform some of the awkward packet processing work and passing CRUD requests to the CPU in a convenient format. The DPU parses incoming packets, extracts the relevant fields, and writes the minimal amount of information (operation to perform, key, value) to queues in host memory. Fig. 8(d) shows the amount of cruft the DPU is able to remove from each CRUD packet. Source: https://dl.acm.org/doi/10.1145/3731545.3731571 The design described in this paper is optimized in all of the ways you would expect , requests are batched, and appropriate pipelining is used to avoid idle time. Cross-core synchronization (both DPU and host CPU cores) synchronization is minimized with the presence of many queues. Each ARM core owns a unique set of queues, and each host CPU core is assigned to one or more queues. As Fig. 11 shows, the design described so far ( DPU-KV-lat and DPU-KV-sav ) doesn’t offer significant speedups. There are latency improvements, but throughput suffers. These designs are bound by the DPU. Evidence for this comes from the performance of DPU-KV-dual and DPU-KV-shrd . DPU-KV-dual allows idle host CPU cores to process raw packets. DPU-KV-shrd uses sharding. The set of keys is partitioned, with some keys handled by the CPU and some keys handled by the DPU. Source: https://dl.acm.org/doi/10.1145/3731545.3731571 Dangling Pointers The moral of the story is that there is no special advantage conferred on an ARM core inside of a SmartNIC over an Intel core inside of the host CPU. It is interesting to compare this work to a key-value store implemented directly in an RMT pipeline . It would be interesting to drill down into the profiling numbers which motivated this paper and understand how much memory-level parallelism a traditional CPU core can utilize. At the microarchitectural level, this problem has to be memory bound, it would be interesting to see if it is bound by memory latency or bandwidth. Subscribe now

0 views
Dangling Pointers 2 months ago

A Computing Procedure for Quantification Theory

A Computing Procedure for Quantification Theory Martin Davis and Hilary Putnam JACM Volume 7, Issue 3 This is the oldest paper I’ve summarized. I think SAT solvers are magic and wanted to peek inside the black box. This paper describes a precursor to the full DPLL algorithm. About half of the paper is about dealing with existential and universal quantifiers, which I’m not describing here. The input to the algorithm described here is a logic formula in conjunctive normal form (CNF). Conjunctive normal form comprises a set of clauses. A clause is the logical OR of a set of literals. A literal is either a variable, or the negation of a variable. Logical AND is used to combine all clauses. For example: There are straightforward algorithms to convert any logic formula into CNF (and more complicated ones like the Tseytin transformation ). The goal of this algorithm is to determine if the input logic formula is satisfiable. If it is satisfiable, then the algorithm produces a list, assigning each variable to a specific value ( or ). An alternative normal form would be disjunctive normal form (DNF), but that isn’t practical. If you think about it, any algorithm that outputs DNF effectively outputs a list of all possible solutions to the SAT problem (each clause is a possible solution). So, either the list is small and SAT solving really is easy, or most interesting problems would be prohibitively large in DNF form. This paper even cites a textbook which says that the act of putting a formula into DNF “automatically reveals whether or not the formula is inconsistent”. The search for a satisfying assignment (i.e., mapping of variable names to values) is described as the iterative application of three rules. The third rule contains the magic necessary to keep the search going. If a variable appears in a clause that only contains a single literal, then the variable can be assigned a value consistent with the polarity of the literal. For example, in the following formula, must be . If all appearances of a variable have the same polarity, then that variable can be assigned to a value consistent with that polarity. For example, in the following formula, can safely be assigned to false. Again, the algorithm notes the value of and then simplifies the formula by substituting that value for . This one is interesting in that subsequent work found an equivalent rule which is more efficient. The first step in applying rule III is to pick a variable (the choice can affect performance, but this paper doesn’t offer much guidance on how to choose), and refactor the formula such that it has the following form: Where , , and are expressions that do not contain . The paper doesn’t explicitly say this, but it seems this must cause the formula to temporarily leave CNF. Here is an example conversion. Now that the formula has been refactored, recursively solve the simpler formula: In the example above, the simpler formula is: The intuition here is that must either be or . If is , then must be . If is , then must be . This paper introduces an equivalent Rule III*, which says to simply assume and check if the formula is satisfiable with that assumption. If not, then assume and check if the formula is satisfiable with that assumption. This avoids having to transform the formula into [ form, which “can easily increase the number and the lengths of the clauses in the expression rather quickly after several applications”. This paper has the greatest statement of an improvement in runtime: The superiority of the present procedure over those previously available is indicated in part by the fact that a formula on which Gilmore’s routine for the IBM 704 causes the machine to compute for 21 minutes without obtaining a result was worked successfully by hand computation using the present method in 30 minutes. Computers were slower in the 1960s. Imagine someone today coming up with an alternative way of training LLMs which can be done faster by hand than a GPU. This follow-on paper has an interesting observation about the choice of which variable to select in rule III: By rearranging the clauses of a formula a different order would in general be created. In some cases, whether or not the program could actually prove the validity of a given formula (without running out of fast access storage) depended on how one shuffled the punched-data deck before reading it into the assembler! Thus, the variation in ordering of constants did affect by a factor of 10 (from 500 to 5000) … In modern lingo, the branching heuristic matters a lot. Subscribe now

0 views
Dangling Pointers 2 months ago

High-Performance Query Processing with NVMe Arrays: Spilling without Killing Performance

High-Performance Query Processing with NVMe Arrays: Spilling without Killing Performance Maximilian Kuschewski, Jana Giceva, Thomas Neumann, and Viktor Leis SIGMOD'25 In database vernacular, spilling is the process of writing intermediate data to disk in order to evaluate a query with a finite amount of main memory. It goes without saying that database folks are control freaks performance conscious and don’t want to rely on generic OS paging mechanisms to handle working sets which are larger than main memory. This tidbit about Snowflake is fascinating: Only 5% of analytical queries in Snowflake’s 2018 workload trace spill data, but those 5% contribute 45% of the overall CPU time and 29% of the total execution time [77]. One fundamental problem in this area is that it is very hard to predict up-front exactly how much working memory will be needed to efficiently execute a query. Databases have to estimate based on statistics of the relations used in a query, or assume no spilling will be needed, and then gracefully fall back to spilling if necessary. The two operators which this paper deals with are joins and aggregations. Both involve key columns, and the cardinality of the key columns is critical in determining if spilling is necessary. One obvious spilling mechanism is to use a partitioning approach for joins and aggregations. I’ve described partitioned joins in summaries of these papers: Efficiently Processing Joins and Grouped Aggregations on GPUs The reason why partitioning nicely solves the problem is that the working set requirements for both steps of a partitioned join are modest. The partitioning step only requires a small amount of memory per partition (e.g., accumulate 64KiB per partition before appending partitioned tuples to on-disk storage). The join step only needs enough working memory to join a single partition. Section 4.1 of the paper claims that partitioning slows down TPC-H queries by 2-5x. My spider sense is tingling, but let’s take this as an axiom for now. Here is the premise of the paper: partitioning is better for queries that must spill, but worse for queries that can be completed without spilling . What is an efficient and simple design given that a database cannot perfectly predict up-front if it will need to spill? Prior work along similar lines introduced the hybrid hash join. A hybrid hash join partitions the left (build) input of the join and dynamically decides what percentage of build partitions must be spilled. A hash table is built containing all non-spilled build partitions. Next the right (probe) input to the join is processed. For each probe tuple, the database determines if the associated partition was spilled. If the partition was spilled, then the probe tuple is spilled. If not, then the probe tuple is processed immediately via a lookup in the hash table. Finally, all spilled partitions are processed. The downside of this approach is that it always partitions the build side, even when that is unnecessary. This paper proposes a join implementation that only pays the cost of partitioning when spilling is required. It is a two-phase process. In the materialization phase, the build table is scanned (and pushed-down filters are applied). The resulting tuples are stored in a list of pages. At first, the system optimistically assumes that no spilling is necessary and appends each tuple to the current page. If a memory limit is reached, then partitioning is enabled. Each tuple processed after that point is partitioned, and per-partition lists of pages are allocated. If a further memory limit is reached, then some partitions are spilled to disk. Next, the build and probe phase executes in a manner similar to hybrid hash join. However, there is a fast path for the case where no spilling occurred. In this case, the tuples produced by the materialization phase are inserted into one large hash table, and then the probe tuples are streamed, with one hash table lookup per tuple. If partitioning (but no spilling) occurred, then the hash table inserts will have high locality (assuming a chained hash table). If spilling did occur, then the build and probe phase operates like a hybrid hash join, spilling probe tuples if and only if the associated build partition was spilled. The paper isn’t clear on what happens to non-partitioned build tuples once the system decides to start spilling build partitions. My assumption is that in this case, probe tuples must probe both the spilled build tuples and these build tuples that were processed before partitioning was enabled. The implementation of aggregation described in this paper follows a similar 2-phase approach. The materialization phase performs pre-aggregation, the system starts by aggregating into an in-memory hash table. If the hash table grows too large, then partitioning kicks in. Tuples from in-memory hash tables are evicted into per-partition page lists, and subsequent tuples are directly stored in these per-partition page lists. These per-partition pages can be spilled if further memory limits are reached. The subsequent phase then performs any necessary aggregation. If no eviction occurred, then this phase has no work to do. The paper describes an adaptive compression algorithm that improves spilling performance. Some interesting numbers from the paper: The number of CPU cycles per input byte for executing an in-memory TPC-H queries ranges from 3.3 to 60.3 The number of CPU cycles required to write and read a byte to/from SSD is 11.1 The adaptive nature of this scheme is driven by the fact that queries are diverse, and compressors have knobs which trade off speed for compression ratio, as illustrated by Fig. 3: Source: https://dl.acm.org/doi/10.1145/3698813 When spilling occurs, the system dynamically balances CPU usage and IO bandwidth by adjusting these knobs. Fig. 5 shows in-memory throughput while Fig. 6 shows throughput when spilling occurs: Source: https://dl.acm.org/doi/10.1145/3698813 Dangling Pointers The design of the unified hash join hinges on the fact that partitioning is a bad idea for the in-memory case. This is in contrast to papers describing in-memory partitioning join implementation on other types of chips like SPID-Join and Efficiently Processing Joins and Grouped Aggregations on GPUs . I imagine there is a lot of literature about this topic that I haven’t read. Leave a comment with your experience. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. The number of CPU cycles per input byte for executing an in-memory TPC-H queries ranges from 3.3 to 60.3 The number of CPU cycles required to write and read a byte to/from SSD is 11.1

0 views
Dangling Pointers 2 months ago

Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt?

Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt? Kaisong Huang, Jiatang Zhou, Zhuoyue Zhao, Dong Xie, and Tianzheng Wang SIGMOD'25 Say you are a database, and your job is to execute two kinds of queries (both from different TPC benchmarks ): High-priority New-Order queries from TPC-C (OLTP) Low-priority Q2 queries from TPC-H (OLAP) Congratulations, you are a hybrid transaction/analytical processing ( HTAP ) database! You would like OLTP transactions to experience low tail latency, while OLAP transactions run at high throughput. How can transactions be scheduled to achieve these goals? A Non-Preemptive FIFO policy runs transactions to completion in the order they came. This is easy but has a high tail latency for OLTP transactions that have to sit in line behind a long queue of OLAP transaction. A cooperative policy involves yield points at specific places in the database code. At each yield point, an OLAP transaction can realize that an OLTP transaction is waiting and yield the CPU. It is a hassle to insert yield points, and it is hard to find the Goldilocks frequency. Checking for high-priority transactions too often adds overhead, but checking infrequently increases tail latency. A preemptive policy allows high-priority OLTP transactions to borrow a CPU core as soon as possible. In the past, the only practical way to do this involved OS context switching, which is expensive. Fig. 2. illustrates these policies: Source: https://dl.acm.org/doi/abs/10.1145/3725319 Enter userspace interrupts . These allow preemption without OS kernel involvement. Section 4.2 of the paper makes it clear that it isn’t totally easy to implement userspace context switching on top of userspace interrupts. An idiomatic use case for userspace interrupts is for an interrupt handler to quickly save some data and then return back to the code that was running. The context switch case is not idiomatic. For each CPU core, two pthreads threads are created, and there are two stacks. Say the CPU core is running a low-priority (OLAP) transaction and a userspace interrupt is delivered to the core. The userspace interrupt handler is invoked, which mucks around with the CPU registers and (including the stack pointer), and then returns. But it doesn’t return to the code that was running the low-priority transaction, it returns to code which runs the high-priority (OLTP) transaction . Once the high-priority transaction finishes, it calls a voluntary context switch function, which again mucks around with CPU registers and the stack pointer in just the correct manner so that it returns back to the code running the low-priority transaction. There are some nitty-gritty details to get this working correctly. Tricky cases have to be handled such as: A userspace interrupt occurring in the middle of the context switch function Support for database code which uses thread-local storage (e.g., the modifier in C++ ) Avoiding deadlocks associated with a userspace interrupt occurring while a database lock is acquired As seen with xUI , while userspace interrupts are cheap, they still incur a cost. This paper proposes firing a single interrupt to execute a batch of high-priority transactions. Section 5 also describes a starvation avoidance mechanism to ensure that low-priority transactions eventually finish. Note that when a low-priority transaction is not preempted, it is not automatically aborted . The paper assumes the underlying database uses multi-versioning and optimistic concurrency control. Fig. 10 has the headline results. represents FIFO scheduling. represents the case where the OLAP query can occasionally yield the CPU core. is the work described in this paper. Tail latencies for OLTP queries are significantly reduced, while performance of OLAP queries does not change much. Source: https://dl.acm.org/doi/abs/10.1145/3725319 Dangling Pointers It would be nice to see a comparison against a version which uses traditional OS thread synchronization rather than userspace interrupts. The details of userspace context switching are tricky and seem orthogonal to databases. A library or OS functionality which provides a robust implementation seems like a useful thing to exist. The paper doesn’t mention what happens if the work associated with a single query is parallelized across multiple CPU cores. I imagine this complicates the scheduling policy. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. High-priority New-Order queries from TPC-C (OLTP) Low-priority Q2 queries from TPC-H (OLAP) Source: https://dl.acm.org/doi/abs/10.1145/3725319 Enter userspace interrupts . These allow preemption without OS kernel involvement. Context Switch Complications Section 4.2 of the paper makes it clear that it isn’t totally easy to implement userspace context switching on top of userspace interrupts. An idiomatic use case for userspace interrupts is for an interrupt handler to quickly save some data and then return back to the code that was running. The context switch case is not idiomatic. For each CPU core, two pthreads threads are created, and there are two stacks. Say the CPU core is running a low-priority (OLAP) transaction and a userspace interrupt is delivered to the core. The userspace interrupt handler is invoked, which mucks around with the CPU registers and (including the stack pointer), and then returns. But it doesn’t return to the code that was running the low-priority transaction, it returns to code which runs the high-priority (OLTP) transaction . Once the high-priority transaction finishes, it calls a voluntary context switch function, which again mucks around with CPU registers and the stack pointer in just the correct manner so that it returns back to the code running the low-priority transaction. There are some nitty-gritty details to get this working correctly. Tricky cases have to be handled such as: A userspace interrupt occurring in the middle of the context switch function Support for database code which uses thread-local storage (e.g., the modifier in C++ ) Avoiding deadlocks associated with a userspace interrupt occurring while a database lock is acquired

0 views
Dangling Pointers 2 months ago

SuSe: Summary Selection for Regular Expression Subsequence Aggregation over Streams

SuSe: Summary Selection for Regular Expression Subsequence Aggregation over Streams Steven Purtzel and Matthias Weidlich SIGMOD'25 A common use case for regular expressions is to check whether a particular input string matches a regular expression. This paper introduces a new use case, which is generating statistics over streams. Imagine an infinite stream of characters constantly being processed, and a dashboard which a person occasionally views. The dashboard shows statistics, e.g., “there have been matches in the last hours”. It is OK for these statistics to be approximate. The final quirk introduced by this paper is to allow for noise. In particular, all regular expressions implicitly accept “unexpected inputs” at any time and simply behave as if the unexpected input had never appeared. Counterbalancing this flexibility is a sliding time window which requires that if a particular subset of the input stream is going to match, it must do so with a bounded number of characters. A regular expression can be represented as an NFA . The N stands for non-deterministic , and that is the tricky bit. For example, say the regular expression to be matched against is . This matches strings which begin with , followed by 0 or more characters, followed by exactly 1 , followed by . Another way to write this expression is: ( stands for “one or more s”). Imagine you were designing a state machine to recognize strings which match this pattern. It starts off easy, the machine should expect the first input to be . The next expected input is , but then what? That first could be one of many represented by the , or it could be the last right before the . The state machine doesn’t know how to classify the inputs that it has seen until it sees the final (or some other non-B character, indicating no match). A common solution to this problem is like the metaverse, when that first B comes, split the universe into two, and then send all subsequent inputs to both. An easy way to implement this is to use a bit vector to track all possible states the NFA could be in. The NFA starts in the initial state (so only one bit is set), and transitions through states like you would expect (only one bit is set at a time). When that first comes however, two bits are set, indicating that there are two possible ways the input could match. The NFA above can be represented with 6 states: The following table shows how the bit vector could evolve over time (one new input accepted each time step, bit vectors are written with the least-significant bit on the right): The key insight in this paper is to track states with counters, not bits. Each counter represents the number of matches that have been seen thus far in the stream. For example, when an input comes along, the counter associated with state 1 is increased by one. When a comes along, the counter associated with state 4 is increased by the value of the counter associated with state 3. Section 5.1 of the paper explains how counter increment rules can be derived for a specific NFA. This system described by this paper contains multiple sets of counters. Some counters are global, and some are local to each input character. The purpose of the local counters is to allow for “garbage collection”. The idea is that the system keeps a sliding window of the most important recent inputs, along with global and local state counters. When a new input comes, the local state counters are used to determine which input in the sliding window to delete. This determination is made based on an estimate of how likely each input is to generate a match. When the user views the dashboard, then all matches within the sliding window are identified. It is OK if this process is expensive, because it is assumed to be rare. Fig. 9 has a comparison of this work (StateSummary) against other approaches: Source: https://dl.acm.org/doi/10.1145/3725359 Dangling Pointers State machines are notoriously hard to optimize with SIMD or specialized hardware because of the fundamental feedback loop involved. It would be interesting to see if this approach is amenable to acceleration. Subscribe now

0 views
Dangling Pointers 2 months ago

Fast and Scalable Data Transfer Across Data Systems

Fast and Scalable Data Transfer Across Data Systems Haralampos Gavriilidis, Kaustubh Beedkar, Matthias Boehm, and Volker Mark SIGMOD'25 We live in exciting times, unimaginably large language models getting better each day, and a constant stream of amazing demos. And yet, efficiently transferring a table between heterogeneous systems is an open research problem! An example from the paper involves transferring data from PostgreSQL to pandas. Optimizing this transfer time is important and non-trivial. The paper describes a system named XDBC. XDBC software runs on both the source and the destination data management systems (DMS), as illustrated by Fig. 4: Source: https://dl.acm.org/doi/10.1145/3725294 The XDBC client/server processes are organized as a pipeline. Data parallelism within a stage is exploited by assigning 1 or more workers (e.g., cores) to each stage. There are a lot of knobs which can affect end-to-end throughput: Number of workers assigned to each task Data interchange format (row-major, column-major, Arrow ) Compression ( zstd , snappy , lzo , lz4 ) Section 4.1 of the paper claims the search space is so large that brute force search will not work, so a heuristic algorithm is used. The heuristic algorithm assumes accurate performance models which can estimate performance of each pipeline stage given a specific configuration. This model is based on real-world single-core performance measurements, and Gustafson’s law to estimate multi-core scaling. The algorithm starts by assigning 1 worker to each pipeline stage (in both the client and server). An iterative process then locates the pipeline stage which is estimated to be the slowest and assigns additional workers to it until it is no longer the bottleneck. This process continues until no more improvement can be found, due to one of the following reasons: All available CPU cores have been assigned Network bandwidth is the bottleneck If the process ends with more CPU cores available, then a hard-coded algorithm determines the best compression algorithm given the number of cores remaining. The data interchange format is determined based on which formats the source and destination DMSs support, and which compression algorithm was chosen. The XDBC optimizer has a lot of similarities with the Alkali optimizer . Here are some differences: Alkali does not require tasks to be executed on separate cores. For example, Alkali would allow a single core to execute both the and pipeline stages. Alkali uses an SMT solver to determine the number of cores to assign to each stage. The Alkali performance model explicitly takes into account inter-core bandwidth requirements. Alkali doesn’t deal with compression. Fig. 7(a) shows results from the motivating example (PostgreSQL→Pandas). Fig. 7(b) compares XDBC vs built-in Pandas functions to read CSV data over HTTP. connector-x is a more specialized library which supports reading data into Python programs specifically. Source: https://dl.acm.org/doi/10.1145/3725294 Dangling Pointers There are many search spaces which are too large for brute force. Special-case heuristic algorithms are one fallback, but as the Alkali paper shows, there are other approaches (e.g., LP solvers, ILP solvers, SMT solvers, machine learning models). It would be great to see cross-cutting studies comparing heuristics to other approaches. Subscribe now Source: https://dl.acm.org/doi/10.1145/3725294 The XDBC client/server processes are organized as a pipeline. Data parallelism within a stage is exploited by assigning 1 or more workers (e.g., cores) to each stage. There are a lot of knobs which can affect end-to-end throughput: Number of workers assigned to each task Data interchange format (row-major, column-major, Arrow ) Compression ( zstd , snappy , lzo , lz4 ) All available CPU cores have been assigned Network bandwidth is the bottleneck Alkali does not require tasks to be executed on separate cores. For example, Alkali would allow a single core to execute both the and pipeline stages. Alkali uses an SMT solver to determine the number of cores to assign to each stage. The Alkali performance model explicitly takes into account inter-core bandwidth requirements. Alkali doesn’t deal with compression.

0 views