Latest Posts (20 found)

Optimizing Datalog for the GPU

Optimizing Datalog for the GPU Yihao Sun, Ahmedur Rahman Shovon, Thomas Gilray, Sidharth Kumar, and Kristopher Micinski ASPLOS'25 Datalog source code comprises a set of relations, and a set of rules. A relation can be explicitly defined with a set of tuples. A running example in the paper is to define a graph with a relation named : A relation can also be implicitly defined with a set of rules. The paper uses the relation as an example: Rule 1 states that two vertices ( and ) are part of the same generation if they both share a common ancestor ( ), and they are not actually the same vertex ( ). Rule 2 states that two vertices ( and ) are part of the same generation if they have ancestors ( and ) from the same generation. “Running a Datalog program” entails evaluating all rules until a fixed point is reached (no more tuples are added). One key idea to internalize is that evaluating a Datalog rule is equivalent to performing a SQL join. For example, rule 1 is equivalent to joining the relation with itself, using as the join key, and as a filter. Semi-naïve Evaluation is an algorithm for performing these joins until convergence, while not wasting too much effort on redundant work. The tuples in a relation are put into three buckets: : holds tuples that were discovered on the current iteration holds tuples which were added in the previous iteration : holds all tuples that have been found in any iteration For a join involving two relations ( and ), is computed as the union of the result of 3 joins: joined with joined with joined with The key fact for performance is that is never joined with . More details on Semi-naïve Evaluation can be found in these notes . This paper introduces the hash-indexed sorted array for storing relations while executing Semi-naïve Evaluation on a GPU. It seems to me like this data structure would work well on other chips too. Fig. 2 illustrates the data structure (join keys are in red): Source: https://dl.acm.org/doi/10.1145/3669940.3707274 The data array holds the actual tuple data. It is densely packed in row-major order. The sorted index array holds pointers into the data array (one pointer per tuple). These pointers are lexicographically sorted (join keys take higher priority in the sort). The hash table is an open-addressed hash table which maps a hash of the join keys to the first element in the sorted index array that contains those join keys. A join of relations A and , can be implemented with the following pseudo-code: Memory accesses when probing through the sorted index array are coherent. Memory accesses when accessing the data array are coherent up to the number of elements in a tuple. Table 3 compares the results from this paper (GPULog) against a state-of-the-art CPU implementation (Soufflé). HIP represents GPULog ported to AMD’s HIP runtime and then run on the same Nvidia GPU. Source: https://dl.acm.org/doi/10.1145/3669940.3707274 Dangling Pointers The data structure and algorithms described by this paper seem generic, it would be interesting to see them run on other chips (FPGA, DPU, CPU, HPC cluster). I would guess most of GPULog is bound by memory bandwidth, not compute. I wonder if there are Datalog-specific algorithms to reduce the bandwidth/compute ratio. Subscribe now : holds tuples that were discovered on the current iteration holds tuples which were added in the previous iteration : holds all tuples that have been found in any iteration joined with joined with joined with

2 views

Extended User Interrupts (xUI): Fast and Flexible Notification without Polling

Extended User Interrupts (xUI): Fast and Flexible Notification without Polling Berk Aydogmus, Linsong Guo, Danial Zuberi, Tal Garfinkel, Dean Tullsen, Amy Ousterhout, and Kazem Taram ASPLOS'25 This paper describes existing hardware support for userspace interrupts , and extensions to make it more efficient. The kernel is powerful, and yet slow. You only want to call on it when necessary. Kernel bypass technologies like DPDK and io_uring exist to allow applications to reduce the frequency of kernel calls. In addition to I/O, applications frequently call the kernel to communicate between threads. For example, in a producer/consumer design, the producer could use a signal to tell the consumer that more data is ready. Section 2 of the paper reminds readers that each of these signals costs about 2.4 microseconds. The idea behind userspace interrupts is to get the kernel out of the way and instead have dedicated hardware to support cheap signaling between threads. UIPI is a hardware feature introduced by Intel with Sapphire Rapids. Section 3 of the paper describes how UIPI works, including reverse engineering some of the Sapphire Rapids microarchitecture. Like other kernel bypass technologies, the kernel is still heavily involved in the control path . When a process requests that the kernel configure UIPI, the kernel responds by creating or modifying the user interrupt target table (UITT) for the process. This per-process table has one entry per thread. The kernel thread scheduler updates this table so that the UIPI hardware can determine which core a thread is currently running on. Once the control path is setup, the data path runs without kernel involvement. Userspace code which wants to send an interrupt to another thread can execute the instruction. This instruction has one operand, which is an index into the UITT (the index of the destination thread). The hardware then consults the UITT and sends an inter-processor interrupt (IPI) to the core on which the destination thread is running. Userspace code in the destination thread then jumps to a pre-registered interrupt handler, which runs arbitrary code in userspace. Hardware has long supported IPIs, but typically only the kernel has had the ability to invoke them. The hardware has the ability to coordinate with the OS to handle the case where the destination thread is not currently running on any core. These cases do involve running kernel code, but they are the slow path. The fast path is handled without any kernel involvement. The authors measure an end-to-end latency of 1360 clock cycles for a userspace interrupt on Sapphire Rapids. Fig. 2 illustrates where that time goes: Source: https://dl.acm.org/doi/abs/10.1145/3676641.3716028 The largest cost is the pipeline flush when the receiving core receives the IPI. Section 3.4 describes experiments the authors performed to determine these numbers, including how they determined that the receiving processor pipeline is flushed. Note that “flush” here means that in-flight instructions are squashed (i.e., what happens when a branch misprediction is detected). An alternative strategy would be to drain the pipeline, which would let outstanding instructions commit before handling the interrupt. This would avoid duplicate work but would increase the latency of handling an interrupt. The authors propose tracked interrupts to solve the performance problem associated with pipeline flushes. When an interrupt is received, the receiving core immediately injects the micro-ops needed to handle the interrupt into the pipeline. Outstanding instructions are not squashed. This may sound similar to draining, but there is a key difference. With draining, the processor waits for all inflight instructions to commit before injecting the interrupt handling micro-ops. With tracked interrupts, micro-ops enter the pipeline immediately, and the out-of-order machinery of the processor will not see any dependencies between the interrupt handling micro-ops and the inflight instructions. This means that the interrupt handling micro-ops can execute quickly and typically do not need to wait behind all inflight instructions. The paper has simulation results which show that tracked interrupts save 414 clock cycles on the receiver side. The paper also discusses some related improvements to UIPI: Userspace timer support Userspace interrupts from I/O devices Safepoints - to allow userspace interrupts to place nice with garbage collection I wonder how much of the Sapphire Rapids design was dictated by simplicity (to reduce the risk associated with this feature)? A frequent theme in papers reviewed on this blog is how difficult it is to implement fine-grained multithreading on general purpose multi-core CPUs. I wonder how much userspace interrupt support can help with that problem? Subscribe now Source: https://dl.acm.org/doi/abs/10.1145/3676641.3716028 The largest cost is the pipeline flush when the receiving core receives the IPI. Section 3.4 describes experiments the authors performed to determine these numbers, including how they determined that the receiving processor pipeline is flushed. Note that “flush” here means that in-flight instructions are squashed (i.e., what happens when a branch misprediction is detected). An alternative strategy would be to drain the pipeline, which would let outstanding instructions commit before handling the interrupt. This would avoid duplicate work but would increase the latency of handling an interrupt. Tracked Interrupts The authors propose tracked interrupts to solve the performance problem associated with pipeline flushes. When an interrupt is received, the receiving core immediately injects the micro-ops needed to handle the interrupt into the pipeline. Outstanding instructions are not squashed. This may sound similar to draining, but there is a key difference. With draining, the processor waits for all inflight instructions to commit before injecting the interrupt handling micro-ops. With tracked interrupts, micro-ops enter the pipeline immediately, and the out-of-order machinery of the processor will not see any dependencies between the interrupt handling micro-ops and the inflight instructions. This means that the interrupt handling micro-ops can execute quickly and typically do not need to wait behind all inflight instructions. Results The paper has simulation results which show that tracked interrupts save 414 clock cycles on the receiver side. The paper also discusses some related improvements to UIPI: Userspace timer support Userspace interrupts from I/O devices Safepoints - to allow userspace interrupts to place nice with garbage collection

0 views

No Cap, This Memory Slaps: Breaking Through the Memory Wall of Transactional Database Systems with Processing-in-Memory

No Cap, This Memory Slaps: Breaking Through the Memory Wall of Transactional Database Systems with Processing-in-Memory Hyoungjoo Kim, Yiwei Zhao, Andrew Pavlo, Phillip B. Gibbons VLDB'25 This paper describes how processing-in-memory (PIM) hardware can be used to improve OLTP performance. Here is a prior paper summary from me on a similar topic, but that one is focused on OLAP rather than OLTP. UPMEM is specific PIM product (also used in the prior paper ) on this blog. A UPMEM DIMM is like a DRAM DIMM, but each DRAM bank is extended with a simple processor which can run user code. That processor has access to a small local memory and the DRAM associated with the bank. This paper calls each processor a PIM Module . There is no direct communication between PIM modules. Fig. 2 illustrates the system architecture used by this paper. A traditional CPU is connected to a set of boring old DRAM DIMMs and is also connected to a set of UPMEM DIMMs. Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Four Challenges The paper identifies the following difficulties associated with using UPMEM to accelerate an OLTP workload: PIM modules can only access their local memory PIM modules do not have typical niceties associated with x64 CPUs (high clock frequency, caches, SIMD) There is a non-trivial cost for the CPU to send data to UPMEM DIMMs (similar to the CPU writing data to regular DRAM) OLTP workloads have tight latency constraints The authors arrived at a solution that both provides a good speedup and doesn’t require boiling the ocean. The database code and architecture remain largely unchanged. Much of the data remains in standard DRAM DIMMs, and the database operates on it as it always has. In section 3.2 the authors identify a handful of data structures and operations with near-memory affinity which are offloaded. These data structures are stored in UPMEM DIMMs, and the algorithms which access them are offloaded to the PIM modules. The key feature that these algorithms have in common is pointer chasing . The sweet spots the authors identify involve a small number of parameters sent from the CPU to a PIM module, then the PIM module performing multiple roundtrips to its local DRAM bank, followed by the CPU reading back a small amount of response data. The roundtrips to PIM-local DRAM have lower latency than accesses from a traditional CPU core. One data structure which involves a lot of pointer chasing is B+ tree traversal. Thus, the system described in this paper moves B+ tree indexes into UPMEM DIMMs and uses PIM modules to search for values in an index. Note that the actual tuples that hold row data stay in plain-old DRAM. The tricky part is handling range queries while distributing an index across many banks. The solution described in this paper is to partition the set of keys into 2 R partitions (the lower bits of a key define the index the partition which holds that key). Each partition is thus responsible for a contiguous array of keys. For a range query, the lower bits of the lower and upper bounds of the range can be used to determine which partitions must be searched. Each PIM module is responsible for multiple partitions, and a hash function is used to convert a partition index into a PIM module index. MVCC is a concurrency control method which requires the database to keep around old versions of a given row (to allow older in-flight queries to access them). The set of versions associated with a row are typically stored in a linked list (yet another pointer traversal). Again, the actual tuple contents are stored in regular DRAM, but the list links are stored in UPMEM DIMMs, with the PIM modules traversing the links. Section 4.3 has more information about how old versions are eventually reclaimed with garbage collection. Fig. 7 has the headline results. is the baseline, is the work described by this paper. It is interesting that only beats on for read-only workloads. Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Processing-in-memory can help with memory bandwidth and memory latency. It seems like this work is primarily focused on memory latency. I suppose this indicates that OLTP workloads are fundamentally latency-bound, because there is not enough potential concurrency between transactions to hide that latency. Is there no way to structure a database such that OLTP workloads are not bound by memory latency? It would be interesting to see if these tricks could work in a distributed system, where the PIM modules are replaced by separate nodes in the system. Subscribe now Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Four Challenges The paper identifies the following difficulties associated with using UPMEM to accelerate an OLTP workload: PIM modules can only access their local memory PIM modules do not have typical niceties associated with x64 CPUs (high clock frequency, caches, SIMD) There is a non-trivial cost for the CPU to send data to UPMEM DIMMs (similar to the CPU writing data to regular DRAM) OLTP workloads have tight latency constraints

0 views

Parendi: Thousand-Way Parallel RTL Simulation

Parendi: Thousand-Way Parallel RTL Simulation Mahyar Emami, Thomas Bourgeat, and James R. Larus ASPLOS'25 This paper describes an RTL simulator running on (one or more) Graphcore IPUs. One nice side benefit of this paper is the quantitative comparisons of IPU synchronization performance vs traditional CPUs. Here is another paper summary which describes some challenges with RTL simulation. The Graphcore IPU used in this paper is a chip with 1472 cores, operating with a MIMD architecture. A 1U server can contain 4 IPUs. It is interesting to see a chip that was designed for DNN workloads adapted to the domain of RTL simulation. Similar to other papers on RTL simulation, a fundamental step of the Parendi simulator is partitioning the circuit to be simulated. Parendi partitions the circuit into fibers . A fiber comprises a single (word-wide) register, and all of the combinational logic which feeds it. Note that some combinational logic may be present in multiple fibers. Fig. 3 contains an example, node a3 is present in multiple fibers. As far as I can tell, Parendi does not try to deduplicate this work (extra computation to save synchronization). Source: https://dl.acm.org/doi/10.1145/3676641.3716010 The driving factor in the design of this fiber-specific partitioning system is scalability . Each register has storage to hold the value of the register at the beginning and end of the current clock cycle (i.e., the and values). I think of the logic to simulate a single clock cycle with the following pseudo-code ( is the register rooted at fiber : Scalability comes from the fact that there are only two barriers per simulated clock cycle. This is an instance of the bulk synchronous parallel (BSP) model. In many cases, there are more fibers than CPU/IPU cores. Parendi addresses this by distributing the simulation across chips and scheduling multiple fibers to run on the same core. If the simulation is distributed across multiple chips, then a min-cut algorithm is used to partition the fibers across chips while minimizing communication. The Parendi compiler statically groups multiple fibers together into a single process . A core simulates all fibers within a process. The merging process primarily seeks to minimize inter-core communication. First, a special case merging algorithm merges multiple fibers which reference the same large array. This is to avoid communicating the contents of such an array across cores. I imagine this is primarily for simulation of on-chip memories. Secondly, a general-purpose merging algorithm merges fibers which each have low compute cost, and high data sharing with each other. Fig. 7 compares Parendi vs Verilator simulation. is a 2-socket server with 28 Intel cores per socket. is a 2-socket server with 64 AMD cores per socket: Source: https://dl.acm.org/doi/10.1145/3676641.3716010 Section 6.4 claims a roughly 2x improvement in cost per simulation using cloud pricing. As far as I can tell, this system doesn’t have optimizations for the case where some or all of a fiber’s inputs do not change between clock cycles. It seems tricky to optimize for this case while maintaining a static assignment of fibers to cores. Fig. 4 has a fascinating comparison of synchronization costs between an IPU and a traditional x64 CPU. This microbenchmark loads up the system with simple fibers (roughly 6 instructions per fiber). Note that the curves represent different fiber counts (e.g., the red dotted line represents 7 fibers on the IPU graph, vs 736 fibers on the x64 graph). The paper claims that a barrier between 56 x64 threads implemented with atomic memory accesses consumes thousands of cycles, whereas the IPU has dedicated hardware barrier support. Source: https://dl.acm.org/doi/10.1145/3676641.3716010 This seems to be one of many examples of how generic multi-core CPUs do not perform well with fine-grained multi-threading. We’ve seen it with pipeline parallelism, and now with the BSP model. Interestingly, both cases seem to work better with specialized multi-core chips (pipeline parallelism works with CPU-based SmartNICs, BSP works with IPUs). I’m not convinced this is a fundamental hardware problem that cannot be addressed with better software. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts.

0 views

Skia: Exposing Shadow Branches

Skia: Exposing Shadow Branches Chrysanthos Pepi, Bhargav Reddy Godala, Krishnam Tibrewala, Gino A. Chacon, Paul V. Gratz, Daniel A. Jiménez, Gilles A. Pokam, and David I. August ASPLOS'25 This paper starts with your yearly reminder of the high cost of the Turing Tax : Recent works demonstrate that the front-end is a considerable source of performance loss [16], with upwards of 53% of performance [23] bounded by the front-end. Everyone knows that the front-end runs ahead of the back-end of a processor. If you want to think of it in AI terms, imagine a model that is told about the current value of and recent history of the program counter, and asked to predict future values of the program counter. The accuracy of these predictions determines how utilized the processor pipeline is. What I did not know is that in a modern processor, the front-end itself is divided into two decoupled components, one of which runs ahead of the other. Fig. 4 illustrates this Fetch Direction Instruction Processing (FDIP) microarchitecture: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 The Instruction Address Generator (IAG) runs the furthest ahead and uses tables (e.g., the Branch Target Buffer (BTB)) in the Branch Prediction Unit (BPU) to predict the sequence of basic blocks which will be executed. Information about each predicted basic block is stored in the Fetch Target Queue (FTQ). The Instruction Fetch Unit (IFU) uses the control flow predictions from the FTQ to actually read instructions from the instruction cache. Some mispredictions can be detected after an instruction has been read and decoded. These result in an early re-steer (i.e., informing the IAG about the misprediction immediately after decode). When a basic block is placed into the FTQ, the associated instructions are prefetched into the IFU (to reduce the impact of instruction cache misses). This paper introduces the term “shadow branch”. A shadow branch is a (static) branch instruction which is currently stored in the instruction cache but is not present in any BPU tables. The top of fig. 5 illustrates a head shadow branch. A branch instruction caused execution to jump to byte 24 and execute the non-shaded instructions. This causes an entire cache line to be pulled into the instruction cache, including the branch instruction starting at byte 19. The bottom of fig. 5 shows a tail shadow branch. In this case, the instruction at byte 12 jumped away from the cache line, causing the red branch instruction at byte 16 to not be executed (even though it is present in the instruction cache). Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Skia The proposed design (Skia) allows the IAG to make accurate predictions for a subset of shadow branches, thus improving pipeline utilization and reducing instruction cache misses. The types of shadow branches which Skia supports are: Direct unconditional branches (target PC can be determined without looking at backend state) Function calls As shown in Fig. 6, these three categories of branches (purple, red, orange) account for a significant fraction of all BTB misses: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 When a cache line enters the instruction cache, the Shadow Branch Decoder (SBD) decodes just enough information to locate shadow branches in the cache line and determine the target PC (for direct unconditional branches and function calls). Metadata from the SBD is placed into two new branch prediction tables in the BPU: The U-SBB holds information about direct unconditional branches and function calls The R-SBB holds information about returns When the BPU encounters a BTB miss, it can fall back to the U-SBB or R-SBB for a prediction. Fig. 11 illustrates the microarchitectural changes proposed by Skia: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Section 4 goes into more details about these structures including: Replacement policy How a shadow branch is upgraded into a first-class branch in the BTB Handling variable length instructions Fig. 14 has (simulated) IPC improvements across a variety of benchmarks: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Dangling Pointers A common problem that HW and SW architects solve is getting teams out of a local minimum caused by fixed interfaces. The failure mode is when two groups of engineers agree on a static interface, and then each optimize their component as best they can without changing the interface. In this paper, the interface is the ISA, and Skia is a clever optimization inside of the CPU front-end. Skia shows that there is fruit to be picked here. It would be interesting to examine potential performance gains from architectural (i.e., ISA) changes to pick the same fruit. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. Source: https://dl.acm.org/doi/10.1145/3676641.3716273 The Instruction Address Generator (IAG) runs the furthest ahead and uses tables (e.g., the Branch Target Buffer (BTB)) in the Branch Prediction Unit (BPU) to predict the sequence of basic blocks which will be executed. Information about each predicted basic block is stored in the Fetch Target Queue (FTQ). The Instruction Fetch Unit (IFU) uses the control flow predictions from the FTQ to actually read instructions from the instruction cache. Some mispredictions can be detected after an instruction has been read and decoded. These result in an early re-steer (i.e., informing the IAG about the misprediction immediately after decode). When a basic block is placed into the FTQ, the associated instructions are prefetched into the IFU (to reduce the impact of instruction cache misses). Shadow Branches This paper introduces the term “shadow branch”. A shadow branch is a (static) branch instruction which is currently stored in the instruction cache but is not present in any BPU tables. The top of fig. 5 illustrates a head shadow branch. A branch instruction caused execution to jump to byte 24 and execute the non-shaded instructions. This causes an entire cache line to be pulled into the instruction cache, including the branch instruction starting at byte 19. The bottom of fig. 5 shows a tail shadow branch. In this case, the instruction at byte 12 jumped away from the cache line, causing the red branch instruction at byte 16 to not be executed (even though it is present in the instruction cache). Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Skia The proposed design (Skia) allows the IAG to make accurate predictions for a subset of shadow branches, thus improving pipeline utilization and reducing instruction cache misses. The types of shadow branches which Skia supports are: Direct unconditional branches (target PC can be determined without looking at backend state) Function calls Source: https://dl.acm.org/doi/10.1145/3676641.3716273 When a cache line enters the instruction cache, the Shadow Branch Decoder (SBD) decodes just enough information to locate shadow branches in the cache line and determine the target PC (for direct unconditional branches and function calls). Metadata from the SBD is placed into two new branch prediction tables in the BPU: The U-SBB holds information about direct unconditional branches and function calls The R-SBB holds information about returns Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Section 4 goes into more details about these structures including: Replacement policy How a shadow branch is upgraded into a first-class branch in the BTB Handling variable length instructions

0 views

Principles and Methodologies for Serial Performance Optimization

Principles and Methodologies for Serial Performance Optimization Sujin Park, Mingyu Guan, Xiang Cheng, and Taesoo Kim OSDI'25 This paper is a psychological trojan horse for computer nerds of a certain vintage. Every paragraph of sections 3 and 4 inflates the ego a bit more. One arrives at section 5 feeling good about their performance optimization skillset, and then one learns that these skills can be replaced by an LLM. A faint hissing sound reaches one’s ears as one’s ego deflates into a shriveled piece of plastic on the ground. Eight Methodologies The authors reviewed 477 papers containing specific instances of serial performance optimization and boiled the techniques down into eight categories. Serial is the operative word here: this paper is about optimizing portions of code that cannot be parallelized. However, some of the techniques are applicable in computations that comprise a serial portion (the critical path) and a parallel portion. Here are the techniques: Amortizing a fixed cost over many items. For example, refactoring code so that some computation can be hoisted out of a (per-item) loop. Store the results of a computation in memory so that it can be used later on. Memoization is a good example. Compute something earlier than is necessary (possibly speculatively). This works in programs which alternate between serial and parallel portions. If the precomputation can be done during a parallel portion, then it can be thought of as “free” because it is off of the critical path. Act like one of my high schoolers: don’t do work now when it could be done earlier. This works great in late spring, when a teacher realizes they have assigned too much work for the year so they cancel an assignment that most students (not mine) have already completed. Deferring interacts with other techniques: Similar to precomputing, deferring can move work off of the critical path Deferring can enable batching by deferring work until a large batch has been accumulated Compute a quick and dirty approximation to the right answer rather than the exactly right answer. Make a generic component more efficient for a specific use case. For example, a library user could pass hints at runtime which gives the library more information to make tradeoffs. Or profile guided optimization to enable the compiler to make better decisions when compiling the generic library. Use a hardware accelerator, to avoid paying the Turing Tax . Chip away at performance inefficiencies caused by the abstractions in a layered software architecture. For example, DPDK allows applications to bypass many networking abstractions. After finishing section 4 of the paper, I felt pretty good. My ego happily agreed with a world view that says that serial performance optimization can be expressed in terms of 8 “basis vectors”, all of which I had extensive experience with. And then came section 5. The authors fine-tuned GPT-4o with a dataset derived from analyzing SOSP and OSDI papers. Each item in the dataset comprises a problem description, observations, and solutions (in terms of the 8 techniques described earlier). The authors made data and scripts available here . The fine-tuned model is called SysGPT . Table 4 shows example inputs (problems + observations), the output from GPT-4, the output from SysGPT, and the actual solution from a real-world paper. Here is an example: Source: https://www.usenix.org/conference/osdi25/presentation/park-sujin Table 5 has quantitative results, where each model is only asked to choose a subset of the 8 optimization strategies for a given problem (N represents the number of example problems given to the baseline GPT-4o model in a prompt): Source: https://www.usenix.org/conference/osdi25/presentation/park-sujin Dangling Pointers It would be interesting to extend these recipes to also include problems which can be parallelized. This paper assumes that the problem and observations are produced by humans. It would be fascinating to see how much of that can be automated. For example, a model could have access to source code and profiling information, and output a set of observations about system performance before optimization. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. Eight Methodologies The authors reviewed 477 papers containing specific instances of serial performance optimization and boiled the techniques down into eight categories. Serial is the operative word here: this paper is about optimizing portions of code that cannot be parallelized. However, some of the techniques are applicable in computations that comprise a serial portion (the critical path) and a parallel portion. Here are the techniques: Batching Amortizing a fixed cost over many items. For example, refactoring code so that some computation can be hoisted out of a (per-item) loop. Caching Store the results of a computation in memory so that it can be used later on. Memoization is a good example. Precomputing Compute something earlier than is necessary (possibly speculatively). This works in programs which alternate between serial and parallel portions. If the precomputation can be done during a parallel portion, then it can be thought of as “free” because it is off of the critical path. Deferring Act like one of my high schoolers: don’t do work now when it could be done earlier. This works great in late spring, when a teacher realizes they have assigned too much work for the year so they cancel an assignment that most students (not mine) have already completed. Deferring interacts with other techniques: Similar to precomputing, deferring can move work off of the critical path Deferring can enable batching by deferring work until a large batch has been accumulated

0 views

Accelerate Distributed Joins with Predicate Transfer

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

1 views

To PRI or Not To PRI, That's the question

Thanks for reading! Subscribe for free to receive new posts and support my work.

0 views

State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts.

0 views

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views

Scaling IP Lookup to Large Databases using the CRAM Lens

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views

Enabling Silent Telemetry Data Transmission with InvisiFlow

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

Enabling Portable and High-Performance SmartNIC Programs with Alkali

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

Ripple: Asynchronous Programming for Spatial Dataflow Architectures

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

Iso: Request-Private Garbage Collection

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

Making Concurrent Hardware Verification Sequential

Thanks for reading! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

Instant neural graphics primitives with a multiresolution hash encoding

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Dangling Pointers 1 months ago

Disentangling the Dual Role of NIC Receive Rings

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views