Efficient Remote Memory Ordering for Non-Coherent Systems
Efficient Remote Memory Ordering for Non-Coherent Systems Wei Siew Liew, Md Ashfaqur Rahaman, Adarsh Patil, Ryan Stutsman, and Vijay Nagarajan ASPLOS’26 It seems like every year there is a new PCIe standard which doubles bandwidth. The key takeaway from this paper is that these improvements are for the fast path , but there exist use cases which are crippled by details of the PCIe protocol. This paper describes two inefficiencies caused by the current protocol and suggests improvements to address them. From my experience, here is the canonical way for the host X64 CPU to send a request to a PCIe device and receive a response. First, the request payload is stored in host memory (which can be mapped as CPU cacheable). During this time, the device does not read any of the payload. Next, a handful of MMIO writes (to uncacheable memory) are used to point the device to the payload and kick off the work. The device stores results via posted DMA writes to host memory. After producing all results, the device uses more posted DMA writes to update control information in host memory. Finally (and optionally), the device signals an interrupt. From the perspective of the device, the interrupt is another posted DMA write to the host. Posted DMA writes are visible to the host in the order they are issued. In other words, the host is guaranteed that it will only observe the control information update after the response payload has been written. Similarly, the host will receive the interrupt after the control information is written. The problems identified by this paper are caused by a lack of ordering guarantees. Table 1 illustrates where ordering is enforced today: Source: https://dl.acm.org/doi/10.1145/3779212.3790156 W→W ordering means two writes from a device will appear to land in host memory in the order they were issued. W→R means that if a device writes to an address in host memory, and then issues a read, the read response will contain the updated data. For more details there is a great explainer on a LinkedIn post here . PCIe has provisions that allow devices to relax these ordering guarantees, but there is no way to enforce more ordering. “R→R No” means that if a device issues two DMA read requests, the read response data could appear as if the reads occurred in the wrong order. This matters for scenarios where the host CPU is actively writing to the same data structures that the device is reading from. Imagine an application that involves two data structures: An array of data: An array of flags: is only valid if is set. Carefully written software could update these data structures, ensuring that is set only after the associated data is written. The trouble is there is no efficient way for a PCIe device to pipeline reads of and . For a given , the device has no choice but to read , wait for the response, and then issue the read of . Some systems work around this by ensuring that data and metadata are stored in the same cache line. A more realistic application is a key-value store that is concurrently updated by the host CPU and read by a device (i.e., RDMA reads from a NIC). It is hard to develop such a system in a way that offers strong consistency and high performance. The solution proposed by the authors is to add semantics similar to and to PCIe. A read TLP with the bit set would not be reordered past subsequent reads. A write TLP with the bit set would not be reordered before prior writes. The other inefficient scenario described by this paper is caused by the lack of W→W MMIO ordering. The problem here is in the CPU architecture. On x86, an MMIO region can be mapped as write-combined, but the application must issue expensive instructions to ensure that writes appear in the correct order. This restricts MMIO writes to low-bandwidth scenarios. The architectural solution described by this paper is to add four explicit MMIO instructions to the ISA: MMIO-Release MMIO-Acquire MMIO-Store and MMIO-Release are store instructions. MMIO-Release has release semantics (it will not be reordered before prior stores). MMIO-Load and MMIO-Acquire are load instructions. MMIO-Acquire instructions are not reordered after subsequent loads. The authors note that RISC-V has similar instructions, but they involve the CPU stalling to implement the desired memory ordering. The solution offered by this paper instead involves a re-order buffer in the PCIe root complex. The CPU assigns sequence numbers to MMIO operations, and the root complex uses those sequence numbers to restore operations to their correct order. Fig. 6 has simulation numbers projecting speedups an RDMA-based key-value store could see if it could properly pipeline DMA reads. and are the work described by this paper. assumes additional speculative optimizations in the root complex. Source: https://dl.acm.org/doi/10.1145/3779212.3790156 Fig. 10 shows how fast a single core can perform unordered MMIO writes. The idea is that if the CPU architecture is enhanced to allow an application to express just the right amount of ordering, it could be possible for a single core to write packet data as fast as a NIC can consume it. Source: https://dl.acm.org/doi/10.1145/3779212.3790156 Dangling Pointers The paper ends with this food for thought: By establishing a high-performance baseline for non-coherent I/O, this work raises the question of whether the complexity of coherent interconnects (like CXL) is truly necessary for future host-device communication. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. An array of data: An array of flags: MMIO-Release MMIO-Acquire