Skia: Exposing Shadow Branches
Skia: Exposing Shadow Branches Chrysanthos Pepi, Bhargav Reddy Godala, Krishnam Tibrewala, Gino A. Chacon, Paul V. Gratz, Daniel A. Jiménez, Gilles A. Pokam, and David I. August ASPLOS'25 This paper starts with your yearly reminder of the high cost of the Turing Tax : Recent works demonstrate that the front-end is a considerable source of performance loss [16], with upwards of 53% of performance [23] bounded by the front-end. Everyone knows that the front-end runs ahead of the back-end of a processor. If you want to think of it in AI terms, imagine a model that is told about the current value of and recent history of the program counter, and asked to predict future values of the program counter. The accuracy of these predictions determines how utilized the processor pipeline is. What I did not know is that in a modern processor, the front-end itself is divided into two decoupled components, one of which runs ahead of the other. Fig. 4 illustrates this Fetch Direction Instruction Processing (FDIP) microarchitecture: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 The Instruction Address Generator (IAG) runs the furthest ahead and uses tables (e.g., the Branch Target Buffer (BTB)) in the Branch Prediction Unit (BPU) to predict the sequence of basic blocks which will be executed. Information about each predicted basic block is stored in the Fetch Target Queue (FTQ). The Instruction Fetch Unit (IFU) uses the control flow predictions from the FTQ to actually read instructions from the instruction cache. Some mispredictions can be detected after an instruction has been read and decoded. These result in an early re-steer (i.e., informing the IAG about the misprediction immediately after decode). When a basic block is placed into the FTQ, the associated instructions are prefetched into the IFU (to reduce the impact of instruction cache misses). This paper introduces the term “shadow branch”. A shadow branch is a (static) branch instruction which is currently stored in the instruction cache but is not present in any BPU tables. The top of fig. 5 illustrates a head shadow branch. A branch instruction caused execution to jump to byte 24 and execute the non-shaded instructions. This causes an entire cache line to be pulled into the instruction cache, including the branch instruction starting at byte 19. The bottom of fig. 5 shows a tail shadow branch. In this case, the instruction at byte 12 jumped away from the cache line, causing the red branch instruction at byte 16 to not be executed (even though it is present in the instruction cache). Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Skia The proposed design (Skia) allows the IAG to make accurate predictions for a subset of shadow branches, thus improving pipeline utilization and reducing instruction cache misses. The types of shadow branches which Skia supports are: Direct unconditional branches (target PC can be determined without looking at backend state) Function calls As shown in Fig. 6, these three categories of branches (purple, red, orange) account for a significant fraction of all BTB misses: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 When a cache line enters the instruction cache, the Shadow Branch Decoder (SBD) decodes just enough information to locate shadow branches in the cache line and determine the target PC (for direct unconditional branches and function calls). Metadata from the SBD is placed into two new branch prediction tables in the BPU: The U-SBB holds information about direct unconditional branches and function calls The R-SBB holds information about returns When the BPU encounters a BTB miss, it can fall back to the U-SBB or R-SBB for a prediction. Fig. 11 illustrates the microarchitectural changes proposed by Skia: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Section 4 goes into more details about these structures including: Replacement policy How a shadow branch is upgraded into a first-class branch in the BTB Handling variable length instructions Fig. 14 has (simulated) IPC improvements across a variety of benchmarks: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Dangling Pointers A common problem that HW and SW architects solve is getting teams out of a local minimum caused by fixed interfaces. The failure mode is when two groups of engineers agree on a static interface, and then each optimize their component as best they can without changing the interface. In this paper, the interface is the ISA, and Skia is a clever optimization inside of the CPU front-end. Skia shows that there is fruit to be picked here. It would be interesting to examine potential performance gains from architectural (i.e., ISA) changes to pick the same fruit. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. Source: https://dl.acm.org/doi/10.1145/3676641.3716273 The Instruction Address Generator (IAG) runs the furthest ahead and uses tables (e.g., the Branch Target Buffer (BTB)) in the Branch Prediction Unit (BPU) to predict the sequence of basic blocks which will be executed. Information about each predicted basic block is stored in the Fetch Target Queue (FTQ). The Instruction Fetch Unit (IFU) uses the control flow predictions from the FTQ to actually read instructions from the instruction cache. Some mispredictions can be detected after an instruction has been read and decoded. These result in an early re-steer (i.e., informing the IAG about the misprediction immediately after decode). When a basic block is placed into the FTQ, the associated instructions are prefetched into the IFU (to reduce the impact of instruction cache misses). Shadow Branches This paper introduces the term “shadow branch”. A shadow branch is a (static) branch instruction which is currently stored in the instruction cache but is not present in any BPU tables. The top of fig. 5 illustrates a head shadow branch. A branch instruction caused execution to jump to byte 24 and execute the non-shaded instructions. This causes an entire cache line to be pulled into the instruction cache, including the branch instruction starting at byte 19. The bottom of fig. 5 shows a tail shadow branch. In this case, the instruction at byte 12 jumped away from the cache line, causing the red branch instruction at byte 16 to not be executed (even though it is present in the instruction cache). Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Skia The proposed design (Skia) allows the IAG to make accurate predictions for a subset of shadow branches, thus improving pipeline utilization and reducing instruction cache misses. The types of shadow branches which Skia supports are: Direct unconditional branches (target PC can be determined without looking at backend state) Function calls Source: https://dl.acm.org/doi/10.1145/3676641.3716273 When a cache line enters the instruction cache, the Shadow Branch Decoder (SBD) decodes just enough information to locate shadow branches in the cache line and determine the target PC (for direct unconditional branches and function calls). Metadata from the SBD is placed into two new branch prediction tables in the BPU: The U-SBB holds information about direct unconditional branches and function calls The R-SBB holds information about returns Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Section 4 goes into more details about these structures including: Replacement policy How a shadow branch is upgraded into a first-class branch in the BTB Handling variable length instructions