Linker Pessimization
In a previous post , I wrote about linker relaxation : the linker’s ability to replace a slower, larger instruction with a faster, smaller one when it has enough information at link time. For instance, an indirect through the GOT can be relaxed into a direct plus a . This is a well-known technique to optimize the instructions for performance. Does it ever make sense to go the other direction ? 🤔 We’ve been working on linking some massive binaries that include Intel’s Math Kernel Library (MKL) , a prebuilt static archive. MKL ships as object files compiled with the small code-model ( ), meaning its instructions assume everything is reachable within ±2 GiB. The included object files also has some odd relocations where the addend is a very large negative number (>1GiB). The calculation for the relocation value is S + A - P : the symbol address plus the addend minus the instruction address. WIth a sufficiently large negative addend, the relocation value can easily exceed the 2 GiB limit and the linker fails with relocation overflows. We can’t recompile MKL (it’s a prebuilt proprietary archive), and we can’t simply switch everything to the large code model. What can we do? 🤔 I am calling this technique linker pessimization : the reverse of relaxation. Instead of shrinking an instruction, we expand one to tolerate a larger address space. 😈 The specific instructions that overflow in our case are (Load Effective Address) instructions. In x86_64, performs pure arithmetic: it computes and stores the result in without accessing memory. The is a 32-bit signed integer embedded directly into the instruction encoding, and the linker fills it in via an relocation. The relocation formula is S + A - P . Let’s look at an example with a large addend. A 32-bit signed integer can only represent ±2,048 MB (±2 GiB). Our value of −2,062 MB exceeds that range and the linker rightfully complains 💥: Note These instructions appear in MKL because the library uses them as a way to compute an address of a data table relative to the instruction pointer. The large negative addend ( ) is intentional ; it’s an offset within a large lookup table. The core idea is delightful because often as an engineer we are trained to optimize systems, but in this case we want the opposite. We swap the for a that reads through a nearby pointer. Recall from the relaxation post : relaxation shrinks instructions (e.g. indirect -> direct ). Here we do the opposite: we make the instruction do more work (pure arithmetic -> memory load) in exchange for a reachable displacement. That’s why I consider it a pessimization or reverse-relaxation . Both instructions use the same encoding length (7 bytes with a REX prefix), so the patch is a single byte change in the opcode. 🤓 The difference in behavior is critical: Original — the must reach across the entire binary: Pessimized — the reads a nearby pointer that holds the full address: We’ve traded one direct computation for an indirect through a pointer, and we make sure the displacement is now tiny. The 64-bit pointer slot can reach any address in the virtual address space. 👌 For each problematic relocation, three changes are needed in the object file: 1. Opcode Patch : In , change byte to (1 byte). This converts the (compute address) into a (load from address). The rest of the instruction encoding (ModR/M byte, REX prefix) stays identical because both instructions use the same operand format. 2. New Pointer Slot — Create a new section ( ) containing 8 zero bytes per patch site, plus a new relocation pointing to the original symbol with the original addend. is a 64-bit absolute relocation. Its formula is simply , no subtraction of . There is no 32-bit range limitation; it can address the entire 64-bit address space. This is the key insight that makes the fix work. 3. Retarget the Original Relocation — In the entry for the patched instruction, change the symbol to point at the new pointer slot in and update the type to . The addend becomes a small offset (the distance from the instruction to the fixup slot), which is guaranteed to fit. Note Because both and with a operand are exactly the same length (7 bytes with a REX prefix), we don’t shift any code, don’t invalidate any other relocations, and don’t need to rewrite any other parts of the object file. It’s truly a surgical patch. The pessimized now performs a memory load where the original did pure register arithmetic. That’s an extra cache line fetch and a data dependency. If this instruction is in a tight loop, it could be a performance hit. Optimization is the root of all evil, what does that make pessimization? 🧌 LEA : (arithmetic, no memory access). must encode the entire distance to the far-away data. This overflows. MOV : (memory load). points to a nearby 8-byte pointer slot. The pointer slot holds the full 64-bit address. This never overflows.