Leaving performance on the table
I have been working with LLVM at , and I have gotten to become familiar with the benefits of optimizing your workloads. I tend to think of optimizing my binaries as thinking about whether I have attached to my compiler flags; maybe if I’m particularly advanced that day I’ll sprinkle in some (link time optimziation) and call it a day. Turns out though that’s leaving lots of performance on the table. Compilers work under the assumption that every branch is is equally taken, unless you are hints like ( ref ). If we can feed the compilers more information about the likely path that our workloads often take, then they can produce much more performant code. There are two primary ways to optimize a binary: instrumented or statistical. When we instrument our binary, we run our workload with an instrumented binary and capture the exact paths that are executed. We will then optimize the binary perfectly tuned to that workload. If our workloads however are varied, we can collect profiles via over a length of time and create an optimized binary based on the statistical occurence of call graphs. Both approaches have their benefits however let’s start with the instrumented variant first, as it’s a little easier to follow and understand. Let’s look at a very simple benchmark. We will calculate fibonocci using SQL in sqlite3 . This is an ideal workload because it’s purely CPU-bound and ripe for optimizing. We will compile from source by downloading it. We can compile a “traditional” optimized binary that merely has and also a version that has LTO enabled since I was also keen to see how much LTO itself adds. Ok, so it looks like our program takes roughly 14-15 seconds to run. Sounds ok? How much better can we do…. 🤔 Next, we compile our program again but we instrument the binary , which effectively injects counters into the program to count invocations of functions. We get very accurate counts of our calls but the binary itself now runs much slower, which can be a problem if your workload was already very slow. Luckily for us, we are in a time domain (~15 seconds), where that is ok. After we have our instrumented binary, we run our workload again to generate the profile data and rebuild the binary with that data. The last step will be to optimize with BOLT, which is a post-link optimizer, which requires us to keep relocations so I’ve also added . When we run our workload with the final optimized binary, we see massive improvement already! 🤯 We’ve cut our workload time down to ~10 seconds which is a nearly a 1.5x improvement. Now let’s optimize the final binary with LLVM’s BOLT . BOLT is a post-link optimizer designed for “large applications”. What this means, is that it largely works by shuffling code around the binary to keep code-paths that have high temporal locality near each other (spatial locality). This can have positive impact on performance due to the instruction cache for instance. Looks like it was a little faster but not much. That makes sense since itself is a pretty small binary (~6MB), but nontheless was good to run through. Running a more thorough benchmark with we can get a final tally of our results. Looks like the I got from the Fedora ecosystem was the slowest . When all the optimizations were applied I was able to get a maximum of 1.38x faster than what was available. These optimizations would be even more dramatic for code-bases that are a sprawl and can heavily vary. Don’t worry also about getting the profile perfectly tuned to your workloads. I have a coworker who often cites that even poor profiles are still much better than no profile at all.