GreatReads - Blog Aggregator · Phoenix Framework

Does anyone actually use the large code-model?

I have been focused lately on trying to resolve relocation overflows when compiling large binaries in the small & medium code-models. Often when talking to others about the problem, they are quick to offer the idea of using the large code-model. Despite the performance downsides of using the large code-model from the instructions generated, it’s true that its intent was to support arbitrarily large binaries. However does anyone actually use it? Turns out that large binaries do not only affect the instructions generated in the section but may also have effects on other sections within the ELF file such as (exception handling information), (optimized binary search table for ), and even . Let’s take and as an example. They specifically allow various encodings for the data within them ( or for 4 bytes and 8 bytes respectively) irrespective of the code-model used. However, it looks like the userland has terrible support for it! If we look at the format, we can see how these encodings are applied in practice. The entries in this column are the ones that actually resolve to specific DWARF exception header encoding formats (like , , , etc.) depending on the values provided in the preceding fields. format [ ref ]: Note: The values for and dictate their byte size and format. For example, if is set to , the field will be processed as an (signed 4-byte) value. Up until very recently ( pull#179089 ), LLVM’s linker would crash if it tried to link exception data ( ) beyond 2GiB. This section is always generated to help stack searching algorithms avoid linear search. Once we fix that though, it looks like ( gcc-patch@ ) and ( pull#964 ) explicitly either crash on or avoid the binary search table completely reverting back to linear search. How devasting is linear search here? If you have a lot of exceptions, which you theoretically might for the large code-model, I had benchmarks that started at ~13s improve to ~18ms for a ~700x speedup . Other fun failure modes that exist: Note: Don’t let confuse you, it’s actually 32bit: It seems like the large code-model “exists” but no one is using it for it’s intended purpose which was to build large binaries. I am working to make massive binaries possible without the large code-model while retaining much of the performance characteristics of the small code-model. You can read more about it in x86-64-abi google-group where I have also posted an RFC.

Programming

Performance

Open Source

0 views

Farid Zakaria 1 months ago

Nix is a lie, and that’s ok

When Eelco Dolstra , father of Nix, descended from the mountain tops and enlightened us all, one of the main commandments for Nix was to eschew all uses of the Filesystem Hierarchy Standard (FHS) . The FHS is the “find libraries and files by convention” dogma Nix abandons in the pursuit of purity. What if I told you that was a lie ? 😑 Nix was explicitly designed to eliminate standard FHS paths (like or ) to guarantee reproducibility. However, graphics drivers represent a hard boundary between user-space and kernel-space. The user-space library ( ) must match the host OS’s kernel module and the physical GPU. Nearly all derivations do not bundle with them because they have no way of predicting the hardware or host kernel the binary will run on. What about NixOS? Surely, we know what kernel and drivers we have there!? 🤔 Well, if we modified every derivation to include the correct it would cause massive rebuilds for every user and make the NixOS cache effectively useless. To solve this, NixOS & Home Manager introduce an intentional impurity, a global path at where derivations expect to find . We’ve just re-introduced a convention path à la FHS. 🫠 Unfortunately, that leaves users who use Nix on other Linux distributions in a bad state which is documented in issue#9415 , that has been opened since 2015. If you tried to install and run any Nix application that requires graphics, you’ll be hit with the exact error message Nix was designed to thwart: There are a couple of workarounds for those of us who use Nix on alternate distributions: For those of us though who cling to the beautiful purity of Nix however it feels like a sad but ultimately necessary trade-off. Thou shall not use FHS, unless you really need to. nixGL , a runtime script that injects the library via manually hacking creating your own and symlinking it with the drivers from

DevOps

0 views

Farid Zakaria 1 months ago

Linker Pessimization

In a previous post , I wrote about linker relaxation : the linker’s ability to replace a slower, larger instruction with a faster, smaller one when it has enough information at link time. For instance, an indirect through the GOT can be relaxed into a direct plus a . This is a well-known technique to optimize the instructions for performance. Does it ever make sense to go the other direction ? 🤔 We’ve been working on linking some massive binaries that include Intel’s Math Kernel Library (MKL) , a prebuilt static archive. MKL ships as object files compiled with the small code-model ( ), meaning its instructions assume everything is reachable within ±2 GiB. The included object files also has some odd relocations where the addend is a very large negative number (>1GiB). The calculation for the relocation value is S + A - P : the symbol address plus the addend minus the instruction address. WIth a sufficiently large negative addend, the relocation value can easily exceed the 2 GiB limit and the linker fails with relocation overflows. We can’t recompile MKL (it’s a prebuilt proprietary archive), and we can’t simply switch everything to the large code model. What can we do? 🤔 I am calling this technique linker pessimization : the reverse of relaxation. Instead of shrinking an instruction, we expand one to tolerate a larger address space. 😈 The specific instructions that overflow in our case are (Load Effective Address) instructions. In x86_64, performs pure arithmetic: it computes and stores the result in without accessing memory. The is a 32-bit signed integer embedded directly into the instruction encoding, and the linker fills it in via an relocation. The relocation formula is S + A - P . Let’s look at an example with a large addend. A 32-bit signed integer can only represent ±2,048 MB (±2 GiB). Our value of −2,062 MB exceeds that range and the linker rightfully complains 💥: Note These instructions appear in MKL because the library uses them as a way to compute an address of a data table relative to the instruction pointer. The large negative addend ( ) is intentional ; it’s an offset within a large lookup table. The core idea is delightful because often as an engineer we are trained to optimize systems, but in this case we want the opposite. We swap the for a that reads through a nearby pointer. Recall from the relaxation post : relaxation shrinks instructions (e.g. indirect -> direct ). Here we do the opposite: we make the instruction do more work (pure arithmetic -> memory load) in exchange for a reachable displacement. That’s why I consider it a pessimization or reverse-relaxation . Both instructions use the same encoding length (7 bytes with a REX prefix), so the patch is a single byte change in the opcode. 🤓 The difference in behavior is critical: Original — the must reach across the entire binary: Pessimized — the reads a nearby pointer that holds the full address: We’ve traded one direct computation for an indirect through a pointer, and we make sure the displacement is now tiny. The 64-bit pointer slot can reach any address in the virtual address space. 👌 For each problematic relocation, three changes are needed in the object file: 1. Opcode Patch : In , change byte to (1 byte). This converts the (compute address) into a (load from address). The rest of the instruction encoding (ModR/M byte, REX prefix) stays identical because both instructions use the same operand format. 2. New Pointer Slot — Create a new section ( ) containing 8 zero bytes per patch site, plus a new relocation pointing to the original symbol with the original addend. is a 64-bit absolute relocation. Its formula is simply , no subtraction of . There is no 32-bit range limitation; it can address the entire 64-bit address space. This is the key insight that makes the fix work. 3. Retarget the Original Relocation — In the entry for the patched instruction, change the symbol to point at the new pointer slot in and update the type to . The addend becomes a small offset (the distance from the instruction to the fixup slot), which is guaranteed to fit. Note Because both and with a operand are exactly the same length (7 bytes with a REX prefix), we don’t shift any code, don’t invalidate any other relocations, and don’t need to rewrite any other parts of the object file. It’s truly a surgical patch. The pessimized now performs a memory load where the original did pure register arithmetic. That’s an extra cache line fetch and a data dependency. If this instruction is in a tight loop, it could be a performance hit. Optimization is the root of all evil, what does that make pessimization? 🧌 LEA : (arithmetic, no memory access). must encode the entire distance to the far-away data. This overflows. MOV : (memory load). points to a nearby 8-byte pointer slot. The pointer slot holds the full 64-bit address. This never overflows.

Performance

0 views

Farid Zakaria 2 months ago

Creating massively huge fake files and binaries

I was writing a test case for to support “thunks” [ llvm#180266 ] which uses a linker script to place two sections very far apart (8GiB) in the virtual address space. After linking a trivially small assembly file, I ran on the resulting binary was confused 8 GiB . For what amounts to a handful of instructions. 😲 What’s going on? And where did all that space come from? Turns out reports the logical (apparent) size of the file, which is simply an integer stored in the inode metadata. It represents the offset of the last byte written. Since lives at (~8 GiB), the file’s logical size extends out that far even though the actual code is tiny. The real story is told by : 12 KiB on disk. The file is sparse . 🤓 A sparse file is one where the filesystem doesn’t bother allocating blocks for regions that are all zeros. The filesystem (ext4, btrfs, etc.) stores a mapping of logical file offsets to physical disk blocks in the inode’s extent tree . For a sparse file, there are simply no extents for the hole regions. For our 8 GiB binary, the extent tree looks something like: We can use to also see the same information, albeit a little more condensed. When something reads the file: You don’t need a linker to create sparse files. will do it: A 1 PiB file that takes zero bytes on disk. with works too: Both produce the same result: a file whose logical size is 1 PiB but whose on-disk footprint is effectively nothing. The virtual filesystem (VFS) receives at some offset The filesystem looks up the extent tree for that offset If extent found then read from the physical disk block If no extent (hole) then the kernel fills the buffer with zeros, no disk I/O

Programming

DevOps

Performance

0 views

Farid Zakaria 2 months ago

Crazy shit linkers do: Relaxation

I have been looking into linkers recently and I’ve been amazed at all the crazy options and optimizations that a linker may perform. Compilers are a well understood domain, taught in schools with a plethora of books but few resources exist for linkers aside from what you may find on some excellent technical blogs such as Lance Taylor’s series on writing the gold linker and Fangrui Song’s, also known as MaskRay, very in-depth blog . I wanted to write down in my own style, concepts I’m learning from first principles . Recently, I came across a term “relaxation” as I was fuddling around LLVM’s . What is it? 🤔 Note Relaxation looks to be relatively new , and the original RFC to the x86-64-abi google group was proposed in 2015. Well, let’s look at a super simple example to understand what it is and why we want it. If you want to follow along take a look at this godbolt example. If we compile this with we see the following disassembly in the object file using . Specifically, the compile has left a “note” for the linker in the form of a relocation , specifically . You can see that the address in the emitted code is after compilation. The linker needs to replace that value with the address of the function from the GOT relative to the register (instruction pointer). This works great and is necessary for shared libraries but what if we are building a final static binary? 🤓 Turns out that in some cases, this instruction can be further simplified by the linker since when producing the final executable binary it has all the information. We will have to see the actual instruction-code to understand this further. If we look at the hexcode for that assembly we see the following: This indirect call ( ) via the GOT address is 6 bytes long with 2 bytes for the opcode & 4 bytes belonging to the offset to the GOT entry. Note Understanding x86-64 is its own whole can of worms. The ISA is incredibly dense and complex, but if you want you can reference it here . x86-64 though has other types ( ), that operate in a direct mode where it calls the address relative to the bytes presented. This direct-mode type is only 5 bytes long with 1 byte for the opcode and 4 bytes for the offset to the function. If we knew the location of the function ahead of time, it would be nice if we could skip checking the GOT completely and just go to where we want to be. Why would we want to do this? Well it’s more efficient to directly jump to the address we want to end up directly. The CPU doesn’t have to load the memory stored at the GOT before jumping to it. When building a static binary the linker should know all the final relative addresses of all the functions, so going through the GOT is no longer necessary. Since the number of bytes is nearly equal, the linker can effectively patch the binary without disrupting other relative calculations, provided it can fill the small gap. We only need to find a single byte to pad our more-efficient ! 🕵️ Turns out, the operation is only a single byte . 👌 We then get the equality: This is what the relocation indicates. It tells the linker it is safe to “relax” and modify the instructions to the more performant variation. When we enable relaxation, we now generate the same code as above but with this new relocation type instructing the linker to perform the optimization if possible. Note Why not just always optimize when possible and forgo introducing a new relocation? My own guess is that it’s important to be backwards compatible and you wouldn’t want the emitted code to vary depending on the linker version but I would be interested to hear something more concrete if you know! Interestingly that many linkers, optimize this even further! Rather than generating a instruction, the linker instead prefixes the with ( ). On x86-64, ( ) usually implies 32-bit addressing for the operand. However, for a relative instruction, it acts as a benign prefix that effectively ignores the override but also consumes exactly 1 byte. If we go back to our example and enable relaxation, and produce a final binary, we can disassemble it to see whether it was relaxed. Here we can see that in fact our was relaxed since we can see 🥳. As it happens, you can do this same “relaxation” optimization for a few other instructions such as , and but the basic premise is the same.

Programming

0 views

Farid Zakaria 3 months ago

Bespoke software is the future

At Google, some of the engineers would joke, self-deprecatingly , that the software internally was not particularly exceptional but rather Google’s dominance was an example of the power of network effects: when software is custom tailored to work well with each other. This is often cited externally to Google, or similar FAANG companies, as indulgent “NIH” (Not Invented Here) syndrome; where the prevailing practice is to pick generalized software solutions, preferably open-source, off-the shelf. The problem with these generalized solutions is that, well, they are generalized and rarely fit well together. 🙄 Engineers are trained to be DRY (Don’t Repeat Yourself), and love abstractions. As a tool tries to solve more problems, the abstraction becomes leakier and ill-fitting. It becomes a general-purpose tax. If you only need 10% of a software solution, you pay for the remaining 90% via the abstractions they impose. 🫠 Internally to a company, however, we are taught that unused code is a liability. We often celebrate negative pull-requests as valuable clean-up work with the understanding that smaller code-bases are simpler to understand, operate and optimize. Yet for our most of our infrastructure tooling, we continue to bloat solutions and tout support despite miniscule user bases. This is probably one of the areas I am most excited about with the ability to leverage LLM for software creation. I recently spent time investigating linkers in previous posts such as LLVM’s lld . I found LLVM to be a pretty polished codebase with lots of documentation. Despite the high-quality, navigating the codebase is challenging as it’s a mass of interfaces and abstractions in order to support: multiple object file formats, 13+ ISAs, a slough of features (i.e. linker scripts ) and multiple operating systems. Instead, I leveraged LLMs to help me design and write µld , a tiny opinionated linker in Rust that only targets ELF, x86_64, static linking and barebone feature-set. It shouldn’t be a surprise to anyone that the end result is a codebase that I can audit, educate myself and can easily grow to support additional improvements and optimizations. The surprising bit, especially to me, was how easy it was to author and write within a very short period of time (1-2 days). That means smaller companies, without the coffer of similar FAANG companies, can also pursue bespoke custom tailored software for their needs. This future is well-suited for tooling such as Nix . Nix is the perfect vehicle to help build custom tooling as you have a playground that is designed to build the world similar to a monorepo. We need to begin to cut away legacy in our tooling and build software that solves specific problems. The end-result will be smaller, easier to manage and better integrated. Where this might have seemed unattainable for most, LLMs will democratize this possibility. I’m excited for the bespoke future.

Programming

AI

Rust

0 views

Farid Zakaria 3 months ago

Huge binaries: papercuts and limits

In a previous post , I synthetically built a program that demonstrated a relocation overflow for a instruction. However, the demo required I add to disable some additional data that might cause other overflows for the purpose of this demonstration. What’s going on? 🤔 This is a good example that only a select few are facing the size-pressure of massive binaries. Even with which already is beginning to articulate to the compiler & linker: “Hey, I expect my binary to be pretty big.”; there are surprising gaps where the linker overflows. On Linux, an ELF binary includes many other sections beyond text and data necessary for code execution. Notably there are sections included for debugging (DWARF) and language-specific sections such as which is used by C++ to help unwind the stack on exceptions. Turns out that even with you might still run into overflow errors! 🤦🏻‍♂️ Note Funny enough, there is a very recent opened issue for this with LLVM #172777 ; perfect timing! For instance, assumes 32-bit values regardless of the code model. There are similar 32-bit assumptions in the data-structure of as well. I also mentioned earlier about a pattern about using multiple GOT, Global Offset Tables, to also avoid the 31-bit (±2GiB) relative offset limitation. Is there even a need for the large code-model? How far can that take us before we are forced to use the large code-model? Let’s think about it: First, let’s think about any limit due to overflow accessing the multiple GOTs. Let’s say we decide to space out our duplicative GOT every 1.5GiB. That means each GOT can grow at most 500MiB before there could exist a instruction from the code section that would result in an overflow. Each GOT entry is 8 bytes, a 64bit pointer. That means we have roughly ~65 million possible entries. A typical GOT relocation looks like the following and it requires 9 bytes: 7 bytes for the and 2 bytes for . That means we have 1.5GiB / 9 = ~178 million possible unique relocations. So theoretically, we can require more unique symbols in our code section than we can fit in the nearest GOT, and therefore cause a relocation overflow. 💥 The same problem exists for thunks, since the thunk is larger than the relative call in bytes. At some point, there is no avoiding the large code-model, however with multiple GOTs, thunks and other linker optimizations (i.e. LTO, relaxation), we have a lot of headroom before it’s necessary. 🕺🏻

Backend

C++

0 views

Farid Zakaria 3 months ago

Huge binaries: I thunk therefore I am

In my previous post , we looked at the “sound barrier” of x86_64 linking: the 32-bit relative instruction and how it can result in relocation overflows. Changing the code-model to fixes the issue but at the cost of “instruction bloat” and likely a performance penalty although I had failed to demonstrate it via a benchmark 🥲. Surely there are other interesting solutions? 🤓 First off, probably the simplest solution is to not statically build your code and rely on dynamic libraries 🙃. This is what most “normal” software-shops and the world does; as a result this hasn’t been such an issue otherwise. This of course has its own downsides and performance implications which I’ve written about and produced solutions for (i.e., Shrinkwrap & MATR ) via my doctorate research. Beyond the performance penalty induced by having thousands of shared-libraries, you lose the simplicity of single-file deployments. A more advanced set of optimizations are under the umbrella of “LTO”; Link Time Optimizations. The linker at the final stage has all the information necessary to perform a variety of optimizations such as code inlining and tree-shaking. That would seem like a good fit except these huge binaries would need an enormous amount of RAM to perform LTO and cause build speeds to go to a crawl. Tip This is still an active area of research and Google has authored ThinLTO . Facebook has its own set of profile guided LTO optimizations as well via Bolt . What if I told you that you could keep your code in the fast, 5-byte small code-model, even if your binary is 25GiB for most callsites. 🧐 Turns out there is prior art for “Linker Thunks” [ ref ] within LLVM for various architectures – notably missing for with a quote: “i386 and x86-64 don’t need thunks” [ ref ] What is a “thunk” ? You might know it by a different name and we use them all the time for dynamic-linking in fact; a trampoline via the procedure linkage table (PLT). A thunk (or trampoline) is a linker-inserted shim that lives within the immediate reach of the caller. The caller branches to the thunk using a standard relative jump, and the thunk then performs an absolute indirect jump to the final destination. LLVM includes support for inserting thunks for certain architectures such as AArch64 because it is a fixed-size instruction set (32-bit), so the relative branch instruction is restricted to 128MiB. As this limit is so low, has support for thunks out of the box. If we cross-compile our “far function” example for AArch64 using the same linker script to synthetically place it far away to trigger the need for a thunk, the linker magic becomes visible immediately. We can now see the generated code with . Instead of branching to at , it branches to a generated thunk at (only 16 bytes away). The thunk similar to the large code-model, loads with the absolute address, stored in the , and then performs an absolute jump ( ). What if supported this? Can we now go beyond 2GiB? 🤯 There would be some more similar thunks that would need to be fixed beyond instructions. Although we are mostly using static binaries, some libraries such as may be dynamically loaded. The access to the methods from these shared libraries are through the GOT, Global Offset Table, which gives the address to the PLT (which is itself a thunk 🤯). The GOT addresses are also loaded via a relative offset so they will need to changed to be either use thunks or perhaps multiple GOT sections; which also has prior art for other architectures such as MIPS [ ref ]. With this information, the necessity of code-models feels unecessary. Why trigger the cost for every callsite when we can do-so piecemeal as necessary with the opportunity to use profiles to guide us on which methods to migrate to thunks. Furthermore, if our binaries are already tens of gigabytes, clearly size for us is not an issue. We can duplicate GOT entries, at the cost of even larger binaries, to reduce the need for even more thunks for the PLT . What do you think? Let’s collaborate.

Programming

0 views

Farid Zakaria 3 months ago

Failing interviews

My blog has been a little quiet. I recently accepted a new role at Meta and it’s been keeping me busy! Once the onboarding phase is done I hope to get back to my Nix contributions. Accepting the position at Meta has had me reflecting on my journey to this current role. People often share their highlights of accepting a new role but rarely their lowlights . I wanted to share a brief look at what interviewing might be like in the software industry. People are often discouraged by failure but it’s part of the process. I remember having done interview training at Google where they discussed most interviewers decide on the outcome of the interview within the first-five minutes. That story is not to totally discourage oneself from the process but rather to demonstrate there is a portion that is out of your control. Going through my emails to get an accurate accounting is challenging, however I found threads as early as 2011 interviewing for Facebook. I am actually sure I had interviewed ealier through my co-ops at University of Waterloo, but I don’t have access anylonger to those emails. 😩 Some rough dates I had found: 2011, 2014, 2015, 2018, 2019, 2020, 2021, 2022, 2023*, 2024, 2025. * This interview round was long and was for 3 distinct roles. Across those years, the level I interviewed at was different and sometimes the role too (IC vs EM). Don’t be discouraged from failure.

Open Source

Career

1 views

Farid Zakaria 5 months ago

Nix derivation madness

I’ve written a bit about Nix and I still face moments where foundational aspects of the package system confounds and surprises me. Recently I hit an issue that stumped me as it break some basic comprehension I had on how Nix works. I wanted to produce the build and runtime graph for the Ruby interpreter. I have Ruby but I don’t seem to have the derivation, , file present on my machine. No worries, I think I can it and download it from the NixOS cache. I guess the NixOS cache doesn’t seem to have it. 🤷 This was actually perplexing me at this moment. In fact there are multiple discourse posts about it. My mental model however of Nix though is that I must have first evaluated the derivation (drv) in order to determine the output path to even substitute. How could the NixOS cache not have it present? Is this derivation wrong somehow? Nope. This is the derivation Nix believes that produced this Ruby binary from the database. 🤨 What does the binary cache itself say? Even the cache itself thinks this particular derivation, , produced this particular Ruby output. What if I try a different command? So I seem to have a completely different derivation, , that resulted in the same output which is not what the binary cache announces. WTF? 🫠 Thinking back to a previous post, I remember touching on modulo fixed-output derivations . Is that what’s going on? Let’s investigate from first principles. 🤓 Let’s first create which is our fixed-output derivation . ☝️ Since this is a fixed-output derivation (FOD) the produced path will not be affected to changes to the derivation beyond the contents of . Now we will create a derivation that uses this FOD. The for the output for this derivation will change on changes to the derivation except if the derivation path for the FOD changes. This is in fact what makes it “modulo” the fixed-output derivations. Let’s test this all out by changing our derivation. Let’s do this by just adding some garbage attribute to the derivation. What happens now? The path of the derivation itself, , has changed but the output path remains consistent. What about the derivation that leverages it? It also got a new derivation path but the output path remained unchanged. 😮 That means changes to fixed-output-derivations didn’t cause new outputs in either derivation but it did create a complete new tree of files. 🤯 That means in nixpkgs changes to fixed-output derivations can cause them to have new store paths for their but result in dependent derivations to have the same output path. If the output path had already been stored in the NixOS cache, then we lose the link between the new and this output path. 💥 The amount of churn that we are creating in derivations was unbeknownst to me. It can get even weirder! This example came from @ericson2314 . We will duplicate the to another file whose only difference is the value of the garbage. Let’s now use both of these in our derivation. We can now instantiate and build this as normal. What is weird about that? Well, let’s take the JSON representation of the derivation and remove one of the inputs. We can do this because although there are two input derivations, we know they both produce the same output! Let’s load this modified derivation back into our and build it again! We got the same output . Not only do we have a trait for our output paths to derivations but we can also take certain derivations and completely change them by removing inputs and still get the same output! 😹 The road to Nix enlightenment is no joke and full of dragons.

Open Source

Ruby

DevOps

JSON

2 views

Farid Zakaria 6 months ago

Fuzzing for fun and profit

I watched recently a keynote by Will Wilson on fuzzing – Fuzzing’25 Keynote . The talk is excellent, and one main highlight is the fact we have at our disposal is the capability to “fuzz” our software toaday and yet we do not. While I’ve seen the power of QuickCheck-like tools to create property based testing, I never had never used fuzzing over an application as a whole, specifically American Fuzzy Lop . I was intrigued to add this skill to my toolbelt and maybe apply it to CppNix . As with everything else, I need to learn things from first principles . I would like to create a scenario with a known-failure and see how AFL discovers it. To get started let’s first make sure we have access to AFL via Nix . We will be using AFL++ , the daughter of AFL that incorporates newer updates and features. How does AFL work? 🤔 AFL will feed your program various inputs to try and cause a crash! 💥 In order to generate better inputs, you compile your code with a variant of or distributed by AFL which will insert special instructions to keep track of coverage of branches as it creates various test cases. Let’s create a program that crashes when given the input . We leverage a so that the compiler does not optimize the multiple instructions together. We now can compile our code with to get the instrumented binary . AFL needs to be given some sample inputs Let’s feed it the simplest starter seed possible – an empty file! Now we simply run , and the magic happens . ✨ A really nice TUI appears that informs you of various statistics of the running fuzzer, and importantly if any crashes had been found – ! The output directory contains all the saved information including the input that caused the crashes. Let’s inspect it! AFL was successfully able to find our code-word, , that caused the crash. It is important to note however that for my simple program it found the failure-case rather quickly, however for large programs it can take a long time to explore the complete state space. Companies such as Google, continously run fuzzers such as AFL on well-known open source projects to help detect failures.

Testing

Security

0 views

Farid Zakaria 7 months ago

Bazel Knowledge: Smuggling capabilities through a tarball

tl;dr : Linux capabilities are just xattrs (extended attributes) on files — and since can preserve xattrs, Bazel can “smuggle” them into OCI layers without ever running . Every so often I stumble on a trick that makes me do a double-take. This one came up while poking around needing to replace the contents of a that set capabilities on a file, via , and trying to replace it with rules_oci . I learnt this idea from reading bazeldnf . What are capabilities? 🤔 We are all pretty familiar with the all powerful in Linux and escalating to via . Capabilities break that monolith into smaller, more focused privileges [ ref ]. Instead of giving a process the full keys to the kingdom, you can hand it just the one it needs. For example: Capabilities are inherited from the spawning process but they can also be added to the file itself, such that any time that process is it has the desired capabilities. The Linux kernel stores these capabilities in the “extended attributes” (i.e. additional metadata) of the file [ ref ]. If the filesystem you are using does not support extended attributes, then you cannot set capabilities on a file. Let’s see an example we will work through. If we build this with Bazel and try to run it, we will see that it fails unless we either spawn it with , or add it to the binary via . Okay great – but what does this have to do with Bazel? Well we were converting a that used to modify the binary. If your OCI image runs as a non-root user, it will also be unpermitted from creating the raw socket. We can build this Docker image and notice that the entrypoint fails . If we amend the by adding we also see it succeeds. Now we can build and run it again. Back to Bazel! Actions in Bazel are executed under the user that spawned the Bazel process. We can validate this with a simple . How can we go ahead then to create a file with a capability set such that we can replace our layer? Escalating privileges inside a Bazel action with isn’t straightforward. You might need to configure for the user, so that it can execute without a password. You could also run the whole command as but that is granting too much privilege everywhere. This is where the magic happens ✨. Let’s take another detour! What are OCI images? I actually did a previous write-up on containers from first principles if you are curious for a deeper dive. We can export the image from Docker and inspect it. An OCI image is a archive containing metadata and a series of “blobs” some of which are themselves are archives. These blobs are the “layers” that are used to construct the final filesystem and contain all the files that will comprise the rootfs. For capabilities to transport themselves through a tar archive, the tar archive itself must have the capability to store extended attributes as well. You can enable this feature with the option. If you decompress the archive, and have necessary privileges to set extended attributes ( or ) then the unarchived file will retain the capability and everything will work! What does this have to do with building an OCI image in Bazel? 🤨 Turns out that a trick we can employ is to toggle the necessary bits to mark a file as having a necessary capability in the tar archive . This is exactly what the xattrs rule in bazeldnf does! 🤓 The key idea : capabilities live in extended attributes, and can carry those along. That means you don’t need to run inside a at build time as the equivalent — Bazel can smuggle the bits straight into the image tar layer to be consumed by a OCI compliant runtime. ☝️ This trick neatly sidesteps the need for in your rules and keeps builds hermetic. Not every filesystem or runtime will honor these attributes, but when it works it’s a clever, Bazel-flavored way to package privileged binaries without breaking sandboxing.

DevOps

Security

0 views

Farid Zakaria 7 months ago

Writing a protoc plugin in Java

Know thy enemy. – Sun Tzu Anyone who’s used Protocol Bufffers We use Protocol Buffers heavily at $DAYJOB$ and it’s becoming increasingly a large pain point, most notably due to challenges with coercing multiple versions in a dependency graph. Recently, a team wanted to augment the generated Java code protoc (Protobuf compiler) emits. I was aware that the compiler had a “plugin” architecture but had never looked deeper into it. Let’s explore writing a Protocol Buffer plugin, in Java and for the Java generated code. 🤓 If you’d like to see the end result check out github.com/fzakaria/protoc-plugin-example Turns out that plugins are simple in that they operate solely over standard input & output and unsurprisingly marshal protobuf over them. A plugin is just a program which reads a protocol buffer from standard input and then writes a protocol buffer to standard output. [ ref ] The request & response protos are described in plugin.proto . Here is a dumb plugin that emits a fixed class to demonstrate. We can run this and see that the expected file is produced. Let’s now look at an example in . You can generate the traditional Java code for this using which by default includes the capability to output Java. Nothing out of the ordinary here, we are merely baselining our knowledge. 👌 How can I now modify this code? If you audit the generated code you will see comments that contain such as: Insertion points are markers within the generated source that allow other plugins to include additional content. We have to modify our that we include in the response to specify the insertion point and instead of a new file being created, the contents of files will be merged. ✨ Our example plugin would like to add the function to every message type described in the proto file. We do this by setting the appropriate insertion point which we found from auditing the original generated code. In this particular example, we want to add our new funciton to the Class definition and pick as our insertion point. We now run both the Java generator alongside our custom plugin. We can audit the generated source and we see that our new method is now included! 🔥 Note: The plugin must be listed after as the order matters on the command-line. While we are limited by the insertion points previously defined in the open-source implementation of the Java protobuf generator, it does provide a convenient way to augment the the generated files. We can also include additional source files that may wrap the original files for cases where the insertion points may not suffice.

Java

Programming

0 views

Farid Zakaria 7 months ago

Bazel Knowledge: Testing for clean JVM shutdown

Ever run into the issue where you exit your method in Java but the application is still running? That can happen if you have non-daemon threads still running. 🤔 The JVM specification specifically states the condition under which the JVM may exit [ ref ]: A program terminates all its activity and exits when one of two things happens: What are daemon-threads? They are effectively background threads that you might spin up for tasks such as garbage collection, where you explicitly don’t want them to inhibit the JVM from shutting down. A common problem however is that if you have code-paths on exit that fail to stop all non-daemon threads, the JVM process will fail to exit which can cause problems if you are relying on this functionality for graceful restarts or shutdown. Let’s observe a simple example. If we run this, although we exit the main thread, we observe that the JVM does not exit and the thread continues to do its “work”. Often you will see classes implement or so that an orderly shutdown of these sort of resources can occur. It would be great however to test that such graceful cleanup is done appropriately for our codebases. Is this possible in Bazel? If we run this test however we notice the test PASSES 😱 Turns out that Bazel’s JUnit test runner uses after running the tests, which according to the JVM specification allows the runtime to shutdown irrespective of active non-daemon threads. [ ref ] From discussion with others in the community, this explicit shutdown was added specifically because many tests would hang due to improper non-daemon thread cleanup. 🤦 How can we validate graceful shutdown then? Well, we can leverage and startup our and validate that the application exits within a specific timeout. Additionally, I’ve put forward a pull-request PR#26879 which adds a new system property that can be added to a such that the test runner validates that there are no non-daemon threads running before exiting. It would have been great to remove the call completely when the presence of the property is true; however I could not find a way to then set the exit value of the test. Turns out that even simple things can be a little complicated and it was a bit of a headscratcher to see why our tests were passing despite our failure to properly tear down resources. All the threads that are not daemon threads terminate. Some thread invokes the method of or , and the exit operation is not forbidden by the security manager. Some thread invokes the method of or , and the exit operation is not forbidden by the security manager.

Java

Testing

0 views

Farid Zakaria 7 months ago

Bazel Knowledge: dive into unused_deps

The Java language implementation for Bazel has a great feature called strict dependencies – the feature enforces that all directly used classes are loaded from jars provided by a target’s direct dependencies. If you’ve ever seen the following message from Bazel, you’ve encountered the feature. The analog tool for removing dependencies which are not directly referenced is unused_deps . You can run this on your Java codebase to prune your dependencies to those only strictly required. That’s a pretty cool feature, but how does it work? 🤔 Turns out the Go code for the tool is relatively short, let’s dive in! I love learning the inner machinery of how the tools I leverage work. 🤓 Let’s use a simple example to explore the tool. First thing the tool does is query which targets to look at , and it emits this to stderr so that part is a little obvious. It performs a query searching for any rules that start with , or . This would catch our common rules such as or . Here is where things get a little more interesting . The tool emits an ephemeral Bazel in a temporary directory that contains a Bazel aspect . What is the aspect the tool injects into our codebase? The aspect is designed to emit additional files that contain the arguments to the compilation actions. If we inspect what this file looks like for the simple I created , we see it’s the arguments to itself. If you are wondering what is? Bazel uses a custom compiler plugin that will be relevant shortly. ☝️ How does the aspect get injected into our project? Well, after figuring out which targets to build via the , will your target pattern and specify to include this additional dependency and enable the aspect via the flag. If you are using Bazel 8+ and have disabled, which is the default, you will need my PR#1387 to make it work. The end result after the is that every Java target (i.e. ) will have produced a file in the directory. Why did it go through such lengths to produce this file? The tool is trying is trying to find the direct dependencies of each Java target. The tool searches for the line for each target to see the dependencies that were needed to build it. QUESTION #1 : Why does the tool need to set up this aspect anyways? Bazel will already emit param files for each Java target that contains nearly identical information. The tool will then iterate through all these JAR files, open them up and look at the file within it for the value of which is the Bazel target expression for this dependency. In this case we can see the desired value is . If you happen to use rules_jvm_external to pull in Maven dependencies, the ruleset will “stamp” the downloaded JARs which means injecting them with the entry in their specifically to work with [ ref ]. QUESTION #2 Why does go to such lengths to discover the labels of the direct dependencies of a particular target? Could this be replaced with a command as well ? 🕵️ For our target we have the following After the labels of all the direct dependencies are known for each target, will parse the jdeps file, , of each target which is a binary protocol serialization of found in deps.go . Using we can inspect and explore the file. This is the super cool feature of Bazel and integrating into the Java compiler. 🔥 Bazel invokes the Java compiler itself and will then iterate through all the symbols, via a provided symbol table, the compiler had to resolve. For each symbol, if the dependency is not from the list than it must have been provided through a transitive dependency. [ ref ]. The presence of kind would actually trigger a failure for the strict Java dependency check if enabled. then takes the list of the direct dependencies and keeps only all the dependencies the compiler reported back as actually requiring to perform compilation. The set difference represents the set of targets that are effectively unused and can be reported back to the user for removal! ✨ QUESTION #3 : There is a third type of dependency kind which I saw when investigating our codebase. I was unable to discern how to trigger it and what it represents. What I enjoy about Bazel is learning how you can improve developer experience and provide insightful tools when you integrate the build system deeply with the underlying language, is a great example of this.

DevOps

Java

Programming

Go

0 views

Farid Zakaria 8 months ago

Using Nix as a library

I have been actively trying to contribute to CppNix – mostly because using it brings me joy and it turns out so does contributing. 🤗 Stepping into any new codebase can be overwhelming. You are trying to navigate new nomenclature, coding standards, tooling and overall architecture. Nix is over 20 years old and has its fair share of warts in a codebase. Knowledge of the codebase is bimodal either being very diffuse or consolidated to a few minds (i.e. @ericson2314 ). Thankfully everyone on the Matrix channel has been extremely welcoming. I have been actively following Snix , a modern Rust re-implementation of the components of the Nix package manager. I like the ideals from the project authors of communicating over well-defined API boundaries via separate processes and a library-first type of design. I was wondering however whether we could leverage CppNix as a library as well. 🤔 Is there a need to throw the baby out with the bath water? 👶 Turns out using Nix as a library is incredibly straightforward! To start, let’s create a that will include our necessary packages: (duh), (build tool) and . </summary> Adding to our will initiate a for any package that contains a output and set up the necessary environment variables. This will be the mechanism with which our build tool finds the necessary shared-objects and header files. We can also run to see that they can be discovered. Let’s now create a trivial file. Since we have our setup, we can declare “system dependencies” that we expect to be present, knowing that we are including these dependencies from our . For our sample project I will recreate functionality that is already present in the command. We will write a function that accepts a path, resolve its derivation and prints it as JSON. We can now build our project and run it! 🔥 That feels pretty cool! Lots of projects end up augmenting Nix by wrapping it with fancy bash scripts , however we can just as easily leverage it as a library and write native-first code. Learning the necessary functions to call is a little obtuse however I was able to reason through the necessary APIs by looking at unit-tests in the repository. What idea do you want to leverage Nix for but maybe put off since you thought doing it on top of Nix would be too hacky? Special thanks to @xokdvium who helped me through some learnings on and how to leverage Nix as a library. 🙇

Tutorial

Rust Bash

JSON

Open Source

0 views

Farid Zakaria 8 months ago

GitHub Code Search is the real MVP

There is endless hype about the productivity boon that LLMs will usher in. While I am amazed at the utility offered by these superintelligent LLMs, at the moment (August 2025) I remain bearish on the utilization of these tools to have any meaningful impact on productivity especially for production-grade codebases where correctness, maintainability, and security are paramount. They are clearly helpful for exploring ideas or any goal where the code produced may be discarded at the end. Thinking about how much promise of productivity we might gain from this tool had me reflecting on what other changes in the past 5 years had already benefited me and a clear winner stands out: GitHub’s code search via cs.github.com . Pre-2020, code search in the open-source domain never really had a good solution, given the diaspora of various hosting platforms. If you’ve worked in any large corporate environment (Amazon, Google, Meta etc…) you might have already had exposure to the powers of an incredible code search. The lack of such a tool for public codebases was a limitation we simply worked This is partly why third-party libraries were consolidated into well-known projects like Apache or established companies such as Google’s Guava . An upside to the consolidation of code on GitHub’s platform was capitalized on with the release of their revamped code search. Made generally available in May 2023, the new engine added powerful features like symbol search and the ability to follow references. The productivity win is clear to me, even with the introduction of LLMs. I visit cs.github.com daily, more frequently and with more interaction than any of the LLMs available to me. Finding code written by other humans is fun , and for some reason, more joyful to read. There is a certain level of joy to finding solutions to problems you may be facing that were authored and written by another human. This psychological effect may diminish as the code I’m wading through begins to tilt toward AI-generated content. But for now, the majority of the code I’m viewing still subjectively looks like that authored by a human. I also tend to work in niche areas such as NixOS or Bazel that don’t have a large corpus of material online so the results from the LLM tend to be more disappointing. If given a Sophie’s choice between GitHub code search and LLMs, strictly for the purpose of code authorship, I would pick code search as of today. Humans easily adapt to their environment, a phenomenon known as the hedonic treadmill. As we all get excited for the incoming technology of generative AI, let’s take a moment to reflect on the already amazing contribution to engineering we have become accustomed to due to a wonderful code search.

Programming

Open Source Github

0 views

Farid Zakaria 8 months ago

Angle brackets in a Nix flake world

At DEFCON33 , the Nix community had its first-ever presence via nix.vegas and I ended up in fun conversation with tomberek 🙌. “What fun things can we do with and with the eventual deprecation of ? The actual 💡 was from tomberek and this is a demonstration of what that might look like without necessitating any changes to CppNix itself. As a very worthwhile aside, the first time presence of the Nix community at DEFCON was fantastic and I am extra appreciative to numinit and RossComputerguy 🙇. The badges handed out were so cool. They have strobing LEDs but also can act as a substituter for the Nix infra that was setup. Okay, back to the idea 💁. Importing nixpks via the through the angle-bracket syntax has been a long-standing wart on the reproducibility promises of Nix. There is a really great article about all the problems with this approach to bringing in projects on nix.dev , for those whom are still leveraging it. With the eventual planned removal of support for , we are now presented with an opportunity of some new functionality in Nix, namely the angled brackets that can be reconstituted for a new purpose. Looks like others are already starting to think about this idea. The project htmnix demonstrates the functionality of writing pure-HTML but evaluating it with 😂. For something potentially more immediately useful, how about giving quicker access to the attributes of the current flake? 🤔 A common pattern that has emerged is to inject and into so that they are available to modules in NixOS or home-manager. This lets you add the modules from your or reference the packages in your from within the modules themselves. That seems nice but also unnecessary. Why not leverage the angled brackets for the same purpose. ☝️ That would make the equivalent example without needing to now wire up the . Is this possible today? Yes! Whenever Nix sees angled brackets it desugars the expression to a call to . If we offer a variant of in scope, Nix will call our implementation rather than the default implementation. Let’s implement a variant that utilizes to return the current flake attributes. Our goal is to write something as simple as the following and have the contents within the angle brackets be treated as an attribute path of the flake. What do we have to do to get this to work? Well we need to provide our own version of . We write a function of that trivially splits the contents within the angle bracket to access the attrset of the flake as returned by . There is some additional magic with 🪄 which is not documented . It allows giving a different base set of variables, via a provided attrset, to use for variables. This is how we can override in all subsequent files. So does this even work? Yes! 🔥 With the caveat that we had to provide since getting the current flake via requires it . This is a pretty ergonomic way to access the attributes of the current Flake automatically without having us all to go through the same setup for what is amounting to common best practices. The need to have is a bit of a bummer although this is a pretty neat improvement. There could be a new builtin, , which automatically provides the context of the current flake and therefore could be pure. I got some wonderful feedback from eljamm via the discourse post that we can just leverage and avoid having to use . We now don’t need to provide 👌 and we get all the same fun new ergonomic way to access flake attributes.

Programming

HTML

0 views

Farid Zakaria 8 months ago

Bazel Knowledge: Mixing and matching how to build third_party is lunacy

Have you ever found your full of mixed bytecode versions and wondered why? The original intent of Bazel , and it’s peer group (i.e. Buck and Pants), are to build everything from source and to consolidate into large-ish repositories. These are the practices done by the companies (eg. Google and Meta), who built these tools and therefore the tools are originally purpose-built for this use-case. Building from source for everything is very orthogonal to how most developers experience development, especially in open-source – unless you are a fan of NixOS . This makes total sense as the cost of setting up a mono-repository for every small effort would be a Herculean task. In order to “meet developers where they are”, Bazel itself has adopted a third-party registry system ( https://registry.bazel.build/ ) and rules for individual languages have emerged to make interoperability with pre-existing language package managers simpler such as rules_jvm_external for Java. Unfortunately, as the use of Bazel at $DAYJOB$ continues to expand, I am beginning to see the costs and fallout of this popular approach. There is no set standard as to when to build from source and when to pull from a third-party artifact repository such as Maven Central across rules, so one may find themselves in at best confusing builds and at worst broken code. Let’s take a look at an example to illustrate this point. If you’d like to see all the source online, I’ve published them to fzakaira/reproduction#protobuf-bytecode . In this example I would like to leverage a new-ish JDK (eg. JDK21), default language version to 11 (i.e. ) but I want to build a particular slice of my code at a different bytecode level ( ). This might seem like a good fit for transitions , however I found that to be a big complexity addition to the codebase and if you can avoid it that might be best. 🧠 Let’s set up our JDK21 and our language version. Now let’s modify the toolchain such that our particular slice of our codebase builds with a different bytecode target. Here is where it gets interesting , let’s build a simple and check all the bytecode within it. I wrote a simple handy tool, check-jar-versions , that can quickly list out all the bytecode versions within a JAR file. We see some classes compiled at Java 14 for the code within but probably unsurprisingly we get Java 11 as well 😲. Why ? This is because automatically includes dependencies for the protobuf runtime to the compiled Java code. Okay well to be honest, since I have a pretty basic application that is a little unsurprising since I guess my assumption is I’ve built everything from source and clearly doesn’t catch the library. Where this gets a little tricky to find and more subtle is when you mix in prebuilt artifacts from Maven which is popular via rules_jvm_external . We can demo that by adding a single to our dependency. Now if the dependency is earlier in the graph for our we get different results. Now all those Java 11 files are shadowed by the one from the prebuilt protobuf JAR which are at the Java 8 bytecode level. 🤯 We have established a pattern in our repository where we have decided to use prebuilt JARs for our third-party dependencies. Even if we don’t explicitly depend on the prebuilt Maven protobuf JAR, it may come in transitvely from another dependency. The problem however is that our dependant ruleset @protobuf – same is true for @grpc-java – chose to build from source and therefore we get different results depending on the order of the dependencies in the build. It’s even more confusing since mixes & matches the two types [ ref ]. Mixing prebuilt jars and source-built targets without discipline creates confusing and inconsistent builds. Bazel doesn’t protect you — it just builds what you tell it to. The fact that class files may also be shadowed by others in the graph can hide this fact and lead to suprising failure modes. Ok, so I sort of understand the problem. What can I do about it ? 🤓 Pick a idiom and try to stick to it ! You might have to go out of your way to do so. For our repository, we pull in too much from Maven Central in our dependency graph, so we’ve decided to make sure all our rulesets leverage the same prebuilt JARs. In the case of it meant creating a new that uses the prebuilt JAR. In the case of , we had to patch the rules to do the equivalent.

Java

DevOps

Programming

0 views