Latest Posts (20 found)
Blog System/5 2 weeks ago

ssh-agent broken in tmux? I've got you!

A little over two years ago, I wrote an article titled SSH agent forwarding and tmux done right . In it, I described how SSH agent forwarding works—a feature that lets a remote machine use the credentials stored in your local ssh-agent instance—and how using a console multiplexer like tmux or screen often breaks it. In that article, I presented the ssh-agent-switcher : a program I put together in a few hours to fix this problem. In short, ssh-agent-switcher exposes an agent socket at a stable location ( by default) and proxies all incoming credential requests to the transient socket that the sshd server creates on a connection basis. In this article, I want to formalize this project by presenting its first actual release, 1.0.0, and explain what has changed to warrant this release number. I put effort into creating this formal release because ssh-agent-switcher has organically gained more interest than I imagined as it is solving a real problem that various people have. When I first wrote ssh-agent-switcher, I did so to fix a problem I was having at work: we were moving from local developer workstations to remote VMs, we required SSH to work on the remote VMs for GitHub access, and I kept hitting problems with the ssh-agent forwarding feature breaking because I’m an avid user of tmux . To explain the problem to my peers, I wrote the aforementioned article and prototyped ssh-agent-switcher after-hours to demonstrate a solution. At the end of the day, the team took a different route for our remote machines but I kept using this little program on my personal machines. Because of work constraints, I had originally written ssh-agent-switcher in Go and I had used Bazel as its build system. I also used my own shtk library to quickly write a bunch of integration tests and, because of the Bazel requirement, I even wrote my first ruleset, rules_shtk , to make it possible. The program worked, but due to the apparent lack of interest, I considered it “done” and what you found in GitHub was a code dump of a little project I wrote in a couple of free evenings. Recently, however, ssh-agent-switcher stopped working on a Debian testing machine I run and I had to fix it. Luckily, someone had sent a bug report describing what the problem was: OpenSSH 10.1 had changed the location where sshd creates the forwarding sockets and even changed their naming scheme, so ssh-agent-switcher had to adapt. Fixing this issue was straightforward, but doing so made me have to “touch” the ssh-agent-switcher codebase again and I got some interest to tweak it further. My energy to work on side-projects like this one and to write about them comes from your support. Subscribe now to motivate future content! As I wanted to modernize this program, one thing kept rubbing me the wrong way: I had originally forced myself to use Go because of potential work constraints. As these requirements never became relevant and I “needed to write some code” to quench some stress, I decided to rewrite the program in Rust. Why, you ask? Just because I wanted to. It’s my code and I wanted to have fun with it, so I did the rewrite. Which took me into a detour. You see: while command line parsing in Rust CLI apps is a solved problem , I had been using the ancient getopts crate in other projects of mine out of inertia. Using either library requires replicating some boilerplate across apps that I don’t like, so… I ended up cleaning up that “common code” as well and putting it into a new crate aptly-but-oddly-named getoptsargs . Take a look and see if you find it interesting… I might write a separate article on it. Doing this rewrite also made me question the decision to use Bazel (again imposed by constraints that never materialized) for this simple tool: as much as I like the concepts behind this build system and think it’s the right choice for large codebases, it was just too heavy for a trivial program like ssh-agent-switcher. So… I just dropped Bazel and wrote a Makefile—which you’d think isn’t necessary for a pure Rust project but remember that this codebase includes shell tests too. With the Rust rewrite done, I was now on a path to making ssh-agent-switcher a “real project” so the first thing I wanted to fix were the ugly setup instructions from the original code dump. Here is what the project README used to tell you to write into your shell startup scripts: Yikes. You needed shell-specific logic to detach the program from the controlling session so that it didn’t stop running when you logged out, as that would have made ssh-agent-switcher suffer from the exact same problems as regular sshd socket handling. The solution to this was to make ssh-agent-switcher become a daemon on its own with proper logging and “singleton” checking via PID file locking. So now you can reliably start it like this from any shell: I suppose you could make systemd start and manage ssh-agent-switcher automatically with a per-user socket trigger without needing the daemonization support in the binary per se… but I do care about more than just Linux and so assuming the presence of systemd is not an option. With that done, I felt compelled to fix a zero-day TODO that kept causing trouble for people: a fixed-size buffer used to proxy requests between the SSH client and the forwarded agent. This limitation caused connections to stall if the response from the ssh-agent contained more keys than fit in the buffer. The workaround had been to make the fixed-size buffer “big enough”, but that was still insufficient for some outlier cases and came with the assumption that the messages sent over the socket would fit in the OS internal buffers in one go as well. No bueno. Fixing this properly required one of the following: adding threads to handle reads and writes over two sockets in any order, dealing with the annoying / family of system calls, or using an async runtime and library (tokio) to deal with the event-like nature of proxying data between two network connections. People dislike async Rust for some good reasons, but async is the way to get to the real fearless concurrency promise. I did not fancy managing threads by hand, and I did not want to deal with manual event handling… so async it was. And you know what? Switching to async had two immediate benefits: Handling termination signals with proper cleanup became straightforward. The previous code had to install a signal handler and deal with potential races in the face of blocking system calls by doing manual polling of incoming requests, which isn’t good if you like power efficiency. Using tokio made this trivial and in a way that I more easily trust is correct. I could easily implement the connection proxying by using an event-driven loop and not having to reason about threads and their terminating conditions. Funnily enough, after a couple of hours of hacking, I felt proud of the proxying algorithm and the comprehensive unit tests I had written so I asked Gemini for feedback, and… while it told me my code was correct, it also said I could replace it all with a single call to a primitive! Fun times. I still don’t trust AI to write much code for me, but I do like it a lot to perform code reviews. Even with tokio in the picture and all of the recent new features and fixes, the Rust binary of ssh-agent-switcher is still smaller (by 100KB or so) than the equivalent Go one and I trust its implementation more. Knowing that various people had found this project useful over the last two years, I decided to conclude this sprint by creating an actual “formal release” of ssh-agent-switcher. Formal releases require: Documentation, which made me write a manual page . A proper installation process, which made me write a traditional -like script because doesn’t support installing supporting documents. A tag and release number, which many people forget about doing these days but are critical if you want the code to be packaged in upstream OSes. And with that, ssh-agent-switcher 1.0.0 went live on Christmas day of 2025. pkgsrc already has a package for it ; what is your OS waiting for? 😉 In that article, I presented the ssh-agent-switcher : a program I put together in a few hours to fix this problem. In short, ssh-agent-switcher exposes an agent socket at a stable location ( by default) and proxies all incoming credential requests to the transient socket that the sshd server creates on a connection basis. In this article, I want to formalize this project by presenting its first actual release, 1.0.0, and explain what has changed to warrant this release number. I put effort into creating this formal release because ssh-agent-switcher has organically gained more interest than I imagined as it is solving a real problem that various people have. Some background When I first wrote ssh-agent-switcher, I did so to fix a problem I was having at work: we were moving from local developer workstations to remote VMs, we required SSH to work on the remote VMs for GitHub access, and I kept hitting problems with the ssh-agent forwarding feature breaking because I’m an avid user of tmux . To explain the problem to my peers, I wrote the aforementioned article and prototyped ssh-agent-switcher after-hours to demonstrate a solution. At the end of the day, the team took a different route for our remote machines but I kept using this little program on my personal machines. Because of work constraints, I had originally written ssh-agent-switcher in Go and I had used Bazel as its build system. I also used my own shtk library to quickly write a bunch of integration tests and, because of the Bazel requirement, I even wrote my first ruleset, rules_shtk , to make it possible. The program worked, but due to the apparent lack of interest, I considered it “done” and what you found in GitHub was a code dump of a little project I wrote in a couple of free evenings. New OpenSSH naming scheme Recently, however, ssh-agent-switcher stopped working on a Debian testing machine I run and I had to fix it. Luckily, someone had sent a bug report describing what the problem was: OpenSSH 10.1 had changed the location where sshd creates the forwarding sockets and even changed their naming scheme, so ssh-agent-switcher had to adapt. Fixing this issue was straightforward, but doing so made me have to “touch” the ssh-agent-switcher codebase again and I got some interest to tweak it further. My energy to work on side-projects like this one and to write about them comes from your support. Subscribe now to motivate future content! The Rust rewrite As I wanted to modernize this program, one thing kept rubbing me the wrong way: I had originally forced myself to use Go because of potential work constraints. As these requirements never became relevant and I “needed to write some code” to quench some stress, I decided to rewrite the program in Rust. Why, you ask? Just because I wanted to. It’s my code and I wanted to have fun with it, so I did the rewrite. Which took me into a detour. You see: while command line parsing in Rust CLI apps is a solved problem , I had been using the ancient getopts crate in other projects of mine out of inertia. Using either library requires replicating some boilerplate across apps that I don’t like, so… I ended up cleaning up that “common code” as well and putting it into a new crate aptly-but-oddly-named getoptsargs . Take a look and see if you find it interesting… I might write a separate article on it. Doing this rewrite also made me question the decision to use Bazel (again imposed by constraints that never materialized) for this simple tool: as much as I like the concepts behind this build system and think it’s the right choice for large codebases, it was just too heavy for a trivial program like ssh-agent-switcher. So… I just dropped Bazel and wrote a Makefile—which you’d think isn’t necessary for a pure Rust project but remember that this codebase includes shell tests too. Daemonization support With the Rust rewrite done, I was now on a path to making ssh-agent-switcher a “real project” so the first thing I wanted to fix were the ugly setup instructions from the original code dump. Here is what the project README used to tell you to write into your shell startup scripts: Yikes. You needed shell-specific logic to detach the program from the controlling session so that it didn’t stop running when you logged out, as that would have made ssh-agent-switcher suffer from the exact same problems as regular sshd socket handling. The solution to this was to make ssh-agent-switcher become a daemon on its own with proper logging and “singleton” checking via PID file locking. So now you can reliably start it like this from any shell: I suppose you could make systemd start and manage ssh-agent-switcher automatically with a per-user socket trigger without needing the daemonization support in the binary per se… but I do care about more than just Linux and so assuming the presence of systemd is not an option. Going async With that done, I felt compelled to fix a zero-day TODO that kept causing trouble for people: a fixed-size buffer used to proxy requests between the SSH client and the forwarded agent. This limitation caused connections to stall if the response from the ssh-agent contained more keys than fit in the buffer. The workaround had been to make the fixed-size buffer “big enough”, but that was still insufficient for some outlier cases and came with the assumption that the messages sent over the socket would fit in the OS internal buffers in one go as well. No bueno. Fixing this properly required one of the following: adding threads to handle reads and writes over two sockets in any order, dealing with the annoying / family of system calls, or using an async runtime and library (tokio) to deal with the event-like nature of proxying data between two network connections. Handling termination signals with proper cleanup became straightforward. The previous code had to install a signal handler and deal with potential races in the face of blocking system calls by doing manual polling of incoming requests, which isn’t good if you like power efficiency. Using tokio made this trivial and in a way that I more easily trust is correct. I could easily implement the connection proxying by using an event-driven loop and not having to reason about threads and their terminating conditions. Funnily enough, after a couple of hours of hacking, I felt proud of the proxying algorithm and the comprehensive unit tests I had written so I asked Gemini for feedback, and… while it told me my code was correct, it also said I could replace it all with a single call to a primitive! Fun times. I still don’t trust AI to write much code for me, but I do like it a lot to perform code reviews. Documentation, which made me write a manual page . A proper installation process, which made me write a traditional -like script because doesn’t support installing supporting documents. A tag and release number, which many people forget about doing these days but are critical if you want the code to be packaged in upstream OSes.

0 views
Blog System/5 1 months ago

From Azure Functions to FreeBSD

On Thanksgiving morning, I woke up to one of my web services being unavailable. All HTTP requests failed with a “503 Service unavailable” error. I logged into the console, saw a simplistic “Runtime version: Error” message, and was not able to diagnose the problem. I did not spend a lot of time trying to figure the issue out and I didn’t even want to contact the support black hole. Because… there was something else hidden behind an innocent little yellow warning at the top of the dashboard: Migrate your app to Flex Consumption as Linux Consumption will reach EOL on September 30 2028 and will no longer be supported. I had known for a few weeks now, while trying to set up a new app, that all of my Azure Functions apps were on death row. The free plan I was using was going to be decommissioned and the alternatives I tried didn’t seem to support custom handlers written in Rust. I still had three years to deal with this, but hitting a showstopper error pushed me to take action. All of my web services are now hosted by the FreeBSD server in my garage with just a few tweaks to their codebase. This is their migration story. Blog System/5 and the open source projects described below are all made in my limited free time. Subscribe now to show your support; it goes a long way! Back in 2021, I had been developing my EndBASIC language for over a year and I wanted to create a file sharing service for it. Part of this was to satisfy my users, but another part was to force myself into the web services world as I felt “behind”. At that time, I had also been at Microsoft for a few months already working on Azure Storage. One of the perks of the job was something like $300 of yearly credit to deploy stuff on Azure for learning purposes. It was only “natural” that I’d pick Azure for what I wanted to do with EndBASIC. Now… $300 can be plentiful for a simple app, but it can also be paltry. Running a dedicated VM would eat through this in a couple of months, but the serverless model offered by Azure Functions with its “infinite” free tier would go a long way. I looked at their online documentation, found a very good guide on how to deploy Rust-native functions onto a Linux runtime , and… I was sold. I quickly got a bare bones service up and running on Azure Functions and I built it up from there. Based on these foundations, I later developed a separate service for my own site analytics (poorly named EndTRACKER ), and I recently started working on a new service to provide secure auto-unlock of encrypted ZFS volumes (stay tuned!). And, for the most part, the experience with Azure has been neat. I learned a bunch and I got to a point where I had set up “push on green” via GitHub Actions and dual staging vs. prod deployments. The apps ran completely on their own for the last three years, a testament to the stability of the platform and to the value of designing for testability . Until now that is. Compute-wise, I was set: Azure Functions worked fine as the runtime for my apps’ logic and it cost pennies to run, so the $300 was almost untouched. But web services aren’t made of compute alone: they need to store data, which means they need a database. My initial research in 2021 concluded that the only option for a database instance with a free plan was to go with, no surprise, serverless Microsoft SQL Server (MSSQL). I had never used Microsoft’s offering but it couldn’t be that different from PostgreSQL or MySQL, could it? Maybe so, but I didn’t get very far in that line of research. The very first blocker I hit was that the MSSQL connection required TLS and this hadn’t been implemented in the connector I chose to use for my Rust-based functions. I wasted two weeks implementing TLS support in (see PR #1200 and PR #1203 ) and got it to work, but that code was not accepted upstream because it conflicted with their business strategy. Needless to say, this was disappointing because getting that to work was a frigging nightmare. In any case, once I passed that point, I started discovering more missing features and bugs in the MSSQL connector, and then I also found some really weird surprises in MSSQL’s dialect of SQL. TL;DR, this turned into a dead end. On the left, the default instance and cost selected by Azure when choosing to create a managed PostgreSQL server today. On the right, minimum possible cost after dialing down CPU, RAM, disk, and availability requirements. I had no choice other than to provision a full PostgreSQL server on Azure. Their onboarding wizard tried to push me towards a pretty beefy and redundant instance that would cost over $600 per month when all I needed was the lowest machine you could get for the amount of traffic I expected. Those options were hidden under a “for development only” panel and riddled with warnings about no redundancy, but after I dialed all the settings down and accepted the “serious risks”, I was left with an instance that’d cost $15 per month or so. This fit well well within the free yearly credit I had access to, so that was it. About two months ago, I started working on a new service to securely auto-unlock ZFS encrypted volumes (more details coming). For this, I had to create a new Azure Functions deployment… and I started seeing the writing on the wall. I don’t remember the exact details, but it was really difficult to get the creation wizard to provision me the same flex plan I had used for my other services, and it was warning me that the selected plan was going to be axed in 2028. At the time of this writing, 2028 is still three years out and this warning was for a new service I was creating. I didn’t want to consider migrating neither EndBASIC nor EndTRACKER to something else just yet. Until Thanksgiving, that was. On Thanksgiving morning, I noticed that my web analytics had stopped working. All HTTP API requests failed with a “503 Service unavailable.” error but, interestingly, the cron-triggered APIs were still running in the background just fine and the staging deployment slot of the same app worked fine end-to-end as well. I tried redeploying the app with a fresh binary, thinking that a refresh would fix the problem, but that was of no use. I also poked through the dashboard trying to figure out what “Runtime version: Error” would be about, making sure the version spec in was up-to-date, and couldn’t figure it out either. Summary state of my problematic Azure Functions deployment. Note the cryptic runtime error along with the subtle warning at the top about upcoming deprecations. So… I had to get out of Azure Functions, quick. Not accidentally, I had bought a second-hand, over-provisioned ThinkStation (2x36-core Xeon E5-2697, 64 GB of RAM, a 2 TB NVMe drive, and a 4x4 TB HDD array) just two years back. The justification I gave myself was to use it as my development server, but I had this idea in the back of my mind to use it to host my own services at some point. The time to put it to serving real-world traffic with FreeBSD 14.x had come. The way you run a serverless Rust (or Go) service on Azure Functions is by creating a binary that exposes an HTTP server on the port provided to it by the environment variable. Then, you package the binary along a set of metadata JSON files that tell the runtime what HTTP routes the binary serves and push the packaged ZIP file to Azure. From there on, the Azure Functions runtime handles TLS termination for those routes, spawns your binary server on a micro VM on demand, and redirects the requests to it. By removing the Azure Functions runtime from the picture, I had to make my server binary stand alone. This was actually pretty simple because the binary was already an HTTP server: it just had to be coerced into playing nicely with FreeBSD’s approach to running services. In particular, I had to: Inject configuration variables into the server process at startup time. These used to come from the Azure Functions configuration page, and are necessary to tell the server where the database lives and what credentials to use. Make the service run as an unprivileged user, easily. Create a PID file to track the execution of the process so that the framework could handle restarts and stop requests. Store the logs that the service emits via stderr to a log file, and rotate the log to prevent local disk overruns. Most daemons implement all of the above as features in their own code, but I did not want to have to retrofit all of these into my existing HTTP service in a rush. Fortunately, FreeBSD provides this little tool, daemon(8) , which wraps an existing binary and offers all of the above. This incantation was enough to get me going: I won’t dive into the details of each flag, but to note: specifies which PID file to create; specifies where to store the stdout and stderr of the process; is required for log rotation (much more below); drops privileges to the given user; and specifies the “title” of the process to display in output. The trick was sufficient to inject configuration variables upon process startup, simulating the same environment that my server used to see when spawned by the Azure Functions runtime. Hooking that up into an service script was then trivial: And with that: Ta-da! I had the service running locally and listening to a local port determined in the configuration file. As part of the migration out of Azure Functions, I switched to self-hosting PostgreSQL as well. This was straightforward but required a couple of extra improvements to my web framework: one to stop using a remote PostgreSQL instance for tests (something I should have done eons ago), and another to support local peer authentication to avoid unnecessary passwords. In the call to above, I briefly mentioned the need for the flag to support log rotation. What’s that about? You see, in Unix-like systems, when a process opens a file, the process holds a handle to the open file. If you delete or rename the file, the handle continues to exist exactly as it was . This has two consequences: If you rename the file, all subsequent reads and writes go to the new file location, not the old one. If you delete the file, all subsequent reads and writes continue to go to disk but to a file you cannot reference anymore. You can run out of disk space and, while will confirm the fact, will not let you find what file is actually consuming it! For a long-running daemon that spits out verbose logs, writing them to a file can become problematic because you can end up running out of disk space. To solve this problem, daemons typically implement log rotation : a mechanism to keep log sizes in check by moving them aside when a certain period of time passes or when they cross a size threshold, and then only keeping the last N files around. Peeking into one of the many examples in my server, note how is the “live” log where writes go to but there is a daily archive for up to a week: Having all daemons implement log rotation logic on their own would be suboptimal because you’d have duplicate logic throughout the system and you would not be able to configure policy easily for them all. This is where newsyslog(8) on FreeBSD (or on Linux) comes into play. is a tool that rotates log files based on criteria such as size or time and optionally compresses them. But remember: the semantics of open file handles mean that simply renaming log files is insufficient! Once takes action and moves a log file aside, it must ensure that the process that was writing to that file closes the file handle and reopens it so that writes start going to the new place. This is typically done via sending a to the daemon, and is why we need to pass to the call. To illustrate the sequence: The system starts a service via and redirects logs to . runs and determines that needs to be rotated because a day has passed. renames to and creates a new and empty . At this point is still writing to ! sends a to the process. The process closes its file handle for the log, reopens (which is the fresh new log file), and resumes writing. compresses the file for archival now that it’s quiesced. Configuring is easy, but cryptic. We can create a service-specific configuration file under that provides entries for our service, such as: I’ll leave you to the manpage to figure out what the magic is (but in short, it controls retention count, rotation schedule, and compression). As I briefly mentioned earlier, the Azure Functions runtime was responsible for TLS termination in my previous setup. Without such a runtime in place, I had to configure TLS on my own in my HTTP server… or did I? I had been meaning to play with Cloudflare Tunnels for a while given that I already use Cloudflare for DNS. Zero Trust Tunnels allow you to expose a service without opening inbound ports in your firewall. The way this works is by installing the tunnel daemon on your machine and configuring the tunnel to redirect certain URL routes to an internal address (typically ). Cloudflare then acts as the frontend for the requests, handles TLS termination and DDOS protection, and then redirects the request to your local service. Interactions between client machines, Cloudflare servers, the cloudflared tunnel agent, and the actual HTTP servers I wrote. The obvious downside of relying on someone else to do TLS termination instead of doing it yourself on your own server is that they can intercept and modify your traffic. For the kinds of services I run this isn’t a big deal for me, and the simplicity of others dealing with certificates is well welcome. Note that I was already offloading TLS termination to Azure Functions anyway, so this isn’t a downgrade in security or privacy. But using Cloudflare as the frontend came with a little annoyance: CORS handling. You see: the services I run require configuring extra allowed origins, and as soon as I tried to connect to them via the Cloudflare tunnel, I’d get the dreaded “405 Method not allowed” error in the requests. Before, I used to configure CORS orgins from the Azure Functions console, but no amount of peeking through the Cloudflare console showed me how to do this for my tunneled routes. At some point during the investigation, I assumed that I had to configure CORS on my own server. I’m not sure how I reached that bogus conclusion, but I ended up wasting a few hours implementing a configuration system for CORS in my web framework . Nice addition… but ultimately useless. I had not accounted for the fact that because Cloudflare acts as the frontend for the services, it is the one responsible for handling the pre-flight HTTP requests necessary for CORS. In turn, this means that Cloudflare is where CORS needs to be configured but there is nothing “obvious” about configuring CORS in the Cloudflare portal. AI to the rescue! As skeptical as I am of these tools, it’s true that they work well to get answers to common problems—and figuring out how to deal with CORS in Cloudflare was no exception. They told me to configure a transformation rule that explicitly sets CORS response headers for specific subdomains, and that did the trick: Sample rule configuration on the Cloudflare portal to rewrite CORS response headers. Even though AI was correct in this case, the whole thing looked fishy to me, so I did spend time reading about the inner workings of CORS to make sure I understood what this proposed solution was about and to gain my own confidence that it was correct. By now, my web services are now fully running on my FreeBSD machine. The above may have seemed complicated, but in reality it was all just a few hours of work on Thanksgiving morning. Let’s conclude by analyzing the results of the transition. On the plus side, here is what I’ve gained: Predictability: Running in the cloud puts you at the mercy of the upgrade and product discontinuation treadmill of big cloud providers. It’s no fun to have to be paying attention to deprecation messages and adjust to changes no matter how long the deadlines are. FreeBSD also evolves, of course, but it has remained pretty much the same over the last 30 years and I have no reason to believe it’ll significantly change in the years to come. Performance: My apps are so much faster now it’s ridiculous. The serverless runtime of Azure Functions starts quickly for sure, but it just can’t beat a server that’s continuously running and that has hot caches at all layers. That said, I bet the real difference in performance for my use case comes from collocating the app servers with the database, duh. Ease of management: In the past, having automated deployments via GitHub Actions to Azure Functions was pretty cool, not gonna lie. But… being now able to deploy with a trivial , perform administration PostgreSQL tasks with just a , and inspecting logs trivially and quickly by looking at beats any sort of online UI and distributed system. “Doesn’t scale” you say, but it scales up my time . Cost: My Azure bill has gone from $20/month, the majority of which was going into the managed PostgreSQL instance, to almost zero. Yes, the server I’m running in the garage is probably costing me the same or more in electricity, but I was running it anyway already for other reasons. And here is what I’ve lost (for now): Availability (and redundancy): The cloud gives you the chance of very high availability by providing access to multiple regions. Leveraging these extra availability features is not cheap and often requires extra work, and I wasn’t taking advantage of them in my previous setup. So, I haven’t really decreased redundancy, but it’s funny that the day right after I finished the migration, I lost power for about 2 hours. Hah, I think I hadn’t suffered any outages with Azure other than the one described in this article. A staging deployment: In my previous setup, I had dual prod and staging deployments (via Azure Functions slots and separate PostgreSQL databases—not servers) and it was cool to deploy first to staging, perform some manual validations, and then promote the deployment to prod. In practice, this was rather annoying because the deployment flow was very slow and not fully automated (see “manual testing”), but it indeed saved me from breaking prod a few times. Auto-deployments: Lastly and also in my previous setup, I had automated the push to staging and prod by simply updating tags in the GitHub repository. Once again, this was convenient, but the biggest benefit of it all was that the prod build process was “containerized” and not subject to environmental interference. I’d very well set up a cron job or webhook-triggered local service that rebuilt and deployed my services on push… but it’s now hard to beat the simplicity of . None of the above losses are inherent to self-hosting, of course. I could provide alternatives for them all and at some point I will; consider them to-dos! On Thanksgiving morning, I woke up to one of my web services being unavailable. All HTTP requests failed with a “503 Service unavailable” error. I logged into the console, saw a simplistic “Runtime version: Error” message, and was not able to diagnose the problem. I did not spend a lot of time trying to figure the issue out and I didn’t even want to contact the support black hole. Because… there was something else hidden behind an innocent little yellow warning at the top of the dashboard: Migrate your app to Flex Consumption as Linux Consumption will reach EOL on September 30 2028 and will no longer be supported. I had known for a few weeks now, while trying to set up a new app, that all of my Azure Functions apps were on death row. The free plan I was using was going to be decommissioned and the alternatives I tried didn’t seem to support custom handlers written in Rust. I still had three years to deal with this, but hitting a showstopper error pushed me to take action. All of my web services are now hosted by the FreeBSD server in my garage with just a few tweaks to their codebase. This is their migration story. Blog System/5 and the open source projects described below are all made in my limited free time. Subscribe now to show your support; it goes a long way! How did I get here? Back in 2021, I had been developing my EndBASIC language for over a year and I wanted to create a file sharing service for it. Part of this was to satisfy my users, but another part was to force myself into the web services world as I felt “behind”. At that time, I had also been at Microsoft for a few months already working on Azure Storage. One of the perks of the job was something like $300 of yearly credit to deploy stuff on Azure for learning purposes. It was only “natural” that I’d pick Azure for what I wanted to do with EndBASIC. Now… $300 can be plentiful for a simple app, but it can also be paltry. Running a dedicated VM would eat through this in a couple of months, but the serverless model offered by Azure Functions with its “infinite” free tier would go a long way. I looked at their online documentation, found a very good guide on how to deploy Rust-native functions onto a Linux runtime , and… I was sold. I quickly got a bare bones service up and running on Azure Functions and I built it up from there. Based on these foundations, I later developed a separate service for my own site analytics (poorly named EndTRACKER ), and I recently started working on a new service to provide secure auto-unlock of encrypted ZFS volumes (stay tuned!). And, for the most part, the experience with Azure has been neat. I learned a bunch and I got to a point where I had set up “push on green” via GitHub Actions and dual staging vs. prod deployments. The apps ran completely on their own for the last three years, a testament to the stability of the platform and to the value of designing for testability . Until now that is. The cloud database Compute-wise, I was set: Azure Functions worked fine as the runtime for my apps’ logic and it cost pennies to run, so the $300 was almost untouched. But web services aren’t made of compute alone: they need to store data, which means they need a database. My initial research in 2021 concluded that the only option for a database instance with a free plan was to go with, no surprise, serverless Microsoft SQL Server (MSSQL). I had never used Microsoft’s offering but it couldn’t be that different from PostgreSQL or MySQL, could it? Maybe so, but I didn’t get very far in that line of research. The very first blocker I hit was that the MSSQL connection required TLS and this hadn’t been implemented in the connector I chose to use for my Rust-based functions. I wasted two weeks implementing TLS support in (see PR #1200 and PR #1203 ) and got it to work, but that code was not accepted upstream because it conflicted with their business strategy. Needless to say, this was disappointing because getting that to work was a frigging nightmare. In any case, once I passed that point, I started discovering more missing features and bugs in the MSSQL connector, and then I also found some really weird surprises in MSSQL’s dialect of SQL. TL;DR, this turned into a dead end. On the left, the default instance and cost selected by Azure when choosing to create a managed PostgreSQL server today. On the right, minimum possible cost after dialing down CPU, RAM, disk, and availability requirements. I had no choice other than to provision a full PostgreSQL server on Azure. Their onboarding wizard tried to push me towards a pretty beefy and redundant instance that would cost over $600 per month when all I needed was the lowest machine you could get for the amount of traffic I expected. Those options were hidden under a “for development only” panel and riddled with warnings about no redundancy, but after I dialed all the settings down and accepted the “serious risks”, I was left with an instance that’d cost $15 per month or so. This fit well well within the free yearly credit I had access to, so that was it. The outage and trigger About two months ago, I started working on a new service to securely auto-unlock ZFS encrypted volumes (more details coming). For this, I had to create a new Azure Functions deployment… and I started seeing the writing on the wall. I don’t remember the exact details, but it was really difficult to get the creation wizard to provision me the same flex plan I had used for my other services, and it was warning me that the selected plan was going to be axed in 2028. At the time of this writing, 2028 is still three years out and this warning was for a new service I was creating. I didn’t want to consider migrating neither EndBASIC nor EndTRACKER to something else just yet. Until Thanksgiving, that was. On Thanksgiving morning, I noticed that my web analytics had stopped working. All HTTP API requests failed with a “503 Service unavailable.” error but, interestingly, the cron-triggered APIs were still running in the background just fine and the staging deployment slot of the same app worked fine end-to-end as well. I tried redeploying the app with a fresh binary, thinking that a refresh would fix the problem, but that was of no use. I also poked through the dashboard trying to figure out what “Runtime version: Error” would be about, making sure the version spec in was up-to-date, and couldn’t figure it out either. Summary state of my problematic Azure Functions deployment. Note the cryptic runtime error along with the subtle warning at the top about upcoming deprecations. So… I had to get out of Azure Functions, quick. Not accidentally, I had bought a second-hand, over-provisioned ThinkStation (2x36-core Xeon E5-2697, 64 GB of RAM, a 2 TB NVMe drive, and a 4x4 TB HDD array) just two years back. The justification I gave myself was to use it as my development server, but I had this idea in the back of my mind to use it to host my own services at some point. The time to put it to serving real-world traffic with FreeBSD 14.x had come. From serverless to serverful The way you run a serverless Rust (or Go) service on Azure Functions is by creating a binary that exposes an HTTP server on the port provided to it by the environment variable. Then, you package the binary along a set of metadata JSON files that tell the runtime what HTTP routes the binary serves and push the packaged ZIP file to Azure. From there on, the Azure Functions runtime handles TLS termination for those routes, spawns your binary server on a micro VM on demand, and redirects the requests to it. By removing the Azure Functions runtime from the picture, I had to make my server binary stand alone. This was actually pretty simple because the binary was already an HTTP server: it just had to be coerced into playing nicely with FreeBSD’s approach to running services. In particular, I had to: Inject configuration variables into the server process at startup time. These used to come from the Azure Functions configuration page, and are necessary to tell the server where the database lives and what credentials to use. Make the service run as an unprivileged user, easily. Create a PID file to track the execution of the process so that the framework could handle restarts and stop requests. Store the logs that the service emits via stderr to a log file, and rotate the log to prevent local disk overruns. If you rename the file, all subsequent reads and writes go to the new file location, not the old one. If you delete the file, all subsequent reads and writes continue to go to disk but to a file you cannot reference anymore. You can run out of disk space and, while will confirm the fact, will not let you find what file is actually consuming it! The system starts a service via and redirects logs to . runs and determines that needs to be rotated because a day has passed. renames to and creates a new and empty . At this point is still writing to ! sends a to the process. The process closes its file handle for the log, reopens (which is the fresh new log file), and resumes writing. compresses the file for archival now that it’s quiesced. Interactions between client machines, Cloudflare servers, the cloudflared tunnel agent, and the actual HTTP servers I wrote. The obvious downside of relying on someone else to do TLS termination instead of doing it yourself on your own server is that they can intercept and modify your traffic. For the kinds of services I run this isn’t a big deal for me, and the simplicity of others dealing with certificates is well welcome. Note that I was already offloading TLS termination to Azure Functions anyway, so this isn’t a downgrade in security or privacy. CORS But using Cloudflare as the frontend came with a little annoyance: CORS handling. You see: the services I run require configuring extra allowed origins, and as soon as I tried to connect to them via the Cloudflare tunnel, I’d get the dreaded “405 Method not allowed” error in the requests. Before, I used to configure CORS orgins from the Azure Functions console, but no amount of peeking through the Cloudflare console showed me how to do this for my tunneled routes. At some point during the investigation, I assumed that I had to configure CORS on my own server. I’m not sure how I reached that bogus conclusion, but I ended up wasting a few hours implementing a configuration system for CORS in my web framework . Nice addition… but ultimately useless. I had not accounted for the fact that because Cloudflare acts as the frontend for the services, it is the one responsible for handling the pre-flight HTTP requests necessary for CORS. In turn, this means that Cloudflare is where CORS needs to be configured but there is nothing “obvious” about configuring CORS in the Cloudflare portal. AI to the rescue! As skeptical as I am of these tools, it’s true that they work well to get answers to common problems—and figuring out how to deal with CORS in Cloudflare was no exception. They told me to configure a transformation rule that explicitly sets CORS response headers for specific subdomains, and that did the trick: Sample rule configuration on the Cloudflare portal to rewrite CORS response headers. Even though AI was correct in this case, the whole thing looked fishy to me, so I did spend time reading about the inner workings of CORS to make sure I understood what this proposed solution was about and to gain my own confidence that it was correct. Results of the transition By now, my web services are now fully running on my FreeBSD machine. The above may have seemed complicated, but in reality it was all just a few hours of work on Thanksgiving morning. Let’s conclude by analyzing the results of the transition. On the plus side, here is what I’ve gained: Predictability: Running in the cloud puts you at the mercy of the upgrade and product discontinuation treadmill of big cloud providers. It’s no fun to have to be paying attention to deprecation messages and adjust to changes no matter how long the deadlines are. FreeBSD also evolves, of course, but it has remained pretty much the same over the last 30 years and I have no reason to believe it’ll significantly change in the years to come. Performance: My apps are so much faster now it’s ridiculous. The serverless runtime of Azure Functions starts quickly for sure, but it just can’t beat a server that’s continuously running and that has hot caches at all layers. That said, I bet the real difference in performance for my use case comes from collocating the app servers with the database, duh. Ease of management: In the past, having automated deployments via GitHub Actions to Azure Functions was pretty cool, not gonna lie. But… being now able to deploy with a trivial , perform administration PostgreSQL tasks with just a , and inspecting logs trivially and quickly by looking at beats any sort of online UI and distributed system. “Doesn’t scale” you say, but it scales up my time . Cost: My Azure bill has gone from $20/month, the majority of which was going into the managed PostgreSQL instance, to almost zero. Yes, the server I’m running in the garage is probably costing me the same or more in electricity, but I was running it anyway already for other reasons. Availability (and redundancy): The cloud gives you the chance of very high availability by providing access to multiple regions. Leveraging these extra availability features is not cheap and often requires extra work, and I wasn’t taking advantage of them in my previous setup. So, I haven’t really decreased redundancy, but it’s funny that the day right after I finished the migration, I lost power for about 2 hours. Hah, I think I hadn’t suffered any outages with Azure other than the one described in this article. A staging deployment: In my previous setup, I had dual prod and staging deployments (via Azure Functions slots and separate PostgreSQL databases—not servers) and it was cool to deploy first to staging, perform some manual validations, and then promote the deployment to prod. In practice, this was rather annoying because the deployment flow was very slow and not fully automated (see “manual testing”), but it indeed saved me from breaking prod a few times. Auto-deployments: Lastly and also in my previous setup, I had automated the push to staging and prod by simply updating tags in the GitHub repository. Once again, this was convenient, but the biggest benefit of it all was that the prod build process was “containerized” and not subject to environmental interference. I’d very well set up a cron job or webhook-triggered local service that rebuilt and deployed my services on push… but it’s now hard to beat the simplicity of .

0 views
Blog System/5 1 months ago

BazelCon 2025 recap

It has been just over two years since I started Blog System/5, and that means it’s time for the now-usual(?) BazelCon 2025 trip report! The conference, arranged by the Linux Foundation, took place in Atlanta, GA, USA over three days: one for tutorials and two for the main talks. An extra hackathon day, organized by Aspect Build, followed. Unfortunately, a canceled flight meant I missed the tutorials, but I attended the rest of the events. As usual, it was a super-fun time to connect with old acquaintances and an energizing event that left me with plenty of new topics to research. What follows is not a complete summary of the conference, as there were many talks I did not attend and conversations I missed. If you want the full firehose of videos, see the BazelCon 2025 YouTube playlist . And if you want a TL;DR… I’d pick the following highlights: The ecosystem is maturing with becoming mandatory and the BUILD Foundation being a reality in the near horizon. Performance remains a key focus of the Bazel core team and the community, with innovative approaches like Skycache for client-side speed, sophisticated RE improvements for backend efficiency, and new rulesets like that focus on build speed. Community tooling is expanding Bazel’s scope, with projects like Aspect’s task runner aiming to solve long-standing workflow gaps. But the above is just tiny peek into the conference. So, strap your seat belts and let’s dive into the conference. If you want more Bazel content (and much more than that), make sure to click subscribe and support Blog System/5! Google opened by emphasizing their commitment to Bazel, highlighting its growing internal adoption. Their reasoning is that Bazel improves security and hermeticity, in addition to the usual benefits of faster builds and easier open-sourcing of code. This statement seems to be a response to last year’s proposal to create a non-Google Bazel Foundation, which would act as a “backup plan” should Google ever withdraw from the project. Google provided two examples of its growing Bazel adoption. The first is Quantum AI , primarily written in Python and Rust, which saw an 80% reduction in CI time after a migration to Bazel that was driven by just one SWE. The second is their Google Distributed Cloud (GDC) , a version of their cloud product that can run on-premise and in air-gapped instances. The GDC codebase weighs 2.6 GB, is developed by 1,300 engineers, and produces 600 GB of release artifacts. I have to question if the latter is a number to be proud of: when does this madness in bloat stop? The introduction concluded with a few statistics: Bazel’s Slack channel has grown by 18% from the previous year to 8,500 users; there are about 10,800 repos on GitHub with files; and there are about 120,000 files on GitHub. The next session was the customary round of community updates, presented by Jay Conrod from EngFlow and Alex Eagle from Aspect Build. Here are the highlights: Training day: There were six different sessions on the Sunday before the conference, and EngFlow is leading training efforts worldwide. Gazelle: C++ support is on the way for this tool. Version 2.0 will simplify the extensions interface and improve performance. BCR Mirror: Cloudflare is now hosting a mirror for the Bazel Central Registry (BCR). You can use it with Bazel 8.4+ by adding to your configuration—and this will become a default in a future release. Documentation: The most common complaint in community surveys remains the documentation and the steep learning curve. To address this, the BCR website now features icons for sponsorship requests, deprecation notices, and provenance attestations. Furthermore, the Starlark documentation has been published and is now easier to read. In a move to empower the community, the documentation has been migrated out of Google’s internal infrastructure and is available at https://preview.bazel.build/ BUILD Foundation: The foundation currently has three founding members (Spotify, Uber, and Canva) and is looking for four more. The initial meeting is scheduled for December 4th. More on this in its own section. Bzlmod was also a significant topic in this talk, but since it was covered at length in other presentations as well, I have dedicated a separate section to it below. As is tradition, the next session was the State of the Union talk, led by John Field and Tobias Werth from Google. Here are some of the highlights from the update: Local remote repo caching: This new feature is intended to allow the caching of repository rules across different workspaces. Experimental WASM support: There is experimental support for WASM tools in repo rules to enable platform-independent tooling, but its future is still uncertain. Performance improvements: Changes to can save up to 20% of memory. Optimizations in Merkle tree handling can reduce wall time by up to 30%. Analysis phase caching is coming soon (see the “Skycache” section below for more details). There are ongoing efforts to cap disk usage. Path stripping, a feature announced last year, is now more mature and integrated with more rule sets, offering up to an 84% reduction in build time. Analysis time on flag changes has been reduced. Java improvements: Caching has been improved, and method signature changes no longer affect downstream header builds. Starlark flags: A new scoping API is available for Starlark flags. : A new file is being introduced to provide a canonical place for project owners to map targets to flags. More details on its own section. Starlarkification: This effort is almost complete. All rules are now decoupled, with the exception of a few integration points for C++. As of Bazel 9, autoloading of rules is disabled, which means users must now explicitly all the rules they use. Starlark type system: Type annotations and type-checking are coming to Starlark. The syntax will be supported in Bazel 9, with type-checking planned for Bazel 10. The syntax aims for compatibility with Python 3 types, which introduces some limitations on what can be expressed. JetBrains Bazel plugin: The JetBrains-owned Bazel plugin has reached general availability, making the Google-owned plugin a legacy tool. This new plugin promises a much-improved user experience, as it is faster and better integrates the Bazel build graph with IntelliJ’s native understanding of the project structure, avoiding expensive “sync” steps. Internal APIs: There is ongoing work to separate core logic from service interactions (such as remote builds and file system operations), which sounds very similar to how Buck 2 was designed from the start. Much like the community updates talk, this one also opened with bzlmod. Let’s dive into that topic next. The workspace is dead; long live bzlmod! With Bazel 9, support for old-style workspaces has been removed, and given that the vast majority of rulesets now support bzlmod, it’s time for everyone to complete their migration. This is a positive development because bzlmod, through the Bazel Central Registry (BCR) , simplifies rule discovery, project dependencies, and version conflict resolution. Interestingly, but not surprisingly, bzlmod and the BCR have effectively become a package manager for C++. But it’s not all roses. The bzlmod migration has been a significant source of friction for the community due to the intrusive and difficult nature of the change. If you haven’t completed the transition, you can no longer upgrade to newer Bazel versions. The official documentation has also been subpar (which is not surprising), although a great set of articles from EngFlow is now available to clarify the migration process in great detail. To assist with the migration, an automated migration tool is now available, and various people have reported success using AI tools to help with the transition. In a related development, there is now a Maintainer Community Program (MCP) for the BCR. One major pain point I have faced, and one that seems to affect many others, is the tendency of the Bazel ecosystem to couple rule versions with library versions. For example, if you are using an old version of with an equally old version of that is not compatible with bzlmod, you must upgrade to migrate. However, this in turn forces you to upgrade itself. Updating a library can introduce API incompatibilities and behavioral changes, making the upgrade to bzlmod and subsequent major Bazel releases much more difficult than it needs to be. Before we dive into remote execution, let me clarify some terminology. Remote Execution is abbreviated as RE, not RBE. RBE was Google’s now-discontinued cloud product for RE. While the terms are often used interchangeably today (even on Bazel’s own website), it’s a good idea to stick to RE. Similarly, avoid using the term “build farm” as there is a specific RE implementation named Buildfarm . Remote execution is always a hot topic at BazelCon, and for good reason. One of Bazel’s biggest selling points is the performance boost from distributing builds across multiple machines, and nearly every Bazel-related startup offers some form of remote execution solution. In the first RE-focused talk, Son Luong Ngoc from BuildBuddy explained how their product routes actions to maximize performance and minimize execution latency. The talk began with a clear premise: remote builds are spiky and hard to binpack, so how can they be scheduled efficiently for both performance and cost? Here are some of the key features of their RE implementation: Executor types: BuildBuddy provides both managed (OCI, Firecracker, macOS) and self-hosted (Docker, Windows, GPU, and more) executors, each with different performance, isolation, and cost characteristics. Multiple action queues: Executors are organized into pools, and actions can specify which pool they should run on. When an action is received, the scheduler enqueues it in up to three different executors to minimize tail latency, based on the Sparrow scheduler research paper . Work stealing: Dynamic scaling of executor pools is critical for handling spiky build workloads while keeping costs down. To make this more efficient, BuildBuddy allows new executors to steal work from existing ones, which helps redistribute the load. Action merging: This feature coalesces multiple execution requests for the same action into a single execution. As we learned, this can be problematic if a misbehaving executor stalls multiple clients (e.g., several CI jobs). To address this, BuildBuddy speculatively re-executes a running action on a different executor after a certain threshold has passed. Action cancellation: When a user presses Ctrl+C, they are likely to modify code and restart the build, so continuing to run in-flight actions is wasteful. For greater efficiency, BuildBuddy catches the finished event from the BEP and attempts to cancel all remotely-queued actions. Binpacking: Different actions have different resource requirements, and it can be difficult to manually assign them to the right pool and executor. BuildBuddy automatically profiles executed actions (for metrics like peak memory and CPU consumption) and stores this information in the field of the action result, which is then stored on the server. The scheduler uses these details to route actions more effectively. Cold starts: Executing remote actions is similar to running lambda functions: a worker must be started, a container image fetched, and the action executed. To optimize this, BuildBuddy uses affinity routing, where a key is computed based on the primary output name (which is unique for each action) to extract platform, target, and output details. This allows similar actions to be routed to similar executors. Recycled runners: Some customers need to maintain heavyweight processes on the remote worker, such as test databases or the Docker daemon. While not hermetic, this is often desirable for high-performance scenarios. The use of these features is customized through execution properties. Custom resources: Some actions may require access to specialized resources like GPUs, FPGAs, or simulators. For better binpacking, customers can define the “size” of different executor types and annotate actions with the resources they consume. Fair scheduling: With multi-tenancy, users can set a “priority” for the actions of a build using the flag. A common use case is to define three priority bands: interactive builds, CI, and cron jobs. The BuildBuddy scheduler takes this property into account. After detailing these existing features, the talk concluded with a glimpse into the future: extending the RE protocol with a remote build graph API. The current protocol is very chatty, making it difficult to colocate actions (such as a compile-link-test chain). A protocol that understands action relationships could significantly improve this. This talk left me wondering which of these features are also offered by other major RE vendors and open-source implementations. I briefly chatted with the EngFlow folks at the conference and they told me they have most of these too; it’d be nice to have a comparison chart among vendors and free solutions. The next major topic for RE was cost savings and reliability, which was covered in at least two talks. The first talk, presented by Rahul Roy from Glean, focused on how their adoption of Buildbarn for scalability unexpectedly doubled their CI costs. The primary causes for this increase were an up-to-20% per-action overhead in the Buildbarn worker and a lack of Bazel client caching in their GitHub Actions runners. To solve these issues, they chose to adopt spot instances in their deployment of Buildbarn on GCP’s Kubernetes offering, but this is not as easy as it seems: The default Kubernetes autoscaler relies on CPU and memory utilization for its decisions, but these metrics are poor predictors of CI traffic patterns. Action queue length is a much better indicator of developer activity, so a custom autoscaler is needed to achieve reasonable behavior. Fair scaling can disrupt ongoing builds because GCP only provides a 90-second shutdown notice before preempting instances, which is not enough time to terminate running actions gracefully. Cold runners are significantly slower than hot ones because they start with empty local caches. To mitigate this, they implemented a solution to reuse runner disks, but only for disks that had been used for builds covering more than 50% of the build graph. This strategy reduced startup times from eight minutes to less than one. The second talk, given by Gabriel Russo from Databricks and Yuval Kaplan from EngFlow, focused on building at scale and how a naive move to remote execution can actually make CI slower. They investigated the specific problem of using Docker in actions. The default behavior for remote actions is to bring up a fresh worker for each execution and tear it down afterward, wiping all state. However, Docker is stateful, which meant that actions were performing a great deal of redundant work. To solve this, they moved the snapshotter (a part of ) out of the action sandbox and into the execution container, allowing it to be shared across all actions on a given machine. The takeaway is that you must be careful with RE. Your intuition for how local processes interact, especially with local state, does not always apply, and you can inadvertently make builds much slower and more expensive. But how do you develop such intuition? That question provides a perfect segue to the next talk. Bazel is a complex piece of software, and its interactions with other systems are not always straightforward. When things go wrong, can the data tell us what happened? Users are often frustrated by unexpected cache misses, frequently rebuilt targets, non-hermetic actions, and flaky tests. This was the topic of a talk co-presented by Eloise Pozzi from Canva and Helen Altshuler from EngFlow. The answer to the question above is obviously yes, the data can tell us. But it’s not easy because there is a lot of data to comb through. Bazel produces the following datasets: Build Event Protocol (BEP): A stream of events that Bazel sends to a remote server to publish build metadata and report progress. The metadata is the closest thing you will get to “usage telemetry” from Bazel as it captures all builds that were executed (who ran them and with which flags, what was executed, etc.) My pet peeve is that the BEP is incredibly complex and really difficult to manipulate post-facto, but I encourage you to generate one locally (via ) and to spend “a few” minutes understanding what’s in there. Exec log: This log captures everything that happened for actions, regardless of where they ran (the BEP only contains minimal details on local-only actions). It is not captured by default due to its verbosity. The flag, available since Bazel 7.1, makes it possible to capture this log unconditionally. Note that you need a parser to convert this binary log into something that can be read and compared across versions, and you need to manually build this parser out of Bazel’s source tree; yikes. Exec graph log: This log captures how actions depend on each other. It can help quantify drag on the critical path, determine if the critical path is unique, or identify if there are competing ones. Use it to identify actions to prioritize for end-to-end build optimization. Query commands: See the reference for , , and . JSON profile: Also known as the performance profile , this captures a timeline of all actions executed by Bazel and can help understand build bottlenecks and tune parallelization. RE profile: Similar to the JSON profile but this is captured server-side by some RE implementations. EngFlow generates one of these with specific details on how the workers executed actions (e.g. which pool ran an action, which is not something that’s visible to Bazel). Returning to the BEP, it’s worth noting that one of the last messages it emits is , which contains links to some of the other logs mentioned above. If you are using remote caching, these links will point to remote cache entries, allowing you to fetch them after the fact for any user build you need to investigate. The talk also included a description of how to debug cache misses between CI and interactive developer builds, and I felt it was an almost-literal rehearsal of the article I wrote months ago on the same topic. The next talk I attended, which hits close to my heart, was on exposing developer tools to the . I believe that provides a terrible user experience, so I was keen to hear about alternatives. The talk was given by Florian Berchtold from Zipline. One possible solution to this dilemma is to use direnv , a long-standing tool that hooks into the shell’s before-prompt command to run arbitrary code when entering specific directories. Scary? Yes. Useful? Also yes. The idea is to leverage direnv to bring project-specific tools into the . But where do these tools come from? While some people use package managers like nixpkgs , this can lead to duplication and inconsistencies in a Bazel-native world. For example, it’s common to pull in via bzlmod, but you might also want to expose it in the . This is where bazel_env comes in: a hook for direnv that fetches tools using Bazel. The talk explained how to use with dev containers to install , , IDE extensions, and even starpls (the LSP for Starlark). It was also mentioned that for C++, bazel-compile-commands-extractor with works reasonably well for VSCode but struggles with large repositories. For those, configure-vscode-for-bazel is recommended for a better experience. The speaker also prepared a sample repository to demonstrate Bazel integration in the IDE for various languages, which you can find at hofbi/bazel-ide . A few months ago, we hosted a Buildbarn mini-conference at Snowflake where, in my opinion, the most exciting talk was Ed Schouten’s presentation on Bonanza. Shortly after, I published an article imagining the next generation of Bazel builds , because Bazel’s fat client model is problematic in many scenarios. At BazelCon, we now heard Google’s approach to solving slow cold builds and Bazel client scalability in a talk on Skycache by Shahan Yang. The core idea of Skycache is to serialize and remotely cache Skyframe , Bazel’s internal tracking system for build state (also known as the “build graph”). In his talk, Shahan outlined three major considerations for making this solution viable: Top-down pruning: When you get a cache hit for a node in the graph, you don’t have to worry about anything below that node anymore. You can throw away everything underneath to keep memory usage constrained. Invalidation computation: To determine what needs to be re-fetched from the cache, Skycache assumes “the same baseline” and then looks for file changes between the local system and the cache to find “what’s missing”. I know, this sounds fuzzy; refer back to the talk for the specifics. Efficiency: For some nodes, it’s cheaper to recompute them than to fetch them from the cache, and this was true for many nodes before applying two optimizations. One was in the nested sets data structure, because the original approach to serializing them caused a 10x space blowup. The other was around serializing individual node values, because most of the time, those values share internals across nodes. Internal dogfooding of Skycache showed that some builds dropped from 46 to 13 seconds, with similar reductions observed for analyzed targets, loaded packages, and more. On the server side, this solution is RAM-intensive (similar to Bazel’s in-memory representations) and is complicated by the fact that users want to build at older versions and with a high version cadence. To be effective, Skycache needs to maintain “thousands of base images”. A specific insight toward the end of the talk was that, for Google as a whole, 2.5% of targets account for 90% of all targets built. This suggests a potential optimization where only those targets are cached, but this has not yet been implemented. There is no open-source implementation of Skycache, but the talk provided hints about which classes would need to be implemented to make it work. It seems that it shouldn’t be too difficult: the serialization code is already in place, so all that’s required is integration with a key-value store and Git. While this talk was fascinating, I can’t help but feel that Google’s solution is a bit strange. They are opting to maintain a fat Bazel client instead of moving it entirely to the cloud, as Bonanza is attempting to do, and this feels weird to me knowing how the rest of their infrastructure works (or used to work a few years back). Yes, this was BazelCon, but given Buck’s spiritual heritage, it was no surprise to see some Buck 2 content. Andreas Herrmann from Tweag took the stage to compare Bazel and Buck 2’s approaches to the efficient compilation of Haskell, highlighting the key role of dynamic actions in Buck 2. The core of the issue lies in how Haskell modules and libraries are compiled and exposed in the Haskell rules. The summary is as follows: Modules are individual source files. These act as the compilation unit. Libraries are collections of modules, and is what’s often modeled in the build via rules. Therefore, library targets tend to group various modules. Compiling an file produces an object file but also a interface file. Think of the latter as a precompiled header or an interface JAR. To compile a module, we need the files of its dependencies, not their files. This is the key difference between compiling a vs. a , because in the C/C++ case, all individual sources can be compiled in any order, but in the Haskell case, they cannot. With this in mind, the central question is: how can we parallelize the compilation of modules within a library when they must be compiled in dependency order? In Bazel, the solution is to model the internal library modules as separate rules, each with a static representation of its cross-module dependencies. However, this approach can be incredibly noisy. While Gazelle can help mitigate the issue, it is still not an ideal user experience. Buck 2, on the other hand, provides dynamic dependencies, which make it possible to infer the module-level dependency graph at build time . The idea is to have an action that runs to emit the cross-module dependency “mini-graph” for a set of modules, and then use a dynamic action to generate module-level compilation actions with the correct dependencies. One of my original critiques of Bazel in 2015 was that while Bazel is excellent at building , it is not well-suited for other workflows. The specific example I gave was that developers want to install the software they have just built (the equivalent of ), which is not easy to model in Bazel. Well, fear no more. Aspect Build is developing a solution to this problem with Starlark-defined tasks and a custom CLI to run them. I found this to be very exciting, and it was a “hot topic” at the hackathon that followed the conference. The premise of the talk was that, even with Bazel, developers still often rely on auxiliary scripts to install tools, Makefiles to drive workflows like setting up test servers or linting code, and YAML files to define complex CI tasks. While all of this should ideally be expressed in Bazel, there is currently no good way to do so. In essence, Bazel is missing a “task runner”, and this is where Aspect’s newly-announced Extension Language (AXL) comes in. It’s a Starlark dialect for running tasks, which requires the Aspect CLI to execute. The CLI is a companion tool to Bazel that once “replaced” Bazel but no longer does. With the new AXL language, you can define tasks in a way that is very similar to defining rules: you create a Starlark function, receive a context, and can then perform “stuff”. The key difference between tasks and rules is that tasks can trivially execute a build with . Even more exciting is the ability to iterate over the BEP events that the build emits and interact with them! The talk also demonstrated the use of WASM binaries for things like buildozer to write platform-agnostic AXL scripts that help with migrations and the like. But the sky is the limit here, and the new aspect-extensions GitHub organization is meant to collect all user-contributed tasks. The desire to create a Bazel foundation to protect the project and ecosystem, should Google ever “pull the plug”, was announced a year ago, but not much has seemed to happen since. In reality, a lot has been going on behind the scenes, but nothing has yet materialized for the average user. As part of the unconference, we voted to have a BoF on the foundation to discuss its future. The main question we tried to answer during the session was, “What could the foundation do ?” Many ideas were brainstormed, including funding a technical writer, improving the quality of pull requests, maintaining important rulesets, and tackling tricky IDE integrations. However, the most popular idea was for the foundation to act as an intermediary between the community and Google, helping to prioritize the projects that the community needs most. I think there is an AI transcript of the session somewhere but there is no recording. You’ll have to stay tuned for the news, or you can get involved via Slack. Reach out to Alex Eagle or Helen Altshuler. When you have a small repository with a single project, you can easily record project settings (such as compilation targets and debug flags) in the top-level file. But what do you do when you start combining multiple projects into one repository? The build settings for a backend service might be different from those required for a frontend application. files are here to help, and Susan Steinman and Greg Estren from Google were on hand to explain them. The key problem being addressed is that while everyone intuitively understands what a “project” is, Bazel lacks a first-class representation of this concept. By introducing such a concept, the goal is to make work consistently everywhere, without the need to specify any flags. This is the opposite of the current situation, where it is common for developers to create auxiliary scripts to run Bazel with different flags for different targets. As for the format of , the presenters reminded us that is an ad-hoc language and one of the few places where Bazel does not use Starlark, despite the community’s desire for consistency. As a result, these new project files are written in the language we have all come to appreciate. More specifically, files can appear multiple times in the directory tree, just like files. The first one found when walking up the tree from a given build target is the one that is used. The file contains a project definition, which in turn contains buildable units. These units can enforce different policies for flags, such as setting default flag values for a target or preventing users from modifying certain flags. Finally, it is also possible to define multiple configurations for a unit (e.g., release vs. development) and to switch between them using . One mistake that everyone makes when moving to a monorepo is retaining operations that scale with the size of the monorepo instead of the size of the change . In particular, it is extremely common to see CI workflows that run , either executing all tests from scratch or hoping that remote caching will prevent the re-running of unmodified tests. This is a bad practice. The overhead is significant, and the end-user experience is often terrible, especially when flaky tests are present. bazel-diff is a tool that helps determine the targets affected by a given code change, allowing Bazel to build and test only those targets. Maxwell Elliott and Connor Wybranowsky were on hand to share the impact that developing this tool at Tinder has had on the company’s developer workflows. The initial results of deploying were a 40% reduction in CI times at the 90th percentile, and up to a 76% reduction in the worst case. These kinds of improvements were transformative for users. In particular, because CI flows became much faster and more accurate, developers began to take ownership of test breakages and flakiness. Unfortunately, as is often the case with “transformative performance improvements” (remember SSDs or the M1 chip?), the codebase continued to grow and eventually consumed all the gains from . To improve on the original deployment, the new approach is to integrate more deeply with CI. The idea is to dynamically generate pipelines based on the changes in a pull request and select which ones to run at review time. For example, if any of the modified files have automated formatters, only the formatter pipelines will be triggered. To recap, the presenters mentioned that the end-to-end adoption of has helped them save up to 93% of their time in CI. While the extra gains beyond the initial 76% did not lead to the same kinds of cultural changes that were originally observed, developers always appreciate faster workflows. If you have attended previous BazelCons, you will know that supply chain security is a recurring topic. This year was no different, with Mark Zeren from Broadcom and Tony Aiuto from Datadog presenting the latest news in this area. The reason this topic is relevant to the conference is that Bazel is a key tool for producing reliable SBOMs, thanks to its hermeticity, sandboxing features, and fine-grained build graph entries. However, it’s not quite there yet. From the beginning, Bazel included as a way to define per-package licensing details. However, this ruleset was “thrown over the wall” by Google when Bazel was first open-sourced and has not been fit for purpose. Today, there is a new ruleset called supply-chain , with only one person from Google on the eight-person team. This new ruleset focuses on two things: metadata rules that code authors can apply to their files, and tools to generate provenance information and produce SBOMs. These two components are separate because the metadata rules are designed to be stable over time, while the tools are expected to change frequently. What is missing from the new ruleset is licensing: the ability to generate copyright notices, validate linkage, and so on. As I mentioned earlier, your intuition about what works well for local actions may not apply to remote actions, and container image building is a prime example of this. In this talk, Malte Poll from Tweag took the stage to introduce rules_img , a new ruleset that replaces and . It is designed to minimize large blob transfers, resulting in significantly more efficient container builds. I do not have written notes on this talk because I was too focused absorbing the many, many details given during the talk, so I strongly encourage you to watch it. To conclude this long recap, I will leave you with my own lightning talk on how our Java integration tests at Snowflake became significantly slower after we migrated from Maven to Bazel. It’s only eight minutes long, but if you want the summary: Maven and Bazel compile Java code differently. Maven writes class files to a few directories on disk, while Bazel creates intermediate JAR files for every Java library. With Bazel’s more-detailed build graph, this causes an explosion in the size and means that class files must be read from compressed JAR (ZIP) files instead of from disk. I spent some time analyzing the problem and ruled out obvious factors like sandboxing and ZIP compression. I concluded that reading from JAR files is indeed slower than reading individual files from disk. (Why? I’m not sure, but I suspect there is an optimization that could be made in the class loader to fix this.) To mitigate the problem, I created a new rule that uses the tool to merge all intermediate JARs into one. But this was easier said than done. The resulting combined JAR was huge and could not be reused across tests, so I had to develop a complex dependency-pruning rule to generate a combined JAR that could be reused across all tests without introducing class duplicates—all while remaining remote-execution-friendly. With this new rule in place, we saw test runtimes drop by about 10 seconds per test, which brought them back to pre-Bazel levels. And with the improvements that Bazel brings to the build, an end-to-end reduction in test times. And that’s a wrap! This has been more of a detailed summary than a brief recap, but my goal was to clean up and share all the notes I took during the conference. Apologies for the many talks I could not cover in this recap. Once again, head to the BazelCon 2025 YouTube playlist for all recordings. If you are involved with Bazel at all or have any interest in build systems, I strongly encourage you to plan to attend next year. You’ll learn a lot from the talks of course, but what’s more, you’ll get to meet key people from tens of companies—people that hold the keys to how modern build tools and scalable development processes are being developed worldwide. The conference, arranged by the Linux Foundation, took place in Atlanta, GA, USA over three days: one for tutorials and two for the main talks. An extra hackathon day, organized by Aspect Build, followed. Unfortunately, a canceled flight meant I missed the tutorials, but I attended the rest of the events. As usual, it was a super-fun time to connect with old acquaintances and an energizing event that left me with plenty of new topics to research. What follows is not a complete summary of the conference, as there were many talks I did not attend and conversations I missed. If you want the full firehose of videos, see the BazelCon 2025 YouTube playlist . And if you want a TL;DR… I’d pick the following highlights: The ecosystem is maturing with becoming mandatory and the BUILD Foundation being a reality in the near horizon. Performance remains a key focus of the Bazel core team and the community, with innovative approaches like Skycache for client-side speed, sophisticated RE improvements for backend efficiency, and new rulesets like that focus on build speed. Community tooling is expanding Bazel’s scope, with projects like Aspect’s task runner aiming to solve long-standing workflow gaps. Training day: There were six different sessions on the Sunday before the conference, and EngFlow is leading training efforts worldwide. Gazelle: C++ support is on the way for this tool. Version 2.0 will simplify the extensions interface and improve performance. BCR Mirror: Cloudflare is now hosting a mirror for the Bazel Central Registry (BCR). You can use it with Bazel 8.4+ by adding to your configuration—and this will become a default in a future release. Documentation: The most common complaint in community surveys remains the documentation and the steep learning curve. To address this, the BCR website now features icons for sponsorship requests, deprecation notices, and provenance attestations. Furthermore, the Starlark documentation has been published and is now easier to read. In a move to empower the community, the documentation has been migrated out of Google’s internal infrastructure and is available at https://preview.bazel.build/ BUILD Foundation: The foundation currently has three founding members (Spotify, Uber, and Canva) and is looking for four more. The initial meeting is scheduled for December 4th. More on this in its own section. Local remote repo caching: This new feature is intended to allow the caching of repository rules across different workspaces. Experimental WASM support: There is experimental support for WASM tools in repo rules to enable platform-independent tooling, but its future is still uncertain. Performance improvements: Changes to can save up to 20% of memory. Optimizations in Merkle tree handling can reduce wall time by up to 30%. Analysis phase caching is coming soon (see the “Skycache” section below for more details). There are ongoing efforts to cap disk usage. Path stripping, a feature announced last year, is now more mature and integrated with more rule sets, offering up to an 84% reduction in build time. Analysis time on flag changes has been reduced. Java improvements: Caching has been improved, and method signature changes no longer affect downstream header builds. Starlark flags: A new scoping API is available for Starlark flags. : A new file is being introduced to provide a canonical place for project owners to map targets to flags. More details on its own section. Starlarkification: This effort is almost complete. All rules are now decoupled, with the exception of a few integration points for C++. As of Bazel 9, autoloading of rules is disabled, which means users must now explicitly all the rules they use. Starlark type system: Type annotations and type-checking are coming to Starlark. The syntax will be supported in Bazel 9, with type-checking planned for Bazel 10. The syntax aims for compatibility with Python 3 types, which introduces some limitations on what can be expressed. JetBrains Bazel plugin: The JetBrains-owned Bazel plugin has reached general availability, making the Google-owned plugin a legacy tool. This new plugin promises a much-improved user experience, as it is faster and better integrates the Bazel build graph with IntelliJ’s native understanding of the project structure, avoiding expensive “sync” steps. Internal APIs: There is ongoing work to separate core logic from service interactions (such as remote builds and file system operations), which sounds very similar to how Buck 2 was designed from the start. Executor types: BuildBuddy provides both managed (OCI, Firecracker, macOS) and self-hosted (Docker, Windows, GPU, and more) executors, each with different performance, isolation, and cost characteristics. Multiple action queues: Executors are organized into pools, and actions can specify which pool they should run on. When an action is received, the scheduler enqueues it in up to three different executors to minimize tail latency, based on the Sparrow scheduler research paper . Work stealing: Dynamic scaling of executor pools is critical for handling spiky build workloads while keeping costs down. To make this more efficient, BuildBuddy allows new executors to steal work from existing ones, which helps redistribute the load. Action merging: This feature coalesces multiple execution requests for the same action into a single execution. As we learned, this can be problematic if a misbehaving executor stalls multiple clients (e.g., several CI jobs). To address this, BuildBuddy speculatively re-executes a running action on a different executor after a certain threshold has passed. Action cancellation: When a user presses Ctrl+C, they are likely to modify code and restart the build, so continuing to run in-flight actions is wasteful. For greater efficiency, BuildBuddy catches the finished event from the BEP and attempts to cancel all remotely-queued actions. Binpacking: Different actions have different resource requirements, and it can be difficult to manually assign them to the right pool and executor. BuildBuddy automatically profiles executed actions (for metrics like peak memory and CPU consumption) and stores this information in the field of the action result, which is then stored on the server. The scheduler uses these details to route actions more effectively. Cold starts: Executing remote actions is similar to running lambda functions: a worker must be started, a container image fetched, and the action executed. To optimize this, BuildBuddy uses affinity routing, where a key is computed based on the primary output name (which is unique for each action) to extract platform, target, and output details. This allows similar actions to be routed to similar executors. Recycled runners: Some customers need to maintain heavyweight processes on the remote worker, such as test databases or the Docker daemon. While not hermetic, this is often desirable for high-performance scenarios. The use of these features is customized through execution properties. Custom resources: Some actions may require access to specialized resources like GPUs, FPGAs, or simulators. For better binpacking, customers can define the “size” of different executor types and annotate actions with the resources they consume. Fair scheduling: With multi-tenancy, users can set a “priority” for the actions of a build using the flag. A common use case is to define three priority bands: interactive builds, CI, and cron jobs. The BuildBuddy scheduler takes this property into account. The default Kubernetes autoscaler relies on CPU and memory utilization for its decisions, but these metrics are poor predictors of CI traffic patterns. Action queue length is a much better indicator of developer activity, so a custom autoscaler is needed to achieve reasonable behavior. Fair scaling can disrupt ongoing builds because GCP only provides a 90-second shutdown notice before preempting instances, which is not enough time to terminate running actions gracefully. Cold runners are significantly slower than hot ones because they start with empty local caches. To mitigate this, they implemented a solution to reuse runner disks, but only for disks that had been used for builds covering more than 50% of the build graph. This strategy reduced startup times from eight minutes to less than one. Build Event Protocol (BEP): A stream of events that Bazel sends to a remote server to publish build metadata and report progress. The metadata is the closest thing you will get to “usage telemetry” from Bazel as it captures all builds that were executed (who ran them and with which flags, what was executed, etc.) My pet peeve is that the BEP is incredibly complex and really difficult to manipulate post-facto, but I encourage you to generate one locally (via ) and to spend “a few” minutes understanding what’s in there. Exec log: This log captures everything that happened for actions, regardless of where they ran (the BEP only contains minimal details on local-only actions). It is not captured by default due to its verbosity. The flag, available since Bazel 7.1, makes it possible to capture this log unconditionally. Note that you need a parser to convert this binary log into something that can be read and compared across versions, and you need to manually build this parser out of Bazel’s source tree; yikes. Exec graph log: This log captures how actions depend on each other. It can help quantify drag on the critical path, determine if the critical path is unique, or identify if there are competing ones. Use it to identify actions to prioritize for end-to-end build optimization. Query commands: See the reference for , , and . JSON profile: Also known as the performance profile , this captures a timeline of all actions executed by Bazel and can help understand build bottlenecks and tune parallelization. RE profile: Similar to the JSON profile but this is captured server-side by some RE implementations. EngFlow generates one of these with specific details on how the workers executed actions (e.g. which pool ran an action, which is not something that’s visible to Bazel). Top-down pruning: When you get a cache hit for a node in the graph, you don’t have to worry about anything below that node anymore. You can throw away everything underneath to keep memory usage constrained. Invalidation computation: To determine what needs to be re-fetched from the cache, Skycache assumes “the same baseline” and then looks for file changes between the local system and the cache to find “what’s missing”. I know, this sounds fuzzy; refer back to the talk for the specifics. Efficiency: For some nodes, it’s cheaper to recompute them than to fetch them from the cache, and this was true for many nodes before applying two optimizations. One was in the nested sets data structure, because the original approach to serializing them caused a 10x space blowup. The other was around serializing individual node values, because most of the time, those values share internals across nodes. Modules are individual source files. These act as the compilation unit. Libraries are collections of modules, and is what’s often modeled in the build via rules. Therefore, library targets tend to group various modules. Compiling an file produces an object file but also a interface file. Think of the latter as a precompiled header or an interface JAR. To compile a module, we need the files of its dependencies, not their files. This is the key difference between compiling a vs. a , because in the C/C++ case, all individual sources can be compiled in any order, but in the Haskell case, they cannot.

0 views
Blog System/5 3 months ago

You are holding BUILD files wrong

I’ve heard it from people new to Bazel but also from people very familiar with the Bazel ecosystem: BUILD files must go away. And they must go away because they are redundant: they just repeat the dependency information that’s already encoded in the in-code import/use statements. Hearing this from newcomers to Bazel isn’t surprising: after all, most newcomers are used to build tools that provide zero facilities to express dependencies across the sources of your own project. Hearing it from old-timers, however, is disappointing because it misses the point of what BUILD files can truly offer. In my opinion: if that’s how you are writing BUILD files, you are holding them wrong. There is much more to BUILD files than mindlessly repeating import statement dependencies. Let’s see why. But before we do, take a moment to subscribe to Blog System/5. You do not want to miss out on future content! Suppose you are given the following change to review: By looking at this diff, possibly from a Pull Request (PR) review, you can guess the following: The Java package already depends on the package. The Java package already depends on the package. The addition of the line does not modify the dependency graph: the edge from the package to the package existed beforehand, and this new import statement is just leveraging it. The addition of the line is… uh, well, given this limited context, you just can’t tell! Is it OK or is it not? Did already depend on via some other file in the same package—in which case this new import changes nothing dependency-wise—or did it not—in which case this new import deserves questioning from a high-level architecture perspective? The snippet I presented above is for Java but, in reality, the problem I described applies to every other language: all languages out there have some sort of import/use statements and all languages have some sort of mechanism to group code in module-like entities. By inspecting standalone changes at the file level, we cannot tell whether new cross-module dependencies are being introduced or not. And being able to reason about modules is critical: we humans work best when we can reason about higher level relationships than files. We think of software as a collection of modules with layered dependencies and constraints that should not be violated. Enforcing these conceptual models via import/use statements is impossible, but the build graph—the very thing that BUILD files define—is the best place to encode them in a programmatic manner. So: my point is that BUILD files give you a chance to encode the high-level architecture of your software project as a graph of dependencies that lives outside of the code. If you keep your BUILD files lean and human-managed, you have a good chance of detecting invalid dependencies from a layering perspective as soon as they are introduced. The word “lean” in the previous paragraph is doing a lot of the heavy lifting though because by “lean” I mean simple BUILD files that define targets that map to concepts. This bypasses “best practices” that dictate one BUILD file per directory because you may need to use recursive globs to group sources into larger conceptual units, and this can also result in reduced build performance because you end up with fewer, larger targets. And that’s fine. For one, if recursive globs are a problem because they end up bundling too many related concepts in one target, you have got a problem with your directory structure and you should fix that. And for another, if larger targets end up hurting build performance, you have got a problem with your modularity and you should work towards breaking those big targets apart. At the end of the day, these two issues are symptoms of having too many unrelated concepts in one module. Simplifying the build structure may result in a transient performance regression, but working towards breaking those apart will help everyone in your organization. None of this is novel though, as these ideas can be found outside of Bazel. Think about shared libraries in large C or C++ projects, multiple Maven modules in a large Java code base, or multiple crates in a large Rust project. If you have ever done any of these, you know that manually defining modules is useful because it forces you and your fellow developers to think in terms of APIs at the module boundaries. Changing the module-level architecture of a project is something that happens infrequently and, when it does, you want the more senior people in the team to question and review such changes. And, for that, you must make these changes visible as soon as they happen. Expressing modules in your build graph is great, but people seem to like having tools to automatically update dependencies based on code changes. This is not incompatible with what I have said so far, but in order to keep a clean software architecture, you will need to have a strong code review culture because any undesirable new edges introduced in a change will have to be vetted up at code review. But… what if they aren’t? Can we do better? Of course we can! Bazel gives us a way to express restrictions via reverse dependencies: aka visibility rules . When you maintain a conceptual dependency graph by hand, you will find cases where you want to express things like: can be consumed from , which is the lowest level layer of the compiler. cannot be consumed from any other layer unless we discuss the implications. Visibility rules allow you to express these restrictions programmatically. The difference with forward dependencies is that, if you ever wanted to use from a module that has not been pre-declared as an allowed consumer, you would need to modify the BUILD file definitions in to widen the visibility rules. This would require talking to the owners of such module, either in person or via the code review, to be allowed as a consumer of those APIs. Now that we know the theory behind my proposal, let’s revisit the package from the earlier example. To enforce our desired architecture, the file in might look like this: This is a “lean” file. It defines a single, conceptual library, and doesn’t bother about specifying source files: it trusts that whatever you throw into the directory truly belongs to that module. Most importantly, the attribute declares that only code within the package is allowed to depend on this library. With this rule in place, the problematic code change we saw earlier (adding an of to the ) would no longer be a silent, ambiguous change. The moment the developer (or the CI system) tries to build the code, they would get an immediate, explicit error from Bazel stating that the target is not allowed to see the target. The architectural violation is caught automatically. The desired conversation with the module owners is now forced to happen, exactly as intended. Finally, we get to the most hyped topic of all times: AI agents. Remember when I said above that a clean conceptual module-based architecture is critical for humans to understand how a project works? Well, guess what, the same applies to AI models. If you try to use AI agents on an existing codebase, you will notice that they try to reason about the current architecture by reading individual file names and their contents, and then chasing through their file-level dependencies. But what if you could make these AI agents to follow your conceptual dependency chain by teaching them, via an MCP server , to follow your build graph? Presumably, their ability to reason would increase because they’d be faced with cleaner concepts that explain the story behind your codebase in big blocks. I hope to have convinced you that manually managing your BUILD files in a Bazel project is a good idea for long-term maintainability and for the successful use of AI tools. For this to be possible, you have to forego the “standard practice” of having very small BUILD targets and instead capture your conceptual modular architecture in the build graph. And once you do that, BUILD files magically become manageable by humans, without the need for fancy automation that pushes complexity under the rug. But that’s just my opinion. I’ve heard it from people new to Bazel but also from people very familiar with the Bazel ecosystem: BUILD files must go away. And they must go away because they are redundant: they just repeat the dependency information that’s already encoded in the in-code import/use statements. Hearing this from newcomers to Bazel isn’t surprising: after all, most newcomers are used to build tools that provide zero facilities to express dependencies across the sources of your own project. Hearing it from old-timers, however, is disappointing because it misses the point of what BUILD files can truly offer. In my opinion: if that’s how you are writing BUILD files, you are holding them wrong. There is much more to BUILD files than mindlessly repeating import statement dependencies. Let’s see why. But before we do, take a moment to subscribe to Blog System/5. You do not want to miss out on future content! The problem Suppose you are given the following change to review: By looking at this diff, possibly from a Pull Request (PR) review, you can guess the following: The Java package already depends on the package. The Java package already depends on the package. The addition of the line does not modify the dependency graph: the edge from the package to the package existed beforehand, and this new import statement is just leveraging it. The addition of the line is… uh, well, given this limited context, you just can’t tell! Is it OK or is it not? Did already depend on via some other file in the same package—in which case this new import changes nothing dependency-wise—or did it not—in which case this new import deserves questioning from a high-level architecture perspective? can be consumed from , which is the lowest level layer of the compiler. cannot be consumed from any other layer unless we discuss the implications.

0 views
Blog System/5 3 months ago

Bazel and glibc versions

Imagine this scenario: your team uses Bazel for fast, distributed C++ builds. A developer builds a change on their workstation, all tests pass, and the change is merged. The CI system picks it up, gets a cache hit from the developer’s build, and produces a release artifact. Everything looks green. But when you deploy to production, the service crashes with a mysterious error: . What went wrong? The answer lies in the subtle but dangerous interaction between Bazel’s caching, remote execution, and differing glibc versions across your fleet. In previous posts in this series, I’ve covered the fundamentals of action non-determinism , remote caching , and execution execution . Now, finally, we’ll build on those to tackle this specific problem. This article dives deep into how glibc versions can break build reproducibility and presents several ways to fix it—from an interesting hack (which spawned this whole series) to the ultimate, most robust solution. Before moving on and getting captivated by the intricate details of the problem, take a moment to support Blog System/5. Suppose you have a pretty standard (corporate?) development environment like the following: Developer workstations (WS). This is where Bazel runs during daily development, and Bazel can execute build actions both locally and remotely. A CI system. This is a distributed cluster of machines that run jobs, including PR merge validation and production release builds. These jobs execute Bazel too, who in turn executes build actions both locally and remotely. The remote execution (RE) system. This is a distributed cluster of worker machines that execute individual Bazel build actions remotely. The key components we want to focus on today are the AC, the CAS, and the workers—all of which I covered in detail in the previous two articles. The production environment (PROD). This is where you deploy binary artifacts to serve your users. No build actions run here. All of the systems above run some version of Linux, and it is tempting to wish to keep such version in sync across them all. The reasons would include keeping operations simpler and ensuring that build actions can run consistently no matter where they are executed. However, this wish is misguided and plain impossible. It is misguided because you may not want to run the same Linux distribution on all three environments: after all, the desktop distribution you run on WS may not be the best choice for RE workers, CI nodes, nor production. And it is plain impossible because, even if you aligned versions to the dot, you would need to take upgrades at some point: distributed upgrades must be rolled out over a period of time (weeks or even months) for reliability, so you’d have to deal with version skew anyway. To make matters more complicated, the remote AC is writable from all of WS, CI, and RE to maximize Bazel cache hits and optimize build times. This goes against best security practices (so there are mitigations in place to protect PROD), but it’s a necessity to support an ongoing onboarding into Bazel and RE. The question becomes: can the Linux version skew among all machines involved cause problems with remote caching? It sure can because C and C++ build actions tend to pick up system-level dependencies in a way that Bazel is unaware of (by default), and those influence the output the actions produce. Here, look at this: The version of glibc leaks into binaries and this is invisible to Bazel’s C/C++ action keys . glibc versions its symbols to provide runtime backwards compatibility when their internal details change, and this means that binaries built against newer glibc versions may not run on systems with older glibc versions. How is this a problem though? Let’s take a look by making the problem specific. Consider the following environment: In this environment, developers run Bazel in WS for their day-to-day work, and CI-1 runs Bazel to support development flows (PR merge-time checks) and to produce binaries for PROD. CI-2 sometimes runs builds too. All of these systems can write to the AC that lives in RE. As it so happens, one of the C++ actions involved in the build of , say , has a tag which forces the action to bypass remote execution. This can lead to the following sequence of events: A developer runs a build on a WS. has changed so it is rebuilt on the WS. The action uses the C++ compiler, so the object files it produces pick up the dependency on glibc 2.28. The result of the action is injected into the remote cache. CI-1 schedules a job to build for release. This job runs Bazel on a machine with glibc 2.17 and leverages the RE cluster which also contain glibc 2.17. Many C++ actions get rebuilt but is reused from the cache. The production artifact now has a dependency on symbols from glibc 2.28. Release engineering picks the output of CI-1, deploys the production binary to PROD, and… boom, PROD explodes: The fact that the developer WS could write to the AC is very problematic on its own, but we could encounter this same scenario if we first ran the production build on CI-2 for testing purposes and then reran it on CI-1 to generate the final artifact. So, what do we do now? In a default Bazel configuration, C and C++ action keys are underspecified and can lead us to non-deterministic behavior when we have a mixture of host systems compiling them. Let’s start with the case where you aren’t yet ready to strictly restrict writes to the AC from RE workers, yet you want to prevent obvious mistakes that lead to production breaks. The idea here is to capture the glibc version that is used in the local and remote environments, pick the higher of the two, and make that version number an input to the C/C++ toolchain. This causes the version to become part of the cache keys and should prevent the majority of the mistakes we may see. WARNING: This is The Hack I recently implemented and that drove me to writing this article series! Prefer the options presented later, but know that you have this one up your sleeve if you must mitigate problems quickly. To implement this hack, the first thing we have to do is capture the local glibc version. We can do this with: One important tidbit here is the use of the file, indirectly via the requirement of stamping. This is necessary to force this action to rerun on every build because we don’t want to hit the case of using an old tree against an upgraded system. As a consequence, we need to modify the script pointed at by script (you have one, right?) to emit the glibc version: The second thing we have to do is capture the remote glibc version. This is… trickier because there is no tag to force Bazel to run an action remotely. Even if we assume remote execution, features like the dynamic spawn strategy or the remote local fallback could cause the action to run locally at random. To prevent problems, we have to detect whether the action is running within RE workers or not, and the way to do that will depend on your environment: The third part of the puzzle is to select the highest glibc version between the two that we collected. We can do this with the following , leveraging ’s flag to compare versions. This flag is a GNU extension… but we are talking about glibc anyway here so I’m not going to be bothered by it : And, finally, we can go to our C++ toolchain definition and modify it to depend on the produced by the previous action: Ta-da! All of our C/C++ actions now encode the highest possible glibc version that the outputs they produce may depend on. And, while not perfect, this is an easy workaround to guard against most mistakes. But can we do better? Of course. Based on the previous articles, what we should think about is plugging the AC hole and forcing build actions to always run on the RE workers. In this way, we would precisely control the environment that generates action outputs and we should be good to go. Unfortunately, we can still encounter problems! Remember how I said that, at some point, you will have to upgrade glibc versions? What happens when you are in the middle of a rolling upgrade to your RE workers? The worker pool will end up with different “partitions”, each with a different glibc version, and you will still run into this issue. To handle this case, you would need to have different worker pools, one with the old glibc version and one with the new version, and then make the worker pool name be part of the action keys. You would then have to migrate from one pool to the other in a controlled manner. This would work well at the expense of reducing cache effectiveness, causing a a big toll on operations, and making the rollout risky because the switch from one pool to another is a all-or-nothing proposition. The real solution comes in the form of sysroots. The idea is to install multiple parallel versions of glibc in all environments and then modify the Bazel C/C++ toolchain to explicitly use a specific one. In this way, the glibc version becomes part of the cache key and all build outputs are pinned to a deterministic glibc version. This allows us to roll out a new version slowly with a code change, pinning the version switch to a specific code commit that can be rolled back if necessary, and keeping the property of reproducible builds for older commits. This is the solution outlined at the end of Picking glibc versions at runtime and is the only solution that can provide you 100% safety against the problem presented in this article. It is difficult to implement, though, because convincing GCC and clang to not use system-provided libraries is tricky and because this solution will sound alien to most of your peers. The problem presented in this article is far from theoretical, but it’s often forgotten about because typical build environments don’t present significant skew across Linux versions. This means that facing new glibc symbols is unlikely, so the chances of ending up with binary-incompatible artifacts are low. But they can still happen, and they can happen at the worst possible moment. Therefore, you need to take action. I’d strongly recommend that you go towards the sysroot solution because it’s the only one that’ll give you a stable path for years to come, but I also understand that it’s hard to implement. Therefore, take the solutions in the order I gave them to you: start with the hack to mitigate obvious problems, follow that up with securing the AC, and finally go down the sysroot rabbit hole. As for the glibc 2.17 mentioned en-passing above, well, it is ancient by today standards at 13 years of age, but it is what triggered this article in the first place. glibc 2.17 was kept alive for many years by the CentOS 7 distribution—an LTS system used as a core building block by companies and that reached EOL a year ago, causing headaches throughout the industry. Personally, I believe that relying on LTS distributions is a mistake that ends up costing more money/time than tracking a rolling release, but I’ll leave that controversial topic for a future opinion post. Imagine this scenario: your team uses Bazel for fast, distributed C++ builds. A developer builds a change on their workstation, all tests pass, and the change is merged. The CI system picks it up, gets a cache hit from the developer’s build, and produces a release artifact. Everything looks green. But when you deploy to production, the service crashes with a mysterious error: . What went wrong? The answer lies in the subtle but dangerous interaction between Bazel’s caching, remote execution, and differing glibc versions across your fleet. In previous posts in this series, I’ve covered the fundamentals of action non-determinism , remote caching , and execution execution . Now, finally, we’ll build on those to tackle this specific problem. This article dives deep into how glibc versions can break build reproducibility and presents several ways to fix it—from an interesting hack (which spawned this whole series) to the ultimate, most robust solution. Before moving on and getting captivated by the intricate details of the problem, take a moment to support Blog System/5. The scenario Suppose you have a pretty standard (corporate?) development environment like the following: Developer workstations (WS). This is where Bazel runs during daily development, and Bazel can execute build actions both locally and remotely. A CI system. This is a distributed cluster of machines that run jobs, including PR merge validation and production release builds. These jobs execute Bazel too, who in turn executes build actions both locally and remotely. The remote execution (RE) system. This is a distributed cluster of worker machines that execute individual Bazel build actions remotely. The key components we want to focus on today are the AC, the CAS, and the workers—all of which I covered in detail in the previous two articles. The production environment (PROD). This is where you deploy binary artifacts to serve your users. No build actions run here. All of the systems above run some version of Linux, and it is tempting to wish to keep such version in sync across them all. The reasons would include keeping operations simpler and ensuring that build actions can run consistently no matter where they are executed. However, this wish is misguided and plain impossible. It is misguided because you may not want to run the same Linux distribution on all three environments: after all, the desktop distribution you run on WS may not be the best choice for RE workers, CI nodes, nor production. And it is plain impossible because, even if you aligned versions to the dot, you would need to take upgrades at some point: distributed upgrades must be rolled out over a period of time (weeks or even months) for reliability, so you’d have to deal with version skew anyway. To make matters more complicated, the remote AC is writable from all of WS, CI, and RE to maximize Bazel cache hits and optimize build times. This goes against best security practices (so there are mitigations in place to protect PROD), but it’s a necessity to support an ongoing onboarding into Bazel and RE. The problem The question becomes: can the Linux version skew among all machines involved cause problems with remote caching? It sure can because C and C++ build actions tend to pick up system-level dependencies in a way that Bazel is unaware of (by default), and those influence the output the actions produce. Here, look at this: The version of glibc leaks into binaries and this is invisible to Bazel’s C/C++ action keys . glibc versions its symbols to provide runtime backwards compatibility when their internal details change, and this means that binaries built against newer glibc versions may not run on systems with older glibc versions. How is this a problem though? Let’s take a look by making the problem specific. Consider the following environment: In this environment, developers run Bazel in WS for their day-to-day work, and CI-1 runs Bazel to support development flows (PR merge-time checks) and to produce binaries for PROD. CI-2 sometimes runs builds too. All of these systems can write to the AC that lives in RE. As it so happens, one of the C++ actions involved in the build of , say , has a tag which forces the action to bypass remote execution. This can lead to the following sequence of events: A developer runs a build on a WS. has changed so it is rebuilt on the WS. The action uses the C++ compiler, so the object files it produces pick up the dependency on glibc 2.28. The result of the action is injected into the remote cache. CI-1 schedules a job to build for release. This job runs Bazel on a machine with glibc 2.17 and leverages the RE cluster which also contain glibc 2.17. Many C++ actions get rebuilt but is reused from the cache. The production artifact now has a dependency on symbols from glibc 2.28. Release engineering picks the output of CI-1, deploys the production binary to PROD, and… boom, PROD explodes:

0 views
Blog System/5 4 months ago

Trusting builds with Bazel remote execution

The previous article on Bazel remote caching concluded that using just a remote cache for Bazel builds was suboptimal due to limitations in what can and cannot be cached for security reasons. The reason behind the restrictions was that it is impossible to safely reuse a cache across users. Or is it? In this article, we’ll see how leveraging remote execution in conjunction with a remote cache opens the door to safely sharing the cache across users. The reason is that remote execution provides a trusted execution environment for actions, and this opens the door to cross-user result sharing. Let’s see why and how. As we saw in the article about action determinism , Bazel’s fundamental unit of execution is the action . Consequently, a remote execution system is going to concern itself with efficiently running individual actions, not builds, and caching the results of those. This distinction is critical because there are systems out there that work differently, such as Microsoft’s CloudBuild , Buildbuddy’s Remote Bazel , or even the shiny and new Bonanza . When we configure remote execution via the flag, Bazel enables the action execution strategy by default for all actions, just as if we had done . But this is only a default and users can mix-and-match remote and local strategies by leveraging the various selection flags or by specifying execution requirements in individual actions. A remote execution system is complicated as it is typically implemented by many services: Multiple frontends. These are responsible for accepting user requests and tracking results. These might include implement a second-level CAS to fan out traffic to clients. A scheduler. This is responsible for enqueuing action requests and distributing them to workers. Whether the scheduler uses a pull or push model to distribute work is implementation dependent. Multiple workers. These are responsible for action execution and are organized in pools of distinct types (workers for x86, workers for arm64, etc.) Internally, a worker is divided into two conceptual parts: the worker itself, which is the privileged service that monitors action execution, and the runner , which is a containerized process that actually runs the untrusted action code. The components of a remote cache (a CAS and an AC). The CAS is essential for communication between Bazel and the workers. The AC, which is optional, is necessary for action caching. The architecture of the cache varies from service to service. For the purposes of this article, I want to focus primarily on the workers and their interactions with the AC and the CAS. I’m not going to talk about frontends nor schedulers except for showing how they help isolate remote action execution from the Bazel process. Let’s look at the interaction between these components in more detail. To set the stage, take a look at the action from this sample build file: The action has two types of inputs: a checked-in source file, , and a file generated during the build, . This distinction is interesting because the way these files end up in the CAS is different: Bazel is the one responsible for uploading into the CAS, but is uploaded by the worker upon action completion. When we ask Bazel to build remotely, and assuming has already been built and cached at some point in the past, we’ll experience something like this: That’s a lot of interactions, right?! Yes; yes they are. A remote execution system is not simple and it’s not always an obvious win: coordinating all of these networked components is costly. The overheads become tangible when dealing with short-lived actions—a better fit for persistent workers—or when you have a sequential chain of actions—a good fit for the dynamic execution strategy . What I want you to notice here, because it’s critical for our analysis, is the shaded area. Note how all interactions within this area are driven by the remote execution service, not Bazel. Once an action enters the remote execution system, neither Bazel nor the machine running Bazel have any way of tampering with the execution of the remote action. They cannot influence the action’s behavior, and they cannot interfere with the way it saves its outputs into the AC and the CAS. And this decoupling, my friend, is the key insight that allows Bazel to safely share the results of actions across users no matter who initiated them. However, the devil lies in the implementation details. Given the above, we now know that remote workers are a trusted environment: the actions that go into a worker are fully specified by their action key and, therefore, whatever they produce and is stored into the AC and the CAS will match that action key. So if we trust the inputs to the action, we can trust its outputs, and we can do this retroactively… right? Well, not so fast. For this to be true, actions must be deterministic, and they aren’t always as we already saw . Some sources of non-determinism are “OK” in this context though, like timestamps, because these come from within the worker and cannot be tampered with. Other sources of non-determinism are problematic though, like this one: An attacker could compromise the network request to modify the content of the downloaded file, but only for long enough to poison the remote cache with a malicious artifact. Once poisoned, they could restore the remote file to its original content and it would be very difficult to notice that the entry in the remote cache did not match the intent of this rule. It is tempting to say: “ah, the above should be fixed by ensuring the checksum of the download is valid”, like this: And I’d say, yes, you absolutely need to do checksum validation because there are legitimate cases where you’ll find yourself writing code like this… in repo rules. Unfortunately, such checks are still insufficient for safe remote execution because, remember: actions can run from unreviewed code, or the code that runs them can be merged into the tree after a careless review (which is more common than you think). Consequently, the only thing you can and must do here is to disable network access in the remote worker. That said, just disabling network access may still be “not good enough” to have confidence in the safety of remote execution. A remote execution system is trying to run untrusted code within a safe production environment: code that could try to attack the worker to escape whatever sandbox/container you have deployed, code that could try to influence other actions running on the same machine, or code that could exfiltrate secrets present in the environment. Securing these is going to come down to standard practices for untrusted code execution, none of which are specific to Bazel, so I’m not going to cover them. Needless to say, it’s a difficult problem. If we have done all of the above, we now have a remote execution system that we can trust to run actions in a secure manner and to store their results in both the AC and the CAS. But… this, on its own, is still insufficient to secure builds end-to-end, and we would like to have trusted end-to-end builds to establish a chain of trust between sources and production artifacts, right? To secure a build, we must protect the AC and restrict writes to it to happen exclusively from the remote workers. Only them, who we have determined cannot be interfered with, know that the results of an action correspond to its declared inputs—and therefore, only them can establish the critical links between an AC entry and one or more files in the CAS. You’d imagine that simply setting would be enough, but it isn’t. A malicious user could still tamper with this flag in transient CI runs or… well, in their local workstation. And it’s because of this latter scenario that the only possible way to close this gap is via network level ACLs: the AC should only be writable from within the remote execution cluster. But… you guessed it: that’s still insufficient. Even if we disallow Bazel clients from writing to the AC, an attacker can still make Bazel run malicious actions outside of the remote execution cluster—that is, on the CI machine locally, which does have network access. Such action wouldn’t record its result in the AC, but the output of the action would go into the CAS, and this problematic action could then be consumed by a subsequent action as an input. The problem here stems from users being able to bypass remote execution by tweaking flags. One option to protect against this situation is the same as we saw before: disallow CI runs of PRs that modify Bazel flags so that users cannot “escape” remote execution. Unfortunately, this doesn’t have great ergonomics because users often need to change the file as part of routine operation. Bazel’s answer to this problem is the widely-unknown invocation policy feature. I say unknown because I do not see it documented in the output of and I cannot find any details about it whatsoever online—yet I know of its existence from my time at Google and I see its implementation in the Bazel code base, so we can reverse-engineer how it works. As the name implies, an invocation policy is a mechanism to enforce specific command-line flag settings during a build or test with the goal of ensuring that conventions and security policies are consistently applied. The policy does so by defining rules to set, override, or restrict the values of flags, such as . The policy is defined using the protobuf message defined in src/main/protobuf/invocation_policy.proto . This message contains a list of messages, each of which defines a rule for a specific flag. The possible rules, which can be applied conditionally on the Bazel command being executed, are: : Sets a flag to a specific value. You can control whether the user can override this value. This is useful for enforcing best practices or build-time configurations. : Forces a flag to its default value, effectively preventing the user from setting it. : Prohibits the use of certain values for a flag. If a user attempts to use a disallowed value, Bazel will produce an error. You can also specify a replacement value to be used instead of the disallowed one. : Restricts a flag to a specific set of allowed values. Any other value will be rejected. To use an invocation policy, you have to define the policy as an instance of the message in text or base64-encoded binary protobuf format and pass the payload to Bazel using the flag in a way that users cannot influence (e.g. directly from your CI infrastructure, not from workflow scripts checked into the repo). Let’s say you want to enforce a policy where the flag is always set to when running the command, and you want to prevent users from overriding this setting. We define the following policy in a file: And then we invoke Bazel like this (again, remember: this flag should be passed by CI in a way that users cannot influence): If you now try to play with the flag, you’ll notice that any overrides you provide don’t work. (Bazel 9 will offer a new flag behavior to error out instead of silently ignoring overrides which will make the experience nicer in this case.) Before concluding, I’d like to show you an interesting outage we faced due to Bazel being allowed to write AC entries from a trusted CI environment. The problem we saw was that, at some point, users started reporting that their builds were completely broken: somehow, the build of our custom singlejar helper tool, a C++ binary that’s commonly used in Java builds, started failing due to the inability of the C++ compiler to find some header files. This didn’t make any sense. If we built the tree at a previous point in time, the problem didn’t surface. And as we discovered later, if we disabled remote caching on a current commit the problem didn’t appear either. Through a series of steps, we found that singlejar’s build from scratch would fail if we tried to build it locally without the sandbox. But… that’s not something we do routinely, so how did this kind of breakage leak into the AC? The problem stemmed from our use of , a flag we had enabled long ago to mitigate flakiness when leveraging remote execution. Because of this flag, we had hit this problematic path: An build started on CI. This build used a remote-only configuration, forcing all actions to run on the remote cluster. Bazel ran actions remotely for a while, but at some point, encountered problems while building singlejar. Because of , Bazel decided to build singlejar on the CI machine, not on the remote worker, and it used the strategy, not the strategy, to do so. This produced an action result that was later incompatible with sandboxed / remote actions. Because of , the “bad” action result was injected into the AC. From here on, any remote build that picked the bad action result would fail. The mitigation to this problem was to flush the problematic artifact from the remote cache, and the immediate solution was to set which… Bazel claims is deprecated and a no-op, but in reality this works and I haven’t been able to find an alternative (at least not in Bazel 7) via any of the other strategy flags. The real solution, however, is to ensure that remote execution doesn’t require the local fallback option for reliability reasons, and to prevent Bazel from injecting AC entries for actions that do not run in the remote workers. With that, this series to revisit Bazel’s action execution fundamentals, remote caching, and remote execution is complete. Which means I can finally tell you the thing that started this whole endeavor: the very specific, cool, and technical solution I implemented to work around a hole in the action keys that can lead to very problematic non-determinism. But, to read on that topic, you’ll have to wait for the next episode! So, make sure to subscribe to Blog System/5 now. The previous article on Bazel remote caching concluded that using just a remote cache for Bazel builds was suboptimal due to limitations in what can and cannot be cached for security reasons. The reason behind the restrictions was that it is impossible to safely reuse a cache across users. Or is it? In this article, we’ll see how leveraging remote execution in conjunction with a remote cache opens the door to safely sharing the cache across users. The reason is that remote execution provides a trusted execution environment for actions, and this opens the door to cross-user result sharing. Let’s see why and how. Remote execution basics As we saw in the article about action determinism , Bazel’s fundamental unit of execution is the action . Consequently, a remote execution system is going to concern itself with efficiently running individual actions, not builds, and caching the results of those. This distinction is critical because there are systems out there that work differently, such as Microsoft’s CloudBuild , Buildbuddy’s Remote Bazel , or even the shiny and new Bonanza . When we configure remote execution via the flag, Bazel enables the action execution strategy by default for all actions, just as if we had done . But this is only a default and users can mix-and-match remote and local strategies by leveraging the various selection flags or by specifying execution requirements in individual actions. A remote execution system is complicated as it is typically implemented by many services: Multiple frontends. These are responsible for accepting user requests and tracking results. These might include implement a second-level CAS to fan out traffic to clients. A scheduler. This is responsible for enqueuing action requests and distributing them to workers. Whether the scheduler uses a pull or push model to distribute work is implementation dependent. Multiple workers. These are responsible for action execution and are organized in pools of distinct types (workers for x86, workers for arm64, etc.) Internally, a worker is divided into two conceptual parts: the worker itself, which is the privileged service that monitors action execution, and the runner , which is a containerized process that actually runs the untrusted action code. The components of a remote cache (a CAS and an AC). The CAS is essential for communication between Bazel and the workers. The AC, which is optional, is necessary for action caching. The architecture of the cache varies from service to service. That’s a lot of interactions, right?! Yes; yes they are. A remote execution system is not simple and it’s not always an obvious win: coordinating all of these networked components is costly. The overheads become tangible when dealing with short-lived actions—a better fit for persistent workers—or when you have a sequential chain of actions—a good fit for the dynamic execution strategy . What I want you to notice here, because it’s critical for our analysis, is the shaded area. Note how all interactions within this area are driven by the remote execution service, not Bazel. Once an action enters the remote execution system, neither Bazel nor the machine running Bazel have any way of tampering with the execution of the remote action. They cannot influence the action’s behavior, and they cannot interfere with the way it saves its outputs into the AC and the CAS. And this decoupling, my friend, is the key insight that allows Bazel to safely share the results of actions across users no matter who initiated them. However, the devil lies in the implementation details. Securing the worker Given the above, we now know that remote workers are a trusted environment: the actions that go into a worker are fully specified by their action key and, therefore, whatever they produce and is stored into the AC and the CAS will match that action key. So if we trust the inputs to the action, we can trust its outputs, and we can do this retroactively… right? Well, not so fast. For this to be true, actions must be deterministic, and they aren’t always as we already saw . Some sources of non-determinism are “OK” in this context though, like timestamps, because these come from within the worker and cannot be tampered with. Other sources of non-determinism are problematic though, like this one: An attacker could compromise the network request to modify the content of the downloaded file, but only for long enough to poison the remote cache with a malicious artifact. Once poisoned, they could restore the remote file to its original content and it would be very difficult to notice that the entry in the remote cache did not match the intent of this rule. It is tempting to say: “ah, the above should be fixed by ensuring the checksum of the download is valid”, like this: And I’d say, yes, you absolutely need to do checksum validation because there are legitimate cases where you’ll find yourself writing code like this… in repo rules. Unfortunately, such checks are still insufficient for safe remote execution because, remember: actions can run from unreviewed code, or the code that runs them can be merged into the tree after a careless review (which is more common than you think). Consequently, the only thing you can and must do here is to disable network access in the remote worker. That said, just disabling network access may still be “not good enough” to have confidence in the safety of remote execution. A remote execution system is trying to run untrusted code within a safe production environment: code that could try to attack the worker to escape whatever sandbox/container you have deployed, code that could try to influence other actions running on the same machine, or code that could exfiltrate secrets present in the environment. Securing these is going to come down to standard practices for untrusted code execution, none of which are specific to Bazel, so I’m not going to cover them. Needless to say, it’s a difficult problem. Securing the build If we have done all of the above, we now have a remote execution system that we can trust to run actions in a secure manner and to store their results in both the AC and the CAS. But… this, on its own, is still insufficient to secure builds end-to-end, and we would like to have trusted end-to-end builds to establish a chain of trust between sources and production artifacts, right? To secure a build, we must protect the AC and restrict writes to it to happen exclusively from the remote workers. Only them, who we have determined cannot be interfered with, know that the results of an action correspond to its declared inputs—and therefore, only them can establish the critical links between an AC entry and one or more files in the CAS. You’d imagine that simply setting would be enough, but it isn’t. A malicious user could still tamper with this flag in transient CI runs or… well, in their local workstation. And it’s because of this latter scenario that the only possible way to close this gap is via network level ACLs: the AC should only be writable from within the remote execution cluster. But… you guessed it: that’s still insufficient. Even if we disallow Bazel clients from writing to the AC, an attacker can still make Bazel run malicious actions outside of the remote execution cluster—that is, on the CI machine locally, which does have network access. Such action wouldn’t record its result in the AC, but the output of the action would go into the CAS, and this problematic action could then be consumed by a subsequent action as an input. The problem here stems from users being able to bypass remote execution by tweaking flags. One option to protect against this situation is the same as we saw before: disallow CI runs of PRs that modify Bazel flags so that users cannot “escape” remote execution. Unfortunately, this doesn’t have great ergonomics because users often need to change the file as part of routine operation. Bazel’s answer to this problem is the widely-unknown invocation policy feature. I say unknown because I do not see it documented in the output of and I cannot find any details about it whatsoever online—yet I know of its existence from my time at Google and I see its implementation in the Bazel code base, so we can reverse-engineer how it works. Invocation policies As the name implies, an invocation policy is a mechanism to enforce specific command-line flag settings during a build or test with the goal of ensuring that conventions and security policies are consistently applied. The policy does so by defining rules to set, override, or restrict the values of flags, such as . The policy is defined using the protobuf message defined in src/main/protobuf/invocation_policy.proto . This message contains a list of messages, each of which defines a rule for a specific flag. The possible rules, which can be applied conditionally on the Bazel command being executed, are: : Sets a flag to a specific value. You can control whether the user can override this value. This is useful for enforcing best practices or build-time configurations. : Forces a flag to its default value, effectively preventing the user from setting it. : Prohibits the use of certain values for a flag. If a user attempts to use a disallowed value, Bazel will produce an error. You can also specify a replacement value to be used instead of the disallowed one. : Restricts a flag to a specific set of allowed values. Any other value will be rejected. An build started on CI. This build used a remote-only configuration, forcing all actions to run on the remote cluster. Bazel ran actions remotely for a while, but at some point, encountered problems while building singlejar. Because of , Bazel decided to build singlejar on the CI machine, not on the remote worker, and it used the strategy, not the strategy, to do so. This produced an action result that was later incompatible with sandboxed / remote actions. Because of , the “bad” action result was injected into the AC. From here on, any remote build that picked the bad action result would fail.

0 views
Blog System/5 4 months ago

Understanding Bazel remote caching

The previous article on Bazel action non-determinism provided an introduction to actions: what they are, how they are defined, and how they act as the fundamental unit of execution in Bazel. What the article did not mention is that actions are also the fundamental unit of caching during execution to avoid doing already-done work. In this second part of the series, I want to revisit the very basics of how Bazel runs actions and how remote caching ( not remote execution, because that’ll come later) works. The goal here is to introduce the Action Cache (AC) , the Content Addressable Storage (CAS) , how they play together, and then have some fun in describing the many ways in which it’s possible to poison such a cache in an accidental or malicious manner. Picture this build file with two targets that generate one action each: And now this sequence of commands: : This first command causes Bazel to build , possibly from scratch. This takes at least 10 seconds due to the calls we introduced in the commands. As a side-effect of this operation, Bazel populates its in-memory graph with the actions that it executed and the outputs they produced. : This second command asks Bazel to do the same build as the first command and, because we just built and Bazel’s in-memory state is untouched, we expect Bazel to do absolutely nothing. In fact, this command completes (or should complete) in just a few milliseconds due to Bazel’s design. : This third command shuts the background local Bazel server process down. All in-memory state populated by the first command is lost. : This fourth command asks Bazel to do the same build as the first and second commands. We didn’t run a so all on-disk state is still present, so we should expect Bazel to not rebuild nor . And indeed they are not rebuilt: Bazel also “does nothing” like in that second command (the s don’t run), but the build is visibly slower in this case. The problem or, rather, question is… how does Bazel know that there is nothing to do on that fourth command? A system like Make would discover that all output files are “up to date” by comparing their timestamps against those of their inputs, but remember that Bazel tracks output staleness by inspecting the content digests of the inputs and ensuring they haven’t changed. Enter the Action Cache (AC) : an on-disk persistent cache that helps Bazel determine whether the outputs of an action are already present on disk and whether they already have the expected (up-to-date) content. The AC lives under and is stored in a special binary format that you can dump with . Conceptually, the AC maps the s that we saw in the previous article to their corresponding s. In practice, the AC keys are a more succinct representation of the on-disk state of the inputs to the action: up until Bazel 9, the keys were known as “digest keys” but things have changed recently to allow Bazel to better explain why certain actions are rebuilt; the details are uninteresting in this article though. But what is an ? The “action result” records, well, the side-effects of the action. Among other things, the contains: The exit code of the process executed by the action. A mapping of output names to output digests . The mapping of output names to digests is the piece of information I want to focus on because this is what allows the fourth Bazel invocation above to discover that it has “nothing to do”. When Bazel has lost its in-memory state of an action, Bazel queries the AC to determine the names of its outputs. If those files exist on disk, then Bazel can compute their digests and compare those against the digests recorded in the . If they match, Bazel can conclude that the action does not have to be re-executed. And this explains why the fourth Bazel invocation is visibly slower than the second invocation, even if both are “fully cached”: when the in-memory state of an action is lost, Bazel has to re-digest all input and output files that are on disk, and these are slow operations, typically I/O-bound. If you have seen annoying pauses with the message, you now know what they are about. The AC is, maybe surprisingly, a concept that exists even if Bazel is not talking to a remote cache or execution system. But what happens when we introduce a remote cache into the mix? First of all, the remote cache needs to be able to answer the same questions as the local AC: “given an action key, is the action result already known?” But let’s think through what goes into the result of such a query. Does the remote AC capture the same information that goes into a local , or does it do something different? Given that we are talking about a remote cache, it’s tempting to say that the value of the cache entry should embed the content of the output files: after all, if Bazel scores a remote AC hit, Bazel will need to retrieve the resulting output files to use them for subsequent actions, right? Not so fast: What if Bazel already has the output files on disk but is just querying the remote AC because the local in-memory state was lost? In this case, you want the response from the cache to be as small and quick as possible: you do not want to fetch the output contents again because they may be very large. What if you (the user) don’t care about the output’s content? Take a look at the actions in the example above: if all of them are cached, when I ask Bazel to build I probably only want to download from the remote cache. I may not care about the intermediate , so why should I be forced to fetch it when I query the cache just to know if is known? What if I'm using remote execution and I'm building and running a test? The test runs remotely, so the local machine does not need to download any file at all! This is why the , even for the remote AC, does not contain the content of the outputs. But then… how is the remote cache ever useful across users or machines, or even in a simple sequence like this? The in between these two builds causes all disk state (the local AC and the local output tree) to be lost. In this situation, Bazel will leverage the remote AC to know the names of the output files and their digests for each action… but if those outputs are not present on the local disk anymore, then what? Do we just rebuild the action? That’d… work, but it’d defeat the whole purpose of remote caching. Enter the Content Addressable Storage (CAS) , another cache provided by the remote caching system to solve this problem. The CAS maps file digests ( not names!) to their contents . Nothing more, nothing less. By leveraging both the AC and the CAS, Bazel can recreate its on-disk view of an already-built target by first checking with the AC what files should exist and then leveraging the CAS to fetch those files. Let’s visualize everything explained above via sequence diagrams. This first diagram represents the initial invocation, assuming that has not been built at all by anyone beforehand. This means that Bazel will not be able to score any local nor remote AC hits and therefore will have to execute all actions: This second diagram represents the second invocation executed after . In this case, all local state has been lost, but Bazel is able to score remote cache hits and recreate the local disk state by downloading entries from the remote AC and CAS. The dashed lines against the CAS represent optional operations, controlled by the use (or not) of the “Build Without The Bytes” feature. Before moving on to the fun stuff, a little subtlety: whenever the AC and CAS use digests, they don’t just use a hash. Instead, they use hash/size pairs. This adds extra protection against length extension attacks and allows both Bazel and the remote cache to cheaply detect data inconsistencies should they ever happen. With any remote caching system, we must fear the possibility of invalid cached entries. We have two main types of “invalid entries” to worry about: Actions that point to results that, when reused, lead to inconsistent or broken builds. This can happen if the cache keys fail to capture some detail of the execution environment. For example: if we have different glibc versions on the machines that store action results into the remote cache, we can end up with object files that are incompatible across machines because the glibc version is not part of the cache key. Actions that point to malicious results injected by a malicious actor. The attack vector looks like this: a malicious actor makes a action point to a poisoned object file that steals credentials and uploads them to a remote server. This object file is later pulled by other machines when building tools that engineers run or when building binaries that end up in production, leading to the compromised code spreading to those binaries. Scary stuff. But can this attack vector happen? Let’s see how an attacker might try to compromise the remote cache. If the attacker can inject a malicious blob into the CAS, we now have a new entry indexed by its digest that points to some dangerous file. But… how can we access such file? To access such file, we must first know its digest. Bazel uses the digests stored in the AC to determine which files to download from the CAS so, as long as there is no entry in the AC pointing to the bad blob, the bad blob is invisible to users and is not used. We have no problem here. The real danger comes from an attacker having write access to the AC. If the attacker can write arbitrary entries to the AC, they can pretty much point any action to compromised results. Therefore, the content of the AC is precious. In order to offer a secure and reliable remote cache system, we must restrict who can write to the cache. And because we can’t control what users do on their machines (intentionally or not), the only option we have is to restrict writes to the AC to builds that run on CI. After all, CI is a trusted environment so we can assume attackers cannot compromise it. But that’s not enough! Attackers can still leverage a naive CI system to inject malicious outputs into the cache. Consider this: an attacker creates a PR that modifies the scripts executed by CI. This change leverages the credentials of the CI system to write a poisoned entry into the AC. This poisoned entry targets an action that almost-never changes (something at the bottom of the build graph) to prevent it from being evicted soon after. The attacker runs the PR through CI and then deletes the PR to erase traces of their actions. From there on, the poisoned artifact remains in the cache and can be reused by other users. Yikes. How do we protect against this? The reality is that we just cannot, at least not in a very satisfactory way. If the CI system runs untrusted code as submitted in a PR, the CI system can be compromised. We can mitigate the threat by doing these: For CI runs against PRs (code not yet reviewed and merged): Disallow running the CI workflows if the changes modify the CI infrastructure in any way (CI configurations, scripts run by CI, the Bazel configuration files, etc.) Configure Bazel with so that s or other actions that could produce tampered outputs cannot propagate those to other users. For CI runs against merged code: Configure Bazel with so that they are the only ones that can populate the remote cache. If there is any malicious activity happening at this stage, which could still happen via sloppy code reviews or smart deceit, at least you will be able to collect audit logs and have the possibility of tracing back the bad changes to a person. This configuration should provide a reasonably secure system at the expense of slightly lower cache hit rates: users will not be able to benefit from cached artifacts until their code has been merged and later built by CI. But… doing otherwise would be reckless. Before concluding: what about the CAS? Is it truly safe to allow users to freely write to the CAS? As we have seen before, it is really difficult for a malicious entry in the CAS to become problematic unless it is referenced by the AC. But still, we have a couple of scenarios to worry about: DoS attacks: Malicious users could bring the remote cache “down” (making it less effective) by flooding the CAS with noise, pushing valid artifacts out of it, or by exhausting all available network bandwidth. This is not a big concern in a corporate environment where you'd be able to trace abusive load to a user, but you might still run into this due to accidental situations. Information disclosure: If a malicious user can somehow guess the digest of a sensitive file (e.g. a file with secrets), they could fetch such file. So… how much do you trust cryptography? As presented in this article, deploying an effective remote cache for Bazel in a manner that’s secure is not trivial. And if you try to make the setup secure, the effectiveness of the remote cache is lower than desirable because users can only leverage remote caching for builds executed on CI: any builds they run locally, possibly with configurations that CI doesn’t plan for, won’t be cached. The only way to offer a truly secure remote caching system is by also leveraging remote execution. But we’ll see how and why in the next episode. Make sure to subscribe to Blog System/5 to not miss out on the promised follow-up! The previous article on Bazel action non-determinism provided an introduction to actions: what they are, how they are defined, and how they act as the fundamental unit of execution in Bazel. What the article did not mention is that actions are also the fundamental unit of caching during execution to avoid doing already-done work. In this second part of the series, I want to revisit the very basics of how Bazel runs actions and how remote caching ( not remote execution, because that’ll come later) works. The goal here is to introduce the Action Cache (AC) , the Content Addressable Storage (CAS) , how they play together, and then have some fun in describing the many ways in which it’s possible to poison such a cache in an accidental or malicious manner. The Action Cache (AC) Picture this build file with two targets that generate one action each: And now this sequence of commands: : This first command causes Bazel to build , possibly from scratch. This takes at least 10 seconds due to the calls we introduced in the commands. As a side-effect of this operation, Bazel populates its in-memory graph with the actions that it executed and the outputs they produced. : This second command asks Bazel to do the same build as the first command and, because we just built and Bazel’s in-memory state is untouched, we expect Bazel to do absolutely nothing. In fact, this command completes (or should complete) in just a few milliseconds due to Bazel’s design. : This third command shuts the background local Bazel server process down. All in-memory state populated by the first command is lost. : This fourth command asks Bazel to do the same build as the first and second commands. We didn’t run a so all on-disk state is still present, so we should expect Bazel to not rebuild nor . And indeed they are not rebuilt: Bazel also “does nothing” like in that second command (the s don’t run), but the build is visibly slower in this case. The exit code of the process executed by the action. A mapping of output names to output digests . What if Bazel already has the output files on disk but is just querying the remote AC because the local in-memory state was lost? In this case, you want the response from the cache to be as small and quick as possible: you do not want to fetch the output contents again because they may be very large. What if you (the user) don’t care about the output’s content? Take a look at the actions in the example above: if all of them are cached, when I ask Bazel to build I probably only want to download from the remote cache. I may not care about the intermediate , so why should I be forced to fetch it when I query the cache just to know if is known? What if I'm using remote execution and I'm building and running a test? The test runs remotely, so the local machine does not need to download any file at all! This second diagram represents the second invocation executed after . In this case, all local state has been lost, but Bazel is able to score remote cache hits and recreate the local disk state by downloading entries from the remote AC and CAS. The dashed lines against the CAS represent optional operations, controlled by the use (or not) of the “Build Without The Bytes” feature. Before moving on to the fun stuff, a little subtlety: whenever the AC and CAS use digests, they don’t just use a hash. Instead, they use hash/size pairs. This adds extra protection against length extension attacks and allows both Bazel and the remote cache to cheaply detect data inconsistencies should they ever happen. Let’s poison the cache With any remote caching system, we must fear the possibility of invalid cached entries. We have two main types of “invalid entries” to worry about: Actions that point to results that, when reused, lead to inconsistent or broken builds. This can happen if the cache keys fail to capture some detail of the execution environment. For example: if we have different glibc versions on the machines that store action results into the remote cache, we can end up with object files that are incompatible across machines because the glibc version is not part of the cache key. Actions that point to malicious results injected by a malicious actor. The attack vector looks like this: a malicious actor makes a action point to a poisoned object file that steals credentials and uploads them to a remote server. This object file is later pulled by other machines when building tools that engineers run or when building binaries that end up in production, leading to the compromised code spreading to those binaries. For CI runs against PRs (code not yet reviewed and merged): Disallow running the CI workflows if the changes modify the CI infrastructure in any way (CI configurations, scripts run by CI, the Bazel configuration files, etc.) Configure Bazel with so that s or other actions that could produce tampered outputs cannot propagate those to other users. For CI runs against merged code: Configure Bazel with so that they are the only ones that can populate the remote cache. If there is any malicious activity happening at this stage, which could still happen via sloppy code reviews or smart deceit, at least you will be able to collect audit logs and have the possibility of tracing back the bad changes to a person. DoS attacks: Malicious users could bring the remote cache “down” (making it less effective) by flooding the CAS with noise, pushing valid artifacts out of it, or by exhausting all available network bandwidth. This is not a big concern in a corporate environment where you'd be able to trace abusive load to a user, but you might still run into this due to accidental situations. Information disclosure: If a malicious user can somehow guess the digest of a sensitive file (e.g. a file with secrets), they could fetch such file. So… how much do you trust cryptography?

0 views
Blog System/5 5 months ago

Bazel and action (non-) determinism

A key feature of Bazel is its ability to produce fast, reliable builds by caching the output of actions. This system, however, relies on a fundamental principle: build actions must be deterministic. For the most part, Bazel helps ensure that they are, but in the odd cases when they aren’t, builds can fail in subtle and frustrating ways, eroding trust in the build system. This article is the first in a series on Bazel’s execution model. Having explained these concepts many times, I want to provide a detailed reference before explaining a cool solution to a problem I recently developed at work. We will start with action non-determinism, then cover remote caching and execution, and finally, explore the security implications of these features. This first article explains what non-determinism is, how it manifests, and how you can diagnose and prevent it in your own builds. Let’s begin. Consider the following example build file: This build file specifies two targets : the target, which builds a C library from two source files, and the target, which builds a C binary from one source file and links it against the library. These two targets instantiate the and rules by binding them to specific attributes (the values of and ). Processing these rules during dependency analysis yields a collection of actions : The rule used to define the target generates: A action to compile the file into the object file. Its command line may be: “ A action to compile the file into the object file. Its command line may be: “ A action to link and together into the archive. Its command line may be: “ The rule used to define the target generates: A action to compile the file into the object file. Its command line may be: “ A action to link and together into the executable. Its command line may be: “ Note that nowhere in the list above do you see target names. Actions work with file -level dependencies, not target -level dependencies. If you need to visualize this, think of the target dependency graph and the action dependency graph as two disjoint entities. (Skyframe tracks them as just one graph but we can ignore that fact here.) It’s this, actions, that are the atomic unit of execution in Bazel. Once Bazel is done with its loading and analysis phases, it enters the execution phase. During execution, the “only” thing that Bazel does is dispatch actions for execution via its execution strategies , trying to maximize parallelism as determined by the constraints of the action dependency graph. To break down an action into its parts, let’s examine what goes into defining the action above, and to do that, let’s first focus on its simple command line to produce the binary from the object file and the static library: Bazel tracks the command line as part of the action, but things are a bit more complex than that. And to explain the “complexity”, let’s try to understand what problems Bazel is trying to solve compared to a more rudimentary build tool like Make. If you have used (or still use) Make, you would have likely expressed the corresponding build rule as: which looks… OK, I guess. But what happens if you do this? An inconsistent build! The binary is not stripped as you would expect because the second invocation does nothing ! has no idea that the variable is involved in the target definition so it doesn’t know that the target has to be rebuilt to honor the variable change. This type of scenario is what leads to having to run from time to time in a Make-based build system because the outputs that Make produces get out of sync with environmental changes. And the reason is that the only thing that tracks to determine whether a target needs to be rebuilt are the file timestamps of the inputs that are explicitly listed in the rule ( and in this example). Now you’d say: but you can fix it! “Just” do: And indeed this ensures that the target gets re-linked if changes. But I hope you’ll agree that this is awful and that nobody does it because: one, most folks writing s aren’t aware of the problem; and, two, even if they are, it’s too hard to get it right (see… we forgot about the value of and whatever other environment variables might influence ’s behavior like, you know, the ?). I didn’t come here to bash against Make. OK, maybe I did because folks out there often say “Make works just fine and it’s much simpler than Bazel!” when in reality they are oblivious to a bunch of very real problems that later waste other people ’s time when their build environment subtly breaks. </rant> Bazel and other next-generation build systems solve this specific problem and more by being comprehensive about what they track at the action level, and using that information to determine whether an action needs to be rebuilt or not. In particular, a Bazel action is defined by these parts: The command line to execute , which in this case is . Hashes of the input files required to execute the command. These include “obvious” inputs like the source files specified in the targets but also the files required to execute the tools of the action (e.g. the compiler’s own files). In this case, the list of input files could look like: , , and . The configuration of the environment in which the action runs. This includes environment variables, the host and target platforms, and things like that. In this case, the configuration could include the value of and whether we are building in debug or optimized mode. Configurations are expressed as a hash, though, because of the many details that go into computing them. These three properties define quite precisely the relation between the context of an action and the outputs it produces, and this is the main technique that Bazel uses to avoid clean builds at large scale. Enjoying the article so far? Please subscribe to show your support. But pay attention to the “quite” word in “quite precisely” right above. I did not say “perfectly” because there are still ways for non-deterministic behavior to leak into a Bazel build, meaning that and unexpected rebuilds (e.g. due to cache expiration) could still change the behavior of a build. Consider this innocuous example: This rule says: stick the output of the command, which prints the current date, into the file. Obviously, “current date” varies over time so we should expect the above to give us trouble. And indeed it does: look at this sequence of commands where I’ve removed all irrelevant Bazel console noise: The first Bazel build claims to have executed 1 action in the sandbox and the file shows us the date when that happened. The second Bazel build does nothing and remains unmodified. But if we later follow that by a Bazel clean and a third Bazel build, we see that the content of is now different. Non-determinism has leaked into the build, and… that’s problematic. Non-determinism is a problem because it prevents achieving reproducible builds . On the one hand, this voids the security guarantees that come from being able to reproduce builds in different environments: if the output of the build is not bit-for-bit identical to its inputs, you can’t verify that a binary that’s being used in production actually comes from the sources it claims to have been built from. On the other hand, this leads to situations where developers get different behavior depending on when/where they build the code: you do not want to hear the “works on my machine” excuse when troubleshooting a bug. So, it is bad. But one interesting property of Bazel’s action model is that a single non-deterministic action does not necessarily poison the whole build. Take a look at this build file that defines a chain of actions: The interesting bit here is in the target, which counts the lines in its input and writes the resulting number to its output. While this target consumes a non-deterministic input, its output is deterministic because the number of lines in the input is constant: writes a different timestamp each time, but it always produces one line. The fact that the target produces a deterministic output allows Bazel to stop “propagating” non-determinism across the build. Remember that actions track input hashes , not input timestamps . Once is re-executed after changes to , the output of will have the same hash as it did before, and will conclude that it doesn’t need to be rerun. Let’s try it: The sequence of commands above proves the point: the first build of the target tells us that Bazel executed 4 sandboxed actions (one for each target). If we then remove the non-deterministic file from the output tree and ask Bazel to rebuild the target, we see how it only rebuilt 3 targets and 1 of them scored a cache hit. And by inspecting the log we asked Bazel to produce, we see that it effectively rebuilt , , and , but it didn’t have to rebuild because the non-determinism didn’t propagate further. In a Make world, the above sequence of commands would have invalidated the whole build because Make just checks timestamps, and targets almost-always update the timestamps of their outputs unless we go through great extents to prevent it (like I did earlier on in the stamp file rule with its call to ). In the previous example, it was rather obvious that a call to could be problematic. But this is not the only source of non-determinism, and oftentimes the reason behind the non-determinism isn’t as obvious. Here is a more comprehensive list of possible causes: Date and time. You might not be calling , but build tools—especially code generators and archivers like zip—love injecting timestamps in their output files. These may be obvious, like comments in generated files, or subtle, like values written in binary metadata headers. System identifiers. Similarly to “current date”, there are tools that query the current PID, UID, GID, etc. and inject those values in their outputs. Sort ordering. Hash tables are the star data structure in computer science and they are everywhere. Unfortunately, there are tools that leak their internal use of hash tables into output files by, for example, emitting unsorted lists. Accessing the network. Just don’t . Unexpected/unknown dependencies on host tools. Calling a tool from the system means introducing hidden dependencies on whatever the tool itself depends on. For example, the tool might read a configuration file that alters its behavior. Dynamic execution . This powerful feature that helps improve incremental build times in interactive scenarios can easily lead to non-determinism if the remote execution environment and the local execution environment aren’t equivalent (where equivalent is tricky to define). Foreign CC rules . Bazel tries to enforce action determinism as we saw earlier, but other build systems make little efforts to do so. If you end up nesting build systems, as is the case when using this ruleset, it’s very likely that you are introducing non-determinism. Randomness. Tools can decide to read from and do something with that value, in which case you definitely have non-determinism. The list above is long, and there is this assumption, especially from newcomers to Bazel, that Bazel’s sandboxing ensures that build behavior is deterministic. In a theoretical world, that would be true: Bazel would execute each action in a precisely controlled environment to ensure that actions behaved exactly the same from run to run. This would require using a cycle-accurate virtual machine to precisely control instruction scheduling (multithreading can also introduce non-determinism) and entropy sources, but as you can imagine, this would make build execution extremely slow. In a practical world, sandboxing has to grant some concessions in the name of performance: otherwise, people will end up disabling sandboxing , nullifying all of its benefits. Furthermore, sandboxing isn’t something magical you can “do” from userspace (unless you write a full machine emulator). Sandboxing requires kernel support, and different kernels offer different sandboxing technologies. In turn, this means that what Bazel can sandbox or not depends on the machine that Bazel is running on. For example: Bazel’s sandbox on Linux is able to restrict file accesses, offer stable PIDs, and forbid network accesses—but the macOS sandbox, based on the deprecated sandbox-exec , cannot mangle the PID namespace. So. We know non-deterministic actions can exist in a Bazel build and that sandboxing isn’t going to protect us from them. In that case, how can we tell if such actions have leaked into our build? We can use the “execution log” feature in Bazel to write a detailed log of all the actions that Bazel executes. Then, we can the logs of two separate builds and see if they differ. Looking back to our chain of actions from the last example, we could capture two fresh execution logs by doing this: Note: it is important to start from a clean build and to tell Bazel to not reuse remotely-cached actions. In this way, we force Bazel to reexecute the whole build, which should uncover non-determinism if it exists. Also, make sure to keep enabled (the default). Once we have run the above, we can proceed to diff the logs. I like doing , but you can use whichever file diffing UI you prefer: Voila. The first chunk of the log tells us that the first non-deterministic action is the one that writes the file, and the second chunk of the log tells us that there is another action that consumes said file as an input. Let’s finish the article by giving you some practical tips to remove non-determinism from the build and to make sure it doesn’t come back: Set up a CI pipeline that identifies new instances of non-determinism . Unless you are proactive about it, non-determinism will creep back in because neither the local sandbox not remote execution can fully prevent it. Keep the local sandbox enabled. It may not be perfect but it’s much better than nothing. Also, enable explicitly because, for historical reasons, the sandbox did not forbid network access and the default hasn’t been flipped yet. Rely on hermetic toolchains. Do not use the system-provided ones because they tend to have dependencies on system-provided files that are invisible to the Bazel action definitions. (E.g. if you use the host-provided , the compiler will happily embed into the final binary and this will be invisible to Bazel.) Force remote execution. Sometimes, non-determinism is inevitable or really hard to avoid (e.g. if you use the Foreign CC ruleset). Under these conditions, your best bet is to force the problematic actions to run remotely under a strictly controlled environment and to provision the remote cache so that such actions “never” fall out. If done correctly, this will “hide” the non-determinism because, once an action has been built, it will never be rebuilt again until its known inputs actually change. Sanitize the action’s environment. Use and to keep settings like the consistent across machines, and use to minimize the environment variables that leak into action execution. Think about network access. If you must, do it from repo rules and always verify that whatever you downloaded matches known checksums. If you are strict about checksum validation, you’ll still have a non-hermetic build, but at least, you’ll have a deterministic one. If you have such behavior in a test, don’t allow its results to be cached by means of the tag. And with that, it’s time to conclude until the next episode on remote caching. Remeber: this is the first part of a series! Subscribe to Blog System/5 now to not miss out on the next one. This article is the first in a series on Bazel’s execution model. Having explained these concepts many times, I want to provide a detailed reference before explaining a cool solution to a problem I recently developed at work. We will start with action non-determinism, then cover remote caching and execution, and finally, explore the security implications of these features. This first article explains what non-determinism is, how it manifests, and how you can diagnose and prevent it in your own builds. Let’s begin. Bazel execution basics Consider the following example build file: This build file specifies two targets : the target, which builds a C library from two source files, and the target, which builds a C binary from one source file and links it against the library. These two targets instantiate the and rules by binding them to specific attributes (the values of and ). Processing these rules during dependency analysis yields a collection of actions : The rule used to define the target generates: A action to compile the file into the object file. Its command line may be: “ A action to compile the file into the object file. Its command line may be: “ A action to link and together into the archive. Its command line may be: “ The rule used to define the target generates: A action to compile the file into the object file. Its command line may be: “ A action to link and together into the executable. Its command line may be: “ The command line to execute , which in this case is . Hashes of the input files required to execute the command. These include “obvious” inputs like the source files specified in the targets but also the files required to execute the tools of the action (e.g. the compiler’s own files). In this case, the list of input files could look like: , , and . The configuration of the environment in which the action runs. This includes environment variables, the host and target platforms, and things like that. In this case, the configuration could include the value of and whether we are building in debug or optimized mode. Configurations are expressed as a hash, though, because of the many details that go into computing them. Date and time. You might not be calling , but build tools—especially code generators and archivers like zip—love injecting timestamps in their output files. These may be obvious, like comments in generated files, or subtle, like values written in binary metadata headers. System identifiers. Similarly to “current date”, there are tools that query the current PID, UID, GID, etc. and inject those values in their outputs. Sort ordering. Hash tables are the star data structure in computer science and they are everywhere. Unfortunately, there are tools that leak their internal use of hash tables into output files by, for example, emitting unsorted lists. Accessing the network. Just don’t . Unexpected/unknown dependencies on host tools. Calling a tool from the system means introducing hidden dependencies on whatever the tool itself depends on. For example, the tool might read a configuration file that alters its behavior. Dynamic execution . This powerful feature that helps improve incremental build times in interactive scenarios can easily lead to non-determinism if the remote execution environment and the local execution environment aren’t equivalent (where equivalent is tricky to define). Foreign CC rules . Bazel tries to enforce action determinism as we saw earlier, but other build systems make little efforts to do so. If you end up nesting build systems, as is the case when using this ruleset, it’s very likely that you are introducing non-determinism. Randomness. Tools can decide to read from and do something with that value, in which case you definitely have non-determinism. Set up a CI pipeline that identifies new instances of non-determinism . Unless you are proactive about it, non-determinism will creep back in because neither the local sandbox not remote execution can fully prevent it. Keep the local sandbox enabled. It may not be perfect but it’s much better than nothing. Also, enable explicitly because, for historical reasons, the sandbox did not forbid network access and the default hasn’t been flipped yet. Rely on hermetic toolchains. Do not use the system-provided ones because they tend to have dependencies on system-provided files that are invisible to the Bazel action definitions. (E.g. if you use the host-provided , the compiler will happily embed into the final binary and this will be invisible to Bazel.) Force remote execution. Sometimes, non-determinism is inevitable or really hard to avoid (e.g. if you use the Foreign CC ruleset). Under these conditions, your best bet is to force the problematic actions to run remotely under a strictly controlled environment and to provision the remote cache so that such actions “never” fall out. If done correctly, this will “hide” the non-determinism because, once an action has been built, it will never be rebuilt again until its known inputs actually change. Sanitize the action’s environment. Use and to keep settings like the consistent across machines, and use to minimize the environment variables that leak into action execution. Think about network access. If you must, do it from repo rules and always verify that whatever you downloaded matches known checksums. If you are strict about checksum validation, you’ll still have a non-hermetic build, but at least, you’ll have a deterministic one. If you have such behavior in a test, don’t allow its results to be cached by means of the tag.

0 views
Blog System/5 7 months ago

Lessons along the EndBOX journey

About six months ago, during one of my long runs, I had a wild idea: what if I built an OS disk image that booted straight into EndBASIC, bundled it with a Raspberry Pi, a display, a custom 3D-printed case, and made a tiny, self-contained retro BASIC computer? Fast-forward to today and such an idea exists in the form of “the EndBOX prototype”! This article isn’t the product announcement though— that’s elsewhere . What I want to do here is look back at the Blog System/5 articles I’ve written over the past months because what might have seemed like scattered topics were actually stepping stones toward the EndBOX. Let’s look at what I learned along the way and why, even though developing EndBASIC may sound like a “useless waste of time”, it’s a great playground and the source of inspiration for the articles you’ve come to appreciate here. You know what to do. Take a moment to subscribe and support Blog System/5! “Porting the EndBASIC console to an LCD” On April 26th, 2024 TL;DR: The article starts with an introduction to EndBASIC’s console framework and how I refactored it to separate display rendering primitives from higher-level operations. This design was inspired by NetBSD’s wscons, and the text explains how so. After that, the article continues to show how to extend the redesigned interface to talk to an SPI-attached LCD, how the SPI communication works, and how double-buffering and damage tracking allow for fast rendering performance. Relevance: This is the article that started the EndBOX but I didn’t know it at the time. You’ll notice that the article ends with a list of parts to build your own embedded box… but because the software wasn’t readily available as a downloadable SD card image, I haven’t heard of anyone trying to use it at all. Lessons learned: SPI bus access: I had to reverse-engineer the sample C code that came with the ST7735s LCD, figure out how to access the SPI bus from Rust, and re-implement parts of it in my own terms. I also had to learn about DTB overlays because the SPI bus is disabled by default. I did not have to dive deep into DTBs at this point, but that came to bite me later. Rasterization algorithms: Up until that point, EndBASIC had leveraged the SDL library and HTML canvas elements for graphics rendering. But when writing directly to an LCD… the only thing you can do is poke pixels. So I had to read on Bresenham’s line algorithm and the Midpoint circle algorithm , implement them and, of course, find a way to write unit tests. “Revisiting the NetBSD build system” On December 28th, 2024 TL;DR: The article presents a general overview of how the NetBSD build system shines in achieving cross-platform, cross-architecture, and root-less builds. Relevance: I had to get back into NetBSD after many years of not touching it because its cross-building features were key to getting EndBOX up and running. Lessons learned: Not much has changed: This is more a realization than a lesson, but it was good to see that not much had changed since I used to use NetBSD on a daily basis. This is good because it shows how resilient BSD systems are, but also bad because some problems that made me leave NetBSD are still present. NetBSD is still unique: I don’t know of any other OS that supports cross-building as trivially as NetBSD does, much less without requiring root access to generate disk images. I hear FreeBSD 15 will sport these same features, and that’s exiting, but “I’ll believe them when I see them”. “Self-documenting Makefiles” On January 10th, 2025 TL;DR: This article is a hands-on tutorial on how to write s that provide help messages by scanning their own content. Relevance: I had read about this idea before but had never put it to practice. In developing the EndBOX, I had to create a rather complex to glue together the patching of the NetBSD tree, the build of the NetBSD toolchain and release, the cross-compilation of EndBASIC for aarch64, and the bundling of all pieces together into a final disk image. Lessons learned: Fragility: Implementing this idea is simple, but in the discussions that followed the article publication, I realized at least two problems: trailing space in variable assignments in is meaningful and this approach cannot easily work if you have includes in your s. Not a big deal for my use case, but having practiced this, I know when and when not do retry this in the future. Use make and you’ll have a bad time: Nothing surprising here, but getting the to orchestrate all these disparate builds and to do it correctly without spurious rebuilds has been painful. What’s worse is that the resulting build process takes a long time because is unable to properly parallelize all build steps. It hurts to see my 72-core server “stall” when a script runs. “Hands-on graphics without X11” On January 17th, 2025 TL;DR: This article presents a deep dive into NetBSD’s console drivers and how those can be used to render graphics directly to the framebuffer without relying on X11 or Wayland. Relevance: Figuring this out was the key to unblocking the EndBOX project. I had built an earlier prototype of the OS image that used EndBASIC’s SDL console over X11 but… the boot times were atrocious and I wasn’t sure I could make them better. Researching how to leverage the framebuffer removed this roadblock as I could get graphics rendering almost immediately after the kernel finished booting. Lessons learned: wsdisplay and wskbd APIs: I had looked at the internals of these devices in the past but never paid much attention to the versatile APIs they expose. I had to do this now, and it was insightful. gdb scripting: Nothing new (I knew this was possible), but I think it was the first time I put it into practice… for the article’s sake. “ioctls from Rust” On February 13th, 2025 TL;DR: An overview on what ioctls are and how to invoke them from Rust when bindings don’t yet exist for them. Relevance: This was a direct follow-up to the previous article: that one was focused on gaining access to the framebuffer, which I prototyped from C, and this one was about productionizing those prototypes in the form of a Rust backend for EndBASIC’s console abstraction. Lessons learned: ioctl formats: Not all ioctls are the same. Some deal with simple data types whereas other deal with large structures—and the way they are expressed is different. Integrating ioctls in Rust: This was “just a matter of programming”, but it was cool to see how the crate makes it easy to expose ioctls in a Rust-native manner so that they just look like function calls. “Hardware discovery: ACPI & Device Tree” On February 28th, 2025 TL;DR: A deep dive on how device discovery works on a modern machine, including details on ACPI and Device Tree, how they differ, and how they are put in memory so that the kernel can read them. Relevance: During the development of the EndBOX, I had to enable SPI to render to the LCD via a DT overlay. I also had to find and install a DTB for the Raspberry Pi Zero 2 W so that NetBSD could boot on this board and so that the WiFi could work too. Lessons learned: ACPI and Device Tree: I pretty much knew nothing about these, so learning about the foundations behind them and how they differ was interesting on its own. Device Tree overlays: I had to write an overlay from scratch to enable the SPI bus on NetBSD—and then get the LCD to actually work (which was its own time sink). Linux vs. NetBSD DTBs: NetBSD reuses Linux DTBs, but NetBSD’s copy is ancient and contains local changes (to e.g. disable the unsupported Videocore). I had to figure out, through sweat and tears, how to port the DTB for the Pi Zero 2 W from Linux to NetBSD without breaking anything else. “Beginning 3D printing” On May 28th, 2025 TL;DR: A beginner-level introduction to 3D printing, including 3D modeling basics, slicing, and actual printing considerations. Relevance: The final step in showing off the EndBOX was to create a case for it that matched the design I had envisioned months earlier. Lessons learned: Unprintable objects: The way 3D printers work imposes constraints on the kinds of items that can be printed, and I had to adjust my design a few times. Modeling vs. slicing: They are two very different things, and I had never imagined that the second existed. It’s not trivial: Even after figuring out the basics, getting a perfect print is difficult. I guess the real world is analog and subject to imperfections. And that’s all for today. If you enjoyed any of these articles, it’s because working on the EndBOX gave me reasons to chase those topics down. To keep this kind of content going, I need time to play, explore, and tinker, so if you’d like to support the journey, please subscribe to or sponsor the project. Sponsor the EndBOX As for what’s next—well, I’m starting to rethink how I can apply the lessons of the EndBOX to something more impactful. Maybe it’s time to turn “just a fun ride” into something greater, because while BASIC may not be the future, the many components that have gone into building EndBASIC and the EndBOX may be. Stay tuned! Subscribe now TL;DR: The article starts with an introduction to EndBASIC’s console framework and how I refactored it to separate display rendering primitives from higher-level operations. This design was inspired by NetBSD’s wscons, and the text explains how so. After that, the article continues to show how to extend the redesigned interface to talk to an SPI-attached LCD, how the SPI communication works, and how double-buffering and damage tracking allow for fast rendering performance. Relevance: This is the article that started the EndBOX but I didn’t know it at the time. You’ll notice that the article ends with a list of parts to build your own embedded box… but because the software wasn’t readily available as a downloadable SD card image, I haven’t heard of anyone trying to use it at all. Lessons learned: SPI bus access: I had to reverse-engineer the sample C code that came with the ST7735s LCD, figure out how to access the SPI bus from Rust, and re-implement parts of it in my own terms. I also had to learn about DTB overlays because the SPI bus is disabled by default. I did not have to dive deep into DTBs at this point, but that came to bite me later. Rasterization algorithms: Up until that point, EndBASIC had leveraged the SDL library and HTML canvas elements for graphics rendering. But when writing directly to an LCD… the only thing you can do is poke pixels. So I had to read on Bresenham’s line algorithm and the Midpoint circle algorithm , implement them and, of course, find a way to write unit tests. TL;DR: The article presents a general overview of how the NetBSD build system shines in achieving cross-platform, cross-architecture, and root-less builds. Relevance: I had to get back into NetBSD after many years of not touching it because its cross-building features were key to getting EndBOX up and running. Lessons learned: Not much has changed: This is more a realization than a lesson, but it was good to see that not much had changed since I used to use NetBSD on a daily basis. This is good because it shows how resilient BSD systems are, but also bad because some problems that made me leave NetBSD are still present. NetBSD is still unique: I don’t know of any other OS that supports cross-building as trivially as NetBSD does, much less without requiring root access to generate disk images. I hear FreeBSD 15 will sport these same features, and that’s exiting, but “I’ll believe them when I see them”. TL;DR: This article is a hands-on tutorial on how to write s that provide help messages by scanning their own content. Relevance: I had read about this idea before but had never put it to practice. In developing the EndBOX, I had to create a rather complex to glue together the patching of the NetBSD tree, the build of the NetBSD toolchain and release, the cross-compilation of EndBASIC for aarch64, and the bundling of all pieces together into a final disk image. Lessons learned: Fragility: Implementing this idea is simple, but in the discussions that followed the article publication, I realized at least two problems: trailing space in variable assignments in is meaningful and this approach cannot easily work if you have includes in your s. Not a big deal for my use case, but having practiced this, I know when and when not do retry this in the future. Use make and you’ll have a bad time: Nothing surprising here, but getting the to orchestrate all these disparate builds and to do it correctly without spurious rebuilds has been painful. What’s worse is that the resulting build process takes a long time because is unable to properly parallelize all build steps. It hurts to see my 72-core server “stall” when a script runs. TL;DR: This article presents a deep dive into NetBSD’s console drivers and how those can be used to render graphics directly to the framebuffer without relying on X11 or Wayland. Relevance: Figuring this out was the key to unblocking the EndBOX project. I had built an earlier prototype of the OS image that used EndBASIC’s SDL console over X11 but… the boot times were atrocious and I wasn’t sure I could make them better. Researching how to leverage the framebuffer removed this roadblock as I could get graphics rendering almost immediately after the kernel finished booting. Lessons learned: wsdisplay and wskbd APIs: I had looked at the internals of these devices in the past but never paid much attention to the versatile APIs they expose. I had to do this now, and it was insightful. gdb scripting: Nothing new (I knew this was possible), but I think it was the first time I put it into practice… for the article’s sake. TL;DR: An overview on what ioctls are and how to invoke them from Rust when bindings don’t yet exist for them. Relevance: This was a direct follow-up to the previous article: that one was focused on gaining access to the framebuffer, which I prototyped from C, and this one was about productionizing those prototypes in the form of a Rust backend for EndBASIC’s console abstraction. Lessons learned: ioctl formats: Not all ioctls are the same. Some deal with simple data types whereas other deal with large structures—and the way they are expressed is different. Integrating ioctls in Rust: This was “just a matter of programming”, but it was cool to see how the crate makes it easy to expose ioctls in a Rust-native manner so that they just look like function calls. TL;DR: A deep dive on how device discovery works on a modern machine, including details on ACPI and Device Tree, how they differ, and how they are put in memory so that the kernel can read them. Relevance: During the development of the EndBOX, I had to enable SPI to render to the LCD via a DT overlay. I also had to find and install a DTB for the Raspberry Pi Zero 2 W so that NetBSD could boot on this board and so that the WiFi could work too. Lessons learned: ACPI and Device Tree: I pretty much knew nothing about these, so learning about the foundations behind them and how they differ was interesting on its own. Device Tree overlays: I had to write an overlay from scratch to enable the SPI bus on NetBSD—and then get the LCD to actually work (which was its own time sink). Linux vs. NetBSD DTBs: NetBSD reuses Linux DTBs, but NetBSD’s copy is ancient and contains local changes (to e.g. disable the unsupported Videocore). I had to figure out, through sweat and tears, how to port the DTB for the Pi Zero 2 W from Linux to NetBSD without breaking anything else. TL;DR: A beginner-level introduction to 3D printing, including 3D modeling basics, slicing, and actual printing considerations. Relevance: The final step in showing off the EndBOX was to create a case for it that matched the design I had envisioned months earlier. Lessons learned: Unprintable objects: The way 3D printers work imposes constraints on the kinds of items that can be printed, and I had to adjust my design a few times. Modeling vs. slicing: They are two very different things, and I had never imagined that the second existed. It’s not trivial: Even after figuring out the basics, getting a perfect print is difficult. I guess the real world is analog and subject to imperfections.

0 views
Blog System/5 7 months ago

Whatever happened to sandboxfs?

Back in 2017–2020, while I was on the Blaze team at Google, I took on a 20% project that turned into a bit of an obsession: sandboxfs . Born out of my work supporting iOS development, it was my attempt to solve a persistent pain point that frustrated both internal teams and external users alike: Bazel’s poor sandboxing performance on macOS . sandboxfs was a user-space file system designed to efficiently create virtual file hierarchies backed by real files—a faster alternative to the “symlink forests” that Bazel uses to prepare per-action sandboxes. The idea was simple: if we could lower sandbox creation overhead, we could make Bazel’s sandboxing actually usable on macOS. Unfortunately, things didn’t play out as I dreamed. Today, sandboxfs is effectively abandoned, and macOS sandboxing performance remains an unsolved problem. In this post, I’ll walk you through why I built sandboxfs, what worked, what didn’t, and why—despite its failure—I still think the core idea holds promise. To understand how sandboxfs was intended to help with sandboxed build performance, we need to first dive into how Bazel runs build actions. For those unfamiliar with Bazel’s terminology, a build action or action is an individual build step, like a single compiler or linker execution. To run actions, Bazel uses the strategies abstraction to decouple action tracking in the build graph from how those actions are actually executed. The default strategy for local builds is the strategy, which isolates the processes that an action runs from the rest of the system. The goal is to make these processes behave in a deterministic manner. The sandboxed strategy achieves action isolation via two different mechanisms: The use of kernel-level sandboxing features to restrict what the action can do (limit network access, limit reads and writes to parts of the file system, etc.). One such mechanism is sandbox-exec on macOS . The creation of an execution root (or execroot ) in which the action runs. The execroot contains the minimum set of files required for the action to run: namely, the toolchain and the action inputs (source files, toolchain dependencies, etc.). One way to do this is via symlink forests. The default mechanism to create an execroot in Bazel is to leverage symlink forests : file hierarchies that use symlinks to refer to files that live elsewhere. Creating a symlink forest is an operation that scales linearly with the number of files in it, and each symlink creation requires at least two system calls: one to create the symlink and another to delete it when the sandbox is torn down. Plus symlink forests typically have complex directory structures, so there are extra and operations to handle all intermediate path components. Doing thousands of these operations may only take milliseconds, but… overheads in action execution quickly compound and turn into visible build slowdowns. To illustrate what this means in practice, consider this target: This target makes Bazel spawn one action to compile into . Said action needs to: run the compiler; read the file; and access any system includes that may reference. Thus, the sandbox used to run this action may look like this: Having this symlink forest in place, Bazel would run the equivalent of this command to perform the compilation: When Bazel runs this, it expects that will only access files in the directory it previously created inside the sandbox. But because reality may not match expectations, Bazel wraps the command by whatever technology the host OS provides to enforce sandboxing. If you are enjoying this article, take a moment to subscribe to Blog System/5. It means a lot to me and you’ll get more content like this! Creating symlink forests on an action basis was very expensive on macOS… or so everyone said. When I arrived to the Blaze team, sandboxing had already been disabled by default on macOS builds and the rationale behind that was that “symlinks were too slow”. There were some flaws with this claim: It was impossible to prove. I ran many microbenchmarks to exercise symlink creations and deletions in large amounts and could never observe a significant performance degradation compared to Linux. Building Bazel with itself, with sandboxing enabled, did not show any sort of substantial performance loss. Yet Bazel has relatively large C++ and Java actions in its own build so you would have expected to see something . If macOS was truly bad at something as fundamental as “symlink management”, you’d imagine that someone else would have found the issue and asked about it online (as it often happens with misguided NTFS complaints ). But there were none to be found. Still, I devised the sandboxfs plan right after developing sourcachefs —another short-lived stint in file systems development—and I charged ahead. I wanted sandboxfs to exist because it did solve an obvious scalability issue (issuing tens of thousands of syscalls per symlink forest creation is not free) and because I wanted sanboxfs to exist for pkg_comp’s own benefit . sandboxfs replaces symlink forests with a virtual file hierarchy that can be materialized in constant time. Here is the flow of operations: Bazel generates an in-memory manifest of the execroot structure and which files are backed by which other files. Bazel sends this manifest to sandboxfs via an RPC (which means we have at least one system call to send a message through a socket and a couple of context switches). sandboxfs updates its in-memory representation of the file system and exposes a new sandbox at its mount point. Bazel runs the action in the new sandbox. sandboxfs catches all I/O in the sandbox and redirects it to the relevant real backing files. It’s this last point that presents the trade off behind sandboxfs, because sandboxfs doesn’t make all costs magically go away. Instead of paying the cost of setting up the sandbox upfront via many system calls, we pay a different cost over all reads and writes that go through the virtual file system. The original hypothesis was that this would be worth it, because most (but not all) build actions are not I/O bound, and most build actions do not access all the files that are mapped into their sandbox. Going back to the example from before, Bazel would send an RPC like this to sandboxfs: And this would cause the following file hierarchy to be immediately available under the mount point: Note that I did not write what these files point to in this snippet because sandboxfs does not use symlinks. sandboxfs exposes the files as if they were real files, and it does that to prevent tools from resolving symlinks and discovering sibling files they aren’t supposed to see. From the point of view of when it runs, everything it sees under is a copy of whatever is in . Overall, sandboxfs was a fun exercise and a great journey to learn more about Rust, FUSE and file systems, and macOS internals: I got to learn Rust. I was lucky to find a random coworker at Google that offered to review my code, and his input was an invaluable learning resource for me. I got to learn about FUSE in quite a bit of detail. I had already played with it before, but by working on sandboxfs, I had to debug some gnarly problems. I got to experience rewriting pre-existing Go code in Rust (because the original sandboxfs implementation was in Go). This was an enlightening exercise because, as I tried to convert the code “verbatim”, I discovered many subtle concurrency bugs and data races that Rust just didn’t let me write. The initial performance evaluation of using sandboxfs for real iOS builds showed promise : I observed that a specific iOS app “only” got a 55% performance penalty when using sandboxfs instead of the 270% penalty it got from symlink forests. A good win, but insufficient to justify enabling sandboxing by default. Many things really. Let’s start with wrong assumptions: Symlink forest creation may not have been the biggest problem in sandboxing performance. As I mentioned in the opening, microbenchmarking this area of macOS didn’t show obvious slowdowns and building Bazel with itself didn’t show major performance differences with and without sandboxing. But iOS builds suffered massively from sandboxing, and the problem was elsewhere: the Objective C and Swift compilers cache persistent state on disk, and sandboxing was preventing such state from actually being persisted. The need for sandboxing on interactive builds was questionable. Yes, it’d have been neat to have it, but in practice, the benefits are little: if your CI builds are powered by remote execution, which tends to happen when you use Bazel, then the implicit sandboxing of remote execution gives you almost all protections that you’d get from using sandboxing. There were also implementation problems: The original implementation of sandboxfs was written in Go, and I hit performance issues with the way bazil/fuse dealt with FUSE operations. The previous was fixed by rewriting sandboxfs in Rust, but then I hit performance problems with the JSON-based RPC interface that sandboxfs had grown in a rush. Fixing this properly required a deep redesign to use path compression and to bypass JSON altogether. But I didn’t get to this because… Kernel bugs / limitations in OSXFUSE erased the possibility of implementing a critical performance optimization. And then I also hit unexpected changes in the ecosystem: Apple deprecated kernel extensions , making the use of FUSE really convoluted and its future uncertain. Apple provided alternate APIs to implement file systems in user space, but those were designed for iCloud-style services and were/are not suitable for sandboxfs. At around the same time in 2019, OSXFUSE went closed source . This meant that relying on it for any future work was not well-advised. There were still code dumps for older versions, but that was not something I was able to maintain. Because of the previous two, I would have had to expose the sandboxfs virtual file system over NFSv4 instead of FUSE. Buildbarn’s bb-clientd provides a dual FUSE/NFSv4 implementation, which proves that this is technically doable, but adding an NFSv4 frontend to sandboxfs meant having to rewrite it from scratch. Plus I’m not sure we’d have gotten good-enough performance if we went this route. At that point in mid-2019, given the other problems illustrated above… I had no interest nor time to rewrite sandboxfs “correctly” (remember, this was a 20% project at first, which unsurprisingly turned into an 120% project). It’d have been nice to do though, because “now I know how to do it right”. I still believe that Bazel needs something like sandboxfs for efficient sandboxed builds. As I mentioned earlier, creating symlink forests does not scale for action execution, and with ever-growing toolchain sizes, the problem is getting worse over time. However, the benefits of local sandboxing are unclear if you are already using remote execution. That said, people keep complaining about poor Bazel sandboxing performance on macOS, which means there still is a clear user need to make this better. And I’m not convinced the various “workarounds” that have been tried in this area (like reusing sandboxes) are sound designs nor that they can actually deliver on their promise. In my case… I don’t run Bazel on Mac anymore at work. What’s more: I do not even use a Mac for personal reasons these days, which means my ulterior motive to use sandboxfs in is gone. But! If you wanted to recreate sandboxfs from scratch, let’s talk! You made it to the end! Take a moment to subscribe to Blog System/5. It means a lot to me, and you’ll get more content like this! sandboxfs was a user-space file system designed to efficiently create virtual file hierarchies backed by real files—a faster alternative to the “symlink forests” that Bazel uses to prepare per-action sandboxes. The idea was simple: if we could lower sandbox creation overhead, we could make Bazel’s sandboxing actually usable on macOS. Unfortunately, things didn’t play out as I dreamed. Today, sandboxfs is effectively abandoned, and macOS sandboxing performance remains an unsolved problem. In this post, I’ll walk you through why I built sandboxfs, what worked, what didn’t, and why—despite its failure—I still think the core idea holds promise. Sandboxing 101 To understand how sandboxfs was intended to help with sandboxed build performance, we need to first dive into how Bazel runs build actions. For those unfamiliar with Bazel’s terminology, a build action or action is an individual build step, like a single compiler or linker execution. To run actions, Bazel uses the strategies abstraction to decouple action tracking in the build graph from how those actions are actually executed. The default strategy for local builds is the strategy, which isolates the processes that an action runs from the rest of the system. The goal is to make these processes behave in a deterministic manner. The sandboxed strategy achieves action isolation via two different mechanisms: The use of kernel-level sandboxing features to restrict what the action can do (limit network access, limit reads and writes to parts of the file system, etc.). One such mechanism is sandbox-exec on macOS . The creation of an execution root (or execroot ) in which the action runs. The execroot contains the minimum set of files required for the action to run: namely, the toolchain and the action inputs (source files, toolchain dependencies, etc.). One way to do this is via symlink forests. It was impossible to prove. I ran many microbenchmarks to exercise symlink creations and deletions in large amounts and could never observe a significant performance degradation compared to Linux. Building Bazel with itself, with sandboxing enabled, did not show any sort of substantial performance loss. Yet Bazel has relatively large C++ and Java actions in its own build so you would have expected to see something . If macOS was truly bad at something as fundamental as “symlink management”, you’d imagine that someone else would have found the issue and asked about it online (as it often happens with misguided NTFS complaints ). But there were none to be found. Bazel generates an in-memory manifest of the execroot structure and which files are backed by which other files. Bazel sends this manifest to sandboxfs via an RPC (which means we have at least one system call to send a message through a socket and a couple of context switches). sandboxfs updates its in-memory representation of the file system and exposes a new sandbox at its mount point. Bazel runs the action in the new sandbox. sandboxfs catches all I/O in the sandbox and redirects it to the relevant real backing files. I got to learn Rust. I was lucky to find a random coworker at Google that offered to review my code, and his input was an invaluable learning resource for me. I got to learn about FUSE in quite a bit of detail. I had already played with it before, but by working on sandboxfs, I had to debug some gnarly problems. I got to experience rewriting pre-existing Go code in Rust (because the original sandboxfs implementation was in Go). This was an enlightening exercise because, as I tried to convert the code “verbatim”, I discovered many subtle concurrency bugs and data races that Rust just didn’t let me write. The initial performance evaluation of using sandboxfs for real iOS builds showed promise : I observed that a specific iOS app “only” got a 55% performance penalty when using sandboxfs instead of the 270% penalty it got from symlink forests. A good win, but insufficient to justify enabling sandboxing by default. Symlink forest creation may not have been the biggest problem in sandboxing performance. As I mentioned in the opening, microbenchmarking this area of macOS didn’t show obvious slowdowns and building Bazel with itself didn’t show major performance differences with and without sandboxing. But iOS builds suffered massively from sandboxing, and the problem was elsewhere: the Objective C and Swift compilers cache persistent state on disk, and sandboxing was preventing such state from actually being persisted. The need for sandboxing on interactive builds was questionable. Yes, it’d have been neat to have it, but in practice, the benefits are little: if your CI builds are powered by remote execution, which tends to happen when you use Bazel, then the implicit sandboxing of remote execution gives you almost all protections that you’d get from using sandboxing. The original implementation of sandboxfs was written in Go, and I hit performance issues with the way bazil/fuse dealt with FUSE operations. The previous was fixed by rewriting sandboxfs in Rust, but then I hit performance problems with the JSON-based RPC interface that sandboxfs had grown in a rush. Fixing this properly required a deep redesign to use path compression and to bypass JSON altogether. But I didn’t get to this because… Kernel bugs / limitations in OSXFUSE erased the possibility of implementing a critical performance optimization. Apple deprecated kernel extensions , making the use of FUSE really convoluted and its future uncertain. Apple provided alternate APIs to implement file systems in user space, but those were designed for iCloud-style services and were/are not suitable for sandboxfs. At around the same time in 2019, OSXFUSE went closed source . This meant that relying on it for any future work was not well-advised. There were still code dumps for older versions, but that was not something I was able to maintain. Because of the previous two, I would have had to expose the sandboxfs virtual file system over NFSv4 instead of FUSE. Buildbarn’s bb-clientd provides a dual FUSE/NFSv4 implementation, which proves that this is technically doable, but adding an NFSv4 frontend to sandboxfs meant having to rewrite it from scratch. Plus I’m not sure we’d have gotten good-enough performance if we went this route. At that point in mid-2019, given the other problems illustrated above… I had no interest nor time to rewrite sandboxfs “correctly” (remember, this was a 20% project at first, which unsurprisingly turned into an 120% project). It’d have been nice to do though, because “now I know how to do it right”.

0 views
Blog System/5 7 months ago

Beginning 3D printing

Hello readers and sorry for the 2-month radio silence. I’ve been pretty busy at work, traveling during school breaks, hacking on EndBASIC when time permitted, and… as of two weeks ago… tinkering with 3D printing as a complete beginner. So, today, I’d like to walk you through the latter because it has been a really fun and rewarding journey, albeit frustrating at times. Prusa i3 MK3S+ in its original container box. You’d think that to use a 3D printer, you’d design a 3D model and then… just… send it to the printer? That’s almost true, but it ignores the realities of producing a physical object from an “abstract” model: when designing such a model, you need to take into account the limitations of 3D printing and you need to translate your model into something the 3D printer can understand via a process called slicing . Let’s take a brief peek at all of these steps. I’ll assume you are a complete beginner like I am. The pictures I’ll show are all for a “first project” I did to remake the bars of a bird cage I have, as the birds had fully destroyed the previous ones. The very first step in printing a 3D object is to create the model of what you want to print, of course. You might think that this is trivial, but there are two difficulties: the first lies in choosing and using the software, and the second lies in the physical constraints of 3D printing. As for the software part, you’ll need a CAD program, and there are many to choose from. Here are the ones I considered: FreeCAD is the free and open source solution. This is the first program I reached for given my preferences to favor free software but… oh my: if you have ever thought that the GIMP ’s UI was difficult, you are in for a shock here. FreeCAD’s UI is not beginner friendly at all and it’s also not high-DPI friendly: the buttons that show up in my monitor are tiny , which brings a new meaning to hunt-and-peck. It seems extremely comprehensive though. Simple FreeCAD project showing the design of a bar for a bird cage. Fusion 360 is Autodesk’s answer to 3D modeling. This is a product I did not know about: I had never done “Computer Assisted Design” before, and the only times I heard of this term were in the context of AutoCAD by the same company, so I was a little surprised to see that they have another flagship brand for 3D design. As it turns out, Fusion 360 is free for non-commercial use and for hobbyist use if your revenue is less than 1K a year. I chose to steer clear of this product because I did not want to be bound by these terms and I did not want to be tied to Windows or macOS. TinkerCAD is another product from Autodesk, but this one is completely free and available as a web application. TinkerCAD is really well made and it is beginner friendly: after just a couple of minutes (literally), I was up and running designing my first model, and I have watched my kid come up with cool objects on his own with close to zero instructions. Unfortunately, as I made progress in my designs, I started suffering from its simplicity and by now I regret not having spent the time to learn FreeCAD from the get go. Simple TinkerCAD project showing the design of a bar for a bird cage. OpenSCAD is a scriptable CAD application where you write code to generate your model. Interestingly, I only came across this program because KDE’s Discover app mentioned it to me when I searched for “CAD”, not because I saw it recommended in any 3D printing-related forums. If you have ever used POV-Ray before, you know what this is about, and to be honest, the idea behind scripting the models sounds really tempting. But that’s where I left it. There are several more options out there so go explore them if none of these satisfy you. My personal suggestion is that you start with TinkerCAD to quickly get something out of your printer and scratch your itch. But, as soon as you get into designing anything “moderately complex”, that you watch a couple of introductory videos for FreeCAD to rip the band-aid off and use a real application. I’ve started doing that now and the “parametric” aspects of FreeCAD make me feel much more confident that my creations will work out and that they’ll not be messed up by me touching “the wrong mouse button”. With software out of the way, let’s move to the fact that 3D models are just that: conceptual models that only exist on the computer. When you want to bring those to reality, you need to account for the constraints that 3D printing brings. Here are some: Layered plastic deposition. This concept is the key to designing something you can actually print. A 3D printer works by melting filament (a long string of plastic) and depositing such plastic horizontally on top of another surface. Objects are printed layer by layer, starting from the layer that touches the bed and moving up the Z axis. Which means that… some shapes are impossible to print! If your model has any overhang larger than maybe a couple millimeters, you can’t print it—unless you add superfluous support structures that need to be removed later. Colors… or lack thereof. While it is certainly possible to print objects with multiple colors on them, the printer add-on to auto-switch colors is pretty expensive, and even if you buy that, you’ll likely be bound to a limited set (4, maybe 8) anyway. This means you have to design your model as separate pieces and perform some post-processing if you need multiple colors. You can combine different colors in one print if they are isolated to different layers though—but as you can imagine, manually switching colors half-way through a print is going to be annoying. Warping. Matter shrinks as it cools, and you may end up in the situation that your print shrinks and warps as it cools down. This has happened to me a few times already and it was obviously annoying and not something I was expecting. You need to be aware that this can happen and design your model accordingly. I’ve seen various suggestions online but haven’t put them in practice yet, so I have nothing to suggest here. These constraints are just the very minimum. I’ll leave you with this excellent 80-minute guide on designing for 3D printing written by Rahix, which covers these topics and more to try to get to a good print on the first try. Let’s say you are done creating your 3D model. Can you send it to the printer? No! The printer doesn’t know anything about “3D models”: it only knows about Logo -like instructions—known as G-Code—that tell it how to operate on a layer by layer basis. The process of converting your 3D model to G-Code is called slicing and is performed by a slicer application. PrusaSlicer model view showing the imported bird cage bar model (in orange) with auto-generated support structures (in green). The slicer takes the model as an input, “slices” through it on the horizontal plane to generate very fine layers, and then produces G-Code to make the printer operate the extruder (the word for the part that melts filament and deposits it on the printing plane) across the plane of every layer. The output of the slicer is a G-Code file which contains such instructions in detail, and this is the file that can be fed to the printer. This slicing step is very interesting and is also where you’ll typically mess things up (assuming you got a 3D printable design in the first place) because of the myriad parameters that exist. PrusaSlicer settings view in Expert Mode. There are a lot of settings. During slicing you’ll do things like: Laying out your model on the printing bed, possibly combining multiple objects to print them in one go and rotating them so that they can be printed. Adjusting settings like the infill, which tells the printer how much plastic to use in the “inner” parts of the object: more infill means a sturdier object, but also a more expensive and slower to print object. Asking the slicer to auto-generate a brim (an extra support structure on the base layer that increases bed adhesion), to add support structures for overhangs, or to add “mouse ears” to help with adhesion. Choosing the type of filament to use and its properties. Selecting the right printer (because the G-Code instructions are printer-specific, as you can imagine). Configuring whether you want to switch colors half-way through the print or not. And a long list of etceteras. The nice thing is that the slicer detects various problematic conditions and offers to resolve them before sending the output to the printer—but the bad thing is that it doesn’t detect everything, and some of the choices it makes are maybe-not-so-great. For example, if you ask the slicer to add support structures for overhangs, it may generate structures that you don’t like or that are harder to remove later on, whereas if you manually adjust the model to contain such structures, you have more control over the results. As for which pieces of software exist for slicing, every printer comes with its own slicer software. In my case, I’ve been using PrusaSlicer which is the one that matches my printer and is open source. And because PrusaSlicer is open source… you can imagine that other companies have taken it as the basis for their own printers, like Bambu Lab has done with their Bambu Studio . And finally, we come to the printing process itself. Once you send the G-Code to the printer, the printer starts by heating up the extruder and the bed. You need a hot extruder to melt the filament, and you need a hot bed to help the filament attach to the printing surface. The printer then performs mesh bed leveling (which not all printers do) to understand subtle variations in the height of the bed across its surface. And then the printer starts moving the extruder on the XY plane to generate the object one layer at a time. It is fascinating to watch; look: But… how do you send the G-Code to the printer? Well, it depends on the printer. If you choose to go the Bambu Lab route, which is “the Apple of 3D printers” as a coworker put it to me (hey Daniel!), it’s simple: you click a button from their slicing app and boom, the printer starts working. No need to worry about file transfers and no need to worry about calibration steps; it just works. And in fact, if you look for recommendations online, most people will point you towards choosing one of these printers: Bambu Lab A1 Mini : Maximum print volume of 180 x 180 x 180 mm. Bambu Lab A1 : Maximum print volume of 256 x 256 x 256 mm. But do you know how Bambu Lab’s magic operation happens? Via a cloud service, “of course” . And if the cloud service is down, good luck printing: from what I could find, it seems like these printers cannot operate without a network connection. And if the printer requires a cloud service, then you also face all the usual privacy and security (or lack thereof) problems, with possibly some risky ones (malicious G-Code trying to make the printer malfunction, maybe?). So, what are the alternatives? Well, there are a ton of them. Two machines from other vendors that pop up in almost all reviews are: Prusa MK4S : This is the “entry level” machine from Prusa , another big brand in this space with similar price points to the well-regarded Bambu Lab machines. Ender 3 V3 : This is a much more affordable printer and comes from Creality . Which one to pick? As a beginner, it’s difficult. I read many reviews online and, as I mention above, most people suggested to “just” choose Bambu Lab for its simplicity and beginner-friendliness to achieve good quality results. These reviewers highlighted that other machines require a lot of manual tinkering (which actually made them more appetizing to me) and also warned to stay clear of Creality due to quality and safety issues. And if you research deeper, you’ll start to realize that the way to “level up” in 3D printing is to build your own printer while 3D-printing some of its own pieces. In the end, I chose to go the Prusa route, but the high prices put me off: after all, I’m just getting started and I do not need to make a huge investment in a “hobby” that may not last, so I got a second-hand, lightly-used Prusa i3 MK3S+ for a fraction of the price of a new one. And then the surprises started: the printer has zero network connectivity. This was a bit unexpected (I somehow assumed it’d offer local-only network printing), but in the end turned out to be great because this means that there is zero cloud garbage involved in the printing process. The printer just has an SD card slot and a USB port, so I plugged the latter into a computer running PrusaSlicer and… nothing. PrusaSlicer did not have any option to actually print to the physically-attached printer. Weird. Pursa i3 MK3S+ next to the Mac Pro 2013 controlling it. As it turns out, you need a print server to actually control the machine from its USB port. Researching this topic online will almost certainly convince you that you need OctoPi running on a Raspberry Pi in order to print. But… a Pi is just a computer, so whatever OctoPi does can also be done on a Linux machine—the Mac Pro I had already connected to the printer in the picture right above. And indeed that’s the case: once I installed OctoPrint , I could connect to the printer and drive it. (What’s more, it is possible to add a “physical printer” to the PrusaSlicer, but all PrusaSlicer will do is embed the print server’s web interface.) PrusaSlicer printer view connected to the OctoPrint server. OctoPrint itself is “the CUPS of 3D printing”. It’s a piece of software that knows how to send the G-Code to the printer via its USB connection, but it’s also a system that queues print jobs, provides monitoring and G-Code inspection features, allows for timelapse recording via a webcam, and much more. It feels overkill to be honest, but it does the job and there don’t seem to be any simpler alternatives, so that’s what it is. The very last thing to touch on regarding the printing process are the filament materials. The basic filament type is PLA, and it seems like it’s the easiest one to get started with. There are alternatives of different quality and properties out there for different applications, but that’s a world I haven’t explored yet. Subscribe now And with that, you should now have the very basic knowledge to start creating your own objects and feel like a Real Engineer (like I did). Keep in mind that you don’t actually need to own a 3D printer though: there are on-demand print services available that will ship the prints back to you for cheap—but there is no getting around to learning CAD modeling within the constraints of 3D printing. I’d also recommend playing with the slicer software to understand the implications of certain choices in the model regarding printing limitations (the need for support structures and the like) and print times. In the end, I’m happy I got my own printer because of the various trial-and-error iterations I went through before getting some decent prints out. Prusa i3 MK3S+ in its original container box. You’d think that to use a 3D printer, you’d design a 3D model and then… just… send it to the printer? That’s almost true, but it ignores the realities of producing a physical object from an “abstract” model: when designing such a model, you need to take into account the limitations of 3D printing and you need to translate your model into something the 3D printer can understand via a process called slicing . Let’s take a brief peek at all of these steps. I’ll assume you are a complete beginner like I am. The pictures I’ll show are all for a “first project” I did to remake the bars of a bird cage I have, as the birds had fully destroyed the previous ones. Step 1: Modeling The very first step in printing a 3D object is to create the model of what you want to print, of course. You might think that this is trivial, but there are two difficulties: the first lies in choosing and using the software, and the second lies in the physical constraints of 3D printing. As for the software part, you’ll need a CAD program, and there are many to choose from. Here are the ones I considered: FreeCAD is the free and open source solution. This is the first program I reached for given my preferences to favor free software but… oh my: if you have ever thought that the GIMP ’s UI was difficult, you are in for a shock here. FreeCAD’s UI is not beginner friendly at all and it’s also not high-DPI friendly: the buttons that show up in my monitor are tiny , which brings a new meaning to hunt-and-peck. It seems extremely comprehensive though. Simple FreeCAD project showing the design of a bar for a bird cage. Fusion 360 is Autodesk’s answer to 3D modeling. This is a product I did not know about: I had never done “Computer Assisted Design” before, and the only times I heard of this term were in the context of AutoCAD by the same company, so I was a little surprised to see that they have another flagship brand for 3D design. As it turns out, Fusion 360 is free for non-commercial use and for hobbyist use if your revenue is less than 1K a year. I chose to steer clear of this product because I did not want to be bound by these terms and I did not want to be tied to Windows or macOS. TinkerCAD is another product from Autodesk, but this one is completely free and available as a web application. TinkerCAD is really well made and it is beginner friendly: after just a couple of minutes (literally), I was up and running designing my first model, and I have watched my kid come up with cool objects on his own with close to zero instructions. Unfortunately, as I made progress in my designs, I started suffering from its simplicity and by now I regret not having spent the time to learn FreeCAD from the get go. Simple TinkerCAD project showing the design of a bar for a bird cage. OpenSCAD is a scriptable CAD application where you write code to generate your model. Interestingly, I only came across this program because KDE’s Discover app mentioned it to me when I searched for “CAD”, not because I saw it recommended in any 3D printing-related forums. If you have ever used POV-Ray before, you know what this is about, and to be honest, the idea behind scripting the models sounds really tempting. But that’s where I left it. Layered plastic deposition. This concept is the key to designing something you can actually print. A 3D printer works by melting filament (a long string of plastic) and depositing such plastic horizontally on top of another surface. Objects are printed layer by layer, starting from the layer that touches the bed and moving up the Z axis. Which means that… some shapes are impossible to print! If your model has any overhang larger than maybe a couple millimeters, you can’t print it—unless you add superfluous support structures that need to be removed later. Colors… or lack thereof. While it is certainly possible to print objects with multiple colors on them, the printer add-on to auto-switch colors is pretty expensive, and even if you buy that, you’ll likely be bound to a limited set (4, maybe 8) anyway. This means you have to design your model as separate pieces and perform some post-processing if you need multiple colors. You can combine different colors in one print if they are isolated to different layers though—but as you can imagine, manually switching colors half-way through a print is going to be annoying. Warping. Matter shrinks as it cools, and you may end up in the situation that your print shrinks and warps as it cools down. This has happened to me a few times already and it was obviously annoying and not something I was expecting. You need to be aware that this can happen and design your model accordingly. I’ve seen various suggestions online but haven’t put them in practice yet, so I have nothing to suggest here. PrusaSlicer model view showing the imported bird cage bar model (in orange) with auto-generated support structures (in green). The slicer takes the model as an input, “slices” through it on the horizontal plane to generate very fine layers, and then produces G-Code to make the printer operate the extruder (the word for the part that melts filament and deposits it on the printing plane) across the plane of every layer. The output of the slicer is a G-Code file which contains such instructions in detail, and this is the file that can be fed to the printer. This slicing step is very interesting and is also where you’ll typically mess things up (assuming you got a 3D printable design in the first place) because of the myriad parameters that exist. PrusaSlicer settings view in Expert Mode. There are a lot of settings. During slicing you’ll do things like: Laying out your model on the printing bed, possibly combining multiple objects to print them in one go and rotating them so that they can be printed. Adjusting settings like the infill, which tells the printer how much plastic to use in the “inner” parts of the object: more infill means a sturdier object, but also a more expensive and slower to print object. Asking the slicer to auto-generate a brim (an extra support structure on the base layer that increases bed adhesion), to add support structures for overhangs, or to add “mouse ears” to help with adhesion. Choosing the type of filament to use and its properties. Selecting the right printer (because the G-Code instructions are printer-specific, as you can imagine). Configuring whether you want to switch colors half-way through the print or not. Bambu Lab A1 Mini : Maximum print volume of 180 x 180 x 180 mm. Bambu Lab A1 : Maximum print volume of 256 x 256 x 256 mm. Prusa MK4S : This is the “entry level” machine from Prusa , another big brand in this space with similar price points to the well-regarded Bambu Lab machines. Ender 3 V3 : This is a much more affordable printer and comes from Creality .

0 views
Blog System/5 9 months ago

The next generation of Bazel builds

Today marks the 10th anniversary of Bazel’s public announcement so this is the perfect moment to reflect on what the next generation of build systems in the Bazel ecosystem may look like. I write this with the inspiration that comes from attending the first ever conference on Buildbarn , one of the many remote execution systems for Bazel. In the conference, Ed Schouten, the creator of Buildbarn , presented Bonanza: a skunkworks reimagination of Bazel for truly large builds. In this article, I want to dive into what Bonanza is and what similar projects to “replace Bazel” have existed. To get there though, we need to start first with a critique of the current implementation of Bazel. The predecessor to Bazel, Blaze, is a build system designed at Google for Google’s monorepo scale. Blaze grew over the years assuming: that every engineer had a beefy workstation under their desk; that remote execution was expected to be used by default; that the remote execution cluster was reachable through a fast and low latency network; and that each office had physical hardware hosting a local cache of remote build artifacts. These assumptions allowed Blaze to “scale up” to the very large codebase that Google builds, but they came with some downsides. One consequence of these assumptions is that the Bazel process—confusingly named the “Bazel server”—that runs on your machine is very resource hungry. The reason is that this process has to scan the source tree to understand the dependency graph and has to coordinate thousands of RPCs against a remote cluster—two operations that aren’t cheap. What’s worse is that the Bazel server process is stateful: at start up, Bazel goes through the expensive steps of computing the analysis graph from disk and, to prevent redoing this work in every build, Bazel keeps this graph in its in-memory “analysis cache”. The analysis cache is fragile. The Bazel server process may auto-restart, and certain flags used to control the build cause the cache to be discarded. These are not rare flags, no: these include basic flags like to change the compilation mode from debug to release, among many others. Cache discards are very intrusive to user workflows because an iterative build that would have taken a second now takes maybe thirty, for example. This user experience degradation makes Bazel’s front-page claim of being fast hard to believe. Bazel is really fast at running gigantic builds from scratch and it is really efficient when executing incremental builds. But the problem is that “truly incremental builds” are a rarity, so you end up paying the re-analysis cost many more times than is necessary. If you run Bazel in a CI environment, you know that these costs are far from negligible because every single Bazel process invocation on a fresh CI node can take minutes to “warm up”. There is also the other side of the coin, which is that Bazel does not scale down very well. This was one of my original critiques when Bazel went open source in 2015: at that time, I really wished for a declarative build system like Bazel to replace the mess that was and is Make plus the GNU Autotools, but the fact that Bazel was written in Java meant that it would never do this (mostly for non-technical reasons). Regardless, I did join the Blaze team after that, and I spent most of my time coercing Blaze and Bazel to run nicely on “small” laptop computers. I succeded in some areas, but it was a losing battle: Java had certain deficiencies that prevent implementing lean software. Project Loom and Project Valhalla promise to bring the necessary features to Java, but these features aren’t quite there yet—and even when they are, retrofitting these into Bazel will be a very hard feat. In any case. Bazel works, and it works nicely for many use cases, but it lives in this limbo state where it isn’t awesome for very large builds and it isn’t awesome for very small builds either. So, let’s look at the former: how do we make Bazel awesome for humongous builds? By lifting it in its entirety to the cloud. Enjoying this article? If so, please subscribe to Blog System/5 to show your support. It’s free and you’ll appreciate future content! Bonanza is Ed’s playground for a new build system: a build system that takes remote execution to the limit. Where Bazel is only capable of shipping individual build actions to the cloud, Bonanza uses the cloud for everything , including external dependency handling, build graph construction and iteration, and more. To understand what Bonanza brings to the table in the context of hugely scalable builds, let’s distill the salient points that Ed highlighted in his “Bonanza in a nutshell” slide from his conference presentation (recording coming soon): Bonanza can only execute build actions remotely. There is no support for local execution, which makes the build driver (the process that runs on your machine) simpler and eliminates all sorts of inconsistencies that show up when mixing local and remote execution. Bazel’s execution strategies tend to enforce hermeticity, but they don’t always succeed because of sandboxing limitations. Bonanza performs analysis remotely. When traditional Bazel is configured to execute all actions remotely, the Bazel server process is essentially a driver that constructs and walks a graph of nodes. This in-memory graph is known as Skyframe and is used to represent and execute a Bazel build. Bonanza lifts the same graph theory from the Bazel server process, puts it into a remote cluster, and relies on a distributed persistent cache to store the graph’s nodes. The consequence of storing the graph in a distributed storage system is that, all of a sudden, all builds become incremental. There is no more “cold build” effect like the one you see with Bazel when you lose the analysis cache. Bonanza runs repo rules remotely. Repo rules are what Bazel uses to interact with out-of-tree dependencies, and they can do things like download Git repositories, toolchain binaries, or detect what compiler exists in the system. What you should know is that Blaze did not and does not have repo rules nor support for workspaces because Google uses a strict monorepo. Both the repo rules and the workspace were bolted-on additions to Blaze when it was open-sourced as Bazel, and it shows: these features do not integrate cleanly with the rest of Bazel’s build model, and they have been clunky for years. Bonanza fixes these issues with a cleaner design. Bonanza encrypts data in transit and at rest. Bonanza brings to life some of the features discussed for the Remote Execution v3 protocol, which never saw the light of day, and encryption is one of them. By encrypting all data that flows through the system, Bonanza can enforce provenance guarantees if you control the action executors. This is important because it allows security-conscious companies to easily trust using remote build service providers . Bonanza only supports rules written in Starlark. When Bazel launched, it included support for Starlark : a new extensibility language with which to write build logic in. Unfortunately, for historical reasons, Bazel’s core still included Java-native implementations of the most important and complex rules: namely, C++, Java and protobuf. Google has been chasing the dream of externalizing all rule implementations into Starlark for the last 10 years, and only in Bazel 8 they mostly have achieved this goal. Bonanza starts with a clean design that requires build logic to be written in Starlark, and it pushes this to the limit: almost everything , including flags , is Starlark. Bonanza aims to be Bazel compatible. Of the modern build systems that use a functional evaluation model like Bazel, only Bazel has been able to grow a significant community around it. This means that the ecosystem of tools and rules, as well as critical features like good IDE support, is thriving in Bazel whereas this cannot be said of other systems. Bonanza makes the right choice of being Bazel compatible so that it can reuse this huge ecosystem. Anyone willing to evaluate Bonanza will be able to do so with relative ease. When you combine all of these points, you have a build system where the client process on your development machine or on CI is thin: all the client has to do is upload the project state to the remote execution cluster, which in the common case will involve just uploading modified source files. From there, the remote cluster computes the delta of what you uploaded versus any previously-built artifacts and reevaluates the minimum set of graph nodes to produce the desired results. Are your hopes up yet? Hopefully so! But beware that Bonanza is just a proof of concept for now. The current implementation shows that all of the ideas above are feasible and it can fully evaluate the complex bb-storage project from scratch—but it doesn’t yet provide the necessary features to execute the build. Ed appreciates PRs though! My time at Google is now long behind as I left almost five years ago, but back then there were two distinct efforts that attempted to tackle the scalability issues described in this article. (I can’t name names because they were never public, but if you probe ChatGPT to see if it knows about these efforts, it somehow knows specific details.) One such effort was to make Blaze “scale up” to handle even larger builds by treating them all as incremental. The idea was to persist the analysis graph in a distributed storage system so that Blaze would never have to recompute it from scratch. This design still kept a relatively fat Blaze process on the client machine, but it is quite similar to what Bonanza does. Google had advantages over Bonanza in terms of simplicity because, as I mentioned earlier, Blaze works in a pure monorepo and does not have to worry about repo rules. The other such effort was to make Blaze “scale down” by exploring a rewrite in Go. This rewrite was carefully crafted to optimize memory layouts, avoiding unnecessary pointer chasing (which was impossible to avoid in Java). The results of this experiment proved that Blaze could be made to analyze large portions of Google’s build graph in just a fraction of the time, without any sort of analysis caching or significant startup penalties. Unfortunately, this rewrite was way ahead of its time: the important rulesets required to build Google’s codebase were still implemented inside of Blaze as Java code, so this experimental build system couldn’t do anything useful outside of Go builds. My knowledge of Meta’s build system is limited, but almost two years ago, Meta released Buck 2 , a complete reimplementation of their original Bazel-inspired build system. This reimplementation checked most of the boxes for what I thought a “Bazel done right” would look like: Buck 2 is written in a real systems language (Rust). As explained earlier, I had originally criticised Java’s choice as one of Bazel’s weaknesses. It turns out Meta realized this same thing because Buck 1 had also been written in Java and they chose to go the risky full-rewrite route to fix it. (To be fair, you need to understand that Java had been a reasonable choice back then: when both Blaze and Buck 1 were originally designed, C++11—possibly the only reasonable edition of C++—didn’t even exist.) Buck 2 is completely language agnostic. Its core build engine does not have any knowledge of the languages it can build, and all language support is provided via Starlark extensions. This stems from learning about earlier design mistakes of both Blaze and Buck 1. Meta chose to address this as part of the Rust rewrite, whereas Google has been addressing it incrementally. Buck 2 has first-class support for virtual file systems . These are a necessity when supporting very large codebases and when integrating with remote build systems, but are also completely optional. Blaze also had support for these, but not Bazel. At launch, I was excited and eager to give Buck 2 a try, but then the disappointment came in: as much as it walks and quacks like Bazel due to its use of Starlark… the API that Buck 2 exports to define rules is not compatible with Bazel’s. This means that Buck 2 cannot be used in existing Bazel codebases, so the ability to evaluate its merits in a real codebase is… insurmountable. In my mind, this made Buck 2 dead on arrival, and it’s yet to be seen if Meta will be able to grow a significant public ecosystem around it. In any case, I do not have any experience with Buck 2 because of the previous, so I cannot speak to its ability to scale either up or down. And this is why I wrote this section: to highlight that being Bazel-compatible is critical in this day and age if you want to have a chance at replacing a modern system like Bazel. Bonanza is Bazel-compatible so it has a chance of demonstrating its value with ease. If you ask me, it seems impossible to come up with a single build system that can satisfy the wishes of tiny open-source projects that long for a lean and clean build system and that can satisfy the versatility and scale requirements of vast corporate codebases. Bazel-the-implementation tries to appeal to both and falls short, yet Bazel-the-ecosystem provides the lingua franca of what those implementations need to support. My personal belief is that we need two build systems that speak the same protocol (Starlark and Bazel’s build API) so that users can interchangeably choose whichever one works best for their use case: On the one hand, we need a massively scalable build system that does all of the work in the cloud. This is to support building monorepos, to support super-efficient CI runs, and to support “headless” builds like those offered by hosted VSCode instances. Bonanza seems to have the right ideas and the right people behind it to fulfill this niche. On the other hand, we need a tiny build system that does all of the work locally and that can be used by the myriad of open-source projects that the industry relies on. This system has to be written in Rust (oops, I said it) with minimal dependencies and be kept lean and fast so that IDEs can communicate with it quickly. This is a niche that is not fulfilled by anyone right now and that my mind keeps coming to; it’d be fun to create this project given my last failed attempt . The time for these next-generation Bazel-compatible build systems is now . Google has spent the last 10 years Starlark-ifying Bazel, making the core execution engine replaceable. We are reaching a point where the vast majority of the build logic can be written in Starlark as Bonanza proves, and thus we should be able to have different build tools that implement the same build system for different use cases. I write this with the inspiration that comes from attending the first ever conference on Buildbarn , one of the many remote execution systems for Bazel. In the conference, Ed Schouten, the creator of Buildbarn , presented Bonanza: a skunkworks reimagination of Bazel for truly large builds. In this article, I want to dive into what Bonanza is and what similar projects to “replace Bazel” have existed. To get there though, we need to start first with a critique of the current implementation of Bazel. Problems scaling up The predecessor to Bazel, Blaze, is a build system designed at Google for Google’s monorepo scale. Blaze grew over the years assuming: that every engineer had a beefy workstation under their desk; that remote execution was expected to be used by default; that the remote execution cluster was reachable through a fast and low latency network; and that each office had physical hardware hosting a local cache of remote build artifacts. Bonanza can only execute build actions remotely. There is no support for local execution, which makes the build driver (the process that runs on your machine) simpler and eliminates all sorts of inconsistencies that show up when mixing local and remote execution. Bazel’s execution strategies tend to enforce hermeticity, but they don’t always succeed because of sandboxing limitations. Bonanza performs analysis remotely. When traditional Bazel is configured to execute all actions remotely, the Bazel server process is essentially a driver that constructs and walks a graph of nodes. This in-memory graph is known as Skyframe and is used to represent and execute a Bazel build. Bonanza lifts the same graph theory from the Bazel server process, puts it into a remote cluster, and relies on a distributed persistent cache to store the graph’s nodes. The consequence of storing the graph in a distributed storage system is that, all of a sudden, all builds become incremental. There is no more “cold build” effect like the one you see with Bazel when you lose the analysis cache. Bonanza runs repo rules remotely. Repo rules are what Bazel uses to interact with out-of-tree dependencies, and they can do things like download Git repositories, toolchain binaries, or detect what compiler exists in the system. What you should know is that Blaze did not and does not have repo rules nor support for workspaces because Google uses a strict monorepo. Both the repo rules and the workspace were bolted-on additions to Blaze when it was open-sourced as Bazel, and it shows: these features do not integrate cleanly with the rest of Bazel’s build model, and they have been clunky for years. Bonanza fixes these issues with a cleaner design. Bonanza encrypts data in transit and at rest. Bonanza brings to life some of the features discussed for the Remote Execution v3 protocol, which never saw the light of day, and encryption is one of them. By encrypting all data that flows through the system, Bonanza can enforce provenance guarantees if you control the action executors. This is important because it allows security-conscious companies to easily trust using remote build service providers . Bonanza only supports rules written in Starlark. When Bazel launched, it included support for Starlark : a new extensibility language with which to write build logic in. Unfortunately, for historical reasons, Bazel’s core still included Java-native implementations of the most important and complex rules: namely, C++, Java and protobuf. Google has been chasing the dream of externalizing all rule implementations into Starlark for the last 10 years, and only in Bazel 8 they mostly have achieved this goal. Bonanza starts with a clean design that requires build logic to be written in Starlark, and it pushes this to the limit: almost everything , including flags , is Starlark. Bonanza aims to be Bazel compatible. Of the modern build systems that use a functional evaluation model like Bazel, only Bazel has been able to grow a significant community around it. This means that the ecosystem of tools and rules, as well as critical features like good IDE support, is thriving in Bazel whereas this cannot be said of other systems. Bonanza makes the right choice of being Bazel compatible so that it can reuse this huge ecosystem. Anyone willing to evaluate Bonanza will be able to do so with relative ease. Buck 2 is written in a real systems language (Rust). As explained earlier, I had originally criticised Java’s choice as one of Bazel’s weaknesses. It turns out Meta realized this same thing because Buck 1 had also been written in Java and they chose to go the risky full-rewrite route to fix it. (To be fair, you need to understand that Java had been a reasonable choice back then: when both Blaze and Buck 1 were originally designed, C++11—possibly the only reasonable edition of C++—didn’t even exist.) Buck 2 is completely language agnostic. Its core build engine does not have any knowledge of the languages it can build, and all language support is provided via Starlark extensions. This stems from learning about earlier design mistakes of both Blaze and Buck 1. Meta chose to address this as part of the Rust rewrite, whereas Google has been addressing it incrementally. Buck 2 has first-class support for virtual file systems . These are a necessity when supporting very large codebases and when integrating with remote build systems, but are also completely optional. Blaze also had support for these, but not Bazel. On the one hand, we need a massively scalable build system that does all of the work in the cloud. This is to support building monorepos, to support super-efficient CI runs, and to support “headless” builds like those offered by hosted VSCode instances. Bonanza seems to have the right ideas and the right people behind it to fulfill this niche. On the other hand, we need a tiny build system that does all of the work locally and that can be used by the myriad of open-source projects that the industry relies on. This system has to be written in Rust (oops, I said it) with minimal dependencies and be kept lean and fast so that IDEs can communicate with it quickly. This is a niche that is not fulfilled by anyone right now and that my mind keeps coming to; it’d be fun to create this project given my last failed attempt .

0 views
Blog System/5 10 months ago

Bazel at Snowflake two years in

Two and a half years ago, I joined Snowflake to help their mission of migrating to Bazel. I spent the first year of this period as an Individual Contributor (IC) diving deep into the migration tasks, and then I took over the Tech Lead (TL) role of the team to see the project through completion. This week, we publicly announced that we completed our migration to Bazel for the largest part of our codebase and we provided details on our journey. I did not publish that article here for obvious reasons, so… today’s entry is going to be a light one: all I want to do is point you at our announcement as well as the various other related articles that came before it. Don’t despair though: those articles, including the announcement, are all full of technical details—just like the kind of content you expect to receive from Blog System/5. So, even if this piece is light, you have enough reading material for the weekend via the links below. Subscribe now “Addressing Bazel OOMs” March 16th, 2023 This was the very first article that we in the Engineering Systems organization—previously known as Developer Productivity Engineering—wrote publicly about our work. Me writing it was no coincidence as I was the one advocating for more openness about the cool stuff we were doing. Yes, I missed blogging about Bazel as I had done in the years prior . In this article about Out-Of-Memory (OOM) conditions, I covered the very interesting problem of trying to fit Bazel builds into limited laptop resources. This was déjà-vu for me: back when I was in the Blaze team at Google, I owned the same problem of making Blaze, a tool that had grown assuming massive workstations, run decently on machines with limited resources. The problems were technically challenging and worth talking about, hence this article. In it, I covered three issues: preventing Bazel from spawning too many memory-hungry compilers and linkers at once; making nested builds behave nicely; and tuning Bazel to drive a large number of remote tasks with limited memory resources. “Analyzing OOMs in IntelliJ with Bazel” October 6th, 2023 Our saga dealing with OOMs was arduous and long, which you can tell by the time gap between the previous article and this one. In this piece, I looked at how IntelliJ itself was running into memory limits, which resulted in the IDE and its container VM freezing during normal operation. The result of this work was careful tuning of the Bazel project settings to make it fit within reasonable limits, but the interesting part—and the one described here—was the process to arrive to those findings. Not too long after I wrote this, we pivoted away from constrained laptop builds to cloud-based workstations, which made all OOM conditions vanish at the expense of using much more RAM (maybe too much RAM, but alas… it’s cheap). Stay tuned for an upcoming article (not from me this time!) on this topic. “Build farm visualizations” October 20th, 2023 As part of our migration to Bazel, we didn’t just convert our build from one tool to another. We also decided to deploy our own remote execution cluster from the get go based on Buildbarn which… gave us its own set of problems. Buildbarn’s architecture is straightforward in paper, but there are a ton of knobs to control how it runs. Making it scale to the huge volume of traffic we experience was not an easy feat. One specific problem we faced was overall poor performance of our cluster, which was eventually root-caused to our remote execution workers using slow local storage volumes. This issue had escaped us for a while, and it wasn’t until I wrote a tool to visualize the cluster behavior that it didn’t become obvious. From there, the solution was easy. Wanna know more? We are hosting a 1-day conference next week on Buildbarn specifically. See the schedule and sign up if you can make it! “Fast and Reliable Builds at Snowflake with Bazel” March 13th, 2025 And finally, the crown jewel. This is the official article published just yesterday where I present the 2-year journey of our migration. In it, I explain the challanges that we faced with our C++ and Java codebases specifically, the choices behind our use of remote execution, and the path we took to production. I conclude with a glimpse on what lies ahead of us. That’s all for today. I promised it would be short :)

0 views
Blog System/5 10 months ago

Hardware discovery: ACPI & Device Tree

If you grew up in the PC scene during the 1980s or early 1990s, you know how painful it was to get hardware to work. And if you did not witness that (lucky you) here is how it went: every piece of hardware in your PC—say a sound card or a network card—had physical switches or jumpers in it. These switches configured the card’s I/O address space, interrupts, and DMA ports, and you had to be careful to select values that did not overlap with other cards. Portion of a Sound Blaster Pro ISA card focusing on the jumpers to configure its I/O settings. But that wasn’t all. Once you had configured the physical switches, you had to tell the operating system and/or software which specific cards you had and how you had configured them. Remember ? This DOS environment variable told programs which specific Sound Blaster you had installed and which I/O settings you had selected via its jumpers. Not really fun. It was common to have hardware conflicts that yielded random lock-ups, and thus ISA “Plug and Play” , or PnP for short, was born in the early 1990s—a protocol for the legacy ISA bus to enumerate its devices and to configure their settings via software. Fast-forward to today’s scene where we just attach devices to external USB connectors and things “magically work”. But how? How does the kernel know which physical devices exist and how does it know which of the many device drivers it contains can handle each device? Enter the world of hardware discovery. If this sounds interesting, please consider subscribing to Blog System/5! It’s completely free but you can also choose to donate to keep me writing! When you learn about the Von Neumann architecture in school, you are typically told that there is a CPU, a chunk of memory, and… “I/O devices”. The CPU and memory portions are where all the focus is put and the I/O devices portion is always “left to the reader”. However, there is a lot of stuff happening in that nebulous cloud. The first question that arises is: what’s in that I/O cloud? Well, take a look: The Windows 2000 Device Manager showing the devices by their connection, not by their type. I've chosen to show you this configuration of a virtual machine instead of what a current Windows 11 system shows because the older view is simpler to digest. Whoa, that’s a lot of stuff, but we can classify the items in the “nebulous cloud of I/O devices” into two categories: The devices themselves, obviously. The busses that connect those devices to the CPU. Both are important: you might have a fancy keyboard with extra keys that requires a special driver, and this keyboard might come in PS/2 and USB versions. The driver for the keyboard may be the same for each version, but the “glue” that attaches this keyboard to either bus is different, and the way the kernel can tell whether the keyboard is attached to one port or another also differs. So how does the kernel know how to find hardware without tons of repeated code for every bus, you ask? It does so via its knowledge of the hardware topology. Just above, I showed you Windows’ view of this, but for the rest of this article, I’ll use the BSD internals (and NetBSD specifically) because that’s what I know best. Don’t let that put you off though: all kernels have to do something similar and the differences among them are likely not meaningful. Here is a little snippet of the default NetBSD kernel configuration file for the amd64 platform . This snippet lays out the topology of serial ports in the PC and the busses in which they may appear: Daunting if you have never seen anything like this, I know, but let me translate this to a diagram: Representation of how the com device is expected to appear under various busses according to the kernel configuration file. Much clearer in picture form, right? What this chunk of configuration does is tell the kernel the places where the driver can find serial ports. We have a chunk that says that and can appear on the ISA bus at specific I/O addresses and interrupts, and in turn that the ISA bus may be a directly-addressable physical bus ( ) and/or an ISA bus exposed via a PCI bridge ( ). Then, we have additional entries telling us that the serial ports can also be configured via ACPI ( ), and that the serial ports may exist on expansion cards ( ) providing communication ports via the PCI bus ( ). The problem is: the kernel configuration tells us what may exist , not what actually exists on a machine. In a sense, the configuration file “wires” the code of the device drivers like so that they can find devices that appear under the , , or busses. But the kernel must still, at runtime, check and see where the devices actually are. How does that happen? To answer the question of how the kernel discovers which hardware is present and where it is, let’s dissect NetBSD’s autoconf(9) manual page: Autoconfiguration is the process of matching hardware devices with an appropriate device driver. In its most basic form, autoconfiguration consists of the recursive process of finding and attaching all devices on a bus, including other busses. From this paragraph, we can extract the following: the kernel contains a collection of device drivers (like the presented earlier). Device drivers are just code that knows how to interact with specific devices, but the “location” of these devices in the hardware topology may vary (the “bindings” to , , and in the earlier example). Moving on: The autoconfiguration framework supports direct configuration where the bus driver can determine the devices present. The autoconfiguration framework also supports indirect configuration where the drivers must probe the bus looking for the presence of a device. Direct configuration is preferred since it can find hardware regardless of the presence of proper drivers. Direct and indirect configuration. Hmm. This sounds like the PnP story, and it kinda does. See, pay close attention to these two lines from the earlier snippet: These are the BSD equivalent of the command for DOS I mentioned in the introduction: they tell the kernel which precise addresses and interrupts to use for the two standard PC serial ports if an ISA bus is present . But what about and ? These lines are neat because they do not tell us, in advance, where to find the serial ports: we expect the kernel to discover those details at runtime so that we don’t have to recompile the kernel when the hardware changes. But even if these two look similar, they are quite different: the is an indirect configuration entry: the driver will have to probe the PCI bus for the presence of a communications card and, if one exists, tell the driver that it can attach to it. On the other hand, the entry is direct: the kernel will read the ACPI configuration (a static table) to know where the ports are and then use those details to configure the driver. Alright, so this raises another question. What is ACPI? ACPI, despite being declared with in a form similar to , is not a bus: ACPI does not physically connect devices to one another. To understand what ACPI does, we can start by realizing that it stands for Advanced Configuration and Power Interface and then, quoting the Wikipedia article: Advanced Configuration and Power Interface (ACPI) is an open standard that operating systems can use to discover and configure computer hardware components, to perform power management (e.g. putting unused hardware components to sleep), auto configuration (e.g. Plug and Play and hot swapping), and status monitoring. ACPI is about configuration, and the kernel uses the ACPI tables, present in any modern PC, to find where devices are. To illustrate how this works, let’s look at the line. This line says that the serial port can be configured via ACPI if ACPI happens to have an entry for it. And you know, we can peek into the ACPI tables of a running machine to see what that might be: (Pro-tip: you can use to extract the Windows license key bound to your machine, if any. I’ve needed to do this in the past to install Windows in a VM after replacing the host OS with FreeBSD.) Voila. The ACPI tables provided by the system tell us a similar story to what the explicit entry did (and no surprise here because this is a legacy device): there is a serial port at base address 0x3f8 that uses interrupt 4. But also, this table tells us the hardware identifier for this entry: . Grepping through the NetBSD kernel code base for this identifier, we land on the dev/acpi/com_acpi.c file: (Another pro-tip: master ripgrep . Knowing how to find a needle in the haystack of a large code base will grant you super-powers among your coworkers. Being able to pinpoint where specifically to start an investigation based on a “random-looking” string is invaluable.) Aha! This file provides the necessary glue to direct the generic serial port driver to the hardware via whatever the ACPI tables prescribe. From here, the kernel can proceed to attach the driver to the device and connect the dots between the user-space interface to the physical serial port. In the world of embedded devices powered by SOCs—" System on a Chip ", a term that describes single chips that provide all functions to build a computer, ranging from the CPU to sound and network cards—we don’t have ACPI tables. What we used to have instead was explicit code for every board/chip combination that knew how to address the hardware in each SOC. Linux used to be a mess of half-baked and abandoned forks, each supporting a different board without hopes of unification into mainline due to the lack of generic interfaces. The Device Tree specification fixed this issue for the most part for architectures like aarch64. With Device Tree, each hardware or OS vendor provides a static table that describes the layout of a board’s hardware separately from the code of the kernel. Then, the kernel peeks into this table to know which devices exist and where they are in the machine layout, much like it does with ACPI. A big difference with ACPI, however, is that the kernel cannot query the Device Tree from the hardware because… well, the Device Tree is external to the hardware. The kernel expects the Device Tree to “exist in memory” either because the kernel image embeds the Device Tree for the target board or because the boot loader loads the Device Tree from disk and passes it to the kernel. Once the kernel has the Device Tree though, the hardware discovery process is similar to the one in ACPI: the kernel scans the Device Tree and looks for drivers that can attach to device nodes based on hardware identifiers. From the perspective of the kernel configuration, things look very similar between amd64 and aarch64. See this snippet from the generic kernel of the evbarm port : This configuration snippet tells the aarch64 kernel that a device may appear on , which stands for “Flat Device Tree”. In turn, says that there is a specific driver named that provides access to the loaded by the boot loader on an ARM machine. We can inspect the Device Tree from the command line as the Device Tree is exposed via the same interface that OpenFirmware used due to its historical roots. For example, to fetch the portion of the Device Tree for the serial port on an aarch64 machine: A binary dump. OK, fine, we can intuit something out of this, but it isn’t particularly clear. The problem here is that we are looking at the binary encoding of the Device Tree (the DTB). But the DTB is built from a set of corresponding source files (one or more DTS files), and if we look at the common DTS for Broadcom 283x boards , we find the following more-readable content: The detail to highlight here is the identifier. If we search for this in the code base with the ripgrep super-powers you gained earlier, we find the arch/arm/broadcom/bcm2835_com.c file, which contains: Once again: we found the glue that connects a generic driver to a specific hardware device. Let’s dig a bit further though. I mentioned earlier that the boot loader is responsible for loading the Device Tree into memory and passing it to the kernel. How is that done? Well, it really depends on the specific machine you are dealing with. In here, I’m just going to very briefly touch upon how the Raspberry Pi does it because that’s the specific non-PC hardware I have access to. And for this, I’ll take you through the investigative journey I took. The specific problem I faced was that NetBSD was not able to discover the SPI bus even when I had enabled the right SPI driver in the kernel for my Raspberry Pi 3B. By that point, I was aware that DTBs existed and I suspected that something might be wrong with them, so my first instinct was to check and see what the DTB had to say about the SPI. Digging through the FAT partition of the disk image I was using, I found the file—the DTB for my specific board. The way the Raspberry Pi boot loader finds this file is by looking for a file matching the board’s own name ( ) in the location specified by the os_prefix configuration property in the placed at the root of the FAT partition. Once I found that file and after learning about the tool (the Device Tree Compiler) which transforms DTS files into DTBs and vice versa, I could decompile the DTS: Then, peeking into the decompiled DTS file, I found: I verified that the SPI kernel driver recognized the identifier just as we did earlier on for the serial port. And it did match, so I was puzzled for a moment. But then I noticed the innocuous line. Aha! The SPI device was disabled by default. I modified that line with following what other entries in the DTS did, recompiled the DTS into the DTB: … rebooted the board and… voila! The kernel successfully attached the SPI driver and the SPI bus started working, which in turn led to a multi-hour debugging session to make the EndBASIC ST7735S driver work—but in the end it did. Yes, there are nicer ways to do what I did here because the DTBs are provided by upstream and you should not be modifying them. What you should do instead is create a DTB overlay, which is a separate small DTB that “patches” the upstream DTB, and then tell the boot loader to process it via the stanza in the file. Details left to you, reader. Just beware that the Raspberry Pi boot loader is picky about file paths and the documentation is your friend here. Based on everything I told you about here, ACPI and Device Tree look oddly similar—and that’s because they are! From the perspective of describing hardware to the kernel, the two technologies are equivalent, but they differ for historical reasons. ACPI has its roots in APM, a PC technology, whereas Device Tree is based on OpenFirmware , a technology that originated at Sun Microsystems for its SPARC machines and that was later used by Apple on their PowerPC-based Macs. One difference between the two, though, is that ACPI does more than just describe hardware. ACPI provides a bunch of hardware-specific routines in a bytecode that the operating system can call into to manipulate the hardware. This is often maligned because ACPI introduces non-free and opaque blobs in the interaction between the operating system kernel and the hardware, but Matthew Garret has a great essay on why ACPI is necessary and why it is better than all other alternatives, possibly including Device Tree. In any case, that’s all for today. I found this exercise of dealing with the Device Tree pretty fun and I thought I could share something interesting with you all. I intentionally omitted many details because the topic of hardware configuration is vast and tricky, but you can continue building your knowledge from the bits above and from the fabulous OSDev wiki . And as always, if any of this was interesting, subscribe to Blog System/5 now. You’ll receive more content like this and you’ll support my future writing! Portion of a Sound Blaster Pro ISA card focusing on the jumpers to configure its I/O settings. But that wasn’t all. Once you had configured the physical switches, you had to tell the operating system and/or software which specific cards you had and how you had configured them. Remember ? This DOS environment variable told programs which specific Sound Blaster you had installed and which I/O settings you had selected via its jumpers. Not really fun. It was common to have hardware conflicts that yielded random lock-ups, and thus ISA “Plug and Play” , or PnP for short, was born in the early 1990s—a protocol for the legacy ISA bus to enumerate its devices and to configure their settings via software. Fast-forward to today’s scene where we just attach devices to external USB connectors and things “magically work”. But how? How does the kernel know which physical devices exist and how does it know which of the many device drivers it contains can handle each device? Enter the world of hardware discovery. If this sounds interesting, please consider subscribing to Blog System/5! It’s completely free but you can also choose to donate to keep me writing! Hardware topology When you learn about the Von Neumann architecture in school, you are typically told that there is a CPU, a chunk of memory, and… “I/O devices”. The CPU and memory portions are where all the focus is put and the I/O devices portion is always “left to the reader”. However, there is a lot of stuff happening in that nebulous cloud. The first question that arises is: what’s in that I/O cloud? Well, take a look: The Windows 2000 Device Manager showing the devices by their connection, not by their type. I've chosen to show you this configuration of a virtual machine instead of what a current Windows 11 system shows because the older view is simpler to digest. Whoa, that’s a lot of stuff, but we can classify the items in the “nebulous cloud of I/O devices” into two categories: The devices themselves, obviously. The busses that connect those devices to the CPU.

0 views
Blog System/5 11 months ago

ioctls from Rust

In Unix-like systems, “everything is a file and a file is defined as a byte stream you can , from, to, and ultimately ”… right? Right? Well, not quite. It’s better to say file descriptors provide access to almost every system that the kernel provides, but not that they can all be manipulated with the same quartet of system calls, nor that they all behave as byte streams. Because you see: network connections are manipulated via file descriptors indeed, but you don’t them: you , / and/or to them. And then you don’t from and to network connections: you somehow to and from them. Device drivers are similar: yes, hardware devices are represented as “virtual files” in the hierarchy and many support and … but these two system calls are not sufficient to access the breath of functionality that the hardware drivers provide. No, you need . is the poster child of the system call that breaks Unix’s “everything is a file” paradigm. is the API that allows out-of-band communication with the kernel side of an open file descriptor. To see cool examples, refer back to my previous article where I demonstrated how to drive graphics from the console without X11: in that post, we had to the console device, but then we had to use to obtain the properties of the framebuffer, and then we had to the device’s content for direct access: no s nor s involved. All the code I showed you in that earlier post was written in C to keep the graphics article to-the-point, but the code I’m really working on is part of EndBASIC, and thus it is all Rust. And the thing is, s are not easy to issue from Rust. In fact, after 7 years of Rust-ing, it’s the first time I’ve had to reach for code blocks, and there was no good documentation on how to deal with . So this posts aims to fix that by presenting what ways there are to call s from Rust… and, of course, diving a bit deeper into what s actually are . Blog System/5 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. For all examples below, I’ll be using a relatively simple from NetBSD’s wsdisplay(4) driver. This API is available via the console device file, typically , and is named . Here is what the manual page has to say: Calling this API from a C program would be trivial and look like this: The reason I’m picking specifically to talk about s in this article is three-fold: it returns a small structure with platform-dependent integers , so the sample code in the article will be relatively short; it relies on platform-specific integer types, so we’ll have to account for that in Rust; and it is “rare enough” (we are talking about NetBSD after all) that it is not going to be supported by any of the common Rust crates like libc or nix, so we’ll have to do extra work to call it. The manual page helpfully provides us a copy of the data structure returned by the , and the BSD manual pages are typically awesome , but it’s worth double-checking that the code snippet actually matches the code it documents. Peeking into as the manual page directs us, we find: OK, great, the structure perfectly aligns with the manual page contents. But what’s more interesting is the , which says that is: an ioctl that reads from the kernel ( ), that invokes function number 65 from the class (which probably stands for “the W scons device driver”), and that places the returned data into a structure of type (not to be confused with ). In a way, this is just like any other function or system call, except that it’s not defined as such and is instead funneled through a single API. is, therefore, “just” a grab bag of arbitrary functionality, and what can be invoked on a given file descriptor depends on what the file descriptor represents. The reasons for this design are historical and, of course, there could have been other options. For example: you know how regular files have an internal structure, right? The vast majority of file formats out there contain a header, which then specifies various sections within the file, which then contain data. The same could have been done with device drivers: their virtual files could have predefined some internal format such that, e.g. the structure always appeared at offset 0x1000 of the virtual file. and would have been sufficient for this design, albeit you’d almost-certainly wanted to combine it with for more efficient access. Or another example: device drivers could have used an RPC-like mechanism where each write to the file descriptor is a “message” that requests a specific function, and that the kernel responds to with an answer. and would have been sufficient for this design. Or yet another example: the requests to the device driver could have been intermixed with the data, such that if the data contained a specific sequence, the kernel would invoke a function instead of processing data. Sounds crazy, right? But that’s what pseudo-terminals do: all those control sequences to change colors and the like are telling the terminal driver to do something special. In any case, these are all alternate designs and… I’m sure they all live in some form or another in current systems. There is no consistency in how pseudo-files expose their behavior, and s are just one of the options we have to deal with. So without further ado, let’s look at three different ways of calling these services from Rust. The first option to call an from Rust is to leverage the neat nix crate , which provides idiomatic access to Unix primitives. This crate is not to be confused with NixOS, with which it has no relation. To use nix to invoke s, we need to do two things. First, we need to define the data structure used by the . In C, we would just , but in Rust we don’t have access to the C-style headers. Instead, we have to do extra work to define the same memory layout of the C structure, but in Rust: It is very important to declare the structure as having a C representation so that its memory layout matches what the C compiler produces for the same structure. The kernel expects C semantics in its system call boundary, and we must adhere to that. Additionally, we must ensure that the types of each field match the C definitions. Rust only has fixed-size integer types like and , but C provides platform-dependent integer types like or and these are sometimes used in public kernel interfaces (a mistake, if you ask me). Fear not, though: the module provides aliases for those C types. And second, we have to do something unique to the nix crate: we have to define a wrapper function for the so that we can invoke the as if it were any other function. nix makes this very easy by providing macros that mimic the syntax of the C we saw earlier on: And with that, we are ready to put everything together in a fully-fledged program: There is one surprising detail in this code though: if we went through the hassle of defining a wrapper function for via the idiomatic nix crate, and idiomatic nix usage doesn’t require blocks… why did we have to wrap the call to in an block? The reason may be that can do whatever to the running process and Rust needs to be over-conservative. In any case, the above is clean and it works. But… using nix comes with a cost: We have pulled 5 crates into the project just to open a file and invoke an . Not a huge deal in this day and age but… this contributes to the perception that the Rust ecosystem is a mess of bloated dependencies. Can we do differently? What if we bypassed nix altogether and invoked libc directly? After all, we can see that nix depends on libc anyway, so we might as well use it at the expense of losing nix’s idiomatic representation of Unix’s interfaces. Sure, we can do that: we can invoke the function directly, which has this prototype: Alright then: we need a file descriptor as the first argument, which we have. And then we need an as the second argument, which we… wait, what is this type? If we look for its definition in the libc source code, we find that is an alias for an integer type ( or depending on the platform), and this matches the C definition of . OK, nothing special. But then… what do we pass in this second argument? If we were writing C, we would use the constant, but we don’t have that in Rust because we don’t get access to the C header files. So what is ? Remember that we previously saw that it is defined as such: … which doesn’t help us much at this point. But we can chase the definition of , end up in , and see: Ugh. We are combining the various arguments to into a number. This is hard to decipher by just reading the code, so we can ask the compiler to tell us the actual value of the constant instead: And if we run the program, we get that is . Knowing that, it’s an SMOP to call the using the libc crate alone: As you can see from this code snippet, we also have to define the structure to match the kernel’s, so avoiding nix didn’t really make things simpler for us—and in fact, it made them uglier because now we have to deal with libc’s oddities like raw C strings, global values, and an opaque constant for the number. Not great. Which makes one wonder… can we avoid replicating the C interfaces in Rust and instead leverage the system-provided header files? Yes we can. Instead of trying to invoke the s from Rust, we can invoke them via some custom C glue code. Rust is going to call into the system-provided libc anyway when we invoke a system call, so we might as well switch to C a bit “earlier”. The idea in this case, as in any other computing problem, is to add a layer of abstraction: instead of dealing with the kernel-defined data structures from Rust, we define our own structures and APIs to detach the Rust world from the C world. Here, look: We start by declaring our own version of , which I’ve called , that only includes the few fields we want to propagate to Rust. Yes, in this example the indirection is utterly pointless because we go from 4 to 3 fields so we haven’t made our lives easier, but there are s that return larger structures from which we might only need a few values. Then we define a trivial function to wrap the and transform its return value into our own structure. Then, we go to Rust, re-define our structure as (both of which we fully control so we can easily verify that they match) and we call the wrapping function: The trick now is to link the C code with the Rust code together, and for that, we create a script. In here, we leverage the cc crate to put things together, which is an extra dependency that is only used at build time: And with that, we are done. Well, let’s see: From a binary size perspective, there are no meaningful differences. As expected, the usage of the nix crate results in slightly more code than the other alternatives because nix has to do extra work to translate global values into Rust types and the like. But the libc and FFI alternatives seem identical. At runtime, however, we should expect the FFI option to perform a teeny tiny bit worse (though good luck measuring that) than the libc option because we have to convert between the kernel structure and our own structure in the happy path… all for dubious benefit. All in all, I’ll take option 1. I do not like the fact of having to manually replicate the kernel structures in my own code so, if I had the time, I’d try to upstream the definitions to the well-tested libc crate or write another reusable crate with just those. But, barring that, the idiomatic nix interfaces make calling Unix primitives a breeze. Because you see: network connections are manipulated via file descriptors indeed, but you don’t them: you , / and/or to them. And then you don’t from and to network connections: you somehow to and from them. Device drivers are similar: yes, hardware devices are represented as “virtual files” in the hierarchy and many support and … but these two system calls are not sufficient to access the breath of functionality that the hardware drivers provide. No, you need . is the poster child of the system call that breaks Unix’s “everything is a file” paradigm. is the API that allows out-of-band communication with the kernel side of an open file descriptor. To see cool examples, refer back to my previous article where I demonstrated how to drive graphics from the console without X11: in that post, we had to the console device, but then we had to use to obtain the properties of the framebuffer, and then we had to the device’s content for direct access: no s nor s involved. All the code I showed you in that earlier post was written in C to keep the graphics article to-the-point, but the code I’m really working on is part of EndBASIC, and thus it is all Rust. And the thing is, s are not easy to issue from Rust. In fact, after 7 years of Rust-ing, it’s the first time I’ve had to reach for code blocks, and there was no good documentation on how to deal with . So this posts aims to fix that by presenting what ways there are to call s from Rust… and, of course, diving a bit deeper into what s actually are . Blog System/5 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Our target For all examples below, I’ll be using a relatively simple from NetBSD’s wsdisplay(4) driver. This API is available via the console device file, typically , and is named . Here is what the manual page has to say: Calling this API from a C program would be trivial and look like this: The reason I’m picking specifically to talk about s in this article is three-fold: it returns a small structure with platform-dependent integers , so the sample code in the article will be relatively short; it relies on platform-specific integer types, so we’ll have to account for that in Rust; and it is “rare enough” (we are talking about NetBSD after all) that it is not going to be supported by any of the common Rust crates like libc or nix, so we’ll have to do extra work to call it. an ioctl that reads from the kernel ( ), that invokes function number 65 from the class (which probably stands for “the W scons device driver”), and that places the returned data into a structure of type (not to be confused with ).

0 views
Blog System/5 12 months ago

Hands-on graphics without X11

Look at these two consoles: Side-by-side comparison of the NetBSD console right after boot vs. the EndBASIC console. Same colors, (almost) same font, same… everything? Other than for the actual text they display, they look identical, don’t they? But the one on the right can do things that the one on the left cannot. Witness this: A square? OK, meh, we had those in the DOS days with box-drawing characters . But a circle?! That’s only possible because the console on the right is a hybrid console that supports mixing the usual textual grid of a terminal with overlapping graphics. Now, if you have been following the development of EndBASIC , this is not surprising. The defining characteristic of the EndBASIC console is that it’s hybrid as the video shows. What’s newsworthy, however, is that the EndBASIC console can now run directly on a framebuffer exposed by the kernel. No X11 nor Wayland in the picture (pun intended). But how? The answer lies in NetBSD’s flexible wscons framework, and this article dives into what it takes to render graphics on a standard Unix system. I’ve found this exercise exciting because, in the old days, graphics were trivial ( mode 13h , anyone?) and, for many years now, computers use framebuffer-backed textual consoles. The kernel is obviously rendering “graphics” by drawing individual letters; so why can’t you, a user of the system, do so too? Your subscription to Blog System/5 fuels me to write comprehensive articles like this, and this one was particularly painful to put together. Click the button for more goodies. It’s free! wscons(4) , or Workstation Console in its full form, is NetBSD’s framework to access the physical console attached to a computer. wscons abstracts the details of the hardware display and input devices so that the kernel and the user-space configuration tools can treat them all uniformly across the tens of platforms that NetBSD supports. If you use wsconsctl(8) on a modern amd64 laptop to control its display, you use wsconsctl on an ancient vax box to control its display too. Layered architecture of wsdisplay and its backing devices. The output architecture of wscons is composed of multiple devices, layered like this: wsdisplay(4) sits at the top of the stack and implements the console in hardware-independent terms. The functionality at this level includes handling of VT100-like sequences, cursor positioning logic, text wrapping, scrolling decisions, etc. Under wsdisplay sit the drivers that know how to access specific hardware devices. These include, among others: vga(4) , which does not do graphics at all; genfb(4) , which is a generic framebuffer driver that talks to the “native” framebuffer of the system (e.g. the one configured by the EFI); and radeonfb(4) , which implements an accelerated console on AMD cards. These drivers know how to initialize and interact with the hardware. Under the graphical drivers sits vcons(4) , the driver that implements one or more graphical consoles in terms of a grid of pixels. vcons is parameterized on “raster operations” (rasops), a set of virtual methods to perform low-level operations. An example is the method, which is used by wsdisplay to implement scrolling in the most efficient way provided by the hardware. vcons provides default (inefficient) implementations of these methods, but the upper drivers like radeonfb can provide hardware-accelerated specializations when instantiating vcons. vcons also interacts with wsfont(4) to render text to the console. Layered architecture of wskbd and its backing devices, including the optional wsmux wrapper. The input architecture of wscons is similar in terms of layering of devices, albeit somewhat simpler: wsmux(4) is an optional component that multiplexes multiple input devices under a single virtual device for event extraction. wskbd(4) sits at the top of the stack (not accounting for wsmux) and implements generic keyboard handling. The functionality at this level includes translating keycodes to layouts, handling key input repetition, and more. wskbd exposes a stream of wsevents to user-space so that user-space can process state changes (e.g. key presses). Under wskbd sit the device drivers that know how to deal with specific hardware devices. These include, among others: ukbd(4) for USB keyboard input and pckbd(4) for PC/AT keyboard input. These drivers wait for hardware input, generate events, and provide a map of keycodes to key symbols to the upper layer so that wskbd can operate in generic terms. The input architecture can handle other types of devices like mice and touch panels (both via wsmouse(4) ), but I’m not going to cover those here. Just know that they sit under wsmux at the equivalent level of wskbd and produce a set of wsevents in the exact same manner as wskbd. As you can sense from the overview, the whole architecture under wsdisplay is geared towards video devices… if it wasn’t for the vga driver: in the common case, wsdisplay is backed by a graphical framebuffer managed by vcons for text rendering, yet the user only sees a textual console. But if the kernel has direct access to the framebuffer, so should user-space too. The details on how to do this click if you read through the operations described in the wsdisplay manual page. In particular, you may notice the call which retrieves extended information about, you guessed it, a framebuffer display. Let’s try it: I wrote a trivial program to open the display device (named for reasons that escape me), call this function, and store the results in an structure: Hmm, but this program does not have any visible output, right? The code just queries the framebuffer information and does nothing with it. The reason is that the content of the structure is large and I didn’t want to pretty-print it myself. I thought it’d be fun to show you how to use GDB to inspect large data structures and how to script the process. Here, look: This call to GDB starts the sample program shown above and automates various GDB commands to set a breakpoint, step through the program, and pretty-print the structure right before exiting. When we execute this command as root (which is important to get access to the device), we get this: Content of the fbinfo structure as grabbed by the sample wsdisplay-fbinfo program and printed by GDB. Neat. We get sensible stuff from the kernel! is 640 and is 480, which matches the 640x480 resolution I have configured in my test VM. But note these other fields in the structure printed above: The and fields are begging us to use to memory-map the area of the device starting at and spanning bytes. Presumably we can write to the framebuffer if we do this, but beforehand, we have to switch the console to “framebuffer mode” by using the (“set mode”) call. This call accepts an integer to indicate which mode to set: : Set the display to emulating (text) mode. This is the default operation mode of wsdisplay and configures the console to “emulate” a text terminal. : Set the display to mapped (graphics) mode. This allows access to the framebuffer and allows the operation to succeed. : Set the display to mapped (framebuffer) mode. This is similar to and, for our purposes in the demo below, works the same. I haven’t found a concise description of how these two differ, but from my reading of the code, the “mapped” mode offers access to the framebuffer as well as device-specific control registers, whereas “dumb framebuffer” just exposes the framebuffer memory. In any case. Once we know that we have to switch the console device to a graphical mode before mapping the framebuffer, and having access to the pixel format described in the structure… drawing something fun is just a few byte manipulation operations away: And if we run this: Voila. We’ve got graphics without paying the X11 startup tax. Switching from the console to graphics is instantaneous, like in the good old mode 13h days. Rendering graphics is just half of the puzzle when writing an interactive application though. The other half is handling input. And, for that, we have to turn to the wskbd device. After we switch the console to mapped mode, keystrokes don’t go to anymore. We have to write code to explicitly read from an attached keyboard, and we can do this via the device representing the first attached keyboard. Once we open the keyboard device for reading, wscons sends us its own representation of events known as wsevents. We can write a trivial program to read one key press: But… if we try to run it and press a key, say , we might get: Huh. We pressed but the character we got is . Not what we expected! Well, as it turns out, the “value” that wsevents report for key presses (37 in this case) is the raw keycode of the key. This is hardware-specific and needs to be translated to an actual symbol via a keymap. One feature of wskbd is that it exposes the keymap as configured in the kernel so there is a single source of truth for the machine. We can query a portion of it with another program: And if we run it, we might get: This dump is telling us how keycodes map to symbols, both in “normal” and in shifted form. If we look up keycode 37, we indeed find the letter . With this, it’s just an SMOP to come up with a program that parses the keymap as exposed by wskbd and converts keycodes to something useful. This is all good and dandy, but what happens if the keyboard is not connected when you try to open ? (Spoiler: the call fails.) Or what happens if your computer has more than one keyboard attached? (Spoiler: you can only read events from one.) This is where wsmux comes to the rescue—a device driver that multiplexes multiple input devices into one. By default, the system reserves as the multiplexer for all attached mice and as the multiplexer for all attached keyboards. We can define our own too via the wsmuxctl(8) command line utility. wsmux then supports “hot plugging”. You can then open a device even when there is no physical hardware attached, and whenever a peripheral is connected, it automatically becomes part of the mux. So, if we modify the program above to open instead of , the program will be resilient to missing keyboards and it’ll recognize multiple keyboards. Easy peasy! You are now equipped with the basics to write graphical applications on a NetBSD system (and maybe OpenBSD too) without running X11. I know NetBSD may not be your jam, but it is a good choice for embedded projects due to its console architecture and other features like its build system . If the code above still seems mysterious, you can read the source code for the xf86-video-wsfb and xf86-input-ws drivers for X.org. The code is easy enough to read, although it is longer because it has to support all the bells and whistles of wsdisplay and wskbd. (I took shortcuts above by making various assumptions on pixel formats and the like.) And, guess what, I am indeed working on an embedded project! A little dev box that can boot straight into EndBASIC with super-fast boot times and for which I couldn’t afford the X11 startup penalty. Stay tuned. In the meantime, what will YOU build? For those of us in the U.S., there is a 3-day weekend ahead and this can be a good distraction. Have fun! If you enjoyed this walk-through, don’t hesitate to subscribe to Blog System/5. It’s free and you’ll receive more articles like this one. Side-by-side comparison of the NetBSD console right after boot vs. the EndBASIC console. Same colors, (almost) same font, same… everything? Other than for the actual text they display, they look identical, don’t they? But the one on the right can do things that the one on the left cannot. Witness this: A square? OK, meh, we had those in the DOS days with box-drawing characters . But a circle?! That’s only possible because the console on the right is a hybrid console that supports mixing the usual textual grid of a terminal with overlapping graphics. Now, if you have been following the development of EndBASIC , this is not surprising. The defining characteristic of the EndBASIC console is that it’s hybrid as the video shows. What’s newsworthy, however, is that the EndBASIC console can now run directly on a framebuffer exposed by the kernel. No X11 nor Wayland in the picture (pun intended). But how? The answer lies in NetBSD’s flexible wscons framework, and this article dives into what it takes to render graphics on a standard Unix system. I’ve found this exercise exciting because, in the old days, graphics were trivial ( mode 13h , anyone?) and, for many years now, computers use framebuffer-backed textual consoles. The kernel is obviously rendering “graphics” by drawing individual letters; so why can’t you, a user of the system, do so too? Your subscription to Blog System/5 fuels me to write comprehensive articles like this, and this one was particularly painful to put together. Click the button for more goodies. It’s free! wscons overview wscons(4) , or Workstation Console in its full form, is NetBSD’s framework to access the physical console attached to a computer. wscons abstracts the details of the hardware display and input devices so that the kernel and the user-space configuration tools can treat them all uniformly across the tens of platforms that NetBSD supports. If you use wsconsctl(8) on a modern amd64 laptop to control its display, you use wsconsctl on an ancient vax box to control its display too. Layered architecture of wsdisplay and its backing devices. The output architecture of wscons is composed of multiple devices, layered like this: wsdisplay(4) sits at the top of the stack and implements the console in hardware-independent terms. The functionality at this level includes handling of VT100-like sequences, cursor positioning logic, text wrapping, scrolling decisions, etc. Under wsdisplay sit the drivers that know how to access specific hardware devices. These include, among others: vga(4) , which does not do graphics at all; genfb(4) , which is a generic framebuffer driver that talks to the “native” framebuffer of the system (e.g. the one configured by the EFI); and radeonfb(4) , which implements an accelerated console on AMD cards. These drivers know how to initialize and interact with the hardware. Under the graphical drivers sits vcons(4) , the driver that implements one or more graphical consoles in terms of a grid of pixels. vcons is parameterized on “raster operations” (rasops), a set of virtual methods to perform low-level operations. An example is the method, which is used by wsdisplay to implement scrolling in the most efficient way provided by the hardware. vcons provides default (inefficient) implementations of these methods, but the upper drivers like radeonfb can provide hardware-accelerated specializations when instantiating vcons. vcons also interacts with wsfont(4) to render text to the console. Layered architecture of wskbd and its backing devices, including the optional wsmux wrapper. The input architecture of wscons is similar in terms of layering of devices, albeit somewhat simpler: wsmux(4) is an optional component that multiplexes multiple input devices under a single virtual device for event extraction. wskbd(4) sits at the top of the stack (not accounting for wsmux) and implements generic keyboard handling. The functionality at this level includes translating keycodes to layouts, handling key input repetition, and more. wskbd exposes a stream of wsevents to user-space so that user-space can process state changes (e.g. key presses). Under wskbd sit the device drivers that know how to deal with specific hardware devices. These include, among others: ukbd(4) for USB keyboard input and pckbd(4) for PC/AT keyboard input. These drivers wait for hardware input, generate events, and provide a map of keycodes to key symbols to the upper layer so that wskbd can operate in generic terms. Content of the fbinfo structure as grabbed by the sample wsdisplay-fbinfo program and printed by GDB. Neat. We get sensible stuff from the kernel! is 640 and is 480, which matches the 640x480 resolution I have configured in my test VM. Drawing to the framebuffer But note these other fields in the structure printed above: The and fields are begging us to use to memory-map the area of the device starting at and spanning bytes. Presumably we can write to the framebuffer if we do this, but beforehand, we have to switch the console to “framebuffer mode” by using the (“set mode”) call. This call accepts an integer to indicate which mode to set: : Set the display to emulating (text) mode. This is the default operation mode of wsdisplay and configures the console to “emulate” a text terminal. : Set the display to mapped (graphics) mode. This allows access to the framebuffer and allows the operation to succeed. : Set the display to mapped (framebuffer) mode. This is similar to and, for our purposes in the demo below, works the same. I haven’t found a concise description of how these two differ, but from my reading of the code, the “mapped” mode offers access to the framebuffer as well as device-specific control registers, whereas “dumb framebuffer” just exposes the framebuffer memory.

0 views
Blog System/5 1 years ago

Self-documenting Makefiles

Make, as arcane as a build tool can be, may still be a good first fit for certain scenarios. “Heresy!”, you say, as you hear a so-called “Bazel expert” utter these words. The specific problem I’m facing is that I need to glue together the NetBSD build system , a quilt patch set , EndBASIC’s Cargo-based Rust build, and a couple of QEMU invocations to produce a Frankenstein disk image for a Raspberry Pi. And the thing is: Make allows doing this sort of stitching with relative ease. Sure, Make is not the best option because the overall build performance is “meh” and because incremental builds are almost-impossible to get right… but adopting Bazel for this project would be an almost-infinite time sink. Anyway. When using Make in this manner, you often end up with what’s essentially a “command dispatcher” and, over time, the number of commands grows and it’s hard to make sense of which one to use for what. Sure, you can write a with instructions, but I guarantee you that the text will get out of sync faster than you can read this article. There is a better way, though. Sample output of the make help command that we will implement in this article. What if we could provide a command that showed an overview of the project’s “build interface”? And what if we could embed such information inside the s themselves, close to the entities that they document? This idea is neither new nor mine, and it has been written about before by different people. However, I bet that most of you haven’t heard about it before so it’s worth for me to repeat it. And I think that my solution is a bit more comprehensive than others I’ve found. So here you go. Blog System/5 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. As I mentioned in the introduction, Make is often used as a command dispatcher: with very little code, you can write what essentially are multiple shell scripts with automatic chaining, all wrapped in one single interface. It’s all pretty terrible, but people are used to this pattern due to Make’s ubiquity and somehow expect it when they face a Make-based project. To implement this command dispatcher idea, each user-facing action is exposed via a target . These targets tend to be marked as “phony”—i.e. they are targets that produce no outputs of their own. Take a look at this : In the snippet above, the target represents a built file . This target depends on a list of sources and specifies what command to run to generate the output when it is missing or out of date (according to file modification times, yikes ). When you type , you expect the file to exist on disk after the command completes. But the snippet also shows two phony targets: and . When you type or , you do not expect neither a nor a file to be created, no. What you expect is that the project is built and tested. And for this, Make evaluates the dependencies of the phony targets (if any are specified, as is the case for ) and then unconditionally executes any commands in the phony targets (as is the case for ). With this in mind, the first thing we want to do in our command is to document these “special” targets that represent user-facing actions. To do this, we’ll leverage one not-well-known aspect of Make’s syntax: the list of dependencies of a target is cumulative across multiple target definitions of the same name. Basically, these target definitions are equivalent: Knowing this, we can add “extra” lines for a target and use one of those to document the target so that we do not end up with super-long lines. For example, we can do: And then we are just one away from extracting the targets and their documentation: OK fine. It’s a bit more complicated than just because we have to reformat the lines a bit and we need to create a nicely formatted table. Also, I know the syntax is awful, but I really don’t want to call into Perl or Python as other guides tell you just for this silly string manipulation. There are native Unix tools that can help us here, and they are much lighter-weight. All other “self-documenting ” tutorials I found out there focus exclusively on documenting targets. But s often expose another dimension of their API, and this is the collection of user-settable configuration variables that they accept. Many s do things like: … to indicate that is set to . But note: the operator invites users to override the variable’s value if they choose to. For example, if the user wanted to build the project in debug mode, they could probably do the following and get the code to build without optimizations and with debug symbols: Given that these variables are user-facing, we should document them as well as part of the output. To document variables, we don’t have the luxury of splitting their definition into multiple lines like we did with targets to prevent super-long lines. That said, we can still add comments at the end of the line, like shown below, and those comments won't be part of the variable's default value. It is important, however, to not leave any space between the default value and the comment, or else the spaces become part of the variable's value. Like with targets, we are also just one away from extracting the variables and their documentation: Again, more complicated than just a , but you get the idea. Alright. So now we know how to extract a table documenting targets and a table documenting variables, but these two lists may still be too obscure on their own. Which targets are important? Which variables might the user want to look into first? To address this deficiency, we can preface those tables with some prose that explains, at a very high level, what to do when interacting with the project for the first time. To implement this, we can write the instructions in a separate file (like a ) next to the , and then have our command print out the text file’s contents. And so without further ado, here is how we can tie everything together: If you copy/paste this text, beware that there are embedded tabs in it. The ones at the beginning of the line are obvious, but the ones in the character classes are not. The latter are supposed to be . Now, have fun with this, but please don’t use Make for new projects if you can avoid it!

0 views
Blog System/5 1 years ago

Revisiting the NetBSD build system

I recently picked up an embedded project in which I needed to build a highly customized full system image with minimal boot times. As I explored my options, I came to the conclusion that NetBSD, the often-forgotten BSD variant, was the best viable choice for my project. One reason for this choice is NetBSD’s build system. Once you look and get past the fact that it feels frozen in time since 2002, you realize it is still one of the most advanced build systems you can find for an OS. And it shows: the NetBSD build system allows you to build the full OS from scratch, on pretty much any host POSIX platform, while targeting any hardware architecture supported by NetBSD. All without root privileges. Another reason for this choice is that NetBSD was my daily workhorse for many years and I’m quite familiar with its internals, which is useful knowledge to quickly achieve the goals I have in mind. In fact, I was a NetBSD Developer with capital D: I had commit access to the project from about 2002 through 2012 or so, and I have just revived my account in service of this project. is back! So, strap onto your seats and let’s see how today’s NetBSD build system looks like and what makes it special. I’ll add my own critique at the end, because it ain’t perfect, but overall it continues to deliver on its design goals set in the late 1990s. Blog System/5 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The NetBSD build system is powerful and featureful, but it’s also arcane as it’s based on a combination of the BSD variant of make and shell scripts. Just peek through the files under src/share/mk/ , the directory that contains the bulk of the infrastructure, to see what I mean. As a user of the build system, however, you rarely interact with make directly. Instead, you use the script located at the top of the source tree. This script provides a user-friendly interface to most operations you may want to do, and abstracts away the intricacies of the targets that coordinate the build of the system and the configuration that controls it. The structure of the command is to pass high-level “goals” to as arguments, which indicate the operations to perform. In its most simple form, all you need to do to build a full system distribution targeting the architecture of the host is: But hey, I promised you can trivially cross-build too, right? Sure, let’s compile the system for a Raspberry Pi with a 64-bit chip, produce the USB image that we can write to an SD card, and do everything as an unprivileged user: That’s it, really. That’s all it takes. We must dig deeper though, so let’s look at some of those “goals” to see what means and understand how a release is put together without root privileges. The very first step of any invocation is to generate the toolchain used to build the rest of the system. This is true of any build, including those that target the host machine, because this ensures that the build is independent of the host’s state. In particular, this avoids the situation where you might have to upgrade certain components before building, which was/is common in other BSDs. And you have seen this prerequisite step in the previous section, by the way: all sample invocations I showed you included as the first goal. Now, you don’t have to provide to every invocation of the script: as soon as you have built a toolchain, you can reuse it in subsequent invocations. Building a toolchain with is incredibly handy on its own, as you can produce cross-build toolchains and use them for other purposes outside of building NetBSD itself. Zig’s build system has often been praised for this reason, but NetBSD’s has nothing to envy. For example, if we do this: We end up with a cross-build C and C++ toolchain under that targets NetBSD running on ARM 64 bits. And what goes into the toolchain, you ask? Directory listing of tools as produced by the command above and its bin subdirectory. In this listing, you can see a bunch of binaries prefixed with . These are all part of the C and C++ toolchain. The rest of the tools, prefixed with , are NetBSD-specific tools required during the build. These are programs that are part of a normal NetBSD installation and would be available without the prefix if we were building on a NetBSD host, but remember, the build system supports any POSIX host OS. Take as an example: yes, all POSIX hosts provide a tool, but its syntax varies among systems so the NetBSD build system isolates itself from those differences by compiling as a host tool and using that throughout. One special tool from this listing is . This is not the binary for make (which is itself stored as ). This is a shell script that captures all settings provided to and then invokes with those, and this script is useful when you want to manually rebuild portions of the tree. Not something you would want to use as an “end user” of the build, but something you definitely will want to use as a NetBSD developer. Let’s explore the source tree a bit, which is the prime example of a monorepo in an open source project: Directory listing of the source tree, its bin and bin/ls subdirectories, and the content of bin/ls/Makefile. In this picture, you can see first the content of the top-level directory of the source tree. It all looks pretty simple: there are various subdirectories, such as , , or , that roughly track the structure of the installed system; there is the script that I previously described; and there is a as well. Looking into one subdirectory, like , we see another and many more subdirectories, one per tool installed onto . Knowing this directory-based structure, we can use the wrapper script I mentioned earlier to operate on just a portion of the monorepo. Focusing on the example shown in the screenshot, we could build and upgrade this piece of the system on its own by doing: Another extra detail to highlight from the screenshot is that NetBSD’s s are mostly declarative. Each defines a bunch of variables to specify what is being built and, at the end, includes one of the many files that pull in the build logic. Among these, we have to build one program, to build one static/shared library, and to recurse into subdirectories. Importantly, the general design is to build just one item per directory—although I myself broke this rule when I added to build tests because splitting them into subdirectories would have added too much noise to the tree. This declarative design is interesting because it maps well to the foundations of modern build systems like Bazel. In fact, the design of the NetBSD build system is what fueled my interest in build systems, influenced the design of my own Buildtool , and made me like Bazel Blaze as soon as I first saw it in 2008. In order to produce the structure of the final installation, the build system uses the “destdir” concept. A destdir is a staging location where built files are installed, but paths to this staging location are not used within the artifacts produced by the build. This idea exists in other build systems such as GNU Automake and is pretty much a necessity to build multiple pieces of software together before installing them or to package software without root privileges. Imagine that you want to build a library, say (the math library), and a tool that uses it, say (the calculator). typically goes into so we cannot just build and install it in place: for one, we may “break” the existing system if the new version happens to be backwards-incompatible; for another, we may be targeting a different architecture so we cannot just replace with an incompatible version. The destdir comes to the rescue. We first build as if it would be installed into . However, during installation , we prefix all file copy operations with the destdir. In this way, we build a separate “system root”, say , that contains the newly-built . After that, we build and point it to the that’s in , but… we have a problem: we can’t allow the path to appear anywhere inside the binary because this directory is transient. To fix this, we must separate build paths from runtime paths during the build: when we build , we tell the linker to look for libraries under via the flag, and we also tell the linker that, at runtime , libraries will be available in via the flag (which stands for runtime path). As you can imagine, the NetBSD build system heavily relies on this idea and, after a build (implied by the goal I showed earlier): we end up with a destdir that contains all system files laid out exactly as they need to be installed. In fact, if you run the build as root and target the host system (where the host is NetBSD), the destdir can serve as the target of a chroot. So, if you do: you essentially can enter the freshly-built system. This may or may not work, however: the newly-built binaries might require new kernel features, which is likely true if you are building a more modern NetBSD release from an older release or if you are tracking NetBSD-current. And this obviously won’t work if you are cross-building. The destdir serves as a staging area but it does not represent the final artifacts of the build. To put the destdir to use, we either have to “copy” the staging area onto the host to perform an in-place upgrade, or we need to build distribution media. The former case of an in-place upgrade is tricky because it requires issuing manual post-installation steps, so I’m not going to describe it here. But the latter case of producing distribution media is trivial. For example, we can do: to produce the release “sets” for the system from the contents of the destdir, or we can do: to create various types of installation media (a bootable CD, a bootable USB image, a live system image…) from the contents of the destdir as well. The release sets are an interesting thing to discuss because they form the core of a NetBSD distribution. You see: NetBSD ships as a collection of tarballs, and installing NetBSD amounts to simply unpacking those tarballs onto a file system and performing a few post-installation configuration steps. Content of the binary/sets directory of the NetBSD/amd64 distribution. Now, the way these tarballs are produced from the destdir is by leveraging , a really cool tool that is not known in Linux land. The purpose of this tool is to compare a textual “golden” representation of a directory against the actual contents of the directory, and highlight where they might differ. BSD systems use to describe how the installed system looks like and, as you can imagine, NetBSD is no exception. The NetBSD build system uses files to ensure the destdir contents match expectations, and also uses the “manifests” to “bucketize” the files from the destdir into the individual release sets. You can find these golden manifests in the directory. These files are also critical for another very important feature: namely, the ability to build the whole NetBSD system as an unprivileged user. This, to me, is one of the most impressive features of this build system: you can produce the full build, including disk images, without ever running or using weird intercept tools like Debian’s . Here is how this works. When building in unprivileged mode (enabled via the flag to ), the build system produces a file under the destdir. This file looks like this: Every line of this file maps a file system entry (a directory, a file, a device…) stored in the destdir to its properties, including ownership information and permissions. These entries are generated from metadata encoded in the s whenever the build system places a new file under the destdir via the command (another nice tool often unknown to Linux users). The is the key that allows building media images without root privileges. If you think about it, media images are simply files with an internal structure that represents disk partitions, file systems, and metadata. Because they are simply files, there is no need to have root access nor to make the host’s file contain all users represented by entries in these file systems. Traditionally, OS builds have needed root because it’s easier to leverage the kernel’s virtual devices and file system implementations, but there is not inherent reason for that to be the only choice. All the work can be done in user space, and that’s precisely what NetBSD does. Now, go back and revisit the screenshot above that showed the toolchain contents. You’ll notice tools like (the tool to format a file system) and (the tool to create a GPT partitioning scheme). These tools are part of the toolchain because they are needed to generate installation media, and these tools know how to read the in order to embed the right permissions and special file modes into the built images. All without ever becoming root. Now, as simple and powerful as might be, I find it cumbersome for day to day use if you want to customize any of its default settings. It is not uncommon to end up running with invocations like: which the official documentation describes as “golden invocations” and I have no desire to type or even remember. This is what drove me to write sysbuild : a layer of abstraction over and that coordinates updating the source tree and building it. The tool even integrates with trivially, providing a mechanism to keep NetBSD-current installations up-to-date. sysbuild is driven by configuration “profiles” which allow you to customize the paths and settings of a build in just one place and then puts them to use with a trivial command. For example, with a configuration file like the following stored in : We can simply run: to update the NetBSD source tree to the latest version, ensure that the tools are up-to-date, and produce the USB disk images for a Raspberry Pi. Not everything about the NetBSD build system is rosy though. The thing that differentiates a good build system from a “meh” one for me personally is the behavior of incremental builds and, in particular, two aspects of these: First, incremental builds need to do minimal work, especially when there is “nothing to do”. The NetBSD build system is a recursive make one (which comes with its own set of problems ), so it does not do minimal work. On my 72-core machine, it takes about 3 minutes to run through a invocation that does nothing. This is OK for end users looking to upgrade their running machine, but it is painful because it makes iterating on system changes difficult. As a developer, you end up needing to know how to surgically rebuild individual subdirectories using the wrappers I described earlier, and manually track dependencies across those. Second, incremental builds must always deliver correct results without having to do a in between. But that’s generally not true for make-based build systems—and NetBSD’s is no exception. Generally, an incremental build after a (sorry, a ) will work fine, but sometimes it won’t. And if you start playing with build-time switches (things like ), then you are out of luck and must resort to a to “switch configurations”. And there are other problems. Running a parallel build on a system with many cores sometimes leads to spurious build failures because the interdependencies between components are not always precisely specified (it’s really difficult to be correct with make). And the build is inefficient: of those 3 minutes I mentioned earlier, you can see that most of the time is wasted by make recursing through directories and discovering there is nothing to do, whereas other times, make “chokes” on way too many C++ compiles at once that lead to out of memory situations. It has been my dream since the publication of Bazel as open source to have a Bazel-based build of NetBSD. I think Bazel is the perfect build system for such a project because it’d deliver correct and efficient incremental builds to NetBSD’s monorepo, and it would save tons of resources when running on many-core machines. Except for the fact that it’s written in Java, so it’d be a really odd choice for such project. Maybe Buck 2 would be suitable. Anyway, one can only dream… Why I’m looking at this at all again, after years of not touching NetBSD? I said it in the opening: I’m working on a new embedded project for which NetBSD is the greatest fit. I could tell you what it is about but it’s easier to just show you: OK, fine, in words: I am building a minimal system that boots straight into EndBASIC with quick build times and low overhead. You’ll have to wait a bit more to get your hands on this though, as I’m still ironing out various details and want to end up providing a pre-built “box” with the right hardware and software combination. I’m also becoming super tempted to migrate NetBSD’s build to Bazel to make my own life easier in this journey. This is a monumental task… but I’m not sure that it’d be crazy to tackle the minimum subset of NetBSD that I need for this minimal disk image and port only those portions to Bazel. The results might impress some and then want to help the effort. Right? In the meantime, I encourage you to read through the comprehensive Building the system portion of The NetBSD guide , and to play with building a NetBSD image straight from your Linux machine. You may like it. Blog System/5 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. One reason for this choice is NetBSD’s build system. Once you look and get past the fact that it feels frozen in time since 2002, you realize it is still one of the most advanced build systems you can find for an OS. And it shows: the NetBSD build system allows you to build the full OS from scratch, on pretty much any host POSIX platform, while targeting any hardware architecture supported by NetBSD. All without root privileges. Another reason for this choice is that NetBSD was my daily workhorse for many years and I’m quite familiar with its internals, which is useful knowledge to quickly achieve the goals I have in mind. In fact, I was a NetBSD Developer with capital D: I had commit access to the project from about 2002 through 2012 or so, and I have just revived my account in service of this project. is back! So, strap onto your seats and let’s see how today’s NetBSD build system looks like and what makes it special. I’ll add my own critique at the end, because it ain’t perfect, but overall it continues to deliver on its design goals set in the late 1990s. Blog System/5 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The basics The NetBSD build system is powerful and featureful, but it’s also arcane as it’s based on a combination of the BSD variant of make and shell scripts. Just peek through the files under src/share/mk/ , the directory that contains the bulk of the infrastructure, to see what I mean. As a user of the build system, however, you rarely interact with make directly. Instead, you use the script located at the top of the source tree. This script provides a user-friendly interface to most operations you may want to do, and abstracts away the intricacies of the targets that coordinate the build of the system and the configuration that controls it. The structure of the command is to pass high-level “goals” to as arguments, which indicate the operations to perform. In its most simple form, all you need to do to build a full system distribution targeting the architecture of the host is: But hey, I promised you can trivially cross-build too, right? Sure, let’s compile the system for a Raspberry Pi with a 64-bit chip, produce the USB image that we can write to an SD card, and do everything as an unprivileged user: That’s it, really. That’s all it takes. We must dig deeper though, so let’s look at some of those “goals” to see what means and understand how a release is put together without root privileges. The toolchain The very first step of any invocation is to generate the toolchain used to build the rest of the system. This is true of any build, including those that target the host machine, because this ensures that the build is independent of the host’s state. In particular, this avoids the situation where you might have to upgrade certain components before building, which was/is common in other BSDs. And you have seen this prerequisite step in the previous section, by the way: all sample invocations I showed you included as the first goal. Now, you don’t have to provide to every invocation of the script: as soon as you have built a toolchain, you can reuse it in subsequent invocations. Building a toolchain with is incredibly handy on its own, as you can produce cross-build toolchains and use them for other purposes outside of building NetBSD itself. Zig’s build system has often been praised for this reason, but NetBSD’s has nothing to envy. For example, if we do this: We end up with a cross-build C and C++ toolchain under that targets NetBSD running on ARM 64 bits. And what goes into the toolchain, you ask? Directory listing of tools as produced by the command above and its bin subdirectory. In this listing, you can see a bunch of binaries prefixed with . These are all part of the C and C++ toolchain. The rest of the tools, prefixed with , are NetBSD-specific tools required during the build. These are programs that are part of a normal NetBSD installation and would be available without the prefix if we were building on a NetBSD host, but remember, the build system supports any POSIX host OS. Take as an example: yes, all POSIX hosts provide a tool, but its syntax varies among systems so the NetBSD build system isolates itself from those differences by compiling as a host tool and using that throughout. One special tool from this listing is . This is not the binary for make (which is itself stored as ). This is a shell script that captures all settings provided to and then invokes with those, and this script is useful when you want to manually rebuild portions of the tree. Not something you would want to use as an “end user” of the build, but something you definitely will want to use as a NetBSD developer. Build structure Let’s explore the source tree a bit, which is the prime example of a monorepo in an open source project: Directory listing of the source tree, its bin and bin/ls subdirectories, and the content of bin/ls/Makefile. In this picture, you can see first the content of the top-level directory of the source tree. It all looks pretty simple: there are various subdirectories, such as , , or , that roughly track the structure of the installed system; there is the script that I previously described; and there is a as well. Looking into one subdirectory, like , we see another and many more subdirectories, one per tool installed onto . Knowing this directory-based structure, we can use the wrapper script I mentioned earlier to operate on just a portion of the monorepo. Focusing on the example shown in the screenshot, we could build and upgrade this piece of the system on its own by doing: Another extra detail to highlight from the screenshot is that NetBSD’s s are mostly declarative. Each defines a bunch of variables to specify what is being built and, at the end, includes one of the many files that pull in the build logic. Among these, we have to build one program, to build one static/shared library, and to recurse into subdirectories. Importantly, the general design is to build just one item per directory—although I myself broke this rule when I added to build tests because splitting them into subdirectories would have added too much noise to the tree. This declarative design is interesting because it maps well to the foundations of modern build systems like Bazel. In fact, the design of the NetBSD build system is what fueled my interest in build systems, influenced the design of my own Buildtool , and made me like Bazel Blaze as soon as I first saw it in 2008. The destdir In order to produce the structure of the final installation, the build system uses the “destdir” concept. A destdir is a staging location where built files are installed, but paths to this staging location are not used within the artifacts produced by the build. This idea exists in other build systems such as GNU Automake and is pretty much a necessity to build multiple pieces of software together before installing them or to package software without root privileges. Imagine that you want to build a library, say (the math library), and a tool that uses it, say (the calculator). typically goes into so we cannot just build and install it in place: for one, we may “break” the existing system if the new version happens to be backwards-incompatible; for another, we may be targeting a different architecture so we cannot just replace with an incompatible version. The destdir comes to the rescue. We first build as if it would be installed into . However, during installation , we prefix all file copy operations with the destdir. In this way, we build a separate “system root”, say , that contains the newly-built . After that, we build and point it to the that’s in , but… we have a problem: we can’t allow the path to appear anywhere inside the binary because this directory is transient. To fix this, we must separate build paths from runtime paths during the build: when we build , we tell the linker to look for libraries under via the flag, and we also tell the linker that, at runtime , libraries will be available in via the flag (which stands for runtime path). As you can imagine, the NetBSD build system heavily relies on this idea and, after a build (implied by the goal I showed earlier): we end up with a destdir that contains all system files laid out exactly as they need to be installed. In fact, if you run the build as root and target the host system (where the host is NetBSD), the destdir can serve as the target of a chroot. So, if you do: you essentially can enter the freshly-built system. This may or may not work, however: the newly-built binaries might require new kernel features, which is likely true if you are building a more modern NetBSD release from an older release or if you are tracking NetBSD-current. And this obviously won’t work if you are cross-building. Distribution media The destdir serves as a staging area but it does not represent the final artifacts of the build. To put the destdir to use, we either have to “copy” the staging area onto the host to perform an in-place upgrade, or we need to build distribution media. The former case of an in-place upgrade is tricky because it requires issuing manual post-installation steps, so I’m not going to describe it here. But the latter case of producing distribution media is trivial. For example, we can do: to produce the release “sets” for the system from the contents of the destdir, or we can do: to create various types of installation media (a bootable CD, a bootable USB image, a live system image…) from the contents of the destdir as well. The release sets are an interesting thing to discuss because they form the core of a NetBSD distribution. You see: NetBSD ships as a collection of tarballs, and installing NetBSD amounts to simply unpacking those tarballs onto a file system and performing a few post-installation configuration steps. Content of the binary/sets directory of the NetBSD/amd64 distribution. Now, the way these tarballs are produced from the destdir is by leveraging , a really cool tool that is not known in Linux land. The purpose of this tool is to compare a textual “golden” representation of a directory against the actual contents of the directory, and highlight where they might differ. BSD systems use to describe how the installed system looks like and, as you can imagine, NetBSD is no exception. The NetBSD build system uses files to ensure the destdir contents match expectations, and also uses the “manifests” to “bucketize” the files from the destdir into the individual release sets. You can find these golden manifests in the directory. Unprivileged builds These files are also critical for another very important feature: namely, the ability to build the whole NetBSD system as an unprivileged user. This, to me, is one of the most impressive features of this build system: you can produce the full build, including disk images, without ever running or using weird intercept tools like Debian’s . Here is how this works. When building in unprivileged mode (enabled via the flag to ), the build system produces a file under the destdir. This file looks like this: Every line of this file maps a file system entry (a directory, a file, a device…) stored in the destdir to its properties, including ownership information and permissions. These entries are generated from metadata encoded in the s whenever the build system places a new file under the destdir via the command (another nice tool often unknown to Linux users). The is the key that allows building media images without root privileges. If you think about it, media images are simply files with an internal structure that represents disk partitions, file systems, and metadata. Because they are simply files, there is no need to have root access nor to make the host’s file contain all users represented by entries in these file systems. Traditionally, OS builds have needed root because it’s easier to leverage the kernel’s virtual devices and file system implementations, but there is not inherent reason for that to be the only choice. All the work can be done in user space, and that’s precisely what NetBSD does. Now, go back and revisit the screenshot above that showed the toolchain contents. You’ll notice tools like (the tool to format a file system) and (the tool to create a GPT partitioning scheme). These tools are part of the toolchain because they are needed to generate installation media, and these tools know how to read the in order to embed the right permissions and special file modes into the built images. All without ever becoming root. sysbuild Now, as simple and powerful as might be, I find it cumbersome for day to day use if you want to customize any of its default settings. It is not uncommon to end up running with invocations like: which the official documentation describes as “golden invocations” and I have no desire to type or even remember. This is what drove me to write sysbuild : a layer of abstraction over and that coordinates updating the source tree and building it. The tool even integrates with trivially, providing a mechanism to keep NetBSD-current installations up-to-date. sysbuild is driven by configuration “profiles” which allow you to customize the paths and settings of a build in just one place and then puts them to use with a trivial command. For example, with a configuration file like the following stored in : We can simply run: to update the NetBSD source tree to the latest version, ensure that the tools are up-to-date, and produce the USB disk images for a Raspberry Pi. Deficiencies Not everything about the NetBSD build system is rosy though. The thing that differentiates a good build system from a “meh” one for me personally is the behavior of incremental builds and, in particular, two aspects of these: First, incremental builds need to do minimal work, especially when there is “nothing to do”. The NetBSD build system is a recursive make one (which comes with its own set of problems ), so it does not do minimal work. On my 72-core machine, it takes about 3 minutes to run through a invocation that does nothing. This is OK for end users looking to upgrade their running machine, but it is painful because it makes iterating on system changes difficult. As a developer, you end up needing to know how to surgically rebuild individual subdirectories using the wrappers I described earlier, and manually track dependencies across those. Second, incremental builds must always deliver correct results without having to do a in between. But that’s generally not true for make-based build systems—and NetBSD’s is no exception. Generally, an incremental build after a (sorry, a ) will work fine, but sometimes it won’t. And if you start playing with build-time switches (things like ), then you are out of luck and must resort to a to “switch configurations”.

0 views
Blog System/5 1 years ago

Synology DS923+ vs. FreeBSD w/ZFS

My interest in storage is longstanding—I loved playing with different file systems in my early Unix days and then I worked on Google’s and Microsoft’s distributed storage solutions—and, about four years ago, I started running a home-grown NAS leveraging FreeBSD and its excellent ZFS support. I first hosted the server on a PowerMac G5 and then upgraded it to an overkill 72-core ThinkStation that I snapped second-hand for a great price. But as stable and low maintenance as FreeBSD is, running day-to-day services myself is not my idea of “fun”. This drove me to replace this machine’s routing functionality with a dedicated pfSense box a year ago and, for similar reasons, I have been curious about dedicated NAS solutions. Synology DS923+ on top of its shipping box right after unboxing it. I was pretty close to buying a second-hand NAS from the work classifieds channel when a Synology marketing person (hi Kyle!) contacted me to offer a partnership: they’d ship me one of their devices for free in exchange for me publishing a few articles about it. Given my interest to drive-test one of these appliances without committing to buying one (they ain’t cheap and I wasn’t convinced I wanted to get rid of my FreeBSD-based solution), I was game. And you guessed right: this article is one of those I promised to write but, before you stop reading, the answer is no. This post is not sponsored by Synology and has not been reviewed nor approved by them. The content here, including any opinions, are purely my own. And what I want do do here is compare how the Synology appliance stacks against my home-built FreeBSD server. Blog System/5 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Here are the two contenders in my comparison: Synology D923+ next to my ThinkStation P710 on top of a dusty LackRack in the garage. My home-built NAS: Hardware: ThinkStation P710 equipped with 2x Intel Xeon E5-2697 v4 processors (18c/36t) at 2.30GHz, 64GB of RAM, 2x Seagate 4TB Enterprise Capacity 7200 RPM drives and a Samsung 970 EVO Plus SSD 500GB. Operating system: FreeBSD 14. Storage configuration: ZFS with the two HDD drives in mirror mode and with the SSD set up as the L2ARC plus ZIL log for the drive pool. The Synology NAS: Hardware: Synology DS923+ equipped with 3x Synology Plus Series 4TB 5400 RPM drives and the same Samsung 970 EVO Plus SSD 500GB that I moved from one machine to the other. The box is equipped with a 4-core AMD Ryzen Embedded R1600 and 4GB of RAM. Operating system: Synology’s own DiskStation Manager (DSM). Storage configuration: btrfs with the three HDD drives set up in RAID 5 mode and with the SSD set up to act as the cache for the drive pool. The two configurations are quite different—after all, I am comparing a workstation machine with lots of spare CPU and RAM to a dedicated machine specifically crafted for file sharing—so it’s going to be difficult to be “fair” in any comparison. In any case, here are a few things to contrast: IOPS: The P710 runs with 2 7200 RPM drives in mirror mode whereas the DS923+ runs with 3 5400 RPM drives in RAID 5 mode. The number of total IOPS from each is going to differ but… for all purposes, the 1Gbit NIC that each machine has is the limiting factor in performance so I haven’t bothered to run any performance tests. Both can saturate the network, so there is that. Quality: Both the P710 and the DS923+ are impressive machines—maybe not PowerMac G5 impressive levels, but pretty, pretty close. I love the ThinkStation’s outer design and the interior shows great cable management and airflow. As for the NAS, I love the lightweight and small form factor that allows placing it pretty much anywhere. Both are tool-less enclosures. Noise: When idle, both machines are equally quiet. The ThinkStation gets really, really loud under heavy load though, and it is also uncomfortably loud even at idle when the room temperature is warm (around or over 25C). The DS923+, however, seems quiet throughout. The Synology Plus Series HDDs are also quieter than the ones I had in the ThinkStation, and that’s partly because they are slower 5400 RPM drives. In any case, these two machines stay in my garage so I don’t care about their noise. Power consumption: I unfortunately do not own a tool to measure it, but it’d be neat to compare how these two stack up. I’m sure I’ve thrown money away by keeping the ThinkStation online 24x7 and having those drives never rest (ZFS doesn’t let them spin down) but it’s hard to care about it because my power bill is dominated by heating almost year-round. FreeBSD is my current favorite system for servers. FreeBSD remains close to its Unix roots and maintains rational and orthogonal tooling (unlike Linux), is quick and trivial to maintain (I could run through the 13 to 14 upgrade in… 15 minutes?), and bundles modern technology like ZFS and bhyve (unlike NetBSD, sadly, which used to be my BSD of choice). It is true that FreeBSD gave me some headaches when I ran it on the PowerMac G5, but that’s expected due to the machine being 20 years old and FreeBSD’s PowerPC support being a Tier 2 platform. The thing is that I never intended to run a NAS on the G5; it just so happened to be the only machine I had available for it. In any case, I have had zero problems on the ThinkStation. Mind you: when I bought this machine, both Windows and Fedora experienced occasional freezes with their default installations (before pulling upgrades from the network) and FreeBSD never has shown any signs of instability. As for ZFS, it is hard to convey the feeling of “power” you experience when you type and commands and see the machine coordinate multiple disks to offer you a dependable storage solution. Creating file systems on a pool, creating raw volumes for VMs or iSCSI targets, taking snapshots, replicating snapshots over the network or to backup USB drives, scrubbing the drives to verify data integrity… all are trivial commands away. ZFS feels slow though. I grew up thinking of file systems as contiguous portions of a disk and tinkering with their partitioning scheme to keep groups of data unfragmented for speedy access. Due to the way ZFS operates, however, none of this archaic knowledge applies (and to be honest, I do not know how ZFS works in detail internally). That said, a hard drive’s seek time is around 10ms and has been like that for decades, which combined with the fact that we are now spoiled by having SSDs everywhere, exacerbates how slow an HDD-based solution feels no matter the file system. Did I just mention disk fragmentation? Yeah I did, and I could not resist including this screenshot of MS-DOS 6.22's DEFRAG.EXE here. Just because. Regarding network connectivity, FreeBSD offers all sorts of networked file systems and services. The base system provides NFSv3, NFSv4, FTP, and iSCSI targets. The ports (packages) system offers whatever else you may need, including SMB, DLNA, and even ancient protocols like AppleTalk or distributed protocols like Ceph. All configuration is done by hand over SSH in the traditional Unix way of editing configuration files—aka messing around with different, inconsistent text formats. The DS923+ runs Synology’s own operating system: the DiskStation Manager (DSM) . The DSM is a headless-first system designed to be accessed over the network, which is no surprise. What is surprising is the choice of the interface: while most networked devices offer some sort of a web-based tabbed UI, the DSM offers a desktop environment—with overlapping windows no less. This feels like a gimmick to me, and a quite neat one, but overkill nonetheless. The DSM desktop with the Control Panel and the File Station apps open to demonstrate the appearance of the web-based windowing system. If we peek under the covers, which we can do by logging into the machine over SSH, we find that the DSM is a Linux system. No surprises here after seeing that its choice of file system is btrfs. But what kind of Linux is it? A weird one, let me tell you. Luckily, you do not have to interact with it at all if you don’t want to, but hey, I’m curious so I did look. The DSM seems to be some sort of Debian derivative based on the fact that is installed, but otherwise I cannot find any other obvious remains of what might have been a Debian installation; even shows nothing. What I can tell is that the btrfs file systems are mounted under and, in there, we can find one directory per shared folder that we create in the UI. We can also find various directories prefixed with , which is a… weird choice for Unix file names, and I understand are managed by the DSM. These include things like or , which carry strong Windows vibes. Considering that the DS923+ is almost a PC and runs Linux, I did look to see if it was possible to run FreeBSD instead. It turns out some people have tried this, and FreeBSD does run, but it lacks drivers to control power to the drives so it cannot actually leverage the storage devices. From what I understood, the DS923+ seems to have dedicated hardware to control power the drive pool, and this hardware logic is controlled via GPIO. Which got me even more curious: if the DSM explicitly controls power to the drive pool, how does it boot and how does it remain responsive even after shutting down the drives? I guessed that the device comes with a tiny drive to hold the DSM and boots off of it, and upon inspecting the and looking for non-obvious stuff, I found this: Ah-ha. A small 120MB (flash?) drive that acts as the system device. However… this separation of system vs. drive pool comes with a price: the drive pool often shuts off, and powering it back on is slow (10–15 seconds are not unusual). When you try to reach the NAS over the network and the pool is off, it can feel as if the NAS is unreachable / down, which has already confused me a few times and caused me to started diagnosing network issues. Quite annoying to be honest, but obviously this can be configured in the Hardware & Power menu: The list of available options under the Hardware & Power menu. Regarding network connectivity, the DSM offers multiple networked file systems and services out of the box, namely: SMB, NFS v3, NFS v4, AFP, FTP, and rsync. I’m actually surprised to see AFP, AppleTalk’s “replacement”, in the default set of supported file systems in this day and age, but there it is. Additionally, the DSM provides its own package management system to install a limited set of additional services and utilities, which I used to add DLNA support. Anyhow, let’s change topics and talk about the DS923+ itself. The first thing I had to do after getting the device was to seed it with data. I needed to copy about 1TB of photos and documents from the FreeBSD machine to the DS923+, which is not a lot, but is large enough that different copy mechanisms can make a huge difference in the time the copy takes. In particular, the choice of protocol matters for a quick copy: both SMB and NFS are OK with large files, but transferring many small files with them is painful. Fortunately, the DSM allows rsync over SSH and I used that to do the initial seeding. Now, using rsync came with two “problems”. The first is that I had to know the path to the shared folders in the file system to specify the rsync targets. This is not clearly exposed through the UI, so I had to rely on SSH to log into the machine and figure out that the shared folders are at as I briefly mentioned in the previous section. Which led to the second problem or, rather, surprise: even though I had set up 2FA for the user accounts I created in the DSM UI, 2FA was meaningless when accessing the machine over SSH. I understand why that may be, but then I question what the point of enabling 2FA really is if one can gain access to the machine without it. The same is true of SMB by the way: you just need an account’s password, not the 2FA. So, unless you disable all networked file systems and only allow web access to the NAS, all the 2FA does is give you a false sense of security. When comparing my custom FreeBSD server to the newer DS923+, the main difference I can notice is an increase in “peace of mind”. Yes, FreeBSD is very stable and ZFS is great… but I wasn’t running the ThinkStation as a dedicated NAS: I was mixing the machine’s NAS responsibilities with those of a host for development tasks. I always felt uncomfortable about the health of the system and I wasn’t convinced that my maintenance was “good enough” to safeguard my data. As you can imagine, being uncomfortable about your data is not something you want to feel: trust is The Thing you most want in a storage solution, and there are a few things that stand out during the setup of the system that gave me warm feelings on this topic. The first and obvious one is that the DSM offers to encrypt the pool right from the beginning, which is something you should always do in this day and age: encryption is cheap in CPU terms, the data you own is precious, and the devices that store it tend to be small and light so they are at risk of theft. But beware: the key is stored on the device to allow auto-mounting by default, which means a knowledgeable thief could still gain access. Thus, this level of encryption is only useful to facilitate the disposal of drives. That said, you can configure additional encryption on each shared folder if you want per-user password-protected encryption. The second thing is that, nowhere in the setup process I was asked to create a Synology account for the NAS to be fully functional. You might think that this is a given, but seeing the state of other hardware or software products these days… it’s not. So this sent a very welcome message. I ended up needing to create an account for certain features that required it, but these were completely optional. The third thing, which is actually one of those features that required creating a Synology account, is the ability to send email notifications for system alerts. Email-based alerts are terrible in large organizations (and of course the DSM offers better alternatives, like ActiveInsight), but for my personal use case, emails are perfect and make a huge difference with my previous FreeBSD setup. I can rest assured that I’ll be told about anything unusual with the DS932+ in a way that I will notice. With my custom build, I had certain monitoring features in place like weekly disk scrubs and periodic online backups… but no way to properly notify me of problems with either: if anything went wrong, I would not have known on time, and that was very unsettling. The fourth and final thing is that the DSM has been built throughout the years by people that deal with storage all the time, and it’s fair to assume that they know a thing or two about running a NAS. I have reasonable confidence that the configuration of the system and the storage pool is going to be “correct” over time, particularly across upgrades, whereas I was never quite sure of it with my manual FreeBSD setup. One such example is with the NFS configuration: setting up the NFSv4 server in the DSM was pretty much knob-free—so when things didn’t work with the clients , I could assume that the issue was almost certainly with the clients themselves. As I mentioned earlier, the DS923+ is pretty much a PC with Linux in a tiny box, so in principle it can host any kind of software you like. The way this works is via DSM’s own “marketplace” where you can find new services to add to the machine. The DSM desktop showing the Package Center application. I did add the optional DLNA service so that I could play videos from my Xbox, and also toyed with with the Domain Server (and later gave up due to the sheer complexity of setting up a domain just for home use). But there are other interesting optional features. For example, you can create virtual machines on the DS923+ which, despite the limited CPU and RAM of the machine, can come in handy from time to time. I suppose with a more powerful box from Synology, the story here would be quite different. One thing to highlight is that these extra pieces of software are curated : you aren’t just installing an extra service to the underlying Linux machine. You are installing an extension to the DSM, which comes with new configuration panels in the UI and full integration with the system. You never have to know that Linux is there if you don’t want to. Having a dedicated NAS with a storage pool that protects against corruption is great, but a singly-homed box like this is still subject to massive data loss due to ransomware, physical damage caused by fire or flooding, and correlated disk failures. Backups are critically important. With the FreeBSD setup, my backup strategy involved using and to back up snapshots into two USB drives: one kept in a fire safe and one kept offsite. This worked but I had to figure out the syntax of these commands over and over again, which didn’t make me feel confident about these actions. I did also back up irreplaceable data to a OneDrive account using , which was good but… manual and ad-hoc as well. Needless to say, while the backups existed, they were often stale and they were a pain to manage. Backing up a NAS is difficult though. For one, a NAS is designed to store lots of data, which means the backup targets have to be large capacity-wise too. And for another, given the DS923+ reduced physical size, it is likely to end up placed in a location that isn’t super convenient to access, which will make attaching temporary USB drives and the like an annoying experience. For now, what I have tried is the optional Cloud Sync package that allows replicating to many cloud services, including OneDrive. And it’s a joy to use. I could trivially install the package, log into my Microsoft account, and configure the two specific folders I care about backing up. I could even respect the previous file layout of my OneDrive account so I did not have to re-upload anything. And the tool supports encryption too, which you may want if you really don’t like those cloud providers from training their AI systems with your data. Sample configuration of cloud syncing to OneDrive. There is more though. Synology offers a variety of products for remote and physical backup. The one that does interest me the most is Snapshot Replication , which provides something similar to what I was doing with and , but unfortunately requires a second Synology system offsite that I don’t have. The other solutions are Active Backup for Business and Hyper Backup , which I still would like to evaluate but haven’t had a chance to look into. To conclude this brief review and comparison, let me say that I’m happy with the DS923+ experience so far. I don’t think I need it because my FreeBSD solution worked well enough, but considering that I like to use the ThinkStation as the host for my VMs and as my primary development machine—aka, a fun toy—I feel more at peace with a dedicated appliance that stores my precious data. For more details, you can visit Synology’s product pages for the DS923+ and the Plus Series HDDs . And if you want to roll your own FreeBSD-based solution, Chapter 22 of the FreeBSD handbook on ZFS is a good place to start. Synology DS923+ on top of its shipping box right after unboxing it. I was pretty close to buying a second-hand NAS from the work classifieds channel when a Synology marketing person (hi Kyle!) contacted me to offer a partnership: they’d ship me one of their devices for free in exchange for me publishing a few articles about it. Given my interest to drive-test one of these appliances without committing to buying one (they ain’t cheap and I wasn’t convinced I wanted to get rid of my FreeBSD-based solution), I was game. And you guessed right: this article is one of those I promised to write but, before you stop reading, the answer is no. This post is not sponsored by Synology and has not been reviewed nor approved by them. The content here, including any opinions, are purely my own. And what I want do do here is compare how the Synology appliance stacks against my home-built FreeBSD server. Blog System/5 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The hardware Here are the two contenders in my comparison: Synology D923+ next to my ThinkStation P710 on top of a dusty LackRack in the garage. My home-built NAS: Hardware: ThinkStation P710 equipped with 2x Intel Xeon E5-2697 v4 processors (18c/36t) at 2.30GHz, 64GB of RAM, 2x Seagate 4TB Enterprise Capacity 7200 RPM drives and a Samsung 970 EVO Plus SSD 500GB. Operating system: FreeBSD 14. Storage configuration: ZFS with the two HDD drives in mirror mode and with the SSD set up as the L2ARC plus ZIL log for the drive pool. The Synology NAS: Hardware: Synology DS923+ equipped with 3x Synology Plus Series 4TB 5400 RPM drives and the same Samsung 970 EVO Plus SSD 500GB that I moved from one machine to the other. The box is equipped with a 4-core AMD Ryzen Embedded R1600 and 4GB of RAM. Operating system: Synology’s own DiskStation Manager (DSM). Storage configuration: btrfs with the three HDD drives set up in RAID 5 mode and with the SSD set up to act as the cache for the drive pool. IOPS: The P710 runs with 2 7200 RPM drives in mirror mode whereas the DS923+ runs with 3 5400 RPM drives in RAID 5 mode. The number of total IOPS from each is going to differ but… for all purposes, the 1Gbit NIC that each machine has is the limiting factor in performance so I haven’t bothered to run any performance tests. Both can saturate the network, so there is that. Quality: Both the P710 and the DS923+ are impressive machines—maybe not PowerMac G5 impressive levels, but pretty, pretty close. I love the ThinkStation’s outer design and the interior shows great cable management and airflow. As for the NAS, I love the lightweight and small form factor that allows placing it pretty much anywhere. Both are tool-less enclosures. Noise: When idle, both machines are equally quiet. The ThinkStation gets really, really loud under heavy load though, and it is also uncomfortably loud even at idle when the room temperature is warm (around or over 25C). The DS923+, however, seems quiet throughout. The Synology Plus Series HDDs are also quieter than the ones I had in the ThinkStation, and that’s partly because they are slower 5400 RPM drives. In any case, these two machines stay in my garage so I don’t care about their noise. Power consumption: I unfortunately do not own a tool to measure it, but it’d be neat to compare how these two stack up. I’m sure I’ve thrown money away by keeping the ThinkStation online 24x7 and having those drives never rest (ZFS doesn’t let them spin down) but it’s hard to care about it because my power bill is dominated by heating almost year-round.

0 views
Blog System/5 1 years ago

Demystifying secure NFS

I recently got a Synology DS923+ for evaluation purposes which led me to setting up NFSv4 with Kerberos. I had done this about a year ago with FreeBSD as the host, and going through this process once again reminded me of how painful it is to secure an NFS connection. You see, Samba is much easier to set up, but because NFS is the native file sharing protocol of Unix systems, I felt compelled to use it instead. However, if you opt for NFSv3 (the “easy default”), you are left with a system that has zero security: traffic travels unencrypted and unsigned, and the server trusts the client when the client asserts who is who. Madness for today’s standards. Yet, when you look around, people say “oh, but NFSv3 is fine if you trust the network!” But seriously, who trusts the network in this day and age? You have to turn to NFSv4 and combine it with Kerberos for a secure file sharing option. And let me tell you: the experience of setting these up and getting things to work is horrible, and the documentation out there is terrible. Most documents are operating-system specific so they only tell you what works when a specific server and a specific client talk to each other. Other documents just assume , and thus omit, various important details of the configuration. So. This article is my recollection of “lab notes” on how to set this whole thing up along with the necessary background to understand NFSv4 and Kerberos. My specific setup involes the Synology DS923+ as the NFSv4 server; Fedora, Debian, and FreeBSD clients; and the supporting KDC on a pfSense (or FreeBSD) box. Update (June 3rd, 2025): Following the discussion from Lobste.rs (and very belatedly, because I could not edit this post due to a magic string blocking my changes), I’ve updated the text to resolve issues with the kernel/user name mapping. An upcoming article will compare the Synology NAS against my home-built FreeBSD-based NAS. Subscribe now to get it! NFSv3, or usually just NFS, is a protocol from the 1980s—and it shows. In broad terms, NFSv3 exposes the inodes of the underlying file system to the network. This became clear to me when I implemented tmpfs for NetBSD in 2005 and realized that a subset of the APIs I had to support were in service of NFS. This was… a mind-blowing realization. “Why would tmpfs, a memory file system, need NFS-specific glue?” , I thought, and then I learned the bad stuff. Anyhow. Now that you know that NFSv3 exposes inodes, you may understand why sharing a directory over NFSv3 is an all-or-nothing option for the whole file system . Even if you can configure to export a single directory via , malicious clients can craft NFS RPCs that reference inodes outside of the shared directory. Which means… they get free access to the whole file system and explains why system administrators used to put NFS-exported shares in separate partitions, mounted under , in an attempt to isolate access to subsets of data. To make things worse, NFSv3 has no concept of security. A client can simply assert that a request comes from UID 1000 and the server will trust that the client is really operating on behalf of the server’s UID 1000. Which means: a malicious client can pretend to be any user that exists on the server and gain access to any file in the exported file system. Which then explains why the option exists as an attempt to avoid impersonating root… but only root. Crazy talk. All in all, NFSv3 may still be OK if you really trust the network, if you compartmentalize the exported file system, and if you are sharing inconsequential stuff. But can you trust the network? Maybe you can if you are using a P2P link, but otherwise… it is really, really risky and I do not want to do that. NFSv4, despite having the same NFS name as NFSv3, is a completely different protocol. Here are two main differences: NFSv4 operates on the basis of usernames, not UIDs. Each request to the server contains a username and the server is responsible for translating that username to a local UID while verifying access permissions. NFSv4 operates at the path level, not the inode level. Each request to the server contains the path of the file to operate on and thus the server can apply access control based on those. Take these two differences together and NFSv4 can implement secure access to file systems. Because the server sees usernames and paths, the server can first verify that a user is who they claim to be. And because the server can authenticate users, it can then authorize accesses at the path level. That said, if all you have is NFSv4, you only get the security level, which is… the same as having no security at all. In this mode, the server trusts the client and assumes that user X on the client maps exactly to user X on the server, which is almost the same as NFSv3 did. The real security features of NFSv4 come into play when it’s paired with Kerberos. When Kerberos is in the picture, you get to choose from the following security levels for each network share: : Requires requests to be authenticated by Kerberos, which is good to ensure only trusted users access what they should but offers zero “on-the-wire” security. Traffic flows unsigned and unencrypted, so an attacker could tamper with the data and slurp it before it reaches the client. : Builds on to offer integrity checks on all data. Basically, all packets on the network are signed but not encrypted. This prevents spoofing packets but does not secure the data against prying eyes. : Builds on to offer encrypted data on the wire. This prevents tampering with the network traffic and also avoids anyone from seeing what’s being transferred. Sounds good? Yeah, but unfortunately, Kerberos and its ecosystem are… complicated. Kerberos is an authentication broker. Its goal is to detach authentication decisions between a client machine and a service running on a second machine, and move that responsibility to a third machine—the Kerberos Domain Controller (KDC). Consequently, the KDC is a trusted entity between the two machines that try to communicate with each other. All the machines that interact with the KDC form a realm (AKA a domain , but not a DNS domain). Each machine needs an file that describes which realms the machine belongs to and who the KDC for each realm is. The actors that exist within the realm are the principals . The KDC maintains the authoritative list of principals and their authentication keys (passwords). These principals represent: Users , which have names of the form . There has to be one of these principals for every person (or role) that interacts with the system. Machines , which have names of the form . There has to be one of these principals for every server, and, depending on the service, the clients may need one too as is the case for NFSv4. Services , which have names of the form . Some services like NFSv4 require one of these, in which case is , but others like SSH do not. Let’s say Alice wants to log into the Kerberos-protected SSH service running on SshServer from a client called LinuxLaptop, all within the Kerberos realm. (Beware that the description below is not 100% accurate. My goal is for you to understand the main concepts so that you can operate a Kerberos realm.) First, Alice needs to obtain a Ticket-Granting-Ticket (TGT) if she doesn’t have a valid one yet. This ticket is issued by the KDC after authenticating Alice with her password, and allows Alice to later obtain service-specific tickets without having to provide her password again. For this flow: Steps involved in obtaining a TGT from the KDC for a user on a client machine. Alice issues a login request to the KDC from the client LinuxLaptop by typing (or using other tools such as a PAM module). This request carries Alice’s password. The KDC validates Alice’s authenticity by checking her password against the KDC’s database and issues a TGT. The TGT is encrypted with the KDC’s key and includes an assertion of who Alice is and how long the ticket is valid for. The client LinuxLaptop stores the TGT on disk. Alice can issue to see the ticket: The TGT, however, is not sufficient to access a service. When Alice wants to access the Kerberos-protected SSH service running on the SshServer machine, Alice needs a ticket that’s specific to that service. For this flow: Steps involved in obtaining a service-specific ticket from the KDC for a user on a client machine. Alice sends a request to the Ticket-Granting-Service (KDS) and asks for a ticket to SshServer. This request carries the TGT. The TGS (which lives in the KDC) verifies who the TGT belongs to and verifies that it’s still valid. If so, the TGS generates a ticket for the service. This ticket is encrypted with the service’s secret key and includes details on who Alice is and how long the ticket is valid for. The client LinuxLaptop stores the service ticket on disk. As before, Alice can issue to see the ticket: At this point, all prerequisite Kerberos flows have taken place. Alice can now initiate the connection to the SSH service: Accessing a remote SSH server using a Kerberos ticket without password authentication. Alice sends the login request from the LinuxLaptop client to the SshServer server and presents the service/host-specific ticket that was granted to her earlier on. The SshServer server decrypts the ticket with its own key, extracts details of who the request is from, and verifies that they are correct. This happens without talking to the KDC and is only possible because SshServer trusts the KDC via a pre-shared key. The SSH service on SshServer decides if Alice has SSH access as requested and, if so, grants such access. Note these very important details: The KDC is only involved in the ticket issuance process. Once the client has a service ticket, all interactions between the client and the server happen without talking to the KDC. This is essential to not make the KDC a bottleneck in the communication. Each host/service and the KDC have unique shared keys that are known by both the host/service and the KDC. These shared keys are created when registering the host or service principals and are copied to the corresponding machines as part of their initial setup. These keys live in machine-specific files. Kerberos does authentication only, not authorization. The decision to grant Alice access to the SSH service in Think is made by the service itself, not Kerberos, after asserting that Alice is truly Alice. As you can imagine, the KDC must be protected with the utmost security measures. If an attacker can compromise the KDC’s locally-stored database, they will get access to all shared keys so they can impersonate any user against any Kerberos-protected service in the network. That’s why attackers try to breach into an Active Directory (AD) service as soon as they infiltrate a Microsoft network because… AD is a KDC. Enough theory. Let’s get our hands dirty and follow the necessary steps to set up a KDC. The KDC’s needs are really modest. Per the discussion above, the KDC isn’t in the hot data path of any service so the number of requests it receives are limited. Those requests are not particularly complex to serve either: at most, there is some CPU time to process cryptographic material but no I/O involved, so for a small network, any machine will do. In my particular case, I set up the KDC in my little pfSense box as it is guaranteed to be almost-always online. This is probably not the best of ideas security-wise, but… it’s sufficient for my paranoia levels. Note that most of the steps below will work similarly on a FreeBSD box, but if you are attempting that, please go read FreeBSD’s official docs on the topic instead. Those docs are one of the few decent guides on Kerberos out there. The pfSense little box that I run the KDC on. Here are the actors that will appear throughout the rest of this article. I’m using the real names of my setup here because, once again, these are my lab notes: : The name of the Kerberos realm. : The user on the client machine wanting access to the NFSv4 share. The UID is irrelevant. : The pfSense box running the KDC. : The Synology DS923+ NAS acting as the NFSv4 server. : A FreeBSD machine that will act as a Kerberized SSH server for testing purposes and an NFSv4 client. (It’s a ThinkStation, hence its name.) : A Linux machine that will act as an NFSv4 client. While in reality this is running Fedora, I’ll use this hostname interchangeably for Fedora and Debian. Knowing all actors, we can set up the KDC. The first step is to create the for the KDC which tells the system which realm the machine belongs to. You’ll have to open up SSH access to the machine via the web interface to perform these steps. Here is the minimum content you need: With that, you should be able to start the service, which is responsible for the KDC. All documentation you find out there will tell you to also start , but if you don’t plan to do administer the KDC from another machine (why would you?), then you don’t need this service. pfSense’s configuration is weird because of the read-only nature of its root partition, so to do this, you have to edit the file stored in NVRAM and add this line right before the closing tag: If you were to set this up on a FreeBSD host instead of pfSense, you would modify instead and add: Then, from the root shell on either case: It is now a good time to ensure that every machine involved in the realm has a DNS record and that reverse DNS lookups work. Failure to do this will cause problems later on when attempting to mount the NFSv4 shares, and clearing those errors won’t be trivial because of caching at various levels. Once the KDC is running, we must create principals for the hosts, the NFSv4 service, and the users that will be part of the realm. The client host and service principals aren’t always necessary though: SSH doesn’t require them, but NFSv4 does. To create the principals, we need access the KDC’s administrative console. Given that the KDC isn’t configured yet, we can only gain such access by running on the KDC machine directly (the pfSense shell), which bypasses the networked service that we did not start. Start and initialize the realm: Next, create principals for the users that will be part of the realm: Then, create principals for the hosts (server and clients, but not the KDC) and the NFSv4 service : And finally, extract the host and service credentials into the machine-specific keytab files. Note that, for the servers, we extract both the host and any service principals they need, but for the client, we just extract the host principal. We do not export any user principals: You now need to copy each extracted keytab file to the corresponding machine and name it . (We’ll do this later on the Synology NAS via its web interface.) This file is what contains the shared key between the KDC and the host and is what allows the host to verify the authenticity of KDC tickets without having to contact the KDC. Make sure to protect it with so that nobody other than root can read it. If is unsuitable or hard to use from the KDC to the client machines (as is my case because I restrict SSH access to the KDC to one specific machine), you can use the command to print out a textual representation of the keytab and use the local clipboard to carry it to a shell session on the destination machine. At this point, the realm should be functional but we need to make the clients become part of the realm. We also need to install all necessary tools, like , which aren’t present by default on some systems: Follow the prompts that the installer shows to configure the realm and the address of the KDC. This will auto-create with the right contents so you don’t have to do anything else. Edit the system-provided file to register the realm and its settings. Use the file content shown above for the KDC as the template, or simply replace all placeholders for and with the name of your DNS domain and realm. On FreeBSD: Create the file from scratch in the same way we did for the KDC. All set! But… do you trust that you did the right thing everywhere? We could go straight into NFSv4, but due to the many pitfalls in its setup, I’d suggest you verify your configuration using a simpler service like SSH. To do this, modify the SSH server’s (aka ’s configuration) file and add so that it can leverage Kerberos for authentication. Restart the SSH service and give it a go: run on the client ( ) and then see how works without typing a password anymore. But… ? What’s up with the cryptic name? GSS-API stands for Generic Security Services API and is the interface that programs use to communicate with the Kerberos implementation on the machine. GSS-API is not always enabled by default for a service, and the way you enable it is service-dependent. As you saw above, all we had to do for SSH was modify the file… but for other services, you may need to take extra steps on the server and/or the client. And, guess what, NFSv4 is weird on this topic. Not only we need service-specific principals for NFS, but we also need the daemon to be running on the server and the client machines. And we also need a separate daemon to map Unix user names to Kerberos principals ( on Linux, on FreeBSD). This is because NFSv4 is typically implemented inside the kernel, but not Kerberos, so the kernel needs a mechanism to “call into” Kerberos and that’s precisely what these daemons do: On the Synology NAS: Do nothing. The system handles by itself. You shouldn’t have to do anything if you correctly created the prerequisite early enough, but make sure the service is running with (and know that this command only shows useful diagnostic logs when run as root). Run if the service isn’t running. Run if the service isn’t running. On FreeBSD: Add , , and to . It’s time to deal with NFSv4, so let’s start by configuring the server on the NAS. The Synology Disk Station Manager (DSM) interface—the web UI for the NAS—is… interesting. As you might expect, it is a web-based interface but… it pretends to be a desktop environment in the browser, which I find overkill and unnecessary. But it’s rather cool in its own way. Navigating the Synology DSM menus to configure the NFS file service with Kerberos. The first step is to enable the NFS service. Roughly follow these steps, which are illustrated in the picture just above: Open the File Services tab of the Control Panel . In the NFS tab, set NFSv4 as the Minimum NFS protocol . Click on Advanced Settings and, in the panel that opens, enter the Kerberos realm under the NFSv4 domain option. Click on Kerberos Settings and, in the panel that opens, select Import and then upload the keytab file that we generated earlier on for the NAS. This should populate the and principals in the list. Finish and save all settings. That should be all to enable NFSv4 file serving. Navigating the Synology DSM menus to configure the properties of a single shared folder over NFS. Then, we need to expose shared folders over NFSv4, and we have to do this for every folder we want to share. Assuming you want to share the folder as shown in the picture just above: Open the Shared Folder tab of the Control Panel . Select the folder you want to share (in our case, ), and click Edit . In the NFS Permissions tab, click either Create or Edit to enter the permissions for every host client that should have access to the share. Fill the NFS rule details. In particular, enter the hostname of the client machine, enable the Allow connections from non-privileged ports option, and select the Security level you desire. In my case, I want and only so that’s the only option I enable. But your risk profile and performance needs may be different, so experiment and see what works best for you. Now that the server is ready and we have dealt with the GSS-API prerequisites, we can start mounting NFSv4 on the clients. On Linux, things are pretty simple. We can mount the file system with: Or persist the entry in if we want to: And then we should be able to list its contents assuming we’ve got a valid TGT for the current user (run if it doesn’t work): Easy peasy, right? But wait… why do all directories have permissions? This is rather unfortunate and I’m not sure why the Synology does this. Logging onto the DS923+ via SSH, I inspected the shared directory and realized that it has various ACLs in place to control access to the directories, but somehow, the traditional Unix permissions are all indeed. Not great. I used to fix the permissions for all directories to and things seem to be OK, but that doesn’t give me a lot of comfort because I do not know if the DSM will ever undo my changes or if I might have broken something. There might be one more problem though, which I did not encounter on Debian clients but that showed up later in Fedora clients. The problem looks like this: Note how all entries are owned by which is… not correct. Yet the right access control is in effect: the directory is only accessible by the user as expected from the earlier listing. Which means that the user mapping between Kerberos principals and local users is working correctly on the server… but not on the client where is not returning the right information. This can happen for two reasons. The first is that the service might not be running, but we already covered starting it up earlier. The second is that the client might not have the right domain name for NFSv4. You can check this with . If the output doesn’t match the name of the DNS domain, update the hostname to include a domain name portion: We now have the Linux clients running just fine so it is time to pivot to FreeBSD. If we try a similar “trivial” mount command, we get an error: The error is pretty… unspecific. It took me quite a bit of trial and error to realize that I had to specify for it to attempt a NFSv4 connection and not NFSv3 (unlike Linux, whose command attempts the highest possible version first and then falls back to older versions): OK, progress. Now this complains that the security flavor we request is wrong. Maybe we just need to be explicit and also pass as an argument: Wait, what? The mount operation still fails? This was more puzzling and also took a fair bit of research to figure out because logs on the client and on the server were just insufficient to see the problem. The reason for the failure is that we are trying to mount the share as root but… we don’t have a principal for this user so root cannot obtain an NFSv4 service ticket to contact the NAS. So… do we need to create a principal for ? No! We do not need to provide user credentials when mounting an NFSv4 share (unlike what you might be used to with Windows shares). What Kerberized NFSv4 needs during the mount operation is a host ticket: the NFSv4 server checks if the client machine is allowed to access the server and, if so, exposes the file system to it. This is done using the client’s host principal. Once the file system is mounted, however, all operations against the share carry the ticket of the user requesting the operation. Knowing this, we need to “help” FreeBSD and tell it that it must use the host’s principal when mounting the share. Why this isn’t the default, I don’t know, particularly because non-root users are not allowed to mount file systems in the default configuration. Anyhow. The option rescues us: Which finally allows the mount operation to succeed. We should persist all this knowledge into an entry like this one: Lastly, note that at the time of this writing with FreeBSD 14.2-STABLE, the username to Kerberos principal mapping provided by may not work correctly and you may end up seeing as the owner of all files on the NFSv4 share. This can be caused by a domain name mismatch as explained in the Linux case—which you can resolve by setting in —or by an actual kernel bug I had to diagnose. Which speaks to the sheer complexity of NFSv4… Color me skeptical, but everything I described above seems convoluted and fragile, so I did not trust that my setup was sound. Consequently, I wanted to verify that the traffic on the network was actually encrypted. To verify this, I installed Wireshark and ran a traffic capture against the NAS with as the filter. Then, from the client, I created a text file on the shared folder and then read it. Inspecting the captured packets confirmed that the traffic is indeed flowing in encrypted form. I could not find the raw file content anywhere in the whole trace (but I could when using anything other than ). Content of an NFS reply packet with Kerberos-based encryption. The packet contents are not plain text. And, as a final test, I tried to mount the network share without and confirmed that this was not possible: All good! I think… That’s about it. But I still have a bunch of unanswered questions from this setup: Kerberos claims to be an authentication system only, not an authorization system. However, the protocol I described above separates the TGT from the TGS, and this separation makes it sound like Kerberos could also implement authorization policies. Why doesn’t it do these? Having to type after logging into the machine is annoying. I remember that, back at Google when we used Kerberos and NFS—those are long gone days—the right tickets would be granted after logging in or unlocking a workstation. This must have been done with the Kerberos PAM modules… but I haven’t gotten them to do this yet and I’m not sure why. The fact that the shared directories created by the Synology NAS have 777 permissions seems wrong. Why is it doing that? And does anything break if you manually tighten these permissions? And the most important question of all: is this all worth it? I’m tempted to just use password-protected Samba shares and call it a day. I still don't trust that the setup is correct, and I still encounter occasional problems here and there. If you happen to have answers to any of the above or have further thoughts, please drop a note in the comments section. And… If you want to see how the DS923+ compares to my home-built FreeBSD NAS with ZFS, subscribe for a future post on that! Credit and disclaimers: the DS923+ and the 3 drives it contains that I used for throughout this article were provided to me for free by Synology for evaluation purposes in exchange for blogging about the NAS. The content in this article is not endorsed has not been reviewed by them. You see, Samba is much easier to set up, but because NFS is the native file sharing protocol of Unix systems, I felt compelled to use it instead. However, if you opt for NFSv3 (the “easy default”), you are left with a system that has zero security: traffic travels unencrypted and unsigned, and the server trusts the client when the client asserts who is who. Madness for today’s standards. Yet, when you look around, people say “oh, but NFSv3 is fine if you trust the network!” But seriously, who trusts the network in this day and age? You have to turn to NFSv4 and combine it with Kerberos for a secure file sharing option. And let me tell you: the experience of setting these up and getting things to work is horrible, and the documentation out there is terrible. Most documents are operating-system specific so they only tell you what works when a specific server and a specific client talk to each other. Other documents just assume , and thus omit, various important details of the configuration. So. This article is my recollection of “lab notes” on how to set this whole thing up along with the necessary background to understand NFSv4 and Kerberos. My specific setup involes the Synology DS923+ as the NFSv4 server; Fedora, Debian, and FreeBSD clients; and the supporting KDC on a pfSense (or FreeBSD) box. Update (June 3rd, 2025): Following the discussion from Lobste.rs (and very belatedly, because I could not edit this post due to a magic string blocking my changes), I’ve updated the text to resolve issues with the kernel/user name mapping. An upcoming article will compare the Synology NAS against my home-built FreeBSD-based NAS. Subscribe now to get it! NFSv3’s insecurity NFSv3, or usually just NFS, is a protocol from the 1980s—and it shows. In broad terms, NFSv3 exposes the inodes of the underlying file system to the network. This became clear to me when I implemented tmpfs for NetBSD in 2005 and realized that a subset of the APIs I had to support were in service of NFS. This was… a mind-blowing realization. “Why would tmpfs, a memory file system, need NFS-specific glue?” , I thought, and then I learned the bad stuff. Anyhow. Now that you know that NFSv3 exposes inodes, you may understand why sharing a directory over NFSv3 is an all-or-nothing option for the whole file system . Even if you can configure to export a single directory via , malicious clients can craft NFS RPCs that reference inodes outside of the shared directory. Which means… they get free access to the whole file system and explains why system administrators used to put NFS-exported shares in separate partitions, mounted under , in an attempt to isolate access to subsets of data. To make things worse, NFSv3 has no concept of security. A client can simply assert that a request comes from UID 1000 and the server will trust that the client is really operating on behalf of the server’s UID 1000. Which means: a malicious client can pretend to be any user that exists on the server and gain access to any file in the exported file system. Which then explains why the option exists as an attempt to avoid impersonating root… but only root. Crazy talk. All in all, NFSv3 may still be OK if you really trust the network, if you compartmentalize the exported file system, and if you are sharing inconsequential stuff. But can you trust the network? Maybe you can if you are using a P2P link, but otherwise… it is really, really risky and I do not want to do that. How is NFSv4 better? NFSv4, despite having the same NFS name as NFSv3, is a completely different protocol. Here are two main differences: NFSv4 operates on the basis of usernames, not UIDs. Each request to the server contains a username and the server is responsible for translating that username to a local UID while verifying access permissions. NFSv4 operates at the path level, not the inode level. Each request to the server contains the path of the file to operate on and thus the server can apply access control based on those. : Requires requests to be authenticated by Kerberos, which is good to ensure only trusted users access what they should but offers zero “on-the-wire” security. Traffic flows unsigned and unencrypted, so an attacker could tamper with the data and slurp it before it reaches the client. : Builds on to offer integrity checks on all data. Basically, all packets on the network are signed but not encrypted. This prevents spoofing packets but does not secure the data against prying eyes. : Builds on to offer encrypted data on the wire. This prevents tampering with the network traffic and also avoids anyone from seeing what’s being transferred. Users , which have names of the form . There has to be one of these principals for every person (or role) that interacts with the system. Machines , which have names of the form . There has to be one of these principals for every server, and, depending on the service, the clients may need one too as is the case for NFSv4. Services , which have names of the form . Some services like NFSv4 require one of these, in which case is , but others like SSH do not. Steps involved in obtaining a TGT from the KDC for a user on a client machine. Alice issues a login request to the KDC from the client LinuxLaptop by typing (or using other tools such as a PAM module). This request carries Alice’s password. The KDC validates Alice’s authenticity by checking her password against the KDC’s database and issues a TGT. The TGT is encrypted with the KDC’s key and includes an assertion of who Alice is and how long the ticket is valid for. The client LinuxLaptop stores the TGT on disk. Alice can issue to see the ticket: Steps involved in obtaining a service-specific ticket from the KDC for a user on a client machine. Alice sends a request to the Ticket-Granting-Service (KDS) and asks for a ticket to SshServer. This request carries the TGT. The TGS (which lives in the KDC) verifies who the TGT belongs to and verifies that it’s still valid. If so, the TGS generates a ticket for the service. This ticket is encrypted with the service’s secret key and includes details on who Alice is and how long the ticket is valid for. The client LinuxLaptop stores the service ticket on disk. As before, Alice can issue to see the ticket: Accessing a remote SSH server using a Kerberos ticket without password authentication. Alice sends the login request from the LinuxLaptop client to the SshServer server and presents the service/host-specific ticket that was granted to her earlier on. The SshServer server decrypts the ticket with its own key, extracts details of who the request is from, and verifies that they are correct. This happens without talking to the KDC and is only possible because SshServer trusts the KDC via a pre-shared key. The SSH service on SshServer decides if Alice has SSH access as requested and, if so, grants such access. The KDC is only involved in the ticket issuance process. Once the client has a service ticket, all interactions between the client and the server happen without talking to the KDC. This is essential to not make the KDC a bottleneck in the communication. Each host/service and the KDC have unique shared keys that are known by both the host/service and the KDC. These shared keys are created when registering the host or service principals and are copied to the corresponding machines as part of their initial setup. These keys live in machine-specific files. Kerberos does authentication only, not authorization. The decision to grant Alice access to the SSH service in Think is made by the service itself, not Kerberos, after asserting that Alice is truly Alice. The pfSense little box that I run the KDC on. Here are the actors that will appear throughout the rest of this article. I’m using the real names of my setup here because, once again, these are my lab notes: : The name of the Kerberos realm. : The user on the client machine wanting access to the NFSv4 share. The UID is irrelevant. : The pfSense box running the KDC. : The Synology DS923+ NAS acting as the NFSv4 server. : A FreeBSD machine that will act as a Kerberized SSH server for testing purposes and an NFSv4 client. (It’s a ThinkStation, hence its name.) : A Linux machine that will act as an NFSv4 client. While in reality this is running Fedora, I’ll use this hostname interchangeably for Fedora and Debian. On Debian: Run . Follow the prompts that the installer shows to configure the realm and the address of the KDC. This will auto-create with the right contents so you don’t have to do anything else. On Fedora: Run . Edit the system-provided file to register the realm and its settings. Use the file content shown above for the KDC as the template, or simply replace all placeholders for and with the name of your DNS domain and realm. On FreeBSD: Create the file from scratch in the same way we did for the KDC. On the Synology NAS: Do nothing. The system handles by itself. On Linux: You shouldn’t have to do anything if you correctly created the prerequisite early enough, but make sure the service is running with (and know that this command only shows useful diagnostic logs when run as root). Run if the service isn’t running. Run if the service isn’t running. On FreeBSD: Add , , and to . Navigating the Synology DSM menus to configure the NFS file service with Kerberos. The first step is to enable the NFS service. Roughly follow these steps, which are illustrated in the picture just above: Open the File Services tab of the Control Panel . In the NFS tab, set NFSv4 as the Minimum NFS protocol . Click on Advanced Settings and, in the panel that opens, enter the Kerberos realm under the NFSv4 domain option. Click on Kerberos Settings and, in the panel that opens, select Import and then upload the keytab file that we generated earlier on for the NAS. This should populate the and principals in the list. Finish and save all settings. Navigating the Synology DSM menus to configure the properties of a single shared folder over NFS. Then, we need to expose shared folders over NFSv4, and we have to do this for every folder we want to share. Assuming you want to share the folder as shown in the picture just above: Open the Shared Folder tab of the Control Panel . Select the folder you want to share (in our case, ), and click Edit . In the NFS Permissions tab, click either Create or Edit to enter the permissions for every host client that should have access to the share. Fill the NFS rule details. In particular, enter the hostname of the client machine, enable the Allow connections from non-privileged ports option, and select the Security level you desire. Content of an NFS reply packet with Kerberos-based encryption. The packet contents are not plain text. And, as a final test, I tried to mount the network share without and confirmed that this was not possible: All good! I think… Open questions That’s about it. But I still have a bunch of unanswered questions from this setup: Kerberos claims to be an authentication system only, not an authorization system. However, the protocol I described above separates the TGT from the TGS, and this separation makes it sound like Kerberos could also implement authorization policies. Why doesn’t it do these? Having to type after logging into the machine is annoying. I remember that, back at Google when we used Kerberos and NFS—those are long gone days—the right tickets would be granted after logging in or unlocking a workstation. This must have been done with the Kerberos PAM modules… but I haven’t gotten them to do this yet and I’m not sure why. The fact that the shared directories created by the Synology NAS have 777 permissions seems wrong. Why is it doing that? And does anything break if you manually tighten these permissions? And the most important question of all: is this all worth it? I’m tempted to just use password-protected Samba shares and call it a day. I still don't trust that the setup is correct, and I still encounter occasional problems here and there.

0 views