Latest Posts (16 found)

Self-hosting my photos with Immich

For every cloud service I use, I want to have a local copy of my data for backup purposes and independence. Unfortunately, the tool stopped working in March 2025 when Google restricted the OAuth scopes, so I needed an alternative for my existing Google Photos setup. In this post, I describe how I have set up Immich , a self-hostable photo manager. Here is the end result: a few (live) photos from NixCon 2025 : I am running Immich on my Ryzen 7 Mini PC (ASRock DeskMini X600) , which consumes less than 10 W of power in idle and has plenty of resources for VMs (64 GB RAM, 1 TB disk). You can read more about it in my blog post from July 2024: When I saw the first reviews of the ASRock DeskMini X600 barebone, I was immediately interested in building a home-lab hypervisor (VM host) with it. Apparently, the DeskMini X600 uses less than 10W of power but supports latest-generation AMD CPUs like the Ryzen 7 8700G! Read more → I installed Proxmox , an Open Source virtualization platform, to divide this mini server into VMs, but you could of course also install Immich directly on any server. I created a VM (named “photos”) with 500 GB of disk space, 4 CPU cores and 4 GB of RAM. For the initial import, you could assign more CPU and RAM, but for normal usage, that’s enough. I (declaratively) installed NixOS on that VM as described in this blog post: For one of my network storage PC builds, I was looking for an alternative to Flatcar Container Linux and tried out NixOS again (after an almost 10 year break). There are many ways to install NixOS, and in this article I will outline how I like to install NixOS on physical hardware or virtual machines: over the network and fully declaratively. Read more → Afterwards, I enabled Immich, with this exact configuration: At this point, Immich is available on , but not over the network, because NixOS enables a firewall by default. I could enable the option, but I actually want Immich to only be available via my Tailscale VPN, for which I don’t need to open firewall access — instead, I use to forward traffic to : Because I have Tailscale’s MagicDNS and TLS certificate provisioning enabled, that means I can now open https://photos.example.ts.net in my browser on my PC, laptop or phone. At first, I tried importing my photos using the official Immich CLI: Unfortunately, the upload was not running reliably and had to be restarted manually a few times after running into a timeout. Later I realized that this was because the Immich server runs background jobs like thumbnail creation, metadata extraction or face detection, and these background jobs slow down the upload to the extent that the upload can fail with a timeout. The other issue was that even after the upload was done, I realized that Google Takeout archives for Google Photos contain metadata in separate JSON files next to the original image files: Unfortunately, these files are not considered by . Luckily, there is a great third-party tool called immich-go , which solves both of these issues! It pauses background tasks before uploading and restarts them afterwards, which works much better, and it does its best to understand Google Takeout archives. I ran as follows and it worked beautifully: My main source of new photos is my phone, so I installed the Immich app on my iPhone, logged into my Immich server via its Tailscale URL and enabled automatic backup of new photos via the icon at the top right. I am not 100% sure whether these settings are correct, but it seems like camera photos generally go into Live Photos, and Recent should cover other files…?! If anyone knows, please send an explanation (or a link!) and I will update the article. I also strongly recommend to disable notifications for Immich, because otherwise you get notifications whenever it uploads images in the background. These notifications are not required for background upload to work, as an Immich developer confirmed on Reddit . Open Settings → Apps → Immich → Notifications and un-tick the permission checkbox: Immich’s documentation on backups contains some good recommendations. The Immich developers recommend backing up the entire contents of , which is on NixOS. The subdirectory contains SQL dumps, whereas the 3 directories , and contain all user-uploaded data. Hence, I have set up a systemd timer that runs to copy onto my PC, which is enrolled in a 3-2-1 backup scheme . Immich (currently?) does not contain photo editing features, so to rotate or crop an image, I download the image and use GIMP . To share images, I still upload them to Google Photos (depending on who I share them with). The two most promising options in the space of self-hosted image management tools seem to be Immich and Ente . I got the impression that Immich is more popular in my bubble, and Ente made the impression on me that its scope is far larger than what I am looking for: Ente is a service that provides a fully open source, end-to-end encrypted platform for you to store your data in the cloud without needing to trust the service provider. On top of this platform, we have built two apps so far: Ente Photos (an alternative to Apple and Google Photos) and Ente Auth (a 2FA alternative to the deprecated Authy). I don’t need an end-to-end encrypted platform. I already have encryption on the transit layer (Tailscale) and disk layer (LUKS), no need for more complexity. Immich is a delightful app! It’s very fast and generally seems to work well. The initial import is smooth, but only if you use the right tool. Ideally, the official could be improved. Or maybe could be made the official one. I think the auto backup is too hard to configure on an iPhone, so that could also be improved. But aside from these initial stumbling blocks, I have no complaints.

0 views

My impressions of the MacBook Pro M4

I have been using a MacBook Pro M4 as my portable computer for the last half a year and wanted to share a few short impressions. As always, I am not a professional laptop reviewer, so in this article you won’t find benchmarks, just subjective thoughts! Back in 2021, I wrote about the MacBook Air M1 , which was the first computer I used that contained Apple’s own ARM-based CPU. Having a silent laptop with long battery life was a game-changer, so I wanted to keep those properties. When the US government announced tariffs, I figured I would replace my 4-year old MacBook Air M1 with a more recent model that should last a few more years. Ultimately, Apple’s prices remained stable, so, in retrospect, I could have stayed with the M1 for a few more years. Oh well. I went to the Apple Store to compare the different options in person. Specifically, I was curious about the display and whether the increased weight and form factor of the MacBook Pro (compared to a MacBook Air) would be acceptable. Another downside of the Pro model is that it comes with a fan, and I really like absolutely quiet computers. Online, I read from other MacBook Pro owners that the fan mostly stays off. In general, I would have preferred to go with a MacBook Air because it has enough compute power for my needs and I like the case better (no ventilation slots), but unfortunately only the MacBook Pro line has the better displays. Why aren’t all displays nano-textured? The employee at the Apple Store presented the trade-off as follows: The nano texture display is great at reducing reflections, at the expense of also making the picture slightly less vibrant. I could immediately see the difference when placing two laptops side by side: The bright Apple Store lights showed up very prominently on the normal display (left), and were almost not visible at all on the nano texture display (right): Personally, I did not perceive a big difference in “vibrancy”, so my choice was clear: I’ll pick the MacBook Pro over the MacBook Air (despite the weight) for the nano texture display! After using the laptop in a number of situations, I am very happy with this choice. In normal scenarios, I notice no reflections at all (where my previous laptop did show reflections!). This includes using the laptop on a train (next to the window), or using the laptop outside in daylight. (When I chose the new laptop, Apple’s M4 chips were current. By now, they have released the first devices with M5 chips.) I decided to go with the MacBook Pro with M4 chip instead of the M4 Pro chip because I don’t need the extra compute, and the M4 needs less cooling — the M4 Pro apparently runs hotter. This increases the chance of the fan staying off. Here are the specs I ended up with: One thing I noticed is that the MacBook Pro M4 sometimes gets warm, even when it is connected to power, but is suspended to RAM (and has been fully charged for hours). I’m not sure why. Luckily, the fan indeed stays silent. I think I might have heard it spin up once in half a year or so? The battery life is amazing! The previous MacBook Air M1 had amazing all-day battery life already, and this MacBook Pro M4 lasts even longer. For example, watching videos on a train ride (with VLC) for 3 hours consumed only 10% of battery life. I generally never even carry the charger. Because of that, Apple’s re-introduction of MagSafe, a magnetic power connector (so you don’t damage the laptop when you trip over it), is nice-to-have but doesn’t really make much of a difference anymore. In fact, it might be better to pack a USB-C cable when traveling, as that makes you more flexible in how you use the charger. I was curious whether the 120 Hz display would make a difference in practice. I mostly notice the increased refresh rate when there are animations, but not, for example, when scrolling. One surprising discovery (but obvious in retrospect) is that even non-animations can become faster. For example, when running a Go web server on , I noticed that navigating between pages by clicking links felt faster on the 120 Hz display! The following illustration shows why that is, using a page load that takes 6ms of processing time. There are three cases (the illustration shows an average case and the worst case): As you can see, the waiting time becomes shorter when going from 60 Hz (one frame every 16.6ms) to 120 Hz (one frame every 8.3ms). So if you’re working with a system that has <8ms response times, you might observe actions completing (up to) twice as fast! I don’t notice going back to 60 Hz displays on computers. However, on phones, where a lot more animations are a key part of the user experience, I think 120 Hz displays are more interesting. My ideal MacBook would probably be a MacBook Air, but with the nano-texture display! :) I still don’t like macOS and would prefer to run Linux on this laptop. But Asahi Linux still needs some work before it’s usable for me (I need external display output, and M4 support). This doesn’t bother me too much, though, as I don’t use this computer for serious work. 14" Liquid Retina XDR Display with nano texture Apple M4 Chip (10 core CPU, 10 core GPU) 32 GB RAM (this is the maximum!), 2 TB SSD (enough for this computer) Best case: Page load finishes just before the next frame is displayed: no delay. Worst case: Page load finishes just after a frame is displayed: one frame of delay. Most page loads are somewhere in between. We’ll have 0.x to 1.0 frames of delay

0 views

NixCon 2025 Trip Report 🐝

I liked the NixOS meetup earlier this year, and at the end of the meetup they told everyone about NixCon 2025, which would be happening in Switzerland this year, at the very same location, the University Of Applied Sciences OST in Rapperswil, so I decided to go! In this trip report, I want to give you a rough impression of how I experienced this awesome conference :) The bee in the title is a NixCon inside joke ;) I arrived at about 09:30 on a rainy Friday morning, meaning I hurried from the train station into OST building 1 to show my ticket QR code and pick up my conference badge and custom name tag that I pre-ordered. The custom ones have your name engraved and come with a strong magnet to attach them to your clothes: After grabbing a bite to eat, I headed to the main lecture hall for the opening session. Prof. Dr. Farhad Mehta from OST, as well as the entire NixCon orga team, welcomed the 450 registered attendees to the 10th NixCon! I recognized many familiar faces from the Nix meetup, but many hands went up when the audience was asked for whom it was the first time at NixCon, or in Switzerland in general. I want to thank Prof. Mehta in particular for making possible such meetups and events! 👏 If you work at a university, school or other organisation that has access to rooms, consider offering to host a meetup (on a regular basis, or even just once)! Locations are always hard to find, so offering a space is a great contribution to Open Source. The first technical talk of the day was “What if GitHub Actions were local-first and built using Nix?” by Domen Kožar, the person behind cachix.org , which is a hosted Nix cache. The talk pitched cloud.devenv.sh , which is a Nix-based CI solution (like GitHub Actions) using devenv . By using this solution, you solve the problem that you can’t easily / completely run GitHub Actions locally (yes, we all know about act ), and you get to (?) write Nix configs instead of YAML configs. The solution seems nice, but I found the talk a little unstructured because the presenter jumped around between slides so much. One crucial question was left unanswered: How do you integrate this custom solution with your GitHub projects? To me, diverging from the default way of configuring GitHub Actions does not seem worth it for my projects. YMMV. → watch the recording (46 minutes) on media.ccc.de Next up: “Rewriting the Hydra Queue Runner in Rust” by Simon Hauser from Helsinki Systems , a small German software company. Hydra is the component in the NixOS infrastructure which schedules builds: when nixpkgs changes, this is the component that runs the build whose result ends up on cache.nixos.org (the Debian equivalent is buildd ). Simon explained that bottlenecks in the current queue runner result in stranding of infrastructure: the project has machines available that it cannot use fully. He outlined how they replaced a crufty SSH-based automation with a well-designed gRPC protocol. I got the impression that a group of people was involved in developing and reviewing this design, which is a great sign for a healthy project. One thing that was unfortunately missing from the talk were metrics. It would have been great to see a few graphs that illustrate just how much better the rewritten queue runner is. Currently, the new queue runner is already used for Nix Community builds, but not yet in production for NixOS itself. Hopefully soon, though! → watch the recording (27 minutes) on media.ccc.de This talk was presented by Zach Mitchell from Flox , which is a Nix-based dev environment solution. Thus far, I use or (see Development shells with Nix: four quick examples ), so I was curious what I’d learn from this talk. Zach explained that both, and were originally written to debug Nix package builds, not to provide general-purpose development environments. For users, this manifests as not being able to use your favorite shell — only supports Bash. One might read about , but that’s wrong, because then the shell’s RC files run after Nix setup, possibly destroying parts of the setup. One interesting thing I learnt is that the Nix garbage collector scans to avoid removing Nix store paths that are still needed by running processes. Zach mentioned https://github.com/zmitchell/proctrace , which is a bpftrace-based profiler that tracks forks/execs and generates gantt chart syntax of the timing. Sounds cool, but is unfortunately broken right now…? Too bad. → watch the recording (45 minutes) on media.ccc.de In this fireside chat, Tarus Balog shared how he ended up at AWS after 20 years of Open Source, and how his team wants to give back to the community. One specific way in which they’re doing that is by hosting cache.nixos.org. → watch the recording (24 minutes) on media.ccc.de Josh Heinrichs from Shopify shared how they adopted Nix (again!), and I think real-world enterprise adoption stories like these are very interesting. In summary, Shopify had a command (since 2016), which offered declarative configuration and then dispatched to (Linux) or (macOS). In the first attempt to move to Nix, the effort didn’t reach stable footing (some folks couldn’t use it yet) and then a company-wide shift to cloud development happened, where the easier solution was to “just use ubuntu”. A few years in, folks are apparently not so happy with the cloud development environments and one day, Shopify CEO Tobias Lütke finds devenv , which is a Nix-based solution that is remarkably similar to Shopify’s . So Tobi adopts devenv for one of their services and becomes supportive of using Nix. This time around, they spend a lot more time on a successful rollout within the organization, meaning incremental adoption, getting all stakeholders on board, etc. The takeaway is that one specific, well-supported use-case can be the adoption driver. And once you have your development environments on a Nix-based solution, you can more easily adopt other parts of the ecosystem as well. → watch the recording (19 minutes) on media.ccc.de In a similar spirit to the Shopify talk, Kavisha Kumar from ASML shared how she got into Nix after seeing a colleague use to obtain a clean development shell. Kavisha spent a lot of time at ASML to teach others about why and how to use Nix. She shared a number of nice metaphors that explained Nix concepts through the subject area of video gaming. I think many people are excited about Nix, but have trouble conveying that excitement to others. Kavisha showed us a good way that worked for her. → watch the recording (19 minutes) on media.ccc.de The rest of the day was filled with lightning talks. Cole Mickens from Determinate Systems explained what features they are currently shipping in their downstream distribution “Determinate Nix” (features will be upstreamed): lazy trees (a performance optimization for evaluating Flakes), parallel evaluation (brings evaluation times down from 16s to 7s) and a native Linux builder for mac. Next up are Flake Schemas, which I haven’t read about yet. Yvan Sraka from Numtide , a Nix and DevOps consultancy, showed how he manages Linux machines for friends and family with NixOS. He has his own configuration layer on top of NixOS and only uses the system as a base. Most actual programs are used through AppImage, Flatpaks, envfs and nix-ld . The latter two are solutions to use FHS based programs (those that expect and other standard locations to be present) on non-FHS systems like NixOS. I had heard of nix-ld before, but not of envfs. Jacek Galowicz from Nixcademy showed how to use systemd-sysupdate and systemd-repart to implement A/B style updates with NixOS and systemd. It’s great to see that this technique is more and more mainstream, as I am also using A/B style updating successfully in gokrazy . The weather on Saturday was a lot better, so I made sure to get a seat with a view of Lake Zürich: In this talk, Silvan Mosberger from Tweag (and one of the main NixCon organizers!), explains how the official formatting tool for .nix files came to be. I was delighted to hear , the official Go formatter, being mentioned as a source of inspiration. Just like in other language ecosystems, introducing uniform formatting eliminates time-consuming back-and-forth in code review over adhering to coding style. Unfortunately, the formatting folks did not replicate one key aspect to gofmt’s success: gofmt has no options. As the famous Go proverb goes: Gofmt’s style is no one’s favorite, yet gofmt is everyone’s favorite! Meaning that it’s more important that everyone uses the same style, compared to everyone being able to express their personal style preferences. → watch the recording (20 minutes) on media.ccc.de In this two-hour workshop, Jacek Galowicz from Nixcademy , who is not only a Nix teacher, but also happens to be the maintainer of the NixOS integration test driver, shows us how to write complex integration tests with a few lines of Nix and Python. Jacek showed an integration test example: a Bittorrent service, consisting of tracker, clients, firewalls and multiple networks! Nixpkgs contains over 1000 such integration tests, and running one on your laptop is easy. The various ways to debug your tests seem pretty cool: using vsock instead of port forwardings, and enabling a debug hook that will make a failed test hang and wait to be debugged. I thought this was a great overview and Jacek is an engaging teacher. I would recommend booking his classes! Ryota spoke about when to use Nix and when not to use Nix. For example, you could manage your dotfiles (config files) with Nix, or you could decide not to. Having recently migrated more and more machines and configurations to Nix, I found myself agreeing with this talk: It’s important to understand what you’ll get out of declaratively or statefully managed configs, and when which approach is better. → watch the recording (19 minutes) on media.ccc.de The rest of the day I spent in lightning talks, some of which were sponsored talk slots. I learnt about, in no particular order: After all the talks, we met outside for a group picture followed by barbecue at the lake: NixCon 2025 by Arik Grahl. Licensed under CC BY-SA 4.0. Before the conference, I wasn’t sure if I would even bother showing up for Sunday (Hack day), but on Sunday, I was like “of course!”, and it was a great decision! Many people were still around and were working on their projects. It felt like the answer to any Nix question was just one chat message away — there was expertise and helping hands from many parts of the project. I ended up meeting a couple of people I only knew from online interactions before, and we also talked a lot about meetups. Now, I am invited to multiple meetups to give a talk :D This was a wonderful conference! The orga team and all contributors did a great job! As always, the OST in Rapperswil is a great venue for Open Source events. Ticket sales and talk submission / scheduling were done using the Pretix and Pretalx Open Source systems, which makes me proud to have contributed to Pretix. The selection of talks was great: Some deeply technical, some covering only the human side of things, and many somewhere in between. I got the impression that all the presenters I saw genuinely cared about their topic, so the overall energy was very good! (You can watch the talk recordings at media.ccc.de: NixCon 2025 .) Also outside of the talks, I had many friendly interactions and interesting conversations. There is a lot of interest and adoption of Nix, which is great to see! The production level of the conference was very high for such a volunteer-driven event. For example, the very cool sounding break music between talks was created specifically for NixCon: “Lava” by tonstr.studio . Similarly, the welcome bag contained dark Swiss chocolate, specifically made for NixCon (see picture below). I don’t even like dark chocolate, but this one was delicious! Thanks again to all helpers, and I look forward to coming back soon! Cloud Hypervisor , a KVM based hypervisor like qemu, but written in Rust. nixbuild.net , a pay-as-you-go offering for extra build capacity you can rent. On Sunday I heard someone say that their company is using nixbuild.net and it’s very smooth. NixCI , a Nix-based hosted CI. So, the cloud.devenv.sh service we heard about on Friday is a competitor to this service. Nix in the Wild is an effort by Flox where they do 45-60 minute interviews about Nix success stories. This might help you convince folks in your organization. clan is a fleet management solution. NovaCustom , a one-person laptop/PC company. The laptops come with coreboot and work with NixOS. ExpressVPN is migrating their internal server setup (TrustedServer) from Debian to NixOS! Deploying weekly in 105+ countries. Cyberus, a German company, is offering NixOS LTS releases, compliant with the EU Cyber Resilience Act obligations. David’s styx project is a more bandwidth-efficient download mechanism for NixOS updates. This uses EROFS , which seems like an interesting alternative to SquashFS images.

0 views

Bye Intel, hi AMD! I’m done after 2 dead Intels

The Intel 285K CPU in my high-end 2025 Linux PC died again ! 😡 Notably, this was the replacement CPU for the original 285K that died in March , and after reading through the reviews of Intel CPUs on my electronics store of choice, many of which (!) mention CPU replacements, I am getting the impression that Intel’s current CPUs just are not stable 😞. Therefore, I am giving up on Intel for the coming years and have bought an AMD Ryzen 9950X3D CPU instead. On the 9th of July, I set out to experiment with layout-parser and tesseract in order to convert a collection of scanned paper documents from images into text. I expected that offloading this task to the GPU would result in a drastic speed-up, so I attempted to build layout-parser with CUDA . Usually, it’s not required to compile software yourself on NixOS , but CUDA is non-free, so the default NixOS cache does not compile software with CUDA. (Tip: Enable the Nix Community Cache , which contains prebuilt CUDA packages, too!) This lengthy compilation attempt failed with a weird symptom: I left for work, and after a while, my PC was no longer reachable over the network, but fans kept spinning at 100%! 😳 At first, I suspected a Linux bug , but now I am thinking this was the first sign of the CPU being unreliable. When the CUDA build failed, I ran the batch job without GPU offloading instead. It took about 4 hours and consumed roughly 300W constantly. You can see it on this CPU usage graph (screenshot of a Grafana dashboard showing metrics collected by Prometheus ): On the evening of the 9th, the computer still seemed to work fine. But the next day, when I wanted to wake up my PC from suspend-to-RAM as usual, it wouldn’t wake up. Worse, even after removing the power cord and waiting a few seconds, there was no reaction to pressing the power button. Later, I diagnosed the problem to either the mainboard and/or the CPU. The Power Supply, RAM and disk all work with different hardware. I ended up returning both the CPU and the mainboard, as I couldn’t further diagnose which of the two is broken. To be clear: I am not saying the batch job killed the CPU. The computer was acting strangely in the morning already. But the batch job might have been what really sealed the deal. Tom’s Hardware recently reported that “Intel Raptor Lake crashes are increasing with rising temperatures in record European heat wave”, which prompted some folks to blame Europe’s general lack of Air Conditioning. But in this case, I actually did air-condition the room about half-way through the job (at about 16:00), when I noticed the room was getting hot. Here’s the temperature graph: I would say that 25 to 28 degrees celsius are normal temperatures for computers. I also double-checked if the CPU temperature of about 100 degrees celsius is too high, but no: this Tom’s Hardware article shows even higher temperatures, and Intel specifies a maximum of 110 degrees. So, running at “only” 100 degrees for a few hours should be fine. Lastly, even if Intel CPUs were prone to crashing under high heat, they should never die . I wanted the fastest AMD CPU (for desktops, not for servers), which currently is the Ryzen 9 9950X, but there is also the Ryzen 9 9950X 3D , a variant with 3D V-Cache. Depending on the use-case, the variant with or without 3D V-Cache is faster, see the comparison on Phoronix . Ultimately, I decided for the 9950X3D model, not just because it performs better in many of the benchmarks, but also because Linux 6.13 and newer let you control whether to prefer the CPU cores with larger V-Cache or higher frequency , which sounds like an interesting capability: By changing this setting, maybe one can see how sensitive certain workloads are to extra cache. Aside from the CPU, I also needed a new mainboard (for AMD’s socket AM5), but I kept all the other components. I ended up selecting the ASUS TUF X870+ mainboard. I usually look for low power usage in a mainboard, so I made sure to go with an X870 mainboard instead of an X870E one, because the X870E has two chipsets (both of which consume power and need cooling)! Given the context of this hardware replacement, I also like the TUF line’s focus on endurance… The performance of the AMD 9950X3D seems to be slightly better than the Intel 285K: In case you’re curious, the commands used for each workload are: (I have not included the gokrazy UEFI integration tests because I think there is an unrelated difference that prevents comparison of my old results with how the test runs currently.) In my high-end 2025 Linux PC I explained that I chose the Intel 285K CPU for its lower idle power consumption, and some folks were skeptical if AMD CPUs are really worse in that regard. Having switched between 3 different PCs, but with identical peripherals, I can now answer the question of how the top CPUs differ in power consumption! I picked a few representative point-in-time power values from a couple of days of usage: Looking at two typical evenings, here is the power consumption of the Intel 285K (measured using a myStrom WiFi switch smart plug , which comes with a REST API): …and here is the same PC setup, but with the AMD 9950X3D: I get the general impression that the AMD CPU has higher power consumption in all regards: the baseline is higher, the spikes are higher (peak consumption) and it spikes more often / for longer. Looking at my energy meter statistics, I usually ended up at about 9.x kWh per day for a two-person household, cooking with induction. After switching my PC from Intel to AMD, I end up at 10-11 kWh per day. I started buying Intel CPUs because they allowed me to build high-performance computers that ran Linux flawlessly and produced little noise. This formula worked for me over many years: On the one hand, I’m a little sad that this era has ended. On the other hand, I have had a soft spot for AMD since I had one of their K6 CPUs in one of my early PCs and in fact, I have never stopped buying AMD CPUs (e.g. for my Ryzen 7-based Mini Server ). Maybe AMD could further improve their idle power usage in upcoming models? And, if Intel survives for long enough, maybe they succeed at stabilizing their CPU designs again? I certainly would love to see some competition in the CPU market. Back in 2008, I bought a mobile Intel CPU in a desktop case (article in German) . Then, in 2012, I could just buy a regular Intel CPU (i7-2600K) for my Linux PC , because they had gotten so much better in terms of power saving. Over the years, I bought an i7-8700K, and later an i9-9900K. The last time this formula worked out for me was with my 2022 high-end Linux PC .

0 views

Secret Management on NixOS with sops-nix

Passwords and secrets like cryptographic key files are everywhere in computing. When configuring a Linux system, sooner or later you will need to put a password somewhere — for example, when I migrated my existing Linux Network Storage (NAS) setup to NixOS , I needed to specify the desired Samba passwords in my NixOS config (or manage them manually, outside of NixOS). For personal computers, this is fine, but if the goal is to share system configurations (for example in a Git repository), we need a different solution: Secret Management. The basic idea behind Secret Management systems is to encrypt the secrets at rest, meaning if somebody clones the git repository containing your NixOS system configurations, they cannot access (and therefore, also not deploy) the encrypted secrets. Conceptually, we need to: In this article, I will show how to accomplish the above using sops-nix. Here’s a quick overview of the three different building blocks we will use: You might wonder why I chose sops-nix over agenix , the other contender? The instructions for setting up sops-nix made more sense to me when I first looked at it, and I wanted to have the option to use sops in other ways, not just with age. If you’re curious about agenix, check out Andreas Gohr’s blog post about agenix . I ran the following instructions on an Arch Linux machine on which I installed the Nix tool and enabled Nix Flakes . Follow the link for instructions also for other systems like Debian or Fedora. I don’t want to manage an extra key file, so I’ll use to derive a key from my SSH private key file, which I already take good care of to back up: (The option is documented in the ssh-to-age README .) To display the age recipient (public key) of this age identity (private key), I used: Similarly, I will derive an age recipient from the SSH host key of the remote system: In my git repository (nix-configs), I have one subdirectory per NixOS system, i.e. shows: In the root of the git repository (next to the directory), I create like so: The more systems I manage, the more and I will need to configure. The creation rules tell sops which keys to use when encrypting a file. In my setups, I typically use only a single file per system, but I could imagine splitting out some secrets into a separate file if I wanted to collaborate with someone on just one aspect of the system. Now that we told sops which recipients to encrypt for, we can decrypt and edit in our configured editor by running: The simplest key file contains just one key, for example: After saving and exiting your editor, sops will update the encrypted secrets/example.yaml. Now, we need to reference the encrypted file in NixOS and enable integration to make the decrypted secrets available on the system. In , I added to the section and added the NixOS module. I show the entire diff because the places where the lines go are just as important as what the lines say: Then, in , we tell to use the SSH host key as identity, where sops will find our secrets and which secrets should realize on the remote system: After deploying, we can access the secret on the running system: Of course, even after rebooting the machine, the secrets remain available without a re-deploy: Now that we have secrets stored in files under , how can we use these secrets? The following sections show a few common ways. Let’s assume you have deployed a custom Go server as a systemd service on NixOS as follows, and you want to start managing the cleartext secret passed via the and command-line flags: With the following sops secrets: …we need to adjust our NixOS config to read these secret files at runtime. Because the directive is interpreted by systemd and not passed through a shell, we use the helper and then just the files: What if the service in question does not use command-line flags, but environment variables for configuring secrets? We can put an environment variable file into a sops-managed secret: …and then we make systemd apply these environment variables from the secrets file: If you are configuring a NixOS module (instead of declaring a custom service), the option might not always be called . For example, for the oauth2-proxy service, you would need to configure the option : In the previous examples, we configured the of each secret to the user account under which the service is running. But what if there is no such user account, because the service use systemd’s feature? We can use systemd’s feature! For example, I supply the SMTP password to my Prometheus Alertmanager as follows: In my blog post “Migrating my NAS from CoreOS/Flatcar Linux to NixOS” , I describe how to configure samba users and passwords (from sops-managed secrets) with an shell script (which is very similar to the techniques already explained). Managing secrets as separately-encrypted files in your config repository makes sense to me! age’s ability to work with SSH keys makes for a really convenient setup, in my opinion. Encrypting secrets for the destination system’s SSH host key feels very elegant. I hope the examples above are sufficient for you to efficiently configure secrets in NixOS! Encrypt the secrets such that the target system can decrypt them. Encrypt the secrets such that other people working on this config can decrypt them. Have the target system decrypt secrets at runtime. Tell our software where to access the decrypted secrets. sops is a tool to version-control secrets in git, in their encrypted form. sops makes it easy to re-encrypt these secrets when adding/removing authorized keys. sops is very flexible and can work with tons of other tools/providers. sops-nix provides a way to integrate sops with Nix/NixOS Using sops with allows us to use our existing SSH private key (humans) or SSH host private key (machines) instead of managing a separate set of key files.

0 views

Development shells with Nix: four quick examples

I wanted to use GoCV for one of my projects (to find and extract paper documents from within a larger scan), without permanently having OpenCV on my system. This seemed like a good example use-case to demonstrate a couple of Nix commands I like to use, covering quick interactive one-off dev shells to fully declarative, hermetic, reproducible, shareable dev shells. Notably, you don’t need to use NixOS to run these commands! You can install and use Nix on any Linux system like Debian, Arch, etc., as long as you set a Nix path or use Flakes (see setup ). Before we start looking at Nix, I will show how to get GoCV running on Debian. Let’s create a minimal Go program which uses a GoCV function like , just to verify that we can compile this program: If we try to build this on a Debian system, we get: On Debian, we can install OpenCV as follows: Saying “yes” to this prompt downloads and installs over 500 packages (takes a few minutes). Now the build works: …but we have over 500 extra packages on our system that will now need to be updated in all eternity, therefore I would like to separate this one-off experiment from my usual system. We could use Docker to start a Debian container and work inside that container, but, depending on the task, this can be cumbersome precisely because it’s a separate environment. For this example, I would need to specify a volume mount to make my input files available to the Docker container, and I would need to set up environment variables before programs inside the Docker container can open graphical windows on the host… Let’s look at how we can use Nix to help us with that! Users of NixOS can skip this section, as NixOS systems include a ready-to-use Nix. Before you can try the examples on your own computer, you need to complete these three steps: Users of Debian, Arch, Fedora, or other Linux systems first need to install Nix. Luckily, Nix is available for many popular Linux distributions: Nix flakes are “a generic way to package Nix artifacts” . Examples 3 and 4 use Nix flakes to pin dependencies, so we need to enable Nix flakes . For example 1 and 2, we want to use the Nix expression . On NixOS, this expression will follow the system version, meaning if you use on a NixOS 25.05 installation, that will reference nixpkgs in version nixos-25.05 . On other Linux systems, you’ll see an error message like this: We need to tell Nix which version of to use by setting the Nix search path : Alright! Now we are set up. Let’s jump into the first example! Nix provides a middle-ground between installing OpenCV on your system ( like in the example above) and installing OpenCV in a separate Docker container: Nix can make available OpenCV without permanently installing it. We can run to start a bash shell in which the specified packages are available. To successfully build Go code that uses GoCV, we need to have OpenCV available: In case you were wondering: Yes, we do need to specify in this command explicitly, otherwise running will run the host version (outside the dev shell), which cannot find . Once we have a combination of packages that work for our project (in our example, just and ), we can create a (in any directory, but usually in the root of a project) which (without the flag) will read: …and then, we just run : If you’re curious, here are a couple of documentation pointers regarding the boilerplate around the list of packages: By the way: With the nixd language server , editors with LSP support can show the versions that packages resolve to, point out your spelling mistakes, or provide features like “jump to definition”. For example, in this screenshot, I was editing in Emacs and was curious how the Nix source of the package looked like. By pressing ( ) with “point” over , I got to in my local Nix store: The previous examples used nixpkgs from your system (or Nix path), which means you don’t need to change the file when you upgrade your system — depending on the use-case, I see this behavior as either convenient or terrifying. For use-cases where it is important that the file is built exactly the same way, no matter what version the surrounding OS uses, we can use Nix Flakes to build in a hermetic way, with dependency versions pinned in the file. A contains the same expression as above, but declares structure around it: The expression goes into the attribute and the attribute contains Flake references that are available to this build: By the way: Despite the name, it is a best practice to use , which conceptually provides a single result ( for efficiency ). Now, I can use to get a shell with OpenCV: The first run creates a file, so running later will get us exactly the same environment. To update to newer versions, use . Tip: Instead of a shell, is also a useful variant. Unfortunately, the above hard-codes , so it will not be usable on, say, an (ARM) computer, or on a (Mac). Having to explicitly specify the by default is a long-standing criticism of Nix Flakes. There are a number of workarounds. For example, we can use numtide/flake-utils and refactor our to use its convenience function: Or we could use numtide/blueprint , its spiritual successor. LucPerkins’s dev-templates have effectively inlined a version of this technique. For a solution that isn’t part of Nix, but Nix-adjacent: devenv is a separate tool that is built on Nix (no longer using the CppNix implementation, but tvix actually ), but with its own .nix files. If you notice that or similar commands fetch packages despite the not having changed, you can install the Flake into your profile to declare it as a gcroot to Nix : But wait, isn’t that getting us into the same state as with The Debian Way ? No! While OpenCV will remain available indefinitely if you install the flake into your profile, there still is a layer of separation: Within your system, OpenCV isn’t available, only when you start a development shell with or . How do the four examples above compare? Here’s an overview: For personal one-off experiments, I use . Once the experiment works, I typically want to pin the dependencies, so I use a . If this is software that isn’t just versioned, but also published (or worked on with multiple people/systems), I go through the effort of making it a system-independent . I hope in the future, it will become easier to write a system-independent flake. Despite the rough edges, I appreciate the reproducibility and control that Nix gives me! Install Nix Enable Flakes Set a Nix path Debian ships nix-setup-systemd Arch Linux packages nix and provides documentation on the Nix Arch Wiki page . In practice, I installed the package and configured a couple of users . More generally, there are Nix builds (rpm, deb, pacman) available for a number of distributions: https://github.com/nix-community/nix-installers Line 1 to 3 declare a function with an argument set — this is the required structure for to be able to call your file. is a convenience helper for use with . The part allows us to write instead of .

0 views

Migrating my NAS from CoreOS/Flatcar Linux to NixOS

In this article, I want to show how to migrate an existing Linux server to NixOS — in my case the CoreOS/Flatcar Linux installation on my Network Attached Storage (NAS) PC. I will show in detail how the previous CoreOS setup looked like (lots of systemd units starting Docker containers), how I migrated it into an intermediate state (using Docker on NixOS) just to get things going, and finally how I migrated all units from Docker to native NixOS modules step-by-step. If you haven’t heard of NixOS, I recommend you read the first page of the NixOS website to understand what NixOS is and what sort of things it makes possible. The target audience of this blog post is people interested in trying out NixOS for the use-case of a NAS, who like seeing examples to understand how to configure a system. You can apply these examples by first following my blog post “How I like to install NixOS (declaratively)” , then making your way through the sections that interest you. If you prefer seeing the full configuration, skip to the conclusion . Over the last decade, I used a number of different operating systems for my NAS needs. Here’s an overview of the 2 NAS systems storage2 and storage3: When I started using CoreOS, Docker was pretty new technology. I liked that using Docker containers allowed you to treat services uniformly — ultimately, they all expose a port of some sort (speaking HTTP, or Postgres, or…), so you got the flexibility to run much more recent versions of software on a stable OS, or older versions in case an update broke something. Over a decade later, Docker is established tech. People nowadays take for granted the various benefits of the container approach. So, here’s my list of reasons why I wasn’t satisfied with Flatcar Linux anymore. The CoreOS cloud-init project was deprecated at some point in favor of Ignition , which is clearly more powerful, but also more cumbersome to get started with as a hobbyist. As far as I can tell, I must host my config at some URL that I then provide via a kernel parameter. The old way of just copying a file seems to no longer be supported. Ignition also seems less convenient in other ways: YAML is no longer supported, only JSON, which I don’t enjoy writing by hand. Also, the format seems to change quite a bit . As a result, I never made the jump from cloud-init to Ignition, and it’s not good to be reliant on a long-deprecated way to use your OS of choice. At some point, I did an audit of all my containers on the Docker Hub and noticed that most of them were quite outdated. For a while, Docker Hub offered automated builds based on a obtained from GitHub. However, automated builds now require a subscription, and I will not accept a subscription just to use my own computers. If Docker at some point ceases operation of the Docker Hub, I am unable to deploy software to my NAS. This isn’t a very hypothetical concern: In 2023, Docker Hub announced the end of organizations on the Free tier and then backpedaled after community backlash. Who knows how long they can still provide free services to hobbyists like myself. The final nail in the coffin was when I noticed that I could not try Immich on my NAS system! Modern web applications like Immich need multiple Docker containers (for Postgres, Redis, etc.) and hence only offer Docker Compose as a supported way of installation. Unfortunately, Flatcar does not include Docker Compose . I was not in the mood to re-package Immich for non-Docker-Compose systems on an ongoing basis, so I decided that a system on which I can neither run software like Immich directly, nor even run Docker Compose, is not sufficient for my needs anymore. With all of the above reasons, I would have had to set up automated container builds, run my own central registry and would still be unable to run well-known Open Source software like Immich. Instead, I decided to try NixOS again (after a 10 year break) because it seems like the most popular declarative solution nowadays, with a large community and large selection of packages. How does NixOS compare for my situation? My NAS setup needs to work every day, so I wanted to prototype my desired configuration in a VM before making changes to my system. This is not only safer, it also allows me to discover any roadblocks, and what working with NixOS feels like without making any commitments. I copied my NixOS configuration from a previous test installation (see “How I like to install NixOS (declaratively)” ) and used the following command to build a VM image and start it in QEMU: The configuration instructions below can be tried out in this VM, and once you’re happy enough with what you have, you can repeat the steps on the actual machine to migrate. For the migration of my actual system, I defined the following milestones that should be achievable within a typical session of about an hour (after prototyping them in a VM): In practice, this worked out exactly as planned: the actual installation of NixOS and setting up my config to milestone M4 took a little over one hour. All the other nice-to-haves were done over the following days and weeks as time permitted. Tip: After losing data due to an installer bug in the 2000s, I have adopted the habit of physically disconnecting all data disks (= pulling out the SATA cable) when re-installing the system disk. After following “How I like to install NixOS (declaratively)” , this is my initial : All following sections describe changes within this . All devices in my home network obtain their IP address via DHCP. If I want to make an IP address static, I configure it accordingly on my router. My NAS PCs have one specialty with regards to IP addressing: They are reachable via IPv4 and IPv6, and the IPv6 address can be derived from the IPv4 address. Hence, I changed the systemd-networkd configuration from above such that it configures a static IPv6 address in a dynamically configured IPv6 network: ✅ This fulfills milestone M1. To unlock my encrypted disks on boot, I have a custom systemd service unit that uses and to split the key file between the NAS and a remote server (= an attacker needs both pieces to unlock). With CoreOS/Flatcar, my configuration looked as follows: I converted it into the following NixOS configuration: We’ll also need to store the custom TLS certificate file on disk. For that, we can use the configuration: The references like will be replaced with a path to the Nix store ( → nix.dev documentation ). On CoreOS/Flatcar, I was limited to using just the (minimal set of) software included in the base image, or I had to reach for Docker. On NixOS, we can use all packages available in nixpkgs. After deploying and ing, I can access my unlocked disk under ! 🎉 When listing my files, I noticed that the group id was different between my old system and the new system. This can be fixed by explicitly specifying the desired group id: ✅ M2 is complete. Whereas I want to configure remote disk unlock at the systemd service level, for Samba I want to use Docker: I wanted to first transfer my old (working) Docker-based setups as they are, and only later convert them to Nix. We enable the Docker NixOS module which sets up the daemons that Docker needs and whatever else is needed to make it work: This is already sufficient for other services to use Docker, but I also want to be able to run the command interactively for debugging. Therefore, I added to : After deploying this configuration, I can run to verify things work. The version of samba looked like this: We can translate this 1:1 to NixOS: ✅ Now I can manage my files over the network, which completes M3! See also: Nice-to-haves: N5. samba from NixOS For backing up data, I use rsync over SSH. I restrict this SSH access to run only rsync commands by using (in a Docker container). To configure the SSH , we set: ✅ A successful test backup run completes milestone M4! See also: Nice-to-haves: N6. rrsync from NixOS I like to monitor all my machines with Prometheus (and Grafana). For network connectivity and authentication, I use the Tailscale mesh VPN. To install Tailscale, I enable its NixOS module and make the command available: After deploying, I run and open the login link in my browser. The Prometheus Node Exporter can also easily be enabled through its NixOS module : However, this isn’t reliable yet: When Tailscale’s startup takes a while during system boot, the Node Exporter might burn through its entire restart budget when it cannot listen on the Tailscale IP address yet. We can enable indefinite restarts for the service to eventually come up: While migrating my setup, I noticed that calling from directly is not reliable, and it’s better to let systemd manage the mounting: Afterwards, I could just remove the call from : In systemd services, I can now depend on the mount unit: To save power, I turn off my NAS when they are not in use. My backup orchestration uses Wake-on-LAN to start a wakeup and needs to wait until the NAS is fully booted up and has mounted its mount before it can start backup jobs. For this purpose, I have configured a web server (without any files) that depends on the mount. So, once the web server responds to HTTP requests, we know is mounted. The config looked as follows: The Docker version (ported from Flatcar Linux) looks like this: This configuration gets a lot simpler when migrating it from Docker to NixOS: The Docker version (ported from Flatcar Linux) looks like this: As before, when using jellyfin from NixOS, the configuration gets simpler: For a while, I had also set up compatibility symlinks that map the old location ( , inside the Docker container) to the new location ( ), but I encountered strange issues in Jellyfin and ended up just re-initializing my whole Jellyfin state. While the required configuration had more lines, I found it neat to move it into its own file, so here is how to do that: Remove the lines above from and move them into : Then, in , add to the : To use Samba from NixOS, I replaced my config from M3 with this: Note: Setting the samba password in the activation script works for small setups, but if you want to keep your samba passwords out of the Nix store, you’ll need to use a different approach. On a different machine, I use sops-nix to manage secrets and found that refactoring the call like so works reliably: I also noticed that NixOS does not create a group for each user by default, but I am used to managing my permissions like that. We can easily declare a group like so: The Docker version (ported from Flatcar Linux) looks like this: To use from NixOS, I changed the configuration like so: The Docker version (ported from Flatcar Linux) looks like this: I wanted to stop managing the following to ship : To get rid of the Docker container, I translated the file into a Nix expression that writes the Perl script to the Nix store: I can then reference this file by importing it in my and pointing it to the expression of my NixOS configuration: This works, but is it the best approach? Here are some thoughts: I want to configure all my NixOS systems such that my user settings are identical everywhere. To achieve that, I can extract parts of my into a and then declare an accompanying that provides this expression as an output. After publishing these files in a git repository, I can reference said repository in my : Everything declared in the can now be removed from ! One of the motivating reasons for switching away from CoreOS/Flatcar was that I couldn’t try Immich, so let’s give it a shot on NixOS: You can find the full configuration directory on GitHub . I am pretty happy with this NixOS setup! Previously (with CoreOS/Flatcar), I could declaratively manage my base system, but had to manage tons of Docker containers in addition. With NixOS, I can declaratively manage everything (or as much as makes sense). Custom configuration like my SSH+rsync-based backup infrastructure can be expressed cleanly, in one place, and structured at the desired level of abstraction/reuse. If you’re considering managing at least one other system with NixOS, I would recommend it! One of my follow-up projects is to convert storage3 (my other NAS build) from Ubuntu Server to NixOS as well to cut down on manual management. Being able to just copy the entire config to set up another system, or try out an idea in a throwaway VM, is just such a nice workflow 🥰 …but if you have just a single system to manage, probably all of this is too complicated. (This post is only about software! For my usage patterns and requirements regarding hardware selection, see “Design Goals” in my My all-flash ZFS NAS build post (2023) .) Remote management: I really like the model of having the configuration of my network storage builds version-controlled and managed on my main PC. It’s a nice property that I can regain access to my backup setup by re-installing my NAS from my PC within minutes. Automated updates, with easy rollback: Updating all my installations manually is not my idea of a good time. Hence, automated updates are a must — but when the update breaks, a quick and easy path to recovery is also a must. CoreOS/Flatcar achieved that with an A/B updating scheme (update failed? boot the old partition), whereas NixOS achieves that with its concept of a “generation” (update failed? select the old generation), which is finer-grained. Same: I also need to set up an automated job to update my NixOS systems. I already have such a job for updating my gokrazy devices. Docker push is asynchronous: After a successful push, I still need extra automation for pulling the updated containers on the target host and restarting the affected services, whereas NixOS includes all of that. Better: There is no central registry. With NixOS, I can push the build result directly to the target host via SSH. Better: The corpus of available software in NixOS is much larger (including Immich, for example) and the NixOS modules generally seem to be expressed at a higher level of abstraction than individual Docker containers, meaning you can configure more features with fewer lines of config. M1. Install NixOS M2. Set up remote disk unlock M3. Set up Samba for access M4. Set up SSH/rsync for backups Everything extra is nice-to-have and could be deferred to a future session on another day. By managing this script in a Nix expression, I can no longer use my editor’s Perl support. I could probably also keep as a separate file and use string interpolation in my Nix expression to inject an absolute path to the binary into the script. Another alternative would be to add a wrapper script to my Nix expression which ensures that contains and then the script wouldn’t need an absolute path anymore. For small glue scripts like this one, I consider it easier to manage the contents “inline” in the Nix expression, because it means one fewer file in my config directory.

0 views

How I like to install NixOS (declaratively)

For one of my network storage PC builds , I was looking for an alternative to Flatcar Container Linux and tried out NixOS again (after an almost 10 year break). There are many ways to install NixOS, and in this article I will outline how I like to install NixOS on physical hardware or virtual machines: over the network and fully declaratively. The term declarative means that you describe what should be accomplished, not how. For NixOS, that means you declare what software you want your system to include (add to config option , or enable a module) instead of, say, running . A nice property of the declarative approach is that your system follows your configuration, so by reverting a configuration change, you can cleanly revert the change to the system as well. I like to manage declarative configuration files under version control, typically with Git. When I originally set up my current network storage build, I chose CoreOS (later Flatcar Container Linux) because it was an auto-updating base system with a declarative cloud-init config. The NixOS manual’s “Installation” section describes a graphical installer (“for desktop users”, based on the Calamares system installer and added in 2022) and a manual installer. With the graphical installer, it’s easy to install NixOS to disk: just confirm the defaults often enough and you’ll end up with a working system. But there are some downsides: The graphical installer is clearly not meant for remote installation or automated installation. The manual installer on the other hand is too manual for my taste: expand “Example 2” and “Example 3” in the NixOS manual’s Installation summary section to get an impression. To be clear, the steps are very doable, but I don’t want to install a system this way in a hurry. For one, manual procedures are prone to mistakes under stress. And also, copy & pasting commands interactively is literally the opposite of writing declarative configuration files. Ideally, I would want to perform most of the installation from the comfort of my own PC, meaning the installer must be usable over the network. Also, I want the machine to come up with a working initial NixOS configuration immediately after installation (no manual steps!). Luckily, there is a (community-provided) solution: nixos-anywhere . You take care of booting a NixOS installer, then run a single command and nixos-anywhere will SSH into that installer, partition your disk(s) and install NixOS to disk. Notably, nixos-anywhere is configured declaratively, so you can repeat this step any time. (I know that nixos-anywhere can even SSH into arbitrary systems and kexec-reboot them into a NixOS installer, which is certainly a cool party trick, but I like the approach of explicitly booting an installer better as it seems less risky and more generally applicable/repeatable to me.) I want to use NixOS for one of my machines, but not (currently) on my main desktop PC. Hence, I installed only the tool (for building, even without running NixOS) on Arch Linux: Now, running should drop you in a new shell in which the GNU hello package is installed: By the way, the Nix page on the Arch Linux wiki explains how to use nix to install packages, but that’s not what I am interested in: I only want to remotely manage NixOS systems. Previously, I said “you take care of booting a NixOS installer”, and that’s easy enough: write the ISO image to a USB stick and boot your machine from it (or select the ISO and boot your VM). But before we can log in remotely via SSH, we need to manually set a password. I also need to SSH with the environment variable because the termcap file of rxvt-unicode (my preferred terminal) is not included in the default NixOS installer environment. Similarly, my configured locales do not work and my preferred shell (Zsh) is not available. Wouldn’t it be much nicer if the installer was pre-configured with a convenient environment? With other Linux distributions, like Debian, Fedora or Arch Linux, I wouldn’t attempt to re-build an official installer ISO image. I’m sure their processes and tooling work well, but I am also sure it’s one extra thing I would need to learn, debug and maintain. But building a NixOS installer is very similar to configuring a regular NixOS system: same configuration, same build tool. The procedure is documented in the official NixOS wiki . I copied the customizations I would typically put into , imported the module from and put the result in the file: To build the ISO image, I set the environment variable to point to the file and to select the upstream channel for NixOS 25.05: After about 1.5 minutes on my 2025 high-end Linux PC , the installer ISO can be found in (1.46 GB in size in my case). Unfortunately, the nix project has not yet managed to enable the “experimental” new command-line interface (CLI) by default yet, despite 5+ years of being available, so we need to create a config file and enable the modern interface: How can you tell old from new? The old commands are hyphenated ( ), the new ones are separated by a blank space ( ). You’ll notice I also enabled Nix flakes , which I use so that my nix builds are hermetic and pinned to a certain revision of nixpkgs and any other nix modules I want to include in my build. I like to compare flakes to version lock file in other programming environments: the idea is that building the system in 5 months will yield the same result as it does today. To verify that flakes work, run (not ): For reference, here is the configuration I use to create a new VM for NixOS in Proxmox. The most important setting is (= UEFI boot, which is not the default), so that I can use the same boot loader configuration on physical machines as in VMs: Before we can boot our (unsigned) installer, we need to enter the UEFI setup and disable Secure Boot. Note that Proxmox enables Secure Boot by default, for example. Then, boot the custom installer ISO on the target system, and ensure works without prompting for a password. Declare a with the following content: Declare your disk config in : Declare your desired NixOS config in : …and lock it: After about one minute, my VM was installed and rebooted! Tip: Last month, I had to temporarily pin to the latest released version (1.9.0) because of issue nixos-anywhere#510 like so: Now that the declarative part of the system is in place, we need to take care of the stateful part. In my case, the only stateful part that needs setting up is the Tailscale mesh VPN. To set up Tailscale, I log in via SSH and run . Then, I add the new node to my network by following the link. Afterwards, in the Tailscale Machines console , I disable key expiration and add ACL tags. Now, after I changed something in my configuration file, I use remotely to roll out the change to my NixOS system: Note that not all changes are fully applied as part of : while systemd services are generally restarted, newly required kernel modules are not automatically loaded (e.g. after enabling the coral hardware accelerator in Frigate). So, to be sure everything took effect, your system after deploying changes. One of the advantages of NixOS is that in the boot menu, you can select which generation of the system you want to run. If the latest change broke something, you can quickly reboot into the previous generation to undo that change. Of course, you can also undo the configuration change and deploy a new generation — whichever is more convenient in the situation. With this article, I hope I could convey what I wish someone would have told me when I started using Nix and NixOS: Where do you go from here? You need to manually enable SSH after the installation — locally, not via the network. The graphical installer generates an initial NixOS configuration for you, but there is no way to inject your own initial NixOS configuration. Using nixos-anywhere, fetch the hardware-configuration.nix from the installer and install NixOS to disk: Enable flakes and the new CLI. Use nixos-anywhere to install remotely. Build a custom installer if you want, it’s easy! Use ’s builtin flag for remote deployment. Read through all documentation on nixos.org → Learn . Here are a couple of posts from people in and around my bubble that I looked at for inspiration / reference, in no particular order: Michael Lynch wrote about setting up an Oracle Cloud VM with NixOS and about managing his Zig configuration . Nelson Elhage wrote about using Nix to test dozens of Python interpreters as part of his performance investigation into Python 3.14 tail-call interpreter performance . Vincent Bernat wrote about using Nix to build an SD card image for an ARM single board computer . Mitchell Hashimoto shared his extensive NixOS configs . Wolfgang has a YouTube video about using NixOS for his Home Server ( → his configs ) Contact your local Nix community! I recently attended the “Zero Hydra Failures” event of the Nix Zürich group and the kind people there were happy to talk about all things Nix :)

0 views

My 2025 high-end Linux PC 🐧

Update (2025-09-07): The replacement CPU also died and I have given up on Intel. See Bye Intel, hi AMD! for more details on the AMD 9950X3D. Turns out my previous attempt at this build had a faulty CPU! With the CPU replaced, the machine now is stable and fast! 🚀 In this article, I’ll go into a lot more detail about the component selection, but in a nutshell, I picked an Intel 285K CPU for low idle power, chose a 4TB SSD so I don’t have to worry about running out of storage quickly, and a capable nvidia graphics card to drive my Dell UP3218K 8K monitor . Which components did I pick for this build? Here’s the full list: Total: 2350 CHF …and the next couple of sections go into detail on how I selected these components. I have been a fan of Fractal cases for a couple of generations. In particular, I realized that the “Compact” series offers plenty of space even for large graphics cards and CPU coolers, so that’s now my go-to case: the Fractal Define 7 Compact (Black Solid). My general requirements for a PC case are as follows: I really like building components into the case and working with the case. There are no sharp edges, the mechanisms are a pleasure to use and the cable-management is well thought-out. The only thing that wasn’t top-notch is that Fractal ships the case screws in sealed plastic packages that you need to cut open. I would have wished for a re-sealable plastic baggie so that one can keep the unused screws instead of losing them. With this build, I have standardized all my PCs into Fractal Define 7 Compact Black cases! I wanted to keep my options open regarding upgrading to an nvidia 50xx series graphics card at a later point. Those models have a TGP (“Total Graphics Power”) of 575 watts, so I needed a power supply that delivers enough power for the whole system even at peak power usage in all dimensions. I ended up selecting the Corsair RM850x, which reviews favorably (“leader in the 850W gold category”) and was available at my electronics store of choice. This was a good choice: the PSU indeed runs quiet, and I really like the power cables (e.g. the GPU cable) that they include: they are very flexible, which makes them easy to cable-manage. One interesting realization was that it’s more convenient to not use the PSU’s 12VHPWR cable, but instead stick to the older 8-pin power connectors for the GPU in combination with a 12VHPWR-to-8-pin adapter. The reason is that the 12VHPWR connector’s locking mechanism is very hard to unlock, so when swapping out the GPU (as I had to do a number of times while trouble-shooting), unlocking an 8-pin connector is much easier… I have been avoiding PCIe 5 SSDs so far because they consume a lot more power compared to PCIe 4 SSDs. While bulk streaming data transfer rates are higher on PCIe 5 SSDs, random transfers are not significantly faster. Most of my compute workload are random transfers, not large bulk transfers. The power draw situation with PCIe 5 SSDs seems to be getting better lately, with the Phison E31T being the first controller that implements power saving. A disk that uses the E31T controller is the Corsair Force Series MP700 Elite. Unfortunately, said disk was unavailable when I ordered. Instead, I picked the Samsung 990 Pro with 4 TB. I have had good experiences with the Samsung Pro series over the years (never had one die or degrade performance), and my previous 2 TB disk was starting to fill up, so the extra storage space is appreciated. One annoying realization is that most mainboard vendors seem to have moved to 2.5 GbE (= 2.5 Gbit/s ethernet) onboard network cards. I would have been perfectly happy to play it safe and buy another Intel I225 1 GbE network card, as long as it just works with Linux. In the 2.5 GbE space, the main players seem to be Realtek and Intel. Most mainboard vendors opted for Realtek as far as I could see. Linux includes the driver for Realtek network cards, but whether the card will work out of the box depends on the exact revision of the network card! For example: For revision 8125D, you need a recent-enough Linux version (6.13+) that includes commit “ r8169: add support for RTL8125D ”, accompanied by a recent-enough linux-firmware package. Even with the latest firmware, there is some concern around stability and ASPM support. See for example this ServerFault post by someone working on the driver. But, despite the Intel 1 GbE options being well-supported at this point, Intel’s 2.5 GbE options might not fare any better than the Realtek ones: I found reports of instability with Intel’s 2.5 GbE network cards . That said, aside from the annoying firmware requirements, the Realtek 2.5 GbE card seems to work fine for me in practice. Despite the suboptimal network card choice, I decided to stick to the ASUS PRIME series of mainboards, as I made good experiences with those in my past few builds. Here are a couple of thoughts on the ASUS PRIME Z890-P mainboard I went with: One surprising difference between the two mainboards I tested was that the AsRock Z890 Pro-A does not seem to report the correct DIMM clock in , whereas the ASUS does: I haven’t checked if there are measurable performance differences (e.g. if the XMP profile is truly active), but at least you now know to not necessarily trust what can show you. I am a long-time fan of Noctua’s products: This company makes silent fans with great cooling capacity that work reliably! For many years, I have swapped out all the fans of each of my PCs with Noctua fans, and it was always an upgrade. Highly recommended. Hence, it is no question that I picked the latest and greatest Noctua CPU cooler for this build: the Noctua NH-D15 G2. There are a couple of things to pay attention to with this cooler: Probably the point that raises most questions about this build is why I selected an Intel CPU over an AMD CPU. The primary reason is that Intel CPUs are so much better at power saving! Let me explain: Most benchmarks online are for gamers and hence measure a usage curve that goes “start game, run PC at 100% resources for hours”. Of course, when you never let the machine idle, you would care about power efficiency : how much power do you need to use to achieve the desired result? My use-case is software development, not gaming. My usage curve oscillates between “barely any usage because Michael is reading text” to “complete this compilation as quickly as possible with all the power available”. For me, I need both absolute power consumption at idle, and absolute performance to be best-of-class. AMD’s CPUs offer great performance (the recently released Ryzen 9 9950X3D is even faster than the Intel 9 285K), and have great power efficiency , but poor power consumption at idle: With ≈35W of idle power draw, Zen 5 CPUs consume ≈3x as much power as Intel CPUs! Intel’s CPUs offer great performance (like AMD), but excellent power consumption at idle. Therefore, I can’t in good conscience buy an AMD CPU, but if you want a fast gaming-only PC or run an always-loaded HPC cluster with those CPUs, definitely go ahead :) I don’t necessarily recommend any particular nvidia graphics card, but I have had to stick to nvidia cards because they are the only option that work with my picky Dell UP3218K monitor . From time to time, I try out different graphics cards. Recently, I got myself an AMD Radeon RX 9070 because I read that it works well with open source drivers. While the Radeon RX 9070 works with my monitor (great!), it seems to consume 45W in idle, which is much higher than my nvidia cards, which idle at ≈ 20W. This is unacceptable to me: Aside from high power costs and wasting precious resources, the high power draw also means that my room will be hotter in summer and the fans need to spin faster and therefore louder. People asked me on Social Media if this could be a measurement error (like, the card reporting inaccurate values), so I double-checked with a myStrom WiFi Switch and confirmed that with the Radeon card, the PC indeed draws 20-30W more from the wall socket. In the comments for my my previous blog post about the first build of this machine not running stable , people were asking why it is worth it to optimize a few watts of power usage. People calculate what higher power usage might cost, put it in relation to the total cost of the components, and conclude that saving ±10% of the price can’t possibly be worth the effort. Let me try to illustrate the importance of low idle power with this anecdote: For one year, I was suffering from an nvidia driver bug that meant the GPU would not clock down to the most efficient power-saving mode (because of the high resolution of my monitor). The 10-20W of difference should have been insignificant. Yet, when the bug was fixed, I noticed how my PC got quieter (fans don’t need to spin up) and my room noticeably cooled down, which was great as it was peak temperatures in summer. To me, having a whisper-quiet computing environment that does not heat up my room is a great, actual, real-life, measurable benefit. Not wasting resources and saving a tiny amount of money is a nice cherry on top. Obviously all the factors are very dependent on your specific situation: Your house’s thermal behavior might differ from mine, your tolerance for noise (and/or baseline noise levels) might be different, you might put more/less weight on resource usage, etc. On the internet, I read that there was some issue related to the Power Limits that mainboards come with by default. Therefore, I did a UEFI firmware update immediately after getting the mainboard. I upgraded to version 1404 (2025/01/10) using the provided ZIP file ( ) on an MS-DOS FAT-formatted USB stick with the EZ Flash tool in the UEFI firmware interface. Tip: do not extract the ZIP file, otherwise the EZ Flash tool cannot update the Intel ME firmware. Just put the ZIP file onto the USB disk as-is. I verified that with this UEFI version, the is 250W, and , which are exactly the values that Intel recommends. Great! I also enabled XMP and verified that memtest86 reported no errors. To copy over the data from the old disk to the new disk, I wanted to boot a live linux distribution (specifically, grml.org ) and follow my usual procedure: boot with the old disk and the new (empty) disk, then use to copy the data. It’s nice and simple, hard to screw up. Unfortunately, while grml 2024.12 technically does boot up, there are two big problems: There is no network connectivity because the kernel and linux-firmware versions are too old. I could not get Xorg to work at all. Not with the Intel integrated GPU, nor with the nvidia dedicated GPU. Not with or any of the other options in the grml menu. This wasn’t merely a convenience problem: I needed to use (the graphical version) for its partition moving/resizing support. Ultimately, it was easier to upgrade my old PC to Linux 6.13 and linux-firmware 20250109, then put in the new disk and copy over the installation. SSD disks can degrade over time, so it is essential that the Operating System tells the SSD firmware about freed-up blocks (for wear leveling). When using full-disk encryption, all involved layers need to have TRIM support enabled. I think I saw the effect of an incorrectly configured TRIM setup in action back in 2022, when I copied my data from a Force MP600 to a WD Black SN850 , which unexpectedly took many hours! To make sure my disk has a long and healthy life, I double-checked that both periodic and continuous TRIM are enabled on my Arch Linux system: The file contains the option (and lists the option), and ran within the last week: Speaking of copying data: the transfer from my WD Black SN850 to my Samsung 990 PRO ran at 856 MB/s and took about 40 minutes in total. Here are the total times for a couple of typical workloads I run: The performance boost is great! Building Linux kernels a whole minute faster is really nice. In March, I published an article about how the first build of this machine was not stable , in which you can read in detail about the various crashes I ran into. Now, in early May, I know for sure that the CPU was defective, after a lengthy trouble-shooting in which I swapped out all the other parts of this PC, sent back the CPU and got a new one. The CPU was the most annoying component to diagnose in this build because it’s an LGA 1851 socket and I don’t (yet) have any other machines which uses that same socket. AMD’s approach of sticking to each socket for a longer time would have been better in this situation. When I published my earlier blog post about the PC being unstable, I did not really know how to reliably trigger the issue. Some compute-intensive tasks like running a Django test suite seemed to trigger the issue. I suspect that the problem somehow got worse, because when I started stress testing the machine, suddenly it would crash every time when building a Linux kernel. That got me curious to see if other well-known CPU stress testers like Prime95 would show problems, and indeed: within seconds, Prime95 would report errors. I figured I would use Prime95 as a quick signal: if it reports errors, the machine is faulty. This typically happens within seconds of starting Prime95. If Prime95 reported no errors, I would use Linux kernel compilation as a slow signal: if I can successfully build a kernel, the machine is likely stable enough. The specific setup I used is to run , hit N (do not participate in distributed computation projects), then Enter a few times to confirm the defaults. Eventually, Prime95 starts calculating, which pushes the CPU to 100% usage (see the -like output by my implementation) and draws the expected ≈300W of power from the wall: In addition, I also ran MemTest86 for a few hours: To be clear: I also successfully ran MemTest86 on the previous, unstable build, so only running MemTest86 is not good enough if you are dealing with a faulty CPU. Händler hat dies beim Hersteller angemeldet und dieser hat folgende Fragen: Um sicherzugehen, dass wir Sie richtig verstehen: Sie haben die CPU auf zwei verschiedenen Motherboards getestet und das gleiche Problem besteht weiterhin? Könnten Sie uns mitteilen, welche Marke und welches Modell die beiden verwendeten Motherboards sind? Wurde auf beiden Motherboards die neueste BIOS-Version verwendet? Bestand das Problem von Anfang an oder trat es erst später auf? Wurde der Prozessor übertaktet? (Bitte beachten Sie, dass durch Übertakten die Garantie erlischt.) In summary, I spent March without a working PC, but that was because I didn’t have much time to pursue the project. Then, I spent April without a working PC because RMA’ing an Intel CPU through digitec seems pretty slow. I would have wished for a little more trust and a replacement CPU right away. What a rollercoaster and time sink this build was! I have never received a faulty-on-arrival CPU in my entire life before. How did the CPU I first received pass Intel’s quality control? Or did it pass QC, but was damaged in transport? I will probably never know. From now on, I know to extensively stress test new PC builds for stability to detect such issues quicker. Should the CPU be faulty, unfortunately getting it replaced is a month-long process — it’s very annoying to have such a costly machine just gather dust for a month. But, once the faulty component was replaced, this is my best PC build yet: The case is the perfect size for the components and offers incredibly convenient access to all components throughout the entire lifecycle of this PC, including the troubleshooting period, and the later stages of its life when this PC will be rotated into its “lab machine” period before I sell it second-hand to someone who will hopefully use the machine for another few years. The machine is quiet, draws little power (for such a powerful machine) and really packs a punch! As usual, I run Linux on this PC and haven’t noticed any problems in my day-to-day usage. I use suspend-to-RAM multiple times a day without any issues. I hope some of these details were interesting and useful to you in your own PC builds! If you want to learn about which peripherals I use aside from my 8K monitor (e.g. the Kinesis Advantage keyboard, Logitech MX Ergo trackball, etc.), check out my post stapelberg uses this: my 2020 desk setup . I might publish an updated version at some point :) No extra effort should be required for the case to be as quiet as possible. The case should not have any sharp corners (no danger of injury!). The case should provide just enough space for easy access to your components. The more support the case has to encourage clean cable routing, the better. USB3 front panel headers should be included. The AsRock Z890 Pro-A has rev 8125B. lshw: The ASUS PRIME Z890-P has rev 8125 D . lshw: I like the quick-release PCIe mechanism: ASUS understood that people had trouble unlocking large graphics cards from their PCIe slot, so they added a lever-like mechanism that is easily reachable. In my couple of usages, this worked pretty well! I wrote about slow boot times with my 2022 PC build that were caused by time-consuming memory training. On this ASUS board, I noticed that the board blinks the Power LED to signal that memory training is in progress. Very nice! It hadn’t occurred to me previously that the various phases of the boot could be signaled by different Power LED blinking patterns :) The downside of this feature is: While the machine is in suspend-to-RAM, the Power LED also blinks! This is annoying, so I might just disconnect the Power LED entirely. The UEFI firmware includes what they call a Q-Dashboard: An overview of what is installed/connected in which slot. Quite nice: I decided to configure it with one fan instead of two fans: Using only one fan will be the quietest setup, yet still have plenty of cooling capacity for this setup. There are 3 different versions that differ in how their base plate is shaped. Noctua recommends: “For LGA1851, we generally recommend the regular standard version with medium base convexity” ( https://noctua.at/en/intel-lga1851-all-you-need-to-know ) With a height of 168 mm, this cooler fits well into the Fractal Define 7 Compact Black. There is no network connectivity because the kernel and linux-firmware versions are too old. Kernel commit r8169: add support for RTL8125D is not included. I could not get Xorg to work at all. Not with the Intel integrated GPU, nor with the nvidia dedicated GPU. Not with or any of the other options in the grml menu. This wasn’t merely a convenience problem: I needed to use (the graphical version) for its partition moving/resizing support. Jan 15th: I receive the components for my new PC In January and February, the PC crashes occasionally. Mar 4th: I switch back to my old PC and start writing my blog post Mar 19th: I publish my blog post about the machine not being stable The online discussion does not result in any interesting tips or leads. Mar 20th: I order the AsRock Z890 Pro-A mainboard to ensure the mainboard is OK Mar 24th: the AsRock Z890 Pro-A arrives Apr 5th (Sat): started an RMA for the CPU They ask me to send the CPU to orderflow, which is the merchant that fulfilled my order. Typically, I prefer buying directly at digitec, but many PC components seem to only be available from orderflow on digitec nowadays. Apr 9th (Wed): package arrives at orderflow (digitec gave me a non-priority return label) Apr 14th (Mon): I got the following mail from digitec’s customer support and had to explain that I have thoroughly diagnosed the CPU as defective (a link to my blog post was sufficient): Apr 25th (Fri): orderflow hands the replacement CPU to Swiss Post May 1st (Thu): the machine successfully passes stress tests; I start using it

0 views

In praise of grobi for auto-configuring X11 monitors

I have recently started using the program by Alexander Neumann again and was delighted to discover that it makes using my fiddly (but wonderful) Dell 32-inch 8K monitor (UP3218K) monitor much more convenient — I get a signal more quickly than with my previous, sleep-based approach. Previously, when my PC woke up from suspend-to-RAM, there were two scenarios: In scenario ②, or if the one-shot configuration attempt in scenario ① fails, I would need to SSH in from a different computer and run manually so that the monitor would show a signal: I have now completely solved this problem by creating the following file: …and installing / enabling (on Arch Linux) using: Whenever detects that my monitor is connected (it listens for X11 RandR output change events), it will run to configure the monitor resolution and positioning. To check what is seeing/doing, you can use: For example, on my system, I see: Notably, the instructions for getting out of a bad state (no signal) are now to power off the monitor and power it back on again. This will result in RandR output change events, which will trigger , which will run , which configures the monitor. Nice! No particular reason. I knew . If nothing else, is written in Go, so it’s likely to keep working smoothly over the years. Probably not. There is no mention of Wayland over on the grobi repository . As a bonus, this section describes the other half of my monitor-related automation. When I suspend my PC to RAM, I either want to wake it up manually later, for example by pressing a key on the keyboard or by sending a Wake-on-LAN packet, or I want it to wake up automatically each morning at 6:50 — that way, daily cron jobs have some time to run before I start using the computer. To accomplish this, I use , a wrapper program around and that integrates with the myStrom switch smart plug to turn off power to the monitor entirely. This is worthwhile because the monitor draws 30W even in standby! To turn power to the monitor on after resuming, I placed the following shell script in : Once power is on, grobi will detect and configure the monitor. Here is the program in action: The monitor was connected. My sleep program would power on the monitor (if needed), sleep a little while and then run to (hopefully) configure the monitor correctly. The monitor was not connected, for example because it was still connected to my work PC.

0 views

Intel 9 285K on ASUS Z890: not stable!

Update (2025-05-15): Turns out the CPU was faulty! See My 2025 high-end Linux PC for a new article on this build, now with a working CPU. Update (2025-09-07): The replacement CPU also died and I have given up on Intel. See Bye Intel, hi AMD! for more details on the AMD 9950X3D. In January I ordered the components for a new PC and expected that I would publish a successor to my 2022 high-end Linux PC 🐧 article. Instead, I am now sitting on a PC which regularly encounters crashes of the worst-to-debug kind, so I am publishing this article as a warning for others in case you wanted to buy the same hardware. Which components did I pick for this build? Here’s the full list: Total: ≈1800 CHF, excluding the Graphics Card I re-used from a previous build. …and the next couple of sections go into detail on how I selected these components. I have been a fan of Fractal cases for a couple of generations. In particular, I realized that the “Compact” series offers plenty of space even for large graphics cards and CPU coolers, so that’s now my go-to case: the Fractal Define 7 Compact (Black Solid). I really like building components into the case and working with the case. There are no sharp edges, the mechanisms are a pleasure to use and the cable-management is well thought-out. The only thing that wasn’t top-notch is that Fractal ships the case screws in sealed plastic packages that you need to cut open. I would have wished for a re-sealable plastic baggie so that one can keep the unused screws instead of losing them. I wanted to keep my options open regarding upgrading to an nVidia 50xx series graphics card at a later point. Those models have a TGP (“Total Graphics Power”) of 575 watts, so I needed a power supply that delivers enough power for the whole system even at peak power usage in all dimensions. I ended up selecting the Corsair RM850x, which reviews favoribly (“leader in the 850W gold category”) and was available at my electronics store of choice. This was a good choice: the PSU indeed runs quiet, and I really like the power cables (e.g. the GPU cable) that they include: they are very flexible, which makes them easy to cable-manage. I have been avoiding PCIe 5 SSDs so far because they consume a lot more power compared to PCIe 4 SSDs. While bulk streaming data transfer rates are higher on PCIe 5 SSDs, random transfers are not significantly faster. Most of my compute workload are random transfers, not large bulk transfers. The power draw situation with PCIe 5 SSDs seems to be getting better lately, with the Phison E31T being the first controller that implements power saving. A disk that uses the E31T controller is the Corsair Force Series MP700 Elite. Unfortunately, said disk was unavailable when I ordered. Instead, I picked the Samsung 990 Pro with 4 TB. I made good experiences with the Samsung Pro series over the years (never had one die or degrade performance), and my previous 2 TB disk is starting to fill up, so the extra storage space is appreciated. One annoying realization is that most mainboard vendors seem to have moved to 2.5 GbE (= 2.5 Gbit/s ethernet) onboard network cards. I would have been perfectly happy to play it safe and buy another Intel I225 1 GbE network card, as long as it just works with Linux. In the 2.5 GbE space, the main players seem to be Realtek and Intel. Most mainboard vendors opted for Realtek as far as I could see. Linux includes the driver for Realtek network cards, but you need a recent-enough Linux version (6.13+) that includes commit “ r8169: add support for RTL8125D ”, accompanied by a recent-enough linux-firmware package. Even then, there is some concern around stability and ASPM support. See for example this ServerFault post by someone working on the driver. Despite the Intel 1 GbE options being well-supported at this point, Intel’s 2.5 GbE options might not fare any better than the Realtek ones: I found reports of instability with Intel’s 2.5 GbE network cards . Aside from the network cards, I decided to stick to the ASUS prime series of mainboards, as I made good experiences with those in my past few builds. Here are a couple of thoughts on the ASUS PRIME Z890-P mainboard I went with: I am a long-time fan of Noctua’s products: This company makes silent fans with great cooling capacity that work reliably! For many years, I have swapped out every of my PC’s fans with Noctua fans, and it was always an upgrade. Highly recommended. Hence, it is no question that I picked the latest and greatest Noctua CPU cooler for this build: the Noctua NH-D15 G2. There are a couple of things to pay attention to with this cooler: Probably the point that raises most questions about this build is why I selected an Intel CPU over an AMD CPU. The primary reason is that Intel CPUs are so much better at power saving! Let me explain: Most benchmarks online are for gamers and hence measure a usage curve that goes “start game, run PC at 100% resources for hours”. Of course, when you never let the machine idle, you would care about power efficiency : how much power do you need to use to achive the desired result? My use-case is software development, not gaming. My usage curve oscillates between “barely any usage because Michael is reading text” to “complete this compilation as quickly as possible with all the power available”. For me, I need both absolute power consumption at idle, and absolute performance to be best-of-class. AMD’s CPUs offer great performance (the recently released Ryzen 9 9950X3D is even faster than the Intel 9 285K), and have great power efficiency , but poor power consumption at idle: With ≈35W of idle power draw, Zen 5 CPUs consume ≈3x as much power as Intel CPUs! Intel’s CPUs offer great performance (like AMD), but excellent power consumption at idle. Therefore, I can’t in good conscience buy an AMD CPU, but if you want a fast gaming-only PC or run an always-loaded HPC cluster with those CPUs, definitely go ahead :) I don’t necessarily recommend any particular nVidia graphics card, but I have had to stick to nVidia cards because they are the only option that work with my picky Dell UP3218K monitor . From time to time, I try out different graphics cards. Recently, I got myself an AMD Radeon RX 9070 because I read that it works well with open source drivers. While the Radeon RX 9070 works with my monitor (great!), it seems to consume 45W in idle, which is much higher than my nVidia cards, which idle at ≈ 20W. This is unacceptable to me: Aside from high power costs and wasting precious resources, the high power draw also means that my room will be hotter in summer and the fans need to spin faster and therefore louder. Maybe I’ll write a separate article about the Radeon RX 9070. On the internet, I read that there was some issue related to the Power Limits that mainboards come with by default. Therefore, I did a UEFI firmware update first thing after getting the mainboard. I upgraded to version 1404 (2025/01/10) using the provided ZIP file ( ) on an MS-DOS FAT-formatted USB stick with the EZ Flash tool in the UEFI firmware interface. Tip: do not extract the ZIP file, otherwise the EZ Flash tool cannot update the Intel ME firmware. Just put the ZIP file onto the USB disk as-is. I verified that with this UEFI version, the is 250W, and , which are exactly the values that Intel recommends. Great! I also enabled XMP and verified that memtest86 reported no errors. To copy over the data from the old disk to the new disk, I wanted to boot a live linux distribution (specifically, grml.org ) and follow my usual procedure: boot with the old disk and the new (empty) disk, then use to copy the data. It’s nice and simple, hard to screw up. Unfortunately, while grml 2024.12 technically does boot up, there are two big problems: There is no network connectivity because the kernel and linux-firmware versions are too old. I could not get Xorg to work at all. Not with the Intel integrated GPU, nor with the nVidia dedicated GPU. Not with or any of the other options in the grml menu. This wasn’t merely a convenience problem: I needed to use (the graphical version) for its partition moving/resizing support. Ultimately, it was easier to upgrade my old PC to Linux 6.13 and linux-firmware 20250109, then put in the new disk and copy over the installation. At this point (early February), I switched to this new machine as my main PC. Unfortunately, I could never get it to run stable! This journal shows you some of the issues I faced and what I tried to troubleshoot them. One of the first issues I encountered with this system was that after resuming from suspend-to-RAM, I was greeted with a login window instead of my X11 session. The logs say: I couldn’t find any good tips online for this error message, so I figured I’d wait and see how frequently this happens before investigating further. On Feb 18th, after resume-from-suspend, none of my USB peripherals would work anymore! This affected all USB ports of the machine and could not be fixed, not even by a reboot, until I fully killed power to the machine! In the kernel log, I saw the following messages: The HC dying issue happened again when I was writing an SD card in my USB card reader: To try and fix the host controller dying issue, I updated the UEFI firmware to version and disabled the XMPP RAM profile. To rule out any GPU-specific issues, I decided to switch back from the Inno3D GeForce RTX4070 Ti to my older MSI GeForce RTX 3060 Ti. On Feb 28th, my PC did not resume from suspend-to-RAM. It would not even react to a ping, I had to hard-reset the machine. When checking the syslog afterwards, there are no entries. I checked my power monitoring and saw that the machine consumed 50W (well above idle power, and far above suspend-to-RAM power) throughout the entire night. Hence, I suspect that the suspend-to-RAM did not work correctly and the machine never actually suspended. On March 4th, I was running the test suite for a medium-sized Django project (= 100% CPU usage) when I encountered a really hard crash: The machine stopped working entirely, meaning all peripherals like keyboard and mouse stopped responding, and the machine even did not respond to a network ping anymore. At this point, I had enough and switched back to my 2022 PC. What use is a computer that doesn’t work? My hierarchy of needs contains stability as the foundation, then speed and convenience. This machine exhausted my tolerance for frustration with its frequent crashes. Manawyrm actually warned me about the ASUS board : ASUS boards are a typical gamble as always – they fired their firmware engineers about 10 years ago, so you might get a nightmare of ACPI troubleshooting hell now (or it’ll just work). ASRock is worth a look as a replacement if that happens. Electronics are usually solid, though… I didn’t expect that this PC would crash so hard, though. Like, if it couldn’t suspend/resume that would be one thing (a dealbreaker, but somewhat expected and understandable, probably fixable), but a machine that runs into a hard-lockup when compiling/testing software? No thanks. I will buy a different mainboard to see if that helps, likely the ASRock Z890 Pro-A. If you have any recommendations for a Z890 mainboard that actually works reliably, please let me know! Update 2025-04-17: I have received the ASRock Z890 Pro-A, but the machine shows exactly the same symptoms! I also swapped the power supply, which also did not help. Running Prime95 crashed almost immediately. At this point, I have to assume the CPU itself is defective and have started an RMA. I will post another update once (if?) I get a replaced CPU. Update 2025-05-11: The CPU was faulty indeed! See My 2025 high-end Linux PC for a new article on this build, now with a working CPU. I like the quick-release PCIe mechanism: ASUS understood that people had trouble unlocking large graphics cards from their PCIe slot, so they added a lever-like mechanism that is easily reachable. In my couple of usages, this worked pretty well! I wrote about slow boot times with my 2022 PC build that were caused by time-consuming memory training. On this ASUS board, I noticed that they blink the Power LED to signal that memory training is in progress. Very nice! It hadn’t occurred to me previously that the various phases of the boot could be signaled by different Power LED blinking patterns :) The downside of this feature is: While the machine is in suspend-to-RAM, the Power LED also blinks! This is annoying, so I might just disconnect the Power LED entirely. The UEFI firmware includes what they call a Q-Dashboard: An overview of what is installed/connected in which slot. Quite nice: I decided to configure it with one fan instead of two fans: Using only one fan will be the quietest setup, yet still have plenty of cooling capacity for this setup. There are 3 different versions that differ in how their base plate is shaped. Noctua recommends: “For LGA1851, we generally recommend the regular standard version with medium base convexity” ( https://noctua.at/en/intel-lga1851-all-you-need-to-know ) The height of this cooler is 168 mm. This fits well into the Fractal Define 7 Compact Black. There is no network connectivity because the kernel and linux-firmware versions are too old. r8169: add support for RTL8125D I could not get Xorg to work at all. Not with the Intel integrated GPU, nor with the nVidia dedicated GPU. Not with or any of the other options in the grml menu. This wasn’t merely a convenience problem: I needed to use (the graphical version) for its partition moving/resizing support.

0 views

Tips to debug hanging Go programs

I was helping someone get my gokrazy/rsync implementation set up to synchronize RPKI data (used for securing BGP routing infrastructure), when we discovered that with the right invocation, my rsync receiver would just hang indefinitely. This was a quick problem to solve, but in the process, I realized that I should probably write down a few Go debugging tips I have come to appreciate over the years! If you want to follow along, you can reproduce the issue by building an older version of gokrazy/rsync, just before the bug fix commit (you’ll need Go 1.22 or newer ): Now we can try to sync the repository: …and then the program just sits there. The easiest way to look at where a Go program is hanging is to press (backslash) to make the terminal send it a signal . When the Go runtime receives , it prints a stack trace to the terminal before exiting the process. This behavior is enabled by default and can be customized via the environment variable, see the package docs . Here is what the output looks like in our case. I have made the font small so that you can recognize the shape of the output (the details are not important, continue reading below): Phew! This output is pretty dense. We can use the https://github.com/maruel/panicparse program to present this stack trace in a more colorful and much shorter version: The functions helpfully highlighted in red are where the problem lies: My rsync receiver implementation was incorrectly expecting the server to send a uid/gid list, despite the PreserveUid and PreserveGid options not being enabled. Commit fixes the issue. If dumping the stack trace in the moment is not sufficient to diagnose the problem, you can go one step further and reach for an interactive debugger. The most well-known Linux debugger is probably GDB, but when working with Go, I recommend using the delve debugger instead as it typically works better. Install delve if you haven’t already: In this article, I am using delve v1.24.0. Note: If you want to explore local variables, you should rebuild your program without optimizations and inlining (see the docs ): While you can run a new child process in a debugger (use ) without any special permissions, attaching existing processes in a debugger is disabled by default in Linux for security reasons. We can allow this feature (remember to turn it off later!) using: …and then we can just to the hanging process: Great. But if we just print a stack trace, we only see functions from the package: The reason is that no goroutine is running (the program is waiting indefinitely to receive data from the server), so we see one of the OS threads waiting in the Go scheduler. We first need to switch to the goroutine we are interested in ( prints all goroutines), and then the stack trace looks like what we expect: If you don’t have time to poke around in the debugger now, you can save a core dump for later. Tip: Check out my debugging Go core dumps with delve blog post from 2024 for more details! This section just explains how to collect core dumps. In addition to printing the stack trace on , we can make the Go runtime crash the program, which in turn makes the Linux kernel write a core dump, by running our program with the environment variable . Modern Linux systems typically include (but you might need to explicitly install it, for example on Ubuntu) to collect core dumps (and remove old ones). You can use to list and work with them. On macOS, collecting cores is more involved . I don’t know about Windows. In case your Linux system does not use , you can use and set the kernel’s sysctl setting. You can find more details and options in the CoreDumpDebugging page of the Go wiki . For this article, we will stick to : The last line is what we want to see: it should say “core dumped”. This core should now show up in : If you see only hexadecimal addresses followed by , that means could not symbolize (= resolve addresses to function names) your core dump. Here are a few possible reasons for missing symbolization: We can now use to launch delve for this program + core dump: In my experience, in the medium to long term, it always pays off to set up your environment such that you can debug your programs conveniently. I strongly encourage every programmer (and even users!) to invest time into your development and debugging setup. Luckily, Go comes with stack printing functionality by default (just press ) and we can easily get a core dump out of our Go programs by running them with — provided the system is set up to collect core dumps. Together with the delve debugger, this gives us all we need to effectively and efficiently diagnose problems in Go programs. Linux 6.12 and 6.13 produced core dumps that elfutils cannot symbolize . uses elfutils for symbolization, so avoid 6.12/6.13 in favor of using 6.14 or newer. With systemd v234-v256, did not have permission to look into programs living in the directory (fixed with commit in systemd v257+). Similarly, runs with , meaning it won’t be able to access programs you place in . Go builds with debug symbols by default, but maybe you are explicitly stripping debug symbols in your build, by building with ?

0 views
Michael Stapelberg 11 months ago

Go Protobuf: The new Opaque API

I originally published this post in the Go blog , but am publishing this copy of it in my own blog as well for readers who don’t follow the Go blog. [ Protocol Buffers (Protobuf) is Google’s language-neutral data interchange format. See protobuf.dev .] Back in March 2020, we released a major overhaul of the Go Protobuf API . The package introduced first-class support for reflection , a implementation and the package for easier testing. That release introduced a new protobuf module with a new API. Today, we are releasing an additional API for generated code, meaning the Go code in the files created by the protocol compiler ( ). This blog post explains our motivation for creating a new API and shows you how to use it in your projects. To be clear: We are not removing anything. We will continue to support the existing API for generated code, just like we still support the older protobuf module (by wrapping the implementation). Go is committed to backwards compatibility and this applies to Go Protobuf, too! We now call the existing API the Open Struct API, because generated struct types are open to direct access. In the next section, we will see how it differs from the new Opaque API. To work with protocol buffers, you first create a definition file like this one: Then, you run the protocol compiler ( ) to generate code like the following (in a file): Now you can import the generated package from your Go code and call functions like to encode messages into protobuf wire format. You can find more details in the Generated Code API documentation . An important aspect of this generated code is how field presence (whether a field is set or not) is modeled. For instance, the above example models presence using pointers, so you could set the field to: If you are used to generated code not having pointers, you are probably using files that start with . The field presence behavior changed over the years: We created the new Opaque API to uncouple the Generated Code API from the underlying in-memory representation. The (existing) Open Struct API has no such separation: it allows programs direct access to the protobuf message memory. For example, one could use the package to parse command-line flag values into protobuf message fields: The problem with such a tight coupling is that we can never change how we lay out protobuf messages in memory. Lifting this restriction enables many implementation improvements, which we’ll see below. What changes with the new Opaque API? Here is how the generated code from the above example would change: With the Opaque API, the struct fields are hidden and can no longer be directly accessed. Instead, the new accessor methods allow for getting, setting, or clearing a field. One change we made to the memory layout is to model field presence for elementary fields more efficiently: Using fewer variables and pointers also lowers load on the allocator and on the garbage collector. The performance improvement depends heavily on the shapes of your protocol messages: The change only affects elementary fields like integers, bools, enums, and floats, but not strings, repeated fields, or submessages (because it is less profitable for those types). Our benchmark results show that messages with few elementary fields exhibit performance that is as good as before, whereas messages with more elementary fields are decoded with significantly fewer allocations: Reducing allocations also makes decoding protobuf messages more efficient: (All measurements done on an AMD Castle Peak Zen 2. Results on ARM and Intel CPUs are similar.) Note: proto3 with implicit presence similarly does not use pointers, so you will not see a performance improvement if you are coming from proto3. If you were using implicit presence for performance reasons, forgoing the convenience of being able to distinguish empty fields from unset ones, then the Opaque API now makes it possible to use explicit presence without a performance penalty. Lazy decoding is a performance optimization where the contents of a submessage are decoded when first accessed instead of during . Lazy decoding can improve performance by avoiding unnecessarily decoding fields which are never accessed. Lazy decoding can’t be supported safely by the (existing) Open Struct API. While the Open Struct API provides getters, leaving the (un-decoded) struct fields exposed would be extremely error-prone. To ensure that the decoding logic runs immediately before the field is first accessed, we must make the field private and mediate all accesses to it through getter and setter functions. This approach made it possible to implement lazy decoding with the Opaque API. Of course, not every workload will benefit from this optimization, but for those that do benefit, the results can be spectacular: We have seen logs analysis pipelines that discard messages based on a top-level message condition (e.g. whether is one of the machines running a new Linux kernel version) and can skip decoding deeply nested subtrees of messages. As an example, here are the results of the micro-benchmark we included, demonstrating how lazy decoding saves over 50% of the work and over 87% of allocations! Modeling field presence with pointers invites pointer-related bugs. Consider an enum, declared within the message: A simple mistake is to compare the enum field like so: Did you spot the bug? The condition compares the memory address instead of the value. Because the accessor allocates a new variable on each call, the condition can never be true. The check should have read: The new Opaque API prevents this mistake: Because fields are hidden, all access must go through the getter. Let’s consider a slightly more involved pointer-related bug. Assume you are trying to stabilize an RPC service that fails under high load. The following part of the request middleware looks correct, but still the entire service goes down whenever just one customer sends a high volume of requests: Did you spot the bug? The first line accidentally copied the pointer (thereby sharing the pointed-to variable between the and messages) instead of its value. It should have read: The new Opaque API prevents this problem as the setter takes a value ( ) instead of a pointer: To write code that works not only with a specific message type (e.g. ), but with any message type, one needs some kind of reflection. The previous example used a function to redact IP addresses. To work with any type of message, it could have been defined as . Many years ago, your only option to implement a function like was to reach for Go’s package , which resulted in very tight coupling: you had only the generator output and had to reverse-engineer what the input protobuf message definition might have looked like. The module release (from March 2020) introduced Protobuf reflection , which should always be preferred: Go’s package traverses the data structure’s representation, which should be an implementation detail. Protobuf reflection traverses the logical tree of protocol messages without regard to its representation. Unfortunately, merely providing protobuf reflection is not sufficient and still leaves some sharp edges exposed: In some cases, users might accidentally use Go reflection instead of protobuf reflection. For example, encoding a protobuf message with the package (which uses Go reflection) was technically possible, but the result is not canonical Protobuf JSON encoding . Use the package instead. The new Opaque API prevents this problem because the message struct fields are hidden: accidental usage of Go reflection will see an empty message. This is clear enough to steer developers towards protobuf reflection. The benchmark results from the More Efficient Memory Representation section have already shown that protobuf performance heavily depends on the specific usage: How are the messages defined? Which fields are set? To keep Go Protobuf as fast as possible for everyone , we cannot implement optimizations that help only one program, but hurt the performance of other programs. The Go compiler used to be in a similar situation, up until Go 1.20 introduced Profile-Guided Optimization (PGO) . By recording the production behavior (through profiling ) and feeding that profile back to the compiler, we allow the compiler to make better trade-offs for a specific program or workload . We think using profiles to optimize for specific workloads is a promising approach for further Go Protobuf optimizations. The Opaque API makes those possible: Program code uses accessors and does not need to be updated when the memory representation changes, so we could, for example, move rarely set fields into an overflow struct. You can migrate on your own schedule, or even not at all—the (existing) Open Struct API will not be removed. But, if you’re not on the new Opaque API, you won’t benefit from its improved performance, or future optimizations that target it. We recommend you select the Opaque API for new development. Protobuf Edition 2024 (see Protobuf Editions Overview if you are not yet familiar) will make the Opaque API the default. Aside from the Open Struct API and Opaque API, there is also the Hybrid API, which keeps existing code working by keeping struct fields exported, but also enabling migration to the Opaque API by adding the new accessor methods. With the Hybrid API, the protobuf compiler will generate code on two API levels: the is on the Hybrid API, whereas the version is on the Opaque API and can be selected by building with the build tag. See the migration guide for detailed instructions. The high-level steps are: Small usages of protobuf can live entirely within the same repository, but usually, files are shared between different projects that are owned by different teams. An obvious example is when different companies are involved: To call Google APIs (with protobuf), use the Google Cloud Client Libraries for Go from your project. Switching the Cloud Client Libraries to the Opaque API is not an option, as that would be a breaking API change, but switching to the Hybrid API is safe. Our advice for such packages that publish generated code ( files) is to switch to the Hybrid API please! Publish both the and the files, please. The version allows your consumers to migrate on their own schedule. Lazy decoding is available (but not enabled) once you migrate to the Opaque API! 🎉 To enable: in your file, annotate your message-typed fields with the annotation. To opt out of lazy decoding (despite annotations), the package documentation describes the available opt-outs, which affect either an individual Unmarshal operation or the entire program. By using the open2opaque tool in an automated fashion over the last few years, we have converted the vast majority of Google’s files and Go code to the Opaque API. We continuously improved the Opaque API implementation as we moved more and more production workloads to it. Therefore, we expect you should not encounter problems when trying the Opaque API. In case you do encounter any issues after all, please let us know on the Go Protobuf issue tracker . Reference documentation for Go Protobuf can be found on protobuf.dev → Go Reference . : the field is set and contains “zrh01.prod” : the field is set (non- pointer) but contains an empty value pointer: the field is not set uses explicit presence by default used implicit presence by default (where cases 2 and 3 cannot be distinguished and are both represented by an empty string), but was later extended to allow opting into explicit presence with the keyword , the successor to both proto2 and proto3 , uses explicit presence by default The (existing) Open Struct API uses pointers, which adds a 64-bit word to the space cost of the field. The Opaque API uses bit fields , which require one bit per field (ignoring padding overhead). Enable the Hybrid API. Update existing code using the migration tool. Switch to the Opaque API.

0 views
Michael Stapelberg 11 months ago

Get a solar panel for your balcony now ☀️

A year ago, I got a solar panel for my balcony — an easy way to vote with your wallet to convert more of the world’s energy usage to solar power. That was a great decision and I would recommend everyone get a solar panel (or two)! In my experience, many people are surprised about the basics of how power works: You do not need to connect devices to a battery in order to enjoy solar power. You can just plug in the solar panel into your household electricity setup. Any of your consumers (like a TV, or electric cooktop) will now use the power that your solar panel produces before consuming power from the grid. Here’s the panel I have (Weber barbecue for scale). As you can see, the panel is not yet mounted at an angle, just hung over the balcony. The black box at the back of the panel is the inverter (“Wechselrichter”). You connect the panel on one side and get electricity out the other side. There are two big questions to answer when chosing a solar panel: what peak capacity should your panel(s) have and which company / seller do you buy from? Regarding panel capacity: When I look at my energy usage, I see about 100 watts of baseline load. This includes always-on servers and other home automation devices. During working hours, running a PC and (power-hungry) monitor adds another 100 watts or so. Around noon, there is quite a spike in usage when cooking with my induction cooktop. Hence, I figured a plug & play solar panel with the smallest size of 385 Wp would be well equipped to cover baseline usage, compared to the next bigger unit with 780 Wp, which seems oversized for my usage. Note that a peak capacity of 385 Wp will not necessarily mean that you will measure 380W of output. I did repeatedly measure energy production exceeding 300W. Regarding the company, the best offer I found in Switzerland was a small company called erneuer.bar , which means “renewable” in German. They ship the panels with barely any packaging in fully electric vehicles and their offer is eligible for the topten bonus program from EWZ , meaning you’ll get back 200 CHF if you fill in a form. The specific model I ordered was called “385 Wp Plug & Play Solar (DE)”. Here’s the bill: Of course, you can save some money in various ways. For example, the measurement device and pre-mount option are both not required, but convenient. Similarly, you can probably find solar panels for cheaper, but the offer that erneuer.bar has put together truly is very convenient and works well, and to me that’s worth some money. One mistake I made when ordering is selecting a 5m cable. It turned out I needed a 10m cable, so I recommend you measure better than I did (or just select the longer cable). On the plus side, customer service was excellent: I quickly received an email response and could just send back my cable in exchange for a new one. Many people seem to consider only the financial aspect of buying a solar panel and calculate when the solar panel will have paid for itself. I don’t care. My goal is to convert more energy usage to green energy, not to save money. Similarly, some people install batteries so that they can use “their” energy for themselves, in case the solar panel produces more than they use at that moment. I couldn’t care less who uses the energy I produce — as long as it’s green energy, anyone is welcome to consume it. (Of course I understand these questions become more important the larger a solar installation gets. But we’re talking about one balcony and one solar panel (or two) covering someone’s baseline residential household electricity load. Don’t overthink it!) Aside from having a balcony, there is only one hard requirement: you need a power socket. This requirement is either trivially fulfilled if you already have an outdoor power socket on your balcony (lucky you!), or might turn out to be the most involved part of the project. Either way, because an electrician needs to install power sockets, all you can do is get permission from your landlord and make an appointment with your electrician of choice. In terms of cost, you will probably spend a few hundred bucks, depending on your area’s cost of living. A good idea that did not occur to me back then: Ask around in your house if any neighbors would be interested in getting a balcony power socket, too, and do it all in one go (for cheaper). One can easily find stories online about electricity providers and landlords not permitting the installation of solar panels for… rather questionable reasons. For example, some claimed that solar panels could overload the house electricity infrastructure! A drastic-sounding claim, but nonsense in practice. Luckily, law makers are recognizing this and are removing barriers. In Switzerland 🇨🇭, you can connect panels producing up to 600W without an electrician, but you need to notify your electricity provider. In Germany 🇩🇪, you can connect panels producing up to 800W (as of May 16th 2024) without an electrician, but you need to register with the Bundesnetzagentur . Be sure to check your country’s laws and your electricity provider’s rules and processes. In Switzerland 🇨🇭, you need to ask your landlord for permission because if your solar panel were to fall down from the balcony, the landlord would be liable. Usually, the landlord insists on proper mounting and the tenant taking over liability. In my case, the landlord also asked me to ensure the neighbors wouldn’t mind. I put up a letter, nobody complained, the landlord accepted. In Germany 🇩🇪, you do need to ask your landlord for permission, but the landlord pretty much has to agree ( as of October 17th 2024 ). The question is not “if”, but “how” the landlord wants you to install the solar panel. Earlier I wrote that you can just hang the solar panel onto your balcony and plug it in. While this is true, there is one factor that is worth optimizing (as time permits): the installation angle. If you want more details about the physics background and various considerations that go into chasing the optimal angle, check out these (German) articles about optimizing the installation angle (at Golem) or sizing solar installations (at Heise) . I’ll summarize: the angle is important and can result in twice as much energy production! Any angle is usually better than no angle. In my case, I first “installed” the solar panel (no angle) at 2023-09-30. Then, about a month later, I installed it at an angle at 2023-10-28. I unfortunately don’t have a great before/after graph because after I installed the proper angle mount, there were almost no sunny days. Instead, I will show you data from a comparable time range (early October) in 2023 (before mounting the panel at an angle) and in 2024 (with a properly mounted panel). As you can see, the difference is not that huge, but clearly visible: without an angle mount, I could never exceed 300 Wh per day. With a proper mount, a number of days exceed 300 Wh: The exact electricity production numbers depend on how much sun ends up on the solar panel. This in turn depends on the weather and how obstructed the solar panel is (neighbors, trees, …). I like measuring things, so I will share some measurements to give you a rough idea. But note that measuring your solar panel is strictly optional. On the best recorded day, my panel produced about 1.680 kWh of energy: The missing parts before 14:00 are caused by the neighbor’s house blocking the sun. Now, compare this best case with the worst case, a January day with little sun (< 50 Wh): Let’s zoom out a bit and consider an entire year instead. In 2024, the panel produced over 177 kWh so far, or, averaged to the daily value, ≈0.5 kWh/day: Or, in numeric form (all numbers in kWh): A solar panel is a great project to make incremental progress on. It’s just 3 to 4 simple steps, each of which is valuable on its own: That’s it! Come on, get started right away 🌞 Check with your landlord that installing an outdoor power socket and solar panel is okay. Even if you personally do not go any further with your project, you can share the result with your neighbors, who might… Order an outdoor power socket from your (or your landlord’s) preferred electrician. Power will come in handy for lighting when spending summer evenings on the balcony. Order a solar panel and plug it in. Optional, but recommended: Optimize the mounting angle later.

0 views

Testing with Go and PostgreSQL: ephemeral DBs

Let’s say you created a Go program that stores data in PostgreSQL — you installed PostgreSQL, wrote the Go code, and everything works; great! But after writing a test for your code, you wonder: how do you best provide PostgreSQL to your automated tests? Do you start a separate PostgreSQL in a Docker container, for example, or do you maybe reuse your development PostgreSQL instance? I have come to like using ephemeral PostgreSQL instances for their many benefits: In this article, I want to show how to integrate ephemeral PostgreSQL instances into your test setup. The examples are all specific to Go, but I expect that users of other programming languages and environments can benefit from some of these techniques as well. When you are in the very early stages of your project, you might start out with just a single test file (say, ), containing one or more test functions (say, ). In this scenario, all tests will run in the same process. While it’s easy enough to write a few lines of code to start and stop PostgreSQL, I recommend reaching for an existing test helper package. Throughout this article, I will be using the package, which is based on Roxy Light’s package but was extended to work well in the scenarios this article explains. To start an ephemeral PostgreSQL instance before your test functions run, you would declare a custom function : Starting a PostgreSQL instance takes about: Then, you can create a separate database for each test on this ephemeral Postgres instance: Each CreateDatabase call takes about: Usually, most projects quickly grow beyond just a single file. In one project if mine, I eventually reached over 50 test functions in 25 Go packages. I stuck to the above approach of adding a custom to each package in which my tests needed PostgreSQL, and my test runtimes eventually looked like this: That’s not terrible , but not great either. If you happen to open a process monitor while running tests, you might have noticed that there are quite a number of PostgreSQL instances running. This seems like something to optimize! Shouldn’t one PostgreSQL instance be enough for all tests of a test run? Let’s review the process model of before we can talk about how to integrate with it. The usual command to run all tests of a Go project is (see for details on the pattern syntax), which matches the Go package in the current directory and all Go packages in its subdirectories. Each Go package (≈ directory), including files, is compiled into a separate test binary: These test binaries are then run in parallel. In fact, there are two levels of parallelism at play here: The documentation explains that the test flag defaults to and references the parallelism: The parallelism is controlled by the flag, which also defaults to : To print on a given machine, we can run a test program like this : For me, defaults to the 24 threads of my Intel Core i9 12900K CPU , which has 16 cores (8 Performance, 8 Efficiency; only the Performance cores have Hyper Threading): So with a single command, we can expect 24 parallel processes each running 24 tests in parallel. With our current approach, we would start up to 24 concurrent ephemeral PostgreSQL instances (if we have that many packages), which seems wasteful to me. Starting one ephemeral PostgreSQL instance per run seems better. How can we go from starting 24 Postgres instances to starting just one? First, we need to update our test setup code to work with a passed-in database URL. For that, we switch from calling to using a for a database identified by a URL. The old code still needs to remain so that you can run a single test without bothering with : Inside the test function(s), we only need to update the receiver name: Then, we create a new wrapper program (e.g. ) which calls and passes the environment variable to the process(es) it starts: While we could use to compile and run this wrapper program, it is a bit wasteful to recompile this program over and over when it rarely changes. One alternative is to use instead of . I have two minor concerns with that: installs into the bin directory, which is by default. On my machine, takes about 100ms, even when nothing has changed. I like to define a in each of my projects with a set of targets that are consistently named, e.g. , , etc. Given that I already use , I like to set up my to build initpg in the directory: Because rarely changes, the program will typically not need to be recompiled. Note that this is only approximately correct: ’s dependency on is not modeled, so you need to delete to pick up changes to . Let’s compare the before and after test runtimes on the Intel Core i9 12900K: For comparison, the effect is more pronounced on the MacBook Air M1: Sharing one PostgreSQL instance has reduced the total test runtime for a full run by about 20%! We have measurably reduced the runtime of a full test run, but if you pay close attention during development you will notice that now every test run is a full test run , even when you only change a single package! Why can Go no longer cache any of the test results? The problem is that the environment variable has a different value on each run: the name of the temporary directory that the package uses for its ephemeral database instance changes on each run. The documentation on the caching behavior explains this in the last paragraph: (See also Go issue #22593 for more details.) For the Go test caching to work, all environment variables our tests access (including ) need to contain the same value between runs. For us, this means we cannot use a randomly generated name for the Postgres data directory, but instead need to use a fixed name. My package offers convenient support for specifying the desired directory: When running the tests now, starting with the second run (without any changes), you should see a “ (cached)” suffix printed behind tests that were successfully cached, and the test runtime should be much shorter — under a second in my project: In this article, I have shown how to integrate PostgreSQL into your test environment in a way that is convenient for developers, light on system resources and measurably reduces total test time. Adopting seems easy enough to me. If you want to see a complete example, see how I converted the repository to use . Now that we have a detailed understanding of the process model and PostgreSQL startup, we can consider further optimizations. I won’t actually implement them in this article, which is already long enough, but maybe you want to go further in your project… My journey into ephemeral PostgreSQL instances started with Eric Radman’s shell script . Ultimately, I ended up with the Go solution that I much prefer: I don’t need to ship (or require) the shell script with my projects. The fewer languages, the better. Also, is not a wrapper program, which resulted in problems regarding cleanup: A wrapper program can reliably trigger cleanup when tests are done, whereas has to poll for activity. Polling is prone to running too quickly (cleaning up a database before tests were even started) or too slowly, requiring constant tuning. But, does have quite a clever concept of preparing PostgreSQL instances in the background and thereby amortizing startup costs between test runs. There might be an even simpler approach that could amount to the same startup latency hiding behavior: Turning the sequential startup ( needs to wait for PostgreSQL to start and only then can begin running ) into parallel startup using Socket Activation. Note that PostgreSQL does not seem to support Socket Activation natively, so probably one would need to implement a program-agnostic solution into as described in this Unix Stack Exchange question or Andreas Rammhold’s blog post . For isolation, we use a different PostgreSQL database for every test. This means we need to initialize the database schema for each of these per-test databases. We can eliminate this duplicative work by sharing the same database across all tests, provided we have another way of isolating the tests from each other. The package provides a standard which runs all queries of an entire test in a single transaction. Using means we can now safely share the same database between tests without running into conflicts, failing tests, or needing extra locking. Be sure to initialize the database schema before using to share the database: long-running transactions needs to lock the PostgreSQL catalog as soon as you change the database schema (i.e. create or modify tables), meaning only one test can run at a time. (Using is a great way to understand such performance issues.) I am aware that some people don’t like the transaction isolation approach. For example, Gajus Kuizinas’s blog post “Setting up PostgreSQL for running integration tests” finds that transactions don’t work in their (JavaScript) setup. I don’t share this experience at all: In Go, the package works well, even with nested transactions. I have used for months without problems. In my tests, eliminating this duplicative schema initialization work saves about: Easier development setup: no need to configure a database, installation is enough. I recommend installing PostgreSQL from your package manager, e.g. (Debian) or (macOS). No need for Docker :) No risk of “works on my machine” (but nowhere else) problems: every test run starts with an empty database instance, so your test must set up the database correctly. The same approach works locally and on CI systems like GitHub Actions. 300ms on my Intel Core i9 12900K CPU (from 2022) 800ms on my MacBook Air M1 (from 2020) 5-10ms on my Intel Core i9 12900K CPU (from 2022) 20ms on my MacBook Air M1 (from 2020) All test functions (within a single test binary) that call will be run in parallel (in batches of size ). will run different test binaries in parallel. installs into the bin directory, which is by default. This means we need to rely on the environment variable containing the bin directory to run the installed program. Unfortunately, influencing or determining the destination path is tricky. It would be nice to not litter the user’s bin directory. I think the bin directory should contain programs which the user explicitly requested to install, not helper programs that are only necessary to run tests. On my machine, takes about 100ms, even when nothing has changed. 0.5s on my Intel Core i9 12900K 1s on the MacBook Air M1

0 views

Debug Go core dumps with delve: export byte slices

Not all bugs can easily be reproduced — sometimes, all you have is a core dump from a crashing program, but no idea about the triggering conditions of the bug yet. When using Go, we can use the delve debugger for core dump debugging, but I had trouble figuring out how to save byte slice contents (for example: the incoming request causing the crash) from memory into a file for further analysis, so this article walks you through how to do it. Let’s imagine the following scenario: You are working on a performance optimization in Go Protobuf and have accidentally badly broken the function . The function is now returning an error, so let’s run one of the failing tests with delve: Go Protobuf happens to return the already encoded bytes even when returning an error, so we can inspect the byte slice to see how far the encoding got before the error happened: In this case, we can see that the entire (trivial) message was encoded, so our error must happen at a later stage — this allows us to rule out a large chunk of code in our search for the bug. But what would we do if a longer part of the message was displayed and we wanted to load it into a different tool for further analysis, e.g. the excellent protoscope ? The low-tech approach is to print the contents and copy&paste from the delve output into an editor or similar. This stops working as soon as your data contains non-printable characters. We have multiple options to export the byte slice to a file: We could add to the source code and re-run the test. This is definitely the simplest option, as it works with or without a debugger. As long as delve is connected to a running program, we can use delve’s call command to just execute the same code without having to add it to our source: Notably, both options only work when you can debug interactively. For the first option, you need to be able to change the source. The second option requires that delve is attached to a running process that you can afford to pause and interactively control. These are trivial requirements when running a unit tests on your local machine, but get much harder when debugging an RPC service that crashes with specific requests, as you need to only run your changed debugging code for the troublesome requests, skipping the unproblematic requests that should still be handled normally. So let’s switch example: we are no longer working on Go Protobuf. Instead, we now need to debug an RPC service where certain requests crash the service. We’ll use core dump debugging! In case you’re wondering: The name “ core dump ” comes from magnetic-core memory . These days we should probably say “memory dump” instead. The picture above shows an exhibit from the MIT Museum ( Core Memory Unit, Bank C (from Project Whirlwind, 1953-1959)) , a core memory unit with 4 KB of capacity. To make Go write a core dump when panicing, run your program with the environment variable set (all possible values are documented in the package ). You also need to ensure your system is set up to collect core dumps, as they are typically discarded by default: You can find more details and options in the CoreDumpDebugging page of the Go wiki . For this article, we will stick to the route: We’ll use the gRPC Go Quick start example , a greeter client/server program, and add a call to the server handler: The last line is what we want to see: it should say “core dumped”. We can now use to launch delve for this program + core dump: Alright! Now let’s switch to frame 9 (our server’s handler) and inspect the field of the incoming RPC request: In this case, it’s easy to see that the field was set to in the incoming request, but let’s assume the request contained lots of binary data that was not as easy to read or copy. How do we write the byte slice contents to a file? In this scenario, we cannot modify the source code and delve’s command does not work on core dumps (only when delve is attached to a running process): Luckily, we can extend delve with a custom Starlark function to write byte slice contents to a file. You need a version of dlv that contains commit 52405ba . Until the commit is part of a released version, you can install the latest dlv directly from git: Save the following Starlark code to a file, for example : Then, in delve, load the Starlark code and run the function to export the byte slice contents of to : Let’s verify that we got the right contents: When you want to apply the core dump debugging technique on a server (instead of a gRPC server, as above), you will notice that panics in your HTTP handlers do not actually result in a core dump! This code in recovers panics and logs a stack trace: Or, in other words: the environment variable configures what happens for unhandled signals, but this signal is handled with the call, so no core is dumped. This default behavior of servers is now considered regrettable but cannot be changed for compatibility . (We probably can add a struct field to optionally not recover panics, though. I’ll update this paragraph once there is a proposal.) So, what options do we have in the meantime? We could recover panics in our own code (before ’s panic handler is called), but then how do we produce a core dump from our own handler? A closer look reveals that the Go runtime’s function is defined in and sends signal with the function to the current thread: The default action for is to “terminate the process and dump core”, see . We can follow the same strategy and send to our process: There is one caveat: If you have any non-Go threads running in your program, e.g. by using cgo, they might pick up the signal, so ensure they do not install a handler (see also: cgo-related documentation in ). If this is a concern, you can make the above code more platform-specific and use the syscall to direct the signal to the current thread, as the Go runtime does . Core dump debugging can be a very useful technique to quickly make progress on otherwise hard-to-debug problems. In small environments (single to few Linux servers), core dumps are easy enough to turn on and work with, but in larger environments you might need to invest into central core dump collection. I hope the technique shown above comes in handy when you need to work with core dumps. We could add to the source code and re-run the test. This is definitely the simplest option, as it works with or without a debugger. As long as delve is connected to a running program, we can use delve’s call command to just execute the same code without having to add it to our source: On Linux, the easiest way is to install , after which core dumps will automatically be collected. You can use to list and work with them. On macOS, you can enable core dump collection, but delve cannot open macOS core dumps . Luckily, macOS is rarely used for production servers. I don’t know about Windows and other systems.

0 views