Posts in Performance (20 found)
Uros Popovic 3 days ago

How to use Linux vsock for fast VM communication

Discover how to bypass the network stack for Host-to-VM communication using Linux Virtual Sockets (AF_VSOCK). This article details how to use these sockets to build a high-performance gRPC service in C++ that communicates directly over the hypervisor bus, avoiding TCP/IP overhead entirely.

0 views
Anton Zhiyanov 4 days ago

Go proposal: Goroutine metrics

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Export goroutine-related metrics from the Go runtime. Ver. 1.26 • Stdlib • Medium impact New metrics in the package give better insight into goroutine scheduling: Go's runtime/metrics package already provides a lot of runtime stats, but it doesn't include metrics for goroutine states or thread counts. Per-state goroutine metrics can be linked to common production issues. An increasing waiting count can show a lock contention problem. A high not-in-go count means goroutines are stuck in syscalls or cgo. A growing runnable backlog suggests the CPUs can't keep up with demand. Observability systems can track these counters to spot regressions, find scheduler bottlenecks, and send alerts when goroutine behavior changes from the usual patterns. Developers can use them to catch problems early without needing full traces. Add the following metrics to the package: The per-state numbers are not guaranteed to add up to the live goroutine count ( , available since Go 1.16). All metrics use uint64 counters. Start some goroutines and print the metrics after 100 ms of activity: No surprises here: we read the new metric values the same way as before — using metrics.Read . 𝗣 15490 • 𝗖𝗟 690397 , 690398 , 690399 P.S. If you are into goroutines, check out my interactive book on concurrency Total number of goroutines since the program started. Number of goroutines in each state. Number of active threads.

0 views

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds Bang Di, Yun Xu, Kaijie Guo, Yibin Shen, Yu Li, Sanchuan Cheng, Hao Zheng, Fudong Qiu, Xiaokang Hu, Naixuan Guan, Dongdong Huang, Jinhu Li, Yi Wang, Yifang Yang, Jintao Li, Hang Yang, Chen Liang, Yilong Lv, Zikang Chen, Zhenwei Lu, Xiaohan Ma, and Jiesheng Wu SOSP'25 Here is a contrarian view: the existence of hypervisors means that operating systems have fundamentally failed in some way. I remember thinking this a long time ago, and it still nags me from time to time. What does a hypervisor do? It virtualizes hardware so that it can be safely and fairly shared. But isn’t that what an OS is for? My conclusion is that this is a pragmatic engineering decision. It would simply be too much work to try to harden a large OS such that a cloud service provider would be comfortable allowing two competitors to share one server. It is a much safer bet to leave the legacy OS alone and instead introduce the hypervisor. This kind of decision comes up in other circumstances too. There are often two ways to go about implementing something. The first way involves widespread changes to legacy code, and the other way involves a low-level Jiu-Jitsu move which achieves the desired goal while leaving the legacy code untouched. Good managers have a reliable intuition about these decisions. The context here is a cloud service provider which virtualizes the network with a SmartNIC. The SmartNIC (e.g., NVIDIA BlueField-3 ) comprises ARM cores and programmable hardware accelerators. On many systems, the ARM cores are part of the data-plane (software running on an ARM core is invoked for each packet). These cores are also used as part of the control-plane (e.g., programming a hardware accelerator when a new VM is created). The ARM cores on the SmartNIC run an OS (e.g., Linux), which is separate from the host OS. The paper says that the traditional way to schedule work on SmartNIC cores is static scheduling. Some cores are reserved for data-plane tasks, while other cores are reserved for control-plane tasks. The trouble is, the number of VMs assigned to each server (and the size of each VM) changes dynamically. Fig. 2 illustrates a problem that arises from static scheduling: control-plane tasks take more time to execute on servers that host many small VMs. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dynamic Scheduling Headaches Dynamic scheduling seems to be a natural solution to this problem. The OS running on the SmartNIC could schedule a set of data-plane and control-plane threads. Data-plane threads would have higher priority, but control-plane threads could be scheduled onto all ARM cores when there aren’t many packets flowing. Section 3.2 says this is a no-go. It would be great if there was more detail here. The fundamental problem is that control-plane software on the SmartNIC calls kernel functions which hold spinlocks (which disable preemption) for relatively long periods of time. For example, during VM creation, a programmable hardware accelerator needs to be configured such that it will route packets related to that VM appropriately. Control-plane software running on an ARM core achieves this by calling kernel routines which acquire a spinlock, and then synchronously communicate with the accelerator. The authors take this design as immutable. It seems plausible that the communication with the accelerator could be done in an asynchronous manner, but that would likely have ramifications to the entire control-plane software stack. This quote is telling: Furthermore, the CP ecosystem comprises 300–500 heterogeneous tasks spanning C, Python, Java, Bash, and Rust, demanding non-intrusive deployment strategies to accommodate multi-language implementations without code modification. Here is the Jiu-Jitsu move: lie to the SmartNIC OS about how many ARM cores the SmartNIC has. Fig. 7(a) shows a simple example. The underlying hardware has 2 cores, but Linux thinks there are 3. One of the cores that the Linux scheduler sees is actually a virtual CPU (vCPU), the other two are physical CPUs (pCPU). Control-plane tasks run on vCPUs, while data-plane tasks run on pCPUs. From the point of view of Linux, all three CPUs may be running simultaneously, but in reality, a Linux kernel module (5,800 lines of code) is allowing the vCPU to run at times of low data-plane activity. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 One neat trick the paper describes is the hardware workload probe . This takes advantage of the fact that packets are first processed by a hardware accelerator (which can do things like parsing of packet headers) before they are processed by an ARM core. Fig. 10 shows that the hardware accelerator sees a packet at least 3 microseconds before an ARM core does. This enables this system to hide the latency of the context switch from vCPU to pCPU. Think of it like a group of students in a classroom without any teachers (e.g., network packets). The kids nominate one student to be on the lookout for an approaching adult. When the coast is clear, the students misbehave (i.e., execute control-plane tasks). When the lookout sees the teacher (a network packet) returning, they shout “act responsible”, and everyone returns to their schoolwork (running data-plane code). Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Results Section 6 of the paper has lots of data showing that throughput (data-plane) performance is not impacted by this technique. Fig. 17 shows the desired improvement for control-plane tasks: VM startup time is roughly constant no matter how many VMs are packed onto one server. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dangling Pointers To jump on the AI bandwagon, I wonder if LLMs will eventually change the engineering equation. Maybe LLMs will get to the point where widespread changes across a legacy codebase will be tractable. If that happens, then Jiu-Jitsu moves like this one will be less important. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
./techtipsy 1 weeks ago

LattePanda IOTA review: how does it perform as a home server?

Disclosure: the review sample was provided by DFRobot, the makers of LattePanda. I am allowed to keep the review sample indefinitely, no money exchanged hands, and as always, this post covers my own thoughts and views on the product. 1 In 2023, I happened to find a LattePanda V1 for sale at a good price. Given the then-poor availability of affordable Raspberry Pi units, I got one for testing and finding potential use cases for it in my setup. However, it was just a little bit too weak for any practical uses in 2023, with its CPU and USB connectivity being just slow enough to be of less use, and the networking being capped at 100 Mbit/s. In 2025, we have the spiritual successor to it: the LattePanda IOTA. It keeps the same form factor, but the connectivity and raw power have all received a significant jump, with the CPU performance rivalling my current home server, the trusty ThinkPad T430. The marketing materials list all sorts of sensible use cases for it. I’m sure that it works fine for those, but I’m only interested in one thing: how close does this board get to being the perfect home server? The perfect home server uses very little power, offers plenty of affordable storage and provides a lot of performance when it’s actually being relied upon. In my case, low power means less than 5 W while idling, 10+ TB of redundant storage for data resilience and integrity concerns, and performance means about 4 modern CPU cores’ worth (low-to-midrange desktop CPU performance). The model I’m reviewing is the 8GB RAM/64GB eMMC one, with a Windows 11 installation on it (not activated). Along with the review unit itself, I got sent the following accessories: The board was tested with a Lenovo 65W USB-C power adapter, because that’s what I had available. Given the specs of the board and the accessories, that should be plenty. As far as I know, USB power delivery seems to work fine and it’s not just a weird USB-C connector that requires specific voltages to work. The M.2 NVMe SSD used in this review is a 512 GB Samsung PM9A1. I got that one from another PC that really didn’t need a boot drive that large. Most of the testing was done with a fresh Fedora Server 43 installation, kernel version 6.17.7. I suggest looking at the spec sheet if you’re interested in all the fine details and available configurations. The overall connectivity has been improved with the new version of this board compared to the old board. The USB ports are all fast 10 Gbit/s ones, and we have actual PCIe connectivity to play with, although the available bandwidth is quite limited with a PCIe 3.0 x1 lane available on the port that both the M.2 M-key and PoE adapter connect to. What caught my eye was the CPU performance I’ve been proudly running an old ThinkPad T430 as a server for a while now, with some failed attempts to find a more low-power and efficient solution. The Intel N150 is now offering similar levels of performance, but in a much smaller power envelope. When it comes to more specialized functionalities, such as GPIO and the RP2040 microcontroller, I don’t currently have a solid use case for them, so they won’t be covered in this review. I might fancy giving them a go in the future though, it would be nice to get some environmental sensors on it to monitor the temperature and humidity of the server room (which is a closet). Since I also don’t have an 4G LTE modem available, I did not test the associated adapter. The way you can add expansion boards to the LattePanda IOTA is quite similar to how Raspberry Pi 5 and other similar single board computers do it: you simply run a flexible cable to an adapter board, and bam, you have extra connectivity! With the M.2 M-key adapter kit, you get the adapter itself, some mounting screws and brass stand-offs, and a tiny little flexible cable for the PCIe signal. The link speed is PCIe 3.0, with one lane available. In theory, this means a maximum of 1 GB/s of throughput. In practice and with this board and SSD combination, I got a maximum of ~810 MB/s. I expect some levels of losses with these types of setups, so in my view this seems normal. For the test, I just did a . The SSD itself supports up to 4 lanes of PCIe connectivity so that should not be a limiting factor here. The lovely part about M.2 NVMe ports is that you can use it for a lot of off-label use cases. Fancy some SATA ports? There’s an adapter for that. 2 Or a network card? Some fancy AI accelator thingy? Or a full-sized GPU? Anything will work (probably), as long as the cables and adapters are high quality, and you provide extra power to the device through other means. The only device on my network that is connected over PoE is currently an Ubiquiti Wi-Fi access point, and that is unlikely to change in the near future because that would require a full replacement of my networking gear. 3 However, I still gave this board a quick go, and I’m happy to report that it also works as an additional standalone Ethernet port. The Ethernet controller seems to be similar or the same as on the main board, and it shows up as a separate networking device. Both are Realtek NIC-s ( ), and they work with the driver. Realtek has a spotty compatibility story overall on Linux from what I’ve read, but this one seems to work fine on Fedora Server 43. I was very close to pulling the trigger and turning it into a beefy router so that I can finally move my Wireguard networks on the router as my current one cannot do more than 20 Mbit/s of Wireguard traffic, but I didn’t end up going through with that idea because of how well the SBC did in other areas. As some of you might know, I’m a fan of playing with fir- 18650 Li-ion battery cells, and I’m hoping to one day build a solar-powered server of my own (of which there are many examples ). I took some spare 18650 cells that came from an old ThinkPad battery, made sure that the voltages are more-or-less the same, and threw them on the board. Connecting the UPS board with the standoffs was fine, but the cable connecting it with the SBC was finicky. I triple-checked that the connector was the right way, but had to still use an uncomfortable amount of force to connect it all up. The battery cells themselves sit snugly on the board, and unless you drop the board, they should not fall out on their own. You’d still want to build a case around it if you’re going to actually put it to use in rough environments. The manual for the UPS board emphasizes that it only works on Windows 10/11, and sadly that seems to be the case, the UPS does not seem to show up as an USB-listed device, and tools like NUT did not find anything to monitor with a quick 5-minute investigation. The UPS board also has an interesting selection of switches that you can use to adjust the behaviour of the board, like automatically turning the board on when power comes back on, and setting an 80% battery charge limit. The first one was not really necessary to use, the board would follow whatever setting you have enabled on the SBC itself. I configured mine via UEFI settings to automatically turn on with a power adapter connected, and that worked here as well. The run time of your LattePanda IOTA with the UPS expansion board will heavily depend on your workloads and quality of your battery cells. Mine were used cells, and then I hit the board with to create some load on it. It ran for over an hour like that, and then I got bored and wanted to proceed with testing other accessories. The marketing materials mention up to 8 hours of runtime, and I suspect that with good Li-ion cells and workloads where you idle most of the time, it will likely be achievable. The board seems to trigger a hard shutdown on Linux because the host OS is not aware of a battery being connected. Not that catastrophic for most modern filesystems and database engines, but something to consider in your own workloads in case they are Linux-based. The UPS board seems to handle power connection and disconnection events well enough, it did not do anything weird when repeatedly plugging and unplugging the USB-C cable. 4 Based on the readings from a wall outlet energy meter, the board uses up to 20W when charging the cells. It’s possible for the board to pull more than that with a maximum CPU load and connected peripherals, so I wonder if that may be an issue with more intense usage scenarios. During charging and discharging cycles, even under heavy loads, the battery cells did not get hot and were at best warm to touch. It’s gigabit. Fine for my use case given that I still live in 2006 and only have devices that support gigabit Ethernet speeds at best (excluding the Ubiquiti Wi-Fi AP), but certainly less than some competing products. Compared to the LattePanda V1, the USB port performance is actually decent for my use case. I can connect up to three USB-connected storage devices to the board, so that’s exactly what I did. I set up three different USB-connected devices: For each device (including on-board eMMC device), I ran a , which puts a sequential read workload on all the drives in an infinite loop. After about 72 TB of data read in less than 24 hours, I checked the kernel logs and there were no stability issues whatsoever. The NVMe SSD started throttling due to heat, which was expected with that cheap adapter. Assuming no issues with any cables and adapters, the USB ports seem to be solid enough for running storage devices off of. Yes, it can be a horrible idea in some use cases, but at the same time my ThinkPad T430 has been excellent with USB-based storage, and that’s with one of the USB ports being coffee-stained! The eMMC chip is also more performant compared to the previous iteration, with sequential read speeds averaging around 316 MB/s, writes around 175 MB/s, and average read latency being around 0.15 ms. Certainly good enough for a boot drive. The LattePanda V1 struggled with larger displays, and when I gave it a go during this review, it would not properly display an image on my 3440x1440p monitor. The LattePanda IOTA just did it, at 60 Hz. On Fedora Workstation and GNOME, the experience was smooth. Once you start doing things in the browser, like video playback, the situation is less optimal, but as a makeshift desktop PC it is alright for most low/mid-range activities. The board came with a Windows 11 installation (not activated). As is tradition with Windows, the initial impressions are horrible, update processes running in the background made the active cooler go wild and the device felt sluggish. But after that process is done, the experience is not bad at all if you look past the OS being Windows. I did not do a thorough investigation and I suggest formatting the device boot drive either way when receiving it, but the Windows 11 installation looked clean enough, with no obvious bloatware. The LattePanda V1 had some quirks. The performance was iffy, and you had to specify a Linux kernel parameter on first boot so that Fedora Linux does not confuse the optional display interface to be an always-connected primary display. The previous version also didn’t include a real-time clock (RTC) by default, which meant that it was impossible to schedule some systemd timers as the time would always jump on boot years ahead on distros like Fedora Server. I got stuck in a reboot loop with a scheduled reboot job that way, was not fun to recover from. With the LattePanda IOTA, I have not observed any weird oddities and quirks with it. Even the kernel logs don’t show anything that’s problematic, and the RTC is handy to have around as that helps avoid the issue mentioned above. With the LattePanda V1, the cooler was not strictly required, but strongly recommended if you were going to use the board with moderate to high sustained loads. My solution was to slap an inappropriately sized heat sink to it with a thermal pad and zip ties and/or velcro strips, which looked horrible, because it was. With the LattePanda IOTA, the cooler is now a mandatory part of the assembly. It can be fitted with either a passive cooler , a case with passive cooling , or an active cooler . The active cooler does a good job of keeping the board cool, but it does get super loud at higher loads. The default fan curve is very primitive, with the fan changing it speeds in big and sudden increments. Bursty workloads certainly feel bursty with this fan. You will not want to be in the same room with this active cooler. The sound profile is very similar to a thin and light laptop, and the fan has a very strong high-pitched whine to it. Here’s an audio recording of the noise under heavy load if you’re interested (MP3 file). Recorded using a Google Pixel 8a. You can mitigate the active cooler noise issue by reducing the CPU clock speed by setting a lower power limit in UEFI settings, or on Linux, setting a lower CPU performance ceiling using driver option once on boot. This comes at the obvious cost of some raw performance, but given that CPU power scales non-linearly, you may not even notice it that much. If you are sensitive to fan noise, then do get the passive cooler and slap a Noctua fan on it, it will likely be a much better experience with both the cooling performance and noise levels. Oh, and fun fact: I got so carried away with testing that I actually forgot to remove the plastic film on the larger thermal pad that cools supporting components. And then I did about 24 hours of stress testing with that arrangement. I can confirm that the design of the board is idiot-proof, as I did not actually notice any severe throttling or thermal issues with that mistake. You can actually see the plastic film being present in a few photos of the board in this review. I still can’t believe that after all these years I ended up making that one mistake that you usually see online in tech support gore posts. The idle power consumption of the LattePanda IOTA seems to be around 4.0W, which is more than the Raspberry Pi 5 8GB with its power consumption being around 3.2W. Slightly higher compared to that, but lower than most x86 mini PC-s with idle power consumption typically in the range of 6-14W. During the disk read speed stress test, I saw a maximum of 24.4W pulled from the wall. With the disk read stress test and a full CPU stress test, I saw a peak of 36.3W, with it quickly dropping down as the CPU settled down at a lower clock speed. This board came surprisingly close to my perfect home server criteria that I had outlined earlier this year. Less than 5W when idling? Check. 10+ TB of redundant storage? Check. 4 modern cores’ worth of CPU performance? Check. Enough performance during bursty workloads? So far, yes. I then installed a fresh copy of Fedora Server 43 and moved all my home server workloads to it. The eMMC storage is used as a boot drive, writes are disabled, workloads requiring good latency and speed are on the 512GB NVMe SSD, and bulk storage is connected via two existing USB-SATA adapters taken from one of those WD Elements/MyBook external hard drive enclosures. Then it just worked. No issues. 5 The drop in the overall power consumption of my whole home server and networking stack was also immediately noticeable. Here are my observations of the CPU performance and behaviour after hitting it with an all-core CPU load: I have seen the CPU hit around 3.6 GHz with a single core load while there is nothing running in the background, but during my normal home server operations the cores are doing enough work across all 4 cores, so that doesn’t happen all that often, and 2.9 GHz is the ceiling for single core performance. The only limiting factor so far has been the 8 GB of memory on my review unit, but on the bright side that limitation forced me to review the memory usage of some of the jobs that I run on my home server, which ended up with me finding a few resource hogs and then fixing them all up. Now I can run about 30 Docker containers of various resource consumption on a single board computer, and with less than 4GB of RAM used. I set up an 8GB swap file on the SSD, just in case. Thanks to the relatively small boot drive, I also learned that even if you move the Docker folder to another location, will still clutter up your boot drive, so you’ll have to change that path in its file setting. I’m genuinely impressed with how well the LattePanda IOTA runs as a home server. The board isn’t really designed with that use case in mind, and I suspect that the Intel N150 might be doing most of the heavy lifting here, but still, very impressive! Is it the perfect home server? No, but it’s pretty damn close to my definition of it. For those interested in what options are available on the board via its UEFI settings, here are some screenshots of the settings. 6 If the LattePanda IOTA with its adapters fits your project requirements, you’re aware of its limitations, and the price is right, then I believe it’s a solid choice for your next project. My testing didn’t immediately break it, even when I forgot to remove the plastic film on one of the thermal pads. The current pricing of it and its accessories seem to be roughly in the ballpark of the Raspberry Pi 5 8GB (based on prices in Estonia). Boards like the Zimaboard 2 (have not tested it myself) are more expensive, but they’re also catering to a slightly different audience and have better specs, like 2.5G Ethernet ports and SATA ports with power delivery suitable for running two 3.5" hard drives straight from the board. It’s hard to beat the bargain that you can get from a used mini PC or NAS, but it won’t come with the charm, low power consumption and bragging rights that a single board computer gets you, especially if you’re using it for an off-label use case like I am. 7 In the meantime, I’ll keep rocking it as a home server. In case something noteworthy happens, I’ll update this post, which is brought to you by the very same LattePanda IOTA at the time of publishing. this also marks the first time that I’ve been sent a review sample throughout the course of running this blog!  ↩︎ do note that with most M.2 PCIe->SATA adapters, the controller of the adapter determines how good of an experience you will have. With some, I’ve read that the controllers may not handle some failure scenarios well, one device having issues may throw off the whole controller, and now you have a bigger mess.  ↩︎ the earliest PC motherboard with a gigabit Ethernet connection that I’ve personally used was manufactured in 2006. That’s how long gigabit Ethernet has been around for in the consumer space.  ↩︎ say that 10 times in a row!  ↩︎ I know, that usually does not happen on this blog.  ↩︎ being a prolific open source influencer does not bring in as much money as you’d think, so I haven’t bought a proper capture device yet.  ↩︎ no, but seriously, I cannot be the only one who has a strange affection towards SBC-s with their bare PCB-s. I can’t tell a capacitor from a resistor, but the boards are just so damn cool, right?  ↩︎ active cooler M.2 M-key expansion board 51W PoE expansion board M.2 4G LTE expansion board UPS expansion board CPU: Intel N150, 4 cores, 4 threads, up to 3.6 GHz RAM: 8/16 GB (depending on model) Onboard storage: 64/128GB eMMC (depending on model) Networking: gigabit Ethernet port Real-time clock: yes! USB hard drive (Seagate Basic) USB SATA SSD (Samsung QVO 4TB in ICY BOX USB-SATA adapter) USB NVMe SSD (512 GB Samsung PM9A1 with some random cheap USB to M.2 NVMe adapter) 2.9 GHz for a short time period (10-15 seconds), with CPU hovering around 80°C 2.2-2.3 GHz after that, with the CPU dropping to around 70°C Advanced -> ACPI Advanced -> CPU configuration Advanced -> Super IO configuration Advanced -> Serial port 1 configuration Advanced -> SMART Fan Control Advanced -> Trusted Computing Advanced -> NVMe configuration (no device connected at time of screenshot, oops) Advanced -> Power configuration Advanced -> USB configuration Advanced -> Serial Port console redirection Advanced -> SDIO configuration Advanced -> Realtek PCIe Ethernet controller Chipset -> System Agent (SA) configuration Chipset -> Device configuration Security -> Secure Boot Save & Exit this also marks the first time that I’ve been sent a review sample throughout the course of running this blog!  ↩︎ do note that with most M.2 PCIe->SATA adapters, the controller of the adapter determines how good of an experience you will have. With some, I’ve read that the controllers may not handle some failure scenarios well, one device having issues may throw off the whole controller, and now you have a bigger mess.  ↩︎ the earliest PC motherboard with a gigabit Ethernet connection that I’ve personally used was manufactured in 2006. That’s how long gigabit Ethernet has been around for in the consumer space.  ↩︎ say that 10 times in a row!  ↩︎ I know, that usually does not happen on this blog.  ↩︎ being a prolific open source influencer does not bring in as much money as you’d think, so I haven’t bought a proper capture device yet.  ↩︎ no, but seriously, I cannot be the only one who has a strange affection towards SBC-s with their bare PCB-s. I can’t tell a capacitor from a resistor, but the boards are just so damn cool, right?  ↩︎

0 views
Pat Shaughnessy 1 weeks ago

Compiling Ruby To Machine Language

I've started working on a new edition of Ruby Under a Microscope that covers Ruby 3.x. I'm working on this in my spare time, so it will take a while. Leave a comment or drop me a line and I'll email you when it's finished. Here’s an excerpt from the completely new content for Chapter 4, about YJIT and ZJIT. I’m still finishing this up… so this content is fresh off the page! It’s been a lot of fun for me to learn about how JIT compilers work and to brush up on my Rust skills as well. And it’s very exciting to see all the impressive work the Ruby team at Shopify and other contributors have done to improve Ruby’s runtime performance. To find hot spots, YJIT counts how many times your program calls each function or block. When this count reaches a certain threshold, YJIT stops your program and converts that section of code into machine language. Later Ruby will execute the machine language version instead of the original YARV instructions. To keep track of these counts, YJIT saves an internal counter nearby the YARV instruction sequence for each function or block. Figure 4-5 shows the YARV instruction sequence the main Ruby compiler created for the sum += i block at (3) in Listing 4-1. At the top, above the YARV instructions, Figure 4-5 shows two YJIT related values: jit_entry and jit_entry_calls . As we’ll see in a moment, jit_entry starts as a null value but will later hold a pointer to the machine language instructions YJIT produces for this Ruby block. Below jit_entry , Figure 4-5 also shows jit_entry_calls , YJIT’s internal counter. Each time the program in Listing 4-1 calls this block, YJIT increments the value of jit_entry_calls . Since the range at (1) in Listing 4-1 spans from 1 through 40, this counter will start at zero and increase by 1 each time Range#each calls the block at (3). When the jit_entry_calls reaches a particular threshold, YJIT will compile the YARV instructions into machine language. By default for small Ruby programs YJIT in Ruby 3.5 uses a threshold of 30. Larger programs, like Ruby on Rails web applications, will use a larger threshold value of 120. (You can also change the threshold by passing —yjit-call-threshold when you run your Ruby program.) While compiling your Ruby program, YJIT saves the machine language instructions it creates into YJIT blocks . YJIT blocks, which are distinct from Ruby blocks, each contain a sequence of machine language instructions for a range of corresponding YARV instructions. By grouping YARV instructions and compiling each group into a YJIT block, YJIT can produce more optimized code that is tailored to your program’s behavior and avoid compiling code that your program doesn’t need. As we’ll see next, a single YJIT block doesn’t correspond to a Ruby function or block. YJIT blocks instead represent smaller sections of code: individual YARV instructions or a small range of YARV instructions. Each Ruby function or block typically consists of several YJIT blocks. Let’s see how this works for our example. After the program in Listing 4-1 executes the Ruby block at (3) 29 times, YJIT will increment the jit_entry_calls counter again, just before Ruby runs the block for the 30th time. Since jit_entry_calls reaches the threshold value of 30, YJIT triggers the compilation process. YJIT compiles the first YARV instruction getlocal_WC_1 and saves machine language instructions that perform the same work as getlocal_WC_1 into a new YJIT block: On the left side, Figure 4-6 shows the YARV instructions for the sum += i Ruby block. On the right, Figure 4-6 shows the new YJIT block corresponding to getlocal_WC_1 . Next, the YJIT compiler continues and compiles the second YARV instruction from the left side of Figure 4-7: getlocal_WC_0 at index 2. On the left side, Figure 4-7 shows the same YARV instructions for the sum += i Ruby block that we saw above in Figure 4-6. But now the two dotted arrows indicate that the YJIT block on the right contains the machine language instructions equivalent to both getlocal_WC_1 and getlocal_WC_0 . Let’s take a look inside this new block. YJIT compiles or translates the Ruby YARV instructions into machine language instructions. In this example, running on my Mac laptop, YJIT writes the following machine language instructions into this new block: Figure 4-8 shows a closer view of the new YJIT block that appeared on the right side of Figures 4-6 and 4-7. Inside the block, Figure 4-8 shows the assembly language acronyms corresponding to the ARM64 machine language instructions that YJIT generated for the two YARV instructions shown on the left. The YARV instructions on the left are: getlocal_WC_1 , which loads a value from a local variable located in the previous stack frame and saves it on the YARV stack, and getlocal_WC_0 , which loads a local variable from the current stack from and also saves it on the YARV stack. The machine language instructions on the right side of Figure 4-8 perform the same task, loading these values into registers on my M1 microprocessor: x1 and x9 . If you’re curious and would like to learn more about what the machine language instructions mean and how they work, the section “Adding Two Integers Using Machine Language” discusses the instructions for this example in more detail. Next, YJIT continues down the sequence of YARV instructions and compiles the opt_plus YARV instruction at index 4 in Figures 4-6 and 4-7. But this time, YJIT runs into a problem: It doesn’t know the type of the addition arguments. That is, will opt_plus add two integers? Or two strings, floating point numbers, or some other types? Machine language is very specific. To add two 64-bit integers on an M1 microprocessor, YJIT could use the adds assembly language instruction. But adding two floating pointer numbers would require different instructions. And, of course, adding or concatenating two strings is an entirely different operation altogether. In order for YJIT to know which machine language instructions to save into the YJIT block for opt_plus , YJIT needs to know exactly what type of values the Ruby program might ever add at (3) in Listing 4-1. You and I can tell by reading Listing 4-1 that the Ruby code is adding integers. We know right away that the sum += 1 block at (3) is always adding one integer to another. But YJIT doesn’t know this. YJIT uses a clever trick to solve this problem. Instead of analyzing the entire program ahead of time to determine all of the possible types of values the opt_plus YARV instruction might ever need to add, YJIT simply waits until the block runs and observes which types the program actually passes in. YJIT uses branch stubs to achieve this wait-and-see compile behavior, as shown in Figure 4-9. Figure 4-9 shows the YARV instructions on the left, and the YJIT block for indexes 0000-0002 on the right. But note the bottom right corner of Figure 4-7, which shows an arrow pointing down from the block to a box labeled stub. This arrow represents a YJIT branch. Since this new branch doesn’t point to a block yet, YJIT sets up the branch to point to a branch stub instead.

0 views

How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service

How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service Jingkai He, Yunpeng Dong, Dong Du, Mo Zou, Zhitai Yu, Yuxin Ren, Ning Jia, Yubin Xia, and Haibo Chen SOSP'25 Systems spend a lot of time copying bytes from here to there. Fig. 2(a) shows data for some benchmarks. Copy operations are bucketized by size and if the copy occurs on behalf of user/kernel code. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 One interesting piece of data missing from that figure is how much CPU time is spent waiting on cache misses. A more subtle point is how long applications wait after a copy operation completes and when they actually read from the destination buffer. The authors call this the copy-use window . Fig. 3 has some data on this point. Bars above the gray triangle represent opportunities for latency hiding. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 API This paper proposes a new OS service ( Copier ), which performs copy operations on behalf of user and kernel clients. Clients interact with Copier through a set of queues. Copy requests are submitted via descriptors in the Copy Queue . Descriptors contain pointers to bitmaps, which clients can use to monitor progress. Each bit in the bitmap corresponds to the completion of a single segment of a copy. For example, a client could submit a 1MiB copy but read data from 4KiB destination segments as soon as they are written. The Sync Queue is used to raise the priority of an already-submitted task. This is a fallback for when the application has left the copy-use window and has nothing useful to do other than wait for the copy to make progress. The Handler Queue enables arbitrary functions to be called when a copy operation completes. Applications interact with these queues indirectly via a set of functions (e.g., to initiate a copy, to wait for an operation to complete). Because Copier is a centralized service, it can perform global optimizations. An example given in the paper involves two copies. First, a large amount of data is copied from a kernel buffer to an intermediate buffer ( → ). Next, a smaller amount of data is copied from the intermediate buffer into the database ( → ). Copier can detect this situation and only read a subset of the bytes from . One trick that Copier uses is watching calls to the API to know when data can be accessed. For example, if is never called on , and only called on a subset of , then Copier knows that only a subset of needs to be read. Clients can expose more global optimization opportunities to Copier by marking copy tasks as lazy. Lazy copy tasks do not execute unless they need to. An application could supply this hint if it knows that the destination of a copy operation will likely only be used as the source of a future copy operation. Copier will defer copying bits associated with the first copy operation, under the assumption that will never be called on it. Subsequent copy operations will cut out the middleman and read from the original source. The task of actually copying bits is delegated to a set of kernel threads. The number of threads dynamically scales with the amount of work. These threads use AVX instructions in parallel with dedicated DMA hardware . One issue with this setup is that these kernel threads could potentially fault if a specified userspace address is inaccessible (the data is paged out, or the client submitted a bad address). Copier proactively avoids this by locking pages in memory for the duration of the copy (this step is also needed when using DMA hardware). Fig. 11 shows benchmark results for Redis, section 6 of the paper has similar results for other use cases. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 Dangling Pointers Copies that hit in cache are very different from those which do not. I think more exploration here is warranted. Section 6.3.5 points out that Copier has pros and cons related to CPU caches: Pro: because Copier uses separate threads, it hopefully runs on separate cores than the application. This can prevent large copy operations from evicting hot application data. Con: when Copier finishes copying data, the destination bytes are likely not in the L1/L2 of an application core, causing it to have to read data from LLC/DRAM. Because copy operations are spread throughout the stack, careful design of the API is important. For example, the paper describes synchronization between user and kernel requests for the same process, but what about synchronization across two processes? Subscribe now Source: https://dl.acm.org/doi/10.1145/3731569.3764800 One interesting piece of data missing from that figure is how much CPU time is spent waiting on cache misses. A more subtle point is how long applications wait after a copy operation completes and when they actually read from the destination buffer. The authors call this the copy-use window . Fig. 3 has some data on this point. Bars above the gray triangle represent opportunities for latency hiding. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 API This paper proposes a new OS service ( Copier ), which performs copy operations on behalf of user and kernel clients. Clients interact with Copier through a set of queues. Copy requests are submitted via descriptors in the Copy Queue . Descriptors contain pointers to bitmaps, which clients can use to monitor progress. Each bit in the bitmap corresponds to the completion of a single segment of a copy. For example, a client could submit a 1MiB copy but read data from 4KiB destination segments as soon as they are written. The Sync Queue is used to raise the priority of an already-submitted task. This is a fallback for when the application has left the copy-use window and has nothing useful to do other than wait for the copy to make progress. The Handler Queue enables arbitrary functions to be called when a copy operation completes. Applications interact with these queues indirectly via a set of functions (e.g., to initiate a copy, to wait for an operation to complete). Copy Absorption Because Copier is a centralized service, it can perform global optimizations. An example given in the paper involves two copies. First, a large amount of data is copied from a kernel buffer to an intermediate buffer ( → ). Next, a smaller amount of data is copied from the intermediate buffer into the database ( → ). Copier can detect this situation and only read a subset of the bytes from . One trick that Copier uses is watching calls to the API to know when data can be accessed. For example, if is never called on , and only called on a subset of , then Copier knows that only a subset of needs to be read. Clients can expose more global optimization opportunities to Copier by marking copy tasks as lazy. Lazy copy tasks do not execute unless they need to. An application could supply this hint if it knows that the destination of a copy operation will likely only be used as the source of a future copy operation. Copier will defer copying bits associated with the first copy operation, under the assumption that will never be called on it. Subsequent copy operations will cut out the middleman and read from the original source. Copy Threads The task of actually copying bits is delegated to a set of kernel threads. The number of threads dynamically scales with the amount of work. These threads use AVX instructions in parallel with dedicated DMA hardware . One issue with this setup is that these kernel threads could potentially fault if a specified userspace address is inaccessible (the data is paged out, or the client submitted a bad address). Copier proactively avoids this by locking pages in memory for the duration of the copy (this step is also needed when using DMA hardware). Results Fig. 11 shows benchmark results for Redis, section 6 of the paper has similar results for other use cases. Source: https://dl.acm.org/doi/10.1145/3731569.3764800 Dangling Pointers Copies that hit in cache are very different from those which do not. I think more exploration here is warranted. Section 6.3.5 points out that Copier has pros and cons related to CPU caches: Pro: because Copier uses separate threads, it hopefully runs on separate cores than the application. This can prevent large copy operations from evicting hot application data. Con: when Copier finishes copying data, the destination bytes are likely not in the L1/L2 of an application core, causing it to have to read data from LLC/DRAM.

0 views

DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge

DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge Arjun Kashyap, Yuke Li, and Xiaoyi Lu HPDC'25 This paper ends with a counterintuitive result: DPUs aren’t amazingly better than traditional CPUs at implementing key-value stores. You would think they would have a shot, given that they have specialized networking accelerators, and key-value stores don’t require a lot of general-purpose computation. It all comes down to access to off-chip memory. Table 1 and Fig. 2 contain profiling results which motivate the designs presented by this paper: Source: https://dl.acm.org/doi/10.1145/3731545.3731571 When a key-value store is run on a traditional CPU, most of the time is spent in packet processing rather than CRUD operation on (key, value) tuples. DPUs are chips that are specialized for packet processing, so one would think a DPU would be especially helpful in this situation. The designs in this paper are evaluated on NVIDIA BlueField 2 and BlueField 3 DPUs. These DPUs contain a mix of fixed-function hardware and ARM cores. The fundamental idea proposed in this paper is to make the CPU’s job easier by having the DPU perform some of the awkward packet processing work and passing CRUD requests to the CPU in a convenient format. The DPU parses incoming packets, extracts the relevant fields, and writes the minimal amount of information (operation to perform, key, value) to queues in host memory. Fig. 8(d) shows the amount of cruft the DPU is able to remove from each CRUD packet. Source: https://dl.acm.org/doi/10.1145/3731545.3731571 The design described in this paper is optimized in all of the ways you would expect , requests are batched, and appropriate pipelining is used to avoid idle time. Cross-core synchronization (both DPU and host CPU cores) synchronization is minimized with the presence of many queues. Each ARM core owns a unique set of queues, and each host CPU core is assigned to one or more queues. As Fig. 11 shows, the design described so far ( DPU-KV-lat and DPU-KV-sav ) doesn’t offer significant speedups. There are latency improvements, but throughput suffers. These designs are bound by the DPU. Evidence for this comes from the performance of DPU-KV-dual and DPU-KV-shrd . DPU-KV-dual allows idle host CPU cores to process raw packets. DPU-KV-shrd uses sharding. The set of keys is partitioned, with some keys handled by the CPU and some keys handled by the DPU. Source: https://dl.acm.org/doi/10.1145/3731545.3731571 Dangling Pointers The moral of the story is that there is no special advantage conferred on an ARM core inside of a SmartNIC over an Intel core inside of the host CPU. It is interesting to compare this work to a key-value store implemented directly in an RMT pipeline . It would be interesting to drill down into the profiling numbers which motivated this paper and understand how much memory-level parallelism a traditional CPU core can utilize. At the microarchitectural level, this problem has to be memory bound, it would be interesting to see if it is bound by memory latency or bandwidth. Subscribe now

0 views
<antirez> 2 weeks ago

Scaling HNSWs

I’m taking a few weeks of pause on my HNSWs developments (now working on some other data structure, news soon). At this point, the new type I added to Redis is stable and complete enough, it’s the perfect moment to reason about what I learned about HNSWs, and turn it into a blog post. That kind of brain dump that was so common pre-AI era, and now has become, maybe, a bit more rare. Well, after almost one year of thinking and implementing HNSWs and vector similarity stuff, it is time for some writing. However this is not going to be an intro on HNSWs: too many are present already. This is the “extra mile” instead. If you know HNSWs, I want to share with you my more “advanced” findings, especially in the context of making them fast enough to allow for a “Redis” experience: you know, Redis is designed for low latency and high performance, and HNSWs are kinda resistant to that, so there were challenges to expose HNSWs as an abstract data structure. This blog post will be split into several sections. Think of them as pages of the same book, different chapters of the same experience. Oh and, by the way, I already wrote and subsequently lost this blog post :D [long, sad story about MacOS and bad habits – I hadn’t lost something like that since the 90s, during blackouts], so here most of the problem will be to recall what I wrote a few days ago and, while I’m at it, to better rephrase what I didn’t like very much. ## A few words about the state of HNSW Before digging into the HNSWs internals and optimizations, I want to say a few things about HNSWs. The original paper introducing HNSWs is a great piece of computer science literature, and HNSWs are amazing data structures, but: I don’t believe they are the last word for searching, in a greedy way, for nearby vectors according to a distance function. The paper gives the feeling it lacks some “pieces”, almost like if the researchers, given six months more, had a lot more to explore and say. For instance, I modified the paper myself, extending it in order to support removal of entries, actual removals, not just tombstone deletions where the element is marked as gone and collected later: deleting items is totally missing from the paper. Similarly, there are, right now, efforts in order to really check if the “H” in the HNSWs is really needed, and if instead a flat data structure with just one layer would perform more or less the same (I hope I’ll cover more about this in the future: my feeling is that the truth is in the middle, and that it makes sense to modify the level selection function to just have levels greater than a given threshold). All this to say that, if you are into data structures research, I believe that a great area is to imagine evolutions and refinements of HNSWs, without getting trapped within the idea that the evolutions are only in the sense of: let’s do it, but for disk (see Microsoft efforts), or the like. Ok, enough with the premise, let’s go to the actual low level stuff :) ## Scaling memory Redis is an in-memory system, and both HNSWs and vectors have the unfortunate quality of being very space-hungry. There are three reasons for this: 1. HNSWs have a lot of pointers, like 16, 32 or more pointers (this is a tunable parameter of HNSWs) to neighbor nodes. 2. HNSWs have many levels, being a skiplist-alike data structure. This exacerbates the first problem. 3. HNSW’s satellite data is a vector of floating point numbers, so, in the vanilla case, 4 bytes per component, and normally you can have 300-3000 components, this is the usual range. So, what are the lessons learned here? There are folks that compress pointers, since it is very likely that many pointers (8 bytes in 64 bit systems) will have the highest four bytes all the same. This is smart, I didn’t implement it yet, because in Redis I need to go fast, and this is a tradeoff between space and time: but maybe it is worth it, maybe not. I’ll dig more. However, if you do the math, the fact that there are many layers is not *so* terrible as it looks. On average, the multiple layers per node make the situation worse by just ~1.3x (if the probability of level increase is 0.25 in the level selection function), since many nodes will be just at layer 0. But still 1.3 is more than 1, and if that “H” in HNSWs really is not *so* useful… [Spoiler, what I found is that the seek time if you have everything at layer 0 is greater, the main loop for the greedy search will start from less optimal places and it will eventually reach the right cluster, but will take more computation time. However this is just early results.] So here the *real* low hanging fruit is: vector quantization. What I found is that if you use 8 bit quantization what you get is an almost 4x speedup, a 4x reduction of your vectors (but not a 4x reduction of the whole node: the pointers are still there, and they take a lot of space), and a recall that is virtually the same in real world use cases. This is the reason why Redis Vector Sets use 8 bit quantization by default. You can specify, via VADD options, that you want full precision vectors or binary quantized vectors, where we just take the sign, but I’m skeptical about using both full size vectors and binary quantized vectors. Before talking about them, let’s see what kind of quantization I used for 8 bit. What I do is to compute the maximum absolute value of the component of each vector (so quantization is per-vector), then I use signed 8 bit values to represent the quant from -127 to 127. This is not as good as storing both min and max value, but it is faster when computing cosine similarity, since I can do this: /* Each vector is quantized from [-max_abs, +max_abs] to [-127, 127] * where range = 2*max_abs. */ const float scale_product = (range_a/127) * (range_b/127); Then I multiply things together in the integer domain with (actually in the code the main loop is unrolled and uses multiple accumulators, to make modern CPUs more busy) for (; i And finally we can return back to the floating point distance with: float dotf = dot0 * scale_product; Check the vectors_distance_q8() for more information, but I believe you got the idea: it is very simple to go from the integer quants domain to the unquantized dotproduct with trivial operations. So, 8 bit quantization is a great deal, and full precision was a *needed* feature, because there will be people doing things with vectors generated in a way where each small amount makes a difference (no, with learned vectors this is not the case…) but, why binary quantization? Because I wanted users to have a simple way to not waste space when their *original* information is already binary. Imagine you have a set of users and they have yes/no properties, and you want to find similar users, items, whatever. Well: this is where binary quantization should be used, it’s just, again, an option of the VADD command. ## Scaling speed: threading and locality Oh, you know, I have to tell you something about myself: I’m not a fan of threaded systems when it is possible to do a lot with a single core, and then use multiple cores in a shared-nothing architecture. But HNSWs are different. They are *slow*, and they are accessed almost always in read-only ways, at least in most use cases. For this reason, my Vector Sets implementation is fully threaded. Not just reads, even writes are partially threaded, and you may wonder how this is possible without it resulting in a mess, especially in a system like Redis, where keys can be accessed in different ways by the background saving process, the clients, and so forth. Well, to start, let’s focus on reads. What happens is that as long as nobody is writing in the data structure, we can spawn threads that do the greedy collection of near vectors and return back the results to the blocked client. However, my implementation of HNSWs was written from scratch, I mean, from the empty C file opened with vim, it has 0% of shared code with the two implementations most other systems use, so there are a few “novelties”. One of such different things is that in order to avoid re-visiting already visited nodes, I use an integer stored in each node that is called “epoch”, instead of using another data structure to mark (like, in a hash table) nodes already visited. This is quite slow, I believe. The epoch instead is local to the node, and the global data structure increments the epoch for each search. So in the context of each search, we are sure that we can find epochs that are just But with threads, there are multiple searches occurring at the same time! And, yep, what I needed was an array of epochs: typedef struct hnswNode { uint32_t level; /* Node's maximum level */ … many other stuff … uint64_t visited_epoch[HNSW_MAX_THREADS]; } That’s what you can read in hnsw.h. This is, again, a space-time tradeoff, and again time won against space. So, how was it possible to have threaded writes? The trick is that in HNSW inserts, a lot of time is spent looking for neighbors candidates. So writes are split into a reading-half and commit-half, only the second needs a write lock, and there are a few tricks to make sure that the candidates we accumulated during the first part are discarded if the HNSW changed in the meantime, and some nodes may no longer be valid. There is, however, another problem. What about the user deleting the key, while background threads are working on the value? For this scenario, we have a function that waits for background operations to return before actually reclaiming the object. With these tricks, it is easy to get 50k ops/sec on real world vector workloads, and these are numbers I got from redis-benchmark itself, with all the overhead involved. The raw numbers of the flat HNSW library itself are much higher. ## Scaling memory: reclaiming it properly Before talking about how to scale HNSWs into big use cases with multiple instances involved, and why Redis Vector Sets expose the actual data structure in the face of the user (I believe programmers are smart and don’t need babysitting, but it’s not *just* that), I want to go back and talk again about memory, because there is an interesting story to tell about this specific aspect. Most HNSWs implementations are not able to reclaim memory directly when you delete a node from the graph. I believe there are two main reasons for that: 1. People misunderstand the original HNSW paper in a specific way: they believe links can be NOT reciprocal among neighbors. And there is a specific reason why they think so. 2. The paper does not say anything about deletion of nodes and how to fix the graph after nodes go away and we get missing links in the “web” of connections. The first problem is a combination (I believe) of lack of clarity in the paper and the fact that, while implementing HNSWs, people face a specific problem: when inserting a new node, and good neighbors are searched among existing nodes, often the candidates already have the maximum number of outgoing links. What to do, in this case? The issue is often resolved by linking unidirectionally from the new node we are inserting to the candidates that are already “full” of outgoing links. However, when you need to delete a node, you can no longer resolve all its incoming links, so you can’t really reclaim memory. You mark it as deleted with a flag, and later sometimes there is some rebuilding of the graph to “garbage collect” stale nodes, sometimes memory is just leaked. So, to start, my implementation in Redis does things differently by forcing links to be bidirectional. If A links to B, B links to A. But, how to do so, given that A may be busy? Well, this gets into complicated territory but what happens is that heuristics are used in order to drop links from existing nodes, with other neighbors that are well connected, and if our node is a better candidate even for the target node, and if this is not true there are other ways to force a new node to have at least a minimal number of links, always trying to satisfy the small world property of the graph. This way, when Redis deletes a node from a Vector Set, it always has a way to remove all the pointers to it. However, what to do with the remaining nodes that now are missing a link? What I do is to create a distance matrix among them, in order to try to link the old node neighbors among them, trying to minimize the average distance. Basically for each pair of i,j nodes in our matrix, we calculate how good is their connection (how similar their vectors are) and how badly linking them affects the *remaining* possible pairs (since there could be elements left without good pairs, if we link two specific nodes). After we build this matrix of scores, we then proceed with a greedy pairing step. This works so well that you can build a large HNSW with millions of elements, later delete 95% of all your elements, and the remaining graph still has good recall and no isolated nodes and so forth. That is what I mean when I say that there is space in HNSWs for new papers to continue the work. ## Scaling HNSWs to multiple processes When I started to work at Redis Vector Sets, there was already a vector similarity implementation in Redis-land, specifically as an index type of RediSearch, and this is how most people think at HNSWs: a form of indexing of existing data. Yet I wanted to provide Redis with a new HNSW implementation exposed in a completely different way. Guess how? As a data structure, of course. And this tells a story about how Redis-shaped is my head after so many years, or maybe it was Redis-shaped since the start, and it is Redis that is shaped after my head, since I immediately envisioned how to design a Redis data structure that exposed HNSWs to the users, directly, and I was puzzled that the work with vectors in Redis was not performed exactly like that. At the same time, when I handed my design document to my colleagues at Redis, I can’t say that they immediately “saw” it as an obvious thing. My reasoning was: vectors are like scores in Redis Sorted Sets, except they are not scalar scores where you have a total order. Yet you can VADD, VREM, elements, and then you can call VSIM instead of ZRANGE in order to have *similar* elements. This made sense not just as an API, but I thought of HNSWs as strongly composable, and not linked to a specific use case (not specific to text embeddings, or image embeddings, or even *learned* embeddings necessarily). You do: VADD my_vector_set VALUES [… components …] my_element_string So whatever is in your components, Redis doesn't care, when you call VSIM it will report similar elements. But this also means that, if you have different vectors about the same use case split in different instances / keys, you can ask VSIM for the same query vector into all the instances, and add the WITHSCORES option (that returns the cosine distance) and merge the results client-side, and you have magically scaled your hundred of millions of vectors into multiple instances, splitting your dataset N times [One interesting thing about such a use case is that you can query the N instances in parallel using multiplexing, if your client library is smart enough]. Another very notable thing about HNSWs exposed in this raw way, is that you can finally scale writes very easily. Just hash your element modulo N, and target the resulting Redis key/instance. Multiple instances can absorb the (slow, but still fast for HNSW standards) writes at the same time, parallelizing an otherwise very slow process. This way of exposing HNSWs also scales down in a very significant way: sometimes you want an HNSW for each user / item / product / whatever you are working with. This is very hard to model if you have an index on top of something, but it is trivial if your HNSWs are data structures. You just can have a Vector Set key for each of your items, with just a handful of elements. And of course, like with any other Redis key, you can set an expiration time on the key, so that it will be removed automatically later. All this can be condensed into a rule that I believe should be more present in our industry: many programmers are smart, and if instead of creating a magic system they have no access to, you show them the data structure, the tradeoffs, they can build more things, and model their use cases in specific ways. And your system will be simpler, too. ## Scaling loading times If I don’t use threading, my HNSW library can add word2vec (300 components for each vector) into an HNSW at 5000 elements/second if I use a single thread, and can query the resulting HNSW at 90k queries per second. As you can see there is a large gap. This means that loading back an HNSW with many millions of elements from a Redis dump file into memory would take a lot of time. And this time would impact replication as well. Not great. But, this is true only if we add elements from the disk to the memory in the most trivial way, that is storing “element,vector” on disk and then trying to rebuild the HNSW in memory. There is another lesson to learn here. When you use HNSWs, you need to serialize the nodes and the neighbors as they are, so you can rebuild everything in memory just allocating stuff and turning neighbors IDs into pointers. This resulted in a 100x speedup. But do you really believe the story ends here? Hehe. Recently Redis has stronger security features and avoids doing bad things even when the RDB file is corrupted by an attacker. So what I needed to do was to make sure the HNSW is valid after loading, regardless of the errors and corruption in the serialized data structure. This involved many tricks, but I want to take the freedom to just dump one comment I wrote here, as I believe the reciprocal check is particularly cool: /* Second pass: fix pointers of all the neighbors links. * As we scan and fix the links, we also compute the accumulator * register "reciprocal", that is used in order to guarantee that all * the links are reciprocal. * * This is how it works, we hash (using a strong hash function) the * following key for each link that we see from A to B (or vice versa): * * hash(salt || A || B || link-level) * * We always sort A and B, so the same link from A to B and from B to A * will hash the same. Then we xor the result into the 128 bit accumulator. * If each link has its own backlink, the accumulator is guaranteed to * be zero at the end. * * Collisions are extremely unlikely to happen, and an external attacker * can't easily control the hash function output, since the salt is * unknown, and also there would be to control the pointers. * * This algorithm is O(1) for each node so it is basically free for * us, as we scan the list of nodes, and runs on constant and very * small memory. */ ## Scaling use cases: JSON filters I remember the day when the first working implementation of Vector Sets felt complete. Everything worked as expected and it was the starting point to start with the refinements and the extra features. However in the past weeks and months I internally received the feedback that most use cases need some form of mixed search: you want near vectors to a given query vector (like most similar movies to something) but also with some kind of filtering (only released between 2000 and 2010). My feeling is that you need to query for different parameters less often than product people believe, and that most of the time you can obtain this more efficiently by adding, in this specific case, each year to a different vector set key (this is another instance of the composability of HNSWs expressed as data structures versus a kind of index). However I was thinking about the main loop of the HNSW greedy search, that is something like this: // Simplified HNSW greedy search algorithm. Don’t trust it too much. while(candidates.len() > 0) { c = candidates.pop_nearest(query); worst_distance = results.get_worst_dist(query); if (distance(query,c) > worst_distance) break; foreach (neighbor from c) { if (neighbor.already_visited()) continue; neighbor.mark_as_visited(); if (results.has_space() OR neighbor.distance(query) candidates.add(neighbor); results.add(neighbor); } } } return results; So I started to play with the idea of adding a JSON set of metadata for each node. What if, once I have things like {“year”: 1999}, this was enough to filter while I perform the greedy search? Sure, the search needed to be bound, but there is a key insight here: I want, to start, elements that are *near* to the query vector, so I don’t really need to explore the whole graph if the condition on the JSON attributes is not satisfied by many nodes. I’ll let the user specify the effort, and anyway very far away results that match the filter are useless. So that’s yet another way how my HNSW differs: it supports filtering by expressions similar to the ones you could write inside an “if” statement of a programming language. And your elements in the Vector Set can be associated with JSON blobs, expressing their properties. Then you can do things like: VSIM movies VALUES … your vector components here… FILTER '.year >= 1980 and .year ## A few words on memory usage HNSW’s fatal issue is — in theory — that they are normally served from memory. Actually, you can implement HNSWs on disk, even if there are better data structures from the point of view of disk access latencies. However, in the specific case of Redis and Vector Sets the idea is to provide something that is very fast, easy to work with: the flexibility of in-memory data structures help with that. So the question boils down to: is the memory usage really so bad? Loading the 3 million Word2Vec entries into Redis with the default int8 quantization takes 3GB of RAM, 1kb for each entry. Many use cases have just a few tens of million of entries, or a lot less. And what you get back from HNSWs, if well implemented, and in memory, is very good performance, which is crucial in a data structure and in a workload that is in itself slow by definition. In my MacBook I get 48k ops per second with redis-benchmark and VSIM against this key (holding the word2vec dataset). My feeling is that the memory usage of in-memory HNSWs is very acceptable for many use cases. And even in the use cases where you want the bulk of your vectors on disk, even if there is to pay for slower performance, your hot set should likely be served from RAM. This is one of the reasons why I believe that, to be active in HNSW research is a good idea: I don’t think they will be replaced anytime soon for most use cases. It seems more likely that we will continue to have different data structures that are ideal for RAM and for disk depending on the use cases and data size. Moreover, what I saw recently, even just scanning the Hacker News front page, is people with a few millions of items fighting with systems that are slower or more complicated than needed. HNSWs and carefully exposing them in the right way can avoid all that. ## Conclusions I like HNSWs, and working and implementing them was a real pleasure. I believe vectors are a great fit for Redis, even in an AI-less world (for instance, a few months ago I used them in order to fingerprint Hacker News users, replicating an old work published on HN in the past). HNSWs are simply too cool and powerful for a number of use cases, and with AI, and learned embeddings, all this escalates to a myriad of potential use cases. However, like most features in Redis, I expect that a lot of time will pass before people realize they are useful and powerful and how to use them (no, it’s not just a matter of RAG). This happened also with Streams: finally there is mass adoption, after so many years. If instead you are more interested in HNSW and the implementation I wrote, I believe the code is quite accessible, and heavily commented: https://github.com/redis/redis/blob/unstable/modules/vector-sets/hnsw.c If you want to learn more about Redis Vector Sets, please feel free to read the README file I wrote myself. There is also the official Redis documentation, but I suggest you start from here: https://github.com/redis/redis/tree/unstable/modules/vector-sets Thanks for reading such a long blog post! And have a nice day. References. This is the paper about the "H" in HNSW and how useful it is -> https://arxiv.org/abs/2412.01940 Comments

0 views
JSLegendDev 2 weeks ago

What Caused Performance Issues in My Tiny RPG

In a previous post, I mentioned having strange performance issues regarding a tiny RPG I was secretly working on. The crux of the matter was that the game (built using web technologies) would run noticeably less smoothly when wrapped as a desktop app on my machine than when running in Firefox. I initially shared the project’s executables on the KAPLAY Discord server (KAPLAY being the library I used to make the game in JavaScript) and none reported the performance issues I had. Seeing this, I decided to make a Substack post inviting my wider audience to try the game out and report its performance. This led to someone sharing the post on Hacker News which resulted in a large influx of views and feedback. In this post, I would like to explain what went wrong (as best as I understood it myself). First of all, I have mostly fixed the performance of my game. It now runs much more smoothly and I have updated the executables on itch. I would still greatly appreciate feedback regarding how these run on your machines. (Download the ones tagged with -v2) Here’s the link : https://jslegend.itch.io/small-rpg-performance-playtest To build executables for Windows, Mac and Linux, I use either NW.js or GemShell. Similar to Electron but easier to set up. It packages a Chromium instance for rendering. Results in bloated executables but the same rendering engine is used across platforms. Similar to Tauri in how it uses the operating system’s webview to render the app rather than packaging a Chromium instance. This results in leaner executables but different web engines with varying performance differences are used on different platforms. However, contrary to Tauri, GemShell is a one click tool that generate executables for all three platforms. I wrote a post about it recently that delved into more detail. Since I wanted to build executables quickly, I decided to try out GemShell for this project. Executables that I distributed to get feedback were made with this tool. KAPLAY is a library based on the concept of game objects and components. A game object is created from a list of components, many of which, are ready made and offer functionality out of the box. This makes the library very easy to understand and productive. The issue is that it’s by default less performant than other alternatives (Phaser, Excalibur, etc…) I use it because it’s so fast to prototype game ideas in. However, I used it for this project because it was the tool I knew best that would allow me to immediately start working on the game and focus on game design. In hindsight, I should have probably started invested in learning other more performant alternatives to a sufficient level so that I could easily pivot away if the needs of my project demanded it. This was often reported among people for who the game didn’t perform well. In KAPLAY, there is an option to cap the FPS to a maximum amount. It’s used when first initializing the library. The idea behind this option is to enforce a consistent frame rate resulting in more consistently smooth gameplay. However, setting this to 60 in my game resulted in the game eventually slowing down to a crawl all while the debug fps count still displayed 60 fps. I conclude that this happened when the machine running the game wasn’t able to sustain 60fps or more at all times. The fix was simple, remove it. Why did this option behave this way? Probably a bug a in the library. A maintainer is currently working on it. Since GemShell was used to make executables, the underlying web engine used to render the game on a Mac is Webkit and not Chromium. While I knew that Webkit was less performant than Chromium, I didn’t think it would be that noticeable. The apparent solution to this would be to move off of GemShell and use NW.js instead. However, it turns out that MacOS does certain things to not allow Webkit to use its full potential. The dev behind GemShell apparently has a fix for this situation. Below is what they shared on their Discord, which you can join if you’re interested in the tool here . This would be great since it would be nice to have the best of both worlds, more consistent performance across platforms while still having lean executables. Since I’m still not done with the development of the game, I can afford to let the dev cook, as we say! This is the strangest performance issue I experienced. While I could conceive of the game performing better on Firefox VS Safari (since it uses Webkit), I was caught off guard by Chrome performing poorly than Firefox. Chrome is supposed to be more performant due to the Chromium Web engine. The Chrome profiler seemed to indicate that the way I was rendering text in my game probably needed to change. KAPLAY Game Objects and Text Rendering In KAPLAY, if you look at the examples provided in its playground , you’ll find that it’s often shown that to render text you first, create a game object using a text component and then pass the text to that component. However, it turns out that this is inefficient since game objects have a greater cost in terms of performance. For rendering simple text it’s far better to draw it directly in the draw loop. Here’s a simple example. The same logic applies to drawing static images like backgrounds. Instead of doing : I needed to do : Additionally, to not disable batching it was important that all drawSprite calls be placed together in the draw loop before rendering text with drawText calls. Doing this led to better performance on Chrome and Safari. However, there was one last obvious performance improvement I needed to do. In the battle section of my game the player must avoid getting hit by a horde of projectiles thrown at them. A common optimization technique used in this case was to reuse bullets that leave the screen rather than creating new ones. This technique is known as object pooling. I had planned on implementing something like this eventually but didn’t do it since I was developing the game using Firefox as my preview, and the performance was great. There weren’t that many projectiles in the first place so I felt that this would be useful later. However, considering that Chrome and Safari were struggling performance wise probably due to their garbage collector working differently, I resigned myself to implement it now. As expected the game performed much better on both Chrome and Safari. For Chrome, I now had a constant 60fps on my machine (Macbook Air M3, 16 GB RAM) but for Safari it was more around 45-60fps. To stress test my pooling system, I added a bunch of projectiles. Here is footage. I’m happy that I can now resume development and focus more on game design. However, what this adventure taught me is that I should probably invest my time learning other frameworks and game engines. While I eventually ended up fixing the performance issues in my game, I can’t help but think of scenarios where problems could arise later that are unfixable due to limitations of the tools I’m using. In that case, I would have to halt development to learn a new tool at a proficient level on top of having to reimplement the game which would take a lot of time and result in my project getting significantly delayed. If I start learning something else right now, I can at least go faster if I eventually need to reimplement the game. Finally, if you’re interested in keeping up with the game or like technical posts like this one. I recommend subscribing to not miss out when new posts are published. Subscribe now In the meantime, here are few of my previous posts that could interest you!

0 views
JSLegendDev 3 weeks ago

I'm Making a Tiny RPG and I Need Feedback Regarding Performance

This past month, I’ve been working secretly on a small RPG game. While the game is not ready at all and I didn’t plan on talking about it, I’m now kind of forced to. I’ve been using JavaScript + the KAPLAY game library to make this game but I’ve been experiencing performance issues. However, it seems that others aren’t experiencing them so now I wonder, is it just my machine? I’m using a Macbook Air M3 with 16GB of RAM. Normally, things should be smooth and they are when playing the game in the browser via Firefox. However, since I’m making this game with Steam in mind, it’s especially important that the game performs well when wrapped as a desktop app. For this reason, I decided to reach out to you, my audience, for feedback. I have included a build of the unfinished game for Windows, Mac and Linux. It would be very nice if you could try it out on your machine. Additionally, recording gameplay and sharing the link in the comment section of this post would be greatly appreciated. Here is the link to the game in its unfinished current form : https://jslegend.itch.io/small-rpg-performance-playtest Below is a gameplay video to give you an idea of how the game is supposed to be played. You can move around with arrow keys and enter battle by overlapping with a star in the overworld. Performance issues, if any, occur mostly during battles. Thanks in advance! UPDATE 7/11/2025 : It seems that this post was shared on Hacker News and is getting more attention than usual! If you’re new to my Substack and are potentially interested in updates regarding this project, I recommend subscribing. Subscribe now In the meantime, you can read some of my previous posts!

0 views

High-Performance Query Processing with NVMe Arrays: Spilling without Killing Performance

High-Performance Query Processing with NVMe Arrays: Spilling without Killing Performance Maximilian Kuschewski, Jana Giceva, Thomas Neumann, and Viktor Leis SIGMOD'25 In database vernacular, spilling is the process of writing intermediate data to disk in order to evaluate a query with a finite amount of main memory. It goes without saying that database folks are control freaks performance conscious and don’t want to rely on generic OS paging mechanisms to handle working sets which are larger than main memory. This tidbit about Snowflake is fascinating: Only 5% of analytical queries in Snowflake’s 2018 workload trace spill data, but those 5% contribute 45% of the overall CPU time and 29% of the total execution time [77]. One fundamental problem in this area is that it is very hard to predict up-front exactly how much working memory will be needed to efficiently execute a query. Databases have to estimate based on statistics of the relations used in a query, or assume no spilling will be needed, and then gracefully fall back to spilling if necessary. The two operators which this paper deals with are joins and aggregations. Both involve key columns, and the cardinality of the key columns is critical in determining if spilling is necessary. One obvious spilling mechanism is to use a partitioning approach for joins and aggregations. I’ve described partitioned joins in summaries of these papers: Efficiently Processing Joins and Grouped Aggregations on GPUs The reason why partitioning nicely solves the problem is that the working set requirements for both steps of a partitioned join are modest. The partitioning step only requires a small amount of memory per partition (e.g., accumulate 64KiB per partition before appending partitioned tuples to on-disk storage). The join step only needs enough working memory to join a single partition. Section 4.1 of the paper claims that partitioning slows down TPC-H queries by 2-5x. My spider sense is tingling, but let’s take this as an axiom for now. Here is the premise of the paper: partitioning is better for queries that must spill, but worse for queries that can be completed without spilling . What is an efficient and simple design given that a database cannot perfectly predict up-front if it will need to spill? Prior work along similar lines introduced the hybrid hash join. A hybrid hash join partitions the left (build) input of the join and dynamically decides what percentage of build partitions must be spilled. A hash table is built containing all non-spilled build partitions. Next the right (probe) input to the join is processed. For each probe tuple, the database determines if the associated partition was spilled. If the partition was spilled, then the probe tuple is spilled. If not, then the probe tuple is processed immediately via a lookup in the hash table. Finally, all spilled partitions are processed. The downside of this approach is that it always partitions the build side, even when that is unnecessary. This paper proposes a join implementation that only pays the cost of partitioning when spilling is required. It is a two-phase process. In the materialization phase, the build table is scanned (and pushed-down filters are applied). The resulting tuples are stored in a list of pages. At first, the system optimistically assumes that no spilling is necessary and appends each tuple to the current page. If a memory limit is reached, then partitioning is enabled. Each tuple processed after that point is partitioned, and per-partition lists of pages are allocated. If a further memory limit is reached, then some partitions are spilled to disk. Next, the build and probe phase executes in a manner similar to hybrid hash join. However, there is a fast path for the case where no spilling occurred. In this case, the tuples produced by the materialization phase are inserted into one large hash table, and then the probe tuples are streamed, with one hash table lookup per tuple. If partitioning (but no spilling) occurred, then the hash table inserts will have high locality (assuming a chained hash table). If spilling did occur, then the build and probe phase operates like a hybrid hash join, spilling probe tuples if and only if the associated build partition was spilled. The paper isn’t clear on what happens to non-partitioned build tuples once the system decides to start spilling build partitions. My assumption is that in this case, probe tuples must probe both the spilled build tuples and these build tuples that were processed before partitioning was enabled. The implementation of aggregation described in this paper follows a similar 2-phase approach. The materialization phase performs pre-aggregation, the system starts by aggregating into an in-memory hash table. If the hash table grows too large, then partitioning kicks in. Tuples from in-memory hash tables are evicted into per-partition page lists, and subsequent tuples are directly stored in these per-partition page lists. These per-partition pages can be spilled if further memory limits are reached. The subsequent phase then performs any necessary aggregation. If no eviction occurred, then this phase has no work to do. The paper describes an adaptive compression algorithm that improves spilling performance. Some interesting numbers from the paper: The number of CPU cycles per input byte for executing an in-memory TPC-H queries ranges from 3.3 to 60.3 The number of CPU cycles required to write and read a byte to/from SSD is 11.1 The adaptive nature of this scheme is driven by the fact that queries are diverse, and compressors have knobs which trade off speed for compression ratio, as illustrated by Fig. 3: Source: https://dl.acm.org/doi/10.1145/3698813 When spilling occurs, the system dynamically balances CPU usage and IO bandwidth by adjusting these knobs. Fig. 5 shows in-memory throughput while Fig. 6 shows throughput when spilling occurs: Source: https://dl.acm.org/doi/10.1145/3698813 Dangling Pointers The design of the unified hash join hinges on the fact that partitioning is a bad idea for the in-memory case. This is in contrast to papers describing in-memory partitioning join implementation on other types of chips like SPID-Join and Efficiently Processing Joins and Grouped Aggregations on GPUs . I imagine there is a lot of literature about this topic that I haven’t read. Leave a comment with your experience. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. The number of CPU cycles per input byte for executing an in-memory TPC-H queries ranges from 3.3 to 60.3 The number of CPU cycles required to write and read a byte to/from SSD is 11.1

0 views
./techtipsy 3 weeks ago

Why Nextcloud feels slow to use

Nextcloud. I really want to like it, but it’s making it really difficult. I like what Nextcloud offers with its feature set and how easily it replaces a bunch of services under one roof (files, calendar, contacts, notes, to-do lists, photos etc.), but no matter how hard I try and how much I optimize its resources on my home server, it feels slow to use, even on hardware that is ranging from decent to good. Then I opened developer tools and found the culprit. It’s the Javascript. On a clean page load, you will be downloading about 15-20 MB of Javascript, which does compress down to about 4-5 MB in transit, but that is still a huge amount of Javascript. For context, I consider 1 MB of Javascript to be on the heavy side for a web page/app. Yes, that Javascript will be cached in the browser for a while, but you will still be executing all of that on each visit to your Nextcloud instance, and that will take a long time due to the sheer amount of code your browser now has to execute on the page. A significant contributor to this heft seems to be the bundle, which based on its name seems to provide some common functionality that’s shared across different Nextcloud apps that one can install. It’s coming in at 4.71 MB at the time of writing. Then you want notifications, right? is here to cover you, at 1.06 MB . Then there are the app-specific views. The Calendar app is taking up 5.94 MB to show a basic calendar view. Files app includes a bunch of individual scripts, such as ( 1.77 MB ), ( 1.17 MB ), ( 1.09 MB ), ( 0.9 MB which I’ve never used!) and many smaller ones. Notes app with its basic bare-bones editor? 4.36 MB for the ! This means that even on an iPhone 13 mini, opening the Tasks app (to-do list), will take a ridiculously long time. Imagine opening your shopping list at the store and having to wait 5-10 seconds before you see anything, even with a solid 5G connection. Sounds extremely annoying, right? I suspect that a lot of this is due to how Nextcloud is architected. There’s bound to be some hefty common libraries and tools that allow app developers to provide a unified experience, but even then there is something seriously wrong with the end result, the functionality to bundle size ratio is way off. As a result, I’ve started branching out some things from Nextcloud, such as replacing the Tasks app with using a private Vikunja instance, and Photos to a private Immich instance. Vikunja is not perfect, but its 1.5 MB of Javascript is an order of magnitude smaller compared to Nextcloud, making it feel incredibly fast in comparison. However, with other functionality I have to admit that the convenience of Nextcloud is enough to dissuade me from replacing it elsewhere, due to the available feature set comparing well to alternatives. I’m sure that there are some legitimate reasons behind the current state, and overworked development teams and volunteers are unfortunately the norm in the industry, but it doesn’t take away the fact that the user experience and accessibility suffers as a result. I’d like to thank Alex Russell for writing about web performance and why it matters, with supporting evidence and actionable advice, it has changed how I view websites and web apps and has pushed me to be better in my own work. I highly suggest reading his content, starting with the performance inequality gap series. It’s educational, insightful and incredibly irritating once you learn how crap most things are and how careless a lot of development teams are towards performance and accessibility.

0 views
pkh.me 4 weeks ago

Text rendering and effects using GPU-computed distances

Text rendering is cursed . Anyone who has worked on text will tell you the same; whether it's about layout, bi-directional, shaping, Unicode, or the rendering itself, it's never a completely solved problem. In my personal case, I've been working on trying to render text in the context of a compositing engine for creative content. I needed crazy text effects, and I needed them to be reasonably fast, which implied working with the GPU as much as possible. The distance field was an obvious requirement because it unlocks anti-aliasing and the ability to make many great effects for basically free. In this article, we will see how to compute signed distance field on the GPU because it's much faster than doing it on the CPU, especially when targeting mobile devices. We will make the algorithm decently fast, then after lamenting about the limitations, we will see what kind of effects this opens up. Non-bitmap fonts contain glyphs defined by outlines made of closed sequences of lines and (quadratic or cubic) Bézier curves. Extracting them isn't exactly complicated: FreeType or ttf-parser typically expose a way to do that. For the purpose of this article, we're going to hard code the list of the Bézier curves inside the shader, but of course in a more serious setup those would be uploaded through storage buffers or similar. Using this tiny program , a glyph can be dumped as series of outlines into a fixed size array: contains how many Bézier curves there is for each sub-shape composing the glyph, and contains that list of Bézier cubic curves. Even though glyphs are also composed of lines and quadratic curves, we expand them all into cubics: "who can do more can do less". We use these formulas to respectively expand lines and quadratics into cubics: Where P_n are the Bézier control points. For simplicity and because we want to make sure the most complex case is well tested, we will stick to this approach in this article. But it also means there is a lot of room for further optimizations. Since solving linear and quadratics is much simpler, this is left as an exercise for the reader. You may be tempted to upload the polynomial form of these curves directly to save some computation in the shader. Don't. You will lose the exact stitching property because one evaluated polynomial end B_n(1) will not necessary match the next polynomial start B_{n+1}(0) . This makes artificial "precision holes" that will break rendering in obscure ways. In the previous article , we saw how to get the distance to a cubic Bézier curve. Each glyph being composed of multiple outlines, we can simply run over all of them and pick the shortest distance. Where is the distance to the Bézier curve, squared, as defined in the previous article . This works just fine, but as you can imagine, it's not cheap to solve so many distances per pixel. A first straightforward optimization would be to ignore any curve with a bounding box further than our currently best distance, because none of them can give a shorter one: Where each box encloses a Bézier curve like this: We could use a tighter bound but it would require more computation so this felt like a good trade-off. Implementing this in the inner loop is pretty simple: The distance to the bounding box formula comes from this explicative video by Inigo Quilez (the basic one, without the inside distance), adapted to the Bézier control point coordinates. This saves a lot of computation in certain cases, but the worst case is still pretty terrible, as shown by the heat map of this 'C' glyph: Indeed sometimes, it takes a long time to reach a good Bézier curve that is small enough to disregard most of the others. We observe it the further we go away from the beginning of the shape. So the next step is to find a good initial candidate. One cheap way to do that is to first compute the distance to the center point of each curve, and pick the smallest: This optimization is immediately reflected on the heat map, where only the central point seems to become a critical point (this glyph is a pathological case as it forms a circle): The last step is to figure out whether we are inside or outside the shape. There are two schools here, the even-odd and the non-zero rules. We'll pick the latter because that's the expectation in the case of font rendering. In deconstructing Bézier curves , we explained the theory of that specific algorithm so we're not going to dive into the details again. The basic idea is to strike a ray in one direction from our current position, and get how many times we cross a given curve. Here we will arbitrarily choose a horizontal ray line y = P_y where P is our current coordinate. The topology of each curve can hint us on whether it's worth considering it or not. For example, if every control point is above or below our current position, it can be ignored. We can store all the signs in a mask and bail out as soon as the ray is either completely below or completely above the bounding box of the curve: Each sign indicates the position of the control point with regard to the ray. We can use the relative position of the starting point as a reference for the overall orientation (if there is a crossing, we know it will come from below or above): We also need to convert the Bézier curves to the usual polynomial at^3+bt^2+ct+d : Then we can find the y-roots and check every point on the x-axis. For every crossing point (at most 3), we switch the sign: Since we already have a 5th degree root finder from the previous article, we just have to build a tiny version for the 3rd degree: Our root finder doesn't return roots outside [0,1] so no filtering is required. To summarize: For every sub-shape, we can accumulate the winding number, and use it at the end to decide whether we're inside or outside: This winding number logic might be too fragile: it doesn't cover potential degenerate cases such as horizontal tangents / duplicated roots. But for some reason, while I fought these issues for years, none of the weird corner cases seemed to glitch in my extensive tests, probably because the root finder is more resilient than what I was using before. This may look satisfying, but it's only the beginning of the problems. For example, variadic fonts are typically following chaotic patterns: In addition to the self overlapping part, notice the reverse folding triangle on the right. This completely wreck the distance field: Even with a simple character display (meaning something that doesn't exploit the wide range of effects available with an SDF), it starts to glitch: Little "cracks" should appears around the overlaps. This can be mitigated by lowering the distance by a tiny constant to avoid the zero-crossing, but it impacts the overall glyph (it gets more bold). And it's not just because of variadic problem, sometimes designers rely on overlaps for simplicity: And sometimes... well let's say they have a legitimate reason to do it: This is not something that can be addressed easily. For example, take these two overlapping shapes: We see that the actual distance (white circle) is not the smallest distance to either shape, and it's not even the smallest distance to any edge: it is at an intersection point between two curves, which we do not have. Here we're dealing with line segments, but with cubic curves, the problem explodes in complexity. At this point, we need another strategy, like feeding the GPU renderer with preprocessed outline-only curves. Many people rely on curves flattening to address this issue. This is unfortunately yet another field of research that we're not going to explore this time. Inigo talked about the combination of signed distance if you want some ideas, but aside from the first one (giving up), none seems particularly applicable here. Some effects such as blur or glow expand beyond the boundaries of the characters, so the distance field needs to be larger than the glyph itself. This means when an effect spread too large, there will be an overlap. If we're making an effect on a word, the distance field must be the unified version of all the word glyphs (or sometimes even the sentence). The classic approach of an atlas of glyph distances will not work reliably. In the following illustration, a geometry per glyph is used, each geometry is enlarged to account for the larger distance field, and we end up with potential overlaps when applying effects. Like all distance maps, it suffers from the same limitations. The most common one is the rounded corners problem. This is typically addressed using a multi-channel signed distance field generator , but it's hard for me to tell how accessible it is for a portage on the GPU. This problem only appears with intermediate textures. When computing exact distances like here directly in the shaders, this is not an issue. Despite all these limitations, we can already do so much, so let's close this article on a positive note. This is not done here, but all of these effects are free as soon as we have the distance field stored in an intermediate texture. First, we have anti-aliasing / blur: I wrote a dedicated article on the subject of AA (and blur) on SDF if you want more information on how to achieve that. The shape can also be drastically altered with a simple operator such as "rounding": This is the same technique we suggested to cover up for the overlap glitch earlier, just rebranded as an effect. In the same spirit we can also create an outline stroke (on the outer edge to preserve the original glyph design): This is sooo useful because it makes it possible for our text to be visible no matter what the background is. So many editors don't have this feature because it's hard and expensive to do correctly. Given a distance field though, all we have to do is this (which also includes anti-aliasing on every border): We can also dig into our character with : And maybe apply some glow to create a neon effect: We could also do drop shadows, all sorts of distortions, or so many other creative way exploiting this distance. You get the idea: it is fundamental as soon as you want fast visual effects. This article is the last of the series on 2D rendering for me. I've wanted to share this experience and knowledge after many years of struggling (mostly alone) on these issues. I wish I could have succeeded in providing a good free and open-source text effects rendering engine to compete with the industry standards. (Un)fortunately for me, the adventure stops here, but I hope this will benefit creators and future tinkerers interested in the subject.

0 views
The Tymscar Blog 4 weeks ago

From 400 Mbps to 1.7 Gbps: A WiFi 7 Debugging Journey

I recently upgraded from a UniFi Dream Machine to a UniFi Dream Router 7 because I’m getting 2.5 Gbps internet in two weeks and figured I’d jump on the WiFi 7 bandwagon while I’m at it. My iPhone 17 Pro Max supports it, so why not? After setting everything up, I was getting nowhere near the speeds I expected. Time for some debugging. My wired connection was pulling 950 Mbps through a 1 Gbps switch, and iperf3 directly to the UDR7’s 2.5 Gbps port showed around 2.3 Gbps. The backbone was solid. But on WiFi 7 (6 GHz, 160 MHz width), standing literally a foot from the router, I was getting around 400 Mbps with iperf3. With 10 concurrent streams it went up to 650 Mbps, but that’s still pathetic.

0 views
Dangling Pointers 1 months ago

Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt?

Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt? Kaisong Huang, Jiatang Zhou, Zhuoyue Zhao, Dong Xie, and Tianzheng Wang SIGMOD'25 Say you are a database, and your job is to execute two kinds of queries (both from different TPC benchmarks ): High-priority New-Order queries from TPC-C (OLTP) Low-priority Q2 queries from TPC-H (OLAP) Congratulations, you are a hybrid transaction/analytical processing ( HTAP ) database! You would like OLTP transactions to experience low tail latency, while OLAP transactions run at high throughput. How can transactions be scheduled to achieve these goals? A Non-Preemptive FIFO policy runs transactions to completion in the order they came. This is easy but has a high tail latency for OLTP transactions that have to sit in line behind a long queue of OLAP transaction. A cooperative policy involves yield points at specific places in the database code. At each yield point, an OLAP transaction can realize that an OLTP transaction is waiting and yield the CPU. It is a hassle to insert yield points, and it is hard to find the Goldilocks frequency. Checking for high-priority transactions too often adds overhead, but checking infrequently increases tail latency. A preemptive policy allows high-priority OLTP transactions to borrow a CPU core as soon as possible. In the past, the only practical way to do this involved OS context switching, which is expensive. Fig. 2. illustrates these policies: Source: https://dl.acm.org/doi/abs/10.1145/3725319 Enter userspace interrupts . These allow preemption without OS kernel involvement. Section 4.2 of the paper makes it clear that it isn’t totally easy to implement userspace context switching on top of userspace interrupts. An idiomatic use case for userspace interrupts is for an interrupt handler to quickly save some data and then return back to the code that was running. The context switch case is not idiomatic. For each CPU core, two pthreads threads are created, and there are two stacks. Say the CPU core is running a low-priority (OLAP) transaction and a userspace interrupt is delivered to the core. The userspace interrupt handler is invoked, which mucks around with the CPU registers and (including the stack pointer), and then returns. But it doesn’t return to the code that was running the low-priority transaction, it returns to code which runs the high-priority (OLTP) transaction . Once the high-priority transaction finishes, it calls a voluntary context switch function, which again mucks around with CPU registers and the stack pointer in just the correct manner so that it returns back to the code running the low-priority transaction. There are some nitty-gritty details to get this working correctly. Tricky cases have to be handled such as: A userspace interrupt occurring in the middle of the context switch function Support for database code which uses thread-local storage (e.g., the modifier in C++ ) Avoiding deadlocks associated with a userspace interrupt occurring while a database lock is acquired As seen with xUI , while userspace interrupts are cheap, they still incur a cost. This paper proposes firing a single interrupt to execute a batch of high-priority transactions. Section 5 also describes a starvation avoidance mechanism to ensure that low-priority transactions eventually finish. Note that when a low-priority transaction is not preempted, it is not automatically aborted . The paper assumes the underlying database uses multi-versioning and optimistic concurrency control. Fig. 10 has the headline results. represents FIFO scheduling. represents the case where the OLAP query can occasionally yield the CPU core. is the work described in this paper. Tail latencies for OLTP queries are significantly reduced, while performance of OLAP queries does not change much. Source: https://dl.acm.org/doi/abs/10.1145/3725319 Dangling Pointers It would be nice to see a comparison against a version which uses traditional OS thread synchronization rather than userspace interrupts. The details of userspace context switching are tricky and seem orthogonal to databases. A library or OS functionality which provides a robust implementation seems like a useful thing to exist. The paper doesn’t mention what happens if the work associated with a single query is parallelized across multiple CPU cores. I imagine this complicates the scheduling policy. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. High-priority New-Order queries from TPC-C (OLTP) Low-priority Q2 queries from TPC-H (OLAP) Source: https://dl.acm.org/doi/abs/10.1145/3725319 Enter userspace interrupts . These allow preemption without OS kernel involvement. Context Switch Complications Section 4.2 of the paper makes it clear that it isn’t totally easy to implement userspace context switching on top of userspace interrupts. An idiomatic use case for userspace interrupts is for an interrupt handler to quickly save some data and then return back to the code that was running. The context switch case is not idiomatic. For each CPU core, two pthreads threads are created, and there are two stacks. Say the CPU core is running a low-priority (OLAP) transaction and a userspace interrupt is delivered to the core. The userspace interrupt handler is invoked, which mucks around with the CPU registers and (including the stack pointer), and then returns. But it doesn’t return to the code that was running the low-priority transaction, it returns to code which runs the high-priority (OLTP) transaction . Once the high-priority transaction finishes, it calls a voluntary context switch function, which again mucks around with CPU registers and the stack pointer in just the correct manner so that it returns back to the code running the low-priority transaction. There are some nitty-gritty details to get this working correctly. Tricky cases have to be handled such as: A userspace interrupt occurring in the middle of the context switch function Support for database code which uses thread-local storage (e.g., the modifier in C++ ) Avoiding deadlocks associated with a userspace interrupt occurring while a database lock is acquired

0 views
Loren Stewart 1 months ago

I Built the Same App 10 Times: Evaluating Frameworks for Mobile Performance

Started evaluating 3 frameworks for work, ended up building 10. Next-gen frameworks (Marko, SolidStart, SvelteKit, Qwik) all deliver instant 35-39ms performance. The real differentiator? Bundle sizes range from 28.8 kB to 176.3 kB compressed. Choose based on your priorities, not microscopic FCP differences.

0 views
Dangling Pointers 1 months ago

Fast and Scalable Data Transfer Across Data Systems

Fast and Scalable Data Transfer Across Data Systems Haralampos Gavriilidis, Kaustubh Beedkar, Matthias Boehm, and Volker Mark SIGMOD'25 We live in exciting times, unimaginably large language models getting better each day, and a constant stream of amazing demos. And yet, efficiently transferring a table between heterogeneous systems is an open research problem! An example from the paper involves transferring data from PostgreSQL to pandas. Optimizing this transfer time is important and non-trivial. The paper describes a system named XDBC. XDBC software runs on both the source and the destination data management systems (DMS), as illustrated by Fig. 4: Source: https://dl.acm.org/doi/10.1145/3725294 The XDBC client/server processes are organized as a pipeline. Data parallelism within a stage is exploited by assigning 1 or more workers (e.g., cores) to each stage. There are a lot of knobs which can affect end-to-end throughput: Number of workers assigned to each task Data interchange format (row-major, column-major, Arrow ) Compression ( zstd , snappy , lzo , lz4 ) Section 4.1 of the paper claims the search space is so large that brute force search will not work, so a heuristic algorithm is used. The heuristic algorithm assumes accurate performance models which can estimate performance of each pipeline stage given a specific configuration. This model is based on real-world single-core performance measurements, and Gustafson’s law to estimate multi-core scaling. The algorithm starts by assigning 1 worker to each pipeline stage (in both the client and server). An iterative process then locates the pipeline stage which is estimated to be the slowest and assigns additional workers to it until it is no longer the bottleneck. This process continues until no more improvement can be found, due to one of the following reasons: All available CPU cores have been assigned Network bandwidth is the bottleneck If the process ends with more CPU cores available, then a hard-coded algorithm determines the best compression algorithm given the number of cores remaining. The data interchange format is determined based on which formats the source and destination DMSs support, and which compression algorithm was chosen. The XDBC optimizer has a lot of similarities with the Alkali optimizer . Here are some differences: Alkali does not require tasks to be executed on separate cores. For example, Alkali would allow a single core to execute both the and pipeline stages. Alkali uses an SMT solver to determine the number of cores to assign to each stage. The Alkali performance model explicitly takes into account inter-core bandwidth requirements. Alkali doesn’t deal with compression. Fig. 7(a) shows results from the motivating example (PostgreSQL→Pandas). Fig. 7(b) compares XDBC vs built-in Pandas functions to read CSV data over HTTP. connector-x is a more specialized library which supports reading data into Python programs specifically. Source: https://dl.acm.org/doi/10.1145/3725294 Dangling Pointers There are many search spaces which are too large for brute force. Special-case heuristic algorithms are one fallback, but as the Alkali paper shows, there are other approaches (e.g., LP solvers, ILP solvers, SMT solvers, machine learning models). It would be great to see cross-cutting studies comparing heuristics to other approaches. Subscribe now Source: https://dl.acm.org/doi/10.1145/3725294 The XDBC client/server processes are organized as a pipeline. Data parallelism within a stage is exploited by assigning 1 or more workers (e.g., cores) to each stage. There are a lot of knobs which can affect end-to-end throughput: Number of workers assigned to each task Data interchange format (row-major, column-major, Arrow ) Compression ( zstd , snappy , lzo , lz4 ) All available CPU cores have been assigned Network bandwidth is the bottleneck Alkali does not require tasks to be executed on separate cores. For example, Alkali would allow a single core to execute both the and pipeline stages. Alkali uses an SMT solver to determine the number of cores to assign to each stage. The Alkali performance model explicitly takes into account inter-core bandwidth requirements. Alkali doesn’t deal with compression.

0 views
Dangling Pointers 1 months ago

Falcon: A Reliable, Low Latency Hardware Transport

Falcon: A Reliable, Low Latency Hardware Transport Arjun Singhvi, Nandita Dukkipati, Prashant Chandra, Hassan M. G. Wassel, Naveen Kr. Sharma, Anthony Rebello, Henry Schuh, Praveen Kumar, Behnam Montazeri, Neelesh Bansod, Sarin Thomas, Inho Cho, Hyojeong Lee Seibert, Baijun Wu, Rui Yang, Yuliang Li, Kai Huang, Qianwen Yin, Abhishek Agarwal, Srinivas Vaduvatha, Weihuang Wang, Masoud Moshref, Tao Ji, David Wetherall, and Amin Vahdat SIGCOMM'25 Falcon is an IP block which can be integrated into a 3rd-party NIC. Fig. 7 shows an example integration of Falcon into a NIC. Blue components are part of Falcon: Source: https://dl.acm.org/doi/abs/10.1145/3718958.3754353 Multiple Upper Layer Protocols (ULPs, e.g., NVMe and RDMA ) are implemented on top of Falcon. Other protocols (e.g., Ethernet) can bypass Falcon and go straight to the standard NIC hardware. Falcon provides reliability and ordering via a connection-oriented interface to the ULPs. Multipathing is the ability for a single connection to use multiple network paths from the sender to the receiver. This improves throughput by allowing use of aggregate bandwidth and allows Falcon to quickly react to transient congestion on a subset of paths. The paper uses the term flow for a single path from sender to receiver. A single connection is associated with many flows. There are two parts to implementing multipathing, one easy and one not-so-easy. The easy task is to use the IPv6 Flow Label field. When the sending NIC chooses a flow for a particular packet, it sets the index of the flow in the flow label field. When a switch determines that there are multiple valid output ports for a packet, it hashes various fields from the packet (including the flow label) to determine which port to use. The switches are doing the hard work here. A Falcon NIC doesn’t need to maintain a local view of the network topology between the sender and receiver, nor does it have to pre-plan the exact set of switches a packet will traverse. The NIC simply sets the flow label field. The hard part is handling out-of-order packets. If the sending NIC is interleaving between flows at a fine granularity, then the receiving NIC will commonly receive packets out of order. Falcon burns 1-2 mm 2 of silicon on a packet buffer which holds received packets until they can be delivered to a ULP in order. ACK packets contain a packet sequence number and a 128-bit wide bitmap which represent a window of 128 recent packets that have been received. The sender uses these bitmaps to determine when to retransmit. The NIC maintains an estimate of the round-trip latency on each flow. If the most recent bitmap indicates that a packet has not been received, and a period of time longer than the round-trip latency has elapsed, then the packet is retransmitted. Falcon attempts to be a good citizen and minimize bufferbloat by estimating per-flow round-trip latency. These estimates are gathered via hardware near the edge of the NIC which records timestamps as packets (including ACKs) are sent and received. When Falcon is processing a packet to be sent for a given connection, it computes the open window associated with each flow. The open window is the difference between the round-trip latency and the number of unacknowledged packets. The flow with the largest open window is selected. You can think of the open window like a per-flow credit scheme, where the total credits available is determined from round-trip latency, sending a packet consumes a credit, and receiving an ACK produces credits. The trick here is that the round-trip latency associated with each flow is constantly changing. Section 5.2 of the paper describes three details which the authors felt were worth mentioning. The unspoken assumption is that these are non-standard design choices: As mentioned before, Falcon dedicates a non-trivial amount of on-chip resources to SRAM buffers which hold received packets before they are reassembled into the correct order. The paper says 1.2MB is required for 200Gbps, and the buffer size grows linearly with throughput. One interesting fact is that the buffer size is independent of latency, because throughput decreases with latency. For example, the paper mentions the same size works well with “inter-metro use-cases” which have 5-10x higher latency, but also 5-10x lower bandwidth. Falcon has an on-chip cache to hold mutable connection state, but the paper says that it is very common to have a high miss rate in this cache. The solution to this is to provision enough bandwidth to be able to have good performance when most accesses to connection state must go off chip. Reading between the lines, it seems like there are two scenarios which are important. The first has a small number of connections, with each connection experiencing a high packet rate. The second is a large number of connections, each with a low packet rate. Falcon has hardware support for somewhat rare events (errors, timeouts) rather than letting software on the host handle this. Fig. 10 compares Falcon against RoCE for various RDMA verbs and drop rates. Note that the drop rate maxes out at 1%. Source: https://dl.acm.org/doi/abs/10.1145/3718958.3754353 Dangling Pointers Falcon contains a lot of great optimizations. I wonder how many of them are local optimizations, and how much more performance is on the table if global optimization is allowed. In particular, Falcon works with standard ULPs (RDMA, NVMe) and standard Ethernet switches. At some scale, maybe extending the scope of allowable optimizations to those components would make sense? Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. Source: https://dl.acm.org/doi/abs/10.1145/3718958.3754353 Multiple Upper Layer Protocols (ULPs, e.g., NVMe and RDMA ) are implemented on top of Falcon. Other protocols (e.g., Ethernet) can bypass Falcon and go straight to the standard NIC hardware. Falcon provides reliability and ordering via a connection-oriented interface to the ULPs. Multipathing Multipathing is the ability for a single connection to use multiple network paths from the sender to the receiver. This improves throughput by allowing use of aggregate bandwidth and allows Falcon to quickly react to transient congestion on a subset of paths. The paper uses the term flow for a single path from sender to receiver. A single connection is associated with many flows. There are two parts to implementing multipathing, one easy and one not-so-easy. The easy task is to use the IPv6 Flow Label field. When the sending NIC chooses a flow for a particular packet, it sets the index of the flow in the flow label field. When a switch determines that there are multiple valid output ports for a packet, it hashes various fields from the packet (including the flow label) to determine which port to use. The switches are doing the hard work here. A Falcon NIC doesn’t need to maintain a local view of the network topology between the sender and receiver, nor does it have to pre-plan the exact set of switches a packet will traverse. The NIC simply sets the flow label field. The hard part is handling out-of-order packets. If the sending NIC is interleaving between flows at a fine granularity, then the receiving NIC will commonly receive packets out of order. Falcon burns 1-2 mm 2 of silicon on a packet buffer which holds received packets until they can be delivered to a ULP in order. ACK packets contain a packet sequence number and a 128-bit wide bitmap which represent a window of 128 recent packets that have been received. The sender uses these bitmaps to determine when to retransmit. The NIC maintains an estimate of the round-trip latency on each flow. If the most recent bitmap indicates that a packet has not been received, and a period of time longer than the round-trip latency has elapsed, then the packet is retransmitted. Congestion Control Falcon attempts to be a good citizen and minimize bufferbloat by estimating per-flow round-trip latency. These estimates are gathered via hardware near the edge of the NIC which records timestamps as packets (including ACKs) are sent and received. When Falcon is processing a packet to be sent for a given connection, it computes the open window associated with each flow. The open window is the difference between the round-trip latency and the number of unacknowledged packets. The flow with the largest open window is selected. You can think of the open window like a per-flow credit scheme, where the total credits available is determined from round-trip latency, sending a packet consumes a credit, and receiving an ACK produces credits. The trick here is that the round-trip latency associated with each flow is constantly changing. Notable Hardware Details Section 5.2 of the paper describes three details which the authors felt were worth mentioning. The unspoken assumption is that these are non-standard design choices: As mentioned before, Falcon dedicates a non-trivial amount of on-chip resources to SRAM buffers which hold received packets before they are reassembled into the correct order. The paper says 1.2MB is required for 200Gbps, and the buffer size grows linearly with throughput. One interesting fact is that the buffer size is independent of latency, because throughput decreases with latency. For example, the paper mentions the same size works well with “inter-metro use-cases” which have 5-10x higher latency, but also 5-10x lower bandwidth. Falcon has an on-chip cache to hold mutable connection state, but the paper says that it is very common to have a high miss rate in this cache. The solution to this is to provision enough bandwidth to be able to have good performance when most accesses to connection state must go off chip. Reading between the lines, it seems like there are two scenarios which are important. The first has a small number of connections, with each connection experiencing a high packet rate. The second is a large number of connections, each with a low packet rate. Falcon has hardware support for somewhat rare events (errors, timeouts) rather than letting software on the host handle this.

0 views
Brain Baking 1 months ago

Is It Worth It To Optimize Images For Your Site?

Yes but it depends on how you define the verb “to optimize”. For any image conversion heavy lifting I rely on the trusty ImageMagick yet I’ve been wondering whether my argument preset is correct: should it be more or less optimized? The problem with questions is that they lead to other questions, such as: how much assets is this blog actually generating each year? Is my image optimization technique sustainable enough or will I end up with terabytes full of nonsense in ten or twenty years? When it comes to size, opening up the handy gdu disk analyser in the folder is enough to get a first impression: Gdu summarizing how much disk usage the assets on this blog are for each year in MiB. As I maintain the same folder structure for both and —this post lives under , for example—generating an overview of asset sizes per year becomes trivial. Not taking the earlier Brain Baking years into account, the total amount of data that gets added each year is on average . Let’s make that between thirteen and fourteen as 2025 isn’t finished yet. That means in twenty years, I’ll have accumulated an extra . That’s not even half a classic CD-ROM. Is it really worth it then, to think twice about every MiB that gets checked in? Well, yes, since all those bytes need to leave one server and make an appearance at another in order to serve these pretty images to your visitor. Besides, as a proud member of The 512KB Club , I should keep my promise in reducing file sizes as much as possible. Of course, not all posts have assets attached to them: the average amount of assets linked to a post here is with each post having about on data. That’s quite optimized! Yet can I do better? Or should I stop over-compressing those images up to the point that they’re losing their vivid shine? More questions! No wait, those were the same. Here’s the default ImageMagick command I rely on: What exactly does this do? I urge you to read Colin’s old but relevant post on chroma (colour detail) and luma (lightness and darkness) and how to optimize for the web/mobile. It even includes a regression analysis, concluding that: Resizing images matters most. It multiplies the size a little more than the square root of the total pixels. More pixels, many more bytes. Compression matters somewhat. For quality=80 the bytes are x23; for quality=100 bytes multiply x50. Subsampling of 4:2:0 could further reduce the bytes by 17%. What I did not realize until now by testing an comparing images is that does something else besides stripping GPS Exif data. I noticed the export became washed out, as if a portion of the colour profile information was lost. Take a close look at the macOS dock screenshots re-rendered in Firefox: Above: using -strip; without ICC. Below: using +profile '!icc,*'; with ICC. Can you find the difference by inspecting the saturation of the red Firefox fox or the yellow wings of the NetNewsWire satellite? The difference is very subtle—and very difficult to showcase in a screenshot—but very annoying. Inspecting the images using ImageMagick’s reveals that the ICC profile is removed in the process: The embedded ICC profile is there to make sure the image looks the same on any computer and any piece of software; without it browsers can render it like they want. The result is a flat looking image as you can see in the above screenshot (which does have an embedded profile). The option does not solve this: it tells ImageMagick to convert the colorspace, not to attach it. Instead of using , use to throw away all profiles but the ICC one. Also, so be sure to add a as this obviously has the highest impact on file sizes. But wait, what about providing a higher resolution image to desktop browsers and reducing the resolution to lower versions for mobile browsers? For me, that’s a hassle I don’t want to bother with at all. It requires saving the assets in their original format and providing a couple of alternatives, greatly increasing the total size of the source repository, the total size of the deployable folder, and the total bandwidth for my humble server. For mobile users, that’s not a problem as downloading of data is less then the copious amounts of megabytes that will get slurped in when you visit your average newspaper site. For ultra widescreen 4K nerds, the max width on the container wrapping this will keep things somewhat in check. The biggest takeaway for me is that in twenty years I’ll have filled half a CD-ROM which is significantly less than I expected. Should this incentivize me to bump the quality to , reduce downsampling, or instead increase the usage of assets in general? Maybe I should be less worried about the file size and more about the content. By Wouter Groeneveld on 23 October 2025.  Reply via email . : the sampling factor used for the JPEG encoder. If Colin Bendell tells me to use claiming a +/- 17% image size reduction, then I believe him. : Removes all profiles except for the ICC colour profile; gets rid of EXIF data. See What Exif Data Reveals About Your Site . : The compression quality of the image. With 90 or less, chroma channels are downsampled (which I instruct it to do anyway with the sampling factor argument). : explicitly tasks ImageMagick to create a progressive JPEG allowing for the browser to show a lower-resolution version of the image whilst data is still being transferred. Perceived download speed is also important! : Why on earth would you want to export JPEG files when the era of WebP is here? That’s an easy one to answer: because my retro hardware knows JPEG. Because I believe we should build websites that last. : the default and recommended option for the WWW for image that do not contain any extra colorspace information such as JPEG. Other options provide slightly better compression .

0 views