Posts in Performance (20 found)

Running cheap and crappy USB hard drives in RAID0 is indeed a very terrible idea

Some of my dumb experiments result in interesting findings and unexpected successes. Some end up with very predictable failures. What happens when you have two crappy USB hard drives running 1 in mode? Nothing, until something goes wrong on one of the drives. Here’s what it looks like: But in a way, this setup worked exactly as expected. If you want to have a lot of storage on the cheap, or simply care about performance, or both, then running disks in RAID0 mode is a very sensible thing to do. I used it mainly for having a place where I can store a bunch of data temporarily, such a full disk images or data that I can easily replace. Now I can test that theory out! I feel like I need to point out that this is not the fault of . When you instruct a file system to provide zero redundancy, then that is what you will get.  ↩︎ I feel like I need to point out that this is not the fault of . When you instruct a file system to provide zero redundancy, then that is what you will get.  ↩︎

0 views

LoopFrog: In-Core Hint-Based Loop Parallelization

LoopFrog: In-Core Hint-Based Loop Parallelization Marton Erdos, Utpal Bora, Akshay Bhosale, Bob Lytton, Ali M. Zaidi, Alexandra W. Chadwick, Yuxin Guo, Giacomo Gabrielli, and Timothy M. Jones MICRO'25 To my Kanagawa pals: I think hardware like this would make a great target for Kanagawa, what do you think? The message of this paper is that there is plenty of loop-level parallelism available which superscalar cores are not yet harvesting. Fig. 1 illustrates the classic motivation for multi-core processors: scaling the processor width by 4x yields a 2x IPC improvement. In general, wider cores are heavily underutilized. Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The main idea behind is to add hints to the ISA which allow a wide core to exploit more loop-level parallelism in sequential code. If you understand Fig. 2, then you understand , the rest is just details: Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The compiler emits instructions which the processor can use to understand the structure of a loop. Processors are free to ignore the hints. A loop which can be optimized by comprises three sections: A header , which launches each loop iteration A body , which accepts values from the header A continuation , which computes values needed for the next loop iteration (e.g., the value of induction variables). Each execution of the header launches two threadlets . A threadlet is like a thread but is only ever executed on the core which launched it. One threadlet launched by the header executes the body of the loop. The other threadlet launched by the header is the continuation, which computes values needed for the next loop iteration. Register loop-carried dependencies are allowed between the header and continuation, but not between body invocations. That is the key which allows multiple bodies to execute in parallel (see Fig. 2c above). At any one time, there is one architectural threadlet (the oldest one), which can update architectural state. All other threadlets are speculative . Once the architectural threadlet for loop iteration completes, it hands the baton over to the threadlet executing iteration , which becomes architectural. Dependencies through memory are handled by the speculative state buffer (SSB). When a speculative threadlet executes a memory store, data is stored in the SSB and actually written to memory later on (i.e., after that threadlet is no longer speculative). Memory loads read from both the L1 cache and the SSB, and then disambiguation hardware determines which data to use and which to ignore. The hardware implementation evaluated by the paper does not support nested parallelization, it simply ignores hints inside of nested loops. Fig. 6 shows simulated performance results for an 8-wide core. A core which supports 4 threadlets is compared against a baseline which does not implement . Source: https://dl.acm.org/doi/10.1145/3725843.3756051 can improve performance by about 10%. Fig. 1 at the top shows that an 8-wide core experiences about 25% utilization, so there may be more fruit left to pick. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The main idea behind is to add hints to the ISA which allow a wide core to exploit more loop-level parallelism in sequential code. Structured Loops If you understand Fig. 2, then you understand , the rest is just details: Source: https://dl.acm.org/doi/10.1145/3725843.3756051 The compiler emits instructions which the processor can use to understand the structure of a loop. Processors are free to ignore the hints. A loop which can be optimized by comprises three sections: A header , which launches each loop iteration A body , which accepts values from the header A continuation , which computes values needed for the next loop iteration (e.g., the value of induction variables).

0 views

Dynamic Load Balancer in Intel Xeon Scalable Processor

Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines Jiaqi Lou, Srikar Vanavasam, Yifan Yuan, Ren Wang, and Nam Sung Kim ISCA'25 This paper describes the DLB accelerator present in modern Xeon CPUs. The DLB addresses a similar problem discussed in the state-compute replication paper : how to parallelize packet processing when RSS (static NIC-based load balancing) is insufficient. Imagine a 100 Gbps NIC is receiving a steady stream of 64B packets and sending them to the host. If RSS is inappropriate for the application, then another parallelization strategy would be for a single CPU core to distribute incoming packets to all of the others. To keep up, that load-distribution core would have to be able to process 200M packets per second, but state-of-the-art results top out at 30M packets per second. The DLB is an accelerator designed to solve this problem. Fig. 2 illustrates the DLB hardware and software architecture: Source: https://dl.acm.org/doi/10.1145/3695053.3731026 A set of producer cores can write 16B queue elements (QEs) into a set of producer ports (PPs). In a networking application, one QE could map to a single packet. A set of consumer cores can read QEs out of consumer queues (CQs). QEs contain metadata which producers can set to enable ordering within a flow/connection, and to control relative priorities. The DLB balances the load at each consumer, while honoring ordering constraints and priorities. A set of cores can send QEs to the DLB in parallel without suffering too much from skew. For example, imagine a CPU with 128 cores. If DLB is not used, and instead RSS is configured to statically distribute connections among those 128 cores, then skew could be a big problem. If DLB is used, and there are 4 cores which write into the producer ports, then RSS can be configured to statically distribute connections among those 4 cores, and skew is much less likely to be a problem. Fig. 5 shows that DLB works pretty well. and are software load balancers, while uses the DLB accelerator. DLB offers similar throughput and latency to RSS, but with much more flexibility. Source: https://dl.acm.org/doi/10.1145/3695053.3731026 AccDirect One awkward point in the design above is the large number of CPU cycles consumed by the set of producer cores which write QEs into the DLB. The paper proposes AccDirect to solve this. The idea is that DLB appears as a PCIe device, and therefore a flexible NIC can use PCIe peer-to-peer writes to send packets directly to the DLB. The authors find that the NVIDIA BlueField-3 has enough programmability to support this. Fig. 9 show that this results in a significant power savings, but not too much of a latency improvement: Source: https://dl.acm.org/doi/10.1145/3695053.3731026 Dangling Pointers I feel like it is common knowledge that fine-grained parallelism doesn’t work well on multi-core CPUs. In the context of this paper, the implication is that it is infeasible to write a multi-core packet processor that primarily uses pipeline parallelism. Back-of-the-envelope: at 400Gbps, and 64B packets, there is a budget of about 40 8-wide SIMD instructions to process a batch of 8 packets. If there are 128 cores, then maybe the aggregate budget is 4K instructions per batch of 8 packets across all cores. This doesn’t seem implausible to me. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Den Odell 1 weeks ago

The Main Thread Is Not Yours

When a user visits your site or app, their browser dedicates a single thread to running your JavaScript, handling their interactions, and painting what they see on the screen. This is the main thread , and it’s the direct link between your code and the person using it. As developers, we often use it without considering the end user and their device, which could be anything from a mid-range phone to a high-end gaming rig. We don’t think about the fact that the main thread doesn’t belong to us; it belongs to them. I’ve watched this mistake get repeated for years: we burn through the user’s main thread budget as if it were free, and then act surprised when the interface feels broken. Every millisecond you spend executing JavaScript is a millisecond the browser can’t spend responding to a click, updating a scroll position, or acknowledging that the user did just try to type something. When your code runs long, you’re not causing "jank" in some abstract technical sense; you’re ignoring someone who’s trying to talk to you. Because the main thread can only do one thing at a time, everything else waits while your JavaScript executes: clicks queue up, scrolls freeze, and keystrokes pile up hoping you’ll finish soon. If your code takes 50ms to respond nobody notices, but at 500ms the interface starts feeling sluggish, and after several seconds the browser may offer to kill your page entirely. Users don’t know why the interface isn’t responding. They don’t see your code executing; they just see a broken experience and blame themselves, then the browser, then you, in that order. Browser vendors have spent years studying how humans perceive responsiveness, and the research converged on a threshold: respond to user input within 100ms and the interaction feels instant, push past 200ms and users notice the delay. The industry formalized this as the Interaction to Next Paint (INP) metric , where anything over 200ms is considered poor and now affects your search rankings. But that 200ms budget isn’t just for your JavaScript. The browser needs time for style calculations, layout, and painting, so your code gets what’s left: maybe 50ms per interaction before things start feeling wrong. That’s your allocation from a resource you don’t own. The web platform has evolved specifically to help you be a better guest on the main thread, and many of these APIs exist because browser engineers got tired of watching developers block the thread unnecessarily. Web Workers let you run JavaScript in a completely separate thread. Heavy computation, whether parsing large datasets, image processing, or complex calculations, can happen in a Worker without blocking the main thread at all: Workers can’t touch the DOM, but that constraint is deliberate since it forces a clean separation between "work" and "interaction." requestIdleCallback lets you run code only when the browser has nothing better to do. (Due to a WebKit bug , Safari support is still pending at time of writing.) When the user is actively interacting, your callback waits; when things are quiet, your code gets a turn: This is ideal for non-urgent work like analytics, pre-fetching, or background updates. isInputPending (Chromium-only for now) is perhaps the most user-respecting API of the lot, because it lets you check mid-task whether someone is waiting for you: You’re explicitly asking "is someone trying to get my attention?" and if the answer is yes, you stop and let them. The obvious main thread crimes like infinite loops or rendering 100,000 table rows are easy to spot, but the subtle ones look harmless. Calling , for example, on a large API response blocks the main thread until parsing completes, and while this feels instant on a developer’s machine, a mid-range phone with a throttled CPU and competing browser tabs might take 300ms to finish the same operation, ignoring the user’s interactions the whole time. The main thread doesn’t degrade gracefully; it’s either responsive or it isn’t, and your users are running your code in conditions you’ve probably never tested. You can’t manage what you can’t measure, and Chrome DevTools’ Performance panel shows exactly where your main thread time goes if you know where to look. Find the "Main" track and watch for long yellow blocks of JavaScript execution. Tasks exceeding 50ms get flagged with red shading to mark the overtime portion. Use the Insights pane to surface these automatically if you prefer a guided approach. For more precise instrumentation, the API lets you time specific operations in your own code: The Web Vitals library can capture INP scores from real users across all major browsers in production, and when you see spikes you’ll know where to start investigating. Before your application code runs a single line, your framework has already spent some of the user’s main thread budget on initialization, hydration, and virtual DOM reconciliation. This isn’t an argument against frameworks so much as an argument for understanding what you’re spending. A framework that costs 200ms to hydrate has consumed four times your per-interaction budget before you’ve done anything, and that needs to be a conscious choice you’re making, rather than an accident. Some frameworks have started taking this seriously: Qwik’s "resumability" avoids hydration entirely, while React’s concurrent features let rendering yield to user input. These are all responses to the same fundamental constraint, which is that the main thread is finite and we’ve been spending it carelessly. The technical solutions matter, but they follow from a shift in perspective, and when I finally internalized that the main thread belongs to the user, not to me, my own decisions started to change. Performance stops being about how fast your code executes and starts being about how responsive the interface stays while your code executes. Blocking the main thread stops being an implementation detail and starts feeling like taking something that isn’t yours. The browser gave us a single thread of execution, and it gave our users that same thread for interacting with what we built. The least we can do is share it fairly.

0 views

The XOR Cache: A Catalyst for Compression

The XOR Cache: A Catalyst for Compression Zhewen Pan and Joshua San Miguel ISCA'25 Inclusive caches seem a bit wasteful. This paper describes a clever mechanism for reducing that waste: store two cache lines of data in each physical cache line! In most of the paper there are only two caches in play (L1 and LLC), but the idea generalizes. Fig. 1 illustrates the core concept. In this example, there are two cache lines ( , and ) in the LLC. is also present in the L1 cache of a core. Here is the punchline: the LLC stores . If the core which has cached experiences a miss when trying to access , then the core resolves the miss by asking the LLC for . Once that data arrives at the L1, B can be recovered by computing . The L1 stores and separately, so as to not impact L1 hit latency. Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Coherence Protocol The mechanics of doing this correctly are implemented in a cache coherence protocol described in section 4 of the paper. We’ve discussed the local recovery case, where the core which needs already holds in its L1 cache. If the core which requires B does not hold A, then two fallbacks are possible. Before describing those cases, the following property is important to highlight. The cache coherence protocol ensures that if the LLC holds , then some L1 cache in the system will hold a copy of either or . If some action was about to violate this property, then the LLC would request a copy of or from some L1, and use it to split into separate cache lines in the LLC holding and . Direct forwarding occurs when some other core holds a copy of B. In this case, the system requests the other core to send B to the core that needs it. The final case is called remote recovery . If the LLC holds and no L1 cache holds a copy of , then some L1 cache in the system must hold a copy of . The LLC sends to that core which has locally cached. That core computes to recover a copy of and sends it to the requestor. Another tricky case to handle is when a core writes to . The cache coherence protocol handles this case similar to eviction and ensures that the LLC will split cache lines as necessary so that all data is always recoverable. The LLC has a lot of freedom when it comes to deciding which cache lines to pair up. The policies described in this paper optimize for intra-cache line compression (compressing the data within a single cache line). The LLC hardware maintains a hash table. When searching for a partner for cache line , the LLC computes a hash of the contents of to find a set of potential partners. One hash function described by the paper is sparse byte labeling , which produces 6 bits for each 8-byte word in a cache line. Each bit is set to 0 if the corresponding byte in the word is zero. The lower two bytes of each word are ignored. The idea here is that frequently the upper bytes of a word are zero. If two cache lines have the same byte label, then when you XOR them together the merged cache line will have many zero bytes (i.e., they have low entropy and are thus compressible). The LLC can optimize for this case by storing compressed data in the cache and thus increasing its effective capacity. This paper relies on prior work related to compressed caches. The takeaway is that not only is there a potential 2x savings possible from logically storing two cache lines in one physical location, but there are also further savings in compressing these merged cache lines. Fig. 13 compares the compression ratio achieved by XOR cache against prior work (taller bars are better). The right-most set of bars show the geometric mean: Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Fig. 15 shows performance impacts associated with this scheme: Source: https://dl.acm.org/doi/10.1145/3695053.3730995 Dangling Pointers It seems to me that XOR Cache mostly benefits from cache lines that are rarely written. I wonder if there are ways to predict if a particular cache line is likely to be written in the near future. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views

Can Bundler Be as Fast as uv?

At RailsWorld earlier this year, I got nerd sniped by someone. They asked “why can’t Bundler be as fast as uv?” Immediately my inner voice said “YA, WHY CAN’T IT BE AS FAST AS UV????” My inner voice likes to shout at me, especially when someone asks a question so obvious I should have thought of it myself. Since then I’ve been thinking about and investigating this problem, going so far as to give a presentation at XO Ruby Portland about Bundler performance . I firmly believe the answer is “Bundler can be as fast as uv” (where “as fast” has a margin of error lol). Fortunately, Andrew Nesbitt recently wrote a post called “How uv got so fast” , and I thought I would take this opportunity to review some of the highlights of the post and how techniques applied in uv can (or can’t) be applied to Bundler / RubyGems. I’d also like to discuss some of the existing bottlenecks in Bundler and what we can do to fix them. If you haven’t read Andrew’s post, I highly recommend giving it a read . I’m going to quote some parts of the post and try to reframe them with RubyGems / Bundler in mind. Andrew opens the post talking about rewriting in Rust: uv installs packages faster than pip by an order of magnitude. The usual explanation is “it’s written in Rust.” That’s true, but it doesn’t explain much. Plenty of tools are written in Rust without being notably fast. The interesting question is what design decisions made the difference. This is such a good quote. I’m going to address “rewrite in Rust” a bit later in the post. But suffice to say, I think if we eliminate bottlenecks in Bundler such that the only viable option for performance improvements is to “rewrite in Rust”, then I’ll call it a success. I think rewrites give developers the freedom to “think outside the box”, and try techniques they might not have tried. In the case of , I think it gave the developers a good way to say “if we don’t have to worry about backwards compatibility, what could we achieve?”. I suspect it would be possible to write a uv in Python (PyUv?) that approaches the speeds of uv, and in fact much of the blog post goes on to talk about performance improvements that aren’t related to Rust. pip’s slowness isn’t a failure of implementation. For years, Python packaging required executing code to find out what a package needed. I didn’t know this about Python packages, and it doesn’t really apply to Ruby Gems so I’m mostly going to skip this section. Ruby Gems are tar files, and one of the files in the tar file is a YAML representation of the GemSpec. This YAML file declares all dependencies for the Gem, so RubyGems can know, without evaling anything, what dependencies it needs to install before it can install any particular Gem. Additionally, RubyGems.org provides an API for asking about dependency information, which is actually the normal way of getting dependency info (again, no required). There’s only one other thing from this section I’d like to quote: PEP 658 (2022) put package metadata directly in the Simple Repository API, so resolvers could fetch dependency information without downloading wheels at all. Fortunately RubyGems.org already provides the same information about gems. Reading through the number of PEPs required as well as the amount of time it took to get the standards in place was very eye opening for me. I can’t help but applaud folks in the Python community for doing this. It seems like a mountain of work, and they should really be proud of themselves. I’m mostly going to skip this section except for one point: Ignoring requires-python upper bounds. When a package says it requires python<4.0, uv ignores the upper bound and only checks the lower. This reduces resolver backtracking dramatically since upper bounds are almost always wrong. Packages declare python<4.0 because they haven’t tested on Python 4, not because they’ll actually break. The constraint is defensive, not predictive. I think this is very very interesting. I don’t know how much time Bundler spends on doing “required Ruby version” bounds checking, but it feels like if uv can do it, so can we. I really love that Andrew pointed out optimizations that could be made that don’t involve Rust. There are three points in this section that I want to pull out: Parallel downloads. pip downloads packages one at a time. uv downloads many at once. Any language can do this. This is absolutely true, and is a place where Bundler could improve. Bundler currently has a problem when it comes to parallel downloads, and needs a small architectural change as a fix. The first problem is that Bundler tightly couples installing a gem with downloading the gem. You can read the installation code here , but I’ll summarize the method in question below: The problem with this method is that it inextricably links downloading the gem with installing it. This is a problem because we could be downloading gems while installing other gems, but we’re forced to wait because the installation method couples the two operations. Downloading gems can trivially be done in parallel since the files are just archives that can be fetched independently. The second problem is the queuing system in the installation code. After gem resolution is complete, and Bundler knows what gems need to be installed, it queues them up for installation. You can find the queueing code here . The code takes some effort to understand. Basically it allows gems to be installed in parallel, but only gems that have already had their dependencies installed. So for example, if you have a dependency tree like “gem depends on gem which depends on gem ” ( ), then no gems will be installed (or downloaded) in parallel. To demonstrate this problem in an easy-to-understand way, I built a slow Gem server . It generates a dependency tree of ( depends on , depends on ), then starts a Gem server. The Gem server takes 3 seconds to return any Gem, so if we point Bundler at this Gem server and then profile Bundler, we can see the impact of the queueing system and download scheme. In my test app, I have the following Gemfile: If we profile Bundle install with Vernier, we can see the following swim lanes in the marker chart: The above chart is showing that we get no parallelism during installation. We spend 3 seconds downloading the gem, then we install it. Then we spend 3 seconds downloading the gem, then we install it. Finally we spend 3 seconds downloading the gem, and we install it. Timing the process shows we take over 9 seconds to install (3 seconds per gem): Contrast this with a Gemfile containing , , and , which have no dependencies, but still take 3 seconds to download: Timing for the above Gemfile shows it takes about 4 seconds: We were able to install the same number of gems in a fraction of the time. This is because Bundler is able to download siblings in the dependency tree in parallel, but unable to handle other relationships. There is actually a good reason that Bundler insists dependencies are installed before the gems themselves: native extensions. When installing native extensions, the installation process must run Ruby code (the file). Since the could require dependencies be installed in order to run, we must install dependencies first. For example depends on , but is only used during the installation process, so it needs to be installed before can be compiled and installed. However, if we were to decouple downloading from installation it would be possible for us to maintain the “dependencies are installed first” business requirement but speed up installation. In the case, we could have been downloading gems and at the same time as gem (or even while waiting on to be installed). Additionally, pure Ruby gems don’t need to execute any code on installation. If we knew that we were installing a pure Ruby gem, it would be possible to relax the “dependencies are installed first” business requirement and get even more performance increases. The above case could install all three gems in parallel since none of them execute Ruby code during installation. I would propose we split installation in to 4 discrete steps: Downloading and unpacking can be done trivially in parallel. We should unpack the gem to a temporary folder so that if the process crashes or the machine loses power, the user isn’t stuck with a half-installed gem. After we unpack the gem, we can discover whether the gem is a native extension or not. If it’s not a native extension, we “install” the gem simply by moving the temporary folder to the “correct” location. This step could even be a “hard link” step as discussed in the next point. If we discover that the gem is a native extension, then we can “pause” installation of that gem until its dependencies are installed, then resume (by compiling) at an appropriate time. Side note: , a Bundler alternative , works mostly in this manner today. Here is a timing of the case from above: Lets move on to the next point: Global cache with hardlinks. pip copies packages into each virtual environment. uv keeps one copy globally and uses hardlinks I think this is a great idea, but I’d actually like to split the idea in two. First, RubyGems and Bundler should have a combined, global cache, full stop. I think that global cache should be in , and we should store files there when they are downloaded. Currently, both Bundler and RubyGems will use a Ruby version specific cache folder. In other words, if you do on two different versions of Ruby, you get two copies of Rails and all its dependencies. Interestingly, there is an open ticket to implement this , it just needs to be done. The second point is hardlinking on installation. The idea here is that rather than unpacking the gem multiple times, once per Ruby version, we simply unpack once and then hard link per Ruby version. I like this idea, but I think it should be implemented after some technical debt is paid: namely implementing a global cache and unifying Bundler / RubyGems code paths. On to the next point: PubGrub resolver Actually Bundler already uses a Ruby implementation of the PubGrub resolver. You can see it here . Unfortunately, RubyGems still uses the molinillo resolver . In other words you use a different resolver depending on whether you do or . I don’t really think this is a big deal since the vast majority of users will be doing most of time. However, I do think this discrepancy is some technical debt that should be addressed, and I think this should be addressed via unification of RubyGems and Bundler codebases (today they both live in the same repository, but the code isn’t necessarily combined). Lets move on to the next section of Andrew’s post: Andrew first mentions “Zero-copy deserialization”. This is of course an important technique, but I’m not 100% sure where we would utilize it in RubyGems / Bundler. I think that today we parse the YAML spec on installation, and that could be a target. But I also think we could install most gems without looking at the YAML gemspec at all. Thread-level parallelism. Python’s GIL forces parallel work into separate processes, with IPC overhead and data copying. This is an interesting point. I’m not sure what work pip needed to do in separate processes. Installing a pure Ruby, Ruby Gem is mostly an IO bound task, with some ZLIB mixed in. Both of these things (IO and ZLIB processing) release Ruby’s GVL, so it’s possible for us to do things truly in parallel. I imagine this is similar for Python / pip, but I really have no idea. Given the stated challenges with Python’s GIL, you might wonder whether Ruby’s GVL presents similar parallelism problems for Bundler. I don’t think so, and in fact I think Ruby’s GVL gets kind of a bad rap. It prevents us from running CPU bound Ruby code in parallel. Ractors address this, and Bundler could possibly leverage them in the future, but since installing Gems is mostly an IO bound task I’m not sure what the advantage would be (possibly the version solver, but I’m not sure what can be parallelized in there). The GVL does allow us to run IO bound work in parallel with CPU bound Ruby code. CPU bound native extensions are allowed to release the GVL , allowing Ruby code to run in parallel with the native extension’s CPU bound code. In other words, Ruby’s GVL allows us to safely run work in parallel. That said, the GVL can work against us because releasing and acquiring the GVL takes time . If you have a system call that is very fast, releasing and acquiring the GVL could end up being a large percentage of that call. For example, if you do , and the buffer is very small, you could encounter a situation where GVL book keeping is the majority of the time. A bummer is that Ruby Gem packages usually contain lots of very small files, so this problem could be impacting us. The good news is that this problem can be solved in Ruby itself, and indeed some work is being done on it today . No interpreter startup. Every time pip spawns a subprocess, it pays Python’s startup cost. Obviously Ruby has this same problem. That said, we only start Ruby subprocesses when installing native extensions. I think native extensions make up the minority of gems installed, and even when installing a native extension, it isn’t Ruby startup that is the bottleneck. Usually the bottleneck is compilation / linking time (as we’ll see in the next post). Compact version representation. uv packs versions into u64 integers where possible, making comparison and hashing fast. This is a cool optimization, but I don’t think it’s actually Rust specific. Comparing integers is much faster than comparing version objects. The idea is that you take a version number, say , and then pack each part of the version in to a single integer. For example, we could represent as and as , etc. It should be possible to use this trick in Ruby and encode versions to integer immediates, which would unlock performance in the resolver. Rust has an advantage here - compiled native code comparing u64s will always be faster than Ruby, even with immediates. However, I would bet that with the YJIT or ZJIT in play, this gap could be closed enough that no end user would notice the difference between a Rust or Ruby implementation of Bundler. I started refactoring the object so that we might start doing this, but we ended up reverting it because of backwards compatibility (I am jealous of in that regard). I think the right way to do this is to refactor the solver entry point and ensure all version requirements are encoded as integer immediates before entering the solver. We could keep the API as “user facing” and design a more internal API that the solver uses. I am very interested in reading the version encoding scheme in uv. My intuition is that minor numbers tend to get larger than major numbers, so would minor numbers have more dedicated bits? Would it even matter with 64 bits? I’m going to quote Andrew’s last 2 paragraphs: uv is fast because of what it doesn’t do, not because of what language it’s written in. The standards work of PEP 518, 517, 621, and 658 made fast package management possible. Dropping eggs, pip.conf, and permissive parsing made it achievable. Rust makes it a bit faster still. pip could implement parallel downloads, global caching, and metadata-only resolution tomorrow. It doesn’t, largely because backwards compatibility with fifteen years of edge cases takes precedence. But it means pip will always be slower than a tool that starts fresh with modern assumptions. I think these are very good points. The difference is that in RubyGems and Bundler, we already have the infrastructure in place for writing a “fast as uv” package manager. The difficult part is dealing with backwards compatibility, and navigating two legacy codebases. I think this is the real advantage the uv developers had. That said, I am very optimistic that we could “repair the plane mid-flight” so to speak, and have the best of both worlds: backwards compatibility and speed. I mentioned at the top of the post I would address “rewrite it in Rust”, and I think Andrew’s own quote mostly does that for me. I think we could have 99% of the performance improvements while still maintaining a Ruby codebase. Of course if we rewrote it in Rust, you could squeeze an extra 1% out, but would it be worthwhile? I don’t think so. I have a lot more to say about this topic, and I feel like this post is getting kind of long, so I’m going to end it here. Please look out for part 2, which I’m tentatively calling “What makes Bundler / RubyGems slow?” This post was very “can we make RubyGems / Bundler do what uv does?” (the answer is “yes”). In part 2 I want to get more hands-on by discussing how to profile Bundler and RubyGems, what specifically makes them slow in the real world, and what we can do about it. I want to end this post by saying “thank you” to Andrew for writing such a great post about how uv got so fast . Download the gem Unpack the gem Compile the gem Install the gem

0 views

Does the Internet know what time is it?

Time is one of those things that is significantly harder to deal with than you’d naively expect. Its common in computing to assume that computers know the current time. After all, there are protocols like NTP for synchronizing computer clocks, and they presumably work well and are widely used. Practically speaking, what kinds of hazards lie hidden here? I’ll start this post with some questions: Some quick definitions: I just checked the system time of my laptop against time.gov , which reports a -0.073s offset. So for a N=1 sample size, I’m cautiously optimistic. There are research papers, like Spanner, TrueTime & The CAP Theorem , that describe custom systems that rely on atomic clocks and GPS to provide clock services with very low, bounded error. While these are amazing feats of engineering, they remain out of reach for most applications. What if we needed to build a system that spanned countless computers across the Internet and required each to have a fairly accurate clock? I wasn’t able to find a study that measured clock offset in this way. There are, however, a number of studies that measure clock skew (especially for fingerprinting). Many of these studies are dated, so it seems like now is a good time for a new measurement. This post is my attempt to measure clock offsets, Internet-wide. When processing HTTP requests, servers fill the HTTP Date header . This header should indicate “the date and time at which the message originated”. Lots of web servers generate responses on-the-fly, so the Date header reveals the server’s clock in seconds. Looks pretty good. I’ll use this as the basis for the measurements. Unfortunately, there are a bunch of challenges we’ll need to deal with. First, resources may get cached in a CDN for some time and the Date header would reflect when the resource was generated instead of the server’s current time reference. Requesting a randomized path will bypass the CDN, typically generating a 404 error. Unfortunately, I found some servers will set the Date header to the last modified time of the 404 page template. I considered performing multiple lookups to see how the Date header advances between requests, but some websites are distributed, so we’d be measuring a different system’s clock with each request. The safest way to avoid this hazard is to only consider Date headers that are offset to the future, which is the approach we’ll use. HTTP responses will take some time to generate; sometimes spanning a couple seconds. We can’t be sure when the Date header was filled, but we know it was before we got the response. Since we only want to measure timestamps that are from the future, we can subtract the timestamp in the date header from when we received the response. This gives a lower bound for the underlying clock offset. When performing broad Internet scans you’ll find many servers have invalid or expired TLS certificates. For the sake of collecting more data I’ve disabled certificate validations while scanning. Finally, our own system clock has skew. To minimize the effect of local clock skew I made sure I had a synchronization service running (systemd-timesyncd on Debian) and double checked my offset on time.gov. All offset measurements are given in whole seconds, rounding towards zero, to account for this challenge. The measurement tool is mostly a wrapper around this Golang snippet: For performance reasons, the code performs a HTTP HEAD request instead of the heavier GET request. Starting in late-November I scanned all domain names on the Tranco top 1,000,000 domains list (NNYYW) . I scanned slowly to avoid any undesired load on third-party systems, with the scan lasting 25 days. Of the million domain names, 241,570 systems could not be measured due to connection error such as timeout, DNS lookup failure, connection refusal, or similar challenges. Not all the domains on the Tranco list have Internet-accessible HTTPS servers running at the apex on the standard port, so these errors are expected. Further issues included HTTP responses that lacked a Date header (13,098) or had an unparsable Date header (102). In all, 745,230 domain names were successfully measured. The vast majority of the measured domains had an offset of zero (710,189; 95.3%). Date headers set to the future impacted 12,717 domains (1.7%). Date headers set to the past will be otherwise ignored, but impacted 22,324 domains (3.0%). The largest positive offset was 39,867,698 seconds, landing us 461 days in the future (March 2027 at scan time). If we graph this we’ll see that the vast majority of our non-negative offsets are very near zero. We also observe that very large offsets are possible but quite rare. I can’t make out many useful trends from this graph. The large amount of data points near zero seconds skews the vertical scale and the huge offsets skew the horizontal scale. Adjusting the graph to focus on 10 seconds to 86,400 seconds (one day) and switching offsets to a log scale provides this graph: This curve is much closer to my expectations. I can see that small offsets of less than a minute have many observances. One thing I didn’t expect were spikes at intervals of whole hours, but it makes a lot of sense in hindsight. This next graph shows the first day, emphasizing data points that exactly align to whole hour offsets. The largest spikes occur at one, three, and nine hours with no clear trend. Thankfully, geography seems to explain these spikes quite well. Here are the top-level domains (TLDs) of domains seen with exactly one hour offset: Germany (.DE), Czech Republic (.CZ), Sweden (.SE), Norway (.NO), Italy (.IT), and Belgium (.BE) are all currently using Central European Time, which uses offset UTC+1. TLDs of domains seen with exactly three hour offset: The country-code top-level domain (ccTLD) for Russia is .RU and Moscow Standard Time is UTC+3. TLDs of domains with exactly nine hour offset: South Korea (.KR) and Cocos (Keeling) Islands (.CC) follow UTC+9. So I strongly suspect these whole-hour offset spikes are driven by local time zones. These systems seem to have set their UTC time to the local time, perhaps due to an administrator who set the time manually to local time, instead of using UTC and setting their timezone. While this type of error is quite rare, impacting only 49 of the measured domain names (0.007%), the large offsets could be problematic. Another anomalous datapoint at 113 seconds caught my attention. Almost all of the data points at the 113 second offset are for domain names hosted by the same internet service provider using the same IP block. A single server can handle traffic for many domain names, all of which will have the same clock offset. We’ll see more examples of this pattern later. Knowing that we have some anomalous spikes due to shared hosting and spikes at whole hour intervals due to timezone issues, I smoothed out the data to perform modeling. Here’s a graph from zero to fifty-nine minutes, aggregating ten second periods using the median. I added a power-law trend line, which matches the data quite well (R 2 = 0.92). I expected to see a power-law distribution, as these are common when modeling randomized errors, so my intuition feels confirmed. The average clock offset, among those with a non-negative offset, was 6544.8 seconds (about 109 minutes). The median clock offset was zero. As with other power-law distributions, the average doesn’t feel like a useful measure due to the skew of the long tail. The HTTP Date header measurement has proven useful for assessing offsets of modern clocks, but I’m also interested in historical trends. I expect that computers are getting better at keeping clocks synchronized as we get better at building hardware, but can we measure it? I know of some bizarre issues that have popped up over time, like this Windows STS bug , so its even possible we’ve regressed. Historical measurements require us to ask “when was this timestamp generated?” and measure the error. This is obviously tricky as the point of the timestamp is to record the time, but we suspect the timestamp has error. Somehow, we’ve got to find a more accurate time to compare each timestamp against. It took me a while to think of a useful dataset, but I think git commits provide a viable way to measure historical clock offsets. We’ve got to analyze git commit timestamps carefully as there’s lots of ways timestamps can be out of order even when clocks are fully synchronized. Let’s first understand how “author time” and “commit time” work. When you write some code and it, you’ve “authored” the code. The git history at this point will show both an “author time” and “commit time” of the same moment. Later you may merge that code into a “main” branch, which updates the “commit time” to the time of the merge. When you’re working on a team you may see code merged in an order that’s opposite the order it was written, meaning the “author times” can be out of chronological order. The “commit times”, however, should be in order. The Linux kernel source tree is a good candidate for analysis. Linux was one of the first adopters of git, as git was written to help Linux switch source control systems. My local git clone of Linux shows 1,397,347 commits starting from 2005. It may be the largest substantive project using git, and provides ample data for us to detect timestamp-based anomalies. I extracted the timing and other metadata from the git history using: Here’s a graph of the “commit time”, aggregating 1000 commit blocks using various percentiles, showing that commits times are mostly increasing. While there’s evidence of anomalous commit timestamps here, there are too few for us find meaningful trends. Let’s keep looking. Here’s a graph of the “author time” showing much more variation: We should expect to see author times vary, as it takes differing amounts of time for code to be accepted and merged. But there are also large anomalies here, including author times that are decidedly in the future and author times that pre-date both git and Linux. We can get more detail in the graph by zooming into the years Linux has been developed thus far: This graph tells a story about commits usually getting merged quickly, but some taking a long time to be accepted. Certain code taking longer to review is expected, so the descending blue data points are expected. There are many different measurements we could perform here, but I think the most useful will be “author time” minus “commit time”. Typically, we expect that code is developed, committed, reviewed, approved, and finally merged. This provides an author time that is less than the commit time, as review and approval steps take time. A positive value of author time minus commit time would indicate that the code was authored in the future, relative to the commit timestamp. We can’t be sure whether the author time or the commit time was incorrect (or both), but collectively they record a timestamp error. These commits are anomalous as the code was seemingly written, committed, then traveled back in time to be merged. We’ll refer to these commits as time travelling commits, although timestamp errors are very likely the correct interpretation. Looking at the Linux git repo, I see 1,397,347 commits, of which 1,773 are time travelling commits. This is 0.127% of all commits, a somewhat rare occurrence. Here’s a graph of these timestamp errors: There are some fascinating patterns here! Ignoring the marked regions for a moment, I notice that offsets below 100 seconds are rare; this is quite unlike the pattern seen for HTTP Date header analysis. I suspect the challenge is that there is usually a delay between when a commit is authored and when it is merged. Code often needs testing and review before it can be merged; those tasks absorb any small timestamp errors. This will make modeling historical clock offset trends much more difficult. The region marked “A” shows many errors below 100 seconds, especially along linear spikes. There appears to be two committers in this region, both using “de.ibm.com” in their email address. The majority of authors in region A have “ibm.com” in their email address. So these anomalies appear to be largely due to a single company. These commits appear to have the author timestamp rewritten to a (mostly) sequential pattern. Here are the commits for two of the days: The author dates here are perfectly sequential, with one second between each commit. The commit dates also increase, but more slowly, such that the difference between author date and commit date increases with later commits. I suspect these timestamps were set via some sort of automation software when processing a batch of commits. The software may have initially set both author and commit timestamps to the current time, but then incremented the author timestamp by one with each subsequent commit while continuing to use the current time for the commit timestamp. If the software processed commits faster than one per second, we’d see this pattern. I don’t think these timestamps are evidence of mis-set clocks, but rather an automated system with poor timestamp handling code. The region marked “B” shows many errors near a 15.5 hour offset (with several exactly on the half-hour mark). Looking at the email addresses I see several “com.au” domains, suggesting some participants were located in Australia (.AU). Australia uses several time zones, including UTC+8, UTC+8:45, UTC+9:30, UTC+10, UTC+10:30, and UTC+11… but nothing near 15.5 hours. The GitHub profiles for one of the committers shows a current timezone of UTC-5. This suggests that an author in Australia and a committer in the Americas both mis-set their clocks, perhaps combining UTC+10:30 and UTC-5 to to reach the 15.5 hour offset. We saw examples of timezone related clock errors when looking at the HTTP Date header; this appears to be an example of two timezone errors combining. The region marked “C” shows many error around 30 to 260 days, which are unusually large errors. The committer for each of these is the same email address, using the “kernel.org” domain name. If we render the author and committer timestamps we’ll see this pattern: I notice that the day in the author timestamp usually matches the month in the committer timestamp, and when it doesn’t it’s one smaller. When the author day and the committer month match, the author month is less than or the same as the committer day. The days in the author timestamp vary between one and nine, while the days in the commit timestamp vary between eight and twenty-one. This suggests that the author timestamp was set incorrectly, swapping the day and month. Looking at these commits relative to the surrounding commits, the commit timestamps appears accurate. If I fix the author timestamps by swapping the day and month, then the data points are much more reasonable. The author timestamps are no longer after the commit timestamps, with differences varying between zero and thirty-six days, and an average of nine days. So it seems these author timestamps were generated incorrectly, swapping month and day, causing them to appear to travel back in time. Git has had code for mitigating these sorts of issues since 2006, like this code that limits timestamps to ten days in the future . I’m not sure why the commits in region “C” weren’t flagged as erroneous. Perhaps a different code path was used? Region “C” doesn’t appear to be related to a mis-set system clock, but instead a date parsing error that swapped day and month. This type of error is common when working between different locales, as the ordering of month and day in a date varies by country . Finally, the region marked “D” shows a relatively sparse collection of errors. This may suggest that git timestamp related errors are becoming less common. But there’s an analytical hazard here: we’re measuring timestamps that are known to time travel. It’s possible that this region will experience more errors in the future! I suspect region “A” and “C” are due to software bugs, not mis-set clocks. Region “B” may be due to two clocks, both mis-set due to timezone handling errors. It seems unwise to assume that I’ve caught all the anomalies and can attribute the rest of the data points to mis-set clocks. Let’s continue with that assumption anyway, knowing that we’re not on solid ground. The Linux kernel source tree is an interesting code base, but we should look at more projects. This next graph counts positive values of “author time” minus “commit time” for Linux, Ruby, Kubernetes, Git, and OpenSSL. The number of erroneous timestamps is measured per-project against the total commits in each year. It’s difficult to see a trend here. Linux saw the most time travelling commits from 2008 through 2011, each year above 0.4%, and has been below 0.1% since 2015. Git had zero time travelling commits since 2014, with a prior rate below 0.1%. Digging into the raw data I notice that many time travelling commits were generated by the same pair of accounts. For Kubernetes, 78% were authored by [email protected] and merged by [email protected] , although these were only one second in the future. These appear to be due to the “Kubernetes Submit Queue”, where the k8s-merge-robot authors a commit on one system and the merge happens within GitHub. For Ruby, 89% were authored by the same user and committed by [email protected] with an offset near 30 seconds. I attempted to correct for these biases by deduplicating commit-author pairs, but the remaining data points were too sparse to perform meaningful analysis. Time travelling usually reaches its peak two to four years after a project adopts source control, ramping up before, and generally falling after. This hints at a project management related cause to these spikes. I’ll speculate that this is due to developers initially using Git cautiously as it is new to them, then as they get comfortable with Git they begin to build custom automation systems. These new automation systems have bugs or lack well-synchronized clocks, but these issues are addressed over time. I don’t think I can make any conclusion from this data about system clocks being better managed over time. This data doesn’t support my expectation that erroneous timestamps would reduce over time, and I’ll call this a “negative result”. There’s too many challenges in this data set. This analysis explored timestamps impacted by suspected mis-set clocks. HTTP scanning found that 1.7% of domain names had a Date header mis-set to the future. Web server offsets strongly matched a power-law distribution such that small offsets were by far the most common. Git commit analysis found up to 0.65% of commits (Linux, 2009) had author timestamps in the future, relative to the commit timestamp. No clear historical trend was discovered. Timestamps with huge offsets were detected. The largest Linux commit timestamp was in the year 2085 and the largest HTTP Date header was in the year 2027. This shows that while small timestamps were most common, large errors will occur. Many underlying causes were proposed while analyzing the data, including timezone handling errors, date format parsing errors, and timestamps being overwritten by automated systems. Many data points were caused by the same group, like IP address blocks used by many domains or Git users (or robots) interacting with multiple commits. Deduplicating these effects left too few data points to perform trend analysis. Synchronizing computer clocks and working with timestamps remains a challenge for the industry. I’m sure there are other data sets that support this kind of measurement. If you’ve got any, I’d love to hear what trends you can discover! How often are computer clocks set to the wrong time? How large do these offsets grow? Can we model clock offsets, and make predictions about them? Are out-of-sync clocks a historical concern that we’ve largely solved, or is this still a concern? Clock skew : the rate at which a clock deviates from a one-second-per-second standard, often measured in parts per million Clock offset : the difference between the displayed time and Coordinated Universal Time (UTC), often measured in seconds

0 views
matklad 3 weeks ago

Static Allocation For Compilers

TigerBeetle famously uses “static allocation” . Infamously, the use of the term is idiosyncratic: what is meant is not arrays, as found in embedded development, but rather a weaker “no allocation after startup” form. The amount of memory TigerBeetle process uses is not hard-coded into the Elf binary. It depends on the runtime command line arguments. However, all allocation happens at startup, and there’s no deallocation. The long-lived event loop goes round and round happily without . I’ve wondered for years if a similar technique is applicable to compilers. It seemed impossible, but today I’ve managed to extract something actionable from this idea? Static allocation depends on the physics of the underlying problem. And distributed databases have surprisingly simple physics, at least in the case of TigerBeetle. The only inputs and outputs of the system are messages. Each message is finite in size (1MiB). The actual data of the system is stored on disk and can be arbitrarily large. But the diff applied by a single message is finite. And, if your input is finite, and your output is finite, it’s actually quite hard to need to allocate extra memory! This is worth emphasizing — it might seem like doing static allocation is tough and requires constant vigilance and manual accounting for resources. In practice, I learned that it is surprisingly compositional. As long as inputs and outputs of a system are finite, non-allocating processing is easy. And you can put two such systems together without much trouble. routing.zig is a good example of such an isolated subsystem. The only issue here is that there isn’t a physical limit on how many messages can arrive at the same time. Obviously, you can’t process arbitrary many messages simultaneously. But in the context of a distributed system over an unreliable network, a safe move is to drop a message on the floor if the required processing resources are not available. Counter-intuitively, not allocating is simpler than allocating, provided that you can pull it off! Alas, it seems impossible to pull it off for compilers. You could say something like “hey, the largest program will have at most one million functions”, but that will lead to both wasted memory and poor user experience. You could also use a single yolo arena of a fixed size, like I did in Hard Mode Rust , but that isn’t at all similar to “static allocation”. With arenas, the size is fixed explicitly, but you can OOM. With static allocation it is the opposite — no OOM, but you don’t know how much memory you’ll need until startup finishes! The “problem size” for a compiler isn’t fixed — both the input (source code) and the output (executable) can be arbitrarily large. But that is also the case for TigerBeetle — the size of the database is not fixed, it’s just that TigerBeetle gets to cheat and store it on disk, rather than in RAM. And TigerBeetle doesn’t do “static allocation” on disk, it can fail with at runtime, and it includes a dynamic block allocator to avoid that as long as possible by re-using no longer relevant sectors. So what we could say is that a compiler consumes arbitrarily large input, and produces arbitrarily large output, but those “do not count” for the purpose of static memory allocation. At the start, we set aside an “output arena” for storing finished, immutable results of compiler’s work. We then say that this output is accumulated after processing a sequence of chunks, where chunk size is strictly finite. While limiting the total size of the code-base is unreasonable, limiting a single file to, say, 4 MiB (runtime-overridable) is fine. Compiling then essentially becomes a “stream processing” problem, where both inputs and outputs are arbitrary large, but the filter program itself must execute in O(1) memory. With this setup, it is natural to use indexes rather than pointers for “output data”, which then makes it easy to persist it to disk between changes. And it’s also natural to think about “chunks of changes” not only spatially (compiler sees a new file), but also temporally (compiler sees a new version of an old file). Is there any practical benefits here? I don’t know! But seems worth playing around with! I feel that a strict separation between O(N) compiler output and O(1) intermediate processing artifacts can clarify compiler’s architecture, and I won’t be too surprised if O(1) processing in compilers would lead to simpler code the same way it does for databases?

0 views
matklad 3 weeks ago

Newtype Index Pattern In Zig

In efficiency-minded code, it is idiomatic to use indexes rather than pointers. Indexes have several advantages: First , they save memory. Typically a 32-bit index is enough, a saving of four bytes per pointer on 64-bit architectures. I haven’t seen this measured, but my gut feeling is that this is much more impactful than it might initially seem. On modern architectures, saving memory saves time (and energy) as well, because the computing bottleneck is often the bit pipe between the memory and the CPU, not the computation per se. Dense data structures use CPU cache more efficiently, removing prohibitive latency of memory accesses. Bandwidth savings are even better: smaller item size obviously improves bandwidth utilization, but having more items in cache obviates the need to use the bandwidth in the first place. Best case, the working set fits into the CPU cache! Note well that memory savings are evenly spread out. Using indexes makes every data structure slightly more compact, which improves performance across the board, regardless of hotspot distribution. It’s hard to notice a potential for such saving in a profiler, and even harder to test out. For these two reasons, I would default to indexes for code where speed matters, even when I don’t have the code written yet to profile it! There’s also a more subtle way in which indexes save memory. Using indexes means storing multiple items in an array, but such dense storage contains extra information in relative positions of the items. If you need to store a list of items, you can often avoid materializing the list of indexes by storing a range “pointing” into the shared storage. Occasionally, you can even do UTF-8 trick and use just a single bit to mark the end of a list. The second benefit of indexes is more natural modeling of cyclic and recursive data structures. Creating a cycle fundamentally requires mutability somewhere (“tying the knot” in Haskell relies on mutability of lazy thunks). This means that you need to make some pointers nullable, and that usually gets awkward even without borrow checker behind your back. Even without cycles and just recursion, pointers are problematic, due to a combination of two effects: The combination works fine at small scale, but then it fails with stack overflow in production every single time, requiring awkward work-arounds. For example, serializes error traces from nested macro expansions as a deeply nested tree of JSON objects, which requires using stacker hack when parsing the output (which you’ll learn about only after crashes in the hands of macro connoisseur users). Finally , indexes greatly help serialization, they make it trivial to communicate data structures both through space (sending a network message) and time (saving to disk and reading later). Indexes are naturally relocatable, it doesn’t matter where in memory they are. But this is just a half of serialization benefit. The other is that, because everything is in few arrays, you can do bulk serialization. You don’t need to write the items one by one, you can directly arrays around (but be careful to not leak data via padding, and be sure to checksum the result). The big problem with “naive” indexes is of course using the right index with the wrong array, or vice verse. The standard solution here is to introduce a newtype wrapper around the raw index. @andrewrk recently popularized a nice “happy accident of language design” pattern for this in Zig. The core idea is to define an index via non-exhaustive : In Zig, designates a strongly-typed collection of integer constants, not a Rust-style ADT (there’s for that). By default an backing integer type is chosen by the compiler, but you can manually override it with syntax: Finally, Zig allows making enums non-exhaustive with . In a non-exhaustive enum, any numeric value is valid, and some have symbolic labels: and builtins switch abstraction level between a raw integer and an enum value. So, is a way to spell “ , but a distinct type”. Note that there’s no strong encapsulation boundary here, anyone can . Zig just doesn’t provide language-enforced encapsulation mechanisms. Putting everything together, this is how I would model n-ary tree with parent pointers in Zig: Some points of note: P.S. Apparently I also wrote a Rust version of this post a while back? https://matklad.github.io/2018/06/04/newtype-index-pattern.html pointers encourage recursive functions, and recursive data structures lead to arbitrary long (but finite) chains of pointers. As usual with indexes, you start with defining the collective noun first, a rather than a . In my experience, you usually don’t want suffix in your index types, so is just , not the underlying data. Nested types are good! feels just right. For readability, the order is fields, then nested types, then functions. In , we have a couple of symbolic constants. is for the root node that is stored first, for whenever we want to apply offensive programing and make bad indexes blow up. Here, we use for “null” parent. An alternative would be to use , but that would waste of space, or making the root its own parent. If you care about performance, its a good idea to sizes of structures, not to prevent changes, but as a comment that explains to the reader just how the large the struct is. I don’t know if I like or more for representing ranges, but I use the former just because the names align in length. Both and are reasonable shapes for the API. I don’t know which one I prefer more. I default to the former because it works even if there are several node arguments.

0 views
Jeff Geerling 3 weeks ago

Big GPUs don't need big PCs

Ever since I got AMD , Intel , and Nvidia graphics cards to run on a Raspberry Pi, I had a nagging question: What's the point? The Raspberry Pi only has 1 lane of PCIe Gen 3 bandwidth available for a connection to an eGPU. That's not much. Especially considering a modern desktop has at least one slot with 16 lanes of PCIe Gen 5 bandwidth. That's 8 GT/s versus 512 GT/s. Not a fair fight.

0 views
The Jolly Teapot 3 weeks ago

The Club Racer Treatment

In 2022, I wrote a post called The Lotus philosophy applied to blog design , in which I was trying to explain how the Lotus philosophy of lighter cars for improved performance could apply to web design, and to my blog in particular. I wrote: For as long as I can remember, I’ve been a fan of Lotus. From the Esprit featured in The Spy Who Loved Me (1977), the one in the Accolade’s Test Drive video game from 1987, to my fascination with the choices made by the engineers with the 900 kg Elise (and later the Elise CR): Lotus is more than a simple car brand, it is a way to think about product design […] The most acute observers probably noticed my mention of the Lotus Elise CR. This car is, to me at least, a fantastic example of what a company can do when driven by principles and a well laid-out order of priorities. The Elise CR, which stands for Club Racer, was basically a special edition of the regular Lotus Elise, with various modifications aimed for better handling on the track, that was lightened by about 25 kilograms compared to the base car. 1 One may think that a weight reduction of around 3% is nothing, that it doesn’t matter, and that it may not influence performance that much. And to be honest with you, I don’t really know. I just know that I was always fascinated by the engineering that went into saving those 25 kg out of a roughly 900 kg car. Compared to the regular Elise, the CR had its seats fitted with less padding, its floor mats were removed, it had no radio, no A/C, and even the Lotus badge on the back was a sticker instead of using the usual metal letters. The result was a car marginally faster, slightly better to drive, less comfortable, and less practical. If you planned to drive a Lotus Elise on regular roads, you’d be better off with a regular Elise. The Club Racer was a prize among purists, it was a demonstration of what could be done, and I loved that it existed. 2 In its essence, the Club Racer was not about the results on paper or the weight itself, it was about the effort, the craft, and the experience. It was about giving a damn. For a while now, I’ve been generally happy with this site’s design, which feels very much in line with this Lotus philosophy. But there was always an itch that I couldn’t ignore: a Lotus Elise was great, but what I really wanted was a Lotus Elise CR. This is why, in the past couple of… checks notes … weeks, I spent hours and hours giving the Club Racer treatment to this website, for very marginal changes. 3 Now that all of this tedious, frustrating, and abstract work is over, I don’t even know how much weight I saved. Probably the equivalent of the Elise CR’s 25 kg: meaningless to most, meaningful to a few. Like I said, it wasn’t really about the results, but about the effort; it was about getting my hands dirty. Today, I am quite happy with the choices I made and with what I learned in the process. To make sure my project had structure, I needed to identify which were my top 3 priorities, and in which order they needed to be. Obviously, weight saving was one of them, but did I really want to put it above all else? The Lotus Elise CR was about performance and driving experience, not weight saving. Weight saving was just a means to an end. For a blog like mine, the driving experience is obviously the readability, but I also wanted my site to pass the W3C validator, and keep its perfect score on PageSpeed Insights (that’s the performance bit). I ended up with priorities ordered like this: I decided to stick to a serif typeface, to make this website as comfortable as possible to read, just like a page of a paperback novel would be. I have been using STIX Two Text for a while now, and I really like it: it feels a lot like Times New Roman , but improved in every way possible. Not only I think it looks great, but it comes preinstalled on Apple devices, it is open-source, and if a visitor falls back on Times New Roman (via the browser default setting for ), the site maintains enough of the typography to make it just as nice to read: line length, line height, size rendering, etc. Also with readability in mind, I’ve decided to keep the automatic light/dark mode feature, along with the responsive feature for the font size, as it makes text always nicely proportioned compared to the screen size. I certainly could have removed even more than I did, but I wanted to keep the 100 score on PageSpeed Insights and pass the W3C validator . This is why I still have a meta description, for example, and why I use a base64 format for the inline SVG used as the favicon. I kept some of the “branding” elements for good measure, even if what I feel is the visual identity of this site mainly revolves around its lightness. Even a Lotus Elise CR has a coat of paint after all. I could shave even more bytes off this site if the default browser stylesheets weren’t being needlessly updated . But a Club Racer treatment is only fun when talking about weight saving, so let’s get to the good stuff. This is what I removed: Airbags: The HTML tags, as I learned that they are optional in HTML5, as are the tags: If you look at the Elements tab of the browser Web Inspector panel, both are automatically added by the browser, I think. Floor mats: The quotation marks in most of the elements in the but also on some the permanent links (I didn’t go as far as reworking the Markdown parser of Eleventy to get rid of them in all attributes, but on the homepage and other pages, each link is now 2 bytes lighter — at least before Brotli compression and other shenanigans). Power steering: The line height setting for headings. Foam: The padding left and right for mobile view. Sound isolation: A lot of unnecessary nodes in the homepage, now leaner and lighter, at the expense of extra CSS: very worth it. This includes the summaries for Blend of links posts that felt very repetitive. Air conditioning: The little tags around the “by” of the header to make it 16% smaller. I liked you guys, but you had to go. Radio: The highlight colour, used since 2020 on this site, mostly as the bottom border colour for links: it felt distracting and didn’t work well in dark mode. Metal logo: for headings. This CSS feature makes titles look great, but for most of them it wasn’t even needed on desktop. And a bunch of other little things that I mostly forgot (I should have kept a log). 4 To you dear readers, if you’re not reading this in an RSS reader, this site won’t feel any faster than before. It won’t even look better. If anything, it will look slightly worse and for that, I’m sorry. Well, not really: I’m actually very happy about what has changed, and I think it will make this site easier to maintain, and easier to be proud of. On top of the weight-saving, I also worked on improving my local Eleventy setup, reducing dependencies and the number of node modules. I’ve mentioned this on my Now page , but the site now compiles in 1.5 second on my Intel-Core-i5-powered MacBook Air, which is roughly 2–3 times faster than before. I guess this is when you have an underpowered engine that weight-saving and simplifications are the most noticeable. More noticeable than on the website that’s for sure. I hope that when I finally upgrade my computer, probably next March, I won’t get fooled by the hugely improved chips on the newer Macs, to the point of forgetting Colin Chapman: Adding power makes you faster on the straights; subtracting weight makes you faster everywhere. Happy holidays everyone. I found a great review here , in French. ↩︎ Lotus nowadays surely doesn’t look like a brand Colin Chapman would recognise. ↩︎ I thought it would only take a couple of days, but here I am, three weeks later; This was a rather enjoyable rabbit hole. ↩︎ To help me in some of the decisions, I asked a lot of questions to ChatGPT. It sometimes gave me very useful answers, but sometimes it felt like I could have just tossed a coin instead. Also, I was starting to get very annoyed at the recurring “ ah, your question is the classic dilemma between Y and Z ”. ↩︎ Driving experience / Readability Performance / W3C validation & PageSpeed Insights scores Weight saving Airbags: The HTML tags, as I learned that they are optional in HTML5, as are the tags: If you look at the Elements tab of the browser Web Inspector panel, both are automatically added by the browser, I think. Floor mats: The quotation marks in most of the elements in the but also on some the permanent links (I didn’t go as far as reworking the Markdown parser of Eleventy to get rid of them in all attributes, but on the homepage and other pages, each link is now 2 bytes lighter — at least before Brotli compression and other shenanigans). Power steering: The line height setting for headings. Foam: The padding left and right for mobile view. Sound isolation: A lot of unnecessary nodes in the homepage, now leaner and lighter, at the expense of extra CSS: very worth it. This includes the summaries for Blend of links posts that felt very repetitive. Air conditioning: The little tags around the “by” of the header to make it 16% smaller. I liked you guys, but you had to go. Radio: The highlight colour, used since 2020 on this site, mostly as the bottom border colour for links: it felt distracting and didn’t work well in dark mode. Metal logo: for headings. This CSS feature makes titles look great, but for most of them it wasn’t even needed on desktop. And a bunch of other little things that I mostly forgot (I should have kept a log). 4 I found a great review here , in French. ↩︎ Lotus nowadays surely doesn’t look like a brand Colin Chapman would recognise. ↩︎ I thought it would only take a couple of days, but here I am, three weeks later; This was a rather enjoyable rabbit hole. ↩︎ To help me in some of the decisions, I asked a lot of questions to ChatGPT. It sometimes gave me very useful answers, but sometimes it felt like I could have just tossed a coin instead. Also, I was starting to get very annoyed at the recurring “ ah, your question is the classic dilemma between Y and Z ”. ↩︎

0 views
Jeff Geerling 3 weeks ago

1.5 TB of VRAM on Mac Studio - RDMA over Thunderbolt 5

Apple gave me access to this Mac Studio cluster to test RDMA over Thunderbolt, a new feature in macOS 26.2 . The easiest way to test it is with Exo 1.0 , an open source private AI clustering tool. RDMA lets the Macs all act like they have one giant pool of RAM, which speeds up things like massive AI models.

0 views
Max Bernstein 4 weeks ago

How to annotate JITed code for perf/samply

Brief one today. I got asked “does YJIT/ZJIT have support for [Linux] perf?” The answer is yes, and it also works with samply (including on macOS!), because both understand the perf map interface . This is the entirety of the implementation in ZJIT 1 : Whenever you generate a function, append a one-line entry consisting of to . Per the Linux docs linked above, START and SIZE are hex numbers without 0x. symbolname is the rest of the line, so it could contain special characters. You can now happily run or and have JIT frames be named in the output. We hide this behind the flag to avoid file I/O overhead when we don’t need it. Perf map is the older way to interact with perf: a newer, more complicated way involves generating a “dump” file and then ing it. We actually use , which I noticed today is wrong. leaves in the , and it shouldn’t; instead use .  ↩ We actually use , which I noticed today is wrong. leaves in the , and it shouldn’t; instead use .  ↩

0 views
Evan Schwartz 4 weeks ago

Short-Circuiting Correlated Subqueries in SQLite

I recently added domain exclusion lists and paywalled content filtering to Scour . This blog post describes a small but useful SQL(ite) query optimization I came across between the first and final drafts of these features: using an uncorrelated scalar subquery to skip a correlated subquery (if you don't know what that means, I'll explain it below). Scour searches noisy sources for content related to users' interests. At the time of writing, it ingests between 1 and 3 million pieces of content from over 15,000 sources each month. For better and for worse, Scour does ranking on the fly, so the performance of the ranking database query directly translates to page load time. The main SQL query Scour uses for ranking applies a number of filters and streams the item embeddings through the application code for scoring. Scour uses brute force search rather than a vector database, which works well enough for now because of three factors: A simplified version of the query looks something like: The query plan shows that this makes good use of indexes: To add user-specified domain blocklists, I created the table and added this filter clause to the main ranking query: The domain exclusion table uses as a primary key, so the lookup is efficient. However, this lookup is done for every row returned from the first part of the query. This is a correlated subquery : A problem with the way we just added this feature is that most users don't exclude any domains, but we've added a check that is run for every row anyway. To speed up the queries for users who aren't using the feature, we could first check the user's settings and then dynamically build the query. But we don't have to, because we can accomplish the same effect within one static query. We can change our domain exclusion filter to first check whether the user has any excluded domains: Since the short-circuits, if the first returns (when the user has no excluded domains), SQLite never evaluates the correlated subquery at all. The first clause does not reference any column in , so SQLite can evaluate it once and reuse the boolean result for all of the rows. This "uncorrelated scalar subquery" is extremely cheap to evaluate and, when it returns , lets us short-circuit and skip the more expensive correlated subquery that checks each item's domain against the exclusion list. Here is the query plan for this updated query. Note how the second subquery says , whereas the third one is a . The latter is the per-row check, but it can be skipped by the second subquery. To test the performance of each of these queries, I replaced the with and used a simple bash script to invoke the binary 100 times for each query on my laptop. Starting up the process each time adds overhead, but we're comparing relative differences. At the time of this benchmark, the last week had 235,975 items, 144,229 of which were in English. The two example users I tested this for below only look for English content. This test represents most users, who have not configured any excluded domains: This shows that the short-circuit query adds practically no overhead for users without excluded domains, whereas the correlated subquery alone makes queries 17% slower for these users. This test uses an example user that has excluded content from 2 domains: In this case, we do need to check each row against the domain filter. But this shows that the short-circuit still adds no overhead on top of the query. When using SQL subqueries to filter down result sets, it's worth thinking about whether each subquery is really needed for most users or most queries. If the check is needed most of the time, this approach won't help. However if the per-row check isn't always needed, using an uncorrelated scalar subquery to short-circuit a condition can dramatically speed up the average case with practically zero overhead. This is extra important because the slow-down from each additional subquery compounds. In this blog post, I described and benchmarked a single additional filter. But this is only one of multiple subquery filters. Earlier, I also mentioned that users had asked for a way to filter out paywalled content. This works similarly to filtering out content from excluded domains. Some users opt-in to hiding paywalled content. For those users, we check if each item is paywalled. If so, we check if it comes from a site the user has specifically allowed paywalled content from (because they have a subscription). I used the same uncorrelated subquery approach to first check if the feature is enabled for the user and, only then, does SQLite need to check each row. Concretely, the paywalled content filter subquery looks like: In short, a trivial uncorrelated scalar subquery can help us short-circuit and avoid a more expensive per-row check when we don't need it. There are multiple ways to exclude rows from an SQL query. Here are the results from the same benchmark I ran above, but with two other ways of checking for whether an item comes from an excluded domain. The version of the query uses the subquery: The variation joins with and then checks for : And here are the full benchmarks: For users without excluded domains, we can see that the query using the short-circuit wins and adds no overhead. For users who do have excluded domains, the is faster than the version. However, this version raises the exact problem this whole blog post is designed to address. Since joins happen no matter what, we cannot use the short-circuit to avoid the overhead for users without excluded domains. At least for now, this is why I've gone with the subquery using the short-circuit. Discuss on Hacker News , Lobsters , r/programming , r/sqlite . Scour uses SQLite, so the data is colocated with the application code. It uses binary-quantized vector embeddings with Hamming Distance comparisons, which only take ~5 nanoseconds each . We care most about recent posts so we can significantly narrow the search set by publish date.

0 views
Anton Zhiyanov 1 months ago

Timing 'Hello, world'

Here's a little unscientific chart showing the compile/run times of a "hello world" program in different languages: For interpreted languages, the times shown are only for running the program, since there's no separate compilation step. I had to shorten the Kotlin bar a bit to make it fit within 80 characters. All measurements were done in single-core, containerized sandboxes on an ancient CPU, and the timings include the overhead of . So the exact times aren't very interesting, especially for the top group (Bash to Ruby) — they all took about the same amount of time. Here is the program source code in C: Other languages: Bash · C# · C++ · Dart · Elixir · Go · Haskell · Java · JavaScript · Kotlin · Lua · Odin · PHP · Python · R · Ruby · Rust · Swift · V · Zig Of course, this ranking will be different for real-world projects with lots of code and dependencies. Still, I found it curious to see how each language performs on a simple "hello world" task.

0 views
Marc Brooker 1 months ago

What Does a Database for SSDs Look Like?

Maybe not what you think. Over on X, Ben Dicken asked : What does a relational database designed specifically for local SSDs look like? Postgres, MySQL, SQLite and many others were invented in the 90s and 00s, the era of spinning disks. A local NVMe SSD has ~1000x improvement in both throughput and latency. Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random. If we had to throw these databases away and begin from scratch in 2025, what would change and what would remain? How might we tackle this question quantitatively for the modern transaction-orientated database? Approach One: The Five Minute Rule Perhaps my single favorite systems paper, The 5 Minute Rule… by Jim Gray and Franco Putzolu gives us a very simple way to answer one of the most important questions in systems: how big should caches be? The five minute rule is that, back in 1986, if you expected to read a page again within five minutes you should keep in in RAM. If not, you should keep it on disk. The basic logic is that you look at the page that’s least likely to be re-used. If it’s cheaper to keep around until it’s next expected re-use, then you should keep more. If it’s cheaper to reload from storage than keep around, then you should keep less 1 . Let’s update the numbers for 2025, assuming that pages are around 32kB 2 (this becomes important later). The EC2 delivers about 1.8 million read iops of this size, at a price of around $0.004576 per second, or \(10^{-9}\) dollars per transfer (assuming we’re allocating about 40% of the instance price to storage). About one dollar per billion reads. It also has enough RAM for about 50 million pages of this size, costing around \(3 \times 10^{-11}\) dollars to storage a page for one second. So, on this instance type, we should size our RAM cache to store pages for about 30 seconds. Not too different from Gray and Putzolu’s result 40 years ago! That’s answer number one: the database should have a cache sized so that the hot set contains pages expected to be accessed in the next 30 seconds, for optimal cost. For optimal latency, however, the cache may want to be considerably bigger. Approach Two: The Throughput/IOPS Breakeven Point The next question is what size accesses we want to send to our storage devices to take best advantage of their performance. In the days of spinning media, the answer to this was surprisingly big: a 100MB/s disk could generally do around 100 seeks a second, so if your transfers were less than around 1MB you were walking away from throughput. Give or take a factor of 2. What does it look like for modern SSDs? SSDs are much faster on both throughput and iops. They’re less sensitive than spinning drives to workload patterns, but read/write ratios and the fullness of the drives still matter. Absent benchmarking on the actual hardware with the real workload, my rule of thumb is that SSDs are throughput limited for transfers bigger than 32kB, and iops limited for transfers smaller than 32kB. Making transfers bigger than 32kB doesn’t help throughput much, reduces IOPS, and probably makes the cache less effective because of false sharing and related effects. This is especially important for workloads with poor spatial locality . So that’s answer number two: we want our transfers to disk not to be much smaller than 32kB on average, or we’re walking away from throughput. Approach Three: Durability and Replication Building reads on local SSDs is great: tons of throughput, tons of iops. Writes on local SSDs, on the other hand, have the distinct problem of only being durable on the local box, which is unacceptable for most workloads. Modern hardware is very reliable, but thinking through the business risks of losing data on failover isn’t very fun at all, so let’s assume that our modern database is going to replicate off-box, making at least one more synchronous copy. Ideally in a different availability zone (AZ). That we were using for our comparison earlier has 100Gb/s (or around 12GB/s) of network bandwidth. That puts a cap on how much write throughput we can have for a single-leader database. Cross-AZ latency in EC2 varies from a couple hundred microseconds to a millisecond or two, which puts a minimum on our commit latency. That gives us answer number three: we want to incur cross-AZ latency only at commit time, and not during writes. Which is where we run into one of my favorite topics: isolation. The I in ACID . A modern database design will avoid read-time coordination using multiversioning, but to offer isolation stronger than will need to coordinate either on each write or at commit time. It can do that like, say, Aurora Postgres does, having a single leader at a time running in a single AZ. This means great latency for clients in that zone, and higher latency for clients in different AZs. Given that most applications are hosted in multiple AZs, this can add up for latency-sensitive applications which makes a lot of round trips to the database. The alternative approach is the one Aurora DSQL takes, doing the cross-AZ round trip only at time, saving round-trips. Here’s me talking about the shape of that trade-off at re:Invent this year: There’s no clear answer here, because there are real trade-offs between the two approaches. But do make sure to ask your database vendor whether those impressive latency benchmarks are running where you application actually runs. In the spirit of the original question, though, the incredible bandwidth and latency availability in modern datacenter networks is as transformative as SSDs in database designs. Or should be. While we’re incurring the latency cost of synchronous replication, we may as well get strongly consistent scale-out reads for free. In DSQL, we do this using high-quality hardware clocks that you can use too . Another nice win from modern hardware. There are other approaches too. That’s answer number four for me: The modern database uses high-quality clocks and knowledge of actual application architectures to optimize for real-world performance (like latency in multiple availability zones or regions) without compromising on strong consistency. Approach Four: What about that WAL? Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random. WALs, and related low-level logging details, are critical for database systems that care deeply about durability on a single system. But the modern database isn’t like that: it doesn’t depend on commit-to-disk on a single system for its durability story. Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails). That’s answer number five: the modern database commits transactions to a distributed log, which provides multi-machine multi-AZ durability, and might provide other services like atomicity. Recovery is a replay from the distributed log, on any one of a number of peer replicas. What About Data Structures? B-Trees versus LSM-trees vs B-Tree variants versus LSM variants versus other data structures are trade-offs that have a lot to do with access patterns and workload patterns. Picking a winner would be a whole series of blog posts, so I’m going to chicken out and say its complicated . If we had to throw these databases away and begin from scratch in 2025, what would change and what would remain? I’d keep the relational model, atomicity, isolation (but would probably pick as a default), strong consistency, SQL, interactive transactions, and the other core design decisions of relational databases. But I’d move durability, read and write scale, and high availability into being distributed rather than single system concerns. I think that helps with performance and cost, while making these properties easier to achieve. I’d mostly toss out local durability and recovery, and all the huge history of optimizations and data structures around that 3 , in favor of getting better properties in the distributed setting. I’d pay more attention to internal strong isolation (in the security sense) between clients and workloads. I’d size caches for a working set of between 30 seconds and 5 minutes of accesses. I’d optimize for read transfers around that 32kB sweet spot from local SSD, and the around 8kB sweet spot for networks. Probably more stuff too, but this is long enough as-is. Other topics worth covering include avoiding copies on IO, co-design with virtualization (e.g. see our Aurora Serverless paper ), trade-offs of batching, how the relative performance of different isolation levels changes, what promises to give clients, encryption and authorization of data at rest and in motion, dealing with very hot single items, new workloads like vector, verifiable replication journals, handing off changes to analytics systems, access control, multi-tenancy, forking and merging, and even locales. The reasoning is slightly smarter, thinking about the marginal page and marginal cost of memory, but this simplification works for our purposes here. The marginal cost of memory is particularly interesting in a provisioned system, because it varies between zero (you’ve paid for it already) and huge (you need a bigger instance size). One of the really nice things about serverless (like DSQL) and dynamic scaling (like Aurora Serverless) is that it makes the marginal cost constant, greatly simplifying the task of reasoning about cache size. Yes, I know that pages are typically 4kB or 2MB, but bear with me here. Sorry ARIES .

0 views
Ash's Blog 1 months ago

Full Unicode Search at 50× ICU Speed with AVX‑512

ICU gets Unicode right and pays for it. This post shows a different approach: fold-safe windows, SIMD probes, and verifiers for fast UTF‑8 search.

0 views
nathan.rs 1 months ago

Text Diffusion Models are Faster at Writing Code

In this post, I run small experiments showing that diffusion language models generate code (and other structured text) at a faster rate. Increased stucture tends to correlate with reduced entropy, which leads to higher confident token predictions, which directly means more tokens decoded in parallel per step. 1 In speculative decoding (for autoregressive models), we speed up generation by using a smaller model to generate multiple tokens, which are then verified in parallel by a larger model. The core idea is that most tokens are easily predictable; thus, we should be able to use a smaller and faster model for them. The classic example is the following sentence:

1 views
Jeff Geerling 1 months ago

Benchmarking NVENC video transcoding on the Pi

Now that Nvidia GPUs run on the Raspberry Pi , I've been putting all the ones I own through their paces. Many people have an older Nvidia card (like a 3060) laying around from an upgrade. So could a Pi be suitable for GPU-accelerated video transcoding, either standalone for conversion, or running something like Jellyfin for video library management and streaming? That's what I set out to do, and the first step, besides getting the drivers and CUDA going (see blog post linked above), was to find a way to get a repeatable benchmark going.

0 views