Posts in Performance (20 found)

Optimizing Datalog for the GPU

Optimizing Datalog for the GPU Yihao Sun, Ahmedur Rahman Shovon, Thomas Gilray, Sidharth Kumar, and Kristopher Micinski ASPLOS'25 Datalog source code comprises a set of relations, and a set of rules. A relation can be explicitly defined with a set of tuples. A running example in the paper is to define a graph with a relation named : A relation can also be implicitly defined with a set of rules. The paper uses the relation as an example: Rule 1 states that two vertices ( and ) are part of the same generation if they both share a common ancestor ( ), and they are not actually the same vertex ( ). Rule 2 states that two vertices ( and ) are part of the same generation if they have ancestors ( and ) from the same generation. “Running a Datalog program” entails evaluating all rules until a fixed point is reached (no more tuples are added). One key idea to internalize is that evaluating a Datalog rule is equivalent to performing a SQL join. For example, rule 1 is equivalent to joining the relation with itself, using as the join key, and as a filter. Semi-naïve Evaluation is an algorithm for performing these joins until convergence, while not wasting too much effort on redundant work. The tuples in a relation are put into three buckets: : holds tuples that were discovered on the current iteration holds tuples which were added in the previous iteration : holds all tuples that have been found in any iteration For a join involving two relations ( and ), is computed as the union of the result of 3 joins: joined with joined with joined with The key fact for performance is that is never joined with . More details on Semi-naïve Evaluation can be found in these notes . This paper introduces the hash-indexed sorted array for storing relations while executing Semi-naïve Evaluation on a GPU. It seems to me like this data structure would work well on other chips too. Fig. 2 illustrates the data structure (join keys are in red): Source: https://dl.acm.org/doi/10.1145/3669940.3707274 The data array holds the actual tuple data. It is densely packed in row-major order. The sorted index array holds pointers into the data array (one pointer per tuple). These pointers are lexicographically sorted (join keys take higher priority in the sort). The hash table is an open-addressed hash table which maps a hash of the join keys to the first element in the sorted index array that contains those join keys. A join of relations A and , can be implemented with the following pseudo-code: Memory accesses when probing through the sorted index array are coherent. Memory accesses when accessing the data array are coherent up to the number of elements in a tuple. Table 3 compares the results from this paper (GPULog) against a state-of-the-art CPU implementation (Soufflé). HIP represents GPULog ported to AMD’s HIP runtime and then run on the same Nvidia GPU. Source: https://dl.acm.org/doi/10.1145/3669940.3707274 Dangling Pointers The data structure and algorithms described by this paper seem generic, it would be interesting to see them run on other chips (FPGA, DPU, CPU, HPC cluster). I would guess most of GPULog is bound by memory bandwidth, not compute. I wonder if there are Datalog-specific algorithms to reduce the bandwidth/compute ratio. Subscribe now : holds tuples that were discovered on the current iteration holds tuples which were added in the previous iteration : holds all tuples that have been found in any iteration joined with joined with joined with

2 views
baby steps 2 days ago

We need (at least) ergonomic, explicit handles

Continuing my discussion on Ergonomic RC, I want to focus on the core question: should users have to explicitly invoke handle/clone, or not? This whole “Ergonomic RC” work was originally proposed by Dioxus and their answer is simple: definitely not . For the kind of high-level GUI applications they are building, having to call to clone a ref-counted value is pure noise. For that matter, for a lot of Rust apps, even cloning a string or a vector is no big deal. On the other hand, for a lot of applications, the answer is definitely yes – knowing where handles are created can impact performance, memory usage, and even correctness (don’t worry, I’ll give examples later in the post). So how do we reconcile this? This blog argues that we should make it ergonomic to be explicit . This wasn’t always my position, but after an impactful conversation with Josh Triplett, I’ve come around. I think it aligns with what I once called the soul of Rust : we want to be ergonomic, yes, but we want to be ergonomic while giving control 1 . I like Tyler Mandry’s Clarity of purpose contruction, “Great code brings only the important characteristics of your application to your attention” . The key point is that there is great code in which cloning and handles are important characteristics , so we need to make that code possible to express nicely. This is particularly true since Rust is one of the very few languages that really targets that kind of low-level, foundational code. This does not mean we cannot (later) support automatic clones and handles. It’s inarguable that this would benefit clarity of purpose for a lot of Rust code. But I think we should focus first on the harder case, the case where explicitness is needed, and get that as nice as we can ; then we can circle back and decide whether to also support something automatic. One of the questions for me, in fact, is whether we can get “fully explicit” to be nice enough that we don’t really need the automatic version. There are benefits from having “one Rust”, where all code follows roughly the same patterns, where those patterns are perfect some of the time, and don’t suck too bad 2 when they’re overkill. I mentioned this blog post resulted from a long conversation with Josh Triplett 3 . The key phrase that stuck with me from that conversation was: Rust should not surprise you . The way I think of it is like this. Every programmer knows what its like to have a marathon debugging session – to sit and state at code for days and think, but… how is this even POSSIBLE? Those kind of bug hunts can end in a few different ways. Occasionally you uncover a deeply satisfying, subtle bug in your logic. More often, you find that you wrote and not . And occasionally you find out that your language was doing something that you didn’t expect. That some simple-looking code concealed a subltle, complex interaction. People often call this kind of a footgun . Overall, Rust is remarkably good at avoiding footguns 4 . And part of how we’ve achieved that is by making sure that things you might need to know are visible – like, explicit in the source. Every time you see a Rust match, you don’t have to ask yourself “what cases might be missing here” – the compiler guarantees you they are all there. And when you see a call to a Rust function, you don’t have to ask yourself if it is fallible – you’ll see a if it is. 5 So I guess the question is: would you ever have to know about a ref-count increment ? The trick part is that the answer here is application dependent. For some low-level applications, definitely yes: an atomic reference count is a measurable cost. To be honest, I would wager that the set of applications where this is true are vanishingly small. And even in those applications, Rust already improves on the state of the art by giving you the ability to choose between and and then proving that you don’t mess it up . But there are other reasons you might want to track reference counts, and those are less easy to dismiss. One of them is memory leaks. Rust, unlike GC’d languages, has deterministic destruction . This is cool, because it means that you can leverage destructors to manage all kinds of resources, as Yehuda wrote about long ago in his classic ode-to- RAII entitled “Rust means never having to close a socket” . But although the points where handles are created and destroyed is deterministic, the nature of reference-counting can make it much harder to predict when the underlying resource will actually get freed. And if those increments are not visible in your code, it is that much harder to track them down. Just recently, I was debugging Symposium , which is written in Swift. Somehow I had two instances when I only expected one, and each of them was responding to every IPC message, wreaking havoc. Poking around I found stray references floating around in some surprising places, which was causing the problem. Would this bug have still occurred if I had to write explicitly to increment the ref count? Definitely, yes. Would it have been easier to find after the fact? Also yes. 6 Josh gave me a similar example from the “bytes” crate . A type is a handle to a slice of some underlying memory buffer. When you clone that handle, it will keep the entire backing buffer around. Sometimes you might prefer to copy your slice out into a separate buffer so that the underlying buffer can be freed. It’s not that hard for me to imagine trying to hunt down an errant handle that is keeping some large buffer alive and being very frustrated that I can’t see explicitly in the where those handles are created. A similar case occurs with APIs like like 7 . takes an and, if the ref-count is 1, returns an . This lets you take a shareable handle that you know is not actually being shared and recover uniqueness. This kind of API is not frequently used – but when you need it, it’s so nice it’s there. Entering the conversation with Josh, I was leaning towards a design where you had some form of automated cloning of handles and an allow-by-default lint that would let crates which don’t want that turn it off. But Josh convinced me that there is a significant class of applications that want handle creation to be ergonomic AND visible (i.e., explicit in the source). Low-level network services and even things like Rust For Linux likely fit this description, but any Rust application that uses or might also. And this reminded me of something Alex Crichton once said to me. Unlike the other quotes here, it wasn’t in the context of ergonomic ref-counting, but rather when I was working on my first attempt at the “Rustacean Principles” . Alex was saying that he loved how Rust was great for low-level code but also worked well high-level stuff like CLI tools and simple scripts. I feel like you can interpret Alex’s quote in two ways, depending on what you choose to emphasize. You could hear it as, “It’s important that Rust is good for high-level use cases”. That is true, and it is what leads us to ask whether we should even make handles visible at all. But you can also read Alex’s quote as, “It’s important that there’s one language that works well enough for both ” – and I think that’s true too. The “true Rust gestalt” is when we manage to simultaneously give you the low-level control that grungy code needs but wrapped in a high-level package. This is the promise of zero-cost abstractions, of course, and Rust (in its best moments) delivers. Let’s be honest. High-level GUI programming is not Rust’s bread-and-butter, and it never will be; users will never confuse Rust for TypeScript. But then, TypeScript will never be in the Linux kernel. The goal of Rust is to be a single language that can, by and large, be “good enough” for both extremes. The goal is make enough low-level details visible for kernel hackers but do so in a way that is usable enough for a GUI. It ain’t easy, but it’s the job. This isn’t the first time that Josh has pulled me back to this realization. The last time was in the context of async fn in dyn traits, and it led to a blog post talking about the “soul of Rust” and a followup going into greater detail . I think the catchphrase “low-level enough for a Kernel, usable enough for a GUI” kind of captures it. There is a slight caveat I want to add. I think another part of Rust’s soul is preferring nuance to artificial simplicity (“as simple as possible, but no simpler”, as they say). And I think the reality is that there’s a huge set of applications that make new handles left-and-right (particularly but not exclusively in async land 8 ) and where explicitly creating new handles is noise, not signal. This is why e.g. Swift 9 makes ref-count increments invisible – and they get a big lift out of that! 10 I’d wager most Swift users don’t even realize that Swift is not garbage-collected 11 . But the key thing here is that even if we do add some way to make handle creation automatic, we ALSO want a mode where it is explicit and visible. So we might as well do that one first. OK, I think I’ve made this point 3 ways from Sunday now, so I’ll stop. The next few blog posts in the series will dive into (at least) two options for how we might make handle creation and closures more ergonomic while retaining explicitness. I see a potential candidate for a design axiom… rubs hands with an evil-sounding cackle and a look of glee   ↩︎ It’s an industry term .  ↩︎ Actually, by the standards of the conversations Josh and I often have, it was’t really all that long – an hour at most.  ↩︎ Well, at least sync Rust is. I think async Rust has more than its share, particularly around cancellation, but that’s a topic for another blog post.  ↩︎ Modulo panics, of course – and no surprise that accounting for panics is a major pain point for some Rust users.  ↩︎ In this particular case, it was fairly easy for me to find regardless, but this application is very simple. I can definitely imagine ripgrep’ing around a codebase to find all increments being useful, and that would be much harder to do without an explicit signal they are occurring.  ↩︎ Or , which is one of my favorite APIs. It takes an and gives you back mutable (i.e., unique) access to the internals, always! How is that possible, given that the ref count may not be 1? Answer: if the ref-count is not 1, then it clones it. This is perfect for copy-on-write-style code. So beautiful. 😍  ↩︎ My experience is that, due to language limitations we really should fix, many async constructs force you into bounds which in turn force you into and where you’d otherwise have been able to use .  ↩︎ I’ve been writing more Swift and digging it. I have to say, I love how they are not afraid to “go big”. I admire the ambition I see in designs like SwiftUI and their approach to async. I don’t think they bat 100, but it’s cool they’re swinging for the stands. I want Rust to dare to ask for more !  ↩︎ Well, not only that. They also allow class fields to be assigned when aliased which, to avoid stale references and iterator invalidation, means you have to move everything into ref-counted boxes and adopt persistent collections, which in turn comes at a performance cost and makes Swift a harder sell for lower-level foundational systems (though by no means a non-starter, in my opinion).  ↩︎ Though I’d also wager that many eventually find themselves scratching their heads about a ref-count cycle. I’ve not dug into how Swift handles those, but I see references to “weak handles” flying around, so I assume they’ve not (yet?) adopted a cycle collector. To be clear, you can get a ref-count cycle in Rust too! It’s harder to do since we discourage interior mutability, but not that hard.  ↩︎ I see a potential candidate for a design axiom… rubs hands with an evil-sounding cackle and a look of glee   ↩︎ It’s an industry term .  ↩︎ Actually, by the standards of the conversations Josh and I often have, it was’t really all that long – an hour at most.  ↩︎ Well, at least sync Rust is. I think async Rust has more than its share, particularly around cancellation, but that’s a topic for another blog post.  ↩︎ Modulo panics, of course – and no surprise that accounting for panics is a major pain point for some Rust users.  ↩︎ In this particular case, it was fairly easy for me to find regardless, but this application is very simple. I can definitely imagine ripgrep’ing around a codebase to find all increments being useful, and that would be much harder to do without an explicit signal they are occurring.  ↩︎ Or , which is one of my favorite APIs. It takes an and gives you back mutable (i.e., unique) access to the internals, always! How is that possible, given that the ref count may not be 1? Answer: if the ref-count is not 1, then it clones it. This is perfect for copy-on-write-style code. So beautiful. 😍  ↩︎ My experience is that, due to language limitations we really should fix, many async constructs force you into bounds which in turn force you into and where you’d otherwise have been able to use .  ↩︎ I’ve been writing more Swift and digging it. I have to say, I love how they are not afraid to “go big”. I admire the ambition I see in designs like SwiftUI and their approach to async. I don’t think they bat 100, but it’s cool they’re swinging for the stands. I want Rust to dare to ask for more !  ↩︎ Well, not only that. They also allow class fields to be assigned when aliased which, to avoid stale references and iterator invalidation, means you have to move everything into ref-counted boxes and adopt persistent collections, which in turn comes at a performance cost and makes Swift a harder sell for lower-level foundational systems (though by no means a non-starter, in my opinion).  ↩︎ Though I’d also wager that many eventually find themselves scratching their heads about a ref-count cycle. I’ve not dug into how Swift handles those, but I see references to “weak handles” flying around, so I assume they’ve not (yet?) adopted a cycle collector. To be clear, you can get a ref-count cycle in Rust too! It’s harder to do since we discourage interior mutability, but not that hard.  ↩︎

0 views

JIT: so you want to be faster than an interpreter on modern CPUs…

Since my previous blog entry about JIT compiler for PostgreSQL, sadly not much happened due to a lack of time, but still some things were done (biggest improvement was the port to ARM64, a few optimizations, implementing more opcodes…). But I am often asking myself how to really beat the interpreter… And on “modern” CPUs, with a well written interpreter, that’s far more complicated than many would imagine. So in order to explain all this and show how I am planning to improve performance (possibly of the interpreter itself too, thus making this endeavor self-defeating), let’s first talk about… If you already know about all the topics mentioned in this title, feel free to jump to the next section . Note that the following section is over-simplified to make the concepts more accessible. I am writing this blog post on a Zen 2+ CPU. If I upgraded to a Zen 3 CPU, same motherboard, same memory, I would get an advertised 25% performance jump in single thread benchmarks while the CPU frequency would be only 2% higher. Why such a discrepancy? Since the 90s and the Pentium-class CPUs, x86 has followed RISC CPUs in the super-scalar era. Instead of running one instruction per cycle, when conditions are right, several instructions can be executed at the same time. Let’s consider the following pseudo-code: X and Y can be calculated at the same time. The CPU can execute these on two integer units, fetch the results and store them. The only issue is the computation of Z: everything must be done before this step, making it impossible for the CPU to go further without waiting for the previous results. But now, what if the code was written as follow: Every step would require waiting for the previous one, slowing down the CPU terribly. Hence the most important technique used to implement superscalar CPUs: out-of-order execution. The CPU will fetch the instructions, dispatch them in several instruction queues, and resolve dependencies to compute Y before computing Z1 in order to have it ready sooner. The CPU is spending less time idling, thus the whole thing is faster. But, alas, what would happen with the following function? Should the CPU wait for X and Y before deciding which Z to compute? Here is the biggest trick: it will try its luck and compute something anyway. This way, if its bet was right, a lot of time will be saved, otherwise the mistake result will be dropped and the proper computation will be done instead. This is called branch prediction, it has been the source of many fun security issues (hello meltdown), but the performance benefits are so huge that one would never consider disabling this. Most interpreters will operate on an intermediate representation, using opcodes instead of directly executing from an AST or similar. So you could use the following main loop for an interpreter. This is how many, many interpreters were written. But this has a terrible drawback at least when compiled that way: it has branches all over the place from a single starting point (most if not all optimizing compilers will generate a jump table to optimize the dispatch, but this will still jump from the same point). The CPU will have a hard time predicting the right jump, and is thus losing a lot of performance. If this was the only way an interpreter could be written, generating a function by stitching the code together would save a lot of time, likely giving a more than 10% performance improvement. If one look at Python, removing this switch made the interpreter 15 to 20% faster. Many project, including PostgreSQL, use this same technique, called “computed gotos”. After a first pass to fill in “label” targets in each step, the execution would be When running a short sequence of operations in a loop, the jumps will be far more predictable, making the branch predictor’s job easier, and thus improving the speed of the interpreter. Now that we have a very basic understanding of modern CPUs and the insane level of optimization they reach, let’s talk about fighting the PostgreSQL interpreter on performance. I will not discuss optimizing the tuple deforming part (aka. going from on-disc structure to the “C” structure used by the code), this will be a topic for a future blog post when I implement it in my compiler. As you may know, PostgreSQL has a very complete type system with operators overloading. Even this simple query ends up being a call to int4eq, a strict function that will perform the comparison. Since it is a strict function, PostgreSQL must check that the arguments are not null, otherwise the function is not called and the result will be null. If you execute a very basic query like the one in the title, PostgreSQL will have the following opcodes: The EEOP_FUNCEXPR_STRICT_2 will perform the null check, and then call the function. If we unroll all the opcodes in real C code, we end up with the following: We can already spot one optimization: why do we check the two arguments, including our constant, against null? It will never change for the entire run of this query and thus each comparison is going to use an ALU, and branch depending on that comparison. But of course the CPU will notice the corresponding branch pattern, and will thus be able to remain active and feed its other units. What is the real cost of such a pointless comparison? For this purpose, I’ve broken a PostgreSQL instance and replaced all FUNCEXPR_STRICT with a check on one argument only, and one with no STRICT check (do not try this at home!). Doing 10 times a simple SELECT * FROM demo WHERE a = 42 on a 100 million rows table, with no index, here are the two perf results: So, even if this is not the optimization of the century, it’s not that expensive to make, so… why not do it? (Patch coming to pgsql-hackers soon) But a better optimization is to go all-in on inlining. Indeed, instead of jumping through a pointer to the int4eq code (again, something that the CPU will optimize a lot), one could have a special opcode for this quite common operation. With this change alone (but keeping the two null checks, so there are still optimizations possible), we end up with the following perf results. Let’s sum up these results. The biggest change comes, quite obviously, from inlining the int4eq call. Why is it that much better? Because it reduces by quite a lot the number of instructions to run, and it removes a call to an address stored in memory. And this is again an optimization I could do on my JIT compiler that can also be done on the interpreter with the same benefits. The biggest issue here is that you must keep the number of opcodes within (unspecified) limits: too many opcodes could make the compiler job far worse. Well. At first, I thought the elimination of null checks could not be implemented easily in the interpreter. The first draft in my compiler was certainly invalid, but gave me interesting numbers (around 5%, as seen above) and made me want to go ahead. And I realized that implementing it cleanly in the interpreter was far easier than implementing it in my JIT compiler … Then I went with optimizing another common case, the call to int4eq, and, well… One could also add an opcode for that in the interpreter, and thus the performance gain of the JIT compiler are going to be minimal compared to the interpreter. Modern CPUs don’t make my job easy here. Most of the cost of an interpreter is taken away by the branch predictor and the other optimizations implemented in silicon. So is all hope lost, am I to declare the interpreter the winner against the limitations of the copy-patch method I have available for my JIT? Of course not, see you in the next post to discuss the biggest interpreter bottleneck! PS: help welcome. Last year I managed to spend some time working on this during my work time. Since then I’ve changed job, and can hardly get some time on this. I also tried to get some sponsoring to work on this and present at future PostgreSQL conferences, to no luck :/ If you can help in any way on this project, feel free to reach me (code contribution, sponsoring, missions, job offers, nudge nudge wink wink). Since I’ve been alone on this, a lot of things are dibbles on scratch paper, I benchmark code and stuff in my head when life gives me some boring time but testing it for real is of course far better. I have some travel planned soon so I hope for next part to be released before next year, with interesting results since my experiences have been as successful as anticipated.

0 views

Extended User Interrupts (xUI): Fast and Flexible Notification without Polling

Extended User Interrupts (xUI): Fast and Flexible Notification without Polling Berk Aydogmus, Linsong Guo, Danial Zuberi, Tal Garfinkel, Dean Tullsen, Amy Ousterhout, and Kazem Taram ASPLOS'25 This paper describes existing hardware support for userspace interrupts , and extensions to make it more efficient. The kernel is powerful, and yet slow. You only want to call on it when necessary. Kernel bypass technologies like DPDK and io_uring exist to allow applications to reduce the frequency of kernel calls. In addition to I/O, applications frequently call the kernel to communicate between threads. For example, in a producer/consumer design, the producer could use a signal to tell the consumer that more data is ready. Section 2 of the paper reminds readers that each of these signals costs about 2.4 microseconds. The idea behind userspace interrupts is to get the kernel out of the way and instead have dedicated hardware to support cheap signaling between threads. UIPI is a hardware feature introduced by Intel with Sapphire Rapids. Section 3 of the paper describes how UIPI works, including reverse engineering some of the Sapphire Rapids microarchitecture. Like other kernel bypass technologies, the kernel is still heavily involved in the control path . When a process requests that the kernel configure UIPI, the kernel responds by creating or modifying the user interrupt target table (UITT) for the process. This per-process table has one entry per thread. The kernel thread scheduler updates this table so that the UIPI hardware can determine which core a thread is currently running on. Once the control path is setup, the data path runs without kernel involvement. Userspace code which wants to send an interrupt to another thread can execute the instruction. This instruction has one operand, which is an index into the UITT (the index of the destination thread). The hardware then consults the UITT and sends an inter-processor interrupt (IPI) to the core on which the destination thread is running. Userspace code in the destination thread then jumps to a pre-registered interrupt handler, which runs arbitrary code in userspace. Hardware has long supported IPIs, but typically only the kernel has had the ability to invoke them. The hardware has the ability to coordinate with the OS to handle the case where the destination thread is not currently running on any core. These cases do involve running kernel code, but they are the slow path. The fast path is handled without any kernel involvement. The authors measure an end-to-end latency of 1360 clock cycles for a userspace interrupt on Sapphire Rapids. Fig. 2 illustrates where that time goes: Source: https://dl.acm.org/doi/abs/10.1145/3676641.3716028 The largest cost is the pipeline flush when the receiving core receives the IPI. Section 3.4 describes experiments the authors performed to determine these numbers, including how they determined that the receiving processor pipeline is flushed. Note that “flush” here means that in-flight instructions are squashed (i.e., what happens when a branch misprediction is detected). An alternative strategy would be to drain the pipeline, which would let outstanding instructions commit before handling the interrupt. This would avoid duplicate work but would increase the latency of handling an interrupt. The authors propose tracked interrupts to solve the performance problem associated with pipeline flushes. When an interrupt is received, the receiving core immediately injects the micro-ops needed to handle the interrupt into the pipeline. Outstanding instructions are not squashed. This may sound similar to draining, but there is a key difference. With draining, the processor waits for all inflight instructions to commit before injecting the interrupt handling micro-ops. With tracked interrupts, micro-ops enter the pipeline immediately, and the out-of-order machinery of the processor will not see any dependencies between the interrupt handling micro-ops and the inflight instructions. This means that the interrupt handling micro-ops can execute quickly and typically do not need to wait behind all inflight instructions. The paper has simulation results which show that tracked interrupts save 414 clock cycles on the receiver side. The paper also discusses some related improvements to UIPI: Userspace timer support Userspace interrupts from I/O devices Safepoints - to allow userspace interrupts to place nice with garbage collection I wonder how much of the Sapphire Rapids design was dictated by simplicity (to reduce the risk associated with this feature)? A frequent theme in papers reviewed on this blog is how difficult it is to implement fine-grained multithreading on general purpose multi-core CPUs. I wonder how much userspace interrupt support can help with that problem? Subscribe now Source: https://dl.acm.org/doi/abs/10.1145/3676641.3716028 The largest cost is the pipeline flush when the receiving core receives the IPI. Section 3.4 describes experiments the authors performed to determine these numbers, including how they determined that the receiving processor pipeline is flushed. Note that “flush” here means that in-flight instructions are squashed (i.e., what happens when a branch misprediction is detected). An alternative strategy would be to drain the pipeline, which would let outstanding instructions commit before handling the interrupt. This would avoid duplicate work but would increase the latency of handling an interrupt. Tracked Interrupts The authors propose tracked interrupts to solve the performance problem associated with pipeline flushes. When an interrupt is received, the receiving core immediately injects the micro-ops needed to handle the interrupt into the pipeline. Outstanding instructions are not squashed. This may sound similar to draining, but there is a key difference. With draining, the processor waits for all inflight instructions to commit before injecting the interrupt handling micro-ops. With tracked interrupts, micro-ops enter the pipeline immediately, and the out-of-order machinery of the processor will not see any dependencies between the interrupt handling micro-ops and the inflight instructions. This means that the interrupt handling micro-ops can execute quickly and typically do not need to wait behind all inflight instructions. Results The paper has simulation results which show that tracked interrupts save 414 clock cycles on the receiver side. The paper also discusses some related improvements to UIPI: Userspace timer support Userspace interrupts from I/O devices Safepoints - to allow userspace interrupts to place nice with garbage collection

0 views
Jeff Geerling 5 days ago

Resizeable BAR support on the Raspberry Pi

While not an absolute requirement for modern graphics card support on Linux, Resizeable BAR support makes GPUs go faster, by allowing GPUs to throw data back and forth on the PCIe bus in chunks larger than 256 MB. In January, I opened an issue in the Raspberry Pi Linux repo, Resizable BAR support on Pi 5 .

0 views

Python 3.14 Is Here. How Fast Is It?

In November of 2024 I wrote a blog post titled "Is Python Really That Slow?" , in which I tested several versions of Python and noted the steady progress the language has been making in terms of performance. Today is the 8th of October 2025, just a day after the official release of Python 3.14. Let's rerun the benchmarks to find out how fast the new version of Python is!

0 views

No Cap, This Memory Slaps: Breaking Through the Memory Wall of Transactional Database Systems with Processing-in-Memory

No Cap, This Memory Slaps: Breaking Through the Memory Wall of Transactional Database Systems with Processing-in-Memory Hyoungjoo Kim, Yiwei Zhao, Andrew Pavlo, Phillip B. Gibbons VLDB'25 This paper describes how processing-in-memory (PIM) hardware can be used to improve OLTP performance. Here is a prior paper summary from me on a similar topic, but that one is focused on OLAP rather than OLTP. UPMEM is specific PIM product (also used in the prior paper ) on this blog. A UPMEM DIMM is like a DRAM DIMM, but each DRAM bank is extended with a simple processor which can run user code. That processor has access to a small local memory and the DRAM associated with the bank. This paper calls each processor a PIM Module . There is no direct communication between PIM modules. Fig. 2 illustrates the system architecture used by this paper. A traditional CPU is connected to a set of boring old DRAM DIMMs and is also connected to a set of UPMEM DIMMs. Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Four Challenges The paper identifies the following difficulties associated with using UPMEM to accelerate an OLTP workload: PIM modules can only access their local memory PIM modules do not have typical niceties associated with x64 CPUs (high clock frequency, caches, SIMD) There is a non-trivial cost for the CPU to send data to UPMEM DIMMs (similar to the CPU writing data to regular DRAM) OLTP workloads have tight latency constraints The authors arrived at a solution that both provides a good speedup and doesn’t require boiling the ocean. The database code and architecture remain largely unchanged. Much of the data remains in standard DRAM DIMMs, and the database operates on it as it always has. In section 3.2 the authors identify a handful of data structures and operations with near-memory affinity which are offloaded. These data structures are stored in UPMEM DIMMs, and the algorithms which access them are offloaded to the PIM modules. The key feature that these algorithms have in common is pointer chasing . The sweet spots the authors identify involve a small number of parameters sent from the CPU to a PIM module, then the PIM module performing multiple roundtrips to its local DRAM bank, followed by the CPU reading back a small amount of response data. The roundtrips to PIM-local DRAM have lower latency than accesses from a traditional CPU core. One data structure which involves a lot of pointer chasing is B+ tree traversal. Thus, the system described in this paper moves B+ tree indexes into UPMEM DIMMs and uses PIM modules to search for values in an index. Note that the actual tuples that hold row data stay in plain-old DRAM. The tricky part is handling range queries while distributing an index across many banks. The solution described in this paper is to partition the set of keys into 2 R partitions (the lower bits of a key define the index the partition which holds that key). Each partition is thus responsible for a contiguous array of keys. For a range query, the lower bits of the lower and upper bounds of the range can be used to determine which partitions must be searched. Each PIM module is responsible for multiple partitions, and a hash function is used to convert a partition index into a PIM module index. MVCC is a concurrency control method which requires the database to keep around old versions of a given row (to allow older in-flight queries to access them). The set of versions associated with a row are typically stored in a linked list (yet another pointer traversal). Again, the actual tuple contents are stored in regular DRAM, but the list links are stored in UPMEM DIMMs, with the PIM modules traversing the links. Section 4.3 has more information about how old versions are eventually reclaimed with garbage collection. Fig. 7 has the headline results. is the baseline, is the work described by this paper. It is interesting that only beats on for read-only workloads. Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Processing-in-memory can help with memory bandwidth and memory latency. It seems like this work is primarily focused on memory latency. I suppose this indicates that OLTP workloads are fundamentally latency-bound, because there is not enough potential concurrency between transactions to hide that latency. Is there no way to structure a database such that OLTP workloads are not bound by memory latency? It would be interesting to see if these tricks could work in a distributed system, where the PIM modules are replaced by separate nodes in the system. Subscribe now Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Four Challenges The paper identifies the following difficulties associated with using UPMEM to accelerate an OLTP workload: PIM modules can only access their local memory PIM modules do not have typical niceties associated with x64 CPUs (high clock frequency, caches, SIMD) There is a non-trivial cost for the CPU to send data to UPMEM DIMMs (similar to the CPU writing data to regular DRAM) OLTP workloads have tight latency constraints

0 views
Chris Coyier 1 weeks ago

Local by Flywheel was Ultra Slow Because I Had The Wrong Version

Maybe a few months ago on my Mac Studio, Local by Flywheel became incredibly slow clicking any button in the UI would take literally minutes. I was gonna just give up and switch to Studio , but I had tried that once and it had enough rough edges I gave up (can’t remember why now, but I’ll give it a shot again one of these days). There were a few releases of Local since my problem and each time I’d upgrade I’d cross my fingers it would fix it and it never did. But then they released a version that now does some kind of “are you running the correct version?” check and it warned me that I wasn’t. I must have been (somehow, someway) running the “Mac Intel” version on my “Mac Apple Silicon” machine. So anyway, if Local by Flywheel is ultra mega slow for you, make sure to check you’re running the appropriate version for your machine. Derp.

0 views

Parendi: Thousand-Way Parallel RTL Simulation

Parendi: Thousand-Way Parallel RTL Simulation Mahyar Emami, Thomas Bourgeat, and James R. Larus ASPLOS'25 This paper describes an RTL simulator running on (one or more) Graphcore IPUs. One nice side benefit of this paper is the quantitative comparisons of IPU synchronization performance vs traditional CPUs. Here is another paper summary which describes some challenges with RTL simulation. The Graphcore IPU used in this paper is a chip with 1472 cores, operating with a MIMD architecture. A 1U server can contain 4 IPUs. It is interesting to see a chip that was designed for DNN workloads adapted to the domain of RTL simulation. Similar to other papers on RTL simulation, a fundamental step of the Parendi simulator is partitioning the circuit to be simulated. Parendi partitions the circuit into fibers . A fiber comprises a single (word-wide) register, and all of the combinational logic which feeds it. Note that some combinational logic may be present in multiple fibers. Fig. 3 contains an example, node a3 is present in multiple fibers. As far as I can tell, Parendi does not try to deduplicate this work (extra computation to save synchronization). Source: https://dl.acm.org/doi/10.1145/3676641.3716010 The driving factor in the design of this fiber-specific partitioning system is scalability . Each register has storage to hold the value of the register at the beginning and end of the current clock cycle (i.e., the and values). I think of the logic to simulate a single clock cycle with the following pseudo-code ( is the register rooted at fiber : Scalability comes from the fact that there are only two barriers per simulated clock cycle. This is an instance of the bulk synchronous parallel (BSP) model. In many cases, there are more fibers than CPU/IPU cores. Parendi addresses this by distributing the simulation across chips and scheduling multiple fibers to run on the same core. If the simulation is distributed across multiple chips, then a min-cut algorithm is used to partition the fibers across chips while minimizing communication. The Parendi compiler statically groups multiple fibers together into a single process . A core simulates all fibers within a process. The merging process primarily seeks to minimize inter-core communication. First, a special case merging algorithm merges multiple fibers which reference the same large array. This is to avoid communicating the contents of such an array across cores. I imagine this is primarily for simulation of on-chip memories. Secondly, a general-purpose merging algorithm merges fibers which each have low compute cost, and high data sharing with each other. Fig. 7 compares Parendi vs Verilator simulation. is a 2-socket server with 28 Intel cores per socket. is a 2-socket server with 64 AMD cores per socket: Source: https://dl.acm.org/doi/10.1145/3676641.3716010 Section 6.4 claims a roughly 2x improvement in cost per simulation using cloud pricing. As far as I can tell, this system doesn’t have optimizations for the case where some or all of a fiber’s inputs do not change between clock cycles. It seems tricky to optimize for this case while maintaining a static assignment of fibers to cores. Fig. 4 has a fascinating comparison of synchronization costs between an IPU and a traditional x64 CPU. This microbenchmark loads up the system with simple fibers (roughly 6 instructions per fiber). Note that the curves represent different fiber counts (e.g., the red dotted line represents 7 fibers on the IPU graph, vs 736 fibers on the x64 graph). The paper claims that a barrier between 56 x64 threads implemented with atomic memory accesses consumes thousands of cycles, whereas the IPU has dedicated hardware barrier support. Source: https://dl.acm.org/doi/10.1145/3676641.3716010 This seems to be one of many examples of how generic multi-core CPUs do not perform well with fine-grained multi-threading. We’ve seen it with pipeline parallelism, and now with the BSP model. Interestingly, both cases seem to work better with specialized multi-core chips (pipeline parallelism works with CPU-based SmartNICs, BSP works with IPUs). I’m not convinced this is a fundamental hardware problem that cannot be addressed with better software. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts.

0 views
Shayon Mukherjee 1 weeks ago

Exploring PostgreSQL to Parquet archival for JSON data with S3 range reads

PostgreSQL handles large JSON payloads reasonably well until you start updating or deleting them frequently. Once payloads cross the 8 KB TOAST threshold and churn becomes high, autovacuum can dominate your I/O budget and cause other issues. I have been exploring the idea of moving older JSON data (read: cold data) to Parquet on S3 while keeping recent data hot in PostgreSQL daily partitions, then simply dropping those partitions instead of running expensive DELETE operations and subsequent vacuum cycles.

0 views

Skia: Exposing Shadow Branches

Skia: Exposing Shadow Branches Chrysanthos Pepi, Bhargav Reddy Godala, Krishnam Tibrewala, Gino A. Chacon, Paul V. Gratz, Daniel A. Jiménez, Gilles A. Pokam, and David I. August ASPLOS'25 This paper starts with your yearly reminder of the high cost of the Turing Tax : Recent works demonstrate that the front-end is a considerable source of performance loss [16], with upwards of 53% of performance [23] bounded by the front-end. Everyone knows that the front-end runs ahead of the back-end of a processor. If you want to think of it in AI terms, imagine a model that is told about the current value of and recent history of the program counter, and asked to predict future values of the program counter. The accuracy of these predictions determines how utilized the processor pipeline is. What I did not know is that in a modern processor, the front-end itself is divided into two decoupled components, one of which runs ahead of the other. Fig. 4 illustrates this Fetch Direction Instruction Processing (FDIP) microarchitecture: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 The Instruction Address Generator (IAG) runs the furthest ahead and uses tables (e.g., the Branch Target Buffer (BTB)) in the Branch Prediction Unit (BPU) to predict the sequence of basic blocks which will be executed. Information about each predicted basic block is stored in the Fetch Target Queue (FTQ). The Instruction Fetch Unit (IFU) uses the control flow predictions from the FTQ to actually read instructions from the instruction cache. Some mispredictions can be detected after an instruction has been read and decoded. These result in an early re-steer (i.e., informing the IAG about the misprediction immediately after decode). When a basic block is placed into the FTQ, the associated instructions are prefetched into the IFU (to reduce the impact of instruction cache misses). This paper introduces the term “shadow branch”. A shadow branch is a (static) branch instruction which is currently stored in the instruction cache but is not present in any BPU tables. The top of fig. 5 illustrates a head shadow branch. A branch instruction caused execution to jump to byte 24 and execute the non-shaded instructions. This causes an entire cache line to be pulled into the instruction cache, including the branch instruction starting at byte 19. The bottom of fig. 5 shows a tail shadow branch. In this case, the instruction at byte 12 jumped away from the cache line, causing the red branch instruction at byte 16 to not be executed (even though it is present in the instruction cache). Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Skia The proposed design (Skia) allows the IAG to make accurate predictions for a subset of shadow branches, thus improving pipeline utilization and reducing instruction cache misses. The types of shadow branches which Skia supports are: Direct unconditional branches (target PC can be determined without looking at backend state) Function calls As shown in Fig. 6, these three categories of branches (purple, red, orange) account for a significant fraction of all BTB misses: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 When a cache line enters the instruction cache, the Shadow Branch Decoder (SBD) decodes just enough information to locate shadow branches in the cache line and determine the target PC (for direct unconditional branches and function calls). Metadata from the SBD is placed into two new branch prediction tables in the BPU: The U-SBB holds information about direct unconditional branches and function calls The R-SBB holds information about returns When the BPU encounters a BTB miss, it can fall back to the U-SBB or R-SBB for a prediction. Fig. 11 illustrates the microarchitectural changes proposed by Skia: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Section 4 goes into more details about these structures including: Replacement policy How a shadow branch is upgraded into a first-class branch in the BTB Handling variable length instructions Fig. 14 has (simulated) IPC improvements across a variety of benchmarks: Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Dangling Pointers A common problem that HW and SW architects solve is getting teams out of a local minimum caused by fixed interfaces. The failure mode is when two groups of engineers agree on a static interface, and then each optimize their component as best they can without changing the interface. In this paper, the interface is the ISA, and Skia is a clever optimization inside of the CPU front-end. Skia shows that there is fruit to be picked here. It would be interesting to examine potential performance gains from architectural (i.e., ISA) changes to pick the same fruit. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. Source: https://dl.acm.org/doi/10.1145/3676641.3716273 The Instruction Address Generator (IAG) runs the furthest ahead and uses tables (e.g., the Branch Target Buffer (BTB)) in the Branch Prediction Unit (BPU) to predict the sequence of basic blocks which will be executed. Information about each predicted basic block is stored in the Fetch Target Queue (FTQ). The Instruction Fetch Unit (IFU) uses the control flow predictions from the FTQ to actually read instructions from the instruction cache. Some mispredictions can be detected after an instruction has been read and decoded. These result in an early re-steer (i.e., informing the IAG about the misprediction immediately after decode). When a basic block is placed into the FTQ, the associated instructions are prefetched into the IFU (to reduce the impact of instruction cache misses). Shadow Branches This paper introduces the term “shadow branch”. A shadow branch is a (static) branch instruction which is currently stored in the instruction cache but is not present in any BPU tables. The top of fig. 5 illustrates a head shadow branch. A branch instruction caused execution to jump to byte 24 and execute the non-shaded instructions. This causes an entire cache line to be pulled into the instruction cache, including the branch instruction starting at byte 19. The bottom of fig. 5 shows a tail shadow branch. In this case, the instruction at byte 12 jumped away from the cache line, causing the red branch instruction at byte 16 to not be executed (even though it is present in the instruction cache). Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Skia The proposed design (Skia) allows the IAG to make accurate predictions for a subset of shadow branches, thus improving pipeline utilization and reducing instruction cache misses. The types of shadow branches which Skia supports are: Direct unconditional branches (target PC can be determined without looking at backend state) Function calls Source: https://dl.acm.org/doi/10.1145/3676641.3716273 When a cache line enters the instruction cache, the Shadow Branch Decoder (SBD) decodes just enough information to locate shadow branches in the cache line and determine the target PC (for direct unconditional branches and function calls). Metadata from the SBD is placed into two new branch prediction tables in the BPU: The U-SBB holds information about direct unconditional branches and function calls The R-SBB holds information about returns Source: https://dl.acm.org/doi/10.1145/3676641.3716273 Section 4 goes into more details about these structures including: Replacement policy How a shadow branch is upgraded into a first-class branch in the BTB Handling variable length instructions

0 views

Principles and Methodologies for Serial Performance Optimization

Principles and Methodologies for Serial Performance Optimization Sujin Park, Mingyu Guan, Xiang Cheng, and Taesoo Kim OSDI'25 This paper is a psychological trojan horse for computer nerds of a certain vintage. Every paragraph of sections 3 and 4 inflates the ego a bit more. One arrives at section 5 feeling good about their performance optimization skillset, and then one learns that these skills can be replaced by an LLM. A faint hissing sound reaches one’s ears as one’s ego deflates into a shriveled piece of plastic on the ground. Eight Methodologies The authors reviewed 477 papers containing specific instances of serial performance optimization and boiled the techniques down into eight categories. Serial is the operative word here: this paper is about optimizing portions of code that cannot be parallelized. However, some of the techniques are applicable in computations that comprise a serial portion (the critical path) and a parallel portion. Here are the techniques: Amortizing a fixed cost over many items. For example, refactoring code so that some computation can be hoisted out of a (per-item) loop. Store the results of a computation in memory so that it can be used later on. Memoization is a good example. Compute something earlier than is necessary (possibly speculatively). This works in programs which alternate between serial and parallel portions. If the precomputation can be done during a parallel portion, then it can be thought of as “free” because it is off of the critical path. Act like one of my high schoolers: don’t do work now when it could be done earlier. This works great in late spring, when a teacher realizes they have assigned too much work for the year so they cancel an assignment that most students (not mine) have already completed. Deferring interacts with other techniques: Similar to precomputing, deferring can move work off of the critical path Deferring can enable batching by deferring work until a large batch has been accumulated Compute a quick and dirty approximation to the right answer rather than the exactly right answer. Make a generic component more efficient for a specific use case. For example, a library user could pass hints at runtime which gives the library more information to make tradeoffs. Or profile guided optimization to enable the compiler to make better decisions when compiling the generic library. Use a hardware accelerator, to avoid paying the Turing Tax . Chip away at performance inefficiencies caused by the abstractions in a layered software architecture. For example, DPDK allows applications to bypass many networking abstractions. After finishing section 4 of the paper, I felt pretty good. My ego happily agreed with a world view that says that serial performance optimization can be expressed in terms of 8 “basis vectors”, all of which I had extensive experience with. And then came section 5. The authors fine-tuned GPT-4o with a dataset derived from analyzing SOSP and OSDI papers. Each item in the dataset comprises a problem description, observations, and solutions (in terms of the 8 techniques described earlier). The authors made data and scripts available here . The fine-tuned model is called SysGPT . Table 4 shows example inputs (problems + observations), the output from GPT-4, the output from SysGPT, and the actual solution from a real-world paper. Here is an example: Source: https://www.usenix.org/conference/osdi25/presentation/park-sujin Table 5 has quantitative results, where each model is only asked to choose a subset of the 8 optimization strategies for a given problem (N represents the number of example problems given to the baseline GPT-4o model in a prompt): Source: https://www.usenix.org/conference/osdi25/presentation/park-sujin Dangling Pointers It would be interesting to extend these recipes to also include problems which can be parallelized. This paper assumes that the problem and observations are produced by humans. It would be fascinating to see how much of that can be automated. For example, a model could have access to source code and profiling information, and output a set of observations about system performance before optimization. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. Eight Methodologies The authors reviewed 477 papers containing specific instances of serial performance optimization and boiled the techniques down into eight categories. Serial is the operative word here: this paper is about optimizing portions of code that cannot be parallelized. However, some of the techniques are applicable in computations that comprise a serial portion (the critical path) and a parallel portion. Here are the techniques: Batching Amortizing a fixed cost over many items. For example, refactoring code so that some computation can be hoisted out of a (per-item) loop. Caching Store the results of a computation in memory so that it can be used later on. Memoization is a good example. Precomputing Compute something earlier than is necessary (possibly speculatively). This works in programs which alternate between serial and parallel portions. If the precomputation can be done during a parallel portion, then it can be thought of as “free” because it is off of the critical path. Deferring Act like one of my high schoolers: don’t do work now when it could be done earlier. This works great in late spring, when a teacher realizes they have assigned too much work for the year so they cancel an assignment that most students (not mine) have already completed. Deferring interacts with other techniques: Similar to precomputing, deferring can move work off of the critical path Deferring can enable batching by deferring work until a large batch has been accumulated

0 views
Evan Schwartz 2 weeks ago

Subtleties of SQLite Indexes

In the last 6 months, Scour has gone from ingesting 330,000 pieces of content per month to over 1.4 million this month. The massive increase in the number of items slowed down the ranking for users' feeds and sent me looking for ways to speed it up again. After spending too many hours trying in vain to squeeze more performance out of my queries and indexes, I dug into how SQLite's query planner uses indexes, learned some of the subtleties that explained why my initial tweaks weren't working, and sped up one of my main queries by ~35%. Scour is a personalized content feed that finds articles, blog posts, etc related to users' interests. For better and for worse, Scour does its ranking on the fly whenever users load their feeds page. Initially, this took 100 milliseconds or less, thanks to binary vector embeddings and the fact that it's using SQLite so there is no network latency to load data. The most important table in Scour's database is the table. It includes an ID, URL, title, language, publish date (stored as a Unix timestamp), and a text quality rating. Scour's main ranking query filters items based on when they were published, whether they are in a language the user understands, and whether they are above a certain quality threshold. The question is: what indexes do we need to speed up this query? When I first set up Scour's database, I put a bunch of indexes on the table without really thinking about whether they would help. For example, I had separate indexes on the published date, the language, and the quality rating. Useless. It's more important to have one or a small handful of good composite indexes on multiple columns than to have separate indexes on each column. In most cases, the query planner won't bother merging the results from two indexes on the same table. Instead, it will use one of the indexes and then scan all of the rows that match the filter for that index's column. It's worth being careful to only add indexes that will be used by real queries. Having additional indexes on each column won't hurt read performance. However, each index takes up storage space and more indexes will slow down writes, because all of the indexes need to be updated when new rows are inserted into the table. If we're going to have an index on multiple columns, which columns should we include and what order should we put them in? The order of conditions in a query doesn't matter, but the order of columns in an index very much does. Columns that come earlier in the index should be more "selective": they should help the database narrow the results set as much as possible. In Scour's case, the most selective column is the publish date, followed by the quality rating, followed by the language. I put an index on those columns in that order: ...and found that SQLite was only using one of the columns. Running this query: Produced this query plan: It was using the right index but only filtering by (note the part of the plan that says ). Puzzling. My aha moment came while watching Aaron Francis' High Performance SQLite course . He said the main rule for SQLite indexes is: "Left to right, no skipping, stops at the first range." (This is a much clearer statement of the implications of the Where Clause Analysis buried in the Query Optimizer Overview section of the official docs.) This rule means that the query planner will: My query includes two ranges ( ), so the "stops at the first range" rule explains why I was only seeing the query planner use one of those columns. This rule does, however, suggest that I can create an index that will allow SQLite to filter by the one non-range condition ( ) before using one of the ranges: The query plan shows that it can use both the and columns (note the part that reads ): Now we're using two out of the three columns to quickly filter the rows. Can we use all three? (Remember, the query planner won't be able to use multiple range queries on the same index, so we'll need something else.) SQLite has a nice feature called Partial Indexes that allows you to define an index that only applies to a subset of the rows matching some conditions. In Scour's case, we only really want items where the is less than or equal to 90%. The model I'm using to judge quality isn't that great , so I only trust it to filter out items if it's really sure they're low quality. This means I can create an index like this: And then update the query to use the same condition: And it should use our new partial index... right? Wrong. This query is still using the previous index. There's a subtle mistake in the relationship between our index and our query. Can you spot it? Our index contains the condition but our query says . These are mathematically equivalent but they are not exactly the same . SQLite's query planner requires the conditions to match exactly in order for it to use a partial index. Relatedly, a condition of or even in the query would also not utilize the partial index. If we rewrite our query to use the exact same condition of , we get the query plan: Now, we're starting with the items whose , then using the index to find the items in the desired language(s), and lastly narrowing down the results to the items that were published in the correct time range. As mentioned in the intro, these changes to the indexes and one of Scour's main ranking queries yielded a ~35% speedup. Enabling the query planner to make better use of the indexes makes it so that SQLite doesn't need to scan as many rows to find the ones that match the query conditions. Concretely, in Scour's case, filtering by language removes about 30% of items for most users and filtering out low quality content removes a further 50%. Together, these changes reduce the number of rows scanned by around 66%. Sadly, however, a 66% reduction in the number of rows scanned does not directly translate to a 66% speedup in the query. If we're doing more than counting rows, the work to load the data out of the database and process it can be more resource intensive than scanning rows to match conditions. (The optimized queries and indexes still load the same number of rows as before, they just identifying the desired rows faster.) Nevertheless, a 35% speedup is a noticeable improvement. It's worth digging into how your database's query planner uses indexes to help get to the bottom of performance issues. If you're working with SQLite, remember that: Thanks to Aaron Francis for putting together the High Performance SQLite course ! (I have no personal or financial relationship to him, but I appreciate his course unblocking me and helping me speed up Scour's ranking.) Thank you also to Adam Gluck and Alex Kesling for feedback on this post.

5 views

Accelerate Distributed Joins with Predicate Transfer

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

1 views
Dizzy Zone 2 weeks ago

Redis is fast - I'll cache in Postgres

There are books & many articles online, like this one arguing for using Postgres for everything. I thought I’d take a look at one use case - using Postgres instead of Redis for caching. I work with APIs quite a bit, so I’d build a super simple HTTP server that responds with data from that cache. I’d start from Redis as this is something I frequently encounter at work, switch it out to Postgres using unlogged tables and see if there’s a difference. I’ll run the experiment on my homelab’s k8s cluster. The idea is to run Postgres or Redis on one node, limiting it to 2CPUs via k8s limits, as well as 8GiB of memory. On another node, I’ll run the web server itself and then spin a pod for the benchmark executed via k6 on the third. Both postgres and redis are used with the out of the box settings for the following images: I wrote a simple webserver, with 2 endpoints, a cache and a “Session” struct which we’ll store in the cache: For redis, I’ve implemented the cache using as follows: The postgres cache is implemented using the library: I’ll seed the redis and postgres with 30 million entries each, keeping record of the inserted uuids. From there, I’ll generate a subset of existing uuids to use while benchmarking. This allows for simulating both hits and misses. I’ll do a few runs of benchmarks for gets first, then sets and then a mixed run. Each run will execute for 2 minutes. I’ll look at the number of operations per second, latencies as well as memory and CPU usage during those times. To simulate a somewhat real scenario where only a subset of keys exist in the cache the set benchmark will have a 10% chance to update an existing key, whereas the get will have an 80% chance of picking an existing key. The mixed workload will have a 20% chance to execute a set scenario and 80% for the get scenario. Redis performed better than Postgres, which did not surprise me at all. The bottleneck was actually the HTTP server. The machine running the http server maxed out on CPU, with redis running comfortably with ~1280mCPU - short of the 2000mCPU limit imposed. Redis used ~3800MiB of RAM, which stayed flat across the runs. For postgres, the bottleneck was the CPU on postgres side. It consistently maxed out the 2 cores dedicated to it, while also using ~5000MiB of RAM. Redis also did better when it comes to latencies of the HTTP responses: Once again Redis performed better. The CPU usage stayed roughly the same as in the case of the GET experiment, with the RAM usage growing to ~4300MiB due to the new keys being inserted. The bottleneck stayed on the HTTP server side, with Redis using ~1280mCPU once again. Postgres once again was bottlenecked by the CPU, constantly using 100% of the 2 cores it was limited to. During the course of the run, the memory usage grew to ~5500MiB. During the test, the endpoints with the Redis cache implementation also had better latencies: The mixed benchmark also returned the predictable result of Redis reigning superior. As has been the story so far, the CPU stayed put at ~1280mCPU, RAM usage grew a bit due to the new keys being inserted. Postgres maxed out the two cores and reached around 6GiB of memory used. Latencies once again were better when using redis: In the benchmark, I’ve used an unlogged table for postgres but this has not seemed to help, or has it? If I rerun the same benchmark with a normal(logged) table we can look at the numbers. The unlogged table makes a huge difference for the write benchmark and a somewhat smaller but still significant one for the mixed workload. This is because the unlogged tables skip the write ahead log making them a lot faster for writes. There’s very little difference for the read performance though and I expect more runs would show the two test cases converging. Redis is faster than postgres when it comes to caching, there’s no doubt about it. It conveniently comes with a bunch of other useful functionality that one would expect from a cache, such as TTLs. It was also bottlenecked by the hardware, my service or a combination of both and could definitely show better numbers. Surely, we should all use Redis for our caching needs then, right? Well, I think I’ll still use postgres. Almost always, my projects need a database. Not having to add another dependency comes with its own benefits. If I need my keys to expire, I’ll add a column for it, and a cron job to remove those keys from the table. As far as speed goes - 7425 requests per second is still a lot. That’s more than half a billion requests per day. All on hardware that’s 10 years old and using laptop CPUs. Not many projects will reach this scale and if they do I can just upgrade the postgres instance or if need be spin up a redis then. Having an interface for your cache so you can easily switch out the underlying store is definitely something I’ll keep doing exactly for this purpose. Thanks for reading!

0 views
André Arko 3 weeks ago

Adventures in CPU contention

Recently on this blog, I wrote about in-memory filesystems in Rust , and concluded that I wasn’t able to detect a difference between any form of in-memory filesystem and using a regular SSD on macOS. I also asked anyone who found a counterexample to please let me know. Last week, David Barsky of ERSC sent me an extremely compelling counter-example, and I spent several days running benchmarks to understand it better. The top level summary is that the test suite for the jj VCS exhibits an absolutely huge difference between running on an SSD and running against a ramdisk. In my first reproduction attempt, I found the SSD took 239 seconds, while the ramdisk took just 37 seconds. That’s bananas! How was that even possible? What I discovered will amaze, distress, and astound you. Probably. First, the context. The jj project recently shipped a change to always use when persisting a temporary file. My understanding is that this change was made to prevent certain kinds of bad data being written.

0 views
Dayvster 3 weeks ago

Why Zig Feels More Practical Than Rust for Real-World CLI Tools

## Introduction So when it comes to memory management there are two terms you really need to know, the stack and the heap. The stack is a region of memory that stores temporary data that is only needed for a short period of time. It operates in a last-in, first-out (LIFO) manner, meaning that the most recently added data is the first to be removed, as the name suggests. Basically imagine a stack of plates, if you wanna remove one plate you remove the top one, remove the middle plate and disaster awaits in this analogy. The stack is typically used for storing function parameters, local variables, and return addresses. It is fast and efficient because it has a fixed size and does not require dynamic memory allocation. The size of the stack is usually limited, and if a program tries to use more stack space than is available, it can result in a stack overflow error. This can happen if a function calls itself recursively too many times or if a program allocates too much memory on the stack. Whereas the heap as the name suggests is a region of memory that is used for dynamic memory allocation. Unlike the stack, the heap does not have a fixed size and can grow or shrink as needed. The heap is typically used for storing data that needs to persist beyond the lifetime of a single function call, such as objects or data structures that are created at runtime. Imagine the heap as a pile of clothes in a disorganized household, you can add or remove clothes as needed and as long as the pile isn't too big you can find what you need with relative speed and ease. But it will quickly become a nightmare if you let it grow out of control. The heap is managed by the operating system and requires dynamic memory allocation, which can be slower and less efficient than stack allocation. The heap can also become fragmented over time, since we do not always store data in a contiguous block of memory. This can lead to performance issues and make it more difficult to allocate large blocks of memory. ### Rust's Borrow Checker Rust's borrow checker is a a pretty powerful tool that helps ensure memory safety during compile time. It enforces a set of rules that govern how references to data can be used, preventing common programming memory safety errors such as null pointer dereferencing, dangling pointers and so on. However you may have notice the word **compile time** in the previous sentence. Now if you got any experience at systems programming you will know that compile time and runtime are two very different things. Basically compile time is when your code is being translated into machine code that the computer can understand, while runtime is when the program is actually running and executing its instructions. The borrow checker operates during compile time, which means that it can only catch memory safety issues that can be determined statically, before the program is actually run. This means that basically the borrow checker can only catch issues at comptime but it will not fix the underlying issue that is developers misunderstanding memory lifetimes or overcomplicated ownership. The compiler can only enforce the rules you’re trying to follow; it can’t teach you good patterns, and it won’t save you from bad design choices. ### Story Time Last weekend I've made a simple CLI tool for myself to help me manage my notes it parses `~/.notes` into a list of notes, then builds a tag index mapping strings to references into that list. Straightforward, right? Not in Rust. The borrow checker blocks you the moment you try to add a new note while also holding references to the existing ones. Mutability and borrowing collide, lifetimes show up, and suddenly you’re restructuring your code around the compiler instead of the actual problem. In Zig, we would just allocate the list with an allocator, store pointers into it for the tag index, and mutate freely when we need to add or remove notes. No lifetimes, no extra wrappers, no compiler gymnastics, that’s a lot more straightforward. ### But Dave isn't that the exact point of Rust's borrow checker? Yes it is, however by using Zig I managed to get most of the benefits of Rust's memory safety without the complexity or ceremony of the borrow checker. All it took was some basic understanding of memory management and a bit of discipline. I was able to produce two CLI's that are both memory safe and efficient however the Zig one was way more straightforward and easier to reason about and took less time to write. ## What is Safety, Really for CLI Tools? This is where a lot of developers trip up, Rust markets itself as a language that produces safe software, great marketing hook, but one tiny problem, memory safety is one puzzle piece of overall software safety. I'm not sure if the Rust foundation does this on purpose sort of a blanket statement to make it seem like memory safety is the end all be all of software safety, or if they just don't want to constantly prefix safety with memory safety(even though they should). But back to the main point, memory safety is just one aspect of software safety. You can argue if it's a big or small piece of the puzzle, I'd say it depends on the software and use-case but it's definitely not the only piece. So What exactly is safety in terms of CLI tools? Memory safety alone does not make a program safe. Your CLI tool can still crash, produce wrong results, corrupt files, leak sensitive data, be vulnerable to various types of attacks or just behave in a way that is not expected. Let's go back to my `Notes CLI` it's rust version may never segfault but it could silently overwrite my index or tags or corrupt my files if I make a mistake in my logic, or perhaps it could store my file in a temporary location that is world readable, exposing my notes to anyone on the system. Is that safe? No. Would using Zig solve any of those issues automatically, also no. Is my example a bit contrived, yes, but it illustrates the point that memory safety is not the only thing that matters when it comes to software safety. In fact you should also consider other aspects of safety such as: - **Predictable Behavior**: The program should do what the user expects, even when input is malformed or unexpected. A CLI that panics on a missing file or fails silently on a corrupted note is not safe. - **Avoiding Crashes or Silent Corruption**: The program should handle errors gracefully, providing meaningful feedback to the user instead of crashing or corrupting data. A CLI that crashes on a malformed note or silently overwrites existing notes is not safe. - **Manageable Performance**: The program should perform well under expected workloads, avoiding excessive resource consumption or slowdowns. A CLI that becomes unresponsive when managing a large number of notes is not safe. This is where it really helps to understand memory allocations and performance characteristics of your language of choice. - **Sensitive Data Handling**: The program should protect sensitive data from unauthorized access or exposure. A CLI that stores notes in a world-readable temporary file is not safe. - **Robustness Against Attacks**: The program should be resilient against common attack vectors, such as injection attacks or buffer overflows. A CLI that can be exploited to execute arbitrary code or corrupt data is not safe. And this is precisely where Rust's memory safety shines, it can help prevent certain types of vulnerabilities that arise from memory mismanagement. However, it's not a silver bullet that guarantees overall safety. ## The Borrow Checker: Strengths and Limitations The borrow checker is impressive. It prevents dangling references, double frees, and mutable aliasing at compile time, things that would otherwise cause segfaults or undefined behavior. It’s why Rust can claim “memory safe without a garbage collector.” ### Strengths: - **Zero data races / mutable aliasing issues**: The compiler guarantees that only one mutable reference exists at a time, and that immutable references cannot be combined with mutable ones. - **Strong compile-time guarantees**: Many memory-related bugs are caught before you even run the program. - **Early bug detection**: You find mistakes before shipping code, which is a huge win in long-lived services or concurrent systems. ### Limitations / Pain Points: **Cognitive overhead**: You’re constantly thinking about lifetimes, ownership, and borrow scopes, even for simple tasks. A small CLI like my notes tool suddenly feels like juggling hot potatoes. **Boilerplate and contortions**: You end up introducing clones, wrappers (Rc, RefCell), or redesigning data structures just to satisfy the compiler. Your code starts serving the compiler, not the problem. **Compile-time only**: The borrow checker cannot fix logic bugs, prevent silent corruption, or make your CLI behave predictably. It only ensures memory rules are followed. - **Edge cases get messy**: Shared caches, global state, or mutable indexes often trigger lifetime errors that are annoying to work around. At this point, the Rust borrow checker can feel more like a mental tax than a helpful tool, especially for short-lived CLI projects. You’re trading developer ergonomics for a compile-time guarantee that, in many CLI scenarios, may be overkill. ## Zig's Approach to Safety and Simplicity Zig takes a different approach to safety and simplicity. It provides manual memory management with optional safety checks, allowing developers to choose the level of control they need. This can lead to more straightforward code for certain use cases, like CLI tools. However where it really shines is how it does manual memory management, I've briefly touched upon this in my other blog post [Zig Allocators Explained](https://dayvster.com/blog/zig-allocators-explained/). But basically long story short Zig gives you allocators, a set of tools that helps you manually manage your memory in a more structured and predictable way. You can choose to use a general purpose allocator like the `std.heap.GeneralPurposeAllocator` or you can create your own custom allocator that fits your specific needs. This allows you to have more control over how memory is allocated and deallocated, which can lead to more efficient and predictable memory usage. This combined with Zig's `defer` statement which allows you to schedule cleanup code to run when a scope is exited, makes it easy to manage resources gives you most of the power of Rust's borrow checker at your disposal without the complexity and ritual. However it asks one thing in return of you, discipline, your software will be only as safe as you make it. We can make the same claim about Rust, you can throw `copy` and `clone` and `unsafe` around your code and throw away all the benefits of the borrow checker in a heartbeat. The two languages are polar opposites in this regard, Zig places the burden on the developer and makes it easy for them to produce memory safe software, whereas Rust places the burden on the compiler and makes it hard for developers to produce memory unsafe software. Back to the main point, zig's approach to memory management is in my subjective opinion more practical for most of my use cases, especially for CLI tools. It allows me to write straightforward code that is easy to reason about and maintain, without the overhead of the borrow checker. I can allocate a list of notes, store pointers to them in a tag index, and mutate the list freely when I need to add or remove notes. No lifetimes, no extra wrappers, no compiler gymnastics, that’s a lot more straightforward. Oh I almost forgot, Zig also has the `comptime` feature which allows you to execute code at compile time. This can be useful for generating code, performing static analysis, or optimizing performance and even for testing which is a really nice bonus and can be a small helper when it comes to memory safety. ## Developer Ergonomics Matter and Developers are not Idiots When developing software we want to be productive and efficient, most of all we want to be correct and produce good software, however we also want to enjoy the process of creation and not feel like we are fighting the tools we use. Developer ergonomics is a term that refers to how easy and comfortable it is to use a programming language or framework. It encompasses things like syntax, tooling, documentation, and community support. A language with good developer ergonomics can make it easier to write correct code, while a language with poor developer ergonomics can make it harder to do so. I'd say as it currently stands Rust has poor developer ergonomics but produces memory safe software, whereas Zig has good developer ergonomics and allows me to produce memory safe software with a bit of discipline. I personally usually prefer languages where I do not have to succumb to too much ceremony and ritual to get things done, I want to be able to express my ideas in code without having to constantly think about the underlying mechanics of the language and yet I want to be responsible and produce good software. So with C and C++ this was a tiny bit harder as you basically had to learn some useful and practical memory management patterns and techniques, Zig comes with them baked in. I feel like Zig really respects it's developers and treats them like adults, it gives you the tools and expects you to use them wisely. Rust on the other hand feels like it treats developers like children that need to be constantly supervised and guided, which can be frustrating and demotivating. Developers are not idiots, sure even the smartest amongst us still produce memory safety issues or bugs in their software and it's silly to assume that with enough training and practice we can become perfect developers, but we can become better developers. We can learn from our mistakes and improve our skills, we can learn to write better code and produce better software. It's not good to abstract that away to the compiler and assume that it will magically make us better developers, I don't personally think it will. In fact not to sound too cliche but I think that the journey to becoming a better developer is a series of mistakes and fixes, we learn from our mistakes and improve our skills. What does it say about a language that tries to abstract away the mistakes we make, does it really help us become better developers ? ## Final Thoughts Rust is amazing, if you’re building something massive, multithreaded, or long-lived, where compile-time guarantees actually save your life. The borrow checker, lifetimes, and ownership rules are a boon in large systems. But for small, practical CLI tools? Rust can feel like overkill. That’s where Zig shines. Lightweight, fast, and straightforward, you get memory safety without constantly bending over backward for the compiler. You can allocate a list, track pointers, and mutate freely without extra wrappers, lifetimes, or contortions. Iterating feels natural, the code is easier to reason about, and you get stuff done faster. Memory safety is important, but it’s just one piece of the puzzle. Predictable behavior, maintainable code, and robustness are just as critical, and that’s exactly why Zig often feels more practical for real-world CLI tools. At the end of the day, it’s not about which language is “better.” It’s about what fits your workflow and the kinds of projects you build. For me, Zig hits the sweet spot: memory safe, low ceremony, and developer-friendly, perfect for small tools that actually get things done. ## References - The Stack and the Heap - Zig Allocators Explained - Rustonomicon - The Dark Arts of Unsafe Rust - Zig Documentation - Memory Management - Rust Documentation - The Rust Programming Language - Zig Documentation - Comptime - Rust Documentation - Ownership - Zig Documentation - Defer - Rust Documentation - Error Handling - Zig Documentation - Error Handling - Rust Documentation - Concurrency - Zig Documentation - Concurrency - Rust Documentation - Testing - Zig Documentation - Testing - Rust Documentation - Performance

0 views
Anton Zhiyanov 3 weeks ago

Native threading and multiprocessing in Go

As you probably know, the only way to run tasks concurrently in Go is by using goroutines. But what if we bypass the runtime and run tasks directly on OS threads or even processes? I decided to give it a try. To safely manage threads and processes in Go, I'd normally need to modify Go's internals. But since this is just a research project, I chose to (ab)use cgo and syscalls instead. That's how I created multi — a small package that explores unconventional ways to handle concurrency in Go. Features • Goroutines • Threads • Processes • Benchmarks • Final thoughts Multi offers three types of "concurrent groups". Each one has an API similar to , but they work very differently under the hood: runs Go functions in goroutines that are locked to OS threads. Each function executes in its own goroutine. Safe to use in production, although unnecessary, because the regular non-locked goroutines work just fine. runs Go functions in separate OS threads using POSIX threads. Each function executes in its own thread. This implementation bypasses Go's runtime thread management. Calling Go code from threads not created by the Go runtime can lead to issues with garbage collection, signal handling, and the scheduler. Not meant for production use. runs Go functions in separate OS processes. Each function executes in its own process forked from the main one. This implementation uses process forking, which is not supported by the Go runtime and can cause undefined behavior, especially in programs with multiple goroutines or complex state. Not meant for production use. All groups offer an API similar to . Runs Go functions in goroutines that are locked to OS threads. starts a regular goroutine for each call, and assigns it to its own thread. Here's a simplified implementation: goro/thread.go You can use channels and other standard concurrency tools inside the functions managed by the group. Runs Go functions in separate OS threads using POSIX threads. creates a native OS thread for each call. It uses cgo to start and join threads. Here is a simplified implementation: pthread/thread.go You can use channels and other standard concurrency tools inside the functions managed by the group. Runs Go functions in separate OS processes forked from the main one. forks the main process for each call. It uses syscalls to fork processes and wait for them to finish. Here is a simplified implementation: proc/process.go You can only use to exchange data between processes, since regular Go channels and other concurrency tools don't work across process boundaries. Running some CPU-bound workload (with no allocations or I/O) on Apple M1 gives these results: And here are the results from GitHub actions: One execution here means a group of 4 workers each doing 10 million iterations of generating random numbers and adding them up. See the benchmark code for details. As you can see, the default concurrency model ( in the results, using standard goroutine scheduling without meddling with threads or processes) works just fine and doesn't add any noticeable overhead. You probably already knew that, but it's always good to double-check, right? I don't think anyone will find these concurrent groups useful in real-world situations, but it's still interesting to look at possible (even if flawed) implementations and compare them to Go's default (and only) concurrency model. Check out the nalgeon/multi repo for the implementation. P.S. Want to learn more about concurrency? Check out my interactive book

0 views