Posts in Performance (20 found)
The Coder Cafe 4 days ago

Metastable Failures in Distributed Systems

☕ Welcome to The Coder Cafe! Today, we explore one of the nastiest failure patterns in distributed systems: metastable failures. Based on the Metastable Failures in Distributed Systems whitepaper, we break down why these failures happen, why they persist, what we can do about them, and why our instinct to fix them is probably wrong. Get cozy, grab a coffee, and let’s begin! Stable, Vulnerable, Metastable Metastable failures borrow their name from physics, where metastable means something that looks stable but isn’t . To understand how a distributed system can end up in such a state, we need to look at three distinct states it can be in: Stable: The system recovers on its own after any disruption. This is what we call resilience in Resilient, Fault-tolerant, Robust, or Reliable . Vulnerable : The system looks perfectly healthy, but it's operating above its hidden capacity : the load level below which it can self-heal from any disruption. It responds fast, metrics are green, and nothing is alarming. Many production systems deliberately operate here because it's more efficient: resources are used closer to their limit. But there's no slack left . And the deeper the system operates in a vulnerable state, the smaller the trigger needed to push it over the edge. Indeed, a system just above its hidden capacity can survive large disruptions; a system near its advertised capacity can be tipped by almost anything. Metastable failure : A trigger (e.g., a network blip, a deployment, a traffic spike) pushes the system over its hidden capacity. The system is not fully broken: processes are alive, and it’s still running. But goodput collapses: it’s no longer doing any useful work. Technically up, effectively down . And unlike a regular outage, removing the trigger doesn’t fix it. Getting out requires a strong corrective push: a restart, a dramatic load reduction, a manual intervention. NOTE : If you’re not familiar with the concept of goodput, it’s the throughput of useful work completed successfully. For example, in a web application receiving 1000 requests per second but returning errors for 800 of them, the goodput is only 200 RPS. The three states of a metastable failure. A system can drift into the vulnerable state unnoticed, and a single trigger is enough to push it into the metastable state it cannot escape on its own. The most disorienting property of a metastable failure: stopping the trigger doesn’t stop the failure. To understand why, we need to talk about feedback loops. In a previous post on Systems Thinking Explained , we defined a feedback loop as: If causes , then influences . A feedback loop is exactly the mechanism that keeps a system stuck in the metastable state . There is always a sustaining effect, a feedback loop, that prevents recovery. The trigger is just what pushes the system over the edge. The loop is what keeps it there. Blaming the trigger is the natural instinct, and almost always the wrong diagnosis. Let’s discuss a concrete example to make this clear. Imagine a web application that queries a database. The database comfortably handles up to 300 QPS. The application retries any query that doesn’t respond within 1 second. The system is running at 280 QPS, healthy and fast, within the database’s capacity. Then, a transient network issue occurs for 10 seconds. When the issue is over, all the queued requests flood in at once. The database gets hit with a surge it can’t absorb: latency spikes and queries start timing out. So the application retries them. This doubles the effective load to 560 QPS. The database, already struggling, falls further behind. More timeouts. More retries. The loop is now self-sustaining: High load → Timeouts → Retries → Higher load → More timeouts → More retries The transient network issue was fixed minutes ago. Yet, the system is still completely broken. The trigger is gone; the feedback loop is not . The only way out is to dramatically cut the load or disable retries entirely. This is a metastable failure . The system was vulnerable because it was operating close to its hidden capacity . A minor, transient trigger pushed it over the edge and into a self-sustaining failure state it couldn’t escape on its own. The retry mechanism, a feature designed to improve reliability, became the very thing that prevented recovery. This is one example, but the same pattern appears with caches, connection pools, failover logic, and more. The shape is always the same: a feedback loop that turns a temporary problem into a permanent one . Two things make metastable failures particularly nasty. We can be tempted to blame the wrong thing . When an outage happens, the trigger is what’s visible and recent: a spike, a deployment, a hardware fault. It’s the obvious culprit. But the trigger only exposed the problem; it didn’t create it. The sustaining feedback loop was already there, structural and invisible. When analyzing the problem in retrospect, teams focus on the trigger; fixes address the trigger; and the system remains vulnerable to the next one. The authors of the paper observed teams declare a metastable failure “resolved” multiple times before realizing the real cause had never been touched. The feedback loop grows stronger with scale . Small-scale tests won’t reveal it. A staging environment running at 10% capacity may handle the same trigger without falling into a metastable state, because the loop isn’t strong enough at that scale to be self-sustaining. This means these failures can slip past even rigorous testing regimes and only manifest in production at full load. We defined hidden capacity earlier as the load level below which the system can self-heal from any disruption. It’s different, and always lower, than the advertised capacity. In our example, the numbers make it concrete: the advertised capacity is 300 QPS, but the hidden capacity is only 150 QPS, because retries double the load under failure. The gap between those two numbers is where vulnerability lives . Measuring the hidden capacity is not straightforward, though. One possible approach is to apply a trigger at a given load level and observe whether the system recovers on its own: If it does, we are below the hidden capacity. If it doesn’t, we are above it. We can also estimate it indirectly: in the retry example, retries double the load under failure, so the hidden capacity is roughly half the advertised capacity. Metastable failures are not bugs . We can’t write a unit test that catches them. They are emergent behaviors: properties that arise from the interaction of a system’s components under specific conditions, not logic errors in any individual component. No single piece of code is buggy, no single configuration is wrong. The failure is a consequence of how everything fits together under load. This changes how we need to think about them. The right question after an outage is not “ What failed? ” but “ What loop sustained it? ” And before an outage, the danger is not having bugs; it’s optimizing so aggressively for efficiency that we push the system deeper into the vulnerable state without realizing it . Retries, caches, failover logic, connection pools: these are all features that improve reliability in the common case. They are also, under the right conditions, the sustaining mechanisms of metastable failures. The same design decision that makes a system more resilient in normal operation can also prevent it from recovering when things go wrong. The paper describes several approaches to reduce the risk of metastable failures: Retry budgets and circuit breakers : Instead of retrying indefinitely, cap the total number of retries in flight at any given time. This directly weakens the feedback loop by limiting work amplification. LIFO scheduling under overload : Counterintuitively, switching from FIFO to LIFO when the system is overloaded allows some requests to complete within their deadline, preserving goodput instead of letting every request time out. NOTE : I already wrote a post about that approach in Adaptive LIFO . Fast error paths : Success paths are heavily optimized, but error paths often aren’t. An expensive error path (stack traces, DNS lookups, disk writes) under high failure rates can itself become a sustaining mechanism. Optimizing error paths reduces this risk. Read-through caches over look-aside caches : A read-through cache (where the cache itself fetches missing data from the database) can continue filling itself even when the application has given up on a request, steadily increasing the hit rate and helping the system recover. A look-aside cache (where the application is responsible for populating the cache) can’t. Production stress testing : Small-scale tests won’t reveal metastable failures. Testing against a portion of production traffic, with engineers ready to intervene, is the most reliable way to surface them. A note of humility from the paper: there is no systematic solution yet. These are ad-hoc mitigations developed in response to known failures. Detecting vulnerable states before they collapse remains an open problem. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. A distributed system can pass through three states: stable, vulnerable, and metastable. The vulnerable state looks healthy, but it isn’t. The threshold between stable and vulnerable is invisible. Systems can operate in the vulnerable state for months without any sign of trouble. When a trigger pushes a vulnerable system into a metastable failure, a feedback loop sustains the failure even after the trigger is gone. The trigger is not the root cause. The feedback loop is. Fixing the trigger leaves the system vulnerable to the next one. Reliability features like retries and caches can become the sustaining mechanism of a metastable failure under the right conditions. Metastable failures are emergent behaviors, not bugs. We can’t unit test for them, and optimizing for efficiency makes them more likely. Mitigations exist (retry budgets, circuit breakers, LIFO scheduling, fast error paths), but they are all ad-hoc responses to known failures. Detecting vulnerable states before they collapse remains an open problem. Resilient, Fault-tolerant, Robust, or Reliable? Adaptive LIFO Fail Open vs. Fail Closed Metastable Failures in Distributed Systems Metastability and Distributed Systems Stable, Vulnerable, Metastable Metastable failures borrow their name from physics, where metastable means something that looks stable but isn’t . To understand how a distributed system can end up in such a state, we need to look at three distinct states it can be in: Stable: The system recovers on its own after any disruption. This is what we call resilience in Resilient, Fault-tolerant, Robust, or Reliable . Vulnerable : The system looks perfectly healthy, but it's operating above its hidden capacity : the load level below which it can self-heal from any disruption. It responds fast, metrics are green, and nothing is alarming. Many production systems deliberately operate here because it's more efficient: resources are used closer to their limit. But there's no slack left . And the deeper the system operates in a vulnerable state, the smaller the trigger needed to push it over the edge. Indeed, a system just above its hidden capacity can survive large disruptions; a system near its advertised capacity can be tipped by almost anything. Metastable failure : A trigger (e.g., a network blip, a deployment, a traffic spike) pushes the system over its hidden capacity. The system is not fully broken: processes are alive, and it’s still running. But goodput collapses: it’s no longer doing any useful work. Technically up, effectively down . And unlike a regular outage, removing the trigger doesn’t fix it. Getting out requires a strong corrective push: a restart, a dramatic load reduction, a manual intervention. NOTE : If you’re not familiar with the concept of goodput, it’s the throughput of useful work completed successfully. For example, in a web application receiving 1000 requests per second but returning errors for 800 of them, the goodput is only 200 RPS. We can be tempted to blame the wrong thing . When an outage happens, the trigger is what’s visible and recent: a spike, a deployment, a hardware fault. It’s the obvious culprit. But the trigger only exposed the problem; it didn’t create it. The sustaining feedback loop was already there, structural and invisible. When analyzing the problem in retrospect, teams focus on the trigger; fixes address the trigger; and the system remains vulnerable to the next one. The authors of the paper observed teams declare a metastable failure “resolved” multiple times before realizing the real cause had never been touched. The feedback loop grows stronger with scale . Small-scale tests won’t reveal it. A staging environment running at 10% capacity may handle the same trigger without falling into a metastable state, because the loop isn’t strong enough at that scale to be self-sustaining. This means these failures can slip past even rigorous testing regimes and only manifest in production at full load. If it does, we are below the hidden capacity. If it doesn’t, we are above it. Retry budgets and circuit breakers : Instead of retrying indefinitely, cap the total number of retries in flight at any given time. This directly weakens the feedback loop by limiting work amplification. LIFO scheduling under overload : Counterintuitively, switching from FIFO to LIFO when the system is overloaded allows some requests to complete within their deadline, preserving goodput instead of letting every request time out. NOTE : I already wrote a post about that approach in Adaptive LIFO . Fast error paths : Success paths are heavily optimized, but error paths often aren’t. An expensive error path (stack traces, DNS lookups, disk writes) under high failure rates can itself become a sustaining mechanism. Optimizing error paths reduces this risk. Read-through caches over look-aside caches : A read-through cache (where the cache itself fetches missing data from the database) can continue filling itself even when the application has given up on a request, steadily increasing the hit rate and helping the system recover. A look-aside cache (where the application is responsible for populating the cache) can’t. Production stress testing : Small-scale tests won’t reveal metastable failures. Testing against a portion of production traffic, with engineers ready to intervene, is the most reliable way to surface them. A distributed system can pass through three states: stable, vulnerable, and metastable. The vulnerable state looks healthy, but it isn’t. The threshold between stable and vulnerable is invisible. Systems can operate in the vulnerable state for months without any sign of trouble. When a trigger pushes a vulnerable system into a metastable failure, a feedback loop sustains the failure even after the trigger is gone. The trigger is not the root cause. The feedback loop is. Fixing the trigger leaves the system vulnerable to the next one. Reliability features like retries and caches can become the sustaining mechanism of a metastable failure under the right conditions. Metastable failures are emergent behaviors, not bugs. We can’t unit test for them, and optimizing for efficiency makes them more likely. Mitigations exist (retry budgets, circuit breakers, LIFO scheduling, fast error paths), but they are all ad-hoc responses to known failures. Detecting vulnerable states before they collapse remains an open problem. Resilient, Fault-tolerant, Robust, or Reliable? Adaptive LIFO Fail Open vs. Fail Closed Metastable Failures in Distributed Systems Metastability and Distributed Systems

0 views

Kafka Share Groups and Parallelizing Consumption - Part 2: Producer Batches and share.acquire.mode

All tests were executed against Kafka 4.3.0 using Dimster .  In the last post we used simulated consumer processing time to reveal how important it is to set an appropriate value for to ensure the consumer parallelism that we expect. With a uniform distribution of messages over partitions, the rule of thumb was a value somewhat lower than: But there’s more to parallel consumption than . The size of producer batches also plays a role when using the default ( ). Share group members are assigned to partitions like consumer group members are, except that share group assignment allows multiple consumers to be assigned to the same partition. If the number of share consumers is less than the partition count, then each consumer will be assigned multiple partitions. If the consumer count matches or exceeds the partition count, then each consumer will be assigned one partition. Fig 1. Share consumer assignments. Left: consumer count < partition count. Right: consumer count > partition count. When a consumer is assigned only one partition, it will always be fetching from one broker. If a consumer is assigned multiple partitions, it may fetch from multiple brokers concurrently. There are two values for : The Javadoc says the following: The application chooses between the two modes using the consumer share.acquire.mode configuration property. If the application sets the property to batch_optimized or does not set it at all, the share consumer fetches records based on batch boundaries which may mean that the number of records returned may exceed the max.poll.records configuration property. The share consumer may also prefetch records and buffer them temporarily awaiting the application's next call to poll(Duration). If the application sets the property to record_limit, the share consumer fetches no more than records at a time and does not prefetch. This is slower but gives the application tighter control on how many records are fetched and when the acquisition locks begin. So why two modes?  It comes down to efficiency ( ) and consumer control ( ). First of all the sentence “ the share consumer fetches records based on batch boundaries” is correct but a little misleading. No matter what mode is used, whole batches are returned to the consumer over the wire . In other words, the data sent over the network is always based on batch boundaries as the record batch is the unit of data delivery.  What that sentence refers to is what records are acquired by the consumer and returned to the application: With , the config is a soft cap. The consumer acquires any batches (in their entirety) that are covered by the offset range determined by . These acquired batches are returned to the consumer, and the consumer returns the records of those batches to the calling application (that invoked ). With , the config is a strict cap. The consumer only acquires the records that are covered by the offset range determined by (though less if less records are available). However, the unit of data delivery is the record batch, so the consumer receives whole batches but only returns a specific offset range to the calling application. For example, in the figure below we have three consumers sending fetch requests with and . Fig 2. Three consumers fetching with batch_optimized Despite asking for only one record, each consumer acquires and receives records along batch boundaries. The result of consumer.poll(Duration) for c1 is three records, not one. If we rerun this scenario with record_limit: c1 acquires record 0 c2 acquires record 1 c3 acquires record 2 However, the batch is the unit of data delivery, so batch 1 is sent in its entirety to each consumer (the consumer internals only returns the acquired records of the batch to the application). Fig 3. Three consumers fetching with record_limit This is obviously less efficient… We just sent the same batch three times! Nonetheless, exists because sometimes that inefficiency over the wire is countered by other concerns (one of which is covered in this post). Another efficiency gain that has is that because each batch is only sent to one consumer, Kafka only needs to do share group housekeeping of the batch as a whole, not each record individually. This reduces CPU and makes metadata more compact. If we get mixed acknowledgments of the batch records (2 success, 1 reject) only then does the record tracking explode the metadata to be per-record. With , the housekeeping always tracks state per record, which is more expensive. The final difference between the modes is that in mode, a consumer can send concurrent fetches to all the brokers of its assigned partitions. This further increases the number of records that a consumer might receive as is a soft cap per broker. With , the consumer sends one fetch at a time, round-robin between the brokers of its partitions. This difference only manifests when the consumer count is less than the partition count. We’ll cover this aspect more in the next post. The main implications are that: With , the effective consumer parallelism can be impacted by the average number of records per record batch. With , the network throughput will increase in most scenarios as offset ranges are unlikely to align with batch boundaries. If the is larger than the average number of records per batch, then each batch may only be delivered twice. The network throughput can a lot if the is much smaller than the average number of records per record batch. Don’t worry if that isn’t clear yet, we’ll gather some empirical results next which should make it clearer. Let’s test this out with Dimster’s interactive mode, using the same workload as the last post. In the last post, we calculated that the maximum theoretical consumption rate for 300 consumers with a processing time of 5 ms per message would be 60,000 msg/s. By setting to 30 we reached 55,000 msg/s and then finally reached 60,000 with low end-to-end latency by adding an additional 12 consumers (2 per partition). So we use the following workload file (no dimensional stuff in this one as we’re going to use live-interaction): In this test we’re going to make the record batches bigger and see what happens to the consumption rate. First we start Dimster and ensure it’s handling the 60k msg/s. Once it has started and settled in, we see it’s coping well. If I look at the metrics, the current record batch size is around 5KB with 10 records per batch. The average fetch size is 7KB with 14 records. This means some consumers get 1 record batch per fetch and some get 2 record batches per fetch. Let’s increase the batch size. To do this we’ll drop to 1 producer, and set the linger.ms to 10 to reach the default batch.size of 16KB batches. We see that the batch size has risen to the default of 16KB, or 32 records per batch. The consumers should now, on average, receive 32 records per fetch (2 above the max.poll.records). Fig 4. The record batches sent by the producers increase from 5.5 KB to 16 KB The coordinator output shows that the consumers are still coping, as expected. With 500b records, the number of records returned per fetch will be 32 which is close enough to the max.poll.records of 30 to not impact consumption. Now let’s double batch.size to 32786. From a separate terminal window to the coordinator output, we’ll run the following: We see the batch size increase again in the dashboard. Fig 5. The record batches sent by the producers increase from 5.5 KB to 16 KB to 32 KB The coordinator output shows that the consumers are no longer keeping up! Only managing 37K msg/s with a fast growing backlog. The problem is that each partition has an inflight budget of 2000 records and each record batch contains 64 records. That allows up to 31 effective consumers per partition (2000 / 64), leaving 21 consumers starved at any point in time. This explains the 37K msgs/s: We can fix this problem in three ways: Set in the producer. Increase to create a larger inflight budget We already know the default 16KB batch size is ok. Let’s first increase the inflight budget. We’ll double the budget and see what happens. First we’ll stop the producers and remove the processing time on the consumers to drain the backlog. Next we need to update the broker config and restart the brokers. In we add: Then we’ll redeploy Kafka (again from a separate terminal window). Now we’ll start the producers again and apply the 5 ms processing time to the consumers. We’re in business! The consumers are now coping with the larger batch sizes with this increased inflight budget. This time we’ll try . First let’s walk back that inflight budget change by  1) stopping the producers, 2) commenting out the added line to our broker config, 3) redeploying Kafka. While the producers are still stopped, I’ll change the consumers to use : Then start the producers again: In the coordinator, we see that the consumers are now coping with the 60K msg/s. The reason that allows the consumers to keep up, despite the larger record batches, is that each consumer is only allocated a max of 30 records per fetch, even though each batch contains 64 records. However, each batch is now being delivered three times as 30 doesn’t align well with 64. We can see this in the Kafka client metrics. Fig 6. On the left, with the larger inflight budget and batch_optimized. The middle was when we stopped the producers to restart Kafka with the original inflight budget. The right is with record_limit and each batch being sent three times. We could make this more efficient if we increase to 32 to align with the 64 record batches. If I simply change the to 32, we don’t see much of an improvement as most offset ranges of 32 records will touch two batches. But if we stop the producers, ensure there is no backlog at all then set , the fetches will be perfectly aligned. Fig 6. On the left, with unaligned fetches with max.poll.records=32 (each batch delivered 3 times). Right: aligned fetches with max.poll.records=32 (each batch delivered 2 times). Let’s not over-index on this one case. The purpose of this post was to explain the underlying mechanics and back that up with some empirical benchmarks, sticking with the same workload example as the last post. What we’ve learned: Consumer parallelism is impacted by more than just consumer count and . It is also impacted by: Record batch sizes (determined by the producers) The inflight budget ( ) The share consumer config Record acquisition is along batch boundaries with , and record ranges with . Record batches are the unit of delivery, so can cause consumer network bandwidth to increase because fetches likely will not align on batch boundaries causing batches to be delivered at least twice (more if is much smaller than the average number of records per batch). In the next post we’re going to look a bit closer at . ps: you can run this whole scenario with two terminal windows: Window 1 - kick off the benchmark (using the workload yaml described in the post) Window 2 - wait a few minutes then run the following bash script: Happy testing! If the application sets the property to batch_optimized or does not set it at all, the share consumer fetches records based on batch boundaries which may mean that the number of records returned may exceed the max.poll.records configuration property. The share consumer may also prefetch records and buffer them temporarily awaiting the application's next call to poll(Duration). If the application sets the property to record_limit, the share consumer fetches no more than records at a time and does not prefetch. This is slower but gives the application tighter control on how many records are fetched and when the acquisition locks begin. With , the config is a soft cap. The consumer acquires any batches (in their entirety) that are covered by the offset range determined by . These acquired batches are returned to the consumer, and the consumer returns the records of those batches to the calling application (that invoked ). With , the config is a strict cap. The consumer only acquires the records that are covered by the offset range determined by (though less if less records are available). However, the unit of data delivery is the record batch, so the consumer receives whole batches but only returns a specific offset range to the calling application. c1 acquires record 0 c2 acquires record 1 c3 acquires record 2 With , the effective consumer parallelism can be impacted by the average number of records per record batch. With , the network throughput will increase in most scenarios as offset ranges are unlikely to align with batch boundaries. If the is larger than the average number of records per batch, then each batch may only be delivered twice. The network throughput can a lot if the is much smaller than the average number of records per record batch. Don’t worry if that isn’t clear yet, we’ll gather some empirical results next which should make it clearer. Set in the producer. Increase to create a larger inflight budget Consumer parallelism is impacted by more than just consumer count and . It is also impacted by: Record batch sizes (determined by the producers) The inflight budget ( ) The share consumer config Record acquisition is along batch boundaries with , and record ranges with . Record batches are the unit of delivery, so can cause consumer network bandwidth to increase because fetches likely will not align on batch boundaries causing batches to be delivered at least twice (more if is much smaller than the average number of records per batch).

0 views
Jeff Geerling 4 days ago

I patched iozone for better disk benchmarks on modern macOS

A decade ago, I settled on for disk benchmarking on all my systems. Tools like ('Flexible IO' tester) are a little more capable for raw disk performance testing, and other tools test network-scale filesystems better, but gives me an easy overview of real-world disk performance across hard drives and SSDs, and runs on Mac, Windows, and Linux (and a smattering of other OSes). It's been around since 1991 , and is still updated today—in fact, the two latest updates (version 509 and 510) contain patches I sent in to get iozone to compile on Apple Silicon Macs running newer releases of macOS.

0 views
Josh Comeau 5 days ago

CSS vs. JavaScript

There are a bunch of JavaScript animation libraries out there, and you might have wondered whether there’s a performance cost compared to traditional CSS transitions and keyframe animations. In this blog post, we’ll compare the same animation across several different strategies and see the differences firsthand. There’s some interesting nuance here!

0 views

Pipeline Parallel Decompression

This isn’t a paper summary, but rather a description of a hobby experiment I’ve been hacking on ("research quality" code). This quote (attributed to either Anonymous or David Clark ) originally referred to networking, but applies to parallel programming as well: There is an old network saying: Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed—you can’t bribe God. Standard "cured with money" parallelization techniques (e.g., shared-nothing architectures, data parallelism) try to minimize cross-core communication. These hammers are great for hitting nails labeled: "improve throughput by throwing more cores at the problem”. Not everything is a nail. Important problems which cannot be solved with this kind of approach include: Parallel network packet processing in cases where load balancing schemes like RSS do not apply Parallel transaction processing when there is high contention between transactions Parallel encryption of a single stream of data Pipeline parallelism has the potential to provide "bribing God” solutions to some of these problems. A potential additional benefit that pipeline parallelism brings to the table is better usage of CPU caches because of a smaller working set. For example, if 8 cores cooperate to process 1 input file, the working set (input data, output data, intermediate data structures) is potentially 8 times smaller than the case where each core processes a separate input file. This caching advantage also applies to instruction caches, as pipeline parallelism distributes the computational steps of an algorithm across cores. Pipeline parallelism has some major drawbacks: Fine-grain synchronization/communication Load imbalance The purpose of this experiment is to put some numbers on the costs and benefits in a real-world application ( DEFLATE decompression). DEFLATE decompression is hard to parallelize because of two tight feedback loops: The position of encoded token in the input stream is not known until token is decoded (because input data is encoded with a variable length code). The output generated by a match (i.e., length & distance tuple) cannot be computed until some amount of previous output has been generated (because a match references previously generated output) A Negative Nancy might view these as problems, but a Positive Pipeliner views them as a guide for how to decompose the algorithm into pipeline stages. The general technique is to dedicate a pipeline stage to each of these feedback loops and whittle them down to be as tight as possible. The design I’ve landed on has three pipeline stages: , , and . The stage computes the length of each encoded token. It simply reads the next 13 bits from the input stream and uses them as an index into a lookup table. The inner loop looks like this: Note that in contrast to non-pipelined implementations, the only thing this code (and the lookup table) are concerned with is finding the length of each token, everything else is dealt with in another pipeline stage. Each iteration of this loop runs in about 8 clock cycles, and the lookup table fits in the L1 cache. The CPU cannot run multiple iterations of this loop in parallel due to the tight dependency chain. The input to the lookup stage is the encoded bits associated with each input token ( in the code above). These bits are used to perform another lookup (in a larger lookup table, stored in the L2 cache) which results in much more information about each token. Optimizing this stage is easy, because it doesn’t contain any tight feedback loops. The CPU can process multiple loop iterations in parallel, which enables it to hide the latency of accessing the L2. If necessary, it would be easy to split this pipeline stage into two. The inner loop looks like this: The structure contains metadata about the input token (literal value and/or information about a match). This data structure does not contain the exact distance associated with the match, the variables named deal with that detail from the DEFLATE spec. The stage writes literals and matches to the output buffer. This code leans on the CPU store-to-load forwarding hardware to deal with match operations which must read data that was recently produced. Each iteration of the inner loop performs a word-sized write of literal data, plus a 32B read and write to read and write match data. Actual store-to-load forwarding is rare, as most match distances are large. The Silesia Corpus contains commonly used files to benchmark compression algorithms. has English text with short matches whereas contains data dumps with longer matches. is an optimized library which can decompress roughly 2-3x faster than the standard . The following chart shows baseline performance on in a shared-nothing architecture where each CPU core decompresses a separate input file. There is one data point for each core count (1, 2, …, 8). As you would expect, throwing more cores at the problem improves throughput, at the cost of slight latency increase. If you want a more interesting tradeoff of throughput vs. latency, you have to bribe God. For example, say you are writing a decompression application. If the user requests a bulk decompression of 100 files, then the optimal choice may assign each file to a CPU core. But if the user requests to decompress a single file, then you would prefer to decompress using multiple CPU cores. And here is the same chart with the 3-stage pipeline implementation added in orange (compare it to the third blue dot from the left for a 3-core vs 3-core comparison): For a 37% cost in throughput, you get a 2x reduction in latency. Here is the chart for , which shows a similar story. Data-parallel throughput saturates at 6 cores. Pipeline parallelism allows a 2.6x latency reduction at the cost of 14% throughput. Dangling Pointers I think there is room for language/runtime support to improve performance of pipeline parallel algorithms on multicore CPUs (by reducing load imbalance). is bound by the chase stage, whereas is bound by the output stage. The programmer could supply multiple implementations of the pipeline (with some compiler help to reduce code duplication), and the runtime could dynamically switch between them depending on which stage is the bottleneck. High level synthesis tools are capable of automatic pipelining. Such techniques could be used to automatically generate many pipeline implementations for the runtime to choose between. The description above leaves out a few implementation details regarding the lookup tables. Because the lookup table data is spread across two cores (i.e., pipeline stages), there is enough room to store data for 2 Huffman tokens (2 literals, or a full match). This provides a large speedup compared to traditional implementations that store all data in the caches of a single core. Because the stage is throughput bound rather than latency bound, it can afford to access the lookup table via a layer of indirection. The 13 input bits are used to lookup a index, and that index is used to access the final data in another lookup table. The second lookup table has fewer entries, but each entry is larger. This reduces the total working set. This design leans heavily on CPU branch prediction. The code snippets shown earlier are for the common cases, with branches used to implement uncommon cases (e.g., a single encoded token that is wider than 13 bits). As long as those cases are rare, branch prediction does a great job of keeping the inner loops humming. An interesting puzzle arose during this experiment. I found that performance could swing widely (~10%) based on where the operating system located stacks of the various threads. The stack address would change from run to run because of ASLR . A little to offset the stack by a small amount would resolve this issue. It seems to be an important consideration when trying to maximize usage of the L1 cache. Subscribe now Parallel network packet processing in cases where load balancing schemes like RSS do not apply Parallel transaction processing when there is high contention between transactions Parallel encryption of a single stream of data Fine-grain synchronization/communication Load imbalance The position of encoded token in the input stream is not known until token is decoded (because input data is encoded with a variable length code). The output generated by a match (i.e., length & distance tuple) cannot be computed until some amount of previous output has been generated (because a match references previously generated output)

0 views

Kafka Share Groups and Parallelizing Consumption — Part 1: Tuning max.poll.records

All tests were executed against Kafka 4.2.0 using Dimster (and also validated against 4.3.0).  In the last post we measured the overhead that the mechanics of share groups adds, and saw that it is pretty small. Likewise we saw that raw throughput was also comparable to consumer groups and even saw it exceed consumer group throughput on one test. In this post we’re going to simulate processing time in the consumers to make these benchmarks more realistic and show the utility of share groups (namely the ability to parallelize processing beyond the partition count). We’ll see how the following two configurations play an important role in parallelizing consumption with share groups: max.poll.records (consumer config) group.share.partition.max.record.locks (broker-side config) If we know the average processing time and the number of consumers, we can calculate the theoretical max throughput of a topic: For example: If we have 100 consumers, with an average processing time of 5 ms, then the maximum throughput will be 20,000 message/s. If we have a topic which peaks at 60K message/s and our average processing time is 5 ms. We’ll need 300 consumers to handle that. If we use consumer groups we’ll also need 300 partitions.  > Of course we could do some fancy concurrent work in the consumer to parallelize the consumer work but that comes with some downsides, principally that consumer groups track a position in the log which doesn’t map well to concurrently processing multiple positions in the log simultaneously, should the consumer encounter problems, abruptly terminate or get reassigned partitions. The ParallelConsumer does some clever tricks to handle this. With a share group we don’t need 300 partitions, we could have a handful of partitions with a group of 300 consumers and it should handle the load.  Let’s test it out with share group config defaults . Like last time, I’m going to ensure that load is even across the partitions so that load skew doesn’t pollute the results (I’ll be looking at load skew in a future post). See the last post for how I did that. We’ll use Dimster’s live interaction feature to model the workload on the fly, to see the impact of changing configurations and consumer counts. Fig 1. Dimster supports mutating the running workload. The max theoretical throughput of 300 consumers with 5 ms processing time is 60K msg/s so we’ll start with see what happens. Remember, Dimster uses named environments where commands take the format: , my environment is called localBeefy (detailed in the last post). Note: I prepend workload files with to so they don’t appear as untracked files in Git. The coordinator output shows: Straight away we see that consumption is really low at 4800 msg/s, nowhere near 60K msg/s. Let’s stop the producers and set processing time to 0 ms to allow the consumers to catch up so we can try again. From a separate terminal window: Viewing the coordinator output in the other terminal window, we wait until the backlog is drained, then slowly ramp up the producer rate to 50K msg/s. Then from the live interaction terminal: The coordinator output shows the consumers are now managing 50K msg/s. Much better than 4800 msg/s: However, it soon degrades, dropping to 20K msg/s and eventually back down to 4800 msg/s. Fig 2. The ramp up to 50k msg/s, followed by a reduction back down to 4800 msg/s What on earth is going on? The metrics tell a story. The max poll size (the number of records the consumer is returned after calling poll) is 500. Most calls to poll return few records, with p50 at 8. But the max tells us some return 500. Fig 3. The number of records returned by consumer.poll() The Kafka client metrics show that the average records per fetch response batch climbs toward 450. Fig 4. The records per fetch start low and grow larger and larger What we’re seeing is that most consumers aren’t getting very many records, but a tiny number are getting a lot. When the throughput was low and growing, the records per fetch were low (up to 10). But then the average records per batch started creeping up (while the consumption kept dropping) until the average fetch size was around 450 records. The average is high despite only few being full because most fetch requests sit idle until they can be serviced (by default up to 500 ms). Fig 5. The average fetch latency creeps up and almost reaches the default fetch.max.wait.ms of 500. It’s clear that the default of 500 is at play here. There is an interesting phenomenon here: at low producer rates, the broker does not have enough available records to fill each consumer’s , so each poll tends to return a small batch. Since many consumers are polling, processing, and acknowledging at roughly the same cadence, the available records get spread across the group. The result is an accidental fair-sharing regime : lots of consumers are active, each processing small batches, and aggregate throughput can approach the theoretical maximum. But this regime is fragile. It is not guaranteed by the broker-side allocation policy. Once enough records are available to fill large polls, the greedy allocation behavior takes over. A small number of consumers can acquire large batches, occupying the partition’s inflight record budget while the rest of the consumers sit idle. We’ll call this the greedy-capture regime , as a few consumers greedily capture the inflight window. This regime works as follows. The broker config determines how many records can be locked/inflight per partition, and defaults to 2000. With the default of 500, a single consumer can acquire 25% of that budget. At 5 ms per record, that one batch takes 2.5 seconds to process. While those records are locked, other fetches may sit idle even as new records arrive. This creates a feedback loop: larger batches consume more of the inflight budget, queued fetches wait longer, lag builds, and future fetches are more likely to be filled with large batches. Eventually the group collapses into the greedy-capture regime. We see this in the behavior above . With an inflight budget of 2000 messages and large fetches of 500 records, we only have 4 effective consumers per partition at a time. Across 6 partitions we only have 24 effective consumers each able to process 200 messages a second, resulting in an aggregate 4800 msg/s (exactly the number we’ve seen). To test whether the fair-sharing state was actually stable, I tried ramping only to 30K msg/s and holding it there. I left it for ten minutes and it remained stable. Then I restarted the consumers. Sure enough, throughput dropped back down to 4800 msg/s again. Fig 6. Fair-sharing regime collapses after a consumer restart Why go on about this accidental fair-sharing? Because the system can appear healthy under a slowly changing throughput and a moderate load because it has entered accidental fair-sharing, despite a bad choice of max.poll.records . I imagine this could trip some people up. Consumption may look fine for a long time, but suddenly degrade causing some head scratching and stress! The solution here is simple: reduce max.poll.records .  In theory we should carve up the inflight window between all consumers. So let’s take the configured and divide it by the number of consumers per partition. In our case, we are using the default of 2000. With 50 consumers per partition, we should set to . First, let’s drain the backlog. Once the backlog is drained, let’s set . This causes all 300 consumers to restart with the new config. Now we’ll attempt 60K msg/s with 5 ms processing time abruptly, no ramp up. The coordinator output shows: We’re close, about 55K msg/s consumption, however this soon drops to 45K and remains stable there. It seems that 40 was still too high as it did not account for all the overhead of the fetch/response time, the timing of commits, etc. After dropping to 30 and finally we hit 60K msg/s consumption! But the coordinator output shows that end-to-end latency is growing, little by little, it still isn’t quite keeping up. Let’s add 2 more consumers per partition (300 -> 312 consumers). The coordinator output now shows that end-to-end latency has dropped and stabilized. At this point, the benchmark has a few minutes left. We can discard all the cumulative latency histograms so we record the last minutes with this stable configuration. The final e2e latency distribution for 10 or so minutes with 312 consumers and is: Fig 7. End-to-end latency distribution of 60K msg/s, 5 ms processing time and 312 consumers Rules of thumb:  Set by taking and dividing it by the number of consumers per partition . Then set it somewhat lower to leave room for timing variance, uneven fetch timing, partition skew, and transient backlog. If you have very long processing time (over 1 second) you can even drop max.poll.records to 1 as the cost of a fetch is dwarfed by the processing time. You can also try increasing the group.share.partition.max.record.locks ( max of 10000) which will allow for a larger inflight budget and be more forgiving of a suboptimal max.poll.records. Now armed with a good rule-of-thumb, we’ll run two scenarios with Dimster’s explore limits mode , a benchmark mode for finding the highest sustainable throughput (see the last post for how it works): Fig 8. All test points achieved 57,000 msg/s while staying under p75, 100 ms end-to-end latency. All 5 workloads achieved 57K msg/s, just short of the max theoretical throughput (likely due to the latency constraint of explore mode). Adding some more consumers would be enough to reach 60K msg/s. Next, with 1 ms processing time. Fig 9. Share groups with 12+ partitions reached 95% of the theoretical max consumption throughput. Share groups with 12, 30 and 60 partitions did best, reaching 95% of the max theoretical throughput. The reason 6 partitions fared a little worse is likely due to contention over the inflight window (6 * 2000 records). The higher partition tests had a larger window across the same number of consumers. I expect the consumer groups could have gotten higher throughput, just not within the latency target of the test (100 ms, p75, based on the worst partition). First of all, I hope you see how useful live interaction using Dimster is! You can mutate a live workload to explore the impact of changing client configurations, producer rate, the number of producers and consumers, all on the fly. Once you have a topology you want to record stats for, clear the stats, set a new running time and get all the usual Dimster results. You can download the results from this blog post: tarball for the interactive session tarball of the explore limits run Regarding Kafka, the important lesson is that share groups change the parallelism bottleneck. With consumer groups, it’s the partition count. With share groups, it’s easy to think it simply comes down to the number of consumers, but it’s a little more complicated than that. Parallelism is determined by the inflight budget and the size of fetch requests .  Setting carefully might seem obvious, but I think it could trip people up for a few reasons: Defaults can come into play easily, especially for people without a lot of Kafka experience. The greedy behavior is not necessarily obvious (especially in terms of message queues in general). Synthetic benchmarks with 0 processing time will miss this (4 consumers per partition can handle whatever you throw at the partition). Only once you add processing time to a benchmark does the relationship between the inflight budget and fetch size become apparent. This greedy algorithm makes a very important configuration for share groups and the default of 500 is arguably the wrong value for share groups. It would be nice for a future version of Kafka to offer an alternative which enforces fair-sharing broker-side. I’ve posted this sentiment to the dev mailing list. Next : isn’t the only config that determines the size of consumer fetches ! In the next post we’ll look at the role of the following (in terms of how they can affect consumer fetch sizes and therefore the parallelism of share groups): Producer batch sizes. Share group consumer config :  (default, used in this post) max.poll.records (consumer config) group.share.partition.max.record.locks (broker-side config) If we have 100 consumers, with an average processing time of 5 ms, then the maximum throughput will be 20,000 message/s. If we have a topic which peaks at 60K message/s and our average processing time is 5 ms. We’ll need 300 consumers to handle that. If we use consumer groups we’ll also need 300 partitions.  Set by taking and dividing it by the number of consumers per partition . Then set it somewhat lower to leave room for timing variance, uneven fetch timing, partition skew, and transient backlog. If you have very long processing time (over 1 second) you can even drop max.poll.records to 1 as the cost of a fetch is dwarfed by the processing time. You can also try increasing the group.share.partition.max.record.locks ( max of 10000) which will allow for a larger inflight budget and be more forgiving of a suboptimal max.poll.records. tarball for the interactive session tarball of the explore limits run Defaults can come into play easily, especially for people without a lot of Kafka experience. The greedy behavior is not necessarily obvious (especially in terms of message queues in general). Synthetic benchmarks with 0 processing time will miss this (4 consumers per partition can handle whatever you throw at the partition). Only once you add processing time to a benchmark does the relationship between the inflight budget and fetch size become apparent. Producer batch sizes. Share group consumer config :  (default, used in this post)

0 views
Farid Zakaria 1 weeks ago

Leaving performance on the table

I have been working with LLVM at , and I have gotten to become familiar with the benefits of optimizing your workloads. I tend to think of optimizing my binaries as thinking about whether I have attached to my compiler flags; maybe if I’m particularly advanced that day I’ll sprinkle in some (link time optimziation) and call it a day. Turns out though that’s leaving lots of performance on the table. Compilers work under the assumption that every branch is is equally taken, unless you are hints like ( ref ). If we can feed the compilers more information about the likely path that our workloads often take, then they can produce much more performant code. There are two primary ways to optimize a binary: instrumented or statistical. When we instrument our binary, we run our workload with an instrumented binary and capture the exact paths that are executed. We will then optimize the binary perfectly tuned to that workload. If our workloads however are varied, we can collect profiles via over a length of time and create an optimized binary based on the statistical occurence of call graphs. Both approaches have their benefits however let’s start with the instrumented variant first, as it’s a little easier to follow and understand. Let’s look at a very simple benchmark. We will calculate fibonocci using SQL in sqlite3 . This is an ideal workload because it’s purely CPU-bound and ripe for optimizing. We will compile from source by downloading it. We can compile a “traditional” optimized binary that merely has and also a version that has LTO enabled since I was also keen to see how much LTO itself adds. Ok, so it looks like our program takes roughly 14-15 seconds to run. Sounds ok? How much better can we do…. 🤔 Next, we compile our program again but we instrument the binary , which effectively injects counters into the program to count invocations of functions. We get very accurate counts of our calls but the binary itself now runs much slower, which can be a problem if your workload was already very slow. Luckily for us, we are in a time domain (~15 seconds), where that is ok. After we have our instrumented binary, we run our workload again to generate the profile data and rebuild the binary with that data. The last step will be to optimize with BOLT, which is a post-link optimizer, which requires us to keep relocations so I’ve also added . When we run our workload with the final optimized binary, we see massive improvement already! 🤯 We’ve cut our workload time down to ~10 seconds which is a nearly a 1.5x improvement. Now let’s optimize the final binary with LLVM’s BOLT . BOLT is a post-link optimizer designed for “large applications”. What this means, is that it largely works by shuffling code around the binary to keep code-paths that have high temporal locality near each other (spatial locality). This can have positive impact on performance due to the instruction cache for instance. Looks like it was a little faster but not much. That makes sense since itself is a pretty small binary (~6MB), but nontheless was good to run through. Running a more thorough benchmark with we can get a final tally of our results. Looks like the I got from the Fedora ecosystem was the slowest . When all the optimizations were applied I was able to get a maximum of 1.38x faster than what was available. These optimizations would be even more dramatic for code-bases that are a sprawl and can heavily vary. Don’t worry also about getting the profile perfectly tuned to your workloads. I have a coworker who often cites that even poor profiles are still much better than no profile at all.

0 views
flowtwo.io 1 weeks ago

Othello World

I was introduced to the board game Othello (also known as Reversi) on a recent trip to Japan. It's one of those games where you can learn the rules in 5 minutes, but the gameplay dynamics are surprisingly deep. When I saw it's played on an 8x8 board, like chess is, I immediately started thinking about how to program a game engine for it. The 8x8 board is helpful because it allows you to represent the board state with 64-bit longs; each set bit in the number indicates the presence of a piece on that square. When you perform a bitwise operation on these numbers you're essentially computing multiple piece movements in parallel with a single CPU instruction. This computational efficiency enables deep searching of the move tree. I purposely started out without reading too much about game strategies because I wanted to explore it through coding the engine logic. It didn't take long to create an algorithm that is significantly stronger than me. Although it's not a high bar. There's a demo available here if you're interested in playing it. The basic building blocks of the game engine are as follows: Once you have these four elements built and wired together, you have a functional game engine to play against. The first two pieces are fairly straightforward—the real strength of an engine comes from how the last two are implemented. Like I mentioned above, we can represent the complete board state with just two 64-bit numbers. One number represents the black piece positions and the other for the white pieces. How you encode the 64 squares to the 64 bits is arbitrary, but I chose to represent each row as one byte (8 bits) and from left to right, top to bottom in terms of bit significance. In other words: And that's all that's needed to represent the piece positions. I created an immutable data class to encapsulate this: In Othello, if one player has no legal moves at any point in time, they skip their turn and the other player gets to go again. If both players have no legal moves, the game ends. Instead of computing both player's legal moves every time to check for those situations, I created a enum so that information somewhat pre-computed. The combination of and provides everything needed to determine the state of the game for the other stages in the engine. This is where things get tricky. Move generation requires codifying the rules of Othello in such a way that, given a board state, all the legal moves for either player can be computed—quickly, ideally. In Othello, you can only place a piece somewhere that will "sandwich" the other player's piece(s) between the piece you're placing and another "anchor" piece of yours. There can't be any blank spaces either. This rule applies to any of the 8 directions of the board (diagonals count). This screenshot illustrates the valid moves for black in this position: This function will calculate all the eligible squares for a single direction of movement (up, down, up-left etc.). What's cool is that it calculates eligible squares for all 8 rows/columns/diagonals at the same time. It's invoked as follows. For each of the 8 directions, you pass in a movement function and an ineligible square bitmask if required for that direction. For example, if shifting towards the left, you need to mask out the pieces on the leftmost column to prevent wrapping to the other side of the board (similarly for moving right). Moving up or down doesn't require a mask because shifting the bits "up" or "down" enough will just drop them from the number entirely. The function will return all valid moves for a given position for the "moving" pieces (the 1st argument). The moves are returned as a where each set bit is a valid square to place a piece. This part was interesting to me as I don't know much about strategy in Othello besides that the corners are important. The corners are important because once you claim a corner it can't be unflipped by the other player. Also, simply maximizing for the most pieces isn't the best strategy either, apparently. I do have a "greedy" algorithm that you can select in the demo app if you want to see that strategy in action. But of course, closer to the end of the game, having more pieces is more important since that's how the winner is determined. I represented this in the eval function by linearly shifting the weighting towards piece score as you get closer to the end of the game. I have two piece scores actually. The is a step function that only returns 1 or -1 depending on which piece colour has more pieces. But in the heuristic evaluation, I look at the actual piece differential score which returns between -100% and +100% depending on what "percentage" of the overall possible pieces the leading player has. That score is given 40% weighting in the heuristic evaluation function, the other 60% is a positional score based on the following square values I came up with: This was my best guess at which squares matter most. My reasoning is that the more central the square is, the more likely it is to be flipped. The closer to the edge it is, the less likely it is to be flipped and the more likely it is to be used as an anchor piece. So putting this all together, the heuristic evaluation is computed as follows: And that's it. The top-level function provides a relative score between -1.0 and +1.0 which represents the strength of a given position, relative to black. Since Othello is a zero-sum game, a good score for one player is an equivalently bad score for the other player. This is important in the next phase, the move search algorithm. This part of the engine is fairly "textbook". There's lots of explanation for how these algorithms work on wikipedia and chessprogramming.org is an incredible knowledge base for this sort of thing too. For zero-sum games, you can use a variant of minimax search called Negamax . That's what's shown here: For Othello specifically, the Negamax function needs to handle the case that the moving player has no legal moves and must pass to the opposing player. This is in the branch in the middle. We check if we're already in a position where the previous player had to pass, which means both players can't move and the game would be over in this branch. If not, we simply call again with the SAME and reverse the score returned from that call. With those 4 components built, I now had a functional engine to play against. I created an class that accepts a move selection algorithm. It exposes 3 methods: - for showing valid player moves in the UI - which validates and then applies a specified player move - which chooses and applies the best move using the I exposed the via a stateless REST API. Each request needs to supply the current game state information in order to make a move. For example: For the demo , it uses HTMX instead to return a rendered board component. The request format is the same but it returns HTML instead of JSON. I read this article recently that took a contrarian view on agentic coding and it's pitfalls. The author makes a lot of good points and it was thought-provoking. While I don't agree that using agentic coding will make you dumber per se ... I do think there's something to be said for regularly exercising the critical thinking and problem solving part of your brain if you want to be a good software engineer. Side projects like this are a great opportunity to do that. The incredible rise in coding competency for AI agents over the last 12 months has made a project like this into a one-shot, one prompt task for a recent LLM. I obviously didn't do that, because the point of this project was the act of doing it, not the end result. I learned a bit about Othello and refreshed myself on bitwise operations. The parts I wasn't interested in doing, the UI and the API wiring, I delegated to an agent to implement for me. To me, that's one of the best parts about coding with AI. I can now offload the tasks I'm not interested in or that's not as critical, and focus on the parts of the system I want to work on. It's never been easier to build and bring ideas to life with software. Board representation Move generation Position evaluation Game tree search

0 views
Jack Vanlightly 1 weeks ago

Benchmarking Apache Kafka Consumer Groups vs Share Groups (overhead test)

In my last blog post I introduced Dimster (DIMensional teSTER), a performance benchmarking tool for Apache Kafka with a specific set of philosophies. In this first share group benchmarking post, we’re going to use share groups as they are not intended to be used, but for a good reason. Share groups allow you to move past partitions as the unit of parallelism by allowing multiple consumers to read from the same partition, using message queue semantics. We’ll run those kinds of tests in the next post. In this post I just want to understand if the mechanics of how share groups work add any additional overhead compared to consumer groups. So we’ll use share groups as if they were consumer groups (by capping consumer count to partition count). Objective : Use synthetic tests to measure the overhead of share groups compared to consumer groups in identical conditions. How : Like-for-like tests which use an identical workload/topology using consumerType (CONSUMER_GROUP|SHARE_GROUP) as a dimension. Given identical producer/consumer counts, producer rate, topic/partition counts, do share groups scale as well as consumer groups? Do they add any latency overhead? These benchmarks are educational , they are not hard numbers, they are not some kind of canonical result (in fact, no such benchmark exists). And again, this is not a realistic test at all, they only serve to understand share group overhead. I ran all these benchmarks on a k3d Kubernetes cluster on my Threadripper 9980X: 64 cores (128 threads) 256 GB DDR5 memory Two Samsung 9100 PRO 8 TB (with one dedicated to the benchmarks) Pretty decent CPU and RAM cooling.  This is not a production setup, but the hardware is more than capable of handling a small to medium sized Kafka cluster with excellent performance. The SSD can sustain around 1.7 GB/s once the SLC cache has filled up and none of these benchmarks exceed that in aggregate across the 3 brokers. All tests were run with TLS between the clients and brokers and between each broker. I prefer to run benchmarks with TLS enabled (though it reduces the numbers) because most people (hopefully?) run Kafka with full TLS.  Dimster uses named environments located in the dimster-config.yaml . Each environment targets a specific k8s cluster (via kubectl context), specifies the Kafka and client versions, sizes the Kafka pods, determines heap sizes, broker and log config files etc, all in one yaml block. This environment uses 36 of 128 CPU threads (16 of 64 cores) and 72 GB of 256 GB of RAM of my workstation, so we’re not pushing the Threadripper too hard. Note, the ‘requests’ field block is applied to both k8s requests and limits. The client pod is over-provisioned with 12 CPU cores (24 threads) and 24 GB RAM to avoid any client bottlenecks causing spurious results. The tests in this post compare consumer groups with share groups. To do that, I tried to isolate other factors as much as possible. Random load skew is one such important factor.  In these tests, I ensured that load was as even as possible over the brokers: Message distribution over the partitions of a given topic was even. I used the Dimster message distributor PINNED_PARTITIONS which ensures the number of producers is divisible by the number of brokers and pins each producer to a set of partitions, and each producer round-robin sends to its partitions directly. Multi-topic tests used a topic count divisible by the number of brokers to ensure even distribution of leaders over brokers. Consumer counts per group were divisible by the number of brokers to ensure even distribution of partitions over consumers. Fig 1. Dimster’s partition pinning for even load distribution This is not like in real-life, but for this post I want to avoid the randomness involved with partition and broker skew so that we can compare consumer group vs share group performance without load skew randomness playing a role. I’ll be writing about and running benchmarks with partition and broker skew in a future post. Link to results as a tarball For the throughput benchmarks, I used Dimster’s explore mode, which probes the cluster to find the highest sustainable throughput while staying under a target end-to-end latency in ms and percentile (50 ms, p75 in this case). It measures e2e latency per-partition and uses the latency of the poorest performing partition as the yardstick.  Explore mode runs in phases: Ramp . Start with a low throughput and keep doubling the throughput after a configured interval. Once the e2e latency exceeds the limit, move to the next phase. Search : Perform a binary search within the bounds of [0 - max-ramp-throughput ]. It starts at the midpoint and if it can sustain that throughput, it searches the high range starting at the midpoint. If it can’t sustain it, then it searches the low range. It recursively performs the search until the current search range size is < 5% of the throughput. Then it moves to the sustain phase. Sustain : The throughput identified by the search phase is maintained for a prolonged period. If it passes, the test is complete. If it fails to sustain (under the target e2e latency), it goes back to the search phase, with the failed sustain throughput as the new upper bound of the search range. The sustain phase is successful if 80% of the intervals (30 intervals of 10 seconds by default) meet the latency criteria. This rule exists as explore mode is trying to find the highest sustainable throughput which sits on the edge of the cluster’s limit, allowing for some latency spikes. I ran explore mode on the following workload: The first scenario has 4 test points which co-varies 4 workload aspects related to partition, client counts and consumer type as dimensions, repeating the tests 3 times. Fig 2. The merged result of three repeats (only small variance between runs) We see that share groups matched or even exceeded consumer group performance. Moreover, this pattern was broadly the same across the three test repeats. We can’t infer this as a generalizable result based on this one test, but my general observation, having been running these tests for a few weeks, on EKS clusters, my Threadripper and my Mac, is that throughput in this kind of synthetic test is comparable (between consumer/share groups). Scenario 2 - Varying fanout This scenario involved 1 topic with 12 partitions with a fanout of 2 and then 6. Fig 3. The merged result of three repeats (only small variance between runs) The surprising result was that share groups maintained a higher sustainable throughput with a fanout of 6. Explore mode is sensitive to spiky latency, and one thing I’ve observed is that share group latency can be more stable under stressful loads than consumer groups. Again, this may not be generalizable, but it shows that share groups might actually outperform consumer groups in some cases. I think the main takeaway from these limited tests is that share groups and consumer groups are in the same ball  park in terms of raw throughput. Link to results as a tarball The throughput benchmarks were a stress test of sorts, pushing Kafka right up to its limit. CPU was maxed out. We don’t want that for the latency benchmarks. We’re not going to push the Kafka cluster to the limit as we want to measure latencies within the performance envelope. With 4 vCPUs, around 100 clients and TLS, a 15 MB/s (1.3 TB daily) workload fits comfortably inside that envelope. I used run-mode , which are the standard fixed throughput benchmarks (best for measuring latency). I ran a single test campaign with 3 scenarios where consumerType was the dimension: 1 topic with 60 partitions, 30 producers, 60 consumers. 12 topics with 6 partitions, 6 consumers per topic, 3 producers per topic. 6 topics with 6 partitions, 3 consumer groups per topic with 6 consumers each, 3 producers per topic. All ran with an aggregate producer rate of 15000 msg/s with a 1 KB message size (15 MB/s). Fig 4. End-to-end latency (p99) over time (10 second intervals). Note: you can select a time range on Dimster charts to zoom into a sub-range. Under this lighter load, we see that share groups add some overhead, with the e2e p99 latency being a little more choppy than the much flatter consumer group latency. Fig 5. End-to-end latency distribution. Note: you can select a percentile range on Dimster charts to zoom into a sub-range. Fig 6. p99 end-to-end latency over time (10 second intervals) The sharegroup overhead is more pronounced in this test. Fig 7. End-to-end latency distribution. Fig 8. p99 end-to-end latency over time (10 second intervals) Again we see the same overhead. The takeaway is that for an adequately sized cluster that is not stressed by the workload, we can expect to see some small share group end-to-end latency overhead. Just to show you this isn’t an artifact of running these tests on k3d on a single workstation, we see the same pattern on a 50 MB/s test I ran a few weeks ago on AWS EKS with the m6i.2xlarge instance (8 vCPU, 32 GB RAM, EBS). Fig 9. 50 MB/s test, p99 end-to-end latency over time (10 second intervals) on an EKS cluster And a 150 MB/s test which was more stressful Fig 10. 150 MB/s test, p99 end-to-end latency over time (10 second intervals) on an EKS cluster We see the typical Kafka latency spikes related to log flushing and rotation (which has this predictable cadence due to how all load starts at the same time, at a constant rate, on one topic). The share group tests consistently used more CPU than the consumer group tests, which is understandable given share groups do a lot more accounting and state management than consumer groups. For example, the first repeat of scenario 1 of the latency test (executed as test points CG, SG, CG, SG, CG, SG): Fig 11. CPU over three apache/kafka pods In all these tests, consumers did nothing with the messages except record some metrics. In the real world consumers write to databases and call APIs. It might take anywhere from < 1 ms to 30+ seconds to process a message. More useful benchmarks simulate consumer processing time which is exactly what we’ll do in the next post. When we add processing time, we start to see where share groups really shine. To summarize some findings from this post: Share groups add a little overhead which might show up in a latency benchmark. Share groups consume more CPU. Raw throughput benchmarks will probably see varied results, but share groups are not fundamentally slower than consumer groups. 64 cores (128 threads) 256 GB DDR5 memory Two Samsung 9100 PRO 8 TB (with one dedicated to the benchmarks) Pretty decent CPU and RAM cooling.  Message distribution over the partitions of a given topic was even. I used the Dimster message distributor PINNED_PARTITIONS which ensures the number of producers is divisible by the number of brokers and pins each producer to a set of partitions, and each producer round-robin sends to its partitions directly. Multi-topic tests used a topic count divisible by the number of brokers to ensure even distribution of leaders over brokers. Consumer counts per group were divisible by the number of brokers to ensure even distribution of partitions over consumers. Ramp . Start with a low throughput and keep doubling the throughput after a configured interval. Once the e2e latency exceeds the limit, move to the next phase. Search : Perform a binary search within the bounds of [0 - max-ramp-throughput ]. It starts at the midpoint and if it can sustain that throughput, it searches the high range starting at the midpoint. If it can’t sustain it, then it searches the low range. It recursively performs the search until the current search range size is < 5% of the throughput. Then it moves to the sustain phase. Sustain : The throughput identified by the search phase is maintained for a prolonged period. If it passes, the test is complete. If it fails to sustain (under the target e2e latency), it goes back to the search phase, with the failed sustain throughput as the new upper bound of the search range. 1 topic with 60 partitions, 30 producers, 60 consumers. 12 topics with 6 partitions, 6 consumers per topic, 3 producers per topic. 6 topics with 6 partitions, 3 consumer groups per topic with 6 consumers each, 3 producers per topic. Share groups add a little overhead which might show up in a latency benchmark. Share groups consume more CPU. Raw throughput benchmarks will probably see varied results, but share groups are not fundamentally slower than consumer groups.

0 views
Jack Vanlightly 1 weeks ago

Introducing Dimster, a performance benchmarking tool for Apache Kafka

Dimster = DIMensional teSTER for Apache Kafka On GitHub: https://github.com/dimster-hq/dimster Most of my career in distributed systems has been as a tester, performance engineer and formal verification specialist. I’ve written performance benchmarking tools in the past, for RabbitMQ and Apache Pulsar but in recent years I’ve used OpenMessagingBenchmark (OMB) to run benchmarks against Apache Kafka and other messaging systems. But OMB is hard to deploy and has several limitations compared to more sophisticated benchmarking systems I’ve developed in the past. With Claude becoming so much better since Christmas I decided to write a Kafka-centric performance benchmarking tool, with a lot of inspiration from OMB. I took the bits I like about OMB and the things I like about the tooling I’ve built in the past, to make a performance testing tool for testing Apache Kafka. In this post I’ll introduce some aspects of Dimster that are core to its design: Dimensional testing Shareable, self-contained results with reproducibility in mind Benchmark prep and post-processing Kubernetes as a standardized runtime A benchmarking and stress testing technique I’ve used for years is something I have called “Dimensional Testing”. We can think of all the configs and workload aspects as forming N-dimensional space. Within that space we can explore the impact of points in that space along a single dimension, or even co-varying dimensions. Take a config or an aspect of a workload as a dimension, and run a series of identical benchmarks where a set of points along that dimension are explored (while everything else remains the same). The dimension could be a client config, such as batch.size or acks. It could be an aspect of the workload such as number of consumers, type of consumer, number of consumer groups, the partition count, the produce rate and so on. There are hundreds of dimensions to explore, which requires some patience and care lest you become overwhelmed. The below depicts just three dimensions, and a set of three scenarios which test performance along one or two dimensions at a time. Fig 1. Three examples of varying or co-varying an aspect of a workload as dimensions Each of the above 16 test points (across 3 scenarios) is a separate benchmark, with a fresh topic, warm-up time, recorded time, and cooldown time etc. The generated charts for throughput and various latencies are repeated for each of the three scenarios, with each test point within a scenario plotted as a series/bar on those charts. This makes it easy to compare the performance results of varying the values of a single dimension (or co-varying values across multiple dimensions). Fig 2. Each scenario maps to a set of charts, with the test points as data series. With share groups being relatively new, I could compare the performance of regular consumers against share group consumers, with identical benchmarks where the dimension explored is consumer type (CONSUMER_GROUP|SHARE_GROUP). The following test has as the base workload of ten topics with each topic having 6 partitions, 6 consumers and 4 producers. Each scenario changes the producer rate, and compares consumer groups to share groups. Record keys are used, so batch sizes will be small, which is a tougher workload than a no-key test which typically results in larger batches. The charts below show the results for an EKS deployment with Kafka deployed on 3x m6i.2xlarge with 300 MB/s provisioned gp3. At 50 MB/s we see that p99 end-to-end latency is stable, with roughly 15 ms overhead for share groups. At 200 MB/s, p99 end-to-end exhibits peaks in a periodic fashion. Dimster uses environments. The sizing of a test is determined by which environment is used. I ran some share group consumer scaling tests, with full mTLS, on Kafka clusters assigned 2, 4, and 8 CPUs. These are the equivalent of vCPUs, as my Threadripper has SMT (hyperthreading) enabled. 2-CPU environment on my Threadripper: I ran the following workload with the above environment, with the CPU requests/limit of 2, 4 and 8. Then I used the dimster compare command to generate comparison charts based on the JSON result files of each run. Each chart compares each test point side-by-side. 10k msg/s - 1000 consumers (6th test point in 1st scenario) We see that 2 CPUs fare a lot worse than 4 and 8 CPUs. 100k msg/s, 250 consumers (4th test point, 3rd scenario) The 2 CPU cluster simply can’t keep up with 100k msg/s and 250 consumers. If we unselect 2-CPU, we see that 4-CPU and 8-CPU was ok. Dimster charts are interactive. Series can be toggled, time and percentile ranges can be selected. One thing I really like about OMB is that it produces a JSON file for the results. These files are easy to store and easy to share. But there was also a lot missing for full traceability and reproducibility. Dimster includes the following in every test campaign result (a set of files in a result directory): Results :  The JSON result file which contains all the test point performance results. For each test point, it includes the effective workload and client configuration. It also includes the hardware and other metadata to know what the benchmark was run against. A CSV file generated from the result JSON file (to make it easy to put in a spreadsheet or run custom visualizations). Source configs : The source workload file itself, as well as any additional files such as any dedicated client config file, the broker config file, the version of Kafka, the version of the Kafka clients, and the CPU/memory/disk given to the brokers and clients. Log files : the log files of dimster-core, the benchmarking framework, and each Kafka broker. Charts : Throughput and latency charts (clickable, zoomable) generated from the result JSON file. Dashboards : Grafana dashboards converted to interactive HTML files. I can run a test campaign then send you the results and you’ll be able to reproduce the results because you know exactly what was run and on what. The results are also completely self-contained, if you want to see the dashboard to look at Kafka metrics during the test, it’s right there as an HTML file in the results. No need for access to Grafana and Prometheus and no need to keep monitoring infrastructure around, it can be ephemeral. Dimster comes with four test modes (which all support dimensional testing): Run : Fixed throughput benchmarks, plus: Live-interaction . Run-mode also supports live interaction with the user. The user can change the producer rate, number of producers and consumers, message size, etc.  Availability : Optionally measure availability (producer/consumer/aggregate) during the standard run-mode benchmark. Explore : Discover the highest sustainable throughput while staying under a target end-to-end latency and percentile. Drain-backlog : Build a backlog and time how long it takes for the consumers to drain it. Optionally set a producer rate during the drain phase, such as when testing if a cluster is big enough to drain a backlog while under normal producer load. Correctness : Detects data loss, data corruption, out-of-order delivery and duplicates.  Example 1: Peak sustainable throughput, 1 partition, share group consumers Explore mode on my Threadripper. The idea was to see the bottleneck of a single partition, as consumers are scaled out. The rule was for p75 e2e latency to stay below 50ms. Example 2: Consumer group vs share group with 1 ms processing time The prior example was an unrealistic synthetic test where the consumer spent no time processing. This explore test added 1 ms consumer processing time per message with 300 consumers. It compared a 300 member consumer group with 300 partitions, vs a 300 member share group, with 5, 10, 25 and 50 partitions. Share groups managed the same throughput (95% of theoretical max based on 1 ms processing time and consumer count), on only 10 partitions. Consumers groups needed 300 partitions. Personally, explore and run are my bread and butter benchmark modes. For a given workload I usually start by finding the throughput limit where Kafka transitions from normal stable performance into degraded territory. I either use run mode and use live interaction to discover the performance limit, or I use explore which is slower but I can leave to run and it discovers the limit in an automated way. For latency benchmarks, once I know the limit, I can craft benchmarks that fit inside the performance envelope for that workload on the specific version of Kafka on the specific hardware I am using. The Dimster CLI has some commands that help before running benchmarks and for post-processing. Dimster resources command The resources command calculates the network and disk throughput required to service a workload. This is important in the cloud for selecting the right instances, ensuring that baseline network and disk throughput are greater than the workload’s demands. Dimster compare command Compare different runs that were executed on different hardware, different broker configurations, different broker versions etc. Dimster pivot command You can slice and dice the data any way you want based on the CSV data. However, you can also pivot the results and generate a chart with the pivot command. This compares the Nth test point across all scenarios. Dimster is easiest to use with Kubernetes. Dimster has a CLI you use from your laptop which speaks Kubernetes and leverages it to run benchmarks on any hardware, any cloud, any laptop or workstation using the exact same orchestration logic. All it needs is a properly configured k8s cluster. It could be minikube or k3d on a laptop or workstation, or AWS EKS or Google Cloud GKE or your own in-house cluster. You can tell Dimster to deploy Apache Kafka to a stateful set in the k8s cluster: Fig 3. Dimster architecture in full deploy mode Or point Dimster (deployed to k8s) at a Kafka service or in-house Kafka cluster. When testing a Kafka service, you can provision a single powerful instance for the Dimster coordinator and worker, and deploy them to a local k8s distro such as Minikube, K3d or Kind. A single worker will happily consume all the cores and memory you give it. Fig 4. Dimster architecture in external deploy mode Or run a super-slim full setup in a tiny minikube/kind/etc local k8s distro: Fig 5. Dimster deployed in a tiny local k8s cluster The workflow is the same. If you can provide a k8s cluster, then Dimster does the rest. Deployment is really simple, monitoring, gathering results, troubleshooting is all simplified via a mix of the CLI being relatively capable, and k8s providing a well-understood platform. K8s is not obligatory , you can run dimster-core directly as a Java program, and point it at a Kafka cluster already provisioned. But you lose many features such as monitoring, live-interaction, automatic gathering of logs, automatic chart and CSV generation and so on. However, you can use the post-processing command dimster chart to generate the charts of a result JSON file. Run the Java directly via the benchmark script: ./bin/benchmark -w path/to/workload file I will be publishing a blog post regularly about Dimster and what you can do with it. So stay tuned. I invite you to go and play around with Dimster , even if it's just running benchmarks on your laptop or workstation. You can get an idea of what charts get produced, what kinds of benchmarks you can run, trying out dimensional testing etc. The docs are pretty decent and should cover most of it. It’s fully featured but still a 0.X version. Myself and a Confluent colleague are the only ones who have run it thus far, so there may be bugs you encounter, if you do encounter a problem, please open an issue with repro steps. If you want to run serious benchmarks, you’ll likely need an EKS or GKE type of Kubernetes cluster. Dimster comes with a special CLI for EKS to deploy EKS with node groups for Kafka, Dimster workers/coordinator, Grafana/Prometheus, as well as storage classes for gp3.  While evaluating consumer group vs share group consumers, I’ve been running benchmarks in k3d on my beefy Threadripper 9980X workstation with 64 cores (128 threads), 256 GB RAM and an Samsung 9100 PRO 8TB SSD, which is plenty to run an entire medium sized Kafka cluster plus workers on it. I’ll be sharing some share group benchmarks tomorrow. Happy testing! Dimensional testing Shareable, self-contained results with reproducibility in mind Benchmark prep and post-processing Kubernetes as a standardized runtime Results :  The JSON result file which contains all the test point performance results. For each test point, it includes the effective workload and client configuration. It also includes the hardware and other metadata to know what the benchmark was run against. A CSV file generated from the result JSON file (to make it easy to put in a spreadsheet or run custom visualizations). Source configs : The source workload file itself, as well as any additional files such as any dedicated client config file, the broker config file, the version of Kafka, the version of the Kafka clients, and the CPU/memory/disk given to the brokers and clients. Log files : the log files of dimster-core, the benchmarking framework, and each Kafka broker. Charts : Throughput and latency charts (clickable, zoomable) generated from the result JSON file. Dashboards : Grafana dashboards converted to interactive HTML files. Run : Fixed throughput benchmarks, plus: Live-interaction . Run-mode also supports live interaction with the user. The user can change the producer rate, number of producers and consumers, message size, etc.  Availability : Optionally measure availability (producer/consumer/aggregate) during the standard run-mode benchmark. Explore : Discover the highest sustainable throughput while staying under a target end-to-end latency and percentile. Drain-backlog : Build a backlog and time how long it takes for the consumers to drain it. Optionally set a producer rate during the drain phase, such as when testing if a cluster is big enough to drain a backlog while under normal producer load. Correctness : Detects data loss, data corruption, out-of-order delivery and duplicates.

0 views

SG-IOV: Socket-Granular I/O Virtualization for SmartNIC-Based Container Networks

SG-IOV: Socket-Granular I/O Virtualization for SmartNIC-Based Container Networks Chenxingyu Zhao, Hongtao Zhang, Jaehong Min, Shengkai Lin, Wei Zhang, Kaiyuan Zhang, Ming Liu, and Arvind Krishnamurthy ASPLOS'26 SR-IOV is a PCIe feature that enables a single device to expose multiple virtual functions, each of which appears as a separate device. This can be used to securely share one hardware device among multiple virtual machines. For containerized workloads, one would think that SR-IOV could be used to expose a virtual NIC to each container. This would save CPU cycles as network virtualization would be handled by the NIC rather than software. The trouble is that SR-IOV doesn’t scale to high container counts (they top out on the order of 100 of virtual functions per physical NIC). This paper introduces Socket-Granular I/O Virtualization (SG-IOV) which enables NIC virtualization at the socket level . The authors have an implementation working on NVIDIA BlueField-3. The key assumption that SG-IOV makes is that container networking uses stream sockets (e.g., TCP) rather than datagram sockets (e.g., UDP). In other words, software running inside a container wants to reliably send a stream of bytes rather than a stream of packets. Streams are transmitted through Warp Pipes , which are simply ring buffers (comprising a base address, head and tail pointers). Warp Pipes can be stored in host memory or NIC local memory. A socket is associated with one or more dedicated Warp Pipes. A server can have thousands of Warp Pipes, because the low-level NIC hardware (which may have scalability limits) doesn’t directly interact with Warp Pipes. Updates to head and tail pointers are not performed directly via memory writes, instead they are communicated through a Cross-FIFO . A single message in a cross-FIFO contains three fields: Ring buffer ID (which Warp Pipe to update) Which pointer to update (head or tail) The new value of the head or tail pointer There are multiple implementations of the Cross-FIFO interface. For example, the authors use a PCIe-based implementation for host→NIC communication, and a RDMA based implementation for NIC→NIC communication. Each cross-FIFO uses limited NIC resources; therefore many sockets share a single cross-FIFO, which is OK because the messages they contain are coarse-grain (e.g., increment tail pointer by 16KiB). Say an application in container A on host 1 wants to send 8KiB of data to an application in container B on host 2. The flow looks like this: Host CPU (host 1): The application calls the standard socket API The payload is copied into the Warp Pipe associated with the socket A message is enqueued into a PCIe cross-FIFO, indicating that the head pointer should be incremented by 8KiB ARM CPU running on the NIC (host 1): Dequeue the message from the cross-FIFO, update the head pointer Enqueue task descriptors to the low-level NIC hardware to read the payload data (i.e., local DMA) from the Warp Pipe and store it in the correct Warp Pipe on host 2 (i.e., RDMA); note that these task descriptors can configure the NIC to perform appropriate network virtualization After the data has been transmitted, enqueue a message to the NIC on host 2 (via a RDMA-based cross-FIFO) to update the head pointer for the Warp Pipe on host 2 ARM CPU running on the NIC (host 2): Dequeue the message from the cross-FIFO Update the local head pointer Enqueue a message to the host CPU (host 2) via a PCIe-based cross-FIFO, indicating that the head pointer should be incremented Host CPU (host 2): Dequeue the message from the PCIe-based cross-FIFO Send data to the application when it calls And a similar sequence would update tail pointers. The key design principles of SG-IOV are: All state is tracked by software running on the host CPU and ARM CPUs running on the NIC All low-level NIC hardware is used in a stateless manner, and is multiplexed across many sockets A benefit of this design is that it is flexible enough to handle the loopback case efficiently. Section 8 of the paper shows that SG-IOV enables the use of hardware-accelerated network virtualization (saving host CPU cycles) while offering latency and bandwidth that is competitive with other container-based network virtualization systems. Figs. 17 and 18 show latency and bandwidth numbers for a few benchmarks: Source: https://dl.acm.org/doi/10.1145/3779212.3790218 Dangling Pointers It seems like one downside of this system is excessive memory usage. If each socket has dedicated large ring buffers, then DDIO may not be effective (see here , here , and here for other papers that discuss this problem). Subscribe now Ring buffer ID (which Warp Pipe to update) Which pointer to update (head or tail) The new value of the head or tail pointer The application calls the standard socket API The payload is copied into the Warp Pipe associated with the socket A message is enqueued into a PCIe cross-FIFO, indicating that the head pointer should be incremented by 8KiB Dequeue the message from the cross-FIFO, update the head pointer Enqueue task descriptors to the low-level NIC hardware to read the payload data (i.e., local DMA) from the Warp Pipe and store it in the correct Warp Pipe on host 2 (i.e., RDMA); note that these task descriptors can configure the NIC to perform appropriate network virtualization After the data has been transmitted, enqueue a message to the NIC on host 2 (via a RDMA-based cross-FIFO) to update the head pointer for the Warp Pipe on host 2 Dequeue the message from the cross-FIFO Update the local head pointer Enqueue a message to the host CPU (host 2) via a PCIe-based cross-FIFO, indicating that the head pointer should be incremented Dequeue the message from the PCIe-based cross-FIFO Send data to the application when it calls All state is tracked by software running on the host CPU and ARM CPUs running on the NIC All low-level NIC hardware is used in a stateless manner, and is multiplexed across many sockets

0 views
Giles's blog 1 weeks ago

10Gb/s Ethernet: using mini-heatsinks with a 10GBASE-T SFP+ module

In my last post I showed the somewhat-scary temperatures I was getting on the MikroTik 10GBASE-T SFP+ module I have plugged into , the 10Gb/s switch I have in my study. As I mentioned then, the plan was to try using some of the mini-heatsinks that people use on Raspberry Pis, to see if that would help. Here's how it went. I bought a 40-piece set of heatsinks made by the improbably-named VooGenzek on Amazon for €8 , and attached two of them like this -- see the bottom module, with the yellow cable: That was 24 hours ago, and here's a chart of temperatures from that module showing the 24 hours before and after: You can see the big drop-off in the middle of the chart; it even overshot a bit (I'm guessing because the heatsinks absorbed a bunch of heat initially when I put them on). The difference looks more dramatic than it is! See where the Y-axis starts. But given that the weather has been pretty much the same today as it was yesterday, that looks like a 3.5°C improvement. Not great, but not nothing either. In the copious discussion about the last post on Hacker News , one of the most popular comments -- from -- was that there are two generations of SFP+ modules for this kind of thing; an older one, using a Marvell chip, and the newer one using one from Broadcom. on the ServeTheHome forums made the same point. They both mentioned that a good indicator of which type a module is using is that the older ones tend to be rated up to 30 metres, while the newer ones are rated up to 100. This one is a MikroTik S+RJ10 , which definitely is one of the older ones -- the specific chip is mentioned in the docs . I'm not sure which chip the Protectli modules in my router are -- they're these modules -- but they say they're rated up to 30 metres, so I guess they're probably the older type too. Looking into switching those out might be a good next step! I probably won't do that in the short term, though, unless I start getting issues as we move into summer.

0 views
Phil Eaton 1 weeks ago

Serving files over HTTP three ways: synchronous, epoll, and io_uring

This is an external post of mine. Click here if you are not redirected.

0 views
Unsung 2 weeks ago

“We accepted this gradual bloat, but that’s not progress.”

I like the Fits on a Floppy manifesto by Matt Sephton: Software should be as small as it can be. Not as a gimmick, but as a discipline. The floppy disk is the measuring stick: 1.44 MB. If the software that ran entire businesses could fit in that space, then a modern, focused, single-purpose tool certainly can. In my own work, I have mostly focused on the web side of this equation, as this is where the situation feels the most dire: tens of megabytes dedicated to heavy frameworks, unnecessary tracking scripts, and video ads have a real negative effect on experiencing websites. Progressive loading challenges also make it harder to offer a great experience. But space considerations are starting to feel more pertinent to local software too, in an era where SSD and hard drive prices are going up, and where local LLM models start taking up more room . Also, this passage feels very Unsung, and is exactly why the tag #history exists on this blog: I don’t miss floppy disks. I miss the mindset they demanded—that every byte matters, that constraints breed creativity, and that software should be light on its footprint. If you reduce tech history to just nostalgia, it won’t be that useful. But if you look at it as inspiration , you might find some truly wonderful and meaningful stuff in there. On that note: Bonus for a nice classic domain, and a nod toward Mac’s most famous screensaver. #history #performance

0 views
Lalit Maganti 2 weeks ago

Don't answer the first question

In my work on Perfetto, a performance debugging tool, one question I get often is: “how do I split a Perfetto trace into multiple files?” Instead of answering directly, I say: “there isn’t an easy way to do that, but what’s leading you to collect traces large enough to want to split?” This is one of my golden rules at work. When a user asks me something “weird”: don’t answer the first version of the question . On the surface this might appear like I’m talking about the XY problem , but that stops one step short. It treats the user’s stated question as a puzzle to decode: figure out what they really meant, answer that, move on. I think we can go much further. Instead, the confusion that produced the wrong question is itself an opening, and the conversation it sparks is valuable to both sides. The user walks away with a better mental model of the tool. I walk away with a clearer picture of where the product confuses people. And sometimes, between us, we figure out that the product itself needs to change.

0 views
Evan Hahn 2 weeks ago

Make ZIP files smaller with ZIP Shrinker

I built ZIP Shrinker, a little browser tool to shrink ZIP files. It also works with formats that are secretly ZIPs underneath, like APK, EPUB, JAR, and many more. Try it out! At a high level, this tool (1) re-compresses every file in the ZIP archive with higher compression (2) removes all metadata (3) removes entries for directories. ZIP files are typically compressed with an algorithm called Deflate . There are a few tools that can re-compress Deflate data and make it smaller, usually by spending more time on the computation. I took one of these tools, libdeflate , and applied it to each compressed entry in the ZIP. I chose libdeflate because of its performance; alternatives like Zopfli can achieve marginally smaller results but take much longer. I created libdeflate.js , a WebAssembly wrapper for libdeflate, as part of this work. (I always relish my time working with WASM!) Each entry in a ZIP file can contain additional metadata like comments. These aren’t typically used, and if they’re there, my shrinker removes them. This usually doesn’t save too many bytes, but it doesn’t hurt. Removing directories is a slightly spicier decision. Usually, the existence of a file entry implies the existence of the directory it’s inside. For example, implies the existence of the directory. Some ZIPs include separate entries for directories, but because most extractors don’t need them, I remove those. This has the side effect of removing empty directories— let me know if that’s a problem for you. If you want to see how the whole project works, check out the full source code . I tested several ZIPs to see what this tool could do. Some anecdotal results: Not particularly scientific, but useful to see. This proof-of-concept shows that you can make ZIP files smaller without sacrificing backwards compatibility. It could be useful for sending an archive to someone, but could also be useful to reduce bandwidth and server costs. For example, if Project Gutenberg re-compressed all their EPUB books with this method, they might be able to save some money. Of course, ZIP isn’t always the most efficient format. Typically, other archives like can be smaller. But those aren’t backwards-compatible! ZIP also supports compression methods other than Deflate. They’re atypical, but you could use them to achieve a smaller result, too. Give my tool a try if you want a smaller ZIP.

0 views
Kev Quirk 2 weeks ago

Upgrading My Home Internet to Full Fibre

As many regular readers know, we live in the North Wales countryside, which means it can take time to get the latest and greatest when it comes to technology. As a result, we were previously "limited" to FTTC (fibre to the cabinet) which had a max speed of 70Mbps. As a result, we got okay internet speeds: But then I saw the ISP vans in the village, and I asked them what they were doing - "oh, we're upgrading the village to full fibre" she said. I had to have it! As soon as FTTP (fibre to the premises) was available, I placed the order with my ISP (who offered me a great deal that's only £5 per month more), and this is the result: In all honesty, I haven't noticed the difference. We didn't have any buffering issues when watching things like Netflix or Apple TV, so I'm not really sure why I upgraded in hindsight. I thought it would be this incredible difference where my internet would then be rapid, but the truth is, it's complete imperceptible. I remember when I upgraded from a 56k MODEM, to ~2Mbps broadband and it blew my mind. I was thinking this would be the same, but no. I do think the increased upload speed is going to come in handy when it comes to things like syncing my private git repos back to my Synology, but aside from that, there's not much in it. Had I paid full price (~£20 more per month) I don't think I'd have been too happy, but since I got a good deal, I'm not too bothered. Thanks for reading this post via RSS. RSS is ace, and so are you. ❤️ You can reply to this post by email , or leave a comment .

1 views

Efficient Remote Memory Ordering for Non-Coherent Systems

Efficient Remote Memory Ordering for Non-Coherent Systems Wei Siew Liew, Md Ashfaqur Rahaman, Adarsh Patil, Ryan Stutsman, and Vijay Nagarajan ASPLOS’26 It seems like every year there is a new PCIe standard which doubles bandwidth. The key takeaway from this paper is that these improvements are for the fast path , but there exist use cases which are crippled by details of the PCIe protocol. This paper describes two inefficiencies caused by the current protocol and suggests improvements to address them. From my experience, here is the canonical way for the host X64 CPU to send a request to a PCIe device and receive a response. First, the request payload is stored in host memory (which can be mapped as CPU cacheable). During this time, the device does not read any of the payload. Next, a handful of MMIO writes (to uncacheable memory) are used to point the device to the payload and kick off the work. The device stores results via posted DMA writes to host memory. After producing all results, the device uses more posted DMA writes to update control information in host memory. Finally (and optionally), the device signals an interrupt. From the perspective of the device, the interrupt is another posted DMA write to the host. Posted DMA writes are visible to the host in the order they are issued. In other words, the host is guaranteed that it will only observe the control information update after the response payload has been written. Similarly, the host will receive the interrupt after the control information is written. The problems identified by this paper are caused by a lack of ordering guarantees. Table 1 illustrates where ordering is enforced today: Source: https://dl.acm.org/doi/10.1145/3779212.3790156 W→W ordering means two writes from a device will appear to land in host memory in the order they were issued. W→R means that if a device writes to an address in host memory, and then issues a read, the read response will contain the updated data. For more details there is a great explainer on a LinkedIn post here . PCIe has provisions that allow devices to relax these ordering guarantees, but there is no way to enforce more ordering. “R→R No” means that if a device issues two DMA read requests, the read response data could appear as if the reads occurred in the wrong order. This matters for scenarios where the host CPU is actively writing to the same data structures that the device is reading from. Imagine an application that involves two data structures: An array of data: An array of flags: is only valid if is set. Carefully written software could update these data structures, ensuring that is set only after the associated data is written. The trouble is there is no efficient way for a PCIe device to pipeline reads of and . For a given , the device has no choice but to read , wait for the response, and then issue the read of . Some systems work around this by ensuring that data and metadata are stored in the same cache line. A more realistic application is a key-value store that is concurrently updated by the host CPU and read by a device (i.e., RDMA reads from a NIC). It is hard to develop such a system in a way that offers strong consistency and high performance. The solution proposed by the authors is to add semantics similar to and to PCIe. A read TLP with the bit set would not be reordered past subsequent reads. A write TLP with the bit set would not be reordered before prior writes. The other inefficient scenario described by this paper is caused by the lack of W→W MMIO ordering. The problem here is in the CPU architecture. On x86, an MMIO region can be mapped as write-combined, but the application must issue expensive instructions to ensure that writes appear in the correct order. This restricts MMIO writes to low-bandwidth scenarios. The architectural solution described by this paper is to add four explicit MMIO instructions to the ISA: MMIO-Release MMIO-Acquire MMIO-Store and MMIO-Release are store instructions. MMIO-Release has release semantics (it will not be reordered before prior stores). MMIO-Load and MMIO-Acquire are load instructions. MMIO-Acquire instructions are not reordered after subsequent loads. The authors note that RISC-V has similar instructions, but they involve the CPU stalling to implement the desired memory ordering. The solution offered by this paper instead involves a re-order buffer in the PCIe root complex. The CPU assigns sequence numbers to MMIO operations, and the root complex uses those sequence numbers to restore operations to their correct order. Fig. 6 has simulation numbers projecting speedups an RDMA-based key-value store could see if it could properly pipeline DMA reads. and are the work described by this paper. assumes additional speculative optimizations in the root complex. Source: https://dl.acm.org/doi/10.1145/3779212.3790156 Fig. 10 shows how fast a single core can perform unordered MMIO writes. The idea is that if the CPU architecture is enhanced to allow an application to express just the right amount of ordering, it could be possible for a single core to write packet data as fast as a NIC can consume it. Source: https://dl.acm.org/doi/10.1145/3779212.3790156 Dangling Pointers The paper ends with this food for thought: By establishing a high-performance baseline for non-coherent I/O, this work raises the question of whether the complexity of coherent interconnects (like CXL) is truly necessary for future host-device communication. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. An array of data: An array of flags: MMIO-Release MMIO-Acquire

0 views
Unsung 2 weeks ago

“Nothing short of a magic trick.”

A fascinating 25-minute video from Mark Brown at Game Maker’s Toolkit about how the team building Grand Theft Auto 3 conquered the technical limitations of PlayStation 2: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/nothing-short-of-a-magic-trick/yt1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/nothing-short-of-a-magic-trick/yt1.1600w.avif" type="image/avif"> How do you squeeze a city that occupies over 50 megabytes into the 32MB memory of the console? You simply do what The Truman Show did , and construct the city around the player as they’re moving around : This has, as you can expect, a lot of technical and even game-design consequences, and the video goes into a lot of detail on these – including Brown rebuilding the Grand Theft Auto 3 source to visualize things better. This technique is also used in interface design, for example if you have a really long list of things that would take too much memory or GPU power to render. What the video calls “streaming” is, in the context of UI, often called “virtualization”: instead of having a full long list (or an entire world), you abstract it away – or, virtualize – into something nimbler. Some of the challenges and techniques used by Grand Theft Auto 3 apply pretty directly here, as well: On the other hand, “speedy players” and “pop in” can’t ever be solved because any UI list is random access, and slowing users down is not typically appropriate; better to make loading as pleasant as possible than introduce any roadblocks, even if figurative ones. #definitions #games #performance #youtube you can use UI skeletons as “low poly” models, in some contexts, you can guess the user is more likely to move in one direction (for example, going through fonts in a font picker), and more eagerly preload where they’re going to look next, rather than symmetrically in both directions.

0 views
The Coder Cafe 3 weeks ago

Cache Use Cases Explained

☕ Welcome to The Coder Cafe! Today, we discuss cache use cases. When we think about caching, it’s pretty frequent to focus on where it happens; for example, client-side, server-side, or in a CDN. Yet, there’s a more important question that should be answered first: What’s the use case? In this post, we will break down two common cache use cases: reducing latency and improving capacity. And we will see why the line between the two is blurrier than it seems. Get cozy, grab a coffee, and let’s begin! A Cache for Latency Latency is the time between when a request is sent and when a response is received. A cache for latency exists to reduce the average latency of a service . The classic access pattern looks like this 1 : We check the cache first. On a cache hit, we return the data directly without touching the backend. On a miss, we go to the backend, return the result, and store it in the cache for future requests. Why does this reduce latency? The cache keeps data in memory, which is significantly faster to read from than a remote database that may involve network round-trips, disk I/O, and query execution. On a hit, all of that work is skipped. In Soft vs. Hard Dependency , we introduced two kinds of dependencies: A soft dependency is a non-critical dependency for the service to operate properly. A hard dependency is a critical dependency for the service to operate properly. A cache for latency is a soft dependency . If the cache becomes unavailable, requests fall through to the backend. The system keeps working, just at a higher latency. Keep this in mind, because it’s the key difference we’ll come back to. A cache for capacity exists to serve higher throughput than the backend can handle on its own. The access pattern is identical to the latency case: cache first, then backend on a miss. So what actually makes these two different? The difference is not in the code; it’s in what the backend can absorb. In a capacity scenario, the backend would be overwhelmed if it received all the traffic directly. The cache absorbs a large portion of the requests, keeping the backend load manageable. This changes the nature of the dependency . If the cache goes down, the backend is suddenly hit with all the traffic it was previously shielded from. Whether the system survives depends on the backend’s own capacity. If the backend can scale fast enough, the cache is still a soft dependency: there will be a rough period, but the system recovers. If the backend can’t cope with the load, the cache becomes a hard dependency . Without it, the system fails . Here’s a question worth asking: if the access pattern for both types is identical, how do we know which one we have? In most cases, caches are introduced to reduce latency. But here’s what can happen over time: Our system is stable. Cache hit rates are high, backend load is low. Traffic grows. The backend load stays low because the cache is absorbing most of it. Nothing breaks. No alerts fire. Six months pass. Nothing has changed, no code, no configuration, no architecture decision. And yet the cache is no longer reducing latency. It’s keeping the backend alive. The cache didn’t change. The code didn’t change. The system grew around the cache, and the cache quietly became load-bearing . The same risk appears when a cache goes cold. For example: A migration to a new cache instance A data format change that requires purging existing entries A cache restart after maintenance Any of these can produce a large wave of cache misses in a short window. If we were running a latency cache, we would see higher latency for a while. If we were running a capacity cache, we would see a traffic spike that the backend can’t absorb. The unsettling part is that the code is identical in both cases. The difference only becomes visible at failure time . The root problem is that teams often don’t know which type of cache they’re running . They built it for latency, and that’s still how they think about it, even as the system outgrows that assumption. A few approaches help here: Periodically ask: could the backend handle the current traffic if the cache were completely removed ? Load testing without the cache, or estimating backend capacity against current traffic levels, gives you a concrete answer. Treat cache hit rate as a meaningful operational signal , not just a performance metric. A sustained drop in hit rate means the backend is absorbing more traffic than usual. If that trend continues, it’s an early warning that you may be drifting toward a capacity problem. When migrating a cache or invalidating a large portion of its data, warm the new cache before routing live traffic to it. This prevents a cold-start burst from hitting the backend all at once. Finally, once we recognize that a cache is operating as a capacity cache , we should treat it accordingly. It’s no longer optional infrastructure and it deserves proper alerting and a clear plan for what happens if it goes down. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. A cache for latency serves data from memory to reduce average response time. It is a soft dependency: if unavailable, the system degrades in latency but continues to work. A cache for capacity absorbs traffic that the backend couldn’t handle on its own. It can be a soft or a hard dependency, depending on whether the backend can absorb the load without it. Both types share the same access pattern, which makes them easy to confuse. A latency cache can silently become a capacity cache as traffic grows, without any code change. When a capacity cache goes cold or fails, the backend can be overwhelmed. Hit rate monitoring, periodic load testing, and cache warming are practical ways to manage this risk. Availability Models Safety and Liveness The PACELC Theorem The Three Types of Cache Cache stampede Even though variations exist. A Cache for Latency Latency is the time between when a request is sent and when a response is received. A cache for latency exists to reduce the average latency of a service . The classic access pattern looks like this 1 : We check the cache first. On a cache hit, we return the data directly without touching the backend. On a miss, we go to the backend, return the result, and store it in the cache for future requests. A soft dependency is a non-critical dependency for the service to operate properly. A hard dependency is a critical dependency for the service to operate properly. Our system is stable. Cache hit rates are high, backend load is low. Traffic grows. The backend load stays low because the cache is absorbing most of it. Nothing breaks. No alerts fire. Six months pass. Nothing has changed, no code, no configuration, no architecture decision. And yet the cache is no longer reducing latency. It’s keeping the backend alive. A migration to a new cache instance A data format change that requires purging existing entries A cache restart after maintenance Periodically ask: could the backend handle the current traffic if the cache were completely removed ? Load testing without the cache, or estimating backend capacity against current traffic levels, gives you a concrete answer. Treat cache hit rate as a meaningful operational signal , not just a performance metric. A sustained drop in hit rate means the backend is absorbing more traffic than usual. If that trend continues, it’s an early warning that you may be drifting toward a capacity problem. When migrating a cache or invalidating a large portion of its data, warm the new cache before routing live traffic to it. This prevents a cold-start burst from hitting the backend all at once. Finally, once we recognize that a cache is operating as a capacity cache , we should treat it accordingly. It’s no longer optional infrastructure and it deserves proper alerting and a clear plan for what happens if it goes down. A cache for latency serves data from memory to reduce average response time. It is a soft dependency: if unavailable, the system degrades in latency but continues to work. A cache for capacity absorbs traffic that the backend couldn’t handle on its own. It can be a soft or a hard dependency, depending on whether the backend can absorb the load without it. Both types share the same access pattern, which makes them easy to confuse. A latency cache can silently become a capacity cache as traffic grows, without any code change. When a capacity cache goes cold or fails, the backend can be overwhelmed. Hit rate monitoring, periodic load testing, and cache warming are practical ways to manage this risk. Availability Models Safety and Liveness The PACELC Theorem The Three Types of Cache Cache stampede

0 views