Posts in Backend (20 found)
The Coder Cafe 3 days ago

Metastable Failures in Distributed Systems

☕ Welcome to The Coder Cafe! Today, we explore one of the nastiest failure patterns in distributed systems: metastable failures. Based on the Metastable Failures in Distributed Systems whitepaper, we break down why these failures happen, why they persist, what we can do about them, and why our instinct to fix them is probably wrong. Get cozy, grab a coffee, and let’s begin! Stable, Vulnerable, Metastable Metastable failures borrow their name from physics, where metastable means something that looks stable but isn’t . To understand how a distributed system can end up in such a state, we need to look at three distinct states it can be in: Stable: The system recovers on its own after any disruption. This is what we call resilience in Resilient, Fault-tolerant, Robust, or Reliable . Vulnerable : The system looks perfectly healthy, but it's operating above its hidden capacity : the load level below which it can self-heal from any disruption. It responds fast, metrics are green, and nothing is alarming. Many production systems deliberately operate here because it's more efficient: resources are used closer to their limit. But there's no slack left . And the deeper the system operates in a vulnerable state, the smaller the trigger needed to push it over the edge. Indeed, a system just above its hidden capacity can survive large disruptions; a system near its advertised capacity can be tipped by almost anything. Metastable failure : A trigger (e.g., a network blip, a deployment, a traffic spike) pushes the system over its hidden capacity. The system is not fully broken: processes are alive, and it’s still running. But goodput collapses: it’s no longer doing any useful work. Technically up, effectively down . And unlike a regular outage, removing the trigger doesn’t fix it. Getting out requires a strong corrective push: a restart, a dramatic load reduction, a manual intervention. NOTE : If you’re not familiar with the concept of goodput, it’s the throughput of useful work completed successfully. For example, in a web application receiving 1000 requests per second but returning errors for 800 of them, the goodput is only 200 RPS. The three states of a metastable failure. A system can drift into the vulnerable state unnoticed, and a single trigger is enough to push it into the metastable state it cannot escape on its own. The most disorienting property of a metastable failure: stopping the trigger doesn’t stop the failure. To understand why, we need to talk about feedback loops. In a previous post on Systems Thinking Explained , we defined a feedback loop as: If causes , then influences . A feedback loop is exactly the mechanism that keeps a system stuck in the metastable state . There is always a sustaining effect, a feedback loop, that prevents recovery. The trigger is just what pushes the system over the edge. The loop is what keeps it there. Blaming the trigger is the natural instinct, and almost always the wrong diagnosis. Let’s discuss a concrete example to make this clear. Imagine a web application that queries a database. The database comfortably handles up to 300 QPS. The application retries any query that doesn’t respond within 1 second. The system is running at 280 QPS, healthy and fast, within the database’s capacity. Then, a transient network issue occurs for 10 seconds. When the issue is over, all the queued requests flood in at once. The database gets hit with a surge it can’t absorb: latency spikes and queries start timing out. So the application retries them. This doubles the effective load to 560 QPS. The database, already struggling, falls further behind. More timeouts. More retries. The loop is now self-sustaining: High load → Timeouts → Retries → Higher load → More timeouts → More retries The transient network issue was fixed minutes ago. Yet, the system is still completely broken. The trigger is gone; the feedback loop is not . The only way out is to dramatically cut the load or disable retries entirely. This is a metastable failure . The system was vulnerable because it was operating close to its hidden capacity . A minor, transient trigger pushed it over the edge and into a self-sustaining failure state it couldn’t escape on its own. The retry mechanism, a feature designed to improve reliability, became the very thing that prevented recovery. This is one example, but the same pattern appears with caches, connection pools, failover logic, and more. The shape is always the same: a feedback loop that turns a temporary problem into a permanent one . Two things make metastable failures particularly nasty. We can be tempted to blame the wrong thing . When an outage happens, the trigger is what’s visible and recent: a spike, a deployment, a hardware fault. It’s the obvious culprit. But the trigger only exposed the problem; it didn’t create it. The sustaining feedback loop was already there, structural and invisible. When analyzing the problem in retrospect, teams focus on the trigger; fixes address the trigger; and the system remains vulnerable to the next one. The authors of the paper observed teams declare a metastable failure “resolved” multiple times before realizing the real cause had never been touched. The feedback loop grows stronger with scale . Small-scale tests won’t reveal it. A staging environment running at 10% capacity may handle the same trigger without falling into a metastable state, because the loop isn’t strong enough at that scale to be self-sustaining. This means these failures can slip past even rigorous testing regimes and only manifest in production at full load. We defined hidden capacity earlier as the load level below which the system can self-heal from any disruption. It’s different, and always lower, than the advertised capacity. In our example, the numbers make it concrete: the advertised capacity is 300 QPS, but the hidden capacity is only 150 QPS, because retries double the load under failure. The gap between those two numbers is where vulnerability lives . Measuring the hidden capacity is not straightforward, though. One possible approach is to apply a trigger at a given load level and observe whether the system recovers on its own: If it does, we are below the hidden capacity. If it doesn’t, we are above it. We can also estimate it indirectly: in the retry example, retries double the load under failure, so the hidden capacity is roughly half the advertised capacity. Metastable failures are not bugs . We can’t write a unit test that catches them. They are emergent behaviors: properties that arise from the interaction of a system’s components under specific conditions, not logic errors in any individual component. No single piece of code is buggy, no single configuration is wrong. The failure is a consequence of how everything fits together under load. This changes how we need to think about them. The right question after an outage is not “ What failed? ” but “ What loop sustained it? ” And before an outage, the danger is not having bugs; it’s optimizing so aggressively for efficiency that we push the system deeper into the vulnerable state without realizing it . Retries, caches, failover logic, connection pools: these are all features that improve reliability in the common case. They are also, under the right conditions, the sustaining mechanisms of metastable failures. The same design decision that makes a system more resilient in normal operation can also prevent it from recovering when things go wrong. The paper describes several approaches to reduce the risk of metastable failures: Retry budgets and circuit breakers : Instead of retrying indefinitely, cap the total number of retries in flight at any given time. This directly weakens the feedback loop by limiting work amplification. LIFO scheduling under overload : Counterintuitively, switching from FIFO to LIFO when the system is overloaded allows some requests to complete within their deadline, preserving goodput instead of letting every request time out. NOTE : I already wrote a post about that approach in Adaptive LIFO . Fast error paths : Success paths are heavily optimized, but error paths often aren’t. An expensive error path (stack traces, DNS lookups, disk writes) under high failure rates can itself become a sustaining mechanism. Optimizing error paths reduces this risk. Read-through caches over look-aside caches : A read-through cache (where the cache itself fetches missing data from the database) can continue filling itself even when the application has given up on a request, steadily increasing the hit rate and helping the system recover. A look-aside cache (where the application is responsible for populating the cache) can’t. Production stress testing : Small-scale tests won’t reveal metastable failures. Testing against a portion of production traffic, with engineers ready to intervene, is the most reliable way to surface them. A note of humility from the paper: there is no systematic solution yet. These are ad-hoc mitigations developed in response to known failures. Detecting vulnerable states before they collapse remains an open problem. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. A distributed system can pass through three states: stable, vulnerable, and metastable. The vulnerable state looks healthy, but it isn’t. The threshold between stable and vulnerable is invisible. Systems can operate in the vulnerable state for months without any sign of trouble. When a trigger pushes a vulnerable system into a metastable failure, a feedback loop sustains the failure even after the trigger is gone. The trigger is not the root cause. The feedback loop is. Fixing the trigger leaves the system vulnerable to the next one. Reliability features like retries and caches can become the sustaining mechanism of a metastable failure under the right conditions. Metastable failures are emergent behaviors, not bugs. We can’t unit test for them, and optimizing for efficiency makes them more likely. Mitigations exist (retry budgets, circuit breakers, LIFO scheduling, fast error paths), but they are all ad-hoc responses to known failures. Detecting vulnerable states before they collapse remains an open problem. Resilient, Fault-tolerant, Robust, or Reliable? Adaptive LIFO Fail Open vs. Fail Closed Metastable Failures in Distributed Systems Metastability and Distributed Systems Stable, Vulnerable, Metastable Metastable failures borrow their name from physics, where metastable means something that looks stable but isn’t . To understand how a distributed system can end up in such a state, we need to look at three distinct states it can be in: Stable: The system recovers on its own after any disruption. This is what we call resilience in Resilient, Fault-tolerant, Robust, or Reliable . Vulnerable : The system looks perfectly healthy, but it's operating above its hidden capacity : the load level below which it can self-heal from any disruption. It responds fast, metrics are green, and nothing is alarming. Many production systems deliberately operate here because it's more efficient: resources are used closer to their limit. But there's no slack left . And the deeper the system operates in a vulnerable state, the smaller the trigger needed to push it over the edge. Indeed, a system just above its hidden capacity can survive large disruptions; a system near its advertised capacity can be tipped by almost anything. Metastable failure : A trigger (e.g., a network blip, a deployment, a traffic spike) pushes the system over its hidden capacity. The system is not fully broken: processes are alive, and it’s still running. But goodput collapses: it’s no longer doing any useful work. Technically up, effectively down . And unlike a regular outage, removing the trigger doesn’t fix it. Getting out requires a strong corrective push: a restart, a dramatic load reduction, a manual intervention. NOTE : If you’re not familiar with the concept of goodput, it’s the throughput of useful work completed successfully. For example, in a web application receiving 1000 requests per second but returning errors for 800 of them, the goodput is only 200 RPS. We can be tempted to blame the wrong thing . When an outage happens, the trigger is what’s visible and recent: a spike, a deployment, a hardware fault. It’s the obvious culprit. But the trigger only exposed the problem; it didn’t create it. The sustaining feedback loop was already there, structural and invisible. When analyzing the problem in retrospect, teams focus on the trigger; fixes address the trigger; and the system remains vulnerable to the next one. The authors of the paper observed teams declare a metastable failure “resolved” multiple times before realizing the real cause had never been touched. The feedback loop grows stronger with scale . Small-scale tests won’t reveal it. A staging environment running at 10% capacity may handle the same trigger without falling into a metastable state, because the loop isn’t strong enough at that scale to be self-sustaining. This means these failures can slip past even rigorous testing regimes and only manifest in production at full load. If it does, we are below the hidden capacity. If it doesn’t, we are above it. Retry budgets and circuit breakers : Instead of retrying indefinitely, cap the total number of retries in flight at any given time. This directly weakens the feedback loop by limiting work amplification. LIFO scheduling under overload : Counterintuitively, switching from FIFO to LIFO when the system is overloaded allows some requests to complete within their deadline, preserving goodput instead of letting every request time out. NOTE : I already wrote a post about that approach in Adaptive LIFO . Fast error paths : Success paths are heavily optimized, but error paths often aren’t. An expensive error path (stack traces, DNS lookups, disk writes) under high failure rates can itself become a sustaining mechanism. Optimizing error paths reduces this risk. Read-through caches over look-aside caches : A read-through cache (where the cache itself fetches missing data from the database) can continue filling itself even when the application has given up on a request, steadily increasing the hit rate and helping the system recover. A look-aside cache (where the application is responsible for populating the cache) can’t. Production stress testing : Small-scale tests won’t reveal metastable failures. Testing against a portion of production traffic, with engineers ready to intervene, is the most reliable way to surface them. A distributed system can pass through three states: stable, vulnerable, and metastable. The vulnerable state looks healthy, but it isn’t. The threshold between stable and vulnerable is invisible. Systems can operate in the vulnerable state for months without any sign of trouble. When a trigger pushes a vulnerable system into a metastable failure, a feedback loop sustains the failure even after the trigger is gone. The trigger is not the root cause. The feedback loop is. Fixing the trigger leaves the system vulnerable to the next one. Reliability features like retries and caches can become the sustaining mechanism of a metastable failure under the right conditions. Metastable failures are emergent behaviors, not bugs. We can’t unit test for them, and optimizing for efficiency makes them more likely. Mitigations exist (retry budgets, circuit breakers, LIFO scheduling, fast error paths), but they are all ad-hoc responses to known failures. Detecting vulnerable states before they collapse remains an open problem. Resilient, Fault-tolerant, Robust, or Reliable? Adaptive LIFO Fail Open vs. Fail Closed Metastable Failures in Distributed Systems Metastability and Distributed Systems

0 views

Kafka Share Groups and Parallelizing Consumption - Part 2: Producer Batches and share.acquire.mode

All tests were executed against Kafka 4.3.0 using Dimster .  In the last post we used simulated consumer processing time to reveal how important it is to set an appropriate value for to ensure the consumer parallelism that we expect. With a uniform distribution of messages over partitions, the rule of thumb was a value somewhat lower than: But there’s more to parallel consumption than . The size of producer batches also plays a role when using the default ( ). Share group members are assigned to partitions like consumer group members are, except that share group assignment allows multiple consumers to be assigned to the same partition. If the number of share consumers is less than the partition count, then each consumer will be assigned multiple partitions. If the consumer count matches or exceeds the partition count, then each consumer will be assigned one partition. Fig 1. Share consumer assignments. Left: consumer count < partition count. Right: consumer count > partition count. When a consumer is assigned only one partition, it will always be fetching from one broker. If a consumer is assigned multiple partitions, it may fetch from multiple brokers concurrently. There are two values for : The Javadoc says the following: The application chooses between the two modes using the consumer share.acquire.mode configuration property. If the application sets the property to batch_optimized or does not set it at all, the share consumer fetches records based on batch boundaries which may mean that the number of records returned may exceed the max.poll.records configuration property. The share consumer may also prefetch records and buffer them temporarily awaiting the application's next call to poll(Duration). If the application sets the property to record_limit, the share consumer fetches no more than records at a time and does not prefetch. This is slower but gives the application tighter control on how many records are fetched and when the acquisition locks begin. So why two modes?  It comes down to efficiency ( ) and consumer control ( ). First of all the sentence “ the share consumer fetches records based on batch boundaries” is correct but a little misleading. No matter what mode is used, whole batches are returned to the consumer over the wire . In other words, the data sent over the network is always based on batch boundaries as the record batch is the unit of data delivery.  What that sentence refers to is what records are acquired by the consumer and returned to the application: With , the config is a soft cap. The consumer acquires any batches (in their entirety) that are covered by the offset range determined by . These acquired batches are returned to the consumer, and the consumer returns the records of those batches to the calling application (that invoked ). With , the config is a strict cap. The consumer only acquires the records that are covered by the offset range determined by (though less if less records are available). However, the unit of data delivery is the record batch, so the consumer receives whole batches but only returns a specific offset range to the calling application. For example, in the figure below we have three consumers sending fetch requests with and . Fig 2. Three consumers fetching with batch_optimized Despite asking for only one record, each consumer acquires and receives records along batch boundaries. The result of consumer.poll(Duration) for c1 is three records, not one. If we rerun this scenario with record_limit: c1 acquires record 0 c2 acquires record 1 c3 acquires record 2 However, the batch is the unit of data delivery, so batch 1 is sent in its entirety to each consumer (the consumer internals only returns the acquired records of the batch to the application). Fig 3. Three consumers fetching with record_limit This is obviously less efficient… We just sent the same batch three times! Nonetheless, exists because sometimes that inefficiency over the wire is countered by other concerns (one of which is covered in this post). Another efficiency gain that has is that because each batch is only sent to one consumer, Kafka only needs to do share group housekeeping of the batch as a whole, not each record individually. This reduces CPU and makes metadata more compact. If we get mixed acknowledgments of the batch records (2 success, 1 reject) only then does the record tracking explode the metadata to be per-record. With , the housekeeping always tracks state per record, which is more expensive. The final difference between the modes is that in mode, a consumer can send concurrent fetches to all the brokers of its assigned partitions. This further increases the number of records that a consumer might receive as is a soft cap per broker. With , the consumer sends one fetch at a time, round-robin between the brokers of its partitions. This difference only manifests when the consumer count is less than the partition count. We’ll cover this aspect more in the next post. The main implications are that: With , the effective consumer parallelism can be impacted by the average number of records per record batch. With , the network throughput will increase in most scenarios as offset ranges are unlikely to align with batch boundaries. If the is larger than the average number of records per batch, then each batch may only be delivered twice. The network throughput can a lot if the is much smaller than the average number of records per record batch. Don’t worry if that isn’t clear yet, we’ll gather some empirical results next which should make it clearer. Let’s test this out with Dimster’s interactive mode, using the same workload as the last post. In the last post, we calculated that the maximum theoretical consumption rate for 300 consumers with a processing time of 5 ms per message would be 60,000 msg/s. By setting to 30 we reached 55,000 msg/s and then finally reached 60,000 with low end-to-end latency by adding an additional 12 consumers (2 per partition). So we use the following workload file (no dimensional stuff in this one as we’re going to use live-interaction): In this test we’re going to make the record batches bigger and see what happens to the consumption rate. First we start Dimster and ensure it’s handling the 60k msg/s. Once it has started and settled in, we see it’s coping well. If I look at the metrics, the current record batch size is around 5KB with 10 records per batch. The average fetch size is 7KB with 14 records. This means some consumers get 1 record batch per fetch and some get 2 record batches per fetch. Let’s increase the batch size. To do this we’ll drop to 1 producer, and set the linger.ms to 10 to reach the default batch.size of 16KB batches. We see that the batch size has risen to the default of 16KB, or 32 records per batch. The consumers should now, on average, receive 32 records per fetch (2 above the max.poll.records). Fig 4. The record batches sent by the producers increase from 5.5 KB to 16 KB The coordinator output shows that the consumers are still coping, as expected. With 500b records, the number of records returned per fetch will be 32 which is close enough to the max.poll.records of 30 to not impact consumption. Now let’s double batch.size to 32786. From a separate terminal window to the coordinator output, we’ll run the following: We see the batch size increase again in the dashboard. Fig 5. The record batches sent by the producers increase from 5.5 KB to 16 KB to 32 KB The coordinator output shows that the consumers are no longer keeping up! Only managing 37K msg/s with a fast growing backlog. The problem is that each partition has an inflight budget of 2000 records and each record batch contains 64 records. That allows up to 31 effective consumers per partition (2000 / 64), leaving 21 consumers starved at any point in time. This explains the 37K msgs/s: We can fix this problem in three ways: Set in the producer. Increase to create a larger inflight budget We already know the default 16KB batch size is ok. Let’s first increase the inflight budget. We’ll double the budget and see what happens. First we’ll stop the producers and remove the processing time on the consumers to drain the backlog. Next we need to update the broker config and restart the brokers. In we add: Then we’ll redeploy Kafka (again from a separate terminal window). Now we’ll start the producers again and apply the 5 ms processing time to the consumers. We’re in business! The consumers are now coping with the larger batch sizes with this increased inflight budget. This time we’ll try . First let’s walk back that inflight budget change by  1) stopping the producers, 2) commenting out the added line to our broker config, 3) redeploying Kafka. While the producers are still stopped, I’ll change the consumers to use : Then start the producers again: In the coordinator, we see that the consumers are now coping with the 60K msg/s. The reason that allows the consumers to keep up, despite the larger record batches, is that each consumer is only allocated a max of 30 records per fetch, even though each batch contains 64 records. However, each batch is now being delivered three times as 30 doesn’t align well with 64. We can see this in the Kafka client metrics. Fig 6. On the left, with the larger inflight budget and batch_optimized. The middle was when we stopped the producers to restart Kafka with the original inflight budget. The right is with record_limit and each batch being sent three times. We could make this more efficient if we increase to 32 to align with the 64 record batches. If I simply change the to 32, we don’t see much of an improvement as most offset ranges of 32 records will touch two batches. But if we stop the producers, ensure there is no backlog at all then set , the fetches will be perfectly aligned. Fig 6. On the left, with unaligned fetches with max.poll.records=32 (each batch delivered 3 times). Right: aligned fetches with max.poll.records=32 (each batch delivered 2 times). Let’s not over-index on this one case. The purpose of this post was to explain the underlying mechanics and back that up with some empirical benchmarks, sticking with the same workload example as the last post. What we’ve learned: Consumer parallelism is impacted by more than just consumer count and . It is also impacted by: Record batch sizes (determined by the producers) The inflight budget ( ) The share consumer config Record acquisition is along batch boundaries with , and record ranges with . Record batches are the unit of delivery, so can cause consumer network bandwidth to increase because fetches likely will not align on batch boundaries causing batches to be delivered at least twice (more if is much smaller than the average number of records per batch). In the next post we’re going to look a bit closer at . ps: you can run this whole scenario with two terminal windows: Window 1 - kick off the benchmark (using the workload yaml described in the post) Window 2 - wait a few minutes then run the following bash script: Happy testing! If the application sets the property to batch_optimized or does not set it at all, the share consumer fetches records based on batch boundaries which may mean that the number of records returned may exceed the max.poll.records configuration property. The share consumer may also prefetch records and buffer them temporarily awaiting the application's next call to poll(Duration). If the application sets the property to record_limit, the share consumer fetches no more than records at a time and does not prefetch. This is slower but gives the application tighter control on how many records are fetched and when the acquisition locks begin. With , the config is a soft cap. The consumer acquires any batches (in their entirety) that are covered by the offset range determined by . These acquired batches are returned to the consumer, and the consumer returns the records of those batches to the calling application (that invoked ). With , the config is a strict cap. The consumer only acquires the records that are covered by the offset range determined by (though less if less records are available). However, the unit of data delivery is the record batch, so the consumer receives whole batches but only returns a specific offset range to the calling application. c1 acquires record 0 c2 acquires record 1 c3 acquires record 2 With , the effective consumer parallelism can be impacted by the average number of records per record batch. With , the network throughput will increase in most scenarios as offset ranges are unlikely to align with batch boundaries. If the is larger than the average number of records per batch, then each batch may only be delivered twice. The network throughput can a lot if the is much smaller than the average number of records per record batch. Don’t worry if that isn’t clear yet, we’ll gather some empirical results next which should make it clearer. Set in the producer. Increase to create a larger inflight budget Consumer parallelism is impacted by more than just consumer count and . It is also impacted by: Record batch sizes (determined by the producers) The inflight budget ( ) The share consumer config Record acquisition is along batch boundaries with , and record ranges with . Record batches are the unit of delivery, so can cause consumer network bandwidth to increase because fetches likely will not align on batch boundaries causing batches to be delivered at least twice (more if is much smaller than the average number of records per batch).

0 views

Kafka Share Groups and Parallelizing Consumption — Part 1: Tuning max.poll.records

All tests were executed against Kafka 4.2.0 using Dimster (and also validated against 4.3.0).  In the last post we measured the overhead that the mechanics of share groups adds, and saw that it is pretty small. Likewise we saw that raw throughput was also comparable to consumer groups and even saw it exceed consumer group throughput on one test. In this post we’re going to simulate processing time in the consumers to make these benchmarks more realistic and show the utility of share groups (namely the ability to parallelize processing beyond the partition count). We’ll see how the following two configurations play an important role in parallelizing consumption with share groups: max.poll.records (consumer config) group.share.partition.max.record.locks (broker-side config) If we know the average processing time and the number of consumers, we can calculate the theoretical max throughput of a topic: For example: If we have 100 consumers, with an average processing time of 5 ms, then the maximum throughput will be 20,000 message/s. If we have a topic which peaks at 60K message/s and our average processing time is 5 ms. We’ll need 300 consumers to handle that. If we use consumer groups we’ll also need 300 partitions.  > Of course we could do some fancy concurrent work in the consumer to parallelize the consumer work but that comes with some downsides, principally that consumer groups track a position in the log which doesn’t map well to concurrently processing multiple positions in the log simultaneously, should the consumer encounter problems, abruptly terminate or get reassigned partitions. The ParallelConsumer does some clever tricks to handle this. With a share group we don’t need 300 partitions, we could have a handful of partitions with a group of 300 consumers and it should handle the load.  Let’s test it out with share group config defaults . Like last time, I’m going to ensure that load is even across the partitions so that load skew doesn’t pollute the results (I’ll be looking at load skew in a future post). See the last post for how I did that. We’ll use Dimster’s live interaction feature to model the workload on the fly, to see the impact of changing configurations and consumer counts. Fig 1. Dimster supports mutating the running workload. The max theoretical throughput of 300 consumers with 5 ms processing time is 60K msg/s so we’ll start with see what happens. Remember, Dimster uses named environments where commands take the format: , my environment is called localBeefy (detailed in the last post). Note: I prepend workload files with to so they don’t appear as untracked files in Git. The coordinator output shows: Straight away we see that consumption is really low at 4800 msg/s, nowhere near 60K msg/s. Let’s stop the producers and set processing time to 0 ms to allow the consumers to catch up so we can try again. From a separate terminal window: Viewing the coordinator output in the other terminal window, we wait until the backlog is drained, then slowly ramp up the producer rate to 50K msg/s. Then from the live interaction terminal: The coordinator output shows the consumers are now managing 50K msg/s. Much better than 4800 msg/s: However, it soon degrades, dropping to 20K msg/s and eventually back down to 4800 msg/s. Fig 2. The ramp up to 50k msg/s, followed by a reduction back down to 4800 msg/s What on earth is going on? The metrics tell a story. The max poll size (the number of records the consumer is returned after calling poll) is 500. Most calls to poll return few records, with p50 at 8. But the max tells us some return 500. Fig 3. The number of records returned by consumer.poll() The Kafka client metrics show that the average records per fetch response batch climbs toward 450. Fig 4. The records per fetch start low and grow larger and larger What we’re seeing is that most consumers aren’t getting very many records, but a tiny number are getting a lot. When the throughput was low and growing, the records per fetch were low (up to 10). But then the average records per batch started creeping up (while the consumption kept dropping) until the average fetch size was around 450 records. The average is high despite only few being full because most fetch requests sit idle until they can be serviced (by default up to 500 ms). Fig 5. The average fetch latency creeps up and almost reaches the default fetch.max.wait.ms of 500. It’s clear that the default of 500 is at play here. There is an interesting phenomenon here: at low producer rates, the broker does not have enough available records to fill each consumer’s , so each poll tends to return a small batch. Since many consumers are polling, processing, and acknowledging at roughly the same cadence, the available records get spread across the group. The result is an accidental fair-sharing regime : lots of consumers are active, each processing small batches, and aggregate throughput can approach the theoretical maximum. But this regime is fragile. It is not guaranteed by the broker-side allocation policy. Once enough records are available to fill large polls, the greedy allocation behavior takes over. A small number of consumers can acquire large batches, occupying the partition’s inflight record budget while the rest of the consumers sit idle. We’ll call this the greedy-capture regime , as a few consumers greedily capture the inflight window. This regime works as follows. The broker config determines how many records can be locked/inflight per partition, and defaults to 2000. With the default of 500, a single consumer can acquire 25% of that budget. At 5 ms per record, that one batch takes 2.5 seconds to process. While those records are locked, other fetches may sit idle even as new records arrive. This creates a feedback loop: larger batches consume more of the inflight budget, queued fetches wait longer, lag builds, and future fetches are more likely to be filled with large batches. Eventually the group collapses into the greedy-capture regime. We see this in the behavior above . With an inflight budget of 2000 messages and large fetches of 500 records, we only have 4 effective consumers per partition at a time. Across 6 partitions we only have 24 effective consumers each able to process 200 messages a second, resulting in an aggregate 4800 msg/s (exactly the number we’ve seen). To test whether the fair-sharing state was actually stable, I tried ramping only to 30K msg/s and holding it there. I left it for ten minutes and it remained stable. Then I restarted the consumers. Sure enough, throughput dropped back down to 4800 msg/s again. Fig 6. Fair-sharing regime collapses after a consumer restart Why go on about this accidental fair-sharing? Because the system can appear healthy under a slowly changing throughput and a moderate load because it has entered accidental fair-sharing, despite a bad choice of max.poll.records . I imagine this could trip some people up. Consumption may look fine for a long time, but suddenly degrade causing some head scratching and stress! The solution here is simple: reduce max.poll.records .  In theory we should carve up the inflight window between all consumers. So let’s take the configured and divide it by the number of consumers per partition. In our case, we are using the default of 2000. With 50 consumers per partition, we should set to . First, let’s drain the backlog. Once the backlog is drained, let’s set . This causes all 300 consumers to restart with the new config. Now we’ll attempt 60K msg/s with 5 ms processing time abruptly, no ramp up. The coordinator output shows: We’re close, about 55K msg/s consumption, however this soon drops to 45K and remains stable there. It seems that 40 was still too high as it did not account for all the overhead of the fetch/response time, the timing of commits, etc. After dropping to 30 and finally we hit 60K msg/s consumption! But the coordinator output shows that end-to-end latency is growing, little by little, it still isn’t quite keeping up. Let’s add 2 more consumers per partition (300 -> 312 consumers). The coordinator output now shows that end-to-end latency has dropped and stabilized. At this point, the benchmark has a few minutes left. We can discard all the cumulative latency histograms so we record the last minutes with this stable configuration. The final e2e latency distribution for 10 or so minutes with 312 consumers and is: Fig 7. End-to-end latency distribution of 60K msg/s, 5 ms processing time and 312 consumers Rules of thumb:  Set by taking and dividing it by the number of consumers per partition . Then set it somewhat lower to leave room for timing variance, uneven fetch timing, partition skew, and transient backlog. If you have very long processing time (over 1 second) you can even drop max.poll.records to 1 as the cost of a fetch is dwarfed by the processing time. You can also try increasing the group.share.partition.max.record.locks ( max of 10000) which will allow for a larger inflight budget and be more forgiving of a suboptimal max.poll.records. Now armed with a good rule-of-thumb, we’ll run two scenarios with Dimster’s explore limits mode , a benchmark mode for finding the highest sustainable throughput (see the last post for how it works): Fig 8. All test points achieved 57,000 msg/s while staying under p75, 100 ms end-to-end latency. All 5 workloads achieved 57K msg/s, just short of the max theoretical throughput (likely due to the latency constraint of explore mode). Adding some more consumers would be enough to reach 60K msg/s. Next, with 1 ms processing time. Fig 9. Share groups with 12+ partitions reached 95% of the theoretical max consumption throughput. Share groups with 12, 30 and 60 partitions did best, reaching 95% of the max theoretical throughput. The reason 6 partitions fared a little worse is likely due to contention over the inflight window (6 * 2000 records). The higher partition tests had a larger window across the same number of consumers. I expect the consumer groups could have gotten higher throughput, just not within the latency target of the test (100 ms, p75, based on the worst partition). First of all, I hope you see how useful live interaction using Dimster is! You can mutate a live workload to explore the impact of changing client configurations, producer rate, the number of producers and consumers, all on the fly. Once you have a topology you want to record stats for, clear the stats, set a new running time and get all the usual Dimster results. You can download the results from this blog post: tarball for the interactive session tarball of the explore limits run Regarding Kafka, the important lesson is that share groups change the parallelism bottleneck. With consumer groups, it’s the partition count. With share groups, it’s easy to think it simply comes down to the number of consumers, but it’s a little more complicated than that. Parallelism is determined by the inflight budget and the size of fetch requests .  Setting carefully might seem obvious, but I think it could trip people up for a few reasons: Defaults can come into play easily, especially for people without a lot of Kafka experience. The greedy behavior is not necessarily obvious (especially in terms of message queues in general). Synthetic benchmarks with 0 processing time will miss this (4 consumers per partition can handle whatever you throw at the partition). Only once you add processing time to a benchmark does the relationship between the inflight budget and fetch size become apparent. This greedy algorithm makes a very important configuration for share groups and the default of 500 is arguably the wrong value for share groups. It would be nice for a future version of Kafka to offer an alternative which enforces fair-sharing broker-side. I’ve posted this sentiment to the dev mailing list. Next : isn’t the only config that determines the size of consumer fetches ! In the next post we’ll look at the role of the following (in terms of how they can affect consumer fetch sizes and therefore the parallelism of share groups): Producer batch sizes. Share group consumer config :  (default, used in this post) max.poll.records (consumer config) group.share.partition.max.record.locks (broker-side config) If we have 100 consumers, with an average processing time of 5 ms, then the maximum throughput will be 20,000 message/s. If we have a topic which peaks at 60K message/s and our average processing time is 5 ms. We’ll need 300 consumers to handle that. If we use consumer groups we’ll also need 300 partitions.  Set by taking and dividing it by the number of consumers per partition . Then set it somewhat lower to leave room for timing variance, uneven fetch timing, partition skew, and transient backlog. If you have very long processing time (over 1 second) you can even drop max.poll.records to 1 as the cost of a fetch is dwarfed by the processing time. You can also try increasing the group.share.partition.max.record.locks ( max of 10000) which will allow for a larger inflight budget and be more forgiving of a suboptimal max.poll.records. tarball for the interactive session tarball of the explore limits run Defaults can come into play easily, especially for people without a lot of Kafka experience. The greedy behavior is not necessarily obvious (especially in terms of message queues in general). Synthetic benchmarks with 0 processing time will miss this (4 consumers per partition can handle whatever you throw at the partition). Only once you add processing time to a benchmark does the relationship between the inflight budget and fetch size become apparent. Producer batch sizes. Share group consumer config :  (default, used in this post)

0 views
Corrode 1 weeks ago

Migrating from Go to Rust

Out of all the migrations I help teams with, Go to Rust is a bit of an outlier. It’s not a question of “is Rust faster?” or “does Rust have types?”, Go already gets you most of the way there. The discussion is mostly about correctness guarantees , runtime tradeoffs , and developer ergonomics . A quick disclaimer before we start: this guide is heavily backend-focused . Backend services are where Go is strongest, small static binaries, a standard library focused on networking, and an ecosystem of libraries for HTTP servers, gRPC, databases, etc. That’s also where most teams considering Rust are coming from (at least the ones who reach out to me), so I think that’s the comparison that’s actually useful in practice. If you’re writing CLI tools, embedded firmware, or game engines, some of this still applies, but to be honest, I’m afraid this is not the best resource for you. For context, I’ve written about Go and Rust before: “Go vs Rust? Choose Go.” back in 2017, and later the “Rust vs Go: A Hands-On Comparison” with the Shuttle team, which walks through a small backend service in both languages. What you will learn in this article I’ll be upfront: I’m not a fan of Go. I think it’s a badly designed language, even if a very successful one. It confuses easiness with simplicity , and several of its core design tradeoffs ( everywhere, error handling as a discipline rule rather than a type, the long absence of generics) point in a direction I disagree with. That said, success matters! Go has captured a real and persistent share of working developers, hovering around 17–19% in the JetBrains Developer Ecosystem Survey. Rust is growing steadily but is still a smaller slice: Go is clearly working for a lot of people, and a guide that pretends otherwise isn’t helpful. So I’ll do my very best to be objective in this guide rather than relitigate old arguments. But you should know my priors so you can calibrate. The other prior worth disclosing: I run a Rust consultancy; of course I’m biased! More people using Rust is good for my business. But I’ve also worked in both languages professionally and shipped Go services to production. This guide is for Go developers who want an honest, side-by-side look at what changes when you move to Rust. For a deliberately opposite take, I recommend reading “Just Fucking Use Go” by Blain Smith. Holding both views in your head at once is more useful than either one alone. If you prefer to watch rather than read, here’s a video from the Shuttle article above, read and commented by the Primeagen: Go developers already have one of the cleanest toolchains in the industry. Back in the day, it started off a trend of “batteries included” toolchains that give you a single, consistent interface for building, testing, formatting, linting, and managing dependencies. I’m glad that Rust followed suit, because it’s a great model. It’s one of my favorite parts about both ecosystems. has even more built-in: The big difference is that in Go you typically reach for third-party tools ( , , , ) to fill gaps. In Rust, the first-party ecosystem covers more out of the box. Things that do require external crates (e.g. , ) install with one command and feel native, e.g. gives you right away. Both communities have converged on the same insight about formatters: a single canonical style, even an imperfect one, is worth more than the bikeshedding it eliminates. Gofmt’s style is no one’s favorite, yet gofmt is everyone’s favorite. — Rob Pike, Go Proverbs The same is true of : not everyone likes every detail, but the absence of style debates in code review is worth far more than the occasional formatting preference you’d have made differently. The headline is that Go and Rust are both compiled, statically typed, single-binary-deploy languages with strong concurrency stories. The differences are about what guarantees you get from the compiler and how much control you have over runtime behaviour . Go developers don’t usually come to Rust because Go is “too slow.” For most backend workloads, Go is plenty fast. People are generally a bit frustrated with Go’s verbose error handling, the danger of segmentation faults from pointers, and the lack of generics (for a long time) or any sophisticated type system features, such as enums or traits. Interfaces are not a worthy replacement for traits, and the Go standard library has some weird gaps, such as the lack of a type. I call it my billion-dollar mistake. It was the invention of the null reference in 1965 … This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years. — Tony Hoare, inventor of , QCon London 2009 This is the one I hear most often. You ship a Go service, it runs fine for months, and then one Tuesday at 3 a.m. a code path runs where someone forgot to check whether a pointer was , and the goroutine panics. Go’s compiler does not force you to consider the absence case. Rust’s does: You literally cannot dereference an without acknowledging the case. Whole categories of pager-duty incidents disappear. is a great tool, but it’s a runtime detector, it only finds races that actually execute during your tests. Mutating a map from two goroutines without a lock compiles fine in Go and only blows up in production under load. In Rust, sharing mutable state across threads requires types that implement and . Try to share a plain between threads and the program does not compile . You’re forced to wrap it in an , an , or use a channel. That race condition becomes a type error. 1 is fine for a while. After a few years, you notice three things: It’s worth being honest about the counter-argument here, since it came up in the Lobste.rs thread on my Shuttle article: experienced Go developers point out that and catch most of the “forgot to handle the error” cases in practice, and that explicit is easier to read than dense chains. Both points are fair, and the explicit style is a deliberate cultural value, not an accident: I think that error handling should be explicit, this should be a core value of the language. — Peter Bourgon, GoTime #91 , quoted in Dave Cheney’s Zen of Go My take is that lints are an opt-in safety net you have to remember to set up, while Rust’s is the type signature itself, there’s no way to forget. The boilerplate-vs-readability tradeoff is more genuinely subjective. The operator handles propagation; handles wrapping; and a on is exhaustively checked . Add a new variant tomorrow and the compiler shows you every place that needs updating. Go got generics in 1.18, and they’re useful, but the implementation has constraints (no methods with type parameters, GC shape stenciling, occasional surprising performance characteristics). Rust generics monomorphize, each instantiation produces specialized code with zero runtime cost. Combined with traits, this gives you real zero-cost abstractions. This matters less in handler code and more in shared infrastructure (middleware, generic repositories, decoders, parsers), where Go often pushes you back to / plus type assertions. Go’s GC is excellent, concurrent, low-pause, well-tuned for typical service workloads. But “low-pause” is not “no-pause.” Under heavy allocation, P99 latency tails are noticeably worse than a Rust equivalent that simply doesn’t allocate on the hot path. I won’t oversell this, for the vast majority of services, Go’s GC is a non-issue. But for latency-sensitive systems (trading, real-time bidding, network proxies, high-throughput ingestion), the lack of GC pauses is a genuine selling point. Go is death by a thousand paper cuts. It is a very pragmatic language and if you are willing to glance over the above issues, you can be very productive in it. But at a certain codebase size, the problems start to compound. There is no single moment when Go loses its appeal, but teams find themselves wishing for more (more safety, more control, more expressiveness) and that’s when they start looking around for alternatives. The fastest way to feel comfortable in Rust is to map patterns you already know. For a longer, fully-worked example of building the same backend service in both languages, see the Shuttle comparison , the section below focuses on the patterns that come up most often. The operator does the dance for you, including type conversion if is implemented (idiomatic with ’s ). There is no in safe Rust. References can’t be null. Pointers can be, but you almost never use raw pointers in application code. Go’s interfaces are structural, a type satisfies an interface implicitly: Rust’s traits are nominal, you implement them explicitly: The Go style is great for ad-hoc duck typing. The Rust style is great for refactoring and discoverability, you can grep for every implementer of a trait. The closest equivalent of / in Rust is , but you almost never want it. The Go community knows the cost of reaching for too: interface{} says nothing. — Rob Pike, Go Proverbs Generic functions with trait bounds ( ) cover the vast majority of cases and give you monomorphization with no runtime dispatch. Where Go pre-1.18 would have forced you back to plus a type assertion, Rust’s traits + generics let you stay specific. When you do want runtime dispatch (e.g. heterogeneous storage of different implementers), reach for or . That’s the direct Rust analog of holding an value in Go. Go’s concurrency model is famously simple: Goroutines are cheap, the runtime schedules them across OS threads, and channels ( ) are the primary coordination primitive. The Go proverb captures the philosophy: Don’t communicate by sharing memory; share memory by communicating. — Rob Pike, Go Proverbs This is the area where Go genuinely shines, several commenters in the Lobste.rs discussion made the point that goroutines “just disappear” into normal-looking blocking code, and that’s worth giving Go credit for. Rust async is more powerful, but it’s also more visible in your code. Rust uses / on top of an executor (almost always for backend services): The shape is similar. The differences: For most backend code, the day-to-day feel is similar: spawn a task, communicate via channels, use timeouts liberally. In Go, you plumb a through every blocking call: Rust has no built-in . The closest equivalent for cancellation is : For timeouts, wraps any future. For deadlines/values, you typically pass them as explicit arguments or via spans rather than a single context object. Some Go developers miss the implicit-feel of . In practice, the explicit Rust style is easier to reason about, you always know exactly what’s cancellable and what isn’t. The deeper point is that neither language gives you cancellation for free, the discipline just shows up at different layers: Go doesn’t have a way to tell a goroutine to exit. There is no stop or kill function, for good reason. If we cannot command a goroutine to stop, we must instead ask it, politely. — Dave Cheney, The Zen of Go In Go that “asking politely” is a plumbed through every call site by convention. In Rust it’s a (or a channel) plumbed through every call site, but the compiler can actually tell you when you forgot. Both languages have channels. The translation is direct: Rust’s channels distinguish sender and receiver as separate types, which makes ownership and -ness explicit at the type level. Rust’s is the equivalent of a Go value receiver; is a pointer receiver with mutation. Owned (consuming the value) has no Go analog and is occasionally very useful (typestate, builders). Go’s is a UTF-8 byte slice with copy-on-assign semantics (the header is copied, the underlying bytes are shared and immutable). Rust splits this into two types: As a rule of thumb, take in arguments, return when you produce new data. This is mostly painless once you internalize it. The vs split is a microcosm of Rust’s broader “borrow vs own” model. Go got generics in 1.18 (March 2022), thirteen years after the language shipped. They are useful, but they feel tacked on, and in practice they have most of the downsides of a generic type system without delivering the upsides you’d expect coming from Rust, Haskell, or even modern C++. This is a strong claim, so let me back it up. The most telling signal is that three years after generics landed, Go’s own standard library still mostly avoids them. still takes a closure instead of a constraint. is still typed as / . The generic helpers that do exist live in a small handful of packages: , , , and a few entries under . Compare that to Rust, where generics permeate the standard library from day one: , , , , , / , , , every collection, every smart pointer. You cannot write idiomatic Rust without using generics, because the standard library is generic. In Go, generics are an opt-in feature for library authors who really need them. In Rust, they’re the substrate everything else is built on. Rust’s generics are tied to traits, which double as the language’s mechanism for ad-hoc polymorphism, supertraits, associated types, blanket impls, and coherence. Go’s constraints are just interfaces with an extra operator for type-set membership. There are no: The practical consequence is that the moment your abstraction needs more than “a function that works for any with these few operations,” Go pushes you back to plus type assertions, code generation, or runtime reflection. Rust uses a Hindley-Milner-style inference engine that propagates type information through entire expressions, including across closures, iterator chains, and operators. You routinely write: and the compiler figures out is from the range, and is from the target. Go’s inference is much shallower. It can usually infer type parameters from function arguments, but it cannot infer from return-position context , cannot chain inference through generic builders the way Rust does, and frequently forces explicit type arguments at call sites: In Rust this is the exception; in Go it’s still common. Rust monomorphizes: every and produces specialized machine code with zero runtime dispatch. Go uses GCShape stenciling with dictionaries , where types that share a “GC shape” share the same compiled function and dispatch through a dictionary at runtime. The result is a compile-time/runtime tradeoff that often surprises people: generic Go code can be measurably slower than the equivalent hand-written non-generic version, because every method call on a type parameter goes through an indirection. There’s a well-known PlanetScale post showing exactly this. In Rust, generic code is the fast path. Reaching for (the equivalent of Go’s interface dispatch) is a deliberate choice you make when you want runtime polymorphism. This is the part that bothers me most. A good generics system removes reasons to fall back to escape hatches. In Rust, generics + traits eliminate most of what you’d otherwise need or runtime reflection for. The type system gets stronger. In Go, generics did not remove , did not remove , did not remove code generation as the dominant pattern for things like ORMs, decoders, and mocks. still uses reflection. still uses . still generates code. The places where a real generics system would shine are the same places Go reaches for runtime mechanisms it had before 1.18. Generics in Go feel additive, a new tool in the box that’s useful in narrow cases. Generics in Rust feel foundational; remove them and the language collapses. That’s the difference, and it’s why generic Go code, in my experience, doesn’t read better than the -based code it replaced; it just reads differently, with more punctuation. If you’re already opinionated in Go, the Rust ecosystem has converged to a similar level of “default picks.” For a typical backend service: + + + + + covers 90% of what you need. I want to be straightforward here. Coming from Go, you will hit a wall . The wall has a name. Go’s runtime handles memory and aliasing for you. Rust pushes that decision into the type system. The first few weeks you’ll write code that “should obviously work” and the compiler will refuse it. The patterns that bite Go developers most often: With all of these rules, the borrow checker truly sounds like a “gatekeeper” of sorts, which keeps getting in the way and is just overall frustrating to deal with. That is not the mental mindset you should have when learning Rust. The borrow checker truly uncovers real and very existing bugs in your code, and if you don’t address them, your program will deal with safety issues. So whenever you get a compiler error from , take a step back and think how your code could break. A few questions you can ask yourself: That is the mindset you need to understand the borrow checker. Humans are genuinely bad at reasoning about memory. We forget that pointers can be null, that old references can outlive the data they point to, and that multiple threads can touch the same data at the same time. We tend to have a “linear” mental model of how data flows through a program, but in reality it’s closer to a complex graph with many paths and interactions. Every condition forces you to consider what happens in both branches. Every loop forces you to consider what happens on every iteration. That is exactly the kind of reasoning the borrow checker is designed to do for you! It enforces best practices at compile time, and it can feel annoying when your own mental model disagrees with the borrow checker’s (which is the more accurate one 99% of the time). There are cases where the borrow checker is genuinely too strict, but they are rare, and as a beginner you’ll almost never run into them. I got memory management wrong plenty of times in my early days, but I approached it with a learner’s mindset , which helped me ask “what’s wrong with my code?” instead of “what’s wrong with the compiler?”, a reaction I see a lot in trainings. The good news is that once you internalize borrowing, it stops fighting you. Most experienced Rust developers will tell you the borrow checker became an ally somewhere between weeks 4 and 12. The first month is the hardest. Be honest with your team, Rust compile times are a real downgrade from Go’s. A clean release build of a medium service can take minutes in comparison to Go’s near-instantaneous compiles. Incremental builds and are reasonable and compile times have gotten much better over the years, but you’ll feel the difference. To mitigate, use in your edit loop, split into a workspace once it pays off, and keep proc-macro-heavy crates in their own crate so they only recompile when they change. See tips for faster Rust compile times for a deeper dive. Go’s “one type of function, sync everywhere, the runtime handles concurrency” is genuinely simpler than Rust’s split between and . You’ll need to think about which of your functions are async, where you , and how that interacts with traits. Async traits (stable since Rust 1.75) help a lot, but there are still rough edges (especially around with async methods). Rust’s crate ecosystem is growing and libraries are high-quality across the board, but Go has a head start in some backend-adjacent domains: Kubernetes operators, cloud-provider SDKs, database drivers for certain niche stores. Before you commit, spend a day checking that the libraries you depend on have Rust equivalents you’re willing to use. Teams I help often have to hand-roll at least one or two core libraries themselves. For example, they might have to update an abandoned crate for XML schema validation, or write their own client for a lesser-known protocol. You don’t have to rewrite everything in one go. The strategies that work best, in order of how I usually recommend them: If one specific service in your fleet is the perpetual problem child (high CPU, latency-sensitive, or constantly hit with reliability issues), rewrite just that one in Rust, behind the same API contract. This is the lowest-risk migration. Other Go services keep talking to it via HTTP/gRPC, oblivious to the underlying language. Background workers, queue consumers, ingestion pipelines, and CPU-bound batch jobs are excellent first targets. They typically have a clear input/output boundary (a queue, a topic) and no shared in-process state with the rest of the system. You can call Rust from Go via cgo, and there are good guides on how to do it . (Reach out if you’d be interested in a guide on this from me.) In practice, I rarely recommend it for backend services. The build complexity and FFI overhead usually outweigh the benefits compared to “just stand up a Rust service and put it behind a network call.” For libraries and CLI tools, it’s more viable. If you have an API gateway or reverse proxy, you can route specific endpoints to a new Rust service while the rest stays in Go. This works particularly well when one bounded context (auth, search, billing) is the right unit to migrate. The pattern is often called “strangler fig,” because the new service grows around the old one until it eventually replaces it entirely. Start with a service that has a clear boundary. Don’t pick the most central, most-deployed service in your fleet. Pick the one where the contract with the rest of the system is well-defined and the blast radius is small. Keep the same API contract. If your Go service exposes a REST API, your Rust service should too: same paths, same JSON shapes, same error envelope. The migration is invisible to clients, and you can swap traffic incrementally with a gateway. Don’t translate idioms verbatim. Resist the urge to write Go-flavoured Rust. becomes . Goroutine-per-request becomes only when you actually need it (axum already concurrently handles requests). Interfaces with one method usually become trait bounds on a generic, not . Use the compiler as a pair programmer. Rust’s compiler errors are usually pretty good. Read them slowly. They almost always tell you the right answer. The team members who struggle longest are the ones who fight the compiler instead of treating it as a collaborator. Invest in training early. I’ve seen teams try to do a Rust migration “on the side,” learning as they go. It rarely ends well. It’s a bit like training for a marathon by signing up for the race and then trying to run it without any prior training. You can do it, but it’s going to be painful and you might not finish. Block off real time for learning: a workshop, an online course , paired sessions on real code. The upfront investment pays back many times over once the team is fluent. (Hey, if you want to talk about training options, I’m happy to chat .) Not everything should be migrated. Go is excellent for: A hybrid strategy is fine and common. Many of the teams I work with end up with a polyglot backend: Go for the “boring” services, Rust for the ones where reliability and performance pay back the extra effort. Numbers vary wildly by workload, so take these as rough guidance. Not promises! But here are some ballpark numbers, based on Go-to-Rust migrations I’ve helped with: Honestly, you’re unlikely to get a 10x throughput improvement going from Go to Rust the way you might from Python. What you get is fewer “silly errors” and flatter latency tails, plus the ability to expand into other domains like embedded development or systems programming while still using the same language. That’s often the most surprising side-effect of a migration: there’s a lot of opportunity for code-sharing across teams that previously had to use different stacks. You can use Rust for everything. Going from Go to Rust is a different kind of migration than coming from Python or TypeScript . Coming from Go, you know the benefits of a statically-typed, compiled language. So you’re not trading away dynamic typing or a slow runtime, you’re trading away in exchange for a more robust codebase with fewer footguns, and a stricter compiler that catches more mistakes at compile time. There is a steeper learning curve, however. For foundational services (services that your organization relies on, that have high uptime requirements, that are critical to your business), that trade is obviously worth it. For others, Go remains the right answer. The point of a migration is to put each problem in the language that solves it best. Ready to Make the Move to Rust? I help backend teams evaluate, plan, and execute Go-to-Rust migrations. Whether you need an architecture review, training, or hands-on help porting a critical service, let’s talk about your needs . Rust’s type system doesn’t catch all data races, but types that truly can’t be shared between threads without synchronization won’t compile. You can still have logic bugs in your synchronization, but you won’t have the kind of “oh no, I forgot to lock this” that often leads to silent data corruption. ↩ Where Go and Rust overlap, and where they diverge. How Go patterns map to Rust. What you gain from the borrow checker. Where I tell people to keep Go and where Rust is worth the migration cost. How to migrate Go services incrementally. The boilerplate dilutes the actual logic of your function. Wrapping with is a discipline rule, not a compiler rule. It’s easy to drop context on the floor. Sentinel errors via / work, but the compiler doesn’t tell you when you forgot to handle a new variant. Rust async functions return s. They don’t run until awaited or spawned. The compiler tracks / across points. If you hold a non- value across an await, you get a compile error explaining exactly why. There’s no built-in goroutine-style preemption. Long CPU-bound work in an async task starves the executor; you offload to or instead. Channels ( , , ) are first-class but live in libraries, not the language. , owned, heap-allocated, growable. Equivalent to you intend to mutate. , a borrowed view into someone else’s string data. Equivalent to a Go parameter most of the time. Supertraits / constraint hierarchies. In Rust you write , and any automatically satisfies and . Go has no equivalent; you stack interface embeddings, but the constraint solver doesn’t reason about hierarchies the way Rust’s trait system does. Associated types. Rust’s has , so is a first-class thing you can name in bounds. Go’s closest equivalent is a second type parameter, which leaks into every signature. Blanket impls. In Rust, automatically gives every type a method. Go has no way to add methods to a type from outside its defining package, generic or not. Methods with their own type parameters. This is an explicit, documented non-feature in Go. You cannot write . In Rust, generic methods on generic types are routine. Long-lived references. In Go, you’d happily hold a from a map for as long as you want. In Rust, that borrow blocks mutation of the map for its whole lifetime. The fix is usually to clone, or to scope the borrow tighter. Self-referential structs. Common in Go (a struct holding both data and an iterator over it). In Rust, this requires , , or a redesign. Almost always: redesign. Sharing mutable state across goroutines. What you’d write as becomes . Slightly more verbose, much more checked. Returning references from functions. Lifetime annotations show up. They’re not as bad as their reputation, but they’re new. If a value got moved from one place to another, what would happen if the original place tried to use it again? If a value is shared across threads, what would happen if one thread modified it while another thread is using it? If a pointer is dereferenced , what would happen if it was null or dangling? When a value goes out of scope , what would happen if it was still being used somewhere else? Kubernetes-native tooling : operators, controllers, CRDs. The ecosystem is overwhelmingly in Go. CLI utilities and dev tooling : fast compiles, easy cross-compilation, simple deployment. Glue services : thin API layers, proxies, format converters. The boilerplate ratio in Rust isn’t worth it here. Anywhere your team velocity matters more than absolute correctness guarantees . CPU usage: 20–60% reduction. Less dramatic than Python-to-Rust, because Go is already efficient. The wins come from no GC and tighter loops. Memory: 30–50% reduction, mostly from the absence of GC overhead and a smaller runtime. P99 latency: significantly more consistent. Rust services tend to flatline where Go services have visible GC-induced jitter. (This has gotten much better on the Go-side ever since they introduced their low-latency GC, but the difference is still there under heavy load.) Production incidents: this is the one teams report most enthusiastically. The classes of bugs that survive and reach production (data races, nil dereferences, missed error paths) just don’t compile in Rust. Oncall rotations are typically very boring after a Rust migration. Rust’s type system doesn’t catch all data races, but types that truly can’t be shared between threads without synchronization won’t compile. You can still have logic bugs in your synchronization, but you won’t have the kind of “oh no, I forgot to lock this” that often leads to silent data corruption. ↩

0 views
Phil Eaton 1 weeks ago

Serving files over HTTP three ways: synchronous, epoll, and io_uring

This is an external post of mine. Click here if you are not redirected.

0 views
Karboosx 1 weeks ago

How to actually handle database transactions (and why your ORM fails at it)

Wrapping database calls in a simple try-catch block is just asking for trouble. Here is a practical look at how to handle transactions properly, avoid the dreaded ORM lost update, and keep your data actually safe.

0 views

Flat/Non-higher-order construction of ANF via mutable holes

This post will describe an algorithm for constructing an ANF/similar IR from a functional-like base language with no closure usage, and with no related excess space requirements. This algorithm is as far as I know not new, and has apparently been known in imperative compilation for a while. However, I admittedly have no source for this, nor have I seen it described for functional-like languages. Thanks to Ryan Brewer and rpjohnst for their inspiration on this. If you are only interested in code, please skip to Implementation. First, let us describe our input and output languages. Our input language will look like this: This is meant to echo some simple functional programming language, and contains all the elements needed to demonstrate the algorithm. Our output language will look like a lexically-scoped flat ANF-style construction, with all lets at the toplevel. The goal is to convert from our input to this language. The notable addition is the constructor in , the purpose of which will be described now. Imagine we are trying to convert the term When we arrive at the body of , we have a conundrum. We need to put the conversion of before , so that using is well-scoped. Hence, we must convert before we can continue. How do we use this conversion, though? 's content needs to be put in the field of 's , so it's not clear how to both convert before and to use 's finished conversion in . One "traditional" method of solving this is via continuations, where is passed a closure representing "what to do next", after it converts its body. We can then put 's content in that closure, which will be invoked later. This has two main flaws: This method resolves these. Recall that we left a field in our type. I will denote a hole like this: . The central idea is that our conversion of any given fragment will return: Consider this simplified example: When we convert , this signals for the conversion of . There are two steps to converting a - we must first convert the "head" ( ), and then the "body" ( ). Converting the will result in something of form Note that the hole is left to be filled later. This way, the conversion of 's head need know nothing about its environment. We represent this in code via the constructor, which contains a reference to a fragment that either may be empty ( ) or filled ( ). The conversion of the body can then proceed by first generating the expected addition, binding it to so it may be referenced later: We now need to get this fragment into the proper lexical scope. We can do this by plugging it into the hole left by the conversion of the head: Then, returning this fragment, the name bound ( ), and the reference to the inner hole, we can continue converting ; First, convert the head, using the name returned by the conversion of : Then, the body: And finally plug it into the hole returned by . We are left with a final hole we can plug with to be explicit about the end of a control flow path. One disadvantage of this strategy is that it does generate many unneeded names. Most of these can be avoided by carefully special casing certain , and the rest can be eliminated via a simple folding pass. Below is a sample implementation of this algorithm, in OCaml. I recommend toying with the conversion yourself, using or similar. In the following, I will "clean up" the output slightly, as to not make it unreadable. If we use our first example term: We can then plug the returned hole, as described. Via some inspection, we can verify this is indeed correct, and conforms to our ANF-like form, as expected. As mentioned earlier, the large amount of duplicate names can be mitigated through strategies such as special casing and folding passes. Clearly, the algorithm does not use closures, or any indirect jumps. It does not allocate any more than needed for the representation of the output program; there is no excessive closure allocation. I do not currently have a good way to test performance on real-world large inputs, but I have no reason to believe it would not perform better than an equivalent higher-order version. I believe this method should be extendable to CPS as well, although I have not tried as of the moment. The "usual" higher-order CPS conversion described in e.g. "Compiling with continuations, continued" looks quite similar to the higher-order presentation of ANF conversion, so I would be interested in whether it can be similarly massaged into a mutable form. It uses excess space. A conversion of a term will allocate ~2x the memory "needed"; it must allocate for the ANF construction itself, and for the closures needed during the construction. It is slower, due to the aforementioned closure allocation and application. The fragment itself. A name we can use to refer to what was bound in the fragment (In the conversion of addition/similar with nested content, names must be conjured.) The reference created by the hole, so that it may be filled.

0 views
Max Bernstein 2 weeks ago

Partial static single information form

In compilers, static single information form (SSI) is a common extension to static single assignment form (SSA). It was introduced by C. Scott Ananian in 1999 in his MS thesis (PDF) 1 . SSI extends your existing SSA intermediate representation by discovering facts from your existing program and reifying them as path-dependent/flow-sensitive IR nodes. That might sound complicated, but at least the basic idea is pretty natural. I talk a little bit about it in What I talk about when I talk about IRs and I’ll rehash here in more depth, starting with some motivating examples. Consider this admittedly contrived example: We should be able to learn from the comparison that in some branches in the IR, is positive. In that region, we can add a new IR instruction that attaches that knowledge right in the instruction’s type field (yay, sparseness!) and then rewrite uses of to now use . Because we’ve done that, our (imaginary) optimization rule that gets rid of on known-positive integers can kick in, and we can delete the invocation of . Yay, optimization! But a couple of questions remain, at least for me: We’ll go through them, starting with the compiler pipeline. The original SSI paper starts with (I think?) SSA form and places some number of new refinement nodes based on conditionals. I have admittedly not tried very hard, but the into-SSI algorithms look complicated and kind of heavyweight. As a reward, you get “linear” into-SSI time complexity. But I am a humble compiler engineer, and I don’t have the time to go through and load all of this into my head. Instead what I have seen done and have been doing is to take a shortcut: build partial SSI during SSA construction 2 . Most of the time this is from bytecode, but it could also be from some other non-SSA IR. In any case, this is an excellent shortcut for two reasons: This is pretty compelling. We can learn from the bytecode with a very small amount of marginal new complexity. See my implementation in ZJIT , for example. All it really does is modify the abstract interpreter state when building SSA out of , , and bytecode instructions to take into account the new refined values. This is fine for branches that are already in the user’s source program but sometimes optimization, especially of dynamic languages, adds new branches that were not there before. And sometimes these branches get added much later, long after SSA construction. What then? Can we do something similar and rely on existing infrastructure? Implicit in this “can we do it” is the assumption that your IR tracks data dependencies from use to corresponding def, but not from def to uses. Sea of Nodes (at least the Simple implementation), is an IR that tracks both directions all the time for easier rewriting. Many IRs do not do this, so we will continue assuming that there’s no “easy way out”. JIT optimization of dynamic language compilers often adds synthetic instructions to the IR that enforce pre-conditions. These guards allow optimizing happy/fast path cases in JIT code while leaving the interpreter as a fallback. For example, we might be able to optimize two back-to-back instructions (a very dynamic operation in the world of ideas, but fast when concretely implemented using object shapes) from: which is very generic and involves calling into C code that might raise an exception, to something more like: which is much faster (assuming shape stability at run-time). There’s an irritating problem, though, which is that we have a bunch of duplicate instructions littered around the IR now because our optimizer worked on each instruction individually. Kind of a “template optimizer” situation. Now we need some pass to clean up the detritus. Global value numbering (GVN) will do a good job of de-duplicating instructions. It should notice that we already have an instruction that looks like called and rewrite into . That’s great because we have de-duplicated the guard. GVN may not get everything, though; if some instructions later use , they will not get rewritten to instead use the output of these new guard instructions. To do that, we need to add some kind of pass or augment GVN with some canonicalization feature. That canonicalization would handle rewriting operands to use the “latest version” of some value, so to speak. See the canonicalization section of Chris Fallin’s excellent aegraphs blog post for more (and of course the (currently block-local) implementation in ZJIT ). Where I’m going with all of this, though, is that you may already have some dominance-based instruction rewriting mechanism in your compiler, either as part of GVN or separately! And you can use this to do a very low code into-partial-SSI in the middle of your optimizer. This means you could very well get away with inserting instructions in successor blocks of conditionals and get the into-SSI “for free”. That’s up to you. There’s a trade-off between compile-time and run-time, especially in JITs. Inserting more instructions and rewriting more times may slow down your compiler. It’s a cheap lunch, not a free one. I don’t know. I don’t have a good grasp of how this “partial SSI” compares to the “full SSI”. I don’t plan on implementing full SSI in the near future. I will note that this partial SSI approach doesn’t do two things: I can’t tell what impact this has. Like Simple, TruffleRuby is built on a Sea of Nodes IR (Graal). Chris Seaton has an excellent blog post about TruffleRuby’s use of “stamp nodes” (“Pi nodes” 3 ). The function does a lot of heavy lifting, I think because Graal tracks uses. Cinder mostly inserts instructions in the HIR builder, before into-SSA, and then lets the SSA construction take care of things. That’s where I learned this trick, actually. Here is one example of refining the type of the matched operand when building IR for pattern matching. Luau is working on something like this, but for their type checker. Chatting with someone on their team is actually part of the reason I got motivated to write this post. Android ART looks like it has HBoundType and inserts them in reference type propagation . This handles class checks, null checks, and instanceof checks. Last, I want to talk a little bit about some interesting reasoning you can do when you have two implementations of something that you can switch between. For example, JIT (+ interpreter), or aliasing and non-aliasing cases in C code, or the weirdo NULL-UB reasoning LLVM can do to C code, things like that. In ZJIT, we currently insert s opportunistically in “easy” cases when building our HIR from the interpreter bytecode. For example, if in the bytecode there is a branch that compares some value with , it will have two outgoing control-flow edges: one block where is definitely , and one block where is definitely not . In each of these control-flow edges, we can insert corresponding type refinement hints. That’s pretty standard. But we can also do weirder stuff. CRuby has a notion of heap objects vs immediate objects. Many (most?) objects are heap objects. However, integer , for example is not allocated on the heap but instead represented by a tagged bit pattern that pretends to be an address: the whole value is encoded in the pointer itself. We encode this knowledge in the HIR’s type system: “heapness” and “immediateness” each get a bit in the type lattice . We use this in the optimizer to reason about effects , among other things. We can’t know a lot of the time what type a thing is, so we pessimistically type most objects flowing through bytecode as . This type encapsulates the entire world of possible values that could go on the stack or in a local variable. On most heap objects, with only a few exceptions, you can write instance variables (fields, attributes, whatever you want to call them). You can never write an instance variable to an immediate. This means that if we observe the following pattern in the bytecode: Then after building and emitting HIR for the opcode, we can upgrade the type of from a to a . We can do this because if it weren’t a heap-allocated object, we would have left the compiled code and entered the interpreter. This is another SSI-type thing you can do in your compiler. Uhh I guess the conclusion is that you don’t have to do full SSI and partial SSI is available and not too scary? Does your compiler do this? Reader, please write in. …and optimized in 2002 (PDF), revisited in 2009 (PDF), implemented in LLVM in 2010 (PDF), investigated in 2017 for abstract compilation (PDF), and probably more. The 2009 paper by Boissinot, Brisk, Darte, and Rastello even shows that both Ananian and Singer’s papers have bugs, while perhaps unintentionally also making an excellent pun about the literature being “sparse”.  ↩ This blog post is different than the what the LLVM paper (PDF) calls partial SSI. Partial for different reasons. Maybe it’s not even single information anymore.  ↩ Today I learned that this terminology comes from the ABCD paper (PDF).  ↩ Where/when in the compiler pipeline do we insert and remove these type refinements? Do we need to refine after every conditional? Do we need to implement the whole into-SSI and out-of-SSI algorithms from all the complicated-looking papers? It lets me cleanly separate adding the type refinements (pretty straightforward) from the hard part of doing all of the operand rewriting and phi placement and marking and all manner of other nonsense. In addition to separating the concerns, the hard part is already done by SSA construction. We can actually just skip it! SSA construction handles phi placement, operand rewriting, all of it. It probably fits neatly into a naive or a Braun-style (PDF) construction. It doesn’t split variables with a new sigma node, and it generally inserts the refine node within the target block rather than above the branch (For only) It doesn’t insert new phi nodes; it just leaves both IR nodes available and, instead of re-merging, drops them …and optimized in 2002 (PDF), revisited in 2009 (PDF), implemented in LLVM in 2010 (PDF), investigated in 2017 for abstract compilation (PDF), and probably more. The 2009 paper by Boissinot, Brisk, Darte, and Rastello even shows that both Ananian and Singer’s papers have bugs, while perhaps unintentionally also making an excellent pun about the literature being “sparse”.  ↩ This blog post is different than the what the LLVM paper (PDF) calls partial SSI. Partial for different reasons. Maybe it’s not even single information anymore.  ↩ Today I learned that this terminology comes from the ABCD paper (PDF).  ↩

0 views
The Coder Cafe 3 weeks ago

Cache Use Cases Explained

☕ Welcome to The Coder Cafe! Today, we discuss cache use cases. When we think about caching, it’s pretty frequent to focus on where it happens; for example, client-side, server-side, or in a CDN. Yet, there’s a more important question that should be answered first: What’s the use case? In this post, we will break down two common cache use cases: reducing latency and improving capacity. And we will see why the line between the two is blurrier than it seems. Get cozy, grab a coffee, and let’s begin! A Cache for Latency Latency is the time between when a request is sent and when a response is received. A cache for latency exists to reduce the average latency of a service . The classic access pattern looks like this 1 : We check the cache first. On a cache hit, we return the data directly without touching the backend. On a miss, we go to the backend, return the result, and store it in the cache for future requests. Why does this reduce latency? The cache keeps data in memory, which is significantly faster to read from than a remote database that may involve network round-trips, disk I/O, and query execution. On a hit, all of that work is skipped. In Soft vs. Hard Dependency , we introduced two kinds of dependencies: A soft dependency is a non-critical dependency for the service to operate properly. A hard dependency is a critical dependency for the service to operate properly. A cache for latency is a soft dependency . If the cache becomes unavailable, requests fall through to the backend. The system keeps working, just at a higher latency. Keep this in mind, because it’s the key difference we’ll come back to. A cache for capacity exists to serve higher throughput than the backend can handle on its own. The access pattern is identical to the latency case: cache first, then backend on a miss. So what actually makes these two different? The difference is not in the code; it’s in what the backend can absorb. In a capacity scenario, the backend would be overwhelmed if it received all the traffic directly. The cache absorbs a large portion of the requests, keeping the backend load manageable. This changes the nature of the dependency . If the cache goes down, the backend is suddenly hit with all the traffic it was previously shielded from. Whether the system survives depends on the backend’s own capacity. If the backend can scale fast enough, the cache is still a soft dependency: there will be a rough period, but the system recovers. If the backend can’t cope with the load, the cache becomes a hard dependency . Without it, the system fails . Here’s a question worth asking: if the access pattern for both types is identical, how do we know which one we have? In most cases, caches are introduced to reduce latency. But here’s what can happen over time: Our system is stable. Cache hit rates are high, backend load is low. Traffic grows. The backend load stays low because the cache is absorbing most of it. Nothing breaks. No alerts fire. Six months pass. Nothing has changed, no code, no configuration, no architecture decision. And yet the cache is no longer reducing latency. It’s keeping the backend alive. The cache didn’t change. The code didn’t change. The system grew around the cache, and the cache quietly became load-bearing . The same risk appears when a cache goes cold. For example: A migration to a new cache instance A data format change that requires purging existing entries A cache restart after maintenance Any of these can produce a large wave of cache misses in a short window. If we were running a latency cache, we would see higher latency for a while. If we were running a capacity cache, we would see a traffic spike that the backend can’t absorb. The unsettling part is that the code is identical in both cases. The difference only becomes visible at failure time . The root problem is that teams often don’t know which type of cache they’re running . They built it for latency, and that’s still how they think about it, even as the system outgrows that assumption. A few approaches help here: Periodically ask: could the backend handle the current traffic if the cache were completely removed ? Load testing without the cache, or estimating backend capacity against current traffic levels, gives you a concrete answer. Treat cache hit rate as a meaningful operational signal , not just a performance metric. A sustained drop in hit rate means the backend is absorbing more traffic than usual. If that trend continues, it’s an early warning that you may be drifting toward a capacity problem. When migrating a cache or invalidating a large portion of its data, warm the new cache before routing live traffic to it. This prevents a cold-start burst from hitting the backend all at once. Finally, once we recognize that a cache is operating as a capacity cache , we should treat it accordingly. It’s no longer optional infrastructure and it deserves proper alerting and a clear plan for what happens if it goes down. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. A cache for latency serves data from memory to reduce average response time. It is a soft dependency: if unavailable, the system degrades in latency but continues to work. A cache for capacity absorbs traffic that the backend couldn’t handle on its own. It can be a soft or a hard dependency, depending on whether the backend can absorb the load without it. Both types share the same access pattern, which makes them easy to confuse. A latency cache can silently become a capacity cache as traffic grows, without any code change. When a capacity cache goes cold or fails, the backend can be overwhelmed. Hit rate monitoring, periodic load testing, and cache warming are practical ways to manage this risk. Availability Models Safety and Liveness The PACELC Theorem The Three Types of Cache Cache stampede Even though variations exist. A Cache for Latency Latency is the time between when a request is sent and when a response is received. A cache for latency exists to reduce the average latency of a service . The classic access pattern looks like this 1 : We check the cache first. On a cache hit, we return the data directly without touching the backend. On a miss, we go to the backend, return the result, and store it in the cache for future requests. A soft dependency is a non-critical dependency for the service to operate properly. A hard dependency is a critical dependency for the service to operate properly. Our system is stable. Cache hit rates are high, backend load is low. Traffic grows. The backend load stays low because the cache is absorbing most of it. Nothing breaks. No alerts fire. Six months pass. Nothing has changed, no code, no configuration, no architecture decision. And yet the cache is no longer reducing latency. It’s keeping the backend alive. A migration to a new cache instance A data format change that requires purging existing entries A cache restart after maintenance Periodically ask: could the backend handle the current traffic if the cache were completely removed ? Load testing without the cache, or estimating backend capacity against current traffic levels, gives you a concrete answer. Treat cache hit rate as a meaningful operational signal , not just a performance metric. A sustained drop in hit rate means the backend is absorbing more traffic than usual. If that trend continues, it’s an early warning that you may be drifting toward a capacity problem. When migrating a cache or invalidating a large portion of its data, warm the new cache before routing live traffic to it. This prevents a cold-start burst from hitting the backend all at once. Finally, once we recognize that a cache is operating as a capacity cache , we should treat it accordingly. It’s no longer optional infrastructure and it deserves proper alerting and a clear plan for what happens if it goes down. A cache for latency serves data from memory to reduce average response time. It is a soft dependency: if unavailable, the system degrades in latency but continues to work. A cache for capacity absorbs traffic that the backend couldn’t handle on its own. It can be a soft or a hard dependency, depending on whether the backend can absorb the load without it. Both types share the same access pattern, which makes them easy to confuse. A latency cache can silently become a capacity cache as traffic grows, without any code change. When a capacity cache goes cold or fails, the backend can be overwhelmed. Hit rate monitoring, periodic load testing, and cache warming are practical ways to manage this risk. Availability Models Safety and Liveness The PACELC Theorem The Three Types of Cache Cache stampede

0 views
daniel.haxx.se 1 months ago

Inspired

The picture was taken by mr Nasser and shared on social media In appendix A of the book Root cause: Stories and lessons from two decades of Backend Engineering Bugs , author Hussein Nasser has these wonderful words to say about me: Daniel Stenberg is a Swedish engineer and the creator of curl (cURL), one of the most widely used tools and libraries for fetching content over various protocols. I’ve always admired Daniel’s work, reading his blogs and watching his talks on YouTube. He is one of the engineers who inspired me to start my own YouTube channel and teach backend engineering. It warms my heart to read this. Words like this give me energy and motivation. My work has meaning.

0 views
Phil Eaton 1 months ago

Branimir Lambov from IBM on Cassandra

This is an external post of mine. Click here if you are not redirected.

0 views
Zak Knill 1 months ago

All your agents are going async

Agents used to be a thing you talked to synchronously. Now they’re a thing that runs in the background while you work. When you make that change, the transport breaks. For most of the time LLMs have been around, you use them by opening a chat-style window and typing a prompt. The LLM streams the response back token-by-token. It’s how ChatGPT, claude.ai, and Claude Code work. It’s also how the demos work for basically every AI SDK or AI Library. It’s easy to think that LLM chatbots are the ‘art of the possible’ for AI right now. But that’s not the case.

0 views
Ankur Sethi 1 months ago

A broken 404 template in Django can swallow your backtraces

I recently migrated this website from Astro to Wagtail . The reason why I did it is a story for another day. In this post, I want to talk about a bug that took me far too long to figure out. In his (verifiably incorrect) post about making chai , Abhigyan linked to my own (verifiably correct) post on the topic . While linking to my post, he accidentally omitted the trailing slash from the URL. This shouldn't have been a problem. By default, Django automatically redirects a URL without a trailing slash to the same URL with the trailing slash appended, provided the original URL returns a . For example, if you try to access the following URL on my website: Django automatically performs a redirect to: This is the default behavior, controlled by the setting . However, when Abhigyan linked to my (verifiably correct) post about making chai, my server returned a error instead. I'd never have discovered this error myself, but Shubh pointed it out to me on the IndieWebClub chat last week. Thanks Shubh! I started investigating the issue by checking the Gunicorn logs on my VPS. I was hoping they would contain a backtrace that would help me pinpoint the exact problem, but the logs only printed the string whenever the broken URL was accessed. I ran my app with production settings inside a Docker container to see if I could trigger the same behavior. And sure enough, the Dockerized app produced the same error with the same mysterious in the Gunicorn logs. My first instinct was that I had somehow messed up my logging configuration. I'd surely introduced a bug in some Python code somewhere, and my logging configuration was failing to log the backtrace because of a misconfiguration. But tweaking Django's setting didn't change anything. I could see backtraces from the exceptions I inserted at random points in my code, but accessing a URL without a trailing slash would still only produce the string in the logs. After a lot of head scratching, reading the docs, and yelling at Claude, I wondered if something in my template could be responsible for the error. My template was fairly complex, loading and calling several template tags, inheriting from a chain of templates, rendering a few s, and concatenating assets using django-compressor . I started by deleting everything from and reducing it to a single tag. Sure enough, this fixed the issue! Then I slowly added some of the code back until I found the one custom template tag that was throwing an exception, but only when called in the context of a 404 page. Fixing the tag and redeploying fixed the issue for good. But what about the logs? An error in my 404 template not only caused my server to return a 500, but also suppressed any backtraces that might have helped me diagnose the issue. That's weird, right? I might be wrong, but I believe the sequence of events that can lead to this issue is as follows: The lessons I learned from this frustrating scenario were: Somebody accesses a URL without a trailing slash. Django tries to find that URL in its . Since this is a Wagtail installation, it also tries to find a page in the URLs known to Wagtail. All the URLs in my have trailing slashes. Wagtail also appends trailing slashes to all its URLs when is true. So trying to access a page without a trailing slash returns a 404. You would expect Django's redirect logic to kick in at this point, trying to append a trailing slash to the original URL and performing a redirect. But that's not what happens! The redirect logic lives in , which can only perform the redirect after the entire handling chain has finished running. This means regardless of what happens, Django will always render your template when an unknown URL is accessed. Yes, even if redirecting to the same URL with a trailing slash produces a known, correct URL! This means if your template errors out, doesn't even get a chance to run. Django encounters an unknown URL, tries to render the template, fails, and turns the into a . When this happens, Django only logs the , not the template failing to render. This happens even if you're logging template rendering errors in your logging configuration . From what I can tell, there is no way to get Django to log an error in without creating a custom view, manually catching errors, logging the caught errors, and re-raising them so that Django can turn them into s. Always render your and pages in unit tests to make sure they can never error out. Keep your error pages as simple as possible. Ideally, they should only contain HTML and inlined CSS, nothing more.

0 views
Max Bernstein 1 months ago

Value numbering

Welcome back to compiler land. Today we’re going to talk about value numbering , which is like SSA, but more. Static single assignment (SSA) gives names to values: every expression has a name, and each name corresponds to exactly one expression. It transforms programs like this: where the variable is assigned more than once in the program text, into programs like this: where each assignment to has been replaced with an assignment to a new fresh name. It’s great because it makes clear the differences between the two expressions. Though they textually look similar, they compute different values. The first computes 1 and the second computes 2. In this example, it is not possible to substitute in a variable and re-use the value of , because the s are different. But what if we see two “textually” identical instructions in SSA? That sounds much more promising than non-SSA because the transformation into SSA form has removed (much of) the statefulness of it all. When can we re-use the result? Identifying instructions that are known at compile-time to always produce the same value at run-time is called value numbering . To understand value numbering, let’s extend the above IR snippet with two more instructions, v3 and v4. In this new snippet, v3 looks the same as v1: adding v0 and 1. Assuming our addition operation is some ideal mathematical addition, we can absolutely re-use v1; no need to compute the addition again. We can rewrite the IR to something like: This is kind of similar to the destructive union-find representation that JavaScriptCore and a couple other compilers use, where the optimizer doesn’t eagerly re-write all uses but instead leaves a little breadcrumb / instruction 1 . We could then run our copy propagation pass (“union-find cleanup”?) and get: Great. But how does this happen? How does an optimizer identify reusable instruction candidates that are “textually identical”? Generally, there is no actual text in the IR . One popular solution is to compute a hash of each instruction. Then any instructions with the same hash (that also compare equal, in case of collisions) are considered equivalent. This is called hash-consing . When trying to figure all this out, I read through a couple of different implementations. I particularly like the Maxine VM implementation. For example, here is the (hashing) and functions for most binary operations, slightly modified for clarity: The rest of the value numbering implementation assumes that if a function returns 0, it does not wish to be considered for value numbering. Why might an instruction opt-out of value numbering? An instruction might opt out of value numbering if it is not “pure”. Some instructions are not pure. Purity is in the eye of the beholder, but in general it means that an instruction does not interact with the state of the outside world, except for trivial computation on its operands. (What does it mean to de-duplicate/cache/reuse ?) A load from an array object is also not a pure operation 2 . The load operation implicitly relies on the state of the memory. Also, even if the array was known-constant, in some runtime systems, the load might raise an exception. Changing the source location where an exception is raised is generally frowned upon. Languages such as Java often have requirements about where exceptions are raised codified in their specifications. We’ll work only on pure operations for now, but we’ll come back to this later. We do often want to optimize impure operations as well! We’ll start off with the simplest form of value numbering, which operates only on linear sequences of instructions, like basic blocks or traces. Let’s build a small implementation of local value numbering (LVN). We’ll start with straight-line code—no branches or anything tricky. Most compiler optimizations on control-flow graphs (CFGs) iterate over the instructions “top to bottom” 3 and it seems like we can do the same thing here too. From what we’ve seen so far optimizing our made-up IR snippet, we can do something like this: The find-and-replace, remember, is not a literal find-and-replace, but instead something like: (if you have been following along with the toy optimizer series) This several-line function (as long as you already have a hash map and a union-find available to you) is enough to build local value numbering! And real compilers are built this way, too. If you don’t believe me, take a look at this slightly edited snippet from Maxine’s value numbering implementation. It has all of the components we just talked about: iterating over instructions, map lookup, and some substitution. This alone will get you pretty far. Code generators of all shapes tend to leave messy repeated computations all over their generated code and this will make short work of them. Sometimes, though, your computations are spread across control flow—over multiple basic blocks. What do you do then? Computing value numbers for an entire function is called global value numbering (GVN) and it requires dealing with control flow (if, loops, etc). I don’t just mean that for an entire function, we run local value numbering block-by-block. Global value numbering implies that expressions can be de-duplicated and shared across blocks. Let’s tackle control flow case by case. First is the simple case from above: one block. In this case, we can go top to bottom with our value numbering and do alright. The second case is also reasonable to handle: one block flowing into another. In this case, we can still go top to bottom. We just have to find a way to iterate over the blocks. If we’re not going to share value maps between blocks, the order doesn’t matter. But since the point of global value numbering is to share values, we have to iterate them in topological order (reverse post order (RPO)). This ensures that predecessors get visited before successors. If you have , we have to visit first and then . Because of how SSA works and how CFGs work, the second block can “look up” into the first block and use the values from it. To get global value numbering working, we have to copy ’s value map before we start processing so we can re-use the instructions. Maybe something like: Then the expressions can accrue across blocks. can re-use the already-computed from because it is still in the map. …but this breaks as soon as you have control-flow splits. Consider the following shape graph: We’re going to iterate over that graph in one of two orders: A B C or A C B. In either case, we’re going to be adding all this stuff into the value map from one block (say, B) that is not actually available to its sibling block (say, C). When I say “not available”, I mean “would not have been computed before”. This is because we execute either A then B or A then C. There’s no world in which we execute B then C. But alright, look at a third case where there is such a world: a control-flow join. In this diagram, we have two predecessor blocks B and C each flowing into D. In this diagram, B always flows into D and also C always flows into D. So the iterator order is fine, right? Well, still no. We have the same sibling problem as before. B and C still can’t share value maps. We also have a weird question when we enter D: where did we come from? If we came from B, we can re-use expressions from B. If we came from C, we can re-use expressions from C. But we cannot in general know which predecessor block we came from. The only block we know for sure that we executed before D is A. This means we can re-use A’s value map in D because we can guarantee that all execution paths that enter D have previously gone through A. This relationship is called a dominator relationship and this is the key to one style of global value numbering that we’re going to talk about in this post. A block can always use the value map from any other block that dominates it. For completeness’ sake, in the diamond diagram, A dominates each of B and C, too. We can compute dominators a couple of ways 4 , but that’s a little bit out of scope for this blog post. If we assume that we have dominator information available in our CFG, we can use that for global value numbering. And that’s just what—you guessed it—Maxine VM does. It iterates over all blocks in reverse post-order, doing local value numbering, threading through value maps from dominator blocks. In this case, their method gets the immediate dominator : the “closest” dominator block of all the blocks that dominate the current one. And that’s it! That’s the core of Maxine’s GVN implementation . I love how short it is. For not very much code, you can remove a lot of duplicate pure SSA instructions. This does still work with loops, but with some caveats. From p7 of Briggs GVN : The φ-functions require special treatment. Before the compiler can analyze the φ-functions in a block, it must previously have assigned value numbers to all of the inputs. This is not possible in all cases; specifically, any φ-function input whose value flows along a back edge (with respect to the dominator tree) cannot have a value number. If any of the parameters of a φ-function have not been assigned a value number, then the compiler cannot analyze the φ-function, and it must assign a unique, new value number to the result. It also talks about eliminating useless phis, which is optional, but would the strengthen global value numbering pass: it makes more information transparent. But what if we want to handle impure instructions? Languages such as Java allow for reading fields from the / object within methods as if the field were a variable name. This makes code like the following common: Each of these reference to and is an implicit reference to or , which is semantically a field load off an object. You can see it in the bytecode (thanks, Matt Godbolt): When straightforwardly building an SSA IR from the JVM bytecode for this method, you will end up with a bunch of IR that looks like this: Pretty much the same as the bytecode. Even though no code in the middle could modify the field (which would require a re-load), we still have a duplicate load. Bummer. I don’t want to re-hash this too much but it’s possible to fold Load and store forwarding into your GVN implementation by either: See, there’s nothing fundamentally stopping you from tracking the state of your heap at compile-time across blocks. You just have to do a little more bookkeeping. In our dominator-based GVN implementation, for example, you can: Not so bad. Maxine doesn’t do global memory tracking, but they do a limited form of load-store forwarding while building their HIR from bytecode: see GraphBuilder which uses the MemoryMap to help track this stuff. At least they would not have the same duplicate instructions in the example above! We’ve now looked at one kind of value numbering and one implementation of it. What else is out there? Apparently, you can get better results by having a unified hash table (p9 of Briggs GVN ) of expressions, not limiting the value map to dominator-available expressions. Not 100% on how this works yet. They note: Using a unified hash-table has one important algorithmic consequence. Replacements cannot be performed on-line because the table no longer reflects availability. Which is the first time that it occurred to me that hash-based value numbering with dominators was an approximation of available expression analysis. There’s also a totally different kind of value numbering called value partitioning (p12 of Briggs GVN ). See also a nice blog post about this by Allen Wang from the Cornell compiler course . I think this mostly replaces the hashing bit, and you still need some other thing for the available expressions bit. Ben Titzer and Seth Goldstein have some good slides from CMU . Where they talk about the worklist dataflow approach. Apparently this is slower but gets you more available expressions than just looking to dominator blocks. I wonder how much it differs from dominator+unified hash table. While Maxine uses hash table cloning to copy value maps from dominator blocks, there are also compilers such as Cranelift that use scoped hash maps to track this information more efficiently. (Though Amanieu notes that you may not need a scoped hash map and instead can tag values in your value map with the block they came from, ignoring non-dominating values with a quick check. The dominance check makes sense but I haven’t internalized how this affects the set of available expressions yet.) You may be wondering if this kind of algorithm even helps at all in a dynamic language JIT context. Surely everything is too dynamic, right? Actually, no! The JIT hopes to eliminate a lot of method calls and dynamic behaviors, replacing them with guards, assumptions, and simpler operations. These strength reductions often leave behind a lot of repeated instructions. Just the other day, Kokubun filed a value-numbering-like PR to clean up some of the waste. ART has a recent blog post about speeding up GVN. Go forth and give your values more numbers. There’s been an ongoing discussion with Phil Zucker on SSI, GVN, acyclic egraphs, and scoped union-find. TODO summarize Commutativity; canonicalization Seeding alternative representations into the GVN Aegraphs and union-find during GVN https://github.com/bytecodealliance/rfcs/blob/main/accepted/cranelift-egraph.md https://github.com/bytecodealliance/wasmtime/issues/9049 https://github.com/bytecodealliance/wasmtime/issues/4371 Writing this post is roughly the time when I realized that the whole time I was wondering why Cinder did not use union-find for rewriting, it actually did! Optimizing instruction by replacing with followed by copy propagation is equivalent to union-find.  ↩ In some forms of SSA, like heap-array SSA or sea of nodes, it’s possible to more easily de-duplicate loads because the memory representation has been folded into (modeled in) the IR.  ↩ The order is a little more complicated than that: reverse post-order (RPO). And there’s a paper called “A Simple Algorithm for Global Data Flow Analysis Problems” that I don’t yet have a PDF for that claims that RPO is optimal for solving dataflow problems.  ↩ There’s the iterative dataflow way (described in the Cooper paper (PDF)), Lengauer-Tarjan (PDF), the Engineered Algorithm (PDF), hybrid/Semi-NCA approach (PDF), …  ↩ initialize a map from instruction numbers to instruction pointers for each instruction if wants to participate in value numbering if ’s value number is already in the map, replace all pointers to in the rest of the program with the corresponding value from the map otherwise, add to the map doing load-store forwarding as part of local value numbering and clearing memory information from the value map at the end of each block, or keeping track of effects across blocks track heap write effects for each block at the start of each block B, union all of the “kill” sets for every block back to its immediate dominator finally, remove the stuff that got killed from the dominator’s value map V8 Hydrogen Writing this post is roughly the time when I realized that the whole time I was wondering why Cinder did not use union-find for rewriting, it actually did! Optimizing instruction by replacing with followed by copy propagation is equivalent to union-find.  ↩ In some forms of SSA, like heap-array SSA or sea of nodes, it’s possible to more easily de-duplicate loads because the memory representation has been folded into (modeled in) the IR.  ↩ The order is a little more complicated than that: reverse post-order (RPO). And there’s a paper called “A Simple Algorithm for Global Data Flow Analysis Problems” that I don’t yet have a PDF for that claims that RPO is optimal for solving dataflow problems.  ↩ There’s the iterative dataflow way (described in the Cooper paper (PDF)), Lengauer-Tarjan (PDF), the Engineered Algorithm (PDF), hybrid/Semi-NCA approach (PDF), …  ↩

0 views
Evan Schwartz 1 months ago

Scour - March Update

Hi friends, In March, Scour scoured 813,588 posts from 24,029 feeds (7,131 were newly added) and 488 new users signed up. Welcome! Here's what's new in the product: Scour now does a better job of ensuring that your feed draws from a mix of sources and that no single interest or group of interests dominates. I had made a number of changes along these lines in the past, but they were fiddly and the diversification mechanism wasn't working that well. Under the hood, Scour now does a first pass to score how similar articles are to your interests and then has a separate step for selecting posts for your feed while keeping it diverse on a number of different dimensions. Content from websites and groups of interests you tend to like and/or click on more are now given slightly more room in your feed. Conversely, websites and groups of interests you tend to dislike or not click on will be given a bit less space. For Scour, I'm always trying to think of how to show you more content you'll find interesting -- without trapping you in a small filter bubble (you can read about my ranking philosophy in the docs). After a number of iterations, I landed on a design that I'm happy with. I hope this strikes a good balance between making sure you see articles from your favorite sources, while still leaving room for the serendipity of finding a great new source that you didn't know existed. After you click an article, Scour now explicitly asks you for your reaction. These reactions help tune your feed slightly , and they help me improve the ranking algorithm over time. Before, the reaction buttons were below every post but that made them a bit hard to hit intentionally and easy to touch accidentally. If you want to react to an article without reading it first, you can also find them in the More Options ( ) menu. Thanks to Shane Sveller for pointing out that the reaction buttons were too small on mobile! Scour now supports exact keyword matching, in addition to using vector embeddings for semantic similarity. Articles that are similar to one of your interests but don't use the exact words or phrases from your interest definition will be ranked lower. Right now this applies to interests marked as "Specific" or "Normal" (this is also automatically determined when interests are created). This should cut down on the number of articles you see that are mis-categorized or clearly off-topic. Thanks to Alex Miller and an anonymous user for prompting this, and thanks to Alex, JackJackson, mhsid, snuggles, and anders_no for all the Off-Topic reports! Sometimes, I see an article on Hacker News or elsewhere and wonder why didn't this show up in my Scour feed. You can now paste links into the Why didn't I see this? page, and it will give you a bit of an explanation. You can also report that so I can look into it more and continue to improve the ranking algorithm over time. Here were some of my favorite posts that I found on Scour in March: Happy Scouring! P.S. If you use a coding agent like Claude Code, I also wrote up A Rave Review of Superpowers , a plugin that makes me much more productive. For anyone building products, this is a good reminder to make sure you're trying out and experiencing the bad parts of your product: Bored of eating your own dogfood? Try smelling your own farts! . This was a brief, interesting history and technical overview of document formats, from to and and why Markdown "won": Markdown Ate The World . A reminder that any user-generated input, including repo branch names, can be malicious: OpenAI Codex: How a Branch Name Stole GitHub Tokens . This is a very detailed and informative visual essay explaining how quantization (compression) for large language models works: Quantization from the ground up . I'm not currently using Turso (the Rust rewrite of SQLite), but I think what they're doing is interesting. Including this experimental version that speaks the Postgres SQL dialect: pgmicro . And because I like making -- and eating -- sour sourdough: How To Make Sourdough Bread More (Or Less) Sour .

0 views
W. Jason Gilmore 2 months ago

Troubleshooting Your Claude MCP Configuration

These days I add MCP support for pretty much every software product I build, including most recently IterOps and SecurityBot.dev . Creating the MCP server is very easy because I build all of my SaaS products using Laravel, and Laravel offers native MCP support . What's less clear is how to configure the MCP client to talk to the MCP server. This is because many MCP servers use to call the MCP server URL. This is easy enough, however if you're running NVM to assist with handling Node version discrepancies across multiple projects, then you might need to explicitly define the npx path inside the file, like this: If you're using Laravel Herd and the MCP client is crashing once Claude loads, it might be because you're using Herd's locally generated SSL certificates. The mcp-remote package doesn't like this and will complain about the certificate not being signed. You can tell mcp-remote to ignore this by adding the environment variable:

0 views
Simon Willison 2 months ago

Experimenting with Starlette 1.0 with Claude skills

Starlette 1.0 is out ! This is a really big deal. I think Starlette may be the Python framework with the most usage compared to its relatively low brand recognition because Starlette is the foundation of FastAPI , which has attracted a huge amount of buzz that seems to have overshadowed Starlette itself. Kim Christie started working on Starlette in 2018 and it quickly became my favorite out of the new breed of Python ASGI frameworks. The only reason I didn't use it as the basis for my own Datasette project was that it didn't yet promise stability, and I was determined to provide a stable API for Datasette's own plugins... albeit I still haven't been brave enough to ship my own 1.0 release (after 26 alphas and counting)! Then in September 2025 Marcelo Trylesinski announced that Starlette and Uvicorn were transferring to their GitHub account , in recognition of their many years of contributions and to make it easier for them to receive sponsorship against those projects. The 1.0 version has a few breaking changes compared to the 0.x series, described in the release notes for 1.0.0rc1 that came out in February. The most notable of these is a change to how code runs on startup and shutdown. Previously that was handled by and parameters, but the new system uses a neat lifespan mechanism instead based around an async context manager : If you haven't tried Starlette before it feels to me like an asyncio-native cross between Flask and Django, unsurprising since creator Kim Christie is also responsible for Django REST Framework. Crucially, this means you can write most apps as a single Python file, Flask style. This makes it really easy for LLMs to spit out a working Starlette app from a single prompt. There's just one problem there: if 1.0 breaks compatibility with the Starlette code that the models have been trained on, how can we have them generate code that works with 1.0? I decided to see if I could get this working with a Skill . Regular Claude Chat on claude.ai has skills, and one of those default skills is the skill-creator skill . This means Claude knows how to build its own skills. So I started a chat session and told it: Clone Starlette from GitHub - it just had its 1.0 release. Build a skill markdown document for this release which includes code examples of every feature. I didn't even tell it where to find the repo, Starlette is widely enough known that I expected it could find it on its own. It ran which is actually the old repository name, but GitHub handles redirects automatically so this worked just fine. The resulting skill document looked very thorough to me... and then I noticed a new button at the top I hadn't seen before labelled "Copy to your skills". So I clicked it: And now my regular Claude chat has access to that skill! I started a new conversation and prompted: Build a task management app with Starlette, it should have projects and tasks and comments and labels And Claude did exactly that, producing a simple GitHub Issues clone using Starlette 1.0, a SQLite database (via aiosqlite ) and a Jinja2 template. Claude even tested the app manually like this: For all of the buzz about Claude Code, it's easy to overlook that Claude itself counts as a coding agent now, fully able to both write and then test the code that it is writing. Here's what the resulting app looked like. The code is here in my research repository . You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
devansh 2 months ago

Four Vulnerabilities in Parse Server

Parse Server is one of those projects that sits quietly beneath a lot of production infrastructure. It powers the backend of a meaningful number of mobile and web applications, particularly those that started on Parse's original hosted platform before it shut down in 2017 and needed somewhere to migrate. Currently the project has over 21,000+ stars on GitHub I recently spent some time auditing its codebase and found four security vulnerabilities. Three of them share a common root, a fundamental gap between what is documented to do and what the server actually enforces. The fourth is an independent issue in the social authentication adapters that is arguably more severe, a JWT validation bypass that allows an attacker to authenticate as any user on a target server using a token issued for an entirely different application. The Parse Server team was responsive throughout and coordinated fixes promptly. All four issues have been patched. Parse Server is an open-source Node.js backend framework that provides a complete application backend out of the box, a database abstraction layer (typically over MongoDB or PostgreSQL), a REST and GraphQL API, user authentication, file storage, push notifications, Cloud Code for serverless functions, and a real-time event system. It is primarily used as the backend for mobile applications and is the open-source successor to Parse's original hosted backend-as-a-service platform. Parse Server authenticates API requests using one of several key types. The grants full administrative access to all data, bypassing all object-level and class-level permission checks. It is intended for trusted server-side operations only. Parse Server also exposes a option. Per its documentation, this key grants master-level read access, it can query any data, bypass ACLs for reading, and perform administrative reads, but is explicitly intended to deny all write operations. It is the kind of credential you might hand to an analytics service, a monitoring agent, or a read-only admin dashboard, enough power to see everything, but no ability to change anything. That contract is what three of these four vulnerabilities break. The implementation checks whether a request carries master-level credentials by testing a single flag — — on the auth object. The problem is that authentication sets both and , and a large number of route handlers only check the former. The flag is set but never consulted, which means the read-only restriction exists in concept but not in enforcement. Cloud Hooks are server-side webhooks that fire when specific Parse Server events occur — object creation, deletion, user signup, and so on. Cloud Jobs are scheduled or manually triggered background tasks that can execute arbitrary Cloud Code functions. Both are powerful primitives: Cloud Hooks can exfiltrate any data passing through the server's event stream, and Cloud Jobs can execute arbitrary logic on demand. The routes that manage Cloud Hooks and Cloud Jobs — creating new hooks, modifying existing ones, deleting them, and triggering job execution — are all guarded by master key access checks. Those checks verify only that the requesting credential has . Because satisfies that condition, a caller holding only the read-only credential can fully manage the Cloud Hook lifecycle and trigger Cloud Jobs at will. The practical impact is data exfiltration via Cloud Hook. An attacker who knows the can register a new Cloud Hook pointing to an external endpoint they control, then watch as every matching Parse Server event — user signups, object writes, session creation — is delivered to them in real time. The read-only key, intended to allow passive observation, can be turned into an active wiretap on the entire application's event stream. The fix adds explicit rejection checks to the Cloud Hook and Cloud Job handlers. Parse Server's Files API exposes endpoints for uploading and deleting files — and . Both routes are guarded by , a middleware that checks whether the incoming request has master-level credentials. Like the Cloud Hooks routes, this check only tests and never consults . The root cause traces through three locations in the codebase. In at lines 267–278, the read-only auth object is constructed with . In at lines 107–113, the delete route applies as its only guard. At lines 586–602 of the same file, the delete handler calls through to without any additional read-only check in the call chain. The consequence is that a caller with only can upload arbitrary files to the server's storage backend or permanently delete any existing file by name. The upload vector is primarily an integrity concern — poisoning stored assets. The deletion vector is a high-availability concern — an attacker can destroy application data (user avatars, documents, media) that may not have backups, and depending on how the application is structured, deletion of certain files could cause cascading application failures. The fix adds rejection to both the file upload and file delete handlers. This is the most impactful of the three issues. The endpoint is a privileged administrative route intended for master-key workflows — it accepts a parameter and returns a valid, usable session token for that user. The design intent is to allow administrators to impersonate users for debugging or support purposes. It is the digital equivalent of a master key that can open any door. The route's handler, , is located in at lines 339–345 and is mounted as at lines 706–708. The guard condition rejects requests where is false. Because produces an auth object where is true — and because there is no check anywhere in the handler or its middleware chain — the read-only credential passes the gate and the endpoint returns a fully usable for any provided. That session token is not a read-only token. It is a normal user session token, indistinguishable from one obtained by logging in with a password. It grants full read and write access to everything that user's ACL and role memberships permit. An attacker with the and knowledge of any user's object ID can silently mint a session as that user and then act as them with complete write access — modifying their data, making purchases, changing their email address, deleting their account, or doing anything else the application allows its users to do. There is no workaround other than removing from the deployment or upgrading. The fix is a single guard added to that rejects the request when is true. This vulnerability is independent of the theme and is the most severe of the four. It sits in Parse Server's social authentication layer — specifically in the adapters that validate identity tokens for Sign in with Google, Sign in with Apple, and Facebook Login. When a user authenticates via one of these providers, the client receives a JSON Web Token signed by the provider. Parse Server's authentication adapters are supposed to verify this token, they check the signature, the expiry, and critically, the audience claim — the field that specifies which application the token was issued for. Audience validation is what prevents a token issued for one application from being used to authenticate against a different application. Without it, a validly signed token from any Google, Apple, or Facebook application in the world can be used to authenticate against any Parse Server that trusts the same provider. The vulnerability arises from how the adapters handle missing configuration. For the Google and Apple adapters, the audience is passed to JWT verification via the configuration option. When is not set, the adapters do not reject the configuration as incomplete — they silently skip audience validation entirely. The JWT is verified for signature and expiry only, and any valid Google or Apple token from any app will be accepted. For Facebook Limited Login, the situation is worse, the vulnerability exists regardless of configuration. The Facebook adapter validates as the expected audience for the Standard Login (Graph API) flow. However, the Limited Login path — which uses JWTs rather than Graph API tokens — never passes to JWT verification at all. The code path simply does not include the audience parameter in the verification call, meaning no configuration value, however correct, can prevent the bypass on the Limited Login path. The attack is straightforward. An attacker creates or uses any existing Google, Apple, or Facebook application they control, signs in to obtain a legitimately signed JWT, and then presents that token to a vulnerable Parse Server's authentication endpoint. Because audience validation is skipped, the token passes verification. Combined with the ability to specify which Parse Server user account to associate the token with, this becomes full pre-authentication account takeover for any user on the server — with no credentials, no brute force, and no interaction from the victim. The fix enforces (Google/Apple) and (Facebook) as mandatory configuration and passes them correctly to JWT verification for both the Standard Login and Limited Login paths on all three adapters. What is Parse Server? The readOnlyMasterKey Contract Vulnerabilities CVE-2026-29182 Cloud Hooks and Cloud Jobs bypass readOnlyMasterKey CVE-2026-30228 File Creation and Deletion bypass readOnlyMasterKey CVE-2026-30229 /loginAs allows readOnlyMasterKey to gain full access as any user CVE-2026-30863 JWT Audience Validation Bypass in Google, Apple, and Facebook Adapters Disclosure Timeline CVE-2026-29182: GHSA-vc89-5g3r-cmhh — Fixed in 8.6.4 , 9.4.1-alpha.3 CVE-2026-30228: GHSA-xfh7-phr7-gr2x — Fixed in 8.6.5 , 9.5.0-alpha.3 CVE-2026-30229: GHSA-79wj-8rqv-jvp5 — Fixed in 8.6.6 , 9.5.0-alpha.4 CVE-2026-30863: GHSA-x6fw-778m-wr9v — Fixed in 8.6.10 , 9.5.0-alpha.11 Parse Server repository: github.com/parse-community/parse-server

0 views

Binding port 0 to avoid port collisions

It's common to spin up a server in a test so that you can do full end-to-end requests of it. It's a very important sort of test, to make sure things work all together. Most of the work I do is in complex web backends, and there's so much risk of not having all the request processing and middleware and setup exactly the same in a mock test... you must do at least some end-to-end tests or you're making a gamble that's going to bite you. And this is great, but you quickly run into a problem: port collisions! This can happen when you run multiple tests at once and all of them start a separate server, and whoops, two have picked the same port. Or it can happen if something else running on your development machine happens to be running on the port you chose. It's annoying when it happens, too, because it's often hard to reproduce. So... how do we fix that? You read the title [1] , so you know where we're going, but let's go there together. There are a few potential solutions to this. Perhaps the most obvious is binding to a port you choose randomly. This will work a lot of the time, but it's going to be flaky. You can drive down the probability of collision, but it's going to happen sometimes. Side note, I think the only thing worse than a test that fails 10% of the time is one that fails 1% of the time. It's not flaky enough to drive urgency for anyone to fix it, but it's flaky enough that in a team context, you will run into this on a daily basis. Ask me how I know. How often you get a collision depends on a lot of factors. How many times do you bind a port in the range? How many other services might bind something in that range? How likely are two things to run concurrently? As a simple example, let's say we pick a random port in the range 9000-9999, and you have 4 concurrent tests that will overlap. If you uniformly sample from this range, then you will have a 1/1000 chance of a collision from the second test, a 2/1000 chance from the third, and a 3/1000 chance from the fourth. Our probability of having no collision is . That means that we have a 0.6% chance of a collision. This isn't horrible, but it's not great! We could also have each test increment the port it picks by 1. I've done this before, and it avoids one set of problems from collisions, but it makes a new problem. Now you're sweeping across the entire range starting from the first port. If you have anything else running on your system that binds in that range, you'll run into a collision! And if you run your entire test suite in parallel, you're much more likely to have a problem now, since they all start at the same port. The problem we've had all along is that we don't have full information. If we know the system state and all the currently open ports, then binding to one that's not in use is an easy problem. And you know who knows all that info? The kernel does. And it turns out, this is something we can ask the kernel for. We can just say "please give me a nice unused port" and it will! There's a range of ports that the kernel uses for this. It varies by system, but it's not usually very relevant what the particular range is. On my system, I can find the range by checking . My ephemeral port range is from 32768 to 60999. I'm curious why the range stops there instead of going all the way up, so that's a future investigation. To get an ephemeral port on Linux systems, you bind or listen on port 0 . Then the kernel will hand you back a port in the ephemeral range. And you know that it's available, since the kernel is keeping track. It's possible to have an issue here if the full range of ports has been exhausted but, you know what, if you hit that limit, you probably have other problems [2] . The only thing is that if you've bound to an unknown port, how do you send requests to it? We can get the port we've bound to by another syscall, . This lets us find out what address a socket is bound to, and then we can do something with that information. For tests, that means that you'll need to find a way to communicate this port from the listener to the requester. If they're in the same process, I like to do this by either injecting in the listener or returning the address. If you're doing something like postgres or redis on an ephemeral port, then you'd probably have to find the port from its output, which is tedious but doable. Here's an example from a web app I'm working on. This is how a simple test looks. We launch the web server, binding to port 0, and get the address back. Then we can send requests to that address! And inside , the relevant two lines are: ...where in our case. That's all we have to do, and we'll get a much more reliable test setup. I think suspenseful titles can be fun, improve storytelling, and drive attention. But sometimes you really need a clear, honest, spoiler of a title. Giving away the answer is great when you're giving information that people might want to quickly internalize. ↩ If you do run into this, I'm very curious to hear about the circumstances. It's the kind of problem that I'd love to look at and work on. It's kind of messy, and you know that there's something very interesting that led to it being this way. ↩ I think suspenseful titles can be fun, improve storytelling, and drive attention. But sometimes you really need a clear, honest, spoiler of a title. Giving away the answer is great when you're giving information that people might want to quickly internalize. ↩ If you do run into this, I'm very curious to hear about the circumstances. It's the kind of problem that I'd love to look at and work on. It's kind of messy, and you know that there's something very interesting that led to it being this way. ↩

0 views