Metastable Failures in Distributed Systems
☕ Welcome to The Coder Cafe! Today, we explore one of the nastiest failure patterns in distributed systems: metastable failures. Based on the Metastable Failures in Distributed Systems whitepaper, we break down why these failures happen, why they persist, what we can do about them, and why our instinct to fix them is probably wrong. Get cozy, grab a coffee, and let’s begin! Stable, Vulnerable, Metastable Metastable failures borrow their name from physics, where metastable means something that looks stable but isn’t . To understand how a distributed system can end up in such a state, we need to look at three distinct states it can be in: Stable: The system recovers on its own after any disruption. This is what we call resilience in Resilient, Fault-tolerant, Robust, or Reliable . Vulnerable : The system looks perfectly healthy, but it's operating above its hidden capacity : the load level below which it can self-heal from any disruption. It responds fast, metrics are green, and nothing is alarming. Many production systems deliberately operate here because it's more efficient: resources are used closer to their limit. But there's no slack left . And the deeper the system operates in a vulnerable state, the smaller the trigger needed to push it over the edge. Indeed, a system just above its hidden capacity can survive large disruptions; a system near its advertised capacity can be tipped by almost anything. Metastable failure : A trigger (e.g., a network blip, a deployment, a traffic spike) pushes the system over its hidden capacity. The system is not fully broken: processes are alive, and it’s still running. But goodput collapses: it’s no longer doing any useful work. Technically up, effectively down . And unlike a regular outage, removing the trigger doesn’t fix it. Getting out requires a strong corrective push: a restart, a dramatic load reduction, a manual intervention. NOTE : If you’re not familiar with the concept of goodput, it’s the throughput of useful work completed successfully. For example, in a web application receiving 1000 requests per second but returning errors for 800 of them, the goodput is only 200 RPS. The three states of a metastable failure. A system can drift into the vulnerable state unnoticed, and a single trigger is enough to push it into the metastable state it cannot escape on its own. The most disorienting property of a metastable failure: stopping the trigger doesn’t stop the failure. To understand why, we need to talk about feedback loops. In a previous post on Systems Thinking Explained , we defined a feedback loop as: If causes , then influences . A feedback loop is exactly the mechanism that keeps a system stuck in the metastable state . There is always a sustaining effect, a feedback loop, that prevents recovery. The trigger is just what pushes the system over the edge. The loop is what keeps it there. Blaming the trigger is the natural instinct, and almost always the wrong diagnosis. Let’s discuss a concrete example to make this clear. Imagine a web application that queries a database. The database comfortably handles up to 300 QPS. The application retries any query that doesn’t respond within 1 second. The system is running at 280 QPS, healthy and fast, within the database’s capacity. Then, a transient network issue occurs for 10 seconds. When the issue is over, all the queued requests flood in at once. The database gets hit with a surge it can’t absorb: latency spikes and queries start timing out. So the application retries them. This doubles the effective load to 560 QPS. The database, already struggling, falls further behind. More timeouts. More retries. The loop is now self-sustaining: High load → Timeouts → Retries → Higher load → More timeouts → More retries The transient network issue was fixed minutes ago. Yet, the system is still completely broken. The trigger is gone; the feedback loop is not . The only way out is to dramatically cut the load or disable retries entirely. This is a metastable failure . The system was vulnerable because it was operating close to its hidden capacity . A minor, transient trigger pushed it over the edge and into a self-sustaining failure state it couldn’t escape on its own. The retry mechanism, a feature designed to improve reliability, became the very thing that prevented recovery. This is one example, but the same pattern appears with caches, connection pools, failover logic, and more. The shape is always the same: a feedback loop that turns a temporary problem into a permanent one . Two things make metastable failures particularly nasty. We can be tempted to blame the wrong thing . When an outage happens, the trigger is what’s visible and recent: a spike, a deployment, a hardware fault. It’s the obvious culprit. But the trigger only exposed the problem; it didn’t create it. The sustaining feedback loop was already there, structural and invisible. When analyzing the problem in retrospect, teams focus on the trigger; fixes address the trigger; and the system remains vulnerable to the next one. The authors of the paper observed teams declare a metastable failure “resolved” multiple times before realizing the real cause had never been touched. The feedback loop grows stronger with scale . Small-scale tests won’t reveal it. A staging environment running at 10% capacity may handle the same trigger without falling into a metastable state, because the loop isn’t strong enough at that scale to be self-sustaining. This means these failures can slip past even rigorous testing regimes and only manifest in production at full load. We defined hidden capacity earlier as the load level below which the system can self-heal from any disruption. It’s different, and always lower, than the advertised capacity. In our example, the numbers make it concrete: the advertised capacity is 300 QPS, but the hidden capacity is only 150 QPS, because retries double the load under failure. The gap between those two numbers is where vulnerability lives . Measuring the hidden capacity is not straightforward, though. One possible approach is to apply a trigger at a given load level and observe whether the system recovers on its own: If it does, we are below the hidden capacity. If it doesn’t, we are above it. We can also estimate it indirectly: in the retry example, retries double the load under failure, so the hidden capacity is roughly half the advertised capacity. Metastable failures are not bugs . We can’t write a unit test that catches them. They are emergent behaviors: properties that arise from the interaction of a system’s components under specific conditions, not logic errors in any individual component. No single piece of code is buggy, no single configuration is wrong. The failure is a consequence of how everything fits together under load. This changes how we need to think about them. The right question after an outage is not “ What failed? ” but “ What loop sustained it? ” And before an outage, the danger is not having bugs; it’s optimizing so aggressively for efficiency that we push the system deeper into the vulnerable state without realizing it . Retries, caches, failover logic, connection pools: these are all features that improve reliability in the common case. They are also, under the right conditions, the sustaining mechanisms of metastable failures. The same design decision that makes a system more resilient in normal operation can also prevent it from recovering when things go wrong. The paper describes several approaches to reduce the risk of metastable failures: Retry budgets and circuit breakers : Instead of retrying indefinitely, cap the total number of retries in flight at any given time. This directly weakens the feedback loop by limiting work amplification. LIFO scheduling under overload : Counterintuitively, switching from FIFO to LIFO when the system is overloaded allows some requests to complete within their deadline, preserving goodput instead of letting every request time out. NOTE : I already wrote a post about that approach in Adaptive LIFO . Fast error paths : Success paths are heavily optimized, but error paths often aren’t. An expensive error path (stack traces, DNS lookups, disk writes) under high failure rates can itself become a sustaining mechanism. Optimizing error paths reduces this risk. Read-through caches over look-aside caches : A read-through cache (where the cache itself fetches missing data from the database) can continue filling itself even when the application has given up on a request, steadily increasing the hit rate and helping the system recover. A look-aside cache (where the application is responsible for populating the cache) can’t. Production stress testing : Small-scale tests won’t reveal metastable failures. Testing against a portion of production traffic, with engineers ready to intervene, is the most reliable way to surface them. A note of humility from the paper: there is no systematic solution yet. These are ad-hoc mitigations developed in response to known failures. Detecting vulnerable states before they collapse remains an open problem. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. A distributed system can pass through three states: stable, vulnerable, and metastable. The vulnerable state looks healthy, but it isn’t. The threshold between stable and vulnerable is invisible. Systems can operate in the vulnerable state for months without any sign of trouble. When a trigger pushes a vulnerable system into a metastable failure, a feedback loop sustains the failure even after the trigger is gone. The trigger is not the root cause. The feedback loop is. Fixing the trigger leaves the system vulnerable to the next one. Reliability features like retries and caches can become the sustaining mechanism of a metastable failure under the right conditions. Metastable failures are emergent behaviors, not bugs. We can’t unit test for them, and optimizing for efficiency makes them more likely. Mitigations exist (retry budgets, circuit breakers, LIFO scheduling, fast error paths), but they are all ad-hoc responses to known failures. Detecting vulnerable states before they collapse remains an open problem. Resilient, Fault-tolerant, Robust, or Reliable? Adaptive LIFO Fail Open vs. Fail Closed Metastable Failures in Distributed Systems Metastability and Distributed Systems Stable, Vulnerable, Metastable Metastable failures borrow their name from physics, where metastable means something that looks stable but isn’t . To understand how a distributed system can end up in such a state, we need to look at three distinct states it can be in: Stable: The system recovers on its own after any disruption. This is what we call resilience in Resilient, Fault-tolerant, Robust, or Reliable . Vulnerable : The system looks perfectly healthy, but it's operating above its hidden capacity : the load level below which it can self-heal from any disruption. It responds fast, metrics are green, and nothing is alarming. Many production systems deliberately operate here because it's more efficient: resources are used closer to their limit. But there's no slack left . And the deeper the system operates in a vulnerable state, the smaller the trigger needed to push it over the edge. Indeed, a system just above its hidden capacity can survive large disruptions; a system near its advertised capacity can be tipped by almost anything. Metastable failure : A trigger (e.g., a network blip, a deployment, a traffic spike) pushes the system over its hidden capacity. The system is not fully broken: processes are alive, and it’s still running. But goodput collapses: it’s no longer doing any useful work. Technically up, effectively down . And unlike a regular outage, removing the trigger doesn’t fix it. Getting out requires a strong corrective push: a restart, a dramatic load reduction, a manual intervention. NOTE : If you’re not familiar with the concept of goodput, it’s the throughput of useful work completed successfully. For example, in a web application receiving 1000 requests per second but returning errors for 800 of them, the goodput is only 200 RPS. We can be tempted to blame the wrong thing . When an outage happens, the trigger is what’s visible and recent: a spike, a deployment, a hardware fault. It’s the obvious culprit. But the trigger only exposed the problem; it didn’t create it. The sustaining feedback loop was already there, structural and invisible. When analyzing the problem in retrospect, teams focus on the trigger; fixes address the trigger; and the system remains vulnerable to the next one. The authors of the paper observed teams declare a metastable failure “resolved” multiple times before realizing the real cause had never been touched. The feedback loop grows stronger with scale . Small-scale tests won’t reveal it. A staging environment running at 10% capacity may handle the same trigger without falling into a metastable state, because the loop isn’t strong enough at that scale to be self-sustaining. This means these failures can slip past even rigorous testing regimes and only manifest in production at full load. If it does, we are below the hidden capacity. If it doesn’t, we are above it. Retry budgets and circuit breakers : Instead of retrying indefinitely, cap the total number of retries in flight at any given time. This directly weakens the feedback loop by limiting work amplification. LIFO scheduling under overload : Counterintuitively, switching from FIFO to LIFO when the system is overloaded allows some requests to complete within their deadline, preserving goodput instead of letting every request time out. NOTE : I already wrote a post about that approach in Adaptive LIFO . Fast error paths : Success paths are heavily optimized, but error paths often aren’t. An expensive error path (stack traces, DNS lookups, disk writes) under high failure rates can itself become a sustaining mechanism. Optimizing error paths reduces this risk. Read-through caches over look-aside caches : A read-through cache (where the cache itself fetches missing data from the database) can continue filling itself even when the application has given up on a request, steadily increasing the hit rate and helping the system recover. A look-aside cache (where the application is responsible for populating the cache) can’t. Production stress testing : Small-scale tests won’t reveal metastable failures. Testing against a portion of production traffic, with engineers ready to intervene, is the most reliable way to surface them. A distributed system can pass through three states: stable, vulnerable, and metastable. The vulnerable state looks healthy, but it isn’t. The threshold between stable and vulnerable is invisible. Systems can operate in the vulnerable state for months without any sign of trouble. When a trigger pushes a vulnerable system into a metastable failure, a feedback loop sustains the failure even after the trigger is gone. The trigger is not the root cause. The feedback loop is. Fixing the trigger leaves the system vulnerable to the next one. Reliability features like retries and caches can become the sustaining mechanism of a metastable failure under the right conditions. Metastable failures are emergent behaviors, not bugs. We can’t unit test for them, and optimizing for efficiency makes them more likely. Mitigations exist (retry budgets, circuit breakers, LIFO scheduling, fast error paths), but they are all ad-hoc responses to known failures. Detecting vulnerable states before they collapse remains an open problem. Resilient, Fault-tolerant, Robust, or Reliable? Adaptive LIFO Fail Open vs. Fail Closed Metastable Failures in Distributed Systems Metastability and Distributed Systems