How Systems Really Fail, Part I
The hidden physics of distributed outages, metastable cascades, and the assumptions that silently destroy systems at scale.
Intro
There is a version of distributed systems that exists in textbooks, RFCs, and the architecture diagrams that get drawn during the first month of a new project.
In that version, failures are discrete events: a node dies, a network partitions, a disk fills. Each event has a name. Each name has a mitigation. The mitigations compose. The composition is correct.
Then there is the version of distributed systems that exists in production at 03:47 UTC, when the on-call engineer is staring at a dashboard that shows everything green except for a customer-impact metric that has been climbing for nine minutes.
The runbook does not apply because it was written for the system that existed before the migration, the last engineer who understood the offending subsystem left the company seven months ago, and the only documentation is a Confluence page from 2023 that contradicts itself in the third paragraph.
This series is about the second version.
It is not about how to design distributed systems. There are good books for that. It is about what happens to those designs after they meet reality: after the load grows by a factor of fifty, after three reorgs change the ownership of half the services, after the configuration file that was supposed to be immutable acquires a small permissions change on a Tuesday morning in November.
It is about the failure modes that emerge not from broken components but from the interaction between working ones. It is about why debugging at scale is not a technical activity but an epistemic one.
And it is about the design decisions, often made years before the outage, that determine whether a system has a fighting chance when the failure arrives.
Five essays. Each stands alone. They share a thesis: the gap between how engineers reason about systems and how systems actually behave is not a knowledge problem. It is a structural property of complexity.
The faster you accept this, the better your systems will be.
The Composition Problem
On Monday, 17 November 2025, an engineer at Cloudflare merged a change to a permissions policy on the company’s ClickHouse database clusters.
The change was part of a long-running effort to migrate distributed queries from a shared system account to per-user authentication, so that query limits and access grants could be evaluated at finer granularity. It was the right kind of change. Reviewed, staged, rolled out gradually across cluster nodes, exactly as a careful operator would do it.
At 11:05 UTC the following morning, the rollout reached a critical threshold. Twenty-three minutes later, the internet broke.
At 11:28 UTC, Cloudflare’s network, which fronts roughly 20% of the websites on the public internet, began returning HTTP 5xx errors at scale. ChatGPT failed. X failed. Spotify, Discord, Canva, Figma, 1Password, Trello.
The outage lasted until 14:30 UTC for core traffic, with full restoration at 17:06 UTC. Matthew Prince, Cloudflare’s CEO, would later describe it as the worst outage since 2019. Estimated revenue loss across the affected ecosystem ran into the hundreds of millions of dollars.
The chain of causation, once it was understood, fits in a paragraph.
Cloudflare’s Bot Management module runs inside its core proxy (a system called FL, with a newer version FL2). The module scores every request as bot-or-human using a machine-learning model.
That model takes as input a “feature configuration file”, a list of per-request features, which is regenerated every five minutes by a query against a ClickHouse cluster. The regeneration query reads from system.columns, ClickHouse’s metadata table:
SELECT name, type
FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;Note what is not in this query: a filter on the database name. The query implicitly assumed that system.columns would only return columns from the default database, because before the permissions migration users only had visibility into default.
ClickHouse’s distributed table engine actually stores shards in an underlying physical schema named r0. The new permissions policy granted explicit access to r0. After the change, the same query returned columns from both default and r0, roughly doubling the row count.
That row count was used directly to construct the feature file. The file had previously contained around 60 features. It now contained more than 200.
Downstream, in the Rust code that loaded the file into the FL2 proxy, there was a preallocated array sized for a hard ceiling of exactly 200 features: a performance optimisation so that runtime feature lookups would never allocate.
When the oversized file arrived, the load path returned Err(_). The calling code, written under the assumption that this could not happen, called .unwrap() on the Result.
The worker thread panicked with the now-public string:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err valueEvery request routed through that worker returned 5xx.
The damage was amplified by a second-order property. ClickHouse was being rolled out gradually, so for nearly an hour only some cluster nodes returned the duplicated result.
The feature file regenerated every five minutes, and whether the run hit an upgraded node or a non-upgraded node was effectively random.
The file therefore alternated, every five minutes, between “good” and “bad,” and the proxy fleet oscillated between recovery and failure on a five-minute cycle.
From the dashboards, this looked exactly like an active DDoS attack, an external adversary probing the network with intermittent pressure.
The incident commander spent the first two hours of the outage investigating that hypothesis, because the signature of the failure mimicked a known threat.
Read this again. Notice what is not in it.
There was no bug in the database. The new permissions behaviour was correct ClickHouse semantics. There was no bug in the query; it executed exactly as written.
There was no bug in the feature file format; it stored what it was given. There was no bug in the Rust proxy, its bounds check correctly refused to process malformed input rather than corrupting state.
There also was no bug in the deployment process, gradual rollout to a database cluster is exactly how you mitigate rollout risk. Every component, examined in isolation, behaved as designed, as documented, as code-reviewed.
The outage existed in the spaces between the components. It existed in an unwritten assumption, that the cardinality of the metadata query was bounded by the schema layout.
It was present in the gap between the team that owned the permissions migration and the team that owned the feature pipeline.
It even existed in the asymmetry between data Cloudflare treated as “trusted” (internally generated configuration) and data it treated as “untrusted” (everything from outside).
The failure was not a property of any component. It was a property of the system.
This is the central, uncomfortable fact about distributed systems: their failure modes are not documented because they cannot be documented.
They emerge from the composition of components, and the space of possible compositions grows faster than anyone can enumerate it.
Why decomposition breaks down
Software engineering is almost entirely built on decomposition.
You take a hard problem, split it into smaller problems, solve each, and compose the solutions. The discipline assumes (implicitly, almost religiously) that the behaviour of the whole can be derived from the behaviour of the parts.
This is the foundation of modular design, encapsulation, microservices, contracts, type systems. It is what allows ten thousand engineers to build a system no one of them understands in full.
The assumption is wrong, or rather: it holds only within a regime, and the regime ends somewhere around the scale where a system has enough components, enough state, and enough concurrency that the interactions between components become a richer source of behaviour than the components themselves.
The formal version of this argument is older than computer science. Herbert Simon, in The Architecture of Complexity (Proc. Am. Phil. Soc., 1962), distinguished between decomposable and nearly-decomposable systems.
In a decomposable system, interactions between subsystems are negligible compared to interactions within them, and the whole behaves like the sum of independent parts.
In a nearly-decomposable system, this is approximately true on short timescales but not on long ones, the weak inter-subsystem couplings accumulate into qualitatively different behaviour.
Simon’s claim, which has held up for sixty years across biology, economics, and engineering, is that all real systems of significant size are nearly-decomposable, not decomposable.
Distributed systems are an extreme case. The components have clean interfaces and look decomposable on a diagram.
But the interactions are mediated by shared resources (networks, clocks, storage, control planes) and those shared resources transmit perturbations between components in ways the diagram does not show.
A change in one component changes the load profile on the shared network, which changes the queueing behaviour at a different component, which changes the timing of its responses, which changes the retry behaviour of yet another component. The composition is opaque because the couplings are invisible.
Distributed systems theory has known a version of this for forty years. Fischer, Lynch, and Paterson (JACM 1985) proved that consensus is impossible in a purely asynchronous system with even one faulty process; a result that, properly understood, is not about consensus algorithms but about the impossibility of producing globally consistent system behaviour from locally correct components under partial failure.
Brewer’s CAP conjecture (PODC 2000) and the Gilbert-Lynch proof (ACM SIGACT 2002) formalised the same point at the level of state.
Lamport’s “Time, Clocks, and the Ordering of Events in a Distributed System” (CACM 1978) showed that there is no observer-independent simultaneity in a distributed system without explicit synchronisation, meaning every “global view” of the system is a stitched-together fiction.
The classical literature focused on discrete failures: a node dies, a message is lost, a clock drifts. The modern failures are stranger.
They are failures of coupling: moments when two pieces of working software, communicating through an interface both implement correctly, produce a behaviour neither would produce alone.
The Cloudflare incident is one. The DynamoDB DNS race condition that took down AWS US-EAST-1 on 19–20 October 2025 is a more elaborate example of the same pattern, and it is worth reconstructing mechanically because it shows how thoroughly the composition can betray its components.
The AWS DynamoDB cascade, mechanically
DynamoDB’s regional endpoint, dynamodb.us-east-1.amazonaws.com, is served by an internal DNS management system that exists because DynamoDB runs on hundreds of thousands of load balancers, and the DNS records pointing clients at those load balancers must be updated continuously as capacity is added, removed, and rebalanced.
The system has two logical components. The DNS Planner monitors load-balancer health and produces “DNS plans”, versioned snapshots of which load balancers receive which fraction of regional traffic.
The DNS Enactor reads plans and applies them to Route 53, AWS’s DNS service. For availability, three Enactors run in parallel, one per availability zone. They operate concurrently and independently; no distributed lock, no leader election, no coordination protocol.
The system was designed this way deliberately, so a single Enactor crashing mid-run would not stall propagation; the other two would simply pick up subsequent plans and continue.
To prevent stale plans from overwriting newer ones, each Enactor performs a freshness check before applying a plan.
To prevent unbounded growth of historical plans, each Enactor also performs a cleanup pass after applying its current plan, deleting plans significantly older than the current one.
The freshness check happens once, at the start of the application phase. The cleanup happens once, at the end.
This is, again, the kind of design that gets praised in code review. Independent. Stateless. Fault-tolerant. Each component does one well-bounded job.
Now consider what actually happened. At 23:48 PDT on 19 October (06:48 UTC on 20 October), Enactor A read plan #N−1 from the Planner and began applying it to Route 53.
For reasons AWS’s post-mortem describes as “unusual delays”, likely network-mediated queueing inside Route 53’s control plane, Enactor A’s update run took longer than normal.
In the meantime, the Planner produced plan #N. Enactor B picked up plan #N, performed its freshness check (newer than the currently applied plan: pass), and began its own update run.
Enactor B finished first, applying #N to Route 53. It then began its cleanup pass, scanning for plans significantly older than #N and deleting them.
By the time Enactor A finished its delayed run and went to apply the last few records of plan #N−1, Enactor B had already applied #N to those same records.
Enactor A’s freshness check, made at the start of its run, had not detected this; the check was made when #N−1 was still the freshest plan, and that result was now stale. Enactor A overwrote those records with #N−1.
Now Enactor B’s cleanup pass arrived at plan #N−1. By Enactor B’s bookkeeping, #N−1 was significantly older than #N. Enactor B deleted plan #N−1. But Enactor A had just applied #N−1 to the regional endpoint records.
The records now pointed at a plan that did not exist. Route 53 dutifully served what it had: an empty answer set for dynamodb.us-east-1.amazonaws.com.
This is the worst possible DNS response. It is not NXDOMAIN, which clients treat as transient and retry. It is NOERROR with an empty ANSWER section; semantically “this name exists, intentionally, with zero addresses.” Compliant clients stop. There is no answer to retry.
Within seconds, every system inside and outside AWS that wanted to talk to DynamoDB in us-east-1 began failing to resolve its address. From the DynamoDB control plane’s view, the service was healthy: load balancers up, storage reachable, request handlers idle.
From Route 53’s view, the service was healthy: DNS was returning valid authoritative responses. From clients’ view, the service had ceased to exist. Three different frames of reference, three different “states” of the same service, all simultaneously true within their own frame. The mismatch between them was the outage.
It took manual intervention from on-call engineers to identify the empty record, repair Route 53 by hand, and re-enable normal automation. DynamoDB DNS recovered in approximately three hours.
The cascade that followed lasted ten more hours, and is the second composition failure embedded inside the first. EC2’s DropletWorkflow Manager (DWFM), the system that maintains operational leases on the physical hypervisors hosting customer EC2 instances, stores its lease state in DynamoDB.
While DynamoDB was unreachable, DWFM could not renew leases. Existing leases expired silently. When DynamoDB recovered, DWFM woke up to discover that essentially every hypervisor in the region needed a fresh lease, and tried to issue them all at once.
The lease-renewal subsystem entered what AWS’s post-mortem calls “congestive collapse”, a regime where throughput of useful work approaches zero because the system is spending all its time servicing retries of work that has already timed out.
Network Load Balancer health checks began failing en masse. New EC2 launches were impossible. The region was effectively down for production workloads until late that evening. Every design decision in this chain was defensible. Three Enactors instead of one, for availability.
Freshness check, to prevent old plans winning. Cleanup pass, to prevent unbounded growth. No distributed lock, to avoid coordination overhead and tolerate Enactor failures. DWFM storing state in DynamoDB, because what else would you use for a high-availability lease manager.
Each decision is the textbook answer to a specific risk.
The composition of all those textbook answers produced fifteen hours of regional unavailability and an industry-wide impact measured in hundreds of millions of dollars.
Why documentation cannot close the gap
The instinct, after this kind of incident, is to write better documentation. Add the failure mode to the runbook. Update the architecture diagram. Note the implicit assumption in a comment. Surely, next time, we will know.
We will not. The reason is not laziness; it is combinatorial.
Consider a system with N components, each with a small number of internal states, dependencies, and inputs. The number of pairwise interactions grows as O(N²).
The number of trajectories (sequences of states the system can traverse) grows much faster: for any reasonable model of state and concurrency, at least exponential in N. By the time N is in the low thousands (a serious production system), the trajectory space is unbounded for practical purposes.
Documentation is a linear medium. It can describe a finite number of states, interactions, and failure modes. The space of actual failure modes is not finite in any meaningful sense.
What documentation actually captures, in practice, is the failure modes that have already happened; the ones recovered from, written up, discussed in architecture review.
This is useful, but it is fundamentally backward-looking. The next outage is, almost by definition, the one not yet documented. It lives in some currently-undocumented region of the trajectory space, which the system will enter for the first time when some perturbation pushes it there.
This is not an indictment of documentation. Runbooks save lives. Post-mortems compound institutional knowledge. The point is that no quantity of documentation, however thorough, can close the gap between the system as designed and the system as composed.
The gap is structural. It widens deeply with scale.
The pattern beneath the patterns
If you read enough post-mortems (Dan Luu’s catalogue on GitHub remains the best free education in this material) a pattern emerges.
The triggers vary wildly: a permissions change, a DNS update, a config push, a deploy, a hardware failure, a thundering herd. The shape of the failure is often the same.
Nathan Bronson and his collaborators, in a 2021 HotOS paper, gave this shape a name: metastable failure. The framing has become foundational, and is worth restating precisely because it is the closest the field has come to a formal theory of why composition produces outages.
A metastable failure occurs in an open system with an uncontrolled load source. The system has at least two stable operating regimes: a stable regime, in which a transient perturbation decays back to equilibrium, and a metastable failure regime.
In that case, the system is functioning (consuming CPU, processing messages, producing output) but its useful throughput, what Bronson precisely terms with the word goodput, has collapsed.
The system transitions between regimes via a trigger: a load spike, a deploy, a partial failure, a configuration change.
What keeps the system in the failure regime, even after the trigger is removed, is a sustaining effect: a positive feedback loop, usually involving work amplification, in which the system’s response to its own degraded state increases the load on itself further.
The canonical example, paraphrased from the paper:
A web tier calls a database tier through a connection pool. Database latency is normally well below the client’s request timeout. A brief perturbation, like a network blip, a slow GC pause, causes some requests to exceed the timeout.
The client retries. The retry is a new request, added to the existing load. Database queue depths grow. Latency increases, pushing more requests past the timeout. More retries fire. Each timed-out request still consumed full database work to compute its answer, but no client ever saw it; that work was wasted.
The system is now processing 3× its normal request volume (originals plus retries), succeeding in completing them all, but every client is timing out before the answer arrives. Goodput is zero. Throughput is at saturation. The trigger (the original network blip) is long gone. The retry storm is sustaining the failure regime on its own.
The key insight is that the root cause of a metastable failure is the sustaining loop, not the trigger. Triggers are infinitely various and mostly cannot be prevented.
Sustaining loops are finite and identifiable, and if you eliminate them, the same trigger fails to produce the same outcome.
A follow-up paper, Metastable Failures in the Wild (Huang et al., OSDI 2022), examined 22 publicly disclosed incidents at 11 major organisations and concluded that at least 4 of the previous 15 major AWS outages fit the metastable pattern.
The October 2025 DynamoDB incident makes 5. The EC2 cascade after DynamoDB recovered is the metastable pattern in textbook form: the trigger (DynamoDB DNS being empty) was resolved in three hours; the sustaining loop (every hypervisor in the region simultaneously demanding lease renewal from a system that could not handle the surge) took ten more hours to break, and only broke when AWS manually rate-limited the work.
Marc Brooker, a principal engineer at AWS who has written extensively on this material, has pointed out that the appropriate intellectual framework here is not algorithms-and-data-structures but control theory and dynamical systems.
A metastable failure is, in dynamical-systems terms, a system with two stable attractors, where the perturbation required to push the system from the desirable attractor into the undesirable one is much smaller than the perturbation required to push it back.
The state-space geometry is asymmetric. Most production engineers have never thought about their systems this way, because computer science is taught around discrete models. The systems are continuous and dynamical, whether we model them that way or not.
Invariants and the cardinality contract
The implication is not that distributed systems are unbuildable. They obviously are.
The true implication is that the mental model under which most distributed systems get built (components compose, contracts compose, correctness composes) is wrong in a way that matters for production behaviour.
The discipline that replaces this mental model is the explicit enforcement of invariants at every component boundary, including internal ones.
An invariant, in this context, is a property of a value that the consumer’s correctness depends on, but that the producer is not contractually obligated to maintain. The Cloudflare feature file had at least three such invariants, none enforced by any check at the boundary:
A cardinality bound. The Rust consumer required
n_features ≤ 200. The ClickHouse query had noLIMIT, noWHEREon database, and no schema constraint preventing growth.A schema invariant. The consumer assumed columns came from
defaultonly. The query implicitly assumed the same via the permissions model. Neither stated the invariant in code.A monotonicity invariant. A doubling of feature count between two consecutive runs is, on its face, anomalous. No alarm fired on that delta.
Each invariant was true for years. Each became false silently when an upstream change reshaped the world. The boundary between producer and consumer had no formal contract; the contract lived in the heads of engineers, some of whom had left the company.
The discipline that prevents this is not “validation” in the loose sense. It is the explicit, in-code, enforced declaration of every cardinality, ordering, schema, and freshness constraint that the consumer relies on, with explicit handling of violation: typically degradation to last-known-good rather than panic.
The Rust idiom for this is the difference between .unwrap() and explicit pattern matching on Result; the operational idiom is the difference between trusting upstream data and treating every input as adversarial regardless of source.
The cost of the former is a few additional lines per consumer boundary. The cost of the latter is, occasionally, six hours of global downtime.
Sustaining loops and characteristic metrics
The second formal property the Cloudflare and DynamoDB incidents share is the presence of sustaining loops; control loops whose response to system degradation increases the load on the system rather than decreasing it.
The discipline for finding these before they fire is to enumerate every feedback loop in the system and classify each one’s stability properties.
A feedback loop is stable if, when perturbed from equilibrium by a small amount ε, it returns to equilibrium with error decaying as some function f(t,ε) that approaches zero.
A feedback loop is sustaining if the same perturbation produces error that grows or stays bounded away from zero.
The distinction is mathematically standard (Lyapunov stability) but is almost never applied to production systems, because engineers do not model their systems as dynamical systems.
The catalogue of loops in any non-trivial production system:
retry policies (timeout → retry → load → timeout amplification);
autoscaling (latency → scale-up → cold-start latency → more scale-up);
lease renewal (load → renewal delay → lease expiry → mass renewal storm);
connection pooling (failure → reconnect → handshake load → failure);
cache warming (cold cache → DB load → DB slow → cache cannot warm);
health checks (slow response → marked unhealthy → traffic shifted to fewer hosts → those hosts slower).
Each is a control loop. Each can be classified. The classification is rarely written down.
The observability counterpart of this classification is what Bronson calls characteristic metrics: observations of the loop state itself, not of the loop’s inputs or outputs.
Queue depth is a loop-state observable; request rate is not. Retry rate is a loop-state observable; error rate is not. Lease renewal latency is a loop-state observable; lease expiry rate is not.
The relationship between loop-state metrics and incident causality is direct: when a sustaining loop activates, its characteristic metric crosses out of its historical operating envelope before the user-facing symptom appears.
Instrumenting characteristic metrics is the difference between detecting a metastable failure during its inflation phase (when mitigation is cheap) and detecting it after it has saturated (when mitigation requires load-shedding the user-facing service).
The diagnostic question
The compressed form of the entire discipline reduces to a single question, asked of every component boundary in the system: what am I assuming about my input that is not enforced by a check in this code?
Every such unenforced assumption is a future incident. The space of unenforced assumptions is large but finite, and it can be enumerated. Most engineering organisations have never done this enumeration.
The ones that have produce systems that fail in less catastrophic ways; not because they fail less often, but because the failures that occur are caught at the boundary where the assumption was violated, rather than three layers downstream after corruption has propagated.
The system you have is not the system you designed. The system you have is the composition. The composition is opaque, and the opacity is permanent, but the opacity at every individual boundary is not permanent.
Each boundary is a place where assumptions can be made explicit and enforced. The discipline of composition-aware engineering is not to make the whole transparent.
It is to make every boundary honest about what it requires from its neighbours, and to refuse to operate when those requirements are not met.
This is what separates the systems that fail loudly at the seams from the systems that fail catastrophically in the centre.
One more thing…
Modern systems rarely fail because of a single broken component.
They fail because interactions between correct components create behaviours nobody explicitly designed for. The same thing happens in high-performance GPU systems.
Most CUDA optimisation is not about isolated tricks. It is about understanding how kernels, memory hierarchies, scheduling, communication, and throughput constraints interact under load.
I wrote a deep guide on CUDA from exactly this perspective: systems-level performance engineering, bottlenecks, hidden coupling, and why many “optimisations” simply move the problem elsewhere.




Growing complexity in general is generating challenges we are not prepared for. This essay is a remarkable observation about how we will need to rethink the way things work.