Fault Tolerance & Resilience

Master the patterns that keep distributed systems running under failure. Learn circuit breakers, retries with backoff, bulkheads, chaos engineering, and graceful degradation.

5 min read

Fault Tolerance & Resilience

In distributed systems, failure is not an exception—it's a certainty. Networks partition, disks fail, processes crash, and memory leaks. The question isn't whether your system will fail, but how gracefully it will fail and how quickly it will recover.

As a staff engineer who's seen production incidents cascade from a single timeout to a full outage, I can tell you: the difference between a resilient system and a fragile one comes down to defensive programming. Every call to a remote service must assume it will fail, and your system must be designed to handle that failure gracefully.

The Failure Modes

Before we dive into patterns, let's understand what we're defending against:

Why Cascading Failures Happen

The key insight: in a distributed system, resources are finite. When one service slows down, callers hold onto resources (threads, connections, memory) waiting for responses. This depletes resources and causes more failures.

Circuit Breaker

The circuit breaker pattern prevents cascading failures by "failing fast" when a downstream service is unhealthy.

The Three States

State	Behavior	Transitions
Closed	Normal operation, requests pass through	Opens after failure threshold
Open	Fail fast, requests immediately rejected	Transitions to half-open after timeout
Half-Open	Allow limited requests to test recovery	Closes on success, reopens on failure

How It Works

Phase	What Happens
Monitor	Track failures for each downstream service
Trip	When failures exceed threshold, open the circuit
Fallback	When circuit is open, return degraded response instead of waiting
Reset	After timeout, allow test requests to check recovery

Key Configuration

Parameter	Description
Failure threshold	Number of failures before opening (e.g., 5)
Success threshold	Successes needed to close from half-open (e.g., 2)
Timeout duration	Time to wait before attempting reset (e.g., 60s)
Half-open max calls	Limit concurrent test requests

Retry with Exponential Backoff

When a call fails, retries can often succeed—especially for transient failures like network timeouts or 503 Service Unavailable.

The Problem with Fixed Retries

If everyone retries at the same time, you create a "thundering herd" that overloads the recovering service:

Exponential Backoff with Jitter

Strategy	Description	Why It Helps
Exponential backoff	Delay doubles with each retry (1s, 2s, 4s, 8s...)	Gives service time to recover
Jitter	Add randomness to delay	Prevents synchronized retries
Max delay cap	Limit maximum delay (e.g., 30s)	Prevents excessive waiting

What Errors Should You Retry?

Error Type	Retry?	Reason
Network timeout	Yes	Often transient
HTTP 503 Service Unavailable	Yes	Server overloaded
HTTP 429 Rate Limited	Yes (with longer delay)	Back off and retry
HTTP 500 Internal Server Error	Maybe	Depends on idempotency
HTTP 400 Bad Request	No	Won't fix with retry
HTTP 404 Not Found	No	Resource doesn't exist
Auth failure (401)	No	Credentials invalid

Bulkhead Pattern

The bulkhead pattern isolates failures by limiting resource consumption per service. Just like a ship's bulkhead prevents the entire ship from sinking if one compartment floods.

When to Use Bulkheads

Scenario	Without Bulkhead	With Bulkhead
Batch job runs slow	Exhausts shared pool, affects users	Only batch pool fills up
One service hangs	All callers wait	Only that service's pool fills
Health checks pile up	Adds load during outage	Dedicated pool, no impact

Timeout Pattern

Never wait forever. Every remote call must have a timeout. Without timeouts, a slow service can consume all your threads waiting for a response.

Setting Appropriate Timeouts

Approach	How It Works
Fixed timeout	Set absolute maximum (e.g., 5 seconds)
Adaptive timeout	Based on observed latency (e.g., P99 × 2)
Per-operation timeout	Different limits for different operations

Timeout Guidelines by Operation Type

Operation Type	Example Timeout	Rationale
Health check	1 second	Should be instant
Simple query	3 seconds	Database lookup
Complex computation	10 seconds	Report generation
File upload	60 seconds	Large data transfer

Fallback Patterns

When a service fails, provide a degraded but functional experience rather than an error.

Fallback Strategies

Level	Strategy	Latency	Freshness
L1	Call primary service	Medium	Fresh
L2	Serve stale cache	Low	Stale
L3	Return default data	Very low	Static
L4	Return graceful error	Minimal	N/A

Multi-Level Fallback Example

Service	Fallback 1	Fallback 2	Fallback 3
Recommendations	Recent cache	Popular items	Empty list
User profile	Stale cache	Default profile	Minimal profile
Product catalog	Stale cache	Featured products	Error message

Chaos Engineering

You can't know if your system is resilient until you test it under failure. Chaos engineering is the practice of deliberately injecting failures to discover weaknesses.

Common Failure Scenarios to Test

Scenario	What to Inject	What to Measure
Network partition	Latency or dropped packets	Fallback activation
Service crash	Kill instance	Failover time
Database failure	Connection timeout	Read from replica
High load	CPU/memory pressure	Rate limiting
Dependency failure	Service unavailable	Graceful degradation

Chaos Engineering Tools

Tool	Platform	Key Features
Chaos Mesh	Kubernetes	Pod, network, IO failures
Gremlin	Multi-platform	Attack library, safety checks
Litmus	Kubernetes	Workflow automation
AWS Fault Injection Simulator	AWS	Built-in integration

Graceful Degradation Strategy

Combining all patterns into a comprehensive resilience strategy:

Layered Defense

Layer	Pattern	Purpose
1. Timeout	Every remote call	Fail fast, don't wait forever
2. Retry	Transient failures	Recover from temporary issues
3. Circuit Breaker	Repeated failures	Stop hammering unhealthy service
4. Bulkhead	Resource isolation	Prevent cascade to other services
5. Fallback	Service unavailable	Provide degraded response
6. Fallback	Circuit open	Use cached/stale data

Summary: The Resilience Toolkit

Pattern	Purpose	When to Use
Timeout	Prevent resource exhaustion	Always, on every remote call
Retry	Handle transient failures	Network errors, 503, timeouts
Circuit Breaker	Stop cascading failures	Services with known failure rates
Bulkhead	Isolate resource consumption	Multiple service types with different criticality
Fallback	Provide degraded functionality	Non-critical paths, caching strategy
Chaos Engineering	Test resilience before production	Continuous improvement

✅

Resilience is not about preventing all failures—it's about ensuring failures don't cascade and that users experience graceful degradation rather than complete outage. Start with timeouts (the simplest, highest impact), then add circuit breakers for known weak points, and use chaos engineering to validate your assumptions.

Phase 3 Summary: You've now completed the Distributed Systems phase. We covered the fallacies that trip up most architects, the saga pattern for handling distributed transactions, service discovery patterns, caching strategies, distributed data store internals, and resilience patterns. These are the fundamentals that separate engineers who build toy systems from those who design production-grade distributed architectures.

In the next phase, we'll dive into Core Algorithms & Data Structures—the hidden engines behind every distributed system.

Distributed Data Stores

Time and Space Complexity: Big O Notation Explained