Distributed Systems

Fault Tolerance & Resilience

Master the patterns that keep distributed systems running under failure. Learn circuit breakers, retries with backoff, bulkheads, chaos engineering, and graceful degradation.

5 min read

Fault Tolerance & Resilience

In distributed systems, failure is not an exception—it's a certainty. Networks partition, disks fail, processes crash, and memory leaks. The question isn't whether your system will fail, but how gracefully it will fail and how quickly it will recover.

As a staff engineer who's seen production incidents cascade from a single timeout to a full outage, I can tell you: the difference between a resilient system and a fragile one comes down to defensive programming. Every call to a remote service must assume it will fail, and your system must be designed to handle that failure gracefully.

The Failure Modes

Before we dive into patterns, let's understand what we're defending against:

Why Cascading Failures Happen

The key insight: in a distributed system, resources are finite. When one service slows down, callers hold onto resources (threads, connections, memory) waiting for responses. This depletes resources and causes more failures.


Circuit Breaker

The circuit breaker pattern prevents cascading failures by "failing fast" when a downstream service is unhealthy.

The Three States

StateBehaviorTransitions
ClosedNormal operation, requests pass throughOpens after failure threshold
OpenFail fast, requests immediately rejectedTransitions to half-open after timeout
Half-OpenAllow limited requests to test recoveryCloses on success, reopens on failure

How It Works

PhaseWhat Happens
MonitorTrack failures for each downstream service
TripWhen failures exceed threshold, open the circuit
FallbackWhen circuit is open, return degraded response instead of waiting
ResetAfter timeout, allow test requests to check recovery

Key Configuration

ParameterDescription
Failure thresholdNumber of failures before opening (e.g., 5)
Success thresholdSuccesses needed to close from half-open (e.g., 2)
Timeout durationTime to wait before attempting reset (e.g., 60s)
Half-open max callsLimit concurrent test requests

Retry with Exponential Backoff

When a call fails, retries can often succeed—especially for transient failures like network timeouts or 503 Service Unavailable.

The Problem with Fixed Retries

If everyone retries at the same time, you create a "thundering herd" that overloads the recovering service:

Exponential Backoff with Jitter

StrategyDescriptionWhy It Helps
Exponential backoffDelay doubles with each retry (1s, 2s, 4s, 8s...)Gives service time to recover
JitterAdd randomness to delayPrevents synchronized retries
Max delay capLimit maximum delay (e.g., 30s)Prevents excessive waiting

What Errors Should You Retry?

Error TypeRetry?Reason
Network timeoutYesOften transient
HTTP 503 Service UnavailableYesServer overloaded
HTTP 429 Rate LimitedYes (with longer delay)Back off and retry
HTTP 500 Internal Server ErrorMaybeDepends on idempotency
HTTP 400 Bad RequestNoWon't fix with retry
HTTP 404 Not FoundNoResource doesn't exist
Auth failure (401)NoCredentials invalid

Bulkhead Pattern

The bulkhead pattern isolates failures by limiting resource consumption per service. Just like a ship's bulkhead prevents the entire ship from sinking if one compartment floods.

When to Use Bulkheads

ScenarioWithout BulkheadWith Bulkhead
Batch job runs slowExhausts shared pool, affects usersOnly batch pool fills up
One service hangsAll callers waitOnly that service's pool fills
Health checks pile upAdds load during outageDedicated pool, no impact

Timeout Pattern

Never wait forever. Every remote call must have a timeout. Without timeouts, a slow service can consume all your threads waiting for a response.

Setting Appropriate Timeouts

ApproachHow It Works
Fixed timeoutSet absolute maximum (e.g., 5 seconds)
Adaptive timeoutBased on observed latency (e.g., P99 × 2)
Per-operation timeoutDifferent limits for different operations

Timeout Guidelines by Operation Type

Operation TypeExample TimeoutRationale
Health check1 secondShould be instant
Simple query3 secondsDatabase lookup
Complex computation10 secondsReport generation
File upload60 secondsLarge data transfer

Fallback Patterns

When a service fails, provide a degraded but functional experience rather than an error.

Fallback Strategies

LevelStrategyLatencyFreshness
L1Call primary serviceMediumFresh
L2Serve stale cacheLowStale
L3Return default dataVery lowStatic
L4Return graceful errorMinimalN/A

Multi-Level Fallback Example

ServiceFallback 1Fallback 2Fallback 3
RecommendationsRecent cachePopular itemsEmpty list
User profileStale cacheDefault profileMinimal profile
Product catalogStale cacheFeatured productsError message

Chaos Engineering

You can't know if your system is resilient until you test it under failure. Chaos engineering is the practice of deliberately injecting failures to discover weaknesses.

Common Failure Scenarios to Test

ScenarioWhat to InjectWhat to Measure
Network partitionLatency or dropped packetsFallback activation
Service crashKill instanceFailover time
Database failureConnection timeoutRead from replica
High loadCPU/memory pressureRate limiting
Dependency failureService unavailableGraceful degradation

Chaos Engineering Tools

ToolPlatformKey Features
Chaos MeshKubernetesPod, network, IO failures
GremlinMulti-platformAttack library, safety checks
LitmusKubernetesWorkflow automation
AWS Fault Injection SimulatorAWSBuilt-in integration

Graceful Degradation Strategy

Combining all patterns into a comprehensive resilience strategy:

Layered Defense

LayerPatternPurpose
1. TimeoutEvery remote callFail fast, don't wait forever
2. RetryTransient failuresRecover from temporary issues
3. Circuit BreakerRepeated failuresStop hammering unhealthy service
4. BulkheadResource isolationPrevent cascade to other services
5. FallbackService unavailableProvide degraded response
6. FallbackCircuit openUse cached/stale data

Summary: The Resilience Toolkit

PatternPurposeWhen to Use
TimeoutPrevent resource exhaustionAlways, on every remote call
RetryHandle transient failuresNetwork errors, 503, timeouts
Circuit BreakerStop cascading failuresServices with known failure rates
BulkheadIsolate resource consumptionMultiple service types with different criticality
FallbackProvide degraded functionalityNon-critical paths, caching strategy
Chaos EngineeringTest resilience before productionContinuous improvement

Resilience is not about preventing all failures—it's about ensuring failures don't cascade and that users experience graceful degradation rather than complete outage. Start with timeouts (the simplest, highest impact), then add circuit breakers for known weak points, and use chaos engineering to validate your assumptions.


Phase 3 Summary: You've now completed the Distributed Systems phase. We covered the fallacies that trip up most architects, the saga pattern for handling distributed transactions, service discovery patterns, caching strategies, distributed data store internals, and resilience patterns. These are the fundamentals that separate engineers who build toy systems from those who design production-grade distributed architectures.

In the next phase, we'll dive into Core Algorithms & Data Structures—the hidden engines behind every distributed system.