Fault Tolerance & Resilience
Master the patterns that keep distributed systems running under failure. Learn circuit breakers, retries with backoff, bulkheads, chaos engineering, and graceful degradation.
Fault Tolerance & Resilience
In distributed systems, failure is not an exception—it's a certainty. Networks partition, disks fail, processes crash, and memory leaks. The question isn't whether your system will fail, but how gracefully it will fail and how quickly it will recover.
As a staff engineer who's seen production incidents cascade from a single timeout to a full outage, I can tell you: the difference between a resilient system and a fragile one comes down to defensive programming. Every call to a remote service must assume it will fail, and your system must be designed to handle that failure gracefully.
The Failure Modes
Before we dive into patterns, let's understand what we're defending against:
Why Cascading Failures Happen
The key insight: in a distributed system, resources are finite. When one service slows down, callers hold onto resources (threads, connections, memory) waiting for responses. This depletes resources and causes more failures.
Circuit Breaker
The circuit breaker pattern prevents cascading failures by "failing fast" when a downstream service is unhealthy.
The Three States
| State | Behavior | Transitions |
|---|---|---|
| Closed | Normal operation, requests pass through | Opens after failure threshold |
| Open | Fail fast, requests immediately rejected | Transitions to half-open after timeout |
| Half-Open | Allow limited requests to test recovery | Closes on success, reopens on failure |
How It Works
| Phase | What Happens |
|---|---|
| Monitor | Track failures for each downstream service |
| Trip | When failures exceed threshold, open the circuit |
| Fallback | When circuit is open, return degraded response instead of waiting |
| Reset | After timeout, allow test requests to check recovery |
Key Configuration
| Parameter | Description |
|---|---|
| Failure threshold | Number of failures before opening (e.g., 5) |
| Success threshold | Successes needed to close from half-open (e.g., 2) |
| Timeout duration | Time to wait before attempting reset (e.g., 60s) |
| Half-open max calls | Limit concurrent test requests |
Retry with Exponential Backoff
When a call fails, retries can often succeed—especially for transient failures like network timeouts or 503 Service Unavailable.
The Problem with Fixed Retries
If everyone retries at the same time, you create a "thundering herd" that overloads the recovering service:
Exponential Backoff with Jitter
| Strategy | Description | Why It Helps |
|---|---|---|
| Exponential backoff | Delay doubles with each retry (1s, 2s, 4s, 8s...) | Gives service time to recover |
| Jitter | Add randomness to delay | Prevents synchronized retries |
| Max delay cap | Limit maximum delay (e.g., 30s) | Prevents excessive waiting |
What Errors Should You Retry?
| Error Type | Retry? | Reason |
|---|---|---|
| Network timeout | Yes | Often transient |
| HTTP 503 Service Unavailable | Yes | Server overloaded |
| HTTP 429 Rate Limited | Yes (with longer delay) | Back off and retry |
| HTTP 500 Internal Server Error | Maybe | Depends on idempotency |
| HTTP 400 Bad Request | No | Won't fix with retry |
| HTTP 404 Not Found | No | Resource doesn't exist |
| Auth failure (401) | No | Credentials invalid |
Bulkhead Pattern
The bulkhead pattern isolates failures by limiting resource consumption per service. Just like a ship's bulkhead prevents the entire ship from sinking if one compartment floods.
When to Use Bulkheads
| Scenario | Without Bulkhead | With Bulkhead |
|---|---|---|
| Batch job runs slow | Exhausts shared pool, affects users | Only batch pool fills up |
| One service hangs | All callers wait | Only that service's pool fills |
| Health checks pile up | Adds load during outage | Dedicated pool, no impact |
Timeout Pattern
Never wait forever. Every remote call must have a timeout. Without timeouts, a slow service can consume all your threads waiting for a response.
Setting Appropriate Timeouts
| Approach | How It Works |
|---|---|
| Fixed timeout | Set absolute maximum (e.g., 5 seconds) |
| Adaptive timeout | Based on observed latency (e.g., P99 × 2) |
| Per-operation timeout | Different limits for different operations |
Timeout Guidelines by Operation Type
| Operation Type | Example Timeout | Rationale |
|---|---|---|
| Health check | 1 second | Should be instant |
| Simple query | 3 seconds | Database lookup |
| Complex computation | 10 seconds | Report generation |
| File upload | 60 seconds | Large data transfer |
Fallback Patterns
When a service fails, provide a degraded but functional experience rather than an error.
Fallback Strategies
| Level | Strategy | Latency | Freshness |
|---|---|---|---|
| L1 | Call primary service | Medium | Fresh |
| L2 | Serve stale cache | Low | Stale |
| L3 | Return default data | Very low | Static |
| L4 | Return graceful error | Minimal | N/A |
Multi-Level Fallback Example
| Service | Fallback 1 | Fallback 2 | Fallback 3 |
|---|---|---|---|
| Recommendations | Recent cache | Popular items | Empty list |
| User profile | Stale cache | Default profile | Minimal profile |
| Product catalog | Stale cache | Featured products | Error message |
Chaos Engineering
You can't know if your system is resilient until you test it under failure. Chaos engineering is the practice of deliberately injecting failures to discover weaknesses.
Common Failure Scenarios to Test
| Scenario | What to Inject | What to Measure |
|---|---|---|
| Network partition | Latency or dropped packets | Fallback activation |
| Service crash | Kill instance | Failover time |
| Database failure | Connection timeout | Read from replica |
| High load | CPU/memory pressure | Rate limiting |
| Dependency failure | Service unavailable | Graceful degradation |
Chaos Engineering Tools
| Tool | Platform | Key Features |
|---|---|---|
| Chaos Mesh | Kubernetes | Pod, network, IO failures |
| Gremlin | Multi-platform | Attack library, safety checks |
| Litmus | Kubernetes | Workflow automation |
| AWS Fault Injection Simulator | AWS | Built-in integration |
Graceful Degradation Strategy
Combining all patterns into a comprehensive resilience strategy:
Layered Defense
| Layer | Pattern | Purpose |
|---|---|---|
| 1. Timeout | Every remote call | Fail fast, don't wait forever |
| 2. Retry | Transient failures | Recover from temporary issues |
| 3. Circuit Breaker | Repeated failures | Stop hammering unhealthy service |
| 4. Bulkhead | Resource isolation | Prevent cascade to other services |
| 5. Fallback | Service unavailable | Provide degraded response |
| 6. Fallback | Circuit open | Use cached/stale data |
Summary: The Resilience Toolkit
| Pattern | Purpose | When to Use |
|---|---|---|
| Timeout | Prevent resource exhaustion | Always, on every remote call |
| Retry | Handle transient failures | Network errors, 503, timeouts |
| Circuit Breaker | Stop cascading failures | Services with known failure rates |
| Bulkhead | Isolate resource consumption | Multiple service types with different criticality |
| Fallback | Provide degraded functionality | Non-critical paths, caching strategy |
| Chaos Engineering | Test resilience before production | Continuous improvement |
Resilience is not about preventing all failures—it's about ensuring failures don't cascade and that users experience graceful degradation rather than complete outage. Start with timeouts (the simplest, highest impact), then add circuit breakers for known weak points, and use chaos engineering to validate your assumptions.
Phase 3 Summary: You've now completed the Distributed Systems phase. We covered the fallacies that trip up most architects, the saga pattern for handling distributed transactions, service discovery patterns, caching strategies, distributed data store internals, and resilience patterns. These are the fundamentals that separate engineers who build toy systems from those who design production-grade distributed architectures.
In the next phase, we'll dive into Core Algorithms & Data Structures—the hidden engines behind every distributed system.