Reliability Patterns: Circuit Breakers, Retries, Fallbacks, and Graceful Degradation
Learn how to build reliable systems that handle failures gracefully. Cover circuit breakers, retry strategies, timeouts, bulkheads, and graceful degradation patterns.
Why Reliability Matters
Every system fails eventually. The question is whether your system degrades gracefully or collapses completely. Reliability patterns help you build systems that survive failures.
Key insight: Reliability ≠ availability. A system can be available but unreliable (returning errors). Aim for both: high availability AND reliable responses.
Failure Modes
Types of Failures
| Type | Description | Example |
|---|---|---|
| Transient | Temporary, retry might work | Network timeout |
| Permanent | Won't recover without intervention | Disk full |
| Byzantine | Unpredictable, malicious | Corrupted data |
| Partial | Some requests work | Single server down |
Timeouts
Always set timeouts. Without them, a slow service can cause your entire system to hang.
Timeout Strategy
Setting Timeouts
import httpx
# Set reasonable timeouts
response = httpx.get(
"https://api.example.com/data",
timeout=httpx.Timeout(5.0, connect=1.0)
)
# Don't do this - no timeout means infinite wait!
# response = requests.get(url) # Bad!
| Service Type | Timeout Recommendation |
|---|---|
| Fast cache (Redis) | 10-50ms |
| Database | 100-500ms |
| External API | 1-5s |
| Long-running job | Async + polling |
Retry Patterns
When to Retry
Exponential Backoff
import time
import random
def retry_with_backoff(func, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
return func()
except RetryableError as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay / 2)
time.sleep(delay + jitter)
| Attempt | Base Delay | With Jitter |
|---|---|---|
| 1 | 1s | 1.0-1.5s |
| 2 | 2s | 2.0-3.0s |
| 3 | 4s | 4.0-6.0s |
| 4 | 8s | 8.0-12.0s |
Don't retry blindly: Only retry transient errors (timeouts, 5xx). Never retry 4xx errors (they're client mistakes). Implement retry budgets to prevent thundering herd.
Circuit Breaker Pattern
Prevent cascading failures by stopping calls to a failing service.
States
Implementation
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.state = "closed"
self.last_failure_time = None
def call(self, func):
if self.state == "open":
if time.time() - self.last_failure_time > self.timeout:
self.state = "half_open"
else:
raise CircuitOpenError("Circuit is open")
try:
result = func()
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
def on_success(self):
self.failure_count = 0
self.state = "closed"
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
Circuit Breaker in Action
Bulkhead Pattern
Isolate components so that failure in one doesn't affect others.
| Pattern | Purpose |
|---|---|
| Connection pool per service | One slow DB doesn't block others |
| Separate thread pools | Isolate computation |
| Dedicated instances | Complete isolation |
Fallback and Graceful Degradation
Fallback Strategies
Code Example
def get_product_recommendations(user_id):
try:
# Try primary service
recommendations = ml_service.get_recommendations(user_id)
return recommendations
except MLServiceError:
# Fallback 1: Cached recommendations
cached = cache.get(f"recs:{user_id}")
if cached:
return cached
# Fallback 2: Popular products
return popular_products()
Graceful Degradation Matrix
| Component Down | Degradation |
|---|---|
| Recommendations | Show popular items |
| Reviews | Hide review section |
| Search | Show cached results |
| Payments | Queue for later |
| Analytics | Batch process later |
Health Checks
Helpful for orchestration systems to know when to route traffic or restart services.
Health Check Levels
| Level | Checks | Use Case |
|---|---|---|
| Liveness | Is the process alive? | Kubernetes liveness probe |
| Readiness | Can handle traffic? | Kubernetes readiness probe |
| Startup | Is initialization complete? | Kubernetes startup probe |
Health Check Implementation
from fastapi import FastAPI
app = FastAPI()
@app.get("/health")
async def health():
checks = {
"database": check_database(),
"cache": check_cache(),
"external_api": check_external_api(),
}
all_healthy = all(checks.values())
return {
"status": "healthy" if all_healthy else "unhealthy",
"checks": checks
}, 200 if all_healthy else 503
Chaos Engineering
Test your reliability by deliberately breaking things.
| Chaos Tool | What It Tests |
|---|---|
| Chaos Monkey | Kill random instances |
| Litmus | Kubernetes chaos |
| Gremlin | Targeted attacks |
| AWS Fault Injection Simulator | Cloud infrastructure |
Start small: Test in staging first. Have rollback plans. Start with non-critical services. Gradually move to production with proper monitoring.
Reliability Checklist
| Pattern | When to Use |
|---|---|
| Timeouts | Always - prevent infinite waits |
| Retries | Transient failures, idempotent operations |
| Circuit Breaker | Call external services that might fail |
| Bulkhead | Isolate critical resources |
| Fallback | Degrade gracefully when primary fails |
| Health Checks | Help orchestrators manage your service |
| Chaos Engineering | Test your reliability assumptions |
What to Remember for Interviews
- Timeouts are essential: Always set them; infinite waits are never acceptable
- Retry with backoff: Exponential backoff + jitter prevents thundering herd
- Circuit breakers: Protect against cascading failures
- Graceful degradation: Have fallback strategies for every critical path
- Test for failure: Use chaos engineering to verify reliability
Practice: For any system design, ask: "What happens if X fails?" Then implement the appropriate reliability pattern.