Reliability Patterns: Circuit Breakers, Retries, Fallbacks, and Graceful Degradation

Learn how to build reliable systems that handle failures gracefully. Cover circuit breakers, retry strategies, timeouts, bulkheads, and graceful degradation patterns.

20 min readreliabilitycircuit breakerretryfallbackresiliencefault tolerance

Why Reliability Matters

Every system fails eventually. The question is whether your system degrades gracefully or collapses completely. Reliability patterns help you build systems that survive failures.

✅

Key insight: Reliability ≠ availability. A system can be available but unreliable (returning errors). Aim for both: high availability AND reliable responses.

Failure Modes

Types of Failures

Type	Description	Example
Transient	Temporary, retry might work	Network timeout
Permanent	Won't recover without intervention	Disk full
Byzantine	Unpredictable, malicious	Corrupted data
Partial	Some requests work	Single server down

Timeouts

Always set timeouts. Without them, a slow service can cause your entire system to hang.

Timeout Strategy

Setting Timeouts

python

import httpx

# Set reasonable timeouts
response = httpx.get(
    "https://api.example.com/data",
    timeout=httpx.Timeout(5.0, connect=1.0)
)

# Don't do this - no timeout means infinite wait!
# response = requests.get(url)  # Bad!

Service Type	Timeout Recommendation
Fast cache (Redis)	10-50ms
Database	100-500ms
External API	1-5s
Long-running job	Async + polling

Retry Patterns

When to Retry

Exponential Backoff

python

import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except RetryableError as e:
            if attempt == max_retries - 1:
                raise
            
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay / 2)
            time.sleep(delay + jitter)

Attempt	Base Delay	With Jitter
1	1s	1.0-1.5s
2	2s	2.0-3.0s
3	4s	4.0-6.0s
4	8s	8.0-12.0s

⚠️

Don't retry blindly: Only retry transient errors (timeouts, 5xx). Never retry 4xx errors (they're client mistakes). Implement retry budgets to prevent thundering herd.

Circuit Breaker Pattern

Prevent cascading failures by stopping calls to a failing service.

States

Implementation

python

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.state = "closed"
        self.last_failure_time = None
    
    def call(self, func):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "half_open"
            else:
                raise CircuitOpenError("Circuit is open")
        
        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise
    
    def on_success(self):
        self.failure_count = 0
        self.state = "closed"
    
    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = "open"

Circuit Breaker in Action

Bulkhead Pattern

Isolate components so that failure in one doesn't affect others.

Pattern	Purpose
Connection pool per service	One slow DB doesn't block others
Separate thread pools	Isolate computation
Dedicated instances	Complete isolation

Fallback and Graceful Degradation

Fallback Strategies

Code Example

python

def get_product_recommendations(user_id):
    try:
        # Try primary service
        recommendations = ml_service.get_recommendations(user_id)
        return recommendations
    except MLServiceError:
        # Fallback 1: Cached recommendations
        cached = cache.get(f"recs:{user_id}")
        if cached:
            return cached
        
        # Fallback 2: Popular products
        return popular_products()

Graceful Degradation Matrix

Component Down	Degradation
Recommendations	Show popular items
Reviews	Hide review section
Search	Show cached results
Payments	Queue for later
Analytics	Batch process later

Health Checks

Helpful for orchestration systems to know when to route traffic or restart services.

Health Check Levels

Level	Checks	Use Case
Liveness	Is the process alive?	Kubernetes liveness probe
Readiness	Can handle traffic?	Kubernetes readiness probe
Startup	Is initialization complete?	Kubernetes startup probe

Health Check Implementation

python

from fastapi import FastAPI

app = FastAPI()

@app.get("/health")
async def health():
    checks = {
        "database": check_database(),
        "cache": check_cache(),
        "external_api": check_external_api(),
    }
    
    all_healthy = all(checks.values())
    
    return {
        "status": "healthy" if all_healthy else "unhealthy",
        "checks": checks
    }, 200 if all_healthy else 503

Chaos Engineering

Test your reliability by deliberately breaking things.

Chaos Tool	What It Tests
Chaos Monkey	Kill random instances
Litmus	Kubernetes chaos
Gremlin	Targeted attacks
AWS Fault Injection Simulator	Cloud infrastructure

✅

Start small: Test in staging first. Have rollback plans. Start with non-critical services. Gradually move to production with proper monitoring.

Reliability Checklist

Pattern	When to Use
Timeouts	Always - prevent infinite waits
Retries	Transient failures, idempotent operations
Circuit Breaker	Call external services that might fail
Bulkhead	Isolate critical resources
Fallback	Degrade gracefully when primary fails
Health Checks	Help orchestrators manage your service
Chaos Engineering	Test your reliability assumptions

What to Remember for Interviews

Timeouts are essential: Always set them; infinite waits are never acceptable
Retry with backoff: Exponential backoff + jitter prevents thundering herd
Circuit breakers: Protect against cascading failures
Graceful degradation: Have fallback strategies for every critical path
Test for failure: Use chaos engineering to verify reliability

✅

Practice: For any system design, ask: "What happens if X fails?" Then implement the appropriate reliability pattern.

System Security: Authentication, Authorization, and Common Vulnerabilities

Designing Instagram: A Deep Dive into Photo Sharing at Scale