Security & Reliability

Reliability Patterns: Circuit Breakers, Retries, Fallbacks, and Graceful Degradation

Learn how to build reliable systems that handle failures gracefully. Cover circuit breakers, retry strategies, timeouts, bulkheads, and graceful degradation patterns.

20 min readreliabilitycircuit breakerretryfallbackresiliencefault tolerance

Why Reliability Matters

Every system fails eventually. The question is whether your system degrades gracefully or collapses completely. Reliability patterns help you build systems that survive failures.

Key insight: Reliability ≠ availability. A system can be available but unreliable (returning errors). Aim for both: high availability AND reliable responses.


Failure Modes

Types of Failures

TypeDescriptionExample
TransientTemporary, retry might workNetwork timeout
PermanentWon't recover without interventionDisk full
ByzantineUnpredictable, maliciousCorrupted data
PartialSome requests workSingle server down

Timeouts

Always set timeouts. Without them, a slow service can cause your entire system to hang.

Timeout Strategy

Setting Timeouts

python
import httpx

# Set reasonable timeouts
response = httpx.get(
    "https://api.example.com/data",
    timeout=httpx.Timeout(5.0, connect=1.0)
)

# Don't do this - no timeout means infinite wait!
# response = requests.get(url)  # Bad!
Service TypeTimeout Recommendation
Fast cache (Redis)10-50ms
Database100-500ms
External API1-5s
Long-running jobAsync + polling

Retry Patterns

When to Retry

Exponential Backoff

python
import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except RetryableError as e:
            if attempt == max_retries - 1:
                raise
            
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay / 2)
            time.sleep(delay + jitter)
AttemptBase DelayWith Jitter
11s1.0-1.5s
22s2.0-3.0s
34s4.0-6.0s
48s8.0-12.0s
⚠️

Don't retry blindly: Only retry transient errors (timeouts, 5xx). Never retry 4xx errors (they're client mistakes). Implement retry budgets to prevent thundering herd.


Circuit Breaker Pattern

Prevent cascading failures by stopping calls to a failing service.

States

Implementation

python
class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.state = "closed"
        self.last_failure_time = None
    
    def call(self, func):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "half_open"
            else:
                raise CircuitOpenError("Circuit is open")
        
        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise
    
    def on_success(self):
        self.failure_count = 0
        self.state = "closed"
    
    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = "open"

Circuit Breaker in Action


Bulkhead Pattern

Isolate components so that failure in one doesn't affect others.

PatternPurpose
Connection pool per serviceOne slow DB doesn't block others
Separate thread poolsIsolate computation
Dedicated instancesComplete isolation

Fallback and Graceful Degradation

Fallback Strategies

Code Example

python
def get_product_recommendations(user_id):
    try:
        # Try primary service
        recommendations = ml_service.get_recommendations(user_id)
        return recommendations
    except MLServiceError:
        # Fallback 1: Cached recommendations
        cached = cache.get(f"recs:{user_id}")
        if cached:
            return cached
        
        # Fallback 2: Popular products
        return popular_products()

Graceful Degradation Matrix

Component DownDegradation
RecommendationsShow popular items
ReviewsHide review section
SearchShow cached results
PaymentsQueue for later
AnalyticsBatch process later

Health Checks

Helpful for orchestration systems to know when to route traffic or restart services.

Health Check Levels

LevelChecksUse Case
LivenessIs the process alive?Kubernetes liveness probe
ReadinessCan handle traffic?Kubernetes readiness probe
StartupIs initialization complete?Kubernetes startup probe

Health Check Implementation

python
from fastapi import FastAPI

app = FastAPI()

@app.get("/health")
async def health():
    checks = {
        "database": check_database(),
        "cache": check_cache(),
        "external_api": check_external_api(),
    }
    
    all_healthy = all(checks.values())
    
    return {
        "status": "healthy" if all_healthy else "unhealthy",
        "checks": checks
    }, 200 if all_healthy else 503

Chaos Engineering

Test your reliability by deliberately breaking things.

Chaos ToolWhat It Tests
Chaos MonkeyKill random instances
LitmusKubernetes chaos
GremlinTargeted attacks
AWS Fault Injection SimulatorCloud infrastructure

Start small: Test in staging first. Have rollback plans. Start with non-critical services. Gradually move to production with proper monitoring.


Reliability Checklist

PatternWhen to Use
TimeoutsAlways - prevent infinite waits
RetriesTransient failures, idempotent operations
Circuit BreakerCall external services that might fail
BulkheadIsolate critical resources
FallbackDegrade gracefully when primary fails
Health ChecksHelp orchestrators manage your service
Chaos EngineeringTest your reliability assumptions

What to Remember for Interviews

  1. Timeouts are essential: Always set them; infinite waits are never acceptable
  2. Retry with backoff: Exponential backoff + jitter prevents thundering herd
  3. Circuit breakers: Protect against cascading failures
  4. Graceful degradation: Have fallback strategies for every critical path
  5. Test for failure: Use chaos engineering to verify reliability

Practice: For any system design, ask: "What happens if X fails?" Then implement the appropriate reliability pattern.