Distributed Transactions & Sagas: Managing Consistency Across Services
Learn why traditional ACID transactions don't work in distributed systems, master the Saga pattern (choreography vs orchestration), implement compensating transactions, and understand idempotency for exactly-once processing.
Introduction: The Problem We're Solving
Let me start with a scenario I've encountered more times than I can count:
You're building an e-commerce platform. When a customer places an order, you need to:
- Reserve inventory in the Inventory Service
- Process payment via the Payment Service
- Create the order in the Order Service
- Trigger fulfillment in the Fulfillment Service
Simple, right? Each service has its own database. They're independent microservices. But now you need to ensure that all four operations either complete successfully or all roll back. If payment fails, you need to release the inventory. If fulfillment fails, you need to refund the payment.
This is the distributed transaction problem, and it's one of the hardest problems in distributed systems.
Why this matters: Every payment system, every e-commerce platform, every system that coordinates across multiple services faces this problem. Understanding how to solve it—or more accurately, how to work around it—separates good system designers from great ones.
Why Traditional ACID Transactions Don't Work
The Database Transaction Model
In a single database, transactions are straightforward. The database guarantees ACID properties:
- Atomicity: All operations succeed or all fail together
- Consistency: The database moves from one valid state to another
- Isolation: Concurrent transactions don't interfere with each other
- Durability: Once committed, the transaction survives crashes
The Distributed Problem
When your transaction spans multiple databases on different machines, possibly in different data centers, ACID breaks down:
- There's no single database to commit. Each service has its own database.
- Network can fail. Halfway through your transaction, the network might partition.
- Clocks differ. Without synchronized clocks, "what happened first" is ambiguous.
- Different administrators. Each database is managed by a different team, possibly with different SLAs.
Why Two-Phase Commit (2PC) Isn't the Answer
You might have heard of Two-Phase Commit (2PC). It's the traditional solution for distributed transactions. Let me explain why it doesn't scale.
How 2PC Works
Phase 1 (Prepare): The transaction manager asks all participants to prepare (lock their resources, be ready to commit).
Phase 2 (Commit): If all participants say "READY," the transaction manager tells them all to commit. If any says "ABORT," everyone rolls back.
Why 2PC Has Problems
| Problem | Impact |
|---|---|
| Blocking | If coordinator crashes after participants prepare, they must wait indefinitely |
| Coordinator is SPOF | Transaction cannot complete without coordinator |
| All-or-nothing | If any participant fails to prepare, entire transaction fails |
| Latency | Adds 3-4 round trips to every transaction |
| Doesn't handle all failures | In async networks, can end in inconsistent state |
Practical advice from experience: I've worked with systems that tried to use 2PC. They all eventually abandoned it because the latency, complexity, and failure modes weren't worth it. The industry consensus is: don't use 2PC for high-throughput distributed systems.
The Saga Pattern: An Alternative Approach
The Saga pattern, introduced in 1987 by Hector Garcia-Molina and Kenneth Salem, takes a fundamentally different approach.
The Core Idea
Instead of maintaining ACID properties across distributed services, a saga breaks a distributed transaction into a series of local transactions, each with a corresponding compensating transaction to undo its effects.
If any local transaction fails, the compensating transactions for all previously completed local transactions are executed to roll back the changes.
Key Properties of Sagas
- No distributed locks. Each service handles its own data with local ACID transactions.
- Eventual consistency. If compensation completes, the system returns to a consistent state.
- Asynchronous. Services communicate via events; there's no blocking coordinator.
- Compensatable. Every operation must have a defined undo operation.
Trade-offs vs ACID Transactions
| Aspect | ACID Transaction | Saga |
|---|---|---|
| Consistency | Strong (immediate) | Eventual (via compensation) |
| Isolation | Full isolation | Limited (each step visible) |
| Latency | Higher (coordination overhead) | Lower (local transactions) |
| Availability | Lower (distributed locks) | Higher (no distributed locks) |
| Complexity | Database handles it | Application must define compensations |
Two Approaches to Sagas: Choreography vs Orchestration
There are two ways to implement sagas, and understanding when to use each is crucial.
Choreography: Services Communicate via Events
In choreography, each service publishes events when its local transaction completes, and subscribes to events to know what to do next.
When to Use Choreography
| Use When | Avoid When |
|---|---|
| Few steps (2-5) | Many steps (>5) |
| Linear, sequential flow | Complex branching logic |
| Small teams | Multiple teams coordinating |
| Simple monitoring needs | Need comprehensive visibility |
| Quick iteration | Need sophisticated recovery |
Pros
- Loose coupling: Services don't know about each other
- Simple to start: No central coordinator
- Good for simple workflows: When steps are few and linear
Cons
- Hidden dependencies: Compensation logic can spread across services
- Difficult to track: No single place to see the saga state
- Testing complexity: Hard to test end-to-end saga behavior
Orchestration: A Central Coordinator Manages the Saga
In orchestration, there's a dedicated saga orchestrator that tells each service what to do and handles compensation logic in one place.
When to Use Orchestration
| Use When | Avoid When |
|---|---|
| Many steps (over 5) | Few steps (under 5) |
| Complex branching | Linear, simple flows |
| Multiple teams | Small, co-located teams |
| Comprehensive monitoring needed | Basic monitoring OK |
| Sophisticated recovery needed | Simple recovery OK |
Pros
- Clear ownership: Saga logic lives in one place
- Easy to test: Test the orchestrator in isolation
- Easy to extend: Add new steps to the orchestrator
- Better visibility: One place to see saga state
Cons
- Central point of failure: If orchestrator dies, need recovery mechanism
- More infrastructure: Need to build/configure an orchestrator
- Potential tight coupling: Orchestrator knows about all services
Choosing Between Choreography and Orchestration
| Factor | Choreography | Orchestration |
|---|---|---|
| Number of steps | Few (under 5) | Many (over 5) |
| Workflow complexity | Linear, sequential | Branching, conditional |
| Team structure | Small, co-located | Multiple teams |
| Monitoring needs | Basic | Comprehensive |
| Recovery requirements | Simple | Sophisticated |
My rule of thumb: If your saga has more than 4-5 steps, or if the workflow is complex with branches, use orchestration. For simple linear workflows with 2-3 steps, choreography is cleaner.
Idempotency: The Key to Exactly-Once Semantics
Here's a truth I've learned through painful experience: in distributed systems, every operation can be executed more than once.
Why?
- Network timeouts → client retries → operation executes twice
- Consumer crashes → message redelivered → operation executes twice
- Circuit breaker opens → retry after recovery → operation executes twice
The solution? Make every operation idempotent. An idempotent operation produces the same result whether executed once or multiple times.
Patterns for Idempotency
1. Idempotency Keys
Each request includes a unique key. The service checks if that key has been processed before:
Flow:
- Client sends request with unique idempotency key
- Service checks database for existing record with that key
- If found, return existing result (don't process again)
- If not found, process request and store result with key
2. Database Constraints
Use unique constraints to prevent duplicate operations:
| Constraint | Prevents |
|---|---|
| Unique index on idempotency_key | Duplicate payments |
| Unique index on order_id | Duplicate orders |
| Unique constraint per entity | One operation per entity |
3. State-Based Idempotency
Design operations to be naturally idempotent:
- GET operations: Always safe, never changes state
- PUT/replace operations: Replace state, not append
- DELETE operations: Already deleted returns success
Idempotency in Sagas
Every saga step should be idempotent. Here's why:
Without idempotency: The retry could reserve inventory twice, process payment twice, etc.
With idempotency: Even with retries and duplicates, the end result is the same.
Handling Failures in Sagas
This is where things get tricky, and where I've seen many systems struggle.
Types of Failures
- Transient failures: Network blip, brief service unavailability. Retry usually fixes these.
- Permanent failures: Business logic rejection (e.g., insufficient inventory). Requires compensation.
- Unknown failures: We don't know if the operation succeeded or failed. This is the hardest case.
The Challenge of Unknown Failures
Strategies for Handling Unknown Failures
1. Make Operations Explicitly Idempotent
Always design your operations so that executing them twice produces the same result as executing once.
2. Use the "Tell, Don't Ask" Pattern
Instead of asking "did this succeed?" and then deciding, just tell services what to do, and have them record what they've done.
3. Implement Compensating Transaction Idempotency
Your compensating transactions must also be idempotent:
| Operation | Idempotency Strategy |
|---|---|
| Reserve inventory | Check if already reserved by order_id |
| Process payment | Store payment_id, return existing on retry |
| Refund | Check if already refunded, return existing |
| Send email | Check if already sent to idempotency key |
Advanced Saga Patterns
Nested Sagas
Just like you can nest database transactions, you can nest sagas. A nested saga is where one saga step is itself a saga.
Saga with Outbox Pattern
To ensure reliable event publishing, use the transactional outbox pattern:
The Problem: When a service writes to its database and then publishes an event, if it crashes after writing but before publishing, the event is lost.
The Solution: Write both the business data AND an outbox entry in the same local transaction. A separate process polls the outbox and publishes events.
Compensating Transaction Challenges
Challenge 1: Non-Compensatable Operations
Some operations can't be compensated:
- Sending an email
- Charging a credit card
- Publishing a public tweet
Solutions:
- Make them the last step. Only execute if all previous steps are committed.
- Make them idempotent. Store the result and return it on retry.
- Accept the limitation. Document that certain operations can't be rolled back.
- Use a human process. For critical operations, involve a human for reconciliation.
Challenge 2: Partial Compensation Failures
What if compensation itself fails?
Solutions:
- Retry with backoff. Most compensation failures are transient.
- Dead letter queue. After N failed retries, move to DLQ for manual intervention.
- Saga state persistence. Store saga state in durable storage so you can recover.
- Human intervention. Some failures require human operators to resolve.
What to Remember for Interviews
- Why 2PC doesn't scale: Understand the blocking, coordinator SPOF, and latency issues.
- Saga vs ACID: Know the trade-offs: strong consistency vs eventual consistency.
- Choreography vs Orchestration: Be able to explain when to use each, with examples.
- Idempotency: This is critical in practice. Know multiple patterns for achieving it.
- Compensation challenges: Understand partial failures and how to handle them.
- Real-world systems: Be familiar with how Saga is used in Amazon, Uber, and other companies.
Interview tip: When designing any system with multiple services, proactively address the consistency problem. Say "we'll use the Saga pattern with compensating transactions" and explain why ACID transactions across services aren't feasible. This shows deep understanding.
Further Reading
-
"Saga Pattern" — Microservices.io
The definitive patterns reference for sagas. -
"Implementing the Saga Pattern" — AWS Architecture Blog
Practical guide with AWS-specific implementation details. -
"Life beyond Distributed Transactions" — Pat Helland
Seminal paper on why distributed transactions don't work and what to do instead. -
"Amazon's Dynamo Paper"
How Amazon handles eventual consistency in practice.
Final thought: After decades in this field, I've come to view the Saga pattern not as a compromise but as a fundamental shift in how we think about consistency. Strong consistency is expensive and available. Eventual consistency, handled carefully with sagas, is cheap and highly available. The key is understanding when each trade-off is appropriate.