Distributed Systems

Distributed Transactions & Sagas: Managing Consistency Across Services

Learn why traditional ACID transactions don't work in distributed systems, master the Saga pattern (choreography vs orchestration), implement compensating transactions, and understand idempotency for exactly-once processing.

32 min readdistributed transactionssaga2PCcompensating transactionseventual consistencyidempotency

Introduction: The Problem We're Solving

Let me start with a scenario I've encountered more times than I can count:

You're building an e-commerce platform. When a customer places an order, you need to:

  1. Reserve inventory in the Inventory Service
  2. Process payment via the Payment Service
  3. Create the order in the Order Service
  4. Trigger fulfillment in the Fulfillment Service

Simple, right? Each service has its own database. They're independent microservices. But now you need to ensure that all four operations either complete successfully or all roll back. If payment fails, you need to release the inventory. If fulfillment fails, you need to refund the payment.

This is the distributed transaction problem, and it's one of the hardest problems in distributed systems.

Why this matters: Every payment system, every e-commerce platform, every system that coordinates across multiple services faces this problem. Understanding how to solve it—or more accurately, how to work around it—separates good system designers from great ones.


Why Traditional ACID Transactions Don't Work

The Database Transaction Model

In a single database, transactions are straightforward. The database guarantees ACID properties:

  • Atomicity: All operations succeed or all fail together
  • Consistency: The database moves from one valid state to another
  • Isolation: Concurrent transactions don't interfere with each other
  • Durability: Once committed, the transaction survives crashes

The Distributed Problem

When your transaction spans multiple databases on different machines, possibly in different data centers, ACID breaks down:

  1. There's no single database to commit. Each service has its own database.
  2. Network can fail. Halfway through your transaction, the network might partition.
  3. Clocks differ. Without synchronized clocks, "what happened first" is ambiguous.
  4. Different administrators. Each database is managed by a different team, possibly with different SLAs.

Why Two-Phase Commit (2PC) Isn't the Answer

You might have heard of Two-Phase Commit (2PC). It's the traditional solution for distributed transactions. Let me explain why it doesn't scale.

How 2PC Works

Phase 1 (Prepare): The transaction manager asks all participants to prepare (lock their resources, be ready to commit).

Phase 2 (Commit): If all participants say "READY," the transaction manager tells them all to commit. If any says "ABORT," everyone rolls back.

Why 2PC Has Problems

ProblemImpact
BlockingIf coordinator crashes after participants prepare, they must wait indefinitely
Coordinator is SPOFTransaction cannot complete without coordinator
All-or-nothingIf any participant fails to prepare, entire transaction fails
LatencyAdds 3-4 round trips to every transaction
Doesn't handle all failuresIn async networks, can end in inconsistent state
⚠️

Practical advice from experience: I've worked with systems that tried to use 2PC. They all eventually abandoned it because the latency, complexity, and failure modes weren't worth it. The industry consensus is: don't use 2PC for high-throughput distributed systems.


The Saga Pattern: An Alternative Approach

The Saga pattern, introduced in 1987 by Hector Garcia-Molina and Kenneth Salem, takes a fundamentally different approach.

The Core Idea

Instead of maintaining ACID properties across distributed services, a saga breaks a distributed transaction into a series of local transactions, each with a corresponding compensating transaction to undo its effects.

If any local transaction fails, the compensating transactions for all previously completed local transactions are executed to roll back the changes.

Key Properties of Sagas

  1. No distributed locks. Each service handles its own data with local ACID transactions.
  2. Eventual consistency. If compensation completes, the system returns to a consistent state.
  3. Asynchronous. Services communicate via events; there's no blocking coordinator.
  4. Compensatable. Every operation must have a defined undo operation.

Trade-offs vs ACID Transactions

AspectACID TransactionSaga
ConsistencyStrong (immediate)Eventual (via compensation)
IsolationFull isolationLimited (each step visible)
LatencyHigher (coordination overhead)Lower (local transactions)
AvailabilityLower (distributed locks)Higher (no distributed locks)
ComplexityDatabase handles itApplication must define compensations

Two Approaches to Sagas: Choreography vs Orchestration

There are two ways to implement sagas, and understanding when to use each is crucial.

Choreography: Services Communicate via Events

In choreography, each service publishes events when its local transaction completes, and subscribes to events to know what to do next.

When to Use Choreography

Use WhenAvoid When
Few steps (2-5)Many steps (>5)
Linear, sequential flowComplex branching logic
Small teamsMultiple teams coordinating
Simple monitoring needsNeed comprehensive visibility
Quick iterationNeed sophisticated recovery

Pros

  • Loose coupling: Services don't know about each other
  • Simple to start: No central coordinator
  • Good for simple workflows: When steps are few and linear

Cons

  • Hidden dependencies: Compensation logic can spread across services
  • Difficult to track: No single place to see the saga state
  • Testing complexity: Hard to test end-to-end saga behavior

Orchestration: A Central Coordinator Manages the Saga

In orchestration, there's a dedicated saga orchestrator that tells each service what to do and handles compensation logic in one place.

When to Use Orchestration

Use WhenAvoid When
Many steps (over 5)Few steps (under 5)
Complex branchingLinear, simple flows
Multiple teamsSmall, co-located teams
Comprehensive monitoring neededBasic monitoring OK
Sophisticated recovery neededSimple recovery OK

Pros

  • Clear ownership: Saga logic lives in one place
  • Easy to test: Test the orchestrator in isolation
  • Easy to extend: Add new steps to the orchestrator
  • Better visibility: One place to see saga state

Cons

  • Central point of failure: If orchestrator dies, need recovery mechanism
  • More infrastructure: Need to build/configure an orchestrator
  • Potential tight coupling: Orchestrator knows about all services

Choosing Between Choreography and Orchestration

FactorChoreographyOrchestration
Number of stepsFew (under 5)Many (over 5)
Workflow complexityLinear, sequentialBranching, conditional
Team structureSmall, co-locatedMultiple teams
Monitoring needsBasicComprehensive
Recovery requirementsSimpleSophisticated

My rule of thumb: If your saga has more than 4-5 steps, or if the workflow is complex with branches, use orchestration. For simple linear workflows with 2-3 steps, choreography is cleaner.


Idempotency: The Key to Exactly-Once Semantics

Here's a truth I've learned through painful experience: in distributed systems, every operation can be executed more than once.

Why?

  • Network timeouts → client retries → operation executes twice
  • Consumer crashes → message redelivered → operation executes twice
  • Circuit breaker opens → retry after recovery → operation executes twice

The solution? Make every operation idempotent. An idempotent operation produces the same result whether executed once or multiple times.

Patterns for Idempotency

1. Idempotency Keys

Each request includes a unique key. The service checks if that key has been processed before:

Flow:

  1. Client sends request with unique idempotency key
  2. Service checks database for existing record with that key
  3. If found, return existing result (don't process again)
  4. If not found, process request and store result with key

2. Database Constraints

Use unique constraints to prevent duplicate operations:

ConstraintPrevents
Unique index on idempotency_keyDuplicate payments
Unique index on order_idDuplicate orders
Unique constraint per entityOne operation per entity

3. State-Based Idempotency

Design operations to be naturally idempotent:

  • GET operations: Always safe, never changes state
  • PUT/replace operations: Replace state, not append
  • DELETE operations: Already deleted returns success

Idempotency in Sagas

Every saga step should be idempotent. Here's why:

Without idempotency: The retry could reserve inventory twice, process payment twice, etc.

With idempotency: Even with retries and duplicates, the end result is the same.


Handling Failures in Sagas

This is where things get tricky, and where I've seen many systems struggle.

Types of Failures

  1. Transient failures: Network blip, brief service unavailability. Retry usually fixes these.
  2. Permanent failures: Business logic rejection (e.g., insufficient inventory). Requires compensation.
  3. Unknown failures: We don't know if the operation succeeded or failed. This is the hardest case.

The Challenge of Unknown Failures

Strategies for Handling Unknown Failures

1. Make Operations Explicitly Idempotent

Always design your operations so that executing them twice produces the same result as executing once.

2. Use the "Tell, Don't Ask" Pattern

Instead of asking "did this succeed?" and then deciding, just tell services what to do, and have them record what they've done.

3. Implement Compensating Transaction Idempotency

Your compensating transactions must also be idempotent:

OperationIdempotency Strategy
Reserve inventoryCheck if already reserved by order_id
Process paymentStore payment_id, return existing on retry
RefundCheck if already refunded, return existing
Send emailCheck if already sent to idempotency key

Advanced Saga Patterns

Nested Sagas

Just like you can nest database transactions, you can nest sagas. A nested saga is where one saga step is itself a saga.

Saga with Outbox Pattern

To ensure reliable event publishing, use the transactional outbox pattern:

The Problem: When a service writes to its database and then publishes an event, if it crashes after writing but before publishing, the event is lost.

The Solution: Write both the business data AND an outbox entry in the same local transaction. A separate process polls the outbox and publishes events.

Compensating Transaction Challenges

Challenge 1: Non-Compensatable Operations

Some operations can't be compensated:

  • Sending an email
  • Charging a credit card
  • Publishing a public tweet

Solutions:

  1. Make them the last step. Only execute if all previous steps are committed.
  2. Make them idempotent. Store the result and return it on retry.
  3. Accept the limitation. Document that certain operations can't be rolled back.
  4. Use a human process. For critical operations, involve a human for reconciliation.

Challenge 2: Partial Compensation Failures

What if compensation itself fails?

Solutions:

  1. Retry with backoff. Most compensation failures are transient.
  2. Dead letter queue. After N failed retries, move to DLQ for manual intervention.
  3. Saga state persistence. Store saga state in durable storage so you can recover.
  4. Human intervention. Some failures require human operators to resolve.

What to Remember for Interviews

  1. Why 2PC doesn't scale: Understand the blocking, coordinator SPOF, and latency issues.
  2. Saga vs ACID: Know the trade-offs: strong consistency vs eventual consistency.
  3. Choreography vs Orchestration: Be able to explain when to use each, with examples.
  4. Idempotency: This is critical in practice. Know multiple patterns for achieving it.
  5. Compensation challenges: Understand partial failures and how to handle them.
  6. Real-world systems: Be familiar with how Saga is used in Amazon, Uber, and other companies.

Interview tip: When designing any system with multiple services, proactively address the consistency problem. Say "we'll use the Saga pattern with compensating transactions" and explain why ACID transactions across services aren't feasible. This shows deep understanding.


Further Reading

  1. "Saga Pattern" — Microservices.io
    The definitive patterns reference for sagas.

  2. "Implementing the Saga Pattern" — AWS Architecture Blog
    Practical guide with AWS-specific implementation details.

  3. "Life beyond Distributed Transactions" — Pat Helland
    Seminal paper on why distributed transactions don't work and what to do instead.

  4. "Amazon's Dynamo Paper"
    How Amazon handles eventual consistency in practice.

💡

Final thought: After decades in this field, I've come to view the Saga pattern not as a compromise but as a fundamental shift in how we think about consistency. Strong consistency is expensive and available. Eventual consistency, handled carefully with sagas, is cheap and highly available. The key is understanding when each trade-off is appropriate.