Distributed Transactions & Sagas: Managing Consistency Across Services

Learn why traditional ACID transactions don't work in distributed systems, master the Saga pattern (choreography vs orchestration), implement compensating transactions, and understand idempotency for exactly-once processing.

distributed transactionssaga2PCcompensating transactionseventual consistencyidempotency

Introduction: The Problem We're Solving

Let me start with a scenario I've encountered more times than I can count:

You're building an e-commerce platform. When a customer places an order, you need to:

Reserve inventory in the Inventory Service
Process payment via the Payment Service
Create the order in the Order Service
Trigger fulfillment in the Fulfillment Service

Simple, right? Each service has its own database. They're independent microservices. But now you need to ensure that all four operations either complete successfully or all roll back. If payment fails, you need to release the inventory. If fulfillment fails, you need to refund the payment.

This is the distributed transaction problem, and it's one of the hardest problems in distributed systems.

✅

Why this matters: Every payment system, every e-commerce platform, every system that coordinates across multiple services faces this problem. Understanding how to solve it—or more accurately, how to work around it—separates good system designers from great ones.

Why Traditional ACID Transactions Don't Work

The Database Transaction Model

In a single database, transactions are straightforward. The database guarantees ACID properties:

Atomicity: All operations succeed or all fail together
Consistency: The database moves from one valid state to another
Isolation: Concurrent transactions don't interfere with each other
Durability: Once committed, the transaction survives crashes

The Distributed Problem

When your transaction spans multiple databases on different machines, possibly in different data centers, ACID breaks down:

There's no single database to commit. Each service has its own database.
Network can fail. Halfway through your transaction, the network might partition.
Clocks differ. Without synchronized clocks, "what happened first" is ambiguous.
Different administrators. Each database is managed by a different team, possibly with different SLAs.

Why Two-Phase Commit (2PC) Isn't the Answer

You might have heard of Two-Phase Commit (2PC). It's the traditional solution for distributed transactions. Let me explain why it doesn't scale.

How 2PC Works

Phase 1 (Prepare): The transaction manager asks all participants to prepare (lock their resources, be ready to commit).

Phase 2 (Commit): If all participants say "READY," the transaction manager tells them all to commit. If any says "ABORT," everyone rolls back.

Why 2PC Has Problems

Problem	Impact
Blocking	If coordinator crashes after participants prepare, they must wait indefinitely
Coordinator is SPOF	Transaction cannot complete without coordinator
All-or-nothing	If any participant fails to prepare, entire transaction fails
Latency	Adds 3-4 round trips to every transaction
Doesn't handle all failures	In async networks, can end in inconsistent state

⚠️

Practical advice from experience: I've worked with systems that tried to use 2PC. They all eventually abandoned it because the latency, complexity, and failure modes weren't worth it. The industry consensus is: don't use 2PC for high-throughput distributed systems.

The Saga Pattern: An Alternative Approach

The Saga pattern, introduced in 1987 by Hector Garcia-Molina and Kenneth Salem, takes a fundamentally different approach.

The Core Idea

Instead of maintaining ACID properties across distributed services, a saga breaks a distributed transaction into a series of local transactions, each with a corresponding compensating transaction to undo its effects.

If any local transaction fails, the compensating transactions for all previously completed local transactions are executed to roll back the changes.

Key Properties of Sagas

No distributed locks. Each service handles its own data with local ACID transactions.
Eventual consistency. If compensation completes, the system returns to a consistent state.
Asynchronous. Services communicate via events; there's no blocking coordinator.
Compensatable. Every operation must have a defined undo operation.

Trade-offs vs ACID Transactions

Aspect	ACID Transaction	Saga
Consistency	Strong (immediate)	Eventual (via compensation)
Isolation	Full isolation	Limited (each step visible)
Latency	Higher (coordination overhead)	Lower (local transactions)
Availability	Lower (distributed locks)	Higher (no distributed locks)
Complexity	Database handles it	Application must define compensations

Two Approaches to Sagas: Choreography vs Orchestration

There are two ways to implement sagas, and understanding when to use each is crucial.

Choreography: Services Communicate via Events

In choreography, each service publishes events when its local transaction completes, and subscribes to events to know what to do next.

When to Use Choreography

Use When	Avoid When
Few steps (2-5)	Many steps (>5)
Linear, sequential flow	Complex branching logic
Small teams	Multiple teams coordinating
Simple monitoring needs	Need comprehensive visibility
Quick iteration	Need sophisticated recovery

Pros

Loose coupling: Services don't know about each other
Simple to start: No central coordinator
Good for simple workflows: When steps are few and linear

Cons

Hidden dependencies: Compensation logic can spread across services
Difficult to track: No single place to see the saga state
Testing complexity: Hard to test end-to-end saga behavior

Orchestration: A Central Coordinator Manages the Saga

In orchestration, there's a dedicated saga orchestrator that tells each service what to do and handles compensation logic in one place.

When to Use Orchestration

Use When	Avoid When
Many steps (over 5)	Few steps (under 5)
Complex branching	Linear, simple flows
Multiple teams	Small, co-located teams
Comprehensive monitoring needed	Basic monitoring OK
Sophisticated recovery needed	Simple recovery OK

Pros

Clear ownership: Saga logic lives in one place
Easy to test: Test the orchestrator in isolation
Easy to extend: Add new steps to the orchestrator
Better visibility: One place to see saga state

Cons

Central point of failure: If orchestrator dies, need recovery mechanism
More infrastructure: Need to build/configure an orchestrator
Potential tight coupling: Orchestrator knows about all services

Choosing Between Choreography and Orchestration

Factor	Choreography	Orchestration
Number of steps	Few (under 5)	Many (over 5)
Workflow complexity	Linear, sequential	Branching, conditional
Team structure	Small, co-located	Multiple teams
Monitoring needs	Basic	Comprehensive
Recovery requirements	Simple	Sophisticated

✅

My rule of thumb: If your saga has more than 4-5 steps, or if the workflow is complex with branches, use orchestration. For simple linear workflows with 2-3 steps, choreography is cleaner.

Idempotency: The Key to Exactly-Once Semantics

Here's a truth I've learned through painful experience: in distributed systems, every operation can be executed more than once.

Why?

Network timeouts → client retries → operation executes twice
Consumer crashes → message redelivered → operation executes twice
Circuit breaker opens → retry after recovery → operation executes twice

The solution? Make every operation idempotent. An idempotent operation produces the same result whether executed once or multiple times.

Patterns for Idempotency

1. Idempotency Keys

Each request includes a unique key. The service checks if that key has been processed before:

Flow:

Client sends request with unique idempotency key
Service checks database for existing record with that key
If found, return existing result (don't process again)
If not found, process request and store result with key

2. Database Constraints

Use unique constraints to prevent duplicate operations:

Constraint	Prevents
Unique index on idempotency_key	Duplicate payments
Unique index on order_id	Duplicate orders
Unique constraint per entity	One operation per entity

3. State-Based Idempotency

Design operations to be naturally idempotent:

GET operations: Always safe, never changes state
PUT/replace operations: Replace state, not append
DELETE operations: Already deleted returns success

Idempotency in Sagas

Every saga step should be idempotent. Here's why:

Without idempotency: The retry could reserve inventory twice, process payment twice, etc.

With idempotency: Even with retries and duplicates, the end result is the same.

Handling Failures in Sagas

This is where things get tricky, and where I've seen many systems struggle.

Types of Failures

Transient failures: Network blip, brief service unavailability. Retry usually fixes these.
Permanent failures: Business logic rejection (e.g., insufficient inventory). Requires compensation.
Unknown failures: We don't know if the operation succeeded or failed. This is the hardest case.

The Challenge of Unknown Failures

Strategies for Handling Unknown Failures

1. Make Operations Explicitly Idempotent

Always design your operations so that executing them twice produces the same result as executing once.

2. Use the "Tell, Don't Ask" Pattern

Instead of asking "did this succeed?" and then deciding, just tell services what to do, and have them record what they've done.

3. Implement Compensating Transaction Idempotency

Your compensating transactions must also be idempotent:

Operation	Idempotency Strategy
Reserve inventory	Check if already reserved by order_id
Process payment	Store payment_id, return existing on retry
Refund	Check if already refunded, return existing
Send email	Check if already sent to idempotency key

Advanced Saga Patterns

Nested Sagas

Just like you can nest database transactions, you can nest sagas. A nested saga is where one saga step is itself a saga.

Saga with Outbox Pattern

To ensure reliable event publishing, use the transactional outbox pattern:

The Problem: When a service writes to its database and then publishes an event, if it crashes after writing but before publishing, the event is lost.

The Solution: Write both the business data AND an outbox entry in the same local transaction. A separate process polls the outbox and publishes events.

Compensating Transaction Challenges

Challenge 1: Non-Compensatable Operations

Some operations can't be compensated:

Sending an email
Charging a credit card
Publishing a public tweet

Solutions:

Make them the last step. Only execute if all previous steps are committed.
Make them idempotent. Store the result and return it on retry.
Accept the limitation. Document that certain operations can't be rolled back.
Use a human process. For critical operations, involve a human for reconciliation.

Challenge 2: Partial Compensation Failures

What if compensation itself fails?

Solutions:

Retry with backoff. Most compensation failures are transient.
Dead letter queue. After N failed retries, move to DLQ for manual intervention.
Saga state persistence. Store saga state in durable storage so you can recover.
Human intervention. Some failures require human operators to resolve.

What to Remember for Interviews

Why 2PC doesn't scale: Understand the blocking, coordinator SPOF, and latency issues.
Saga vs ACID: Know the trade-offs: strong consistency vs eventual consistency.
Choreography vs Orchestration: Be able to explain when to use each, with examples.
Idempotency: This is critical in practice. Know multiple patterns for achieving it.
Compensation challenges: Understand partial failures and how to handle them.
Real-world systems: Be familiar with how Saga is used in Amazon, Uber, and other companies.

✅

Interview tip: When designing any system with multiple services, proactively address the consistency problem. Say "we'll use the Saga pattern with compensating transactions" and explain why ACID transactions across services aren't feasible. This shows deep understanding.

Distributed Transactions & Sagas: Managing Consistency Across Services

Introduction: The Problem We're Solving

Why Traditional ACID Transactions Don't Work

The Database Transaction Model

The Distributed Problem

Why Two-Phase Commit (2PC) Isn't the Answer

How 2PC Works

Why 2PC Has Problems

The Saga Pattern: An Alternative Approach

The Core Idea

Key Properties of Sagas

Trade-offs vs ACID Transactions

Two Approaches to Sagas: Choreography vs Orchestration

Choreography: Services Communicate via Events

When to Use Choreography

Pros

Cons

Orchestration: A Central Coordinator Manages the Saga

When to Use Orchestration

Pros

Cons

Choosing Between Choreography and Orchestration

Idempotency: The Key to Exactly-Once Semantics

Patterns for Idempotency

1. Idempotency Keys

2. Database Constraints

3. State-Based Idempotency

Idempotency in Sagas

Handling Failures in Sagas

Types of Failures

The Challenge of Unknown Failures

Strategies for Handling Unknown Failures

1. Make Operations Explicitly Idempotent

2. Use the "Tell, Don't Ask" Pattern

3. Implement Compensating Transaction Idempotency

Advanced Saga Patterns

Nested Sagas

Saga with Outbox Pattern

Compensating Transaction Challenges

Challenge 1: Non-Compensatable Operations

Challenge 2: Partial Compensation Failures

What to Remember for Interviews

Further Reading