Core Algorithms

Paxos, Two-Phase Commit, and Distributed Consensus Protocols

Understand Paxos phases, Two-Phase Commit (2PC), and Three-Phase Commit (3PC). Learn when to use strong consistency vs eventual consistency in distributed systems.

Paxos2PC3PCdistributed-transactionsconsensuscoordinator

Introduction: Why We Need Distributed Agreement

I've spent years building systems that span multiple machines, and I've learned one hard truth: getting distributed components to agree on anything is fundamentally harder than it looks. When a single machine processes a transaction, it either commits or aborts—there's no ambiguity. But when you have five machines, each with its own view of the world, and the network between them can fail at any moment, agreement becomes a deep problem.

This is where distributed agreement protocols come in. They provide a formal mechanism for multiple participants to reach consensus despite failures. But not all protocols are equal—each makes different trade-offs between safety, liveness, and performance.

The landscape: Two-Phase Commit (2PC) is the simplest but most fragile. Three-Phase Commit (3PC) improves availability at the cost of complexity. Paxos is the gold standard for correctness but is notoriously hard to implement. Raft, which we covered in the previous article, is the practical alternative to Paxos.


Two-Phase Commit (2PC)

Two-Phase Commit is the oldest and most intuitive distributed transaction protocol. It coordinates multiple participants (databases, services, or resource managers) to either all commit or all abort a transaction.

The Participants

RoleResponsibility
CoordinatorManages the transaction lifecycle, decides commit or abort
ParticipantsExecute the transaction locally and report readiness

Phase 1: Prepare

The coordinator asks every participant if they can commit the transaction. Each participant writes its transaction changes to durable storage (but does not commit) and responds.

Phase 2: Commit or Abort

If all participants respond READY, the coordinator broadcasts COMMIT. If any participant responds ABORT or times out, the coordinator broadcasts ABORT.

What Can Go Wrong

The Blocking Problem

The most famous flaw of 2PC is the blocking problem. If the coordinator crashes after sending PREPARE but before sending COMMIT, participants are stuck holding locks, waiting indefinitely.

Failure ScenarioImpact
Coordinator crashes after Phase 1Participants block, holding locks indefinitely
Participant crashes in Phase 1Coordinator aborts, other participants release locks
Participant crashes in Phase 2Coordinator keeps retrying COMMIT
Network partitionParticipant appears crashed to coordinator

Coordinator Recovery

If the coordinator crashes, it must recover and consult its transaction log to determine the fate of each transaction. This means 2PC requires all coordinators to have durable storage. Even with recovery, participants may block for an unbounded time.

When 2PC Is Acceptable

Use CaseReason
Small number of participants (2-3)Less chance of failure, faster agreement
Short-lived transactionsMinimizes lock duration
Same datacenterLower latency, fewer network partitions
Low throughputDoesn't need to scale to thousands of transactions/second
⚠️

My experience: I've seen teams try to use 2PC in high-throughput, geo-distributed systems. It never ends well. The blocking problem, single point of failure, and latency overhead make it unsuitable for modern cloud-native architectures. If you need distributed transactions at scale, use the Saga pattern instead.


Three-Phase Commit (3PC)

Three-Phase Commit extends 2PC with a pre-commit phase designed to eliminate the blocking problem. It was proposed by Dale Skeen and Michael Stonebraker in 1981.

The Three Phases

How 3PC Avoids Blocking

In 2PC, participants enter an uncertain state after acknowledging PREPARE—they don't know if the coordinator will COMMIT or ABORT. In 3PC, the pre-commit phase gives participants additional information: they know that all other participants said YES. If the coordinator crashes now, participants can independently decide to commit because they know everyone was ready.

However, 3PC's improvement is limited. Under network partitions, 3PC can still block or even violate consistency.

Why 3PC Is Rarely Used

ProblemExplanation
Network partitions still break itIf the network splits, participants in different partitions may make different decisions
More messagesExtra phase adds latency and message overhead
Complex recoveryRequires timeouts and consensus among participants on recovery
Still not partition-tolerant3PC sacrifices partition tolerance for availability under certain failures

The fundamental issue is the CAP theorem: you cannot have both consistency and availability under network partitions. 3PC improves availability compared to 2PC in some scenarios but still cannot handle all partition scenarios safely.

Practical reality: In my career, I've never seen 3PC used in production at scale. The extra complexity doesn't buy enough benefit over 2PC, and the safety guarantees are still insufficient for most distributed systems. Sagas and Raft are far more practical.


Paxos: The Gold Standard

Paxos, introduced by Leslie Lamport in 1989, is the foundational consensus algorithm. It is mathematically proven correct under asynchronous network assumptions and crash failures.

Classic Paxos

Classic Paxos involves three roles:

RoleResponsibility
ProposerProposes values, drives the consensus process
AcceptorVotes on proposals, stores accepted values
LearnerLearns which value was chosen (not involved in consensus)

Phase 1: Prepare and Promise

The proposer picks a proposal number n and sends Prepare(n) to all acceptors. Each acceptor promises to:

  • Never accept a proposal numbered less than n
  • If it has already accepted a proposal, return the accepted value and its proposal number

If a majority responds, the proposer moves to Phase 2.

Phase 2: Accept and Accepted

The proposer sends Accept(n, v) where n is the proposal number and v is the value selected from Phase 1 responses. Each acceptor accepts if n is greater than or equal to any proposal number it has promised.

Why Paxos Is Correct

Paxos guarantees safety (at most one value is chosen) through the quorum intersection property. Any two majorities intersect—they share at least one acceptor. This means:

  1. If a value is chosen in Phase 2, at least one acceptor in any future Phase 1 majority will know about it
  2. The proposer will discover the chosen value and propagate it
  3. Conflicting values cannot both be chosen because they would need overlapping majorities with different promises

Multi-Paxos: Making Paxos Practical

Running Classic Paxos for every consensus decision is expensive—each round requires two phases and multiple message delays. Multi-Paxos optimizes by establishing a distinguished proposer (a leader) that can skip Phase 1 for subsequent rounds.

ProtocolMessage Delays per DecisionPhase 1 Cost
Classic Paxos2 round trips (prepare + accept)Every decision
Multi-Paxos with leader1 round trip (accept only)One-time leader election

The Cost of Understanding

Despite its mathematical elegance, Paxos is famously hard to implement. Lamport's original paper was written as a parable about a mythical Greek parliament. Multiple attempts to implement Paxos have resulted in subtle bugs.

⚠️

From the trenches: The Google Chubby team, who successfully implemented Paxos in production, noted that "there are significant gaps between the description of the Paxos algorithm and the needs of a real-world system." They had to extend Paxos for leader election, reconfiguration, and failure detection—concerns that Raft addresses directly in its design.


Comparison: 2PC vs 3PC vs Paxos vs Raft

Property2PC3PCPaxosRaft
Consensus typeAtomic commitmentAtomic commitmentValue consensusLog consensus
Fault toleranceCrash of coordinatorMore robustTolerates minority failuresTolerates minority failures
BlockingYes (coordinator crash)No (in most cases)NoNo
Message rounds2 (prepare + commit)3 (can + pre + do)2 (prepare + accept)2 (request + replicate)
Leader requiredYes (coordinator)Yes (coordinator)Optional (Multi-Paxos)Yes (required)
Network partition safeNoPartialYes (if leader in majority)Yes (if leader in majority)
ComplexityLowMediumHighMedium
Implementation difficultyLowMediumVery highModerate
PerformanceGood (small scale)Moderate (more messages)Good (with leader)Good
Production usageLegacy systems, small scaleRarely usedGoogle Chubby, Spanneretcd, Consul, CockroachDB

When to Use Each Protocol

Use 2PC When

  • You have 2-3 participants in the same datacenter
  • Transactions are short-lived and low-throughput
  • You can accept the blocking risk for simplicity
  • Legacy system integration requires it

Consider 3PC When

  • You need non-blocking atomic commitment
  • Network partitions are unlikely (single datacenter)
  • You want theoretical improvement over 2PC
  • (In practice, this is almost never the answer)

Use Paxos When

  • You need mathematically proven correctness
  • You are building a fundamental infrastructure component
  • You have the expertise to implement it correctly
  • You need maximum flexibility in protocol behavior

Use Raft When

  • You need a practical, implementable consensus algorithm
  • You want strong consistency with leader-driven replication
  • You need cluster membership changes
  • You want debuggable, understandable system behavior

Use the Saga Pattern When

  • You need distributed transactions at scale
  • Eventual consistency is acceptable
  • You can define compensating transactions
  • You prefer availability over strong consistency

My rule of thumb: For 90% of distributed systems, Raft is the right answer for strong consistency. For the remaining 10%, consider Saga for eventual consistency at scale. Paxos is the right choice only if you're building infrastructure like Google's Chubby or Spanner. 2PC is a legacy pattern that should be avoided in new systems.


Real-World Usage

Google Chubby (Paxos)

Google Chubby is a distributed lock service that uses Paxos for consensus. It provides coarse-grained locking for Google's distributed systems. Chubby's Paxos implementation predates Raft and demonstrated that Paxos could work in production—but at significant engineering cost.

Google Spanner (Paxos)

Spanner, Google's globally distributed SQL database, uses Paxos for consensus at the tablet level. Each tablet has a Paxos group with replicas across data centers. Spanner is notable for using TrueTime (synchronized clocks with bounded error) to provide external consistency.

Apache ZooKeeper (Zab)

ZooKeeper uses the Zab (ZooKeeper Atomic Broadcast) protocol, which is conceptually similar to Raft. It provides ordered log replication with leader-based consensus. Zab was developed independently and predates Raft.

etcd (Raft)

etcd is a distributed key-value store that uses Raft for consensus. It is the foundation of Kubernetes. etcd's Raft implementation is the reference implementation for many other systems.


Key Takeaways

  1. 2PC is the simplest but most fragile protocol. The blocking problem (coordinator crash leaving participants locked) makes it unsuitable for production distributed systems at scale.

  2. 3PC improves on 2PC by adding a pre-commit phase, but it still fails under network partitions and is rarely used in practice.

  3. Paxos is mathematically correct and provably safe under asynchronous network assumptions. But its subtlety and complexity make it notoriously hard to implement correctly.

  4. Multi-Paxos improves efficiency by establishing a leader that can skip the Prepare phase for subsequent rounds, achieving performance comparable to Raft.

  5. Raft is the practical alternative to Paxos—it provides equivalent safety guarantees while being far easier to understand and implement.

  6. Choose your protocol based on your system's trade-off requirements: Raft for strong consistency with moderate complexity, Saga for eventual consistency at scale, 2PC only for legacy integration, and Paxos only if you have the expertise and need mathematical provability.

Interview tip: When discussing distributed transactions in a system design interview, show that you understand the trade-offs between consensus protocols. Explain why 2PC blocks, how Paxos guarantees safety through quorum intersection, and why Raft is the practical choice for most systems. Demonstrating awareness of these trade-offs signals deep distributed systems knowledge.