Raft Consensus Algorithm: Leader Election, Log Replication, and Safety

Master Raft - the consensus algorithm behind etcd, Consul, and CockroachDB. Understand leader election, term-based log replication, commit indices, and how Raft prevents split-brain.

Raftconsensusleader-electionlog-replicationetcdsplit-brainquorum

Introduction: The Consensus Problem

I've been building distributed systems for over a decade, and if there's one problem that consistently humbles engineers, it's consensus. How do you get multiple machines to agree on something when the network can fail, machines can crash, and messages can be delayed or lost?

At first glance, this seems straightforward. Just have everyone vote, right? But consider this scenario: three servers receive a write request. One says yes, one says no, and one never responds. Did the write succeed? What if two servers accept but a third crashes before acknowledging? What if the network splits and each side of the partition elects its own leader?

This is the distributed consensus problem, and it is far more subtle than it appears. Raft solves it with a design philosophy centered on understandability.

✅

Why consensus matters: Without consensus, distributed systems can enter split-brain states where different clients see different data. This causes data loss, corruption, and hard-to-debug inconsistencies. Consensus is the foundation of strongly consistent replicated systems.

Why Raft Exists

Before Raft, the dominant consensus algorithm was Paxos, introduced by Leslie Lamport in 1989. Despite being correct, Paxos is notoriously difficult to understand and implement. The Raft paper, published by Diego Ongaro and John Ousterhout in 2014, explicitly targeted understandability as its primary design goal.

Raft's Design Goals

Goal	Description
Understandability	The algorithm must be easy to understand and reason about
Complete foundation	Provide a complete foundation for system implementation
Safe under all conditions	Guarantee safety even under failures and network partitions
Efficient for common cases	Optimize for the common case where the leader is stable

Raft achieves understandability through decomposition and state space reduction. It breaks consensus into three independent subproblems: leader election, log replication, and safety. In contrast, Paxos intertwines these concerns.

Raft vs Paxos at a Glance

Aspect	Raft	Paxos
Approach	Strong leader with all operations through leader	Multiple proposers can contend
Subproblems	Decomposed into leader election, log replication, safety	Single unified protocol
Understandability	Designed to be taught and learned	Notoriously opaque
Leader stability	Leader holds position until failure	Leader can be challenged at any time
Cluster changes	Joint consensus approach	More complex reconfiguration
Production use	etcd, Consul, CockroachDB, TiDB	Google Chubby, Spanner

The Raft State Machine

Every Raft server is in one of three states at any time:

State	Role
Follower	Passive state, accepts log entries from leader
Candidate	Initiates election, requests votes from peers
Leader	Handles all client requests, manages log replication

All servers start as followers. If a follower receives no communication from a leader within a certain time (the election timeout), it transitions to candidate and starts an election.

Terms: Raft's Logical Clock

Raft divides time into terms, which act as monotonically increasing logical clocks. Terms are critical to Raft's safety guarantees.

Property	Description
Monotonic	Terms only increase, never decrease or wrap around
Election marker	Each term begins with an election
Exactly one leader	At most one leader is elected per term
No leader possible	A term may end with no leader (split vote)

Each server stores its current term number. When servers communicate, they exchange term numbers. If a server receives a request with a higher term than its own, it updates its term and becomes a follower. If a server receives a request with a lower term, it rejects it.

⚠️

Clock skew warning: Terms are logical clocks, not wall clocks. They solve distributed coordination without requiring synchronized physical clocks. Relying on physical clocks would introduce vulnerabilities to clock drift.

Leader Election

When a follower detects no communication from the leader within its election timeout, it transitions to candidate state and begins an election.

The Election Process

Key Mechanisms

Randomized election timeouts are what make Raft work in practice. Each server picks a random timeout between 150ms and 300ms. This ensures that typically only one server's timeout expires first, preventing split votes.

Mechanism	Purpose
Randomized timeout	Prevents split votes, ensures fast convergence
First vote per term	Each server grants only one vote per term
Majority required	Candidate needs votes from majority of cluster
Higher term wins	Candidate defers to higher-term candidate
Self-vote	Candidate always votes for itself first

What Happens During a Split Vote

If multiple followers become candidates simultaneously, no one gets a majority. Each candidate's election times out, and they start a new election with a higher term. The randomness ensures that eventually one candidate's timeout expires first.

Log Replication

Once a leader is elected, all client requests go through the leader. The leader is the single source of truth.

The Replication Flow

Log Structure

The log is an ordered sequence of entries. Each entry contains:

Field	Description
Index	Position in the log (monotonically increasing)
Term	The term when the entry was created
Command	The operation to apply to the state machine

text

Log:
Index:  1    2    3    4    5
       +-----+-----+-----+-----+
Term:  |  1  |  1  |  2  |  3  |  3  |
       +-----+-----+-----+-----+-----+
Cmd:   | x=1 | y=2 | x=3 | x=5 | z=1 |
       +-----+-----+-----+-----+-----+
                  ^
             commitIndex = 3

Commit Index

An entry is committed when it has been replicated to a majority of the cluster. Once committed, the entry is safe - no future term can overwrite it. The leader tracks the highest committed index and includes it in every AppendEntries RPC.

State	Meaning
Appended to leader's log	Leader has the entry
Replicated to majority	Entry is committed, can be applied to state machine
Applied to state machine	Entry has been executed, visible to clients

Log Matching Property

This is the most important invariant in Raft. It guarantees:

If two entries in different logs have the same index and term, they store the same command
If two entries in different logs have the same index and term, all preceding entries are identical

Raft enforces this by including prevLogIndex and prevLogTerm in every AppendEntries RPC. If a follower's log doesn't have a matching entry at prevLogIndex with term prevLogTerm, it rejects the append.

When a follower rejects an AppendEntries (log mismatch), the leader decrements nextIndex for that follower and retries. The follower removes conflicting entries and adopts the leader's log. This is how Raft ensures all logs converge to the leader's.

✅

Key insight: The leader never overwrites or deletes its own log entries. It only adds new entries. If a follower's log diverges, the leader forces the follower to adopt its log. This is why Raft is called a leader-based algorithm.

Safety Properties

Raft guarantees five safety properties. Understanding these is essential to understanding why Raft is correct.

1. Election Safety

At most one leader can be elected in any given term. This is guaranteed because a candidate needs a majority of votes, and only one candidate can get a majority in a single term.

2. Leader Append-Only

A leader never overwrites or deletes entries in its log. It only appends new entries. This is fundamental to the Leader Completeness property.

3. Log Matching

As described above: if two logs have an entry with the same index and term, they are identical for all preceding entries.

4. Leader Completeness

If a log entry is committed in a given term, it will be present in the logs of all future leaders. This is the most subtle property. It is guaranteed because:

A candidate must contact a majority of servers to win an election
At least one server in the majority has the committed entry (by the quorum intersection property)
The voter denies its vote if its log is more up-to-date than the candidate's

5. State Machine Safety

If a server has applied a log entry at a given index to its state machine, no other server will apply a different entry at the same index. This is guaranteed because committed entries are immutable (Leader Completeness) and all servers apply entries in the same order.

Cluster Membership Changes

Adding or removing nodes from a Raft cluster is one of the trickiest operations. Raft uses joint consensus to transition safely between configurations.

The Problem

Changing cluster membership mid-operation is dangerous. Consider: the old configuration has 3 nodes, and you're adding a 4th. Without care, you could have two leaders - one elected under the 3-node configuration and one under the 4-node.

Joint Consensus

Raft handles this with a two-phase approach:

Phase	Configuration	Decision Rule
Current	Cold only	Majority of Cold
Joint	Cold + Cnew combined	Majority of both Cold AND Cnew
New	Cnew only	Majority of Cnew

During the joint consensus phase, a leader must get approval from majorities of both configurations. This prevents split-brain: any decision requires both old and new nodes to agree.

⚠️

Practical note: Adding nodes to a Raft cluster should be done carefully. New nodes with empty logs need time to catch up before they can participate in elections. Most implementations (like etcd) require explicit commands to add and remove members.

Network Partitions and Split-Brain Prevention

What happens when the network divides a Raft cluster? This is where Raft's safety guarantees shine.

In a 5-node cluster, if the network splits into a partition of 3 nodes and a partition of 2:

Majority partition (3 nodes): The existing leader continues serving. It can still commit entries because it has a majority of the cluster.

Minority partition (2 nodes): The followers become candidates, but they can never get a majority. No leader is elected. The minority partition becomes unavailable for writes.

When the network heals, the minority partition's followers rejoin and accept the leader's log. No data is lost, and no split-brain occurs.

Condition	Result
Leader in majority partition	Continues serving, commits entries
Minority partition without majority	Cannot elect leader, rejects writes
Network heals	Minority nodes catch up with leader

Real-World Usage

etcd

etcd is a distributed key-value store that uses Raft as its consensus engine. It is the backing store for Kubernetes. Every Kubernetes cluster runs etcd, making Raft indirectly responsible for orchestrating millions of containers.

Feature	How Raft Enables It
Strong consistency	All writes go through Raft leader
Leader election	Automatic failover in 150-300ms
Cluster membership	`etcdctl member add/remove` triggers Raft configuration change

Consul

HashiCorp Consul uses Raft for its consensus protocol. Consul's key-value store and service catalog both depend on Raft for consistency across Consul server nodes.

CockroachDB

CockroachDB uses Raft at the range level. Each range (a contiguous segment of data) is a Raft group with replicas on different nodes. This gives CockroachDB both strong consistency and high availability.

TiDB

TiKV, the storage layer of TiDB, uses Raft for replication at the region level, similar to CockroachDB's approach.

Raft vs Paxos vs Zab

Aspect	Raft	Paxos	Zab
Primary goal	Understandability	Correctness	Ordered log replication
Leader mechanism	Election with randomized timeouts	Distinguished proposer	Leader election phase
Log replication	Strong leader, uni-directional	Multi-phase, bi-directional	Leader-based, similar to Raft
Membership changes	Joint consensus	Harder to implement	Similar approach
Commit condition	Log entry on majority of servers	Value chosen when majority accepts	Entry committed when majority acknowledges
Performance	Good (leader handles all)	Good with Multi-Paxos	Good for ZooKeeper workloads
Implementation complexity	Moderate	High	Moderate
Used by	etcd, Consul, CockroachDB	Google Chubby, Spanner	Apache ZooKeeper

Key Takeaways

Raft decomposes consensus into three independent subproblems: leader election, log replication, and safety. This makes it far more understandable than Paxos.
Terms are logical clocks that monotonically increase. They prevent stale leaders from causing chaos and underpin all of Raft's safety guarantees.
Randomized election timeouts solve the split-vote problem elegantly without requiring coordinated timing.
The log matching property ensures that once entries are committed, they are immutable and guaranteed to be present in all future leaders.
Raft prevents split-brain because only a partition containing a majority of nodes can elect a leader and commit entries. Minority partitions become unavailable but never diverge.
In production, Raft powers etcd (Kubernetes), Consul (service discovery), CockroachDB (distributed SQL), and TiDB (distributed MySQL-compatible database).

✅

Interview tip: When asked about consensus in a system design interview, reach for Raft. Explain the three subproblems, how leader election works with randomized timeouts, and the log matching property. Show that you understand not just how Raft works, but why its design choices make it practical to implement.

Write-Ahead Log (WAL): The Foundation of Crash Recovery and Durability

Paxos, Two-Phase Commit, and Distributed Consensus Protocols