Raft Consensus Algorithm: Leader Election, Log Replication, and Safety
Master Raft - the consensus algorithm behind etcd, Consul, and CockroachDB. Understand leader election, term-based log replication, commit indices, and how Raft prevents split-brain.
Introduction: The Consensus Problem
I've been building distributed systems for over a decade, and if there's one problem that consistently humbles engineers, it's consensus. How do you get multiple machines to agree on something when the network can fail, machines can crash, and messages can be delayed or lost?
At first glance, this seems straightforward. Just have everyone vote, right? But consider this scenario: three servers receive a write request. One says yes, one says no, and one never responds. Did the write succeed? What if two servers accept but a third crashes before acknowledging? What if the network splits and each side of the partition elects its own leader?
This is the distributed consensus problem, and it is far more subtle than it appears. Raft solves it with a design philosophy centered on understandability.
Why consensus matters: Without consensus, distributed systems can enter split-brain states where different clients see different data. This causes data loss, corruption, and hard-to-debug inconsistencies. Consensus is the foundation of strongly consistent replicated systems.
Why Raft Exists
Before Raft, the dominant consensus algorithm was Paxos, introduced by Leslie Lamport in 1989. Despite being correct, Paxos is notoriously difficult to understand and implement. The Raft paper, published by Diego Ongaro and John Ousterhout in 2014, explicitly targeted understandability as its primary design goal.
Raft's Design Goals
| Goal | Description |
|---|---|
| Understandability | The algorithm must be easy to understand and reason about |
| Complete foundation | Provide a complete foundation for system implementation |
| Safe under all conditions | Guarantee safety even under failures and network partitions |
| Efficient for common cases | Optimize for the common case where the leader is stable |
Raft achieves understandability through decomposition and state space reduction. It breaks consensus into three independent subproblems: leader election, log replication, and safety. In contrast, Paxos intertwines these concerns.
Raft vs Paxos at a Glance
| Aspect | Raft | Paxos |
|---|---|---|
| Approach | Strong leader with all operations through leader | Multiple proposers can contend |
| Subproblems | Decomposed into leader election, log replication, safety | Single unified protocol |
| Understandability | Designed to be taught and learned | Notoriously opaque |
| Leader stability | Leader holds position until failure | Leader can be challenged at any time |
| Cluster changes | Joint consensus approach | More complex reconfiguration |
| Production use | etcd, Consul, CockroachDB, TiDB | Google Chubby, Spanner |
The Raft State Machine
Every Raft server is in one of three states at any time:
| State | Role |
|---|---|
| Follower | Passive state, accepts log entries from leader |
| Candidate | Initiates election, requests votes from peers |
| Leader | Handles all client requests, manages log replication |
All servers start as followers. If a follower receives no communication from a leader within a certain time (the election timeout), it transitions to candidate and starts an election.
Terms: Raft's Logical Clock
Raft divides time into terms, which act as monotonically increasing logical clocks. Terms are critical to Raft's safety guarantees.
| Property | Description |
|---|---|
| Monotonic | Terms only increase, never decrease or wrap around |
| Election marker | Each term begins with an election |
| Exactly one leader | At most one leader is elected per term |
| No leader possible | A term may end with no leader (split vote) |
Each server stores its current term number. When servers communicate, they exchange term numbers. If a server receives a request with a higher term than its own, it updates its term and becomes a follower. If a server receives a request with a lower term, it rejects it.
Clock skew warning: Terms are logical clocks, not wall clocks. They solve distributed coordination without requiring synchronized physical clocks. Relying on physical clocks would introduce vulnerabilities to clock drift.
Leader Election
When a follower detects no communication from the leader within its election timeout, it transitions to candidate state and begins an election.
The Election Process
Key Mechanisms
Randomized election timeouts are what make Raft work in practice. Each server picks a random timeout between 150ms and 300ms. This ensures that typically only one server's timeout expires first, preventing split votes.
| Mechanism | Purpose |
|---|---|
| Randomized timeout | Prevents split votes, ensures fast convergence |
| First vote per term | Each server grants only one vote per term |
| Majority required | Candidate needs votes from majority of cluster |
| Higher term wins | Candidate defers to higher-term candidate |
| Self-vote | Candidate always votes for itself first |
What Happens During a Split Vote
If multiple followers become candidates simultaneously, no one gets a majority. Each candidate's election times out, and they start a new election with a higher term. The randomness ensures that eventually one candidate's timeout expires first.
Log Replication
Once a leader is elected, all client requests go through the leader. The leader is the single source of truth.
The Replication Flow
Log Structure
The log is an ordered sequence of entries. Each entry contains:
| Field | Description |
|---|---|
| Index | Position in the log (monotonically increasing) |
| Term | The term when the entry was created |
| Command | The operation to apply to the state machine |
Log:
Index: 1 2 3 4 5
+-----+-----+-----+-----+
Term: | 1 | 1 | 2 | 3 | 3 |
+-----+-----+-----+-----+-----+
Cmd: | x=1 | y=2 | x=3 | x=5 | z=1 |
+-----+-----+-----+-----+-----+
^
commitIndex = 3
Commit Index
An entry is committed when it has been replicated to a majority of the cluster. Once committed, the entry is safe - no future term can overwrite it. The leader tracks the highest committed index and includes it in every AppendEntries RPC.
| State | Meaning |
|---|---|
| Appended to leader's log | Leader has the entry |
| Replicated to majority | Entry is committed, can be applied to state machine |
| Applied to state machine | Entry has been executed, visible to clients |
Log Matching Property
This is the most important invariant in Raft. It guarantees:
- If two entries in different logs have the same index and term, they store the same command
- If two entries in different logs have the same index and term, all preceding entries are identical
Raft enforces this by including prevLogIndex and prevLogTerm in every AppendEntries RPC. If a follower's log doesn't have a matching entry at prevLogIndex with term prevLogTerm, it rejects the append.
When a follower rejects an AppendEntries (log mismatch), the leader decrements nextIndex for that follower and retries. The follower removes conflicting entries and adopts the leader's log. This is how Raft ensures all logs converge to the leader's.
Key insight: The leader never overwrites or deletes its own log entries. It only adds new entries. If a follower's log diverges, the leader forces the follower to adopt its log. This is why Raft is called a leader-based algorithm.
Safety Properties
Raft guarantees five safety properties. Understanding these is essential to understanding why Raft is correct.
1. Election Safety
At most one leader can be elected in any given term. This is guaranteed because a candidate needs a majority of votes, and only one candidate can get a majority in a single term.
2. Leader Append-Only
A leader never overwrites or deletes entries in its log. It only appends new entries. This is fundamental to the Leader Completeness property.
3. Log Matching
As described above: if two logs have an entry with the same index and term, they are identical for all preceding entries.
4. Leader Completeness
If a log entry is committed in a given term, it will be present in the logs of all future leaders. This is the most subtle property. It is guaranteed because:
- A candidate must contact a majority of servers to win an election
- At least one server in the majority has the committed entry (by the quorum intersection property)
- The voter denies its vote if its log is more up-to-date than the candidate's
5. State Machine Safety
If a server has applied a log entry at a given index to its state machine, no other server will apply a different entry at the same index. This is guaranteed because committed entries are immutable (Leader Completeness) and all servers apply entries in the same order.
Cluster Membership Changes
Adding or removing nodes from a Raft cluster is one of the trickiest operations. Raft uses joint consensus to transition safely between configurations.
The Problem
Changing cluster membership mid-operation is dangerous. Consider: the old configuration has 3 nodes, and you're adding a 4th. Without care, you could have two leaders - one elected under the 3-node configuration and one under the 4-node.
Joint Consensus
Raft handles this with a two-phase approach:
| Phase | Configuration | Decision Rule |
|---|---|---|
| Current | Cold only | Majority of Cold |
| Joint | Cold + Cnew combined | Majority of both Cold AND Cnew |
| New | Cnew only | Majority of Cnew |
During the joint consensus phase, a leader must get approval from majorities of both configurations. This prevents split-brain: any decision requires both old and new nodes to agree.
Practical note: Adding nodes to a Raft cluster should be done carefully. New nodes with empty logs need time to catch up before they can participate in elections. Most implementations (like etcd) require explicit commands to add and remove members.
Network Partitions and Split-Brain Prevention
What happens when the network divides a Raft cluster? This is where Raft's safety guarantees shine.
In a 5-node cluster, if the network splits into a partition of 3 nodes and a partition of 2:
Majority partition (3 nodes): The existing leader continues serving. It can still commit entries because it has a majority of the cluster.
Minority partition (2 nodes): The followers become candidates, but they can never get a majority. No leader is elected. The minority partition becomes unavailable for writes.
When the network heals, the minority partition's followers rejoin and accept the leader's log. No data is lost, and no split-brain occurs.
| Condition | Result |
|---|---|
| Leader in majority partition | Continues serving, commits entries |
| Minority partition without majority | Cannot elect leader, rejects writes |
| Network heals | Minority nodes catch up with leader |
Real-World Usage
etcd
etcd is a distributed key-value store that uses Raft as its consensus engine. It is the backing store for Kubernetes. Every Kubernetes cluster runs etcd, making Raft indirectly responsible for orchestrating millions of containers.
| Feature | How Raft Enables It |
|---|---|
| Strong consistency | All writes go through Raft leader |
| Leader election | Automatic failover in 150-300ms |
| Cluster membership | etcdctl member add/remove triggers Raft configuration change |
Consul
HashiCorp Consul uses Raft for its consensus protocol. Consul's key-value store and service catalog both depend on Raft for consistency across Consul server nodes.
CockroachDB
CockroachDB uses Raft at the range level. Each range (a contiguous segment of data) is a Raft group with replicas on different nodes. This gives CockroachDB both strong consistency and high availability.
TiDB
TiKV, the storage layer of TiDB, uses Raft for replication at the region level, similar to CockroachDB's approach.
Raft vs Paxos vs Zab
| Aspect | Raft | Paxos | Zab |
|---|---|---|---|
| Primary goal | Understandability | Correctness | Ordered log replication |
| Leader mechanism | Election with randomized timeouts | Distinguished proposer | Leader election phase |
| Log replication | Strong leader, uni-directional | Multi-phase, bi-directional | Leader-based, similar to Raft |
| Membership changes | Joint consensus | Harder to implement | Similar approach |
| Commit condition | Log entry on majority of servers | Value chosen when majority accepts | Entry committed when majority acknowledges |
| Performance | Good (leader handles all) | Good with Multi-Paxos | Good for ZooKeeper workloads |
| Implementation complexity | Moderate | High | Moderate |
| Used by | etcd, Consul, CockroachDB | Google Chubby, Spanner | Apache ZooKeeper |
Key Takeaways
-
Raft decomposes consensus into three independent subproblems: leader election, log replication, and safety. This makes it far more understandable than Paxos.
-
Terms are logical clocks that monotonically increase. They prevent stale leaders from causing chaos and underpin all of Raft's safety guarantees.
-
Randomized election timeouts solve the split-vote problem elegantly without requiring coordinated timing.
-
The log matching property ensures that once entries are committed, they are immutable and guaranteed to be present in all future leaders.
-
Raft prevents split-brain because only a partition containing a majority of nodes can elect a leader and commit entries. Minority partitions become unavailable but never diverge.
-
In production, Raft powers etcd (Kubernetes), Consul (service discovery), CockroachDB (distributed SQL), and TiDB (distributed MySQL-compatible database).
Interview tip: When asked about consensus in a system design interview, reach for Raft. Explain the three subproblems, how leader election works with randomized timeouts, and the log matching property. Show that you understand not just how Raft works, but why its design choices make it practical to implement.