Distributed Data Stores
Master the architecture behind Dynamo, Cassandra, and Riak. Learn consistent hashing, quorum systems, vector clocks, and conflict-free replicated data types.
Distributed Data Stores
When you need to store petabytes of data across hundreds of machines while staying available during network partitions and machine failures, traditional databases won't cut it. You need distributed data stores—systems designed from the ground up for the realities of distributed computing.
This tutorial covers the architecture of systems like Amazon Dynamo, Apache Cassandra, and Riak. These aren't academic exercises—Dynamo powers Dynamo (ironically, the paper influenced systems like Cassandra, Riak, and Voldemort), and understanding their design will make you a better distributed systems engineer.
Why Distributed Data Stores?
Before diving into the architecture, let's understand the problem:
| Challenge | Traditional DB | Distributed Store |
|---|---|---|
| Scale | Vertical only | Horizontal |
| Failure | Single point | Designed for |
| Availability | CP (Consistency-Positive) | AP (Availability-Positive) |
| Geography | Single DC | Multi-region |
| Partition tolerance | None needed | Built-in |
The CAP theorem states you can have at most two of: Consistency, Availability, and Partition tolerance. Distributed stores choose availability and partition tolerance (AP), trading strong consistency for always-on availability. They achieve "eventual consistency" through clever mechanisms.
Consistent Hashing
The foundation of distributed data stores. Consistent hashing minimizes data movement when nodes join or leave.
The Basic Problem
Traditional hashing: node = hash(key) % num_nodes
This works until you add a node. Then num_nodes changes, and 90%+ of keys remap. For a system with millions of keys, this causes massive cache invalidation and rebalancing storms.
The Hash Ring Solution
Consistent hashing maps both keys and nodes onto a circular hash space (a ring):
Key placement: Walk clockwise from the key's hash position until you find a node. That node owns the key.
Virtual Nodes for Load Balancing
Physical nodes can have different capacities (a big server vs a small one). Virtual nodes solve this by assigning more virtual nodes to powerful machines:
Replication strategy: When a key hashes to position P, replicate to the next N nodes clockwise.
Quorum Systems
Quorum determines how many nodes must acknowledge reads and writes:
| Configuration | R | W | N | Consistency | Latency |
|---|---|---|---|---|---|
| Strong write | N | 1 | 3 | Writes slow, reads fast | Read-optimized |
| Strong read | 1 | N | 3 | Writes fast, reads slow | Write-optimized |
| Balanced | N/2+1 | N/2+1 | 3 | Quorum intersection | Balanced |
The key invariant: If R + W > N, you're guaranteed to read your own writes.
Versioning and Vector Clocks
In distributed systems, you can't just use "last write wins" because clocks are unreliable and network partitions cause concurrent writes. Vector clocks track causal history.
The Problem with Timestamps
When the network partitions, both sides can make conflicting changes. Timestamps alone can't resolve this.
Vector Clock Solution
Each version has a vector clock—a map of node IDs to sequence numbers:
| Concept | Description |
|---|---|
| Vector clock | Map of node_id → counter |
| Increment | Node increments its own counter when making a change |
| Merge | For each node, take the max counter |
| Happens-before | Version A happened-before B if all A's counters ≤ B's |
| Conflict | Two versions conflict if neither happened-before the other |
Version History
Merge process:
- v3 clock A:3 and v4 clock A:2 B:1 merges to A:3 B:1
- v5 happened-after v4 (B:1 exists only in v4, and A:3 > A:2)
- v6 clock A:4 is concurrent with v5 (A:4 vs A:3) → CONFLICT!
Conflict Resolution Strategies
| Strategy | How It Works | Trade-off |
|---|---|---|
| Last Write Wins (LWW) | Use wall clock timestamps | Simple but loses data |
| Application-level merge | Let the app decide how to combine | Best for domain-specific logic |
| Return all versions | Let client decide | Maximum flexibility |
Hinted Handoff
When a node is temporarily down, you don't want to fail writes. Hinted handoff stores writes for later delivery:
Process:
- Write to target node fails (node down)
- Write to next healthy node succeeds
- Store hint: "when Node A comes back, deliver this write"
- When target node recovers, deliver the hinted write
Anti-Entropy and Merkle Trees
Even with quorum, replicas can diverge. Anti-entropy is the process of repairing divergence between replicas.
Merkle Trees
A Merkle tree is a hash tree where each leaf is the hash of a data block:
How it works:
- Each leaf is the hash of a data range
- Each internal node is the hash of its children
- Comparing root hashes tells if trees are identical
- Comparing child hashes identifies divergent ranges
Anti-Entropy Protocol
Process:
- Compare root hashes of Merkle trees
- If different, compare child hashes
- Recursively narrow down to divergent range
- Sync only the divergent keys
CRDTs: Conflict-Free Replicated Data Types
For certain data types, you can define merge operations that are always correct, regardless of the order of application.
Why CRDTs Matter
Traditional conflict resolution requires application knowledge. CRDTs provide mathematically proven conflict-free merging at the data structure level.
Common CRDTs
| CRDT Type | Description | Example Use |
|---|---|---|
| Grow-only Counter | Can only increment, never decrement | Page view counts |
| LWW-Register | Last write wins | Configuration values |
| PnCounter | Can increment and decrement | Vote counts, account balance |
| OR-Set | Add/remove with conflict-free merge | Tags, bookmarks, shopping cart |
How CRDTs Work
| Type | Merge Rule |
|---|---|
| G-Counter | For each node, take max of counters |
| LWW-Register | If other timestamp > mine, use other value |
| PnCounter | Positive counter = max, Negative counter = max |
| OR-Set | Union of adds minus intersection of removes |
Sloppy Quorum and Hinted Replicas
In production, some nodes may be unavailable. Sloppy quorum uses the next N healthy nodes instead of strict N successors:
Key insight: Sloppy quorum improves availability but means your data might not be on the expected nodes. Hinted handoff ensures eventual delivery to the target node.
Summary: Dynamo-Style Architecture
| Component | Purpose | Key Algorithm |
|---|---|---|
| Consistent Hashing | Node selection | Hash ring with virtual nodes |
| Replication | Durability | N replicas per key |
| Quorum | Consistency | R + W > N |
| Vector Clocks | Version tracking | Causal history |
| Hinted Handoff | Availability during failures | Store and deliver later |
| Anti-Entropy | Repair divergence | Merkle trees |
| CRDTs | Conflict-free data types | Mathematically proven merge |
These patterns are used in production systems you interact with daily. Cassandra uses consistent hashing and quorum. Riak uses vector clocks and CRDTs. Redis Cluster uses consistent hashing (with hash slots, a variant). Understanding the internals helps you tune these systems for your workload.
In the next tutorial, we'll explore Fault Tolerance and Resilience—the patterns that keep distributed systems running when components fail.