Distributed Data Stores

Master the architecture behind Dynamo, Cassandra, and Riak. Learn consistent hashing, quorum systems, vector clocks, and conflict-free replicated data types.

5 min read

Distributed Data Stores

When you need to store petabytes of data across hundreds of machines while staying available during network partitions and machine failures, traditional databases won't cut it. You need distributed data stores—systems designed from the ground up for the realities of distributed computing.

This tutorial covers the architecture of systems like Amazon Dynamo, Apache Cassandra, and Riak. These aren't academic exercises—Dynamo powers Dynamo (ironically, the paper influenced systems like Cassandra, Riak, and Voldemort), and understanding their design will make you a better distributed systems engineer.

Why Distributed Data Stores?

Before diving into the architecture, let's understand the problem:

Challenge	Traditional DB	Distributed Store
Scale	Vertical only	Horizontal
Failure	Single point	Designed for
Availability	CP (Consistency-Positive)	AP (Availability-Positive)
Geography	Single DC	Multi-region
Partition tolerance	None needed	Built-in

💡

The CAP theorem states you can have at most two of: Consistency, Availability, and Partition tolerance. Distributed stores choose availability and partition tolerance (AP), trading strong consistency for always-on availability. They achieve "eventual consistency" through clever mechanisms.

Consistent Hashing

The foundation of distributed data stores. Consistent hashing minimizes data movement when nodes join or leave.

The Basic Problem

Traditional hashing: node = hash(key) % num_nodes

This works until you add a node. Then num_nodes changes, and 90%+ of keys remap. For a system with millions of keys, this causes massive cache invalidation and rebalancing storms.

The Hash Ring Solution

Consistent hashing maps both keys and nodes onto a circular hash space (a ring):

Key placement: Walk clockwise from the key's hash position until you find a node. That node owns the key.

Virtual Nodes for Load Balancing

Physical nodes can have different capacities (a big server vs a small one). Virtual nodes solve this by assigning more virtual nodes to powerful machines:

Replication strategy: When a key hashes to position P, replicate to the next N nodes clockwise.

Quorum Systems

Quorum determines how many nodes must acknowledge reads and writes:

Configuration	R	W	N	Consistency	Latency
Strong write	N	1	3	Writes slow, reads fast	Read-optimized
Strong read	1	N	3	Writes fast, reads slow	Write-optimized
Balanced	N/2+1	N/2+1	3	Quorum intersection	Balanced

The key invariant: If R + W > N, you're guaranteed to read your own writes.

Versioning and Vector Clocks

In distributed systems, you can't just use "last write wins" because clocks are unreliable and network partitions cause concurrent writes. Vector clocks track causal history.

The Problem with Timestamps

When the network partitions, both sides can make conflicting changes. Timestamps alone can't resolve this.

Vector Clock Solution

Each version has a vector clock—a map of node IDs to sequence numbers:

Concept	Description
Vector clock	Map of node_id → counter
Increment	Node increments its own counter when making a change
Merge	For each node, take the max counter
Happens-before	Version A happened-before B if all A's counters ≤ B's
Conflict	Two versions conflict if neither happened-before the other

Version History

Merge process:

v3 clock A:3 and v4 clock A:2 B:1 merges to A:3 B:1
v5 happened-after v4 (B:1 exists only in v4, and A:3 > A:2)
v6 clock A:4 is concurrent with v5 (A:4 vs A:3) → CONFLICT!

Conflict Resolution Strategies

Strategy	How It Works	Trade-off
Last Write Wins (LWW)	Use wall clock timestamps	Simple but loses data
Application-level merge	Let the app decide how to combine	Best for domain-specific logic
Return all versions	Let client decide	Maximum flexibility

Hinted Handoff

When a node is temporarily down, you don't want to fail writes. Hinted handoff stores writes for later delivery:

Process:

Write to target node fails (node down)
Write to next healthy node succeeds
Store hint: "when Node A comes back, deliver this write"
When target node recovers, deliver the hinted write

Anti-Entropy and Merkle Trees

Even with quorum, replicas can diverge. Anti-entropy is the process of repairing divergence between replicas.

Merkle Trees

A Merkle tree is a hash tree where each leaf is the hash of a data block:

How it works:

Each leaf is the hash of a data range
Each internal node is the hash of its children
Comparing root hashes tells if trees are identical
Comparing child hashes identifies divergent ranges

Anti-Entropy Protocol

Process:

Compare root hashes of Merkle trees
If different, compare child hashes
Recursively narrow down to divergent range
Sync only the divergent keys

CRDTs: Conflict-Free Replicated Data Types

For certain data types, you can define merge operations that are always correct, regardless of the order of application.

Why CRDTs Matter

Traditional conflict resolution requires application knowledge. CRDTs provide mathematically proven conflict-free merging at the data structure level.

Common CRDTs

CRDT Type	Description	Example Use
Grow-only Counter	Can only increment, never decrement	Page view counts
LWW-Register	Last write wins	Configuration values
PnCounter	Can increment and decrement	Vote counts, account balance
OR-Set	Add/remove with conflict-free merge	Tags, bookmarks, shopping cart

How CRDTs Work

Type	Merge Rule
G-Counter	For each node, take max of counters
LWW-Register	If other timestamp > mine, use other value
PnCounter	Positive counter = max, Negative counter = max
OR-Set	Union of adds minus intersection of removes

Sloppy Quorum and Hinted Replicas

In production, some nodes may be unavailable. Sloppy quorum uses the next N healthy nodes instead of strict N successors:

Key insight: Sloppy quorum improves availability but means your data might not be on the expected nodes. Hinted handoff ensures eventual delivery to the target node.

Summary: Dynamo-Style Architecture

Component	Purpose	Key Algorithm
Consistent Hashing	Node selection	Hash ring with virtual nodes
Replication	Durability	N replicas per key
Quorum	Consistency	R + W > N
Vector Clocks	Version tracking	Causal history
Hinted Handoff	Availability during failures	Store and deliver later
Anti-Entropy	Repair divergence	Merkle trees
CRDTs	Conflict-free data types	Mathematically proven merge

💡

These patterns are used in production systems you interact with daily. Cassandra uses consistent hashing and quorum. Riak uses vector clocks and CRDTs. Redis Cluster uses consistent hashing (with hash slots, a variant). Understanding the internals helps you tune these systems for your workload.

In the next tutorial, we'll explore Fault Tolerance and Resilience—the patterns that keep distributed systems running when components fail.

Distributed Caching & CDNs

Fault Tolerance & Resilience