Distributed Systems

Distributed Data Stores

Master the architecture behind Dynamo, Cassandra, and Riak. Learn consistent hashing, quorum systems, vector clocks, and conflict-free replicated data types.

5 min read

Distributed Data Stores

When you need to store petabytes of data across hundreds of machines while staying available during network partitions and machine failures, traditional databases won't cut it. You need distributed data stores—systems designed from the ground up for the realities of distributed computing.

This tutorial covers the architecture of systems like Amazon Dynamo, Apache Cassandra, and Riak. These aren't academic exercises—Dynamo powers Dynamo (ironically, the paper influenced systems like Cassandra, Riak, and Voldemort), and understanding their design will make you a better distributed systems engineer.

Why Distributed Data Stores?

Before diving into the architecture, let's understand the problem:

ChallengeTraditional DBDistributed Store
ScaleVertical onlyHorizontal
FailureSingle pointDesigned for
AvailabilityCP (Consistency-Positive)AP (Availability-Positive)
GeographySingle DCMulti-region
Partition toleranceNone neededBuilt-in
💡

The CAP theorem states you can have at most two of: Consistency, Availability, and Partition tolerance. Distributed stores choose availability and partition tolerance (AP), trading strong consistency for always-on availability. They achieve "eventual consistency" through clever mechanisms.


Consistent Hashing

The foundation of distributed data stores. Consistent hashing minimizes data movement when nodes join or leave.

The Basic Problem

Traditional hashing: node = hash(key) % num_nodes

This works until you add a node. Then num_nodes changes, and 90%+ of keys remap. For a system with millions of keys, this causes massive cache invalidation and rebalancing storms.

The Hash Ring Solution

Consistent hashing maps both keys and nodes onto a circular hash space (a ring):

Key placement: Walk clockwise from the key's hash position until you find a node. That node owns the key.

Virtual Nodes for Load Balancing

Physical nodes can have different capacities (a big server vs a small one). Virtual nodes solve this by assigning more virtual nodes to powerful machines:

Replication strategy: When a key hashes to position P, replicate to the next N nodes clockwise.

Quorum Systems

Quorum determines how many nodes must acknowledge reads and writes:

ConfigurationRWNConsistencyLatency
Strong writeN13Writes slow, reads fastRead-optimized
Strong read1N3Writes fast, reads slowWrite-optimized
BalancedN/2+1N/2+13Quorum intersectionBalanced

The key invariant: If R + W > N, you're guaranteed to read your own writes.


Versioning and Vector Clocks

In distributed systems, you can't just use "last write wins" because clocks are unreliable and network partitions cause concurrent writes. Vector clocks track causal history.

The Problem with Timestamps

When the network partitions, both sides can make conflicting changes. Timestamps alone can't resolve this.

Vector Clock Solution

Each version has a vector clock—a map of node IDs to sequence numbers:

ConceptDescription
Vector clockMap of node_id → counter
IncrementNode increments its own counter when making a change
MergeFor each node, take the max counter
Happens-beforeVersion A happened-before B if all A's counters ≤ B's
ConflictTwo versions conflict if neither happened-before the other

Version History

Merge process:

  • v3 clock A:3 and v4 clock A:2 B:1 merges to A:3 B:1
  • v5 happened-after v4 (B:1 exists only in v4, and A:3 > A:2)
  • v6 clock A:4 is concurrent with v5 (A:4 vs A:3) → CONFLICT!

Conflict Resolution Strategies

StrategyHow It WorksTrade-off
Last Write Wins (LWW)Use wall clock timestampsSimple but loses data
Application-level mergeLet the app decide how to combineBest for domain-specific logic
Return all versionsLet client decideMaximum flexibility

Hinted Handoff

When a node is temporarily down, you don't want to fail writes. Hinted handoff stores writes for later delivery:

Process:

  1. Write to target node fails (node down)
  2. Write to next healthy node succeeds
  3. Store hint: "when Node A comes back, deliver this write"
  4. When target node recovers, deliver the hinted write

Anti-Entropy and Merkle Trees

Even with quorum, replicas can diverge. Anti-entropy is the process of repairing divergence between replicas.

Merkle Trees

A Merkle tree is a hash tree where each leaf is the hash of a data block:

How it works:

  • Each leaf is the hash of a data range
  • Each internal node is the hash of its children
  • Comparing root hashes tells if trees are identical
  • Comparing child hashes identifies divergent ranges

Anti-Entropy Protocol

Process:

  1. Compare root hashes of Merkle trees
  2. If different, compare child hashes
  3. Recursively narrow down to divergent range
  4. Sync only the divergent keys

CRDTs: Conflict-Free Replicated Data Types

For certain data types, you can define merge operations that are always correct, regardless of the order of application.

Why CRDTs Matter

Traditional conflict resolution requires application knowledge. CRDTs provide mathematically proven conflict-free merging at the data structure level.

Common CRDTs

CRDT TypeDescriptionExample Use
Grow-only CounterCan only increment, never decrementPage view counts
LWW-RegisterLast write winsConfiguration values
PnCounterCan increment and decrementVote counts, account balance
OR-SetAdd/remove with conflict-free mergeTags, bookmarks, shopping cart

How CRDTs Work

TypeMerge Rule
G-CounterFor each node, take max of counters
LWW-RegisterIf other timestamp > mine, use other value
PnCounterPositive counter = max, Negative counter = max
OR-SetUnion of adds minus intersection of removes

Sloppy Quorum and Hinted Replicas

In production, some nodes may be unavailable. Sloppy quorum uses the next N healthy nodes instead of strict N successors:

Key insight: Sloppy quorum improves availability but means your data might not be on the expected nodes. Hinted handoff ensures eventual delivery to the target node.


Summary: Dynamo-Style Architecture

ComponentPurposeKey Algorithm
Consistent HashingNode selectionHash ring with virtual nodes
ReplicationDurabilityN replicas per key
QuorumConsistencyR + W > N
Vector ClocksVersion trackingCausal history
Hinted HandoffAvailability during failuresStore and deliver later
Anti-EntropyRepair divergenceMerkle trees
CRDTsConflict-free data typesMathematically proven merge
💡

These patterns are used in production systems you interact with daily. Cassandra uses consistent hashing and quorum. Riak uses vector clocks and CRDTs. Redis Cluster uses consistent hashing (with hash slots, a variant). Understanding the internals helps you tune these systems for your workload.

In the next tutorial, we'll explore Fault Tolerance and Resilience—the patterns that keep distributed systems running when components fail.