Core Building Blocks

Data Replication: Strategies, Consistency Models, and Conflict Resolution

Learn how to replicate data across multiple nodes for availability and fault tolerance, covering synchronous vs asynchronous replication, consistency models, and conflict resolution strategies.

22 min readdata replicationconsistencyavailabilityfault tolerancesystem design

Why Data Replication Matters

Replicating data across multiple nodes is fundamental to building highly available and fault-tolerant systems. By maintaining copies of data, we can survive node failures, serve users from geographically closer locations, and scale read throughput.

The CAP theorem reminder: Replication involves trade-offs between consistency, availability, and partition tolerance. Understanding these trade-offs is key to choosing the right replication strategy.

However, replication introduces complexity: ensuring consistency between replicas, handling network partitions, and resolving conflicts when replicas diverge.


Synchronous vs Asynchronous Replication

Synchronous Replication

  • How it works: A write is not considered complete until it has been written to all replicas (or a quorum of replicas).
  • Pros: Strong consistency; all replicas are identical at all times.
  • Cons: Write latency is limited by the slowest replica; system blocks if any replica is unavailable.
  • Use case: Financial systems where strong consistency is required (e.g., banking transactions).

Asynchronous Replication

  • How it works: A write is considered complete once written to the primary; replicas are updated later.
  • Pros: Low write latency (only waits for primary); system continues if replicas are temporarily unavailable.
  • Cons: Replicas may lag behind (replication lag); risk of data loss if primary fails before replicating.
  • Use case: Most web applications where low latency is prioritized over strong consistency.
⚠️

Replication lag: In asynchronous replication, replicas may temporarily stale reads. Monitor lag and consider using techniques like read-after-write consistency for critical operations.


Replication Architectures

Leader-Based (Master-Slave) Replication

  • One node (leader) handles all writes; replicas (followers) replicate from the leader.
  • Writes: Go to the leader only.
  • Reads: Can be served by leader or followers (with potential staleness).
  • Failover: On leader failure, a follower is promoted to leader.

Multi-Leader Replication

  • Multiple nodes can accept writes; changes are replicated between leaders.
  • Writes: Can go to any leader.
  • Reads: Can be served by any node.
  • Conflict resolution: Required when the same data is modified concurrently on different leaders.
  • Use case: Multi-datacenter setups where low-latency writes are needed locally.

Leaderless (Dynamo-style) Replication

  • No designated leader; each node can accept writes and coordinates with others.
  • Writes: Sent to multiple nodes (typically a quorum).
  • Reads: Read from multiple nodes and return the most recent value (often using vector clocks or timestamps).
  • Conflict resolution: Built-in mechanisms like last-write-wins or application-specific handlers.
  • Use case: High-availability systems like Amazon DynamoDB, Apache Cassandra.
💡

Quorum: In leaderless systems, a write is successful when written to W nodes; a read is successful when read from R nodes. If W + R > N (replication factor), strong consistency is guaranteed.


Consistency Models in Replicated Systems

Strong Consistency

  • After a write completes, all subsequent reads (from any node) return that value.
  • Achieved with synchronous replication or quorum reads/writes in leaderless systems.
  • Cost: Higher latency, lower availability during partitions.

Eventual Consistency

  • If no new updates are made, eventually all accesses will return the last updated value.
  • Common in asynchronous replication and leaderless systems with low quorum requirements.
  • Cost: Temporary inconsistencies; requires conflict resolution.

Causal Consistency

  • Preserves causal relationships: if event A causally precedes event B, then everyone who sees B must also have seen A.
  • Stronger than eventual but weaker than strong consistency.
  • Use case: Social media feeds where seeing a comment before its parent post would be confusing.

Read-After-Write Consistency

  • A client that performs a write will always see that write in subsequent reads (but other clients might not).
  • Often implemented by routing reads to the leader or using session guarantees.

Client-centric consistency: Consider what consistency guarantees your clients actually need. Not all clients require strong consistency; many can tolerate brief staleness.


Conflict Resolution in Multi-Leader and Leaderless Systems

When the same data is updated concurrently on different nodes, conflicts arise. Strategies include:

Last-Write-Wins (LWW)

  • Uses timestamps to determine which write is "last."
  • Simple but can lose updates if clocks are not synchronized (clock skew).
  • Mitigation: Use hybrid logical clocks or synchronized clocks (e.g., Google's TrueTime).

Vector Clocks

  • Track causality between events; concurrent updates are detected and must be resolved.
  • More accurate than timestamps but increases metadata overhead.

Application-Specific Conflict Resolution

  • Let the application decide how to merge conflicting updates (e.g., merge shopping carts, take max counter).
  • Requires semantic understanding of the data.

Conflict-Free Replicated Data Types (CRDTs)

  • Data structures designed to automatically merge concurrent updates correctly.
  • Examples: G-Counter (grow-only counter), LWW-Element-Set, OR-Set.
  • Use case: Collaborative editing, counters, sets where you can define merge semantics.
🚨

Silent conflicts: Some systems (like basic LWW) may silently drop updates. Always understand your system's conflict resolution and monitor for conflicts.


Practical Considerations

Replication Lag Monitoring

  • Track the time difference between when a write is committed on the primary and when it appears on replicas.
  • Set alerts for excessive lag that might affect user experience.

Read Replica Routing

  • Decide whether to route reads to replicas (for scalability) or only to the primary (for consistency).
  • Consider using sticky sessions or session guarantees for read-after-write consistency.

Backup and Restore

  • Replication is not a backup! Replicas can propagate corruptions or accidental deletes.
  • Maintain independent backups for disaster recovery.

Network Partitions

  • Understand how your system behaves during a network split (e.g., does it continue with risk of divergence?).
  • Test partition scenarios to ensure your conflict resolution and recovery mechanisms work.
💡

Geo-replication: Replicating across data centers adds latency but improves disaster recovery and serves users closer to them. Consider asynchronous replication with conflict resolution for geo-distributed systems.


What to Remember for Interviews

  1. Replication types: Know synchronous vs asynchronous, leader-based vs multi-leader vs leaderless.
  2. Consistency models: Understand strong, eventual, causal, and read-after-write consistency.
  3. Trade-offs: Recognize the latency, availability, and consistency implications of each choice.
  4. Conflict resolution: Be familiar with LWW, vector clocks, application-specific resolution, and CRDTs.
  5. Practical concerns: Know to monitor replication lag, understand that replication ≠ backup, and test network partitions.

Practice: Design a user profile service for a global social network. You need low-latency reads and writes worldwide, can tolerate brief inconsistencies, and must survive data center failures. What replication strategy would you choose and why?