/ˌrɛplɪˈkeɪʃən/

noun … “Copy data across nodes to ensure reliability.”

Replication is the process of creating and maintaining multiple copies of data across different nodes in a Distributed System. Its purpose is to enhance Availability, fault tolerance, and performance by allowing data to remain accessible even if some nodes fail. Replication is fundamental to distributed databases, file systems, and cloud storage platforms.

Key characteristics of Replication include:

  • Redundancy: multiple copies of the same data exist on different nodes to prevent data loss.
  • Consistency: replication strategies define how and when updates propagate, balancing between strong consistency and eventual consistency.
  • Durability: replicated data ensures that committed writes are not lost even if a node crashes.
  • Performance: replication can improve read throughput by allowing multiple nodes to serve requests concurrently.
  • Coordination: algorithms like Paxos and Raft are often used to ensure replicated logs remain consistent across nodes.

Workflow example: In a replicated key-value store, a client writes a new value. The leader node appends the value to its log and replicates it to follower nodes. Once a majority acknowledges, the value is committed and applied locally and on followers. Reads can be served from any node, and failed nodes can catch up with the latest state once they recover.

-- Simplified replication example
nodes = ["Node1", "Node2", "Node3"]
value = 100
leader = "Node1"
leader_log.append(value)
for node in nodes {
    if node != leader:
        replicate(node, value)
}
commit(value)
-- All nodes now contain the value 100

Conceptually, Replication is like distributing multiple copies of a book to several libraries. Even if one library is closed or destroyed, readers can still access the same content elsewhere.

See Distributed Systems, Consensus, Raft, Paxos, Availability.