Database Replication
Database Replication¶
Difficulty: βββ (Intermediate)¶
Overview¶
Database replication is a technique that copies identical data to multiple nodes to improve availability, fault tolerance, and read performance. In this document, you will learn about various replication strategies, consistency guarantee mechanisms, and failure recovery methods.
Table of Contents¶
- Concept and Purpose of Replication
- Single-Leader Replication
- Multi-Leader Replication
- Leaderless Replication
- Synchronous/Asynchronous Replication
- Replication Lag and Consistency Issues
- Failure Recovery and Leader Election
- Quorum and Consistency Levels
- Practice Problems
- Next Steps
- References
1. Concept and Purpose of Replication¶
Why Replication is Needed¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Main Purposes of Replication β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. High Availability β
β βββββββββββ βββββββββββ β
β β Primary β ββXββ> β Replica β β When Primary fails, β
β β (Down) β β (Active)β Replica maintains β
β βββββββββββ βββββββββββ service β
β β
β 2. Read Scalability β
β βββββββββββ β
β β Primary ββββββ¬ββββ> Replica 1 ββββ Read β
β β (Write) β βββββ> Replica 2 ββββ Read β
β βββββββββββ βββββ> Replica 3 ββββ Read β
β β
β 3. Geographical Distribution β
β β
β Seoul ββββββββββββββ> Tokyo ββββββββββββββ> US-West β
β [Primary] [Replica] [Replica] β
β Latency: 0ms Latency: ~30ms Latency: ~100ms β
β β
β 4. Data Protection β
β - Hardware failure protection β
β - Data center failure protection β
β - Disaster Recovery β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Types of Replication Architecture¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Replication Architecture Comparison β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Single-Leader β
β ββββββββββ β
β β Leader β ββWriteββ> β
β βββββ¬βββββ β
β β Replicate β
β βββββ΄ββββββββββββ β
β βΌ βΌ βΌ β
β Follower Follower Follower β
β β
β 2. Multi-Leader β
β ββββββββββ ββββββββββ ββββββββββ β
β βLeader 1β<βββ>βLeader 2β<βββ>βLeader 3β β
β βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ β
β β β β β
β Follower Follower Follower β
β β
β 3. Leaderless β
β ββββββββ ββββββββ ββββββββ β
β βNode 1β βNode 2β βNode 3β β
β ββββββββ ββββββββ ββββββββ β
β β² β² β² β
β ββββββββββββ΄βββββββββββ β
β All nodes are equal β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2. Single-Leader Replication¶
How It Works¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Single-Leader Replication β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Client β
β β β
β β Write Request β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Leader (Primary) β β
β β βββββββββββββββββββββββββββββββββββ β β
β β β Transaction Log β β β
β β β [1] INSERT INTO users... β β β
β β β [2] UPDATE orders... β β β
β β β [3] DELETE from cart... β β β
β β βββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββ¬ββββββββββββββββββββββββ β
β β β
β βββββββββββΌββββββββββ β
β β Replication Log β β
β βΌ βΌ βΌ β
β βββββββββββββ βββββββββββββ βββββββββββββ β
β β Follower1 β β Follower2 β β Follower3 β β
β β (Replica) β β (Replica) β β (Replica) β β
β βββββββ¬ββββββ βββββββ¬ββββββ βββββββ¬ββββββ β
β β β β β
β βββββββββββββββ΄ββββββββββββββ β
β β β
β Read Requests β
β β² β
β Clients β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Replication Methods¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Replication Methods Comparison β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Statement-Based Replication β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Leader: INSERT INTO users VALUES (...) β β
β β β β β
β β βΌ β β
β β Follower: INSERT INTO users VALUES (...) β β SQL replay β
β βββββββββββββββββββββββββββββββββββββββββββ β
β Problem: Non-deterministic functions like NOW(), RAND() β
β β
β 2. Write-Ahead Log (WAL) Shipping β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Leader WAL: β β
β β [Page 5, Offset 120, Data: 0x45AB...] β β
β β β β β
β β βΌ β β
β β Follower: Apply identically byte by byteβ β Low-level β
β βββββββββββββββββββββββββββββββββββββββββββ replication β
β Problem: Version compatibility (downtime on upgrade) β
β β
β 3. Row-Based (Logical) Replication β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Change Log: β β
β β {table: "users", β β
β β op: "INSERT", β β
β β new_row: {id:1, name:"Kim"}} β β Logical β
β β β β changes β
β β βΌ β β
β β Follower: Apply row data β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β Advantages: Version compatible, flexible replication β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PostgreSQL Streaming Replication Example¶
-- Primary configuration (postgresql.conf)
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1GB
-- Primary: Create replication slot
SELECT * FROM pg_create_physical_replication_slot('replica1_slot');
-- Standby configuration (postgresql.conf)
primary_conninfo = 'host=primary-host port=5432 user=replicator'
primary_slot_name = 'replica1_slot'
-- Check replication status
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag
FROM pg_stat_replication;
MySQL Replication Configuration Example¶
-- Master configuration (my.cnf)
-- server-id=1
-- log_bin=mysql-bin
-- binlog_format=ROW
-- Master: Create replication user
CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
SHOW MASTER STATUS;
-- Slave configuration
CHANGE MASTER TO
MASTER_HOST='master-host',
MASTER_USER='repl',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=154;
START SLAVE;
SHOW SLAVE STATUS\G
3. Multi-Leader Replication¶
Architecture¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Multi-Leader Replication β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Center A Data Center B β
β βββββββββββββββββββ βββββββββββββββββββ β
β β Leader A β <ββββ> β Leader B β β
β β βββββββββββββ β β βββββββββββββ β β
β β β users: 1 β β Sync β β users: 1 β β β
β β β orders: 5 β β <ββββ> β β orders: 5 β β β
β β βββββββββββββ β β βββββββββββββ β β
β β β β β β β β
β β βββββ΄ββββ β β βββββ΄ββββ β β
β β βΌ βΌ β β βΌ βΌ β β
β β Follower Followerβ β Follower Followerβ β
β βββββββββββββββββββ βββββββββββββββββββ β
β β² β² β
β β β β
β [Clients A] [Clients B] β
β Latency: ~1ms Latency: ~1ms β
β β
β Advantages: β
β - Low write latency at each data center β
β - Can continue writing at other location on DC failure β
β β
β Disadvantages: β
β - Conflict resolution needed β
β - Increased complexity β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Write Conflict Problem¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Write Conflict Scenario β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Time ββββββββββββββββββββββββββββββββββββββββββββββββ> β
β β
β Leader A: β
β T1: UPDATE users SET name='Kim' WHERE id=1 β
β β β
β βΌ (replicate) β
β β
β Leader B: β
β T1: UPDATE users SET name='Lee' WHERE id=1 β
β β β
β βΌ (replicate) β
β β
β Result: β
β βββββββββββββββ βββββββββββββββ β
β β Leader A β β Leader B β β
β β name='Lee'? β β β name='Kim'? β Conflict! β
β βββββββββββββββ βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Conflict Resolution Strategies¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Conflict Resolution Strategies β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Last Write Wins (LWW) β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Write A: {value: "Kim", timestamp: 100} β β
β β Write B: {value: "Lee", timestamp: 105} β β
β β β β
β β Result: "Lee" (timestamp 105 > 100) β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β Problem: Possible data loss β
β β
β 2. Version Vector (Vector Clock) β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Node A: {value: "Kim", version: [A:1, B:0]} β β
β β Node B: {value: "Lee", version: [A:0, B:1]} β β
β β β β
β β Detect concurrent writes β Merge needed β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 3. CRDT (Conflict-free Replicated Data Type) β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β G-Counter: Maintain counter per node β β
β β Node A: +3 β β
β β Node B: +2 β β
β β Total: 3 + 2 = 5 (automatic merge) β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 4. Custom Resolution Logic β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Example: Shopping cart conflict β β
β β Cart A: [Item1, Item2] β β
β β Cart B: [Item1, Item3] β β
β β Merge: [Item1, Item2, Item3] (Union) β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Replication Topologies¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Replication Topology Types β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Circular β
β βββββ β
β β A β βββββββββββ β
β βββ¬ββ β β
β β βββΌββ β
β β β B β β
β β βββ¬ββ β
β βββΌββ β β
β β D β <ββββββββββ β
β βββ¬ββ β β
β βββ> βββββ <βββ β
β β C β β
β βββββ β
β Risk of entire replication stopping on failure β
β β
β 2. Star β
β βββββ β
β β B β β
β βββ¬ββ β
β βββββ β βββββ β
β β A βββΌββ C β β
β βββββ β βββββ β
β βββΌββ β
β βHubβ β Vulnerable when central node fails β
β βββββ β
β β
β 3. All-to-All β
β βββββ βββββ β
β β A β βββββ B β β
β βββ¬ββ βββ¬ββ β
β β β² β± β β
β β β³ β β
β β β± β² β β
β βββΌββ βββΌββ β
β β D β βββββ C β β
β βββββ βββββ β
β High fault tolerance, complex management β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4. Leaderless Replication¶
Dynamo-Style Replication¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Leaderless Replication β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Characteristics: β
β - All nodes can read/write β
β - Client sends requests to multiple nodes simultaneously β
β - Used in Amazon Dynamo, Cassandra, Riak, etc. β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Client β β
β β β β β
β β Write: key=X, value=V β β
β β ββββββββββββββββΌβββββββββββββββ β β
β β βΌ βΌ βΌ β β
β β ββββββββ ββββββββ ββββββββ β β
β β βNode 1β βNode 2β βNode 3β β β
β β β X=V β β X=V β β(Down)β β β
β β β β β β β β β β β β β
β β ββββββββ ββββββββ ββββββββ β β
β β β β
β β N=3 (total replicas), W=2 (required write successes) β β
β β 2 nodes succeeded β Write successful! β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Read Repair and Anti-Entropy¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Repair Mechanisms β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Read Repair (Repair on Read) β
β β
β Client β
β β Read key=X β
β ββββββββββββββββββββββββββββββββββββββ β
β βΌ βΌ βΌ β
β ββββββββ ββββββββ ββββββββ β
β βNode 1β βNode 2β βNode 3β β
β βX=V2 β βX=V2 β βX=V1 β β Stale version β
β βver:2 β βver:2 β βver:1 β β
β ββββββββ ββββββββ ββββββββ β
β β β β β
β ββββββββββββββββ΄ββββββββββββββββββββββ β
β β β
β βΌ β
β Client: Compare versions β
β Return latest version V2 β
β Request V2 update to Node 3 βββ Read Repair β
β β
β 2. Anti-Entropy Process (Background Sync) β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Background Anti-Entropy Process β β
β β β β
β β Node 1 Node 2 Node 3 β β
β β ββββββ ββββββ ββββββ β β
β β β A β β A β β A β β β
β β β B β <ββββ> β B β <ββββ> β B β β β
β β β C β Merkle β C β Merkle β C* β β Mismatch β β
β β β D β Tree β D β Tree β D β β β
β β ββββββ Compare ββββββ Compare ββββββ β β
β β β β
β β Efficient difference detection and sync via Merkle β β
β β Tree β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sloppy Quorum and Hinted Handoff¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Sloppy Quorum & Hinted Handoff β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Strict Quorum: Read/write only from designated nodes β
β β
β Sloppy Quorum: Other nodes substitute when some fail β
β β
β Normal situation: β
β ββββββββ ββββββββ ββββββββ β
β βNode 1β βNode 2β βNode 3β β Home nodes for key X β
β ββββββββ ββββββββ ββββββββ β
β β
β When Node 3 fails: β
β ββββββββ ββββββββ ββββββββ ββββββββ β
β βNode 1β βNode 2β β(Down)β βNode 4β β
β β X=V β β X=V β β β β X=V* β β Store with Hint β
β ββββββββ ββββββββ ββββββββ ββββββββ β
β β
β *Hinted Handoff: β
β ββββββββββββββββββββββββββββββββββββββββββββ β
β β Data stored in Node 4: β β
β β { β β
β β key: "X", β β
β β value: "V", β β
β β hint: "Node 3" β Original target node β β
β β } β β
β β β β
β β When Node 3 recovers β Transfer then β β
β β delete β β
β ββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5. Synchronous/Asynchronous Replication¶
Synchronous Replication¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Synchronous Replication β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Client Leader Follower 1 Follower 2 β
β β β β β β
β β Write X=V β β β β
β βββββββββββββββ>β β β β
β β β Replicate β β β
β β ββββββββββββββββββ>β β β
β β ββββββββββββββββββββββββββββββββββ>β β
β β β β β β
β β β ACK β β β
β β β<ββββββββββββββββββ β β
β β β<ββββββββββββββββββββββββββββββββββ β
β β β β β β
β β ACK β (After all replicated) β β
β β<βββββββββββββββ β β β
β β β β β β
β β
β Advantages: β
β - Strong consistency guaranteed β
β - No data loss β
β β
β Disadvantages: β
β - Slow write latency (wait for all replicas) β
β - Cannot write when replica fails β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Asynchronous Replication¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Asynchronous Replication β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Client Leader Follower 1 Follower 2 β
β β β β β β
β β Write X=V β β β β
β βββββββββββββββ>β β β β
β β β β β β
β β ACK β (Immediate response) β β
β β<βββββββββββββββ β β β
β β β β β β
β β β Replicate (background) β β
β β ββββββββββββββββββ>β β β
β β ββββββββββββββββββββββββββββββββββ>β β
β β β β β β
β β
β Advantages: β
β - Fast write response β
β - Replica failure doesn't affect writes β
β β
β Disadvantages: β
β - Possible data loss on leader failure β
β - Read consistency issues (stale read) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Semi-Synchronous Replication¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Semi-Synchronous Replication β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Replicate synchronously to at least 1 replica, async for rest β
β β
β Client Leader Sync Follower Async Followers β
β β β β β β
β β Write X=V β β β β
β βββββββββββββββ>β β β β
β β β Replicate β β β
β β ββββββββββββββββ>β β β
β β β ACK β β β
β β β<ββββββββββββββββ β β
β β ACK β β β β
β β<βββββββββββββββ β β β
β β β Replicate (async) β β
β β ββββββββββββββββββββββββββββββββ>β β
β β β β β β
β β
β MySQL Semi-Sync Configuration: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β -- Master β β
β β SET GLOBAL rpl_semi_sync_master_enabled = 1; β β
β β SET GLOBAL rpl_semi_sync_master_timeout = 10000; -- 10s β β
β β β β
β β -- Slave β β
β β SET GLOBAL rpl_semi_sync_slave_enabled = 1; β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6. Replication Lag and Consistency Issues¶
Replication Lag¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Replication Lag Scenario β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Time ββββββββββββββββββββββββββββββββββββββββββββββββ> β
β β
β T0 T1 T2 T3 T4 T5 β
β β β β β β β β
β Leader: Write β β β β β
β X=1 β β β β β
β β β β β β β
β β βΌ β β β β
β Follower: X=1 β β β β
β (delay) β β β β
β β β β β
β Client A: Read from Leader β β β
β X=1 β β β β β
β β β β β
β Client B: Read from Followerβ β
β X=??? β At T2, β
β Follower not yet replicated β
β β Stale Read! β
β β
β Replication Lag = T1(Leader Write) ~ T2(Follower Apply) time β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Consistency Guarantee Levels¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Consistency Guarantee Levels β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Read-Your-Writes β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β User A: Write profile photo β β
β β β β β
β β βΌ β β
β β User A: Read profile β Should see new photo β β
β β β β
β β Solutions: β β
β β - Always read own changes from Leader β β
β β - Or timestamp-based read position β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 2. Monotonic Reads β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β T1: Read from Follower A β X=2 β β
β β T2: Read from Follower B β X=1 β Problem! β β
β β β β
β β Appears as if time went backwards β β
β β β β
β β Solutions: β β
β β - Read from same replica per user β β
β β - Version-based reads β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 3. Consistent Prefix Reads (Causality Preservation) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Chat: β β
β β A: "How's the weather in Seoul?" (T1) β β
β β B: "It's sunny!" (T2) β β
β β β β
β β With replication lag: β β
β β What C sees: β β
β β B: "It's sunny!" β T2 arrives first β β
β β A: "How's the weather in Seoul?" β T1 arrives β β
β β later β β
β β β β
β β Solutions: β β
β β - Store causally related data in same partition β β
β β - Causality tracking β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Replication Lag Monitoring¶
-- PostgreSQL replication lag check
SELECT
client_addr,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds
FROM pg_stat_replication;
-- MySQL replication lag check
SHOW SLAVE STATUS\G
-- Check Seconds_Behind_Master value
-- Lag-based routing logic (application)
/*
def get_connection(query_type, max_lag_seconds=5):
if query_type == 'write':
return master_connection
for replica in replicas:
if replica.lag_seconds < max_lag_seconds:
return replica.connection
# When all replicas have high lag, read from master
return master_connection
*/
7. Failure Recovery and Leader Election¶
Follower Failure Recovery¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Follower Failure Recovery β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Catch-up Recovery β
β β
β Normal state: β
β Leader: [1] [2] [3] [4] [5] [6] β
β Follower: [1] [2] [3] β
β β² β
β βββ Remembers last applied position β
β β
β After Follower recovers: β
β Leader: [1] [2] [3] [4] [5] [6] [7] [8] β
β Follower: [1] [2] [3] β [4] [5] [6] [7] [8] β
β β β
β Catch-up β
β β
β 2. Recovery Process β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Follower restarts β β
β β 2. Check last applied LSN/binlog position β β
β β 3. Request logs after that position from Leader β β
β β 4. Catch up while applying logs β β
β β 5. Resume real-time replication β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Leader Failure and Failover¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Leader Failure Recovery (Failover) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Failover Process: β
β β
β Step 1: Failure Detection β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ββββββββββ Heartbeat failure β β
β β β Leader β ββββXββββ (3 consecutive) β β
β β β (Down) β β β
β β ββββββββββ Timeout: 30 seconds β β
β β β β
β β Followers: "No Leader response!" β Start Failover β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: New Leader Election β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Follower A: LSN = 1000 β β
β β Follower B: LSN = 1050 β Most up-to-date β New Leader β β
β β Follower C: LSN = 980 β β
β β β β
β β Election criteria: β β
β β 1. Node with most up-to-date data β β
β β 2. Node priority β β
β β 3. Consensus algorithm (Raft, Paxos) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Client Reconnection β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Clients βββββ (old) Leader: 10.0.0.1 β β
β β β² β β
β β β²βββ (new) Leader: 10.0.0.2 β β
β β β β
β β Methods: β β
β β - DNS update β β
β β - Virtual IP (VIP) transfer β β
β β - Proxy/Load Balancer configuration change β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Failover Problems¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Failover Problems β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Data Loss (with async replication) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Old Leader: [1][2][3][4][5] β β
β β β Not replicated β β
β β New Leader: [1][2][3][4] β β
β β β β
β β Write [5] is lost! β β
β β β β
β β When Old Leader recovers: What about [5]? β β
β β β Typically discarded (conflict prevention) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 2. Split Brain (Two Leaders problem) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Network Partition β β
β β β β
β β ββββββββββ β³ ββββββββββ β β
β β βOld β βNew β β β
β β βLeader β Network split βLeader β β β
β β ββββββββββ ββββββββββ β β
β β β² β² β β
β β Clients A Clients B β β
β β (continue writing) (continue writing) β β
β β β β
β β β Data inconsistency occurs! β β
β β β β
β β Solutions: β β
β β - Fencing (STONITH: Shoot The Other Node In Head)β β
β β - Force shutdown Old Leader β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 3. Timeout Configuration Dilemma β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Timeout too short: β β
β β - Unnecessary failover on network delay/load β β
β β - System instability β β
β β β β
β β Timeout too long: β β
β β - Delayed recovery on actual failure β β
β β - Increased service downtime β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Leader Election Algorithm¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Raft Leader Election β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β States: Follower, Candidate, Leader β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β ββββββββββββ Election βββββββββββββ β β
β β β β Timeout β β β β
β β β Follower β βββββββββββββββββ>β Candidate β β β
β β β β β β β β
β β ββββββββββββ βββββββ¬ββββββ β β
β β β² β β β
β β β Discovers β Receives β β
β β β current β majority β β
β β β leader β votes β β
β β β βΌ β β
β β β ββββββββββββ β β
β β βββββββββββββββββββββββββββ Leader β β β
β β β β β β
β β ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Election Process: β
β 1. Follower becomes Candidate if no heartbeat received β
β 2. Increment Term number β
β 3. Vote for self + RequestVote to other nodes β
β 4. Leader elected when majority votes received β
β 5. Leader sends periodic heartbeats β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Example: 5-node cluster β β
β β β β
β β Node A (Candidate, Term 5) β β
β β β β β
β β βββ RequestVote ββ> Node B: Vote YES β β
β β βββ RequestVote ββ> Node C: Vote YES β β
β β βββ RequestVote ββ> Node D: Vote NO (already voted) β β
β β βββ RequestVote ββ> Node E: Vote YES β β
β β β β
β β 3/5 = Majority β Node A is Leader! β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
8. Quorum and Consistency Levels¶
Quorum Concept¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Quorum Concept β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β N = Number of replicas (Total Replicas) β
β W = Number of responses needed for write success (Write Quorum)β
β R = Number of responses needed for read success (Read Quorum) β
β β
β Strong consistency guarantee condition: R + W > N β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β When N=3, W=2, R=2: β β
β β β β
β β On write: β β
β β [Node1: X=V] [Node2: X=V] [Node3: ?] β β
β β β β (pending) β β
β β β β
β β On read: β β
β β [Node1: X=V] [Node2: ?] [Node3: X=V] β β
β β β (skip) β β β
β β β β
β β R(2) + W(2) = 4 > N(3) β β
β β β At least 1 node has the latest write! β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββ β
β Write Set β β Read Set β
β (W=2) β βββββββββββ β (R=2) β
β βββββββββββ β β Node 1 β β βββββββββββ β
β β Node 1 ββββββΌββββ (overlap)βββββΌβββββ Node 1 β β
β β Node 2 β β βββββββββββ β β Node 3 β β
β βββββββββββ β (at least 1) β βββββββββββ β
β βββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quorum Configuration Characteristics¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Quorum Configuration Strategies β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Based on N = 3 replicas β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Config 1: W=1, R=3 (Write priority) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Write: Only 1 node needs to succeed β Fast write β β β
β β β Read: Must read all 3 β Slow read β β β
β β β β β β
β β β Use case: Log collection, IoT data β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Config 2: W=3, R=1 (Read priority) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Write: All 3 must succeed β Slow write β β β
β β β Read: Only 1 needed β Fast read β β β
β β β β β β
β β β Use case: Catalogs, reference data β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Config 3: W=2, R=2 (Balanced) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Write: 2 nodes must succeed β β β
β β β Read: Read from 2 nodes β β β
β β β β β β
β β β Common balanced configuration β β β
β β β Tolerates 1 node failure β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Fault tolerance: min(N-W, N-R) node failures tolerated β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cassandra Consistency Levels¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cassandra Consistency Levels β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Write Consistency Levels: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ANY : 1 node (including hints) β Lowest β β
β β consistency β β
β β ONE : 1 replica β β
β β TWO : 2 replicas β β
β β THREE : 3 replicas β β
β β QUORUM : (RF/2)+1 replicas β β
β β LOCAL_QUORUM: (RF/2)+1 in local DC β β
β β EACH_QUORUM : (RF/2)+1 in each DC β β
β β ALL : All replicas β Highest consistency β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Example: RF=3 cluster β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β QUORUM calculation: (3/2)+1 = 2 β β
β β β β
β β Write CL=QUORUM: 2 node ACKs required β β
β β Read CL=QUORUM: Read from 2 nodes β β
β β β β
β β R(2) + W(2) = 4 > N(3) β Strong consistency! β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Code Example: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β -- CQL β β
β β CONSISTENCY QUORUM; β β
β β INSERT INTO users (id, name) VALUES (1, 'Kim'); β β
β β β β
β β -- Or per-query configuration β β
β β SELECT * FROM users WHERE id = 1 β β
β β USING CONSISTENCY LOCAL_QUORUM; β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Consistency vs Availability Trade-off¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Consistency vs Availability Trade-off β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Consistency β
β β² β
β β β
β ALL ββββββββββββ€ β
β β β
β QUORUM βββββββββββ€ Balance point β
β β (Recommended) β
β TWO ββββββββββ€ β
β β β
β ONE βββββββββ€ β
β β β
β ANY ββββββββΌβββββββββββββββββββββββββββββββββ> β
β Availability β
β β
β Recommended settings by scenario: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Financial transactions: W=ALL, R=ALL (Consistency first) β β
β β Social feeds: W=ONE, R=ONE (Availability first)β β
β β E-commerce: W=QUORUM, R=QUORUM (Balanced) β β
β β Analytics data: W=ONE, R=ONE (Eventual β β
β β consistency OK) β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
9. Practice Problems¶
Practice 1: Choosing Replication Strategy¶
Choose the appropriate replication strategy for the following scenarios and explain your reasoning.
Scenarios:
1. Social media service with global users
2. Bank system processing financial transactions
3. Real-time chat application
4. Log collection and analysis system
Options:
A. Single-leader + Synchronous replication
B. Single-leader + Asynchronous replication
C. Multi-leader replication
D. Leaderless replication
Sample Answer:
1. Social Media: C (Multi-leader)
- Geographically distributed users
- Low latency needed in each region
- Slight consistency delay acceptable
2. Bank System: A (Single-leader + Sync)
- Strong consistency required
- Data loss unacceptable
- Write latency tolerable
3. Real-time Chat: B (Single-leader + Async)
- Fast response needed
- Message order important
- Slight data loss acceptable
4. Log Collection: D (Leaderless)
- High write throughput needed
- Eventual consistency sufficient
- High availability important
Practice 2: Quorum Calculation¶
Problem:
In an N=5 replica cluster:
1. What W, R values for strong consistency with max availability?
2. With W=3, R=2, how many node failures tolerated?
3. How to maximize write availability while maintaining strong consistency?
Sample Answer:
1. W=3, R=3 (or W=2, R=4, etc.)
- R + W > N (3+3=6 > 5) β
- Write: Tolerates 2 node failures
- Read: Tolerates 2 node failures
2. W=3, R=2 analysis:
- R + W = 5 = N (Not strong consistency!)
- Strictly speaking R + W > N needed for strong consistency
- Write availability: 5-3 = 2 node failures tolerated
- Read availability: 5-2 = 3 node failures tolerated
3. Maximize write availability:
- W=1, R=5 (Write: 4 failures tolerated)
- However, read requires all nodes (low availability)
Practice 3: Failover Scenario¶
Problem:
In a MySQL master-slave configuration using asynchronous replication,
when master fails:
1. Slave A: binlog position 1000
2. Slave B: binlog position 950
3. Master last binlog position: 1020
Answer the following:
1. Which slave should be promoted to new master?
2. Maximum how many transactions could be lost?
3. How should the old master be handled when recovered?
Sample Answer:
1. Promote Slave A to new master
- Position 1000 > 950
- Has most up-to-date data
2. Maximum lost transactions:
- 1020 - 1000 = 20 transactions could be lost
3. When old master recovers:
- Reconfigure as slave
- Discard position 1000 ~ 1020 data (conflict prevention)
- Start replication from new master
Practice 4: Conflict Resolution Design¶
Problem:
In a multi-leader e-commerce system, when updating
the same product's inventory simultaneously:
Leader A (Seoul): inventory = 100 - 5 = 95
Leader B (Tokyo): inventory = 100 - 3 = 97
How can this conflict be resolved?
Propose several methods and explain pros/cons.
Sample Answer:
Method 1: LWW (Last Write Wins)
- Apply write with later timestamp
- Pros: Simple implementation
- Cons: Data loss (5 or 3 sales missing)
Method 2: CRDT (Counter)
- Record sales per leader: A: -5, B: -3
- Final inventory = 100 - 5 - 3 = 92
- Pros: No data loss
- Cons: Inventory could go negative
Method 3: Distributed Lock (Not recommended)
- Acquire distributed lock before write
- Cons: Increased latency, complexity
Method 4: Route to Single Leader
- Inventory writes only to specific leader
- Pros: Prevents conflicts at source
- Cons: Potential overload on that leader
Recommended: CRDT + Inventory threshold alerts
- Alert/adjust when negative inventory occurs
Practice 5: Interview Question¶
Interviewer: "How would you implement Read-Your-Writes consistency?"
Requirements:
- User must always be able to read their own writes
- Asynchronous replication environment
- Multiple replicas present
Propose a design approach.
Sample Answer:
Method 1: Read from Leader
- User reads recently written data from Leader
- Implementation: Use Leader for certain time (e.g., 1 min) after write
Method 2: Timestamp-based
- Client remembers last write timestamp
- Include timestamp in read request
- Wait until replica is caught up to that timestamp
Method 3: Replication Position-based
- Return LSN (Log Sequence Number) after write
- Read from replica that has applied at least that LSN
Implementation Example (Method 2):
```python
class Client:
def __init__(self):
self.last_write_timestamp = 0
def write(self, data):
result = leader.write(data)
self.last_write_timestamp = result.timestamp
def read(self, key):
return db.read(
key,
min_timestamp=self.last_write_timestamp
)
10. Next Steps¶
Based on what you learned in this document, study the following topics:
- Distributed Transactions - 2PC, Saga Pattern
- Consensus Algorithms - Paxos, Raft, ZAB
- CDC (Change Data Capture) - Debezium
- Event Sourcing and CQRS - Alternative approaches to replication
Learning Path:
[Current] Database Replication
β
βββ> Distributed Transactions (2PC, Saga)
β
βββ> Consensus Algorithms (Raft, Paxos)
β
βββ> CDC & Event Sourcing
11. References¶
Essential Books¶
- Designing Data-Intensive Applications - Martin Kleppmann
- Chapter 5: Replication
-
Chapter 9: Consistency and Consensus
-
Database Internals - Alex Petrov
- Part II: Distributed Systems
Database-Specific Documentation¶
- PostgreSQL Streaming Replication
-
https://www.postgresql.org/docs/current/warm-standby.html
-
MySQL Replication
-
https://dev.mysql.com/doc/refman/8.0/en/replication.html
-
Cassandra Documentation
- https://cassandra.apache.org/doc/latest/cassandra/architecture/dynamo.html
Papers¶
- Dynamo: Amazon's Highly Available Key-value Store (2007)
- In Search of an Understandable Consensus Algorithm (Raft) (2014)
- Chain Replication for Supporting High Throughput and Availability (2004)
Summary¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Key Concepts Summary β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Replication Types: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Single-leader: Write only to leader, read from β β
β β followers too β β
β β Multi-leader: Write to multiple nodes, conflict β β
β β resolution needed β β
β β Leaderless: All nodes equal, quorum for consistency β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Sync/Async: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Sync: Strong consistency, slow writes, can't write on β β
β β failure β β
β β Async: Fast writes, possible data loss β β
β β Semi-sync: At least 1 sync + rest async (balanced) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Quorum: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β R + W > N guarantees strong consistency β β
β β Fault tolerance = min(N-W, N-R) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Interview Key Points: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Trade-offs by replication type β β
β β 2. Failover process and problems β β
β β 3. Quorum calculation and consistency levels β β
β β 4. Split Brain prevention strategies β β
β β 5. Read-Your-Writes implementation methods β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ