Consensus Algorithms
Consensus Algorithms¶
Difficulty: ββββ
Overview¶
Consensus in distributed systems is about multiple nodes agreeing on a single value. In this chapter, we'll learn about the definition of the consensus problem, Paxos and Raft algorithms, Byzantine Fault Tolerance, and practical applications with ZooKeeper and etcd.
Table of Contents¶
- Defining the Consensus Problem
- Paxos Algorithm
- Raft Algorithm
- Byzantine Fault Tolerance
- Practical Applications
- Practice Problems
1. Defining the Consensus Problem¶
What is Consensus?¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Consensus Problem β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Goal: N nodes agree on a single value β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Node 1: "A" β β
β β Node 2: "B" ββββΊ All nodes: "A" (or "B") β β
β β Node 3: "A" β β
β β Node 4: "C" β β
β β Node 5: (failed) β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Use Cases: β
β - Leader election: Who is the leader? β
β - Distributed locks: Who acquired the lock? β
β - Distributed DB: What order to commit transactions? β
β - State machine replication: What command to execute? β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Properties of Consensus¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Required Properties of Consensus Algorithms β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Safety - Nothing bad happens β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Agreement: β β
β β - All correct nodes agree on the same value β β
β β - Two nodes must not decide on different values β β
β β β β
β β Validity: β β
β β - The decided value must be one that some node proposed β β
β β - Cannot select a value no one proposed β β
β β β β
β β Integrity: β β
β β - Each node decides at most once β β
β β - Once decided, cannot change β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 2. Liveness - Good things eventually happen β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Termination: β β
β β - All correct nodes eventually decide on a value β β
β β - Don't wait forever β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β FLP Impossibility (1985): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β It is impossible to guarantee both Safety and Liveness in β β
β β an asynchronous system while tolerating even a single β β
β β node failure. β β
β β β β
β β Real system responses: β β
β β - Sacrifice some Liveness (retry after timeout) β β
β β - Add synchrony assumptions (partial synchrony) β β
β β - Safety is always guaranteed β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Failure Models¶
| Model | Description | Required Nodes |
|---|---|---|
| Crash Failure | Node stops, can recover | 2f + 1 |
| Omission Failure | Messages can be lost | 2f + 1 |
| Byzantine Failure | Malicious/arbitrary behavior | 3f + 1 |
2. Paxos Algorithm¶
Paxos Overview¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Paxos Algorithm β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Invented by Leslie Lamport in 1989 (published 1998) β
β The foundational algorithm for distributed consensus β
β β
β Roles: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Proposer β β
β β - Proposes values β β
β β - Handles client requests β β
β β β β
β β Acceptor β β
β β - Votes on proposals β β
β β - Decision made when majority accepts β β
β β - Stores state β β
β β β β
β β Learner β β
β β - Learns the decided value β β
β β - Read-only replica β β
β β β β
β β (In practice, one node can play multiple roles) β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Basic Paxos: Two Phases¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Basic Paxos: Phase 1 - Prepare β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Proposer Acceptors (majority needed) β
β β A1 A2 A3 β
β β β β β β
β ββββPrepare(n=1)βββββΊβ β β β
β ββββPrepare(n=1)βββββββββββββΊβ β β
β ββββPrepare(n=1)βββββββββββββββββββββΊβ β
β β β β β β
β β β β β β
β ββββPromise(n=1)βββββ β β β
β ββββPromise(n=1)βββββββββββββ β β
β ββββPromise(n=1)βββββββββββββββββββββ β
β β β β β β
β β
β Prepare(n): "I intend to propose with proposal number n" β
β Promise(n): "I will ignore proposals smaller than n" β
β + return any previously accepted value β
β β
β Acceptor Rules: β
β - If received n > existing promised_n, send Promise β
β - Otherwise ignore/reject β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Basic Paxos: Phase 2 - Accept β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β After receiving majority Promises: β
β β
β Proposer Acceptors β
β β A1 A2 A3 β
β β β β β β
β βββAccept(n=1,v="X")ββΊβ β β β
β βββAccept(n=1,v="X")ββββββββββΊβ β β
β βββAccept(n=1,v="X")ββββββββββββββββββΊβ β
β β β β β β
β ββββAccepted(n=1)βββββ β β β
β ββββAccepted(n=1)βββββββββββββ β β
β ββββAccepted(n=1)βββββββββββββββββββββ β
β β β β β β
β βΌ β
β Majority Accepted β Value "X" decided! β
β β
β Accept(n, v): "Please accept value v with proposal number n" β
β Accepted(n): "Accepted" β
β β
β Value Selection Rule: β
β - If a previous value was received in Promise, use the one with β
β the highest n β
β - Otherwise propose your own value β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Paxos Conflict Resolution¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Paxos: Concurrent Proposal Handling β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Scenario: Two Proposers propose simultaneously β
β β
β Proposer1 Proposer2 Acceptors β
β β β A1 A2 A3 β
β β β β β β β
β ββPrepare(1)ββΊβ β β β β
β β ββPrepare(2)ββΊβ β β β
β β β β β β β
β βββPromise(1)ββΌββββββββββββββ β β A1: n=1 β
β β βββPromise(2)ββββββββββ β A2: n=2 β
β β βββPromise(2)ββββββββββββββββββ A3: n=2 β
β β β β β β β
β β β β β β β
β ββAccept(1,X)βΊβ β β β β
β β β X NACK! (n=1 < promised n=2) β
β β β β β β β
β β ββAccept(2,Y)βΊβ β β β
β β βββAcceptedββββββββββββ β β
β β βββAcceptedββββββββββββββββββββ β
β β β β β β β
β β βΌ β β β β
β β Value "Y" decided β β β β
β β β β β β β
β β β β β β β
β Proposer1 needs to retry with higher n β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Multi-Paxos¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Multi-Paxos β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Basic Paxos Problem: Prepare needed every time (2 RTT) β
β β
β Multi-Paxos Optimization: β
β - Elect a stable leader β
β - Leader decides consecutive values β
β - Skip Prepare phase (1 RTT) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Leader Acceptors β β
β β β A1 A2 A3 β β
β β β β β β β β
β β βββPrepare(n=1)βββββΊβ β (only once initially) β β
β β βββPromise(n=1)ββββββ β β β
β β β β β β β β
β β β β β β β β
β β βββAccept(1,v1)βββββΊβ β Log Entry 1 β β
β β βββAcceptedββββββββββ β β β
β β β β β β β β
β β βββAccept(1,v2)βββββΊβ β Log Entry 2 β β
β β βββAcceptedββββββββββ β β β
β β β β β β β β
β β βββAccept(1,v3)βββββΊβ β Log Entry 3 β β
β β βββAcceptedββββββββββ β β β
β β β β β β β β
β β β β
β β While leader is stable, only Accept repeats (1 RTT) β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3. Raft Algorithm¶
Raft Overview¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Raft Algorithm β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β "In Search of an Understandable Consensus Algorithm" (2014) β
β Diego Ongaro, John Ousterhout β
β β
β Design Goal: An understandable consensus algorithm β
β - Same performance/safety as Paxos β
β - Much easier to understand β
β β
β Key Concept Separation: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Leader Election β β
β β 2. Log Replication β β
β β 3. Safety β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Node States: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Timeout Majority vote β β
β β ββββββββββ ββββββββΊ ββββββββββββββ βββββββΊ ββββββββββ β β
β β βFollowerβ β Candidate β β Leader β β β
β β ββββββββββ ββββββββ ββββββββββββββ βββββββ ββββββββββ β β
β β Higher Term Timeout Higher Term β β
β β (re-election) β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Term¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Raft: Term β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Term: Logical time unit β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Term 1 β Term 2 β Term 3 β Term 4 β β β
β β βββββββββ β βββββββββ β βββββ β βββββββββββ β β β
β β βElectionβ β βElectionβ β βElecβ β βElection β β β β
β β β β β β β β βfailβ β β β β β β
β β βNode 1 β β βNode 3 β β β β β β Node 2 β β β β
β β βLeader β β βLeader β β β β β β Leader β β β β
β β βββββββββ β βββββββββ β βββββ β βββββββββββ β β β
β β β β β β β β
β β ββββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββββΌββββΊ β β
β β Time β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Term Rules: β
β - At most one leader per Term β
β - Term starts: new election begins β
β - On discovering higher Term, immediately become Follower β
β - Ignore messages from older Terms β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Leader Election¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Raft: Leader Election β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Election Timeout β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Node A (Follower) β β
β β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββΊ β β
β β β No heartbeat (150-300ms) β β β
β β β βΌ β β
β β β Timeout! β β
β β β β Become Candidate β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Request Vote β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Node A (Candidate) Node B Node C β β
β β term: 2 term: 1 term: 1 β β
β β β β β β β
β β βββ RequestVote(term=2) βββββΊβ β β β
β β βββ RequestVote(term=2) βββββββββββββββββββΊβ β β
β β β β β β β
β β ββββ VoteGranted βββββββββββββ β β β
β β ββββ VoteGranted βββββββββββββββββββββββββββ β β
β β β β β β β
β β βΌ β β
β β Majority (2/3) achieved β Leader elected! β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Voting Rules: β
β - Vote at most once per Term β
β - Only vote if requester's log is at least as up-to-date β
β - Reset timer after voting β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Log Replication¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Raft: Log Replication β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Leader's Log: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Index: 1 2 3 4 5 6 β β
β β βββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ β β
β β Term: β 1 β 1 β 2 β 2 β 2 β 3 β β β
β β βββββββΌββββββΌββββββΌββββββΌββββββΌββββββ€ β β
β β Cmd: β x=1 β y=2 β x=3 β z=1 β y=4 β x=5 β β β
β β βββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ β β
β β β² β β
β β commitIndex β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Replication Process: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Leader Follower 1 Follower 2 β β
β β β β β β β
β β β AppendEntries β β β β
β β β (term, prevLogIdx, β β β β
β β β prevLogTerm, β β β β
β β β entries[]) β β β β
β β ββββββββββββββββββββββΊβ β β β
β β ββββββββββββββββββββββββββββββββββββββΊ β β β
β β β β β β β
β β ββββ Success ββββββββββ β β β
β β ββββ Success βββββββββββββββββββββββββ β β β
β β β β β β β
β β β Majority replicated β β β
β β β β Increment commitIndex β β β
β β β β Apply to state machine β β β
β β β β β β β
β β β Next AppendEntries includes commitIndex β β
β β β β Followers also commit β β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Log Consistency¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Raft: Log Consistency β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Log Matching Property: β
β - Same index, same term β same command β
β - Same index, same term β all previous entries identical β
β β
β On Inconsistency: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Leader: [1,1] [1,2] [2,3] [3,4] [3,5] [3,6] β β
β β Follower: [1,1] [1,2] [2,3] [2,4] [2,5] β β
β β β² β β
β β Divergence point β β
β β β β
β β Resolution: Leader overwrites Follower log to match β β
β β β β
β β 1. AppendEntries(prevLogIdx=3, prevLogTerm=2) fails β β
β β 2. Decrement nextIndex and retry β β
β β 3. Find match point (index=3) β β
β β 4. Overwrite from index 4 with Leader's log β β
β β β β
β β Result: β β
β β Follower: [1,1] [1,2] [2,3] [3,4] [3,5] [3,6] β β
β β (identical to Leader) β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Raft vs Paxos¶
| Property | Raft | Paxos |
|---|---|---|
| Understandability | High | Low |
| Leader Role | Clear (strong leader) | Flexible |
| Log Order | Sequential | Gaps possible |
| Implementation Complexity | Low | High |
| Paper Clarity | Detailed | Abstract |
| Use Cases | etcd, CockroachDB | Chubby, Spanner |
4. Byzantine Fault Tolerance¶
Byzantine Generals Problem¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Byzantine Generals Problem β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Situation: N generals must decide to attack/retreat β
β Some generals may be traitors β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β ββββββββββββ β β
β β β General 1β β β
β β β (traitor)β β β
β β ββββββ¬ββββββ β β
β β β β β
β β "Attack" β "Retreat" β β
β β βΌ βΌ β β
β β ββββββββββββ ββββββββββββ β β
β β β General 2β β General 3β β β
β β β (loyal) β β (loyal) β β β
β β ββββββββββββ ββββββββββββ β β
β β β β
β β Gen 2: "#1 said attack" / Gen 3: "#1 said retreat" β β
β β β Loyal generals can make different decisions! β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Solution: With 3f + 1 generals, can tolerate f traitors β
β Example: 4 generals β tolerate 1 traitor β
β 7 generals β tolerate 2 traitors β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PBFT (Practical Byzantine Fault Tolerance)¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PBFT Algorithm β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 3-Phase Protocol: Pre-Prepare β Prepare β Commit β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Client Primary Replica 1 Replica 2 Replica 3 β β
β β β β β β β β β
β β βββRequestβββΊβ β β β β β
β β β β β β β β β
β β β ββPre-PrepareββΊβ β β β β
β β β ββPre-PrepareβββββββββββββββΊβ β β β
β β β ββPre-PrepareβββββββββββββββββββββββββββΊβ β β
β β β β β β β β β
β β β ββββPrepareβββ β β β β
β β β βββββPrepareββΊβββPrepareβββ β β β
β β β β βββββPrepareβββΊββPrepareβββ β β
β β β β β βββββPrepareββΊβ β β
β β β β β β β β β
β β β βββββCommitβββ β β β β
β β β βββββCommitβββΊββββCommitβββ β β β
β β β β βββββCommitββββΊβββCommitβββ β β
β β β β β βββββCommitβββΊβ β β
β β β β β β β β β
β β ββββReplyββββ β β β β β
β β ββββReplyββββββββββββββββ β β β β
β β ββββReplyββββββββββββββββββββββββββββ β β β β
β β ββββReplyββββββββββββββββββββββββββββββββββββββββ β β β
β β β β β β β β β
β β β β
β β Client accepts result when receiving f+1 identical Replies β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Complexity: O(nΒ²) messages β
β Limitation: Low scalability (typically ~20 nodes) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
BFT Applications¶
| Domain | Examples |
|---|---|
| Blockchain | Tendermint, Hyperledger Fabric |
| Aviation/Space | Flight control systems |
| Finance | High-availability trading systems |
5. Practical Applications¶
ZooKeeper (ZAB)¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ZooKeeper / ZAB β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ZAB (ZooKeeper Atomic Broadcast): β
β - Paxos variant β
β - Leader-based β
β - Order guaranteed β
β β
β Data Model: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β / (root) β β
β β βββ /config Configuration data β β
β β β βββ /config/db {"host": "localhost"} β β
β β βββ /locks Distributed locks β β
β β β βββ /locks/job1 (ephemeral) β β
β β βββ /leader Leader election β β
β β βββ /leader/node-1 (ephemeral + sequential) β β
β β β β
β β Node Types: β β
β β - Persistent: Remains until explicitly deleted β β
β β - Ephemeral: Auto-deleted on session end β β
β β - Sequential: Auto-incrementing number assigned β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Usage: β
β - Kafka: Broker coordination, topic metadata β
β - HBase: Master election, region assignment β
β - Hadoop: HA NameNode β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
etcd (Raft)¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β etcd β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Features: β
β - Raft consensus algorithm β
β - Key-value store β
β - gRPC API β
β - Watch capability β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Kubernetes Control Plane β β
β β β β
β β βββββββββββββββ βββββββββββββββββββββββββββ β β
β β β API Server ββββββββββΊβ etcd Cluster β β β
β β β β β β β β
β β β β β βββββββ βββββββ ββββββββ β β
β β β - Pods β β βetcd1β βetcd2β βetcd3ββ β β
β β β - Services β β βRaft β βRaft β βRaft ββ β β
β β β - ConfigMapsβ β βββββββ βββββββ ββββββββ β β
β β βββββββββββββββ βββββββββββββββββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Usage Examples: β
β ``` β
β # Store key β
β etcdctl put /config/db/host "localhost" β
β β
β # Get key β
β etcdctl get /config/db/host β
β β
β # Watch (detect changes) β
β etcdctl watch /config/db/ β
β β
β # Check leader β
β etcdctl endpoint status β
β ``` β
β β
β Applications: β
β - Kubernetes: Cluster state storage β
β - Service Discovery: Consul alternative β
β - Feature Flags: Dynamic configuration β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Consul¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Consul β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Features: β
β - Raft consensus algorithm β
β - Service discovery β
β - Health checking β
β - KV store β
β - Multi-datacenter β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β DC1 (Primary) DC2 (Secondary) β β
β β βββββββββββββββββββ βββββββββββββββββββ β β
β β β Consul Servers ββββWANβββββΊβ Consul Servers β β β
β β β (3 nodes, Raft)β Gossip β (3 nodes, Raft)β β β
β β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β β
β β β β β β
β β ββββββββ΄βββββββ ββββββββ΄βββββββ β β
β β β β β β β β
β β ββββ΄βββ ββββ΄βββ ββββ΄βββ ββββ΄βββ β β
β β βAgentβ βAgentβ βAgentβ βAgentβ β β
β β βSvc Aβ βSvc Bβ βSvc Aβ βSvc Cβ β β
β β βββββββ βββββββ βββββββ βββββββ β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Key Features: β
β - DNS-based service discovery: service.consul β
β - Distributed locks: Consul Sessions β
β - Leader election: consul lock β
β - Configuration management: Consul KV + Template β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tool Comparison¶
| Property | ZooKeeper | etcd | Consul |
|---|---|---|---|
| Consensus Algorithm | ZAB (Paxos variant) | Raft | Raft |
| Data Model | Hierarchical (filesystem) | Flat KV | Flat KV |
| Watch | O | O | O |
| Service Discovery | Manual implementation | Manual implementation | Built-in |
| Health Check | X | X | Built-in |
| Multi-DC | X | X | O |
| Language | Java | Go | Go |
| Primary Use Case | Hadoop, Kafka | Kubernetes | HashiCorp ecosystem |
6. Practice Problems¶
Exercise 1: Paxos Scenario¶
Analyze the following scenario in a Paxos system with 5 Acceptors: - Proposer 1 proposes "A" with n=1 - Proposer 2 proposes "B" with n=2 - Messages get reordered due to network delay - What value gets selected in the end?
Exercise 2: Raft Log Recovery¶
Explain the log consistency recovery process in Raft for the following situation:
Leader: [1,1] [2,2] [2,3] [3,4] [3,5]
Follower: [1,1] [2,2] [2,3] [2,4]
Exercise 3: Consensus System Design¶
Design a distributed configuration management system with the following requirements: - Deployed across 3 datacenters - Continue service even with one DC failure - Configuration changes propagate within milliseconds - Optimize read performance
Next Steps¶
In 17_Design_Example_1.md, let's practice designing URL shorteners, Pastebin, and Rate Limiters!
References¶
- "Paxos Made Simple" - Leslie Lamport
- "In Search of an Understandable Consensus Algorithm (Raft)" - Diego Ongaro
- "Practical Byzantine Fault Tolerance" - Castro, Liskov
- etcd Documentation: etcd.io
- ZooKeeper Documentation: zookeeper.apache.org
- Consul Documentation: consul.io
- Raft Visualization: raft.github.io