Distributed Systems: CAP & Consensus

Distributed systems power modern scalable, resilient applications. But they introduce fundamental challenges: how to remain consistent, available, and fault-tolerant despite failures and network partitions. Two cornerstone ideas help us reason about these systems: the CAP theorem and consensus algorithms like Paxos and Raft.

1. The CAP Theorem

CAP Theorem

Definition: In any distributed system, you can only guarantee two of the following three properties at the same time:

Consistency (C) – All nodes see the same data at the same time.
Availability (A) – Every request gets a response, even if some nodes fail.
Partition Tolerance (P) – The system continues to operate despite network failures.

Trade-offs

CP (Consistency + Partition tolerance): Prioritize correctness, may sacrifice availability. Example: ZooKeeper.
AP (Availability + Partition tolerance): Prioritize serving requests, may allow stale reads. Example: DynamoDB, Cassandra.
CA: Rarely practical because network partitions always exist.

👉 Interview Tip: Be ready to reason about trade-offs: “For a payment system I’d choose CP, for a social feed AP is acceptable.”

2. Consensus in Distributed Systems

Distributed systems often need nodes to agree on a single value: who is the leader? what’s the next transaction?. This agreement process is called consensus.

Two famous protocols are Paxos (theoretically foundational but complex) and Raft (practically adopted for being easier to understand).

Paxos (Leslie Lamport, 1990)

Goal: Ensure a cluster of nodes agree on one value, even if some nodes or messages fail.

Roles:

Proposers → suggest values.
Acceptors → vote and decide.
Learners → learn the chosen value.

Phases (simplified)

Prepare Phase: Proposer sends a numbered proposal. Acceptors promise to ignore lower numbers.
Accept Phase: If majority promised, proposer asks acceptors to accept its value.
Learning Phase: Once majority accepts, the value is chosen and shared.

Why it matters

Tolerates node crashes and message loss.
Guarantees safety (no conflicting decisions).
But is difficult to implement correctly.

👉 Interview Tip: Mention majority quorum and two-phase agreement.

Raft (2014, Ongaro & Ousterhout)

Raft was designed to be easier to understand while solving the same consensus problem.

Roles:

Leader – central authority for client requests.
Followers – passive, replicate leader’s log.
Candidates – temporary role during leader elections.

Phases

Leader Election: Followers become candidates if they don’t hear from a leader. Majority votes elects a leader.
Log Replication: Leader appends commands and replicates to followers. Once a majority acknowledge, the entry is committed.
Safety via Terms: Each election has a term number; outdated leaders cannot override newer ones.

Why it matters

Simpler mental model than Paxos.
Widely used in systems like etcd (Kubernetes), Consul, Nomad.
Provides both consensus and log replication.

👉 Interview Tip: Highlight leader-driven replication and simplicity over Paxos.

Paxos vs Raft (Comparison)

Paxos vs Raft

Feature	Paxos	Raft
Understandability	Harder, academic	Easier, designed for teaching
Roles	Proposer, Acceptor, Learner	Leader, Follower, Candidate
Decision making	Multiple proposers may race	Single leader coordinates
Log Replication	Not explicit in basic Paxos	Built-in, core to protocol
Usage in practice	ZooKeeper, Chubby, Spanner	etcd, Consul, Kubernetes control plane

3. Key Takeaways

CAP Theorem → you can’t have it all, must reason about trade-offs.
Consensus → majority agreement ensures safety, but limits availability during partitions.
Paxos → correct but complex, foundational in theory.
Raft → leader-based, simpler, widely used in practice.

4. Practice Question

“Design a leader election mechanism for a distributed system that tolerates partitions. Which consensus protocol would you choose and why?”

For theory-heavy discussion: Paxos (quorum, two-phase).
For practical engineering: Raft (leader election, log replication, used in Kubernetes).

5. Interview Answer Example

“In a distributed key-value store, I’d use Raft. A leader handles client writes, replicates logs to followers, and commits entries once a majority acknowledge. This ensures consistency and fault tolerance, at the cost of temporary unavailability during partitions.”

Distributed Systems: CAP & Consensus ​

1. The CAP Theorem ​

Trade-offs ​

2. Consensus in Distributed Systems ​

Paxos (Leslie Lamport, 1990) ​

Phases (simplified) ​

Why it matters ​

Raft (2014, Ongaro & Ousterhout) ​

Phases ​

Why it matters ​

Paxos vs Raft (Comparison) ​

3. Key Takeaways ​

4. Practice Question ​

5. Interview Answer Example ​

Further Reading ​

Distributed Systems: CAP & Consensus

1. The CAP Theorem

Trade-offs

2. Consensus in Distributed Systems

Paxos (Leslie Lamport, 1990)

Phases (simplified)

Why it matters

Raft (2014, Ongaro & Ousterhout)

Phases

Why it matters

Paxos vs Raft (Comparison)

3. Key Takeaways

4. Practice Question

5. Interview Answer Example

Further Reading