DISTRIBUTED SYSTEMS FUNDAMENTALS PROJECTS
Distributed Systems Fundamentals — Project Recommendations
Goal
After completing these projects, you will understand the core tradeoffs and invariants that govern distributed systems—why consensus is hard, why consistency is always a choice, how failures distort reality, and how data is kept coherent across machines.
You will internalize:
- Consensus mechanics (Paxos, Raft) as log agreement and leader-driven ordering
- Consistency models (CAP, linearizability, eventual) as precise behavioral contracts
- Distributed transactions (2PC, Sagas) as coordination under failure
- Replication and sharding as latency, availability, and scalability levers
- Failure modes (partitions, clock skew, Byzantine-like behavior) as the default, not the edge
This is not about memorizing buzzwords. It is about learning to reason from first principles:
- “What is the system’s single source of truth at any moment?”
- “Which invariants must never be violated, even under partition?”
- “What happens when messages are delayed, duplicated, or lost?”
By the end of this sprint, you will be able to design, defend, and debug distributed systems with clear mental models and measurable guarantees.
Foundational Concepts: The Six Pillars
1. The Network Is the Computer (And It Lies)
Core idea: Distributed systems are a single machine with unreliable communication.
Process A ----message----> Process B
^ \ / ^
| \--delay/dup/drop--/ |
| (unknown) |
+---- time & ordering -----+
Why it matters: Every project assumes the network can be late, lossy, or reordered. Your designs must tolerate that reality.
2. Time Is a Guess (Clocks Are Not Truth)
Real Time:
|----t1----t2----t3----t4----|
Node A Clock: |--t1----t2--t3-----t4--|
Node B Clock: |----t1--t2----t3--t4--|
Same event, different timestamps
Why it matters: Causal order and consistency depend on imperfect clocks. Logical time (Lamport, vector clocks) can be more reliable than wall time.
3. Consensus: Agreeing on One History
Clients -> Leader -> Replicated Log
log:
1: set x=1
2: set x=2
3: set y=7
All replicas must see the SAME order.
Why it matters: Consensus is the foundation for correctness in replicated systems: leader election, log replication, and state machine safety.
4. Consistency Models Are Contracts
Linearizable: all clients see same order
Eventual: clients may diverge, then converge
Causal: cause-before-effect preserved
Why it matters: Consistency is a guarantee you choose. The model defines what anomalies you allow.
5. Transactions Across Machines Are Risky
2PC:
Coordinator -> Prepare -> Participants
Coordinator -> Commit/Abort
Failure mid-flight = uncertainty
Why it matters: Distributed transactions require coordination and can block under failure. Sagas trade atomicity for compensation.
6. Replication and Sharding Shape Everything
Replication: same data, multiple copies
Sharding: different data, different nodes
Replication = availability, read scale
Sharding = write scale, data locality
Why it matters: These are the levers for scale, but they change failure behavior and consistency.
Core Concept Analysis
| Concept Cluster | What You Must Internalize |
|---|---|
| Failures | Delay, loss, and partitions are normal; handle them explicitly. |
| Time & Order | Ordering is negotiated, not observed; clocks drift. |
| Consensus | Safety means one history; liveness means progress. |
| Consistency Models | A model is a contract for what anomalies clients see. |
| Distributed Transactions | Coordination increases correctness and failure risk. |
| Replication & Sharding | Scale comes with new invariants and tradeoffs. |
Deep Dive Reading by Concept
This section maps each concept to specific chapters for deeper understanding.
Concept 1: Failures and Time
| Concept | Book & Chapter |
|---|---|
| Failure modes | “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 8: “The Trouble with Distributed Systems” |
| Time and clocks | “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 8: “The Trouble with Distributed Systems” |
Concept 2: Consensus
| Concept | Book & Chapter |
|---|---|
| Consensus fundamentals | “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 9: “Consistency and Consensus” |
| Raft overview | “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 9: “Consistency and Consensus” |
Concept 3: Consistency Models
| Concept | Book & Chapter |
|---|---|
| Linearizability, CAP | “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 9: “Consistency and Consensus” |
| Eventual consistency | “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 5: “Replication” |
Concept 4: Transactions
| Concept | Book & Chapter |
|---|---|
| 2PC and distributed transactions | “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 7: “Transactions” |
| Sagas and compensations | “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 7: “Transactions” |
Concept 5: Replication and Sharding
| Concept | Book & Chapter |
|---|---|
| Replication models | “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 5: “Replication” |
| Partitioning/sharding | “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 6: “Partitioning” |
Essential Reading Order
- Foundation (Week 1):
- Designing Data-Intensive Applications Ch. 8 (failures, time)
- Designing Data-Intensive Applications Ch. 5 (replication)
- Coordination (Week 2):
- Designing Data-Intensive Applications Ch. 9 (consensus, consistency)
- Transactions (Week 3):
- Designing Data-Intensive Applications Ch. 7 (2PC, Sagas)
- Scaling (Week 4):
- Designing Data-Intensive Applications Ch. 6 (sharding)
Project List
Projects are ordered from core mental models to complex coordination systems.
Project 1: Failure Injection Playground
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Failure Modeling
- Software or Tool: Fault injection harness
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A small multi-node simulator that injects delay, loss, duplication, and reordering into message passing.
Why it teaches distributed systems: You will see how simple protocols fail under real network conditions.
Core challenges you’ll face:
- Designing a controllable network model (maps to failure modes)
- Reproducing heisenbugs (maps to time and ordering)
- Observing invariants breaking under stress (maps to system correctness)
Key Concepts
- Failure models: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
- Message ordering: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
- System invariants: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic networking, concurrency primitives, basic logging and CLI usage
Real World Outcome
You will run a simulator with 3-5 nodes and toggle faults. You will observe cases like duplicate messages causing double-apply, or partitions causing stale reads. You will see a timeline log and a failure report that summarizes which invariants broke.
Example Output:
$ ./fault_playground --nodes 5 --loss 0.2 --delay 200ms
[time=00:00.120] link A->C DROPPED message id=17
[time=00:00.350] node C applied write x=9 twice (DUPLICATE)
[time=00:00.401] invariant violated: "x monotonic" at node C
Summary:
- dropped: 12
- delayed: 47
- duplicated: 5
- invariants broken: 2
The Core Question You’re Answering
“What exactly can go wrong in a distributed system, and how do failures distort correctness?”
This project forces you to confront the reality that “correct” algorithms can still fail if you don’t model the network accurately.
Concepts You Must Understand First
Stop and research these before coding:
- Failure Models
- What does “message loss” vs “message delay” mean?
- Why is duplication a separate class of failure?
- Book Reference: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
- Invariants
- What makes a replicated state machine safe?
- How do you detect invariant violations at runtime?
- Book Reference: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
Questions to Guide Your Design
Before implementing, think through these:
- Fault Injection
- How will you deterministically reproduce a failure?
- How will you toggle fault modes per link?
- Observation
- What events must be logged to prove a violation?
- How will you summarize outcomes across runs?
Thinking Exercise
Message Timeline Sketch
Before coding, draw a 3-node timeline where a write is sent, dropped, and retried.
Node A: send(write x=5) -----X (dropped)
Node A: retry(write x=5) ---------->
Node B: apply(write x=5)
Questions while sketching:
- When should the client consider the write “done”?
- How can duplicates be detected without global time?
- Which invariant proves correctness?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What failure modes do you need to assume in distributed systems?”
- “Why does message delay matter even if messages eventually arrive?”
- “How do you design for idempotency?”
- “What is the difference between safety and liveness?”
- “How do you test distributed systems deterministically?”
Hints in Layers
Hint 1: Starting Point Start with a single, configurable link that can delay or drop messages.
Hint 2: Next Level Make a matrix of links so each pair can have its own fault profile.
Hint 3: Technical Details Use a seeded random generator so the same failure run can be replayed.
Hint 4: Tools/Debugging Log a per-message ID and use it to correlate drops and retries.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Failure modes | “Designing Data-Intensive Applications” by Kleppmann | Ch. 8 |
| Testing nondeterminism | “Designing Data-Intensive Applications” by Kleppmann | Ch. 8 |
Implementation Hints Model your network as a queue with policies (drop, delay, duplicate). Build your event log first so you can observe behavior before adding algorithm complexity.
Learning milestones:
- You can replay a run and get the same failure sequence
- You can cause a specific invariant to break on demand
- You can explain why the failure happened without guessing
Project 2: Raft Log Replication Simulator
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Python, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Consensus
- Software or Tool: Event-driven simulator
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A Raft simulator that shows leader election and log replication under failures.
Why it teaches distributed systems: Raft makes consensus concrete: log agreement, leader stability, and term-based safety.
Core challenges you’ll face:
- Modeling leader election and terms (maps to consensus safety)
- Handling log inconsistencies (maps to replication correctness)
- Simulating timeouts and retries (maps to failure behavior)
Key Concepts
- Consensus safety: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
- Leader election: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
- Log replication: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Difficulty: Advanced Time estimate: 1 month+ Prerequisites: Familiarity with state machines, message passing, and basic concurrency
Real World Outcome
You will run a simulation and watch a leader get elected, commands appended, and replicas converge. Then you will trigger a partition and see a new leader emerge with correct log reconciliation.
Example Output:
$ ./raft_sim --nodes 5 --partition A,B
[t=01.2] Leader elected: node C (term 4)
[t=02.0] Append entry: "set x=9" index=12
[t=02.4] Replication: 4/5 nodes confirmed
[t=03.1] Partition detected: {A,B} isolated
[t=05.2] New leader: node D (term 5)
[t=06.7] Log fix: node A rolled back to index=10
The Core Question You’re Answering
“How do multiple nodes agree on a single ordered log despite failures?”
Raft shows that order is negotiated, not assumed, and that safety depends on strict rules about terms and majorities.
Concepts You Must Understand First
Stop and research these before coding:
- Quorums
- Why is majority voting required for safety?
- What breaks if two leaders exist at once?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
- Log Replication
- What does it mean for logs to be “consistent”?
- How does a follower detect divergence?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Questions to Guide Your Design
Before implementing, think through these:
- Leader Election
- What happens if two candidates time out together?
- How do you prevent old leaders from overwriting new logs?
- Replication
- How does the leader decide which entries are committed?
- How do followers roll back conflicting entries?
Thinking Exercise
Term Timeline
Sketch two leaders elected in different partitions and reconcile after healing.
Partition 1: Leader A term 3, commits entries 8..10
Partition 2: Leader D term 4, commits entries 8..9
Rejoin: which log entries survive?
Questions while sketching:
- Which term wins and why?
- Which entries are safe to keep?
- What invariant guarantees safety?
The Interview Questions They’ll Ask
Prepare to answer these:
- “Why does Raft use terms?”
- “How does Raft prevent split-brain?”
- “What is the difference between committed and applied entries?”
- “How does leader election handle clock drift?”
- “Why is quorum required for writes?”
Hints in Layers
Hint 1: Starting Point Implement election timeouts and a basic vote request/response flow.
Hint 2: Next Level Add a replicated log with append entries but no failure handling.
Hint 3: Technical Details Track terms per log index and roll back on mismatch.
Hint 4: Tools/Debugging Log term transitions and committed index movement per node.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Consensus | “Designing Data-Intensive Applications” by Kleppmann | Ch. 9 |
| Replication | “Designing Data-Intensive Applications” by Kleppmann | Ch. 5 |
Implementation Hints Build the system as a deterministic event simulator first. Only when logs converge reliably should you introduce fault injection.
Learning milestones:
- Leader election produces a stable leader
- Log entries replicate and commit across majority
- Partitions heal without violating log safety
Project 3: Paxos Message Trace Explorer
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Python, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Consensus
- Software or Tool: Protocol trace visualizer
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A tool that replays Paxos phases (prepare/accept) from message traces and shows agreement.
Why it teaches distributed systems: Paxos is the archetype of consensus. Tracing its messages clarifies why it guarantees safety.
Core challenges you’ll face:
- Modeling proposal numbers (maps to safety)
- Handling competing proposers (maps to liveness)
- Showing chosen values under partial failure (maps to correctness)
Key Concepts
- Paxos safety: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
- Majority quorum: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
- Failure handling: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
Difficulty: Expert Time estimate: 1 month+ Prerequisites: Strong grasp of quorum logic, message ordering, and event tracing
Real World Outcome
You will load a recorded sequence of Paxos messages and see a step-by-step timeline of proposals, promises, accepts, and the final chosen value.
Example Output:
$ ./paxos_trace --file traces/competing_proposers.log
Step 1: P1 sends PREPARE n=5 to A,B,C
Step 2: A,B promise n=5
Step 3: P2 sends PREPARE n=6 to B,C,D
Step 4: B,C promise n=6 (P1 now preempted)
Step 5: P2 sends ACCEPT n=6 value=V2
Chosen value: V2 (quorum B,C,D)
The Core Question You’re Answering
“How does Paxos guarantee a single chosen value even with competing leaders?”
Tracing the protocol reveals how quorums and proposal numbers enforce safety.
Concepts You Must Understand First
Stop and research these before coding:
- Proposal Numbers
- Why must proposal numbers be totally ordered?
- What does “promised” mean in Paxos?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
- Quorum Intersection
- Why do two quorums always overlap?
- How does that guarantee safety?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Questions to Guide Your Design
Before implementing, think through these:
- Trace Format
- What minimal fields are required to prove correctness?
- How will you show preemption and retries?
- Visualization
- How will you highlight the chosen value?
- How will you show quorum overlap?
Thinking Exercise
Competing Proposers
Draw two proposers racing with different proposal numbers.
P1: n=5 -> A,B
P2: n=6 -> B,C
Which proposer wins and why?
Questions while sketching:
- Which promises invalidate earlier proposals?
- What value must be accepted if a promise includes prior acceptance?
- Where does safety come from?
The Interview Questions They’ll Ask
Prepare to answer these:
- “Why does Paxos require quorum intersection?”
- “What happens if a proposer retries with a higher number?”
- “How does Paxos avoid choosing two values?”
- “Why is Paxos considered hard to implement?”
- “How is Raft easier than Paxos?”
Hints in Layers
Hint 1: Starting Point Define a trace format that captures proposer, acceptor, message type, and number.
Hint 2: Next Level Replay in strict timestamp order so you can reproduce outcomes.
Hint 3: Technical Details Track each acceptor’s promised number and last accepted value.
Hint 4: Tools/Debugging Highlight the quorum that chooses the final value in your output.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Paxos safety | “Designing Data-Intensive Applications” by Kleppmann | Ch. 9 |
| Failure models | “Designing Data-Intensive Applications” by Kleppmann | Ch. 8 |
Implementation Hints Focus on trace replay and visualization first. Paxos logic becomes simpler when you can inspect each acceptor’s state after every step.
Learning milestones:
- You can explain every promise and accept
- You can show why only one value wins
- You can identify why Paxos stalls under contention
Project 4: Consistency Model Explorer
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Consistency Models
- Software or Tool: Scenario explorer
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A tool that simulates reads/writes under linearizable, causal, and eventual consistency.
Why it teaches distributed systems: You will see how different guarantees change observable client behavior.
Core challenges you’ll face:
- Modeling client histories (maps to consistency semantics)
- Defining allowed anomalies (maps to CAP tradeoffs)
- Rendering timelines (maps to reasoning about order)
Key Concepts
- Linearizability: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
- Eventual consistency: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
- CAP tradeoffs: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Basic understanding of read/write histories and concurrency
Real World Outcome
You will run scenarios with two clients and two replicas, then toggle the consistency model and observe the allowed read results.
Example Output:
$ ./consistency_explorer --model eventual
Client A: write x=1
Client B: read x -> 0 (stale)
Client B: read x -> 1 (after convergence)
Allowed anomaly: stale read (eventual consistency)
The Core Question You’re Answering
“What behaviors are allowed by different consistency guarantees?”
This project forces you to stop thinking of consistency as a single binary choice.
Concepts You Must Understand First
Stop and research these before coding:
- Histories and Order
- What is a history in distributed systems?
- What does it mean for a history to be linearizable?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
- CAP Theorem
- What is sacrificed under partition?
- Why is availability defined per request?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Questions to Guide Your Design
Before implementing, think through these:
- Scenario Encoding
- How will you specify client operations and timing?
- How will you show visibility of writes?
- Model Rules
- What outputs are forbidden vs allowed?
- How do you detect anomalies?
Thinking Exercise
Two Clients, Two Replicas
Sketch a timeline where A writes x=1, B reads before replication.
Time: 1 2 3 4
A: W(x=1)
B: R(x=? )
Replica2 receives write at time 4
Questions while sketching:
- What does linearizability require B to see?
- What does eventual allow?
- What does causal guarantee?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is the difference between linearizability and eventual consistency?”
- “How does CAP relate to consistency guarantees?”
- “What anomalies are acceptable under eventual consistency?”
- “What is causal consistency?”
- “Why is strong consistency expensive?”
Hints in Layers
Hint 1: Starting Point Define a simple operation log of reads/writes with timestamps.
Hint 2: Next Level Implement one model (eventual) before adding others.
Hint 3: Technical Details Use a visibility matrix per replica to decide read results.
Hint 4: Tools/Debugging Generate the same scenario across models to compare outcomes side by side.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Consistency models | “Designing Data-Intensive Applications” by Kleppmann | Ch. 9 |
| Replication & staleness | “Designing Data-Intensive Applications” by Kleppmann | Ch. 5 |
Implementation Hints Build a tiny “history” format and a validator that checks if the history violates the rules of a given model.
Learning milestones:
- You can explain why a history is or is not linearizable
- You can produce a valid eventual history with anomalies
- You can map CAP tradeoffs to observable behavior
Project 5: Quorum Read/Write Tuner
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Replication
- Software or Tool: Quorum calculator
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A simulator that explores how different read/write quorum sizes affect consistency and availability.
Why it teaches distributed systems: It makes quorum math tangible: N, R, W and their consequences.
Core challenges you’ll face:
- Modeling quorum intersections (maps to consistency guarantees)
- Handling node failures (maps to availability)
- Showing staleness vs latency tradeoffs (maps to replication choices)
Key Concepts
- Quorum consistency: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
- Replication lag: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
- Availability tradeoffs: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Basic distributed systems terminology, simple probability reasoning
Real World Outcome
You will input N, R, W values and see if the system can guarantee read-after-write consistency under different failure rates.
Example Output:
$ ./quorum_tuner --nodes 5 --R 2 --W 3 --fail 0.2
R+W = 5 (quorum intersection guaranteed)
Expected availability: 0.80 for writes, 0.96 for reads
Risk: stale read possible if async replication is slow
The Core Question You’re Answering
“How do quorum sizes trade off availability and consistency?”
This teaches you why majority is not just tradition; it is a correctness constraint.
Concepts You Must Understand First
Stop and research these before coding:
- Quorum Rules
- Why does R+W > N matter?
- How do you compute read availability?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
- Replication Lag
- Why can a read quorum still be stale?
- How does leader-based replication help?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Questions to Guide Your Design
Before implementing, think through these:
- Metrics
- How will you quantify availability per request?
- How will you present staleness risk?
- Simulation
- Will you use Monte Carlo runs or deterministic math?
- How will you model failure probability?
Thinking Exercise
Quorum Math
On paper, compute if R=2, W=2, N=3 guarantees consistency.
If R+W = 4 > 3, quorum intersection exists.
What happens if one node is down?
Questions while sketching:
- Does intersection always hold under failure?
- How do failures affect availability vs correctness?
The Interview Questions They’ll Ask
Prepare to answer these:
- “Why does R+W > N matter?”
- “Can quorum systems still return stale reads?”
- “What does availability mean in CAP terms?”
- “How do you tune quorum sizes?”
- “What is sloppy quorum and hinted handoff?”
Hints in Layers
Hint 1: Starting Point Compute availability analytically for fixed N, R, W.
Hint 2: Next Level Add a simple simulator to estimate staleness frequency.
Hint 3: Technical Details Model replication delay as a distribution, not a constant.
Hint 4: Tools/Debugging Plot read/write success rates across failure probabilities.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Quorum systems | “Designing Data-Intensive Applications” by Kleppmann | Ch. 5 |
| Availability tradeoffs | “Designing Data-Intensive Applications” by Kleppmann | Ch. 8 |
Implementation Hints Start with the algebraic rules, then add stochastic simulation to show how delays cause stale reads even when quorum math looks safe.
Learning milestones:
- You can compute quorum properties by hand
- You can show availability changes with failure rate
- You can explain why quorum is not a full consistency guarantee
Project 6: Sharding Strategy Simulator
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Sharding
- Software or Tool: Partitioning simulator
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A simulator that compares hash-based, range-based, and consistent hashing sharding strategies.
Why it teaches distributed systems: It reveals how shard choice impacts balance, hotspots, and resharding cost.
Core challenges you’ll face:
- Modeling key distributions (maps to load balance)
- Simulating node joins/leaves (maps to resharding)
- Measuring hotspot frequency (maps to latency spikes)
Key Concepts
- Partitioning strategies: “Designing Data-Intensive Applications” Ch. 6 - Kleppmann
- Rebalancing costs: “Designing Data-Intensive Applications” Ch. 6 - Kleppmann
- Hotspot detection: “Designing Data-Intensive Applications” Ch. 6 - Kleppmann
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Understanding of hashing, keyspace partitioning, basic statistics
Real World Outcome
You will input a key distribution and number of nodes, then see shard balance and data movement when nodes join or leave.
Example Output:
$ ./shard_sim --strategy consistent --nodes 4 --keys zipf
Balance report:
max shard load: 1.8x average
moved keys on join: 12%
Hotspots detected: shard 2
The Core Question You’re Answering
“How does the choice of sharding strategy shape performance and operational cost?”
Sharding is not just about scalability; it affects resilience and developer experience.
Concepts You Must Understand First
Stop and research these before coding:
- Partitioning Strategies
- What is range partitioning vs hash partitioning?
- Why does consistent hashing reduce rebalancing?
- Book Reference: “Designing Data-Intensive Applications” Ch. 6 - Kleppmann
- Skewed Workloads
- Why do hotspots appear in real data?
- How do you measure imbalance?
- Book Reference: “Designing Data-Intensive Applications” Ch. 6 - Kleppmann
Questions to Guide Your Design
Before implementing, think through these:
- Key Distribution
- How will you model uniform vs Zipf distributions?
- How will you visualize imbalance?
- Resharding
- How will you measure data movement cost?
- How will you simulate node failure?
Thinking Exercise
Shard Movement
Sketch what happens when you add a node to a 4-node hash ring.
Before: [A][B][C][D]
After: [A][B][E][C][D]
Which keys move and why?
Questions while sketching:
- Which strategy minimizes movement?
- How do hotspots persist even with hashing?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is consistent hashing and why is it used?”
- “How does range-based sharding create hotspots?”
- “What are the tradeoffs between sharding and replication?”
- “How do you reshard without downtime?”
- “What is a shard rebalancing strategy?”
Hints in Layers
Hint 1: Starting Point Generate keys and assign them to shards by each strategy.
Hint 2: Next Level Compute load variance per shard to measure balance.
Hint 3: Technical Details Model rebalancing as the fraction of keys that move when a node joins.
Hint 4: Tools/Debugging Visualize shard load with simple histograms.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Partitioning | “Designing Data-Intensive Applications” by Kleppmann | Ch. 6 |
| Rebalancing | “Designing Data-Intensive Applications” by Kleppmann | Ch. 6 |
Implementation Hints Start with deterministic key assignments so you can test balance. Then add randomized distributions to mimic reality.
Learning milestones:
- You can measure shard balance for each strategy
- You can quantify rebalancing cost on node join
- You can explain hotspots in real workloads
Project 7: Two-Phase Commit Coordinator Simulator
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Java, Python, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Distributed Transactions
- Software or Tool: Transaction simulator
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A 2PC coordinator/participant simulator with failure injection and timeout handling.
Why it teaches distributed systems: 2PC is the canonical example of coordination and blocking under failure.
Core challenges you’ll face:
- Modeling prepare/commit phases (maps to atomicity)
- Handling coordinator crashes (maps to blocking)
- Visualizing uncertain outcomes (maps to recovery)
Key Concepts
- Two-phase commit: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
- Failure handling: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
- Atomicity vs availability: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Understanding of transactions, logging, and failure recovery
Real World Outcome
You will simulate transactions across multiple participants and watch outcomes under failures, including the classic “in-doubt” state.
Example Output:
$ ./twopc_sim --participants 3 --crash coordinator@prepare
Phase 1: PREPARE sent to P1,P2,P3
P1: YES
P2: YES
P3: NO RESPONSE (timeout)
Coordinator crashed before decision
Result: transaction IN-DOUBT at P1,P2
The Core Question You’re Answering
“Why does 2PC block, and what is the cost of atomicity under failure?”
This project shows why distributed transactions are expensive and fragile.
Concepts You Must Understand First
Stop and research these before coding:
- Atomicity Across Nodes
- What does it mean to commit atomically across machines?
- Why is the coordinator a single point of failure?
- Book Reference: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
- Durable Logs
- What does a participant need to log to recover?
- Why does logging create in-doubt states?
- Book Reference: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
Questions to Guide Your Design
Before implementing, think through these:
- Failure Simulation
- What happens if the coordinator dies after prepare?
- How do participants decide in-doubt outcomes?
- Recovery
- How will you show state recovery on restart?
- What data must be persisted?
Thinking Exercise
Failure Windows
Sketch the transaction timeline and mark where crashes cause blocking.
Prepare -> Votes -> Decision -> Commit
Crash points: after votes but before decision
Questions while sketching:
- When is the transaction irreversible?
- Why can participants be stuck?
The Interview Questions They’ll Ask
Prepare to answer these:
- “Why does 2PC block under coordinator failure?”
- “What is an in-doubt transaction?”
- “How does 3PC attempt to fix 2PC?”
- “Why do modern systems avoid 2PC?”
- “What is the role of write-ahead logging in 2PC?”
Hints in Layers
Hint 1: Starting Point Implement the prepare and commit phases without failure handling.
Hint 2: Next Level Add timeouts and failure injection at each step.
Hint 3: Technical Details Track participant states: INIT, PREPARED, COMMITTED, ABORTED.
Hint 4: Tools/Debugging Record a timeline so you can replay and explain each outcome.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Distributed transactions | “Designing Data-Intensive Applications” by Kleppmann | Ch. 7 |
| Failure behavior | “Designing Data-Intensive Applications” by Kleppmann | Ch. 8 |
Implementation Hints Model each participant as a finite state machine. Crashes freeze the state; recovery reads the last logged state.
Learning milestones:
- You can explain in-doubt states clearly
- You can simulate coordinator crashes deterministically
- You can articulate why 2PC is risky in practice
Project 8: Saga Orchestrator Simulator
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Java, Python, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Distributed Transactions
- Software or Tool: Workflow orchestrator simulator
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A saga orchestrator that executes steps and compensations under failure.
Why it teaches distributed systems: It highlights how Sagas trade atomicity for availability and resilience.
Core challenges you’ll face:
- Designing compensating actions (maps to consistency)
- Handling partial failure (maps to recovery semantics)
- Managing idempotency (maps to reliability)
Key Concepts
- Sagas: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
- Idempotency: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
- Compensating actions: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Understanding of transactions, state machines, and failure handling
Real World Outcome
You will run a workflow (e.g., order, payment, inventory) and watch it succeed or roll back via compensations when a step fails.
Example Output:
$ ./saga_sim --scenario order_fail_payment
Step 1: Reserve inventory -> OK
Step 2: Charge payment -> FAILED
Compensation: Release inventory -> OK
Saga outcome: FAILED (compensated)
The Core Question You’re Answering
“How do you keep a multi-step workflow consistent without global locks?”
Sagas show that consistency can be maintained through explicit compensation rather than global atomicity.
Concepts You Must Understand First
Stop and research these before coding:
- Compensations
- What does it mean to undo a step?
- When is compensation impossible?
- Book Reference: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
- Idempotency
- Why must retries be safe?
- How do you detect duplicate execution?
- Book Reference: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
Questions to Guide Your Design
Before implementing, think through these:
- Workflow Modeling
- How will you describe steps and compensations?
- How will you record progress for recovery?
- Failure Semantics
- When do you retry vs compensate?
- How will you handle partial failures?
Thinking Exercise
Compensation Map
Draw a 3-step saga and list the compensation for each step.
Reserve -> Charge -> Ship
Compensate: Release <- Refund <- Recall
Questions while sketching:
- Which steps are not reversible?
- How does compensation affect user experience?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is a Saga and how does it differ from 2PC?”
- “Why is idempotency critical in Sagas?”
- “What happens if a compensation fails?”
- “When would you choose Sagas over 2PC?”
- “How do you guarantee eventual consistency in a saga?”
Hints in Layers
Hint 1: Starting Point Define a workflow with explicit forward and compensation steps.
Hint 2: Next Level Add a progress log so you can resume after crash.
Hint 3: Technical Details Include idempotency keys for each step.
Hint 4: Tools/Debugging Print a timeline showing when compensation was triggered.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Sagas | “Designing Data-Intensive Applications” by Kleppmann | Ch. 7 |
| Failures and retries | “Designing Data-Intensive Applications” by Kleppmann | Ch. 8 |
Implementation Hints Think of Sagas as durable workflows with explicit rollback plans. Focus on failure cases first, then add the happy path.
Learning milestones:
- You can design correct compensations
- You can recover a saga after a crash
- You can explain why Sagas are eventually consistent
Project 9: Replication Lag Observatory
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust, Java
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Replication
- Software or Tool: Lag analyzer
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A tool that simulates leader-follower replication and reports lag and staleness windows.
Why it teaches distributed systems: Replication lag explains why eventual consistency appears in real systems.
Core challenges you’ll face:
- Modeling write propagation (maps to replication)
- Measuring staleness windows (maps to consistency)
- Handling follower failure (maps to availability)
Key Concepts
- Replication lag: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
- Leader/follower roles: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
- Failure impact: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Basic replication concepts and simple metrics
Real World Outcome
You will see how lag varies with write rate, follower speed, and failure events, including a measurable “staleness window” for reads.
Example Output:
$ ./replication_lag --followers 3 --write-rate 50/s
Leader commit latency: 12ms
Follower lag: [45ms, 120ms, 310ms]
Staleness window (p95): 290ms
The Core Question You’re Answering
“How long can replicas be stale, and what controls that window?”
This reveals that eventual consistency is often just replication lag in disguise.
Concepts You Must Understand First
Stop and research these before coding:
- Replication Models
- What is leader-based replication?
- Why do followers lag?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
- Staleness
- How do you quantify staleness?
- Why is p95 more meaningful than average?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Questions to Guide Your Design
Before implementing, think through these:
- Lag Measurement
- What metric captures “freshness” best?
- How will you track leader commit vs follower apply time?
- Workload Modeling
- How will you simulate bursty writes?
- How will you represent slow followers?
Thinking Exercise
Lag Window
Sketch a timeline where leader commits at t=10 and follower applies at t=80.
Leader commit: t=10
Follower apply: t=80
Staleness window: 70
Questions while sketching:
- What does a read at t=30 see?
- How does lag affect user experience?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What causes replication lag?”
- “Why do followers fall behind under load?”
- “How do you measure staleness in production?”
- “What is the difference between synchronous and asynchronous replication?”
- “How do you reduce lag without killing throughput?”
Hints in Layers
Hint 1: Starting Point Model leader commits and follower applies with a queue.
Hint 2: Next Level Add randomized delays per follower.
Hint 3: Technical Details Track per-write commit and apply timestamps.
Hint 4: Tools/Debugging Summarize lag as percentiles rather than averages.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Replication lag | “Designing Data-Intensive Applications” by Kleppmann | Ch. 5 |
| Failure impacts | “Designing Data-Intensive Applications” by Kleppmann | Ch. 8 |
Implementation Hints Keep the model simple: a leader log and per-follower apply offsets. The gap between them is your lag signal.
Learning milestones:
- You can measure lag under different loads
- You can explain staleness windows to a user
- You can reason about sync vs async tradeoffs
Project 10: Partition Recovery Playbook
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Failure Modes
- Software or Tool: Recovery simulator
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A scenario-driven tool that simulates network partitions and recovery strategies.
Why it teaches distributed systems: Partitions are the defining failure mode; recovery is where consistency breaks or holds.
Core challenges you’ll face:
- Modeling split-brain writes (maps to CAP)
- Choosing reconciliation rules (maps to consistency)
- Measuring data loss or conflict (maps to correctness)
Key Concepts
- Network partitions: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
- CAP tradeoffs: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
- Conflict resolution: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Understanding of replication, consistency, and conflict resolution
Real World Outcome
You will simulate a partitioned cluster, allow both sides to accept writes, then recover and observe conflicts and resolution outcomes.
Example Output:
$ ./partition_playbook --accept-writes both
Partition: A,B | C,D
Writes during partition: x=1 on A, x=2 on C
Recovery: conflict detected
Resolution: last-write-wins -> x=2
The Core Question You’re Answering
“What should your system do when partitions end and histories disagree?”
This forces you to decide between losing data, merging, or rejecting writes.
Concepts You Must Understand First
Stop and research these before coding:
- Split-Brain
- What happens if both sides accept writes?
- Why does this violate strong consistency?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
- Conflict Resolution
- What are common merge strategies?
- Why is last-write-wins dangerous?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Questions to Guide Your Design
Before implementing, think through these:
- Policy Choice
- Will you allow writes in both partitions?
- How will you track divergent histories?
- Recovery
- How will you compare versions on rejoin?
- What is your reconciliation strategy?
Thinking Exercise
Divergent Histories
Draw two partitions that write conflicting updates.
Partition A: x=1, x=3
Partition B: x=2
Rejoin: which values survive and why?
Questions while sketching:
- What guarantees are lost during partition?
- Which consistency model does your decision align with?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is a split-brain scenario?”
- “How does CAP influence partition handling?”
- “What are common conflict resolution strategies?”
- “Why is last-write-wins risky?”
- “How do CRDTs avoid conflicts?”
Hints in Layers
Hint 1: Starting Point Create a simple model with two partitions and a shared keyspace.
Hint 2: Next Level Add version vectors or logical timestamps for conflict detection.
Hint 3: Technical Details Define a reconciliation policy and apply it to divergent histories.
Hint 4: Tools/Debugging Print a “before/after” state on recovery so differences are visible.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Partitions | “Designing Data-Intensive Applications” by Kleppmann | Ch. 8 |
| Consistency tradeoffs | “Designing Data-Intensive Applications” by Kleppmann | Ch. 9 |
Implementation Hints Start with a single key and two partitions, then expand to multiple keys and resolution strategies.
Learning milestones:
- You can reproduce split-brain and recovery
- You can compare reconciliation strategies
- You can explain CAP implications of each choice
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| Failure Injection Playground | Intermediate | 1-2 weeks | High | High |
| Raft Log Replication Simulator | Advanced | 1 month+ | Very High | High |
| Paxos Message Trace Explorer | Expert | 1 month+ | Very High | Medium |
| Consistency Model Explorer | Advanced | 1-2 weeks | High | High |
| Quorum Read/Write Tuner | Intermediate | Weekend | Medium | Medium |
| Sharding Strategy Simulator | Advanced | 1-2 weeks | High | High |
| Two-Phase Commit Simulator | Advanced | 1-2 weeks | High | Medium |
| Saga Orchestrator Simulator | Advanced | 1-2 weeks | High | Medium |
| Replication Lag Observatory | Intermediate | Weekend | Medium | Medium |
| Partition Recovery Playbook | Advanced | 1-2 weeks | High | High |
Recommendation
Start with Failure Injection Playground to internalize failure modes. Then do Consistency Model Explorer and Quorum Read/Write Tuner to build the consistency intuition. After that, choose Raft Log Replication Simulator if you want deep consensus knowledge, or Sharding Strategy Simulator if you want scalability design instincts.
Final Overall Project: Mini Distributed Database Control Plane
- File: SPRINT_2_DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Distributed Systems Architecture
- Software or Tool: Control plane simulator
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A unified simulator that combines Raft-based metadata consensus, sharded data placement, replication policies, and recovery from partitions.
Why it teaches distributed systems: It forces you to integrate consensus, replication, sharding, and failure handling into a coherent system.
Core challenges you’ll face:
- Coordinating metadata changes with consensus (maps to safety)
- Rebalancing shards while serving reads (maps to availability)
- Handling failures and reconciliation (maps to consistency models)
Summary
You now have a complete sprint of projects that move from failure modeling to full distributed system design. Each project targets a distinct pillar (consensus, consistency, transactions, replication, sharding, failure handling), and together they build the mental models required to design and defend real-world distributed systems.