Distributed Systems Fundamentals — Project Recommendations

Goal

After completing these projects, you will understand the core tradeoffs and invariants that govern distributed systems—why consensus is hard, why consistency is always a choice, how failures distort reality, and how data is kept coherent across machines.

You will internalize:

Consensus mechanics (Paxos, Raft) as log agreement and leader-driven ordering
Consistency models (CAP, linearizability, eventual) as precise behavioral contracts
Distributed transactions (2PC, Sagas) as coordination under failure
Replication and sharding as latency, availability, and scalability levers
Failure modes (partitions, clock skew, Byzantine-like behavior) as the default, not the edge

This is not about memorizing buzzwords. It is about learning to reason from first principles:

“What is the system’s single source of truth at any moment?”
“Which invariants must never be violated, even under partition?”
“What happens when messages are delayed, duplicated, or lost?”

By the end of this sprint, you will be able to design, defend, and debug distributed systems with clear mental models and measurable guarantees.

Foundational Concepts: The Six Pillars

1. The Network Is the Computer (And It Lies)

Core idea: Distributed systems are a single machine with unreliable communication.

Process A ----message----> Process B
   ^  \                     /  ^
   |   \--delay/dup/drop--/   |
   |         (unknown)        |
   +---- time & ordering -----+

Why it matters: Every project assumes the network can be late, lossy, or reordered. Your designs must tolerate that reality.

2. Time Is a Guess (Clocks Are Not Truth)

Real Time:
|----t1----t2----t3----t4----|

Node A Clock:  |--t1----t2--t3-----t4--|
Node B Clock:  |----t1--t2----t3--t4--|

Same event, different timestamps

Why it matters: Causal order and consistency depend on imperfect clocks. Logical time (Lamport, vector clocks) can be more reliable than wall time.

3. Consensus: Agreeing on One History

Clients -> Leader -> Replicated Log

log:
1: set x=1
2: set x=2
3: set y=7

All replicas must see the SAME order.

Why it matters: Consensus is the foundation for correctness in replicated systems: leader election, log replication, and state machine safety.

4. Consistency Models Are Contracts

Linearizable:   all clients see same order
Eventual:       clients may diverge, then converge
Causal:         cause-before-effect preserved

Why it matters: Consistency is a guarantee you choose. The model defines what anomalies you allow.

5. Transactions Across Machines Are Risky

2PC:
Coordinator -> Prepare -> Participants
Coordinator -> Commit/Abort

Failure mid-flight = uncertainty

Why it matters: Distributed transactions require coordination and can block under failure. Sagas trade atomicity for compensation.

6. Replication and Sharding Shape Everything

Replication:   same data, multiple copies
Sharding:      different data, different nodes

Replication = availability, read scale
Sharding    = write scale, data locality

Why it matters: These are the levers for scale, but they change failure behavior and consistency.

Core Concept Analysis

Concept Cluster	What You Must Internalize
Failures	Delay, loss, and partitions are normal; handle them explicitly.
Time & Order	Ordering is negotiated, not observed; clocks drift.
Consensus	Safety means one history; liveness means progress.
Consistency Models	A model is a contract for what anomalies clients see.
Distributed Transactions	Coordination increases correctness and failure risk.
Replication & Sharding	Scale comes with new invariants and tradeoffs.

Deep Dive Reading by Concept

This section maps each concept to specific chapters for deeper understanding.

Concept 1: Failures and Time

Concept	Book & Chapter
Failure modes	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 8: “The Trouble with Distributed Systems”
Time and clocks	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 8: “The Trouble with Distributed Systems”

Concept 2: Consensus

Concept	Book & Chapter
Consensus fundamentals	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 9: “Consistency and Consensus”
Raft overview	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 9: “Consistency and Consensus”

Concept 3: Consistency Models

Concept	Book & Chapter
Linearizability, CAP	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 9: “Consistency and Consensus”
Eventual consistency	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 5: “Replication”

Concept 4: Transactions

Concept	Book & Chapter
2PC and distributed transactions	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 7: “Transactions”
Sagas and compensations	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 7: “Transactions”

Concept 5: Replication and Sharding

Concept	Book & Chapter
Replication models	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 5: “Replication”
Partitioning/sharding	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 6: “Partitioning”

Essential Reading Order

Foundation (Week 1):
- Designing Data-Intensive Applications Ch. 8 (failures, time)
- Designing Data-Intensive Applications Ch. 5 (replication)
Coordination (Week 2):
- Designing Data-Intensive Applications Ch. 9 (consensus, consistency)
Transactions (Week 3):
- Designing Data-Intensive Applications Ch. 7 (2PC, Sagas)
Scaling (Week 4):
- Designing Data-Intensive Applications Ch. 6 (sharding)

Project List

Projects are ordered from core mental models to complex coordination systems.

Project 1: “Failure Injection Playground” — Controlled Chaos for Network Faults

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Python, Rust, Java
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Difficulty	Level 2: Intermediate
Knowledge Area	Failure Modeling
Software or Tool	Fault injection harness
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A small multi-node simulator that injects delay, loss, duplication, and reordering into message passing.

Why it teaches distributed systems: You will see how simple protocols fail under real network conditions.

Core challenges you’ll face:

Designing a controllable network model (maps to failure modes)
Reproducing heisenbugs (maps to time and ordering)
Observing invariants breaking under stress (maps to system correctness)

Key Concepts

Failure models: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
Message ordering: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
System invariants: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic networking, concurrency primitives, basic logging and CLI usage

Real World Outcome

You will run a simulator with 3-5 nodes and toggle faults. You will observe cases like duplicate messages causing double-apply, or partitions causing stale reads. You will see a timeline log and a failure report that summarizes which invariants broke.

Example Output:

$ ./fault_playground --nodes 5 --loss 0.2 --delay 200ms
[time=00:00.120] link A->C DROPPED message id=17
[time=00:00.350] node C applied write x=9 twice (DUPLICATE)
[time=00:00.401] invariant violated: "x monotonic" at node C
Summary:
  - dropped: 12
  - delayed: 47
  - duplicated: 5
  - invariants broken: 2

The Core Question You’re Answering

“What exactly can go wrong in a distributed system, and how do failures distort correctness?”

This project forces you to confront the reality that “correct” algorithms can still fail if you don’t model the network accurately.

Concepts You Must Understand First

Stop and research these before coding:

Failure Models
- What does “message loss” vs “message delay” mean?
- Why is duplication a separate class of failure?
- Book Reference: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
Invariants
- What makes a replicated state machine safe?
- How do you detect invariant violations at runtime?
- Book Reference: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

Fault Injection
- How will you deterministically reproduce a failure?
- How will you toggle fault modes per link?
Observation
- What events must be logged to prove a violation?
- How will you summarize outcomes across runs?

Thinking Exercise

Message Timeline Sketch

Before coding, draw a 3-node timeline where a write is sent, dropped, and retried.

Node A: send(write x=5) -----X (dropped)
Node A: retry(write x=5) ---------->
Node B: apply(write x=5)

Questions while sketching:

When should the client consider the write “done”?
How can duplicates be detected without global time?
Which invariant proves correctness?

The Interview Questions They’ll Ask

Prepare to answer these:

“What failure modes do you need to assume in distributed systems?”
“Why does message delay matter even if messages eventually arrive?”
“How do you design for idempotency?”
“What is the difference between safety and liveness?”
“How do you test distributed systems deterministically?”

Hints in Layers

Hint 1: Starting Point Start with a single, configurable link that can delay or drop messages.

Hint 2: Next Level Make a matrix of links so each pair can have its own fault profile.

Hint 3: Technical Details Use a seeded random generator so the same failure run can be replayed.

Hint 4: Tools/Debugging Log a per-message ID and use it to correlate drops and retries.

Books That Will Help

Topic	Book	Chapter
Failure modes	“Designing Data-Intensive Applications” by Kleppmann	Ch. 8
Testing nondeterminism	“Designing Data-Intensive Applications” by Kleppmann	Ch. 8

Implementation Hints Model your network as a queue with policies (drop, delay, duplicate). Build your event log first so you can observe behavior before adding algorithm complexity.

Learning milestones:

You can replay a run and get the same failure sequence
You can cause a specific invariant to break on demand
You can explain why the failure happened without guessing

Project 2: “Raft Log Replication Simulator” — Consensus Under Failure

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Python, Java
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	1. The “Resume Gold”
Difficulty	Level 3: Advanced
Knowledge Area	Consensus
Software or Tool	Event-driven simulator
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A Raft simulator that shows leader election and log replication under failures.

Why it teaches distributed systems: Raft makes consensus concrete: log agreement, leader stability, and term-based safety.

Core challenges you’ll face:

Modeling leader election and terms (maps to consensus safety)
Handling log inconsistencies (maps to replication correctness)
Simulating timeouts and retries (maps to failure behavior)

Key Concepts

Consensus safety: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Leader election: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Log replication: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann

Difficulty: Advanced Time estimate: 1 month+ Prerequisites: Familiarity with state machines, message passing, and basic concurrency

Real World Outcome

You will run a simulation and watch a leader get elected, commands appended, and replicas converge. Then you will trigger a partition and see a new leader emerge with correct log reconciliation.

Example Output:

$ ./raft_sim --nodes 5 --partition A,B
[t=01.2] Leader elected: node C (term 4)
[t=02.0] Append entry: "set x=9" index=12
[t=02.4] Replication: 4/5 nodes confirmed
[t=03.1] Partition detected: {A,B} isolated
[t=05.2] New leader: node D (term 5)
[t=06.7] Log fix: node A rolled back to index=10

The Core Question You’re Answering

“How do multiple nodes agree on a single ordered log despite failures?”

Raft shows that order is negotiated, not assumed, and that safety depends on strict rules about terms and majorities.

Concepts You Must Understand First

Stop and research these before coding:

Quorums
- Why is majority voting required for safety?
- What breaks if two leaders exist at once?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Log Replication
- What does it mean for logs to be “consistent”?
- How does a follower detect divergence?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

Leader Election
- What happens if two candidates time out together?
- How do you prevent old leaders from overwriting new logs?
Replication
- How does the leader decide which entries are committed?
- How do followers roll back conflicting entries?

Thinking Exercise

Term Timeline

Sketch two leaders elected in different partitions and reconcile after healing.

Partition 1: Leader A term 3, commits entries 8..10
Partition 2: Leader D term 4, commits entries 8..9
Rejoin: which log entries survive?

Questions while sketching:

Which term wins and why?
Which entries are safe to keep?
What invariant guarantees safety?

The Interview Questions They’ll Ask

Prepare to answer these:

“Why does Raft use terms?”
“How does Raft prevent split-brain?”
“What is the difference between committed and applied entries?”
“How does leader election handle clock drift?”
“Why is quorum required for writes?”

Hints in Layers

Hint 1: Starting Point Implement election timeouts and a basic vote request/response flow.

Hint 2: Next Level Add a replicated log with append entries but no failure handling.

Hint 3: Technical Details Track terms per log index and roll back on mismatch.

Hint 4: Tools/Debugging Log term transitions and committed index movement per node.

Books That Will Help

Topic	Book	Chapter
Consensus	“Designing Data-Intensive Applications” by Kleppmann	Ch. 9
Replication	“Designing Data-Intensive Applications” by Kleppmann	Ch. 5

Implementation Hints Build the system as a deterministic event simulator first. Only when logs converge reliably should you introduce fault injection.

Learning milestones:

Leader election produces a stable leader
Log entries replicate and commit across majority
Partitions heal without violating log safety

Project 3: “Paxos Message Trace Explorer” — Visualizing Agreement

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Python, Java
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	1. The “Resume Gold”
Difficulty	Level 4: Expert
Knowledge Area	Consensus
Software or Tool	Protocol trace visualizer
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A tool that replays Paxos phases (prepare/accept) from message traces and shows agreement.

Why it teaches distributed systems: Paxos is the archetype of consensus. Tracing its messages clarifies why it guarantees safety.

Core challenges you’ll face:

Modeling proposal numbers (maps to safety)
Handling competing proposers (maps to liveness)
Showing chosen values under partial failure (maps to correctness)

Key Concepts

Paxos safety: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Majority quorum: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Failure handling: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann

Difficulty: Expert Time estimate: 1 month+ Prerequisites: Strong grasp of quorum logic, message ordering, and event tracing

Real World Outcome

You will load a recorded sequence of Paxos messages and see a step-by-step timeline of proposals, promises, accepts, and the final chosen value.

Example Output:

$ ./paxos_trace --file traces/competing_proposers.log
Step 1: P1 sends PREPARE n=5 to A,B,C
Step 2: A,B promise n=5
Step 3: P2 sends PREPARE n=6 to B,C,D
Step 4: B,C promise n=6 (P1 now preempted)
Step 5: P2 sends ACCEPT n=6 value=V2
Chosen value: V2 (quorum B,C,D)

The Core Question You’re Answering

“How does Paxos guarantee a single chosen value even with competing leaders?”

Tracing the protocol reveals how quorums and proposal numbers enforce safety.

Concepts You Must Understand First

Stop and research these before coding:

Proposal Numbers
- Why must proposal numbers be totally ordered?
- What does “promised” mean in Paxos?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Quorum Intersection
- Why do two quorums always overlap?
- How does that guarantee safety?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

Trace Format
- What minimal fields are required to prove correctness?
- How will you show preemption and retries?
Visualization
- How will you highlight the chosen value?
- How will you show quorum overlap?

Thinking Exercise

Competing Proposers

Draw two proposers racing with different proposal numbers.

P1: n=5 -> A,B
P2: n=6 -> B,C
Which proposer wins and why?

Questions while sketching:

Which promises invalidate earlier proposals?
What value must be accepted if a promise includes prior acceptance?
Where does safety come from?

The Interview Questions They’ll Ask

Prepare to answer these:

“Why does Paxos require quorum intersection?”
“What happens if a proposer retries with a higher number?”
“How does Paxos avoid choosing two values?”
“Why is Paxos considered hard to implement?”
“How is Raft easier than Paxos?”

Hints in Layers

Hint 1: Starting Point Define a trace format that captures proposer, acceptor, message type, and number.

Hint 2: Next Level Replay in strict timestamp order so you can reproduce outcomes.

Hint 3: Technical Details Track each acceptor’s promised number and last accepted value.

Hint 4: Tools/Debugging Highlight the quorum that chooses the final value in your output.

Books That Will Help

Topic	Book	Chapter
Paxos safety	“Designing Data-Intensive Applications” by Kleppmann	Ch. 9
Failure models	“Designing Data-Intensive Applications” by Kleppmann	Ch. 8

Implementation Hints Focus on trace replay and visualization first. Paxos logic becomes simpler when you can inspect each acceptor’s state after every step.

Learning milestones:

You can explain every promise and accept
You can show why only one value wins
You can identify why Paxos stalls under contention

Project 4: “Consistency Model Explorer” — CAP Tradeoffs in Practice

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Python, Rust, Java
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Difficulty	Level 3: Advanced
Knowledge Area	Consistency Models
Software or Tool	Scenario explorer
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A tool that simulates reads/writes under linearizable, causal, and eventual consistency.

Why it teaches distributed systems: You will see how different guarantees change observable client behavior.

Core challenges you’ll face:

Modeling client histories (maps to consistency semantics)
Defining allowed anomalies (maps to CAP tradeoffs)
Rendering timelines (maps to reasoning about order)

Key Concepts

Linearizability: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Eventual consistency: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
CAP tradeoffs: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Basic understanding of read/write histories and concurrency

Real World Outcome

You will run scenarios with two clients and two replicas, then toggle the consistency model and observe the allowed read results.

Example Output:

$ ./consistency_explorer --model eventual
Client A: write x=1
Client B: read x -> 0 (stale)
Client B: read x -> 1 (after convergence)
Allowed anomaly: stale read (eventual consistency)

The Core Question You’re Answering

“What behaviors are allowed by different consistency guarantees?”

This project forces you to stop thinking of consistency as a single binary choice.

Concepts You Must Understand First

Stop and research these before coding:

Histories and Order
- What is a history in distributed systems?
- What does it mean for a history to be linearizable?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
CAP Theorem
- What is sacrificed under partition?
- Why is availability defined per request?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

Scenario Encoding
- How will you specify client operations and timing?
- How will you show visibility of writes?
Model Rules
- What outputs are forbidden vs allowed?
- How do you detect anomalies?

Thinking Exercise

Two Clients, Two Replicas

Sketch a timeline where A writes x=1, B reads before replication.

Time: 1 2 3 4
A:   W(x=1)
B:         R(x=? )
Replica2 receives write at time 4

Questions while sketching:

What does linearizability require B to see?
What does eventual allow?
What does causal guarantee?

The Interview Questions They’ll Ask

Prepare to answer these:

“What is the difference between linearizability and eventual consistency?”
“How does CAP relate to consistency guarantees?”
“What anomalies are acceptable under eventual consistency?”
“What is causal consistency?”
“Why is strong consistency expensive?”

Hints in Layers

Hint 1: Starting Point Define a simple operation log of reads/writes with timestamps.

Hint 2: Next Level Implement one model (eventual) before adding others.

Hint 3: Technical Details Use a visibility matrix per replica to decide read results.

Hint 4: Tools/Debugging Generate the same scenario across models to compare outcomes side by side.

Books That Will Help

Topic	Book	Chapter
Consistency models	“Designing Data-Intensive Applications” by Kleppmann	Ch. 9
Replication & staleness	“Designing Data-Intensive Applications” by Kleppmann	Ch. 5

Implementation Hints Build a tiny “history” format and a validator that checks if the history violates the rules of a given model.

Learning milestones:

You can explain why a history is or is not linearizable
You can produce a valid eventual history with anomalies
You can map CAP tradeoffs to observable behavior

Project 5: “Quorum Read/Write Tuner” — N, R, W Parameter Explorer

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Python, Rust, Java
Coolness Level	Level 3: Genuinely Clever
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 2: Intermediate
Knowledge Area	Replication
Software or Tool	Quorum calculator
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A simulator that explores how different read/write quorum sizes affect consistency and availability.

Why it teaches distributed systems: It makes quorum math tangible: N, R, W and their consequences.

Core challenges you’ll face:

Modeling quorum intersections (maps to consistency guarantees)
Handling node failures (maps to availability)
Showing staleness vs latency tradeoffs (maps to replication choices)

Key Concepts

Quorum consistency: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Replication lag: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Availability tradeoffs: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Basic distributed systems terminology, simple probability reasoning

Real World Outcome

You will input N, R, W values and see if the system can guarantee read-after-write consistency under different failure rates.

Example Output:

$ ./quorum_tuner --nodes 5 --R 2 --W 3 --fail 0.2
R+W = 5 (quorum intersection guaranteed)
Expected availability: 0.80 for writes, 0.96 for reads
Risk: stale read possible if async replication is slow

The Core Question You’re Answering

“How do quorum sizes trade off availability and consistency?”

This teaches you why majority is not just tradition; it is a correctness constraint.

Concepts You Must Understand First

Stop and research these before coding:

Quorum Rules
- Why does R+W > N matter?
- How do you compute read availability?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Replication Lag
- Why can a read quorum still be stale?
- How does leader-based replication help?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

Metrics
- How will you quantify availability per request?
- How will you present staleness risk?
Simulation
- Will you use Monte Carlo runs or deterministic math?
- How will you model failure probability?

Thinking Exercise

Quorum Math

On paper, compute if R=2, W=2, N=3 guarantees consistency.

If R+W = 4 > 3, quorum intersection exists.
What happens if one node is down?

Questions while sketching:

Does intersection always hold under failure?
How do failures affect availability vs correctness?

The Interview Questions They’ll Ask

Prepare to answer these:

“Why does R+W > N matter?”
“Can quorum systems still return stale reads?”
“What does availability mean in CAP terms?”
“How do you tune quorum sizes?”
“What is sloppy quorum and hinted handoff?”

Hints in Layers

Hint 1: Starting Point Compute availability analytically for fixed N, R, W.

Hint 2: Next Level Add a simple simulator to estimate staleness frequency.

Hint 3: Technical Details Model replication delay as a distribution, not a constant.

Hint 4: Tools/Debugging Plot read/write success rates across failure probabilities.

Books That Will Help

Topic	Book	Chapter
Quorum systems	“Designing Data-Intensive Applications” by Kleppmann	Ch. 5
Availability tradeoffs	“Designing Data-Intensive Applications” by Kleppmann	Ch. 8

Implementation Hints Start with the algebraic rules, then add stochastic simulation to show how delays cause stale reads even when quorum math looks safe.

Learning milestones:

You can compute quorum properties by hand
You can show availability changes with failure rate
You can explain why quorum is not a full consistency guarantee

Project 6: “Sharding Strategy Simulator” — Partitioning Comparison Tool

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Python, Rust, Java
Coolness Level	Level 3: Genuinely Clever
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 3: Advanced
Knowledge Area	Sharding
Software or Tool	Partitioning simulator
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A simulator that compares hash-based, range-based, and consistent hashing sharding strategies.

Why it teaches distributed systems: It reveals how shard choice impacts balance, hotspots, and resharding cost.

Core challenges you’ll face:

Modeling key distributions (maps to load balance)
Simulating node joins/leaves (maps to resharding)
Measuring hotspot frequency (maps to latency spikes)

Key Concepts

Partitioning strategies: “Designing Data-Intensive Applications” Ch. 6 - Kleppmann
Rebalancing costs: “Designing Data-Intensive Applications” Ch. 6 - Kleppmann
Hotspot detection: “Designing Data-Intensive Applications” Ch. 6 - Kleppmann

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Understanding of hashing, keyspace partitioning, basic statistics

Real World Outcome

You will input a key distribution and number of nodes, then see shard balance and data movement when nodes join or leave.

Example Output:

$ ./shard_sim --strategy consistent --nodes 4 --keys zipf
Balance report:
  max shard load: 1.8x average
  moved keys on join: 12%
Hotspots detected: shard 2

The Core Question You’re Answering

“How does the choice of sharding strategy shape performance and operational cost?”

Sharding is not just about scalability; it affects resilience and developer experience.

Concepts You Must Understand First

Stop and research these before coding:

Partitioning Strategies
- What is range partitioning vs hash partitioning?
- Why does consistent hashing reduce rebalancing?
- Book Reference: “Designing Data-Intensive Applications” Ch. 6 - Kleppmann
Skewed Workloads
- Why do hotspots appear in real data?
- How do you measure imbalance?
- Book Reference: “Designing Data-Intensive Applications” Ch. 6 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

Key Distribution
- How will you model uniform vs Zipf distributions?
- How will you visualize imbalance?
Resharding
- How will you measure data movement cost?
- How will you simulate node failure?

Thinking Exercise

Shard Movement

Sketch what happens when you add a node to a 4-node hash ring.

Before: [A][B][C][D]
After:  [A][B][E][C][D]
Which keys move and why?

Questions while sketching:

Which strategy minimizes movement?
How do hotspots persist even with hashing?

The Interview Questions They’ll Ask

Prepare to answer these:

“What is consistent hashing and why is it used?”
“How does range-based sharding create hotspots?”
“What are the tradeoffs between sharding and replication?”
“How do you reshard without downtime?”
“What is a shard rebalancing strategy?”

Hints in Layers

Hint 1: Starting Point Generate keys and assign them to shards by each strategy.

Hint 2: Next Level Compute load variance per shard to measure balance.

Hint 3: Technical Details Model rebalancing as the fraction of keys that move when a node joins.

Hint 4: Tools/Debugging Visualize shard load with simple histograms.

Books That Will Help

Topic	Book	Chapter
Partitioning	“Designing Data-Intensive Applications” by Kleppmann	Ch. 6
Rebalancing	“Designing Data-Intensive Applications” by Kleppmann	Ch. 6

Implementation Hints Start with deterministic key assignments so you can test balance. Then add randomized distributions to mimic reality.

Learning milestones:

You can measure shard balance for each strategy
You can quantify rebalancing cost on node join
You can explain hotspots in real workloads

Project 7: “Two-Phase Commit Coordinator Simulator” — Atomic Commit Protocol

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Java, Python, Rust
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Difficulty	Level 3: Advanced
Knowledge Area	Distributed Transactions
Software or Tool	Transaction simulator
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A 2PC coordinator/participant simulator with failure injection and timeout handling.

Why it teaches distributed systems: 2PC is the canonical example of coordination and blocking under failure.

Core challenges you’ll face:

Modeling prepare/commit phases (maps to atomicity)
Handling coordinator crashes (maps to blocking)
Visualizing uncertain outcomes (maps to recovery)

Key Concepts

Two-phase commit: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
Failure handling: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
Atomicity vs availability: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Understanding of transactions, logging, and failure recovery

Real World Outcome

You will simulate transactions across multiple participants and watch outcomes under failures, including the classic “in-doubt” state.

Example Output:

$ ./twopc_sim --participants 3 --crash coordinator@prepare
Phase 1: PREPARE sent to P1,P2,P3
P1: YES
P2: YES
P3: NO RESPONSE (timeout)
Coordinator crashed before decision
Result: transaction IN-DOUBT at P1,P2

The Core Question You’re Answering

“Why does 2PC block, and what is the cost of atomicity under failure?”

This project shows why distributed transactions are expensive and fragile.

Concepts You Must Understand First

Stop and research these before coding:

Atomicity Across Nodes
- What does it mean to commit atomically across machines?
- Why is the coordinator a single point of failure?
- Book Reference: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
Durable Logs
- What does a participant need to log to recover?
- Why does logging create in-doubt states?
- Book Reference: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

Failure Simulation
- What happens if the coordinator dies after prepare?
- How do participants decide in-doubt outcomes?
Recovery
- How will you show state recovery on restart?
- What data must be persisted?

Thinking Exercise

Failure Windows

Sketch the transaction timeline and mark where crashes cause blocking.

Prepare -> Votes -> Decision -> Commit
Crash points: after votes but before decision

Questions while sketching:

When is the transaction irreversible?
Why can participants be stuck?

The Interview Questions They’ll Ask

Prepare to answer these:

“Why does 2PC block under coordinator failure?”
“What is an in-doubt transaction?”
“How does 3PC attempt to fix 2PC?”
“Why do modern systems avoid 2PC?”
“What is the role of write-ahead logging in 2PC?”

Hints in Layers

Hint 1: Starting Point Implement the prepare and commit phases without failure handling.

Hint 2: Next Level Add timeouts and failure injection at each step.

Hint 3: Technical Details Track participant states: INIT, PREPARED, COMMITTED, ABORTED.

Hint 4: Tools/Debugging Record a timeline so you can replay and explain each outcome.

Books That Will Help

Topic	Book	Chapter
Distributed transactions	“Designing Data-Intensive Applications” by Kleppmann	Ch. 7
Failure behavior	“Designing Data-Intensive Applications” by Kleppmann	Ch. 8

Implementation Hints Model each participant as a finite state machine. Crashes freeze the state; recovery reads the last logged state.

Learning milestones:

You can explain in-doubt states clearly
You can simulate coordinator crashes deterministically
You can articulate why 2PC is risky in practice

Project 8: “Saga Orchestrator Simulator” — Compensation-Based Transactions

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Java, Python, Rust
Coolness Level	Level 3: Genuinely Clever
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 3: Advanced
Knowledge Area	Distributed Transactions
Software or Tool	Workflow orchestrator simulator
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A saga orchestrator that executes steps and compensations under failure.

Why it teaches distributed systems: It highlights how Sagas trade atomicity for availability and resilience.

Core challenges you’ll face:

Designing compensating actions (maps to consistency)
Handling partial failure (maps to recovery semantics)
Managing idempotency (maps to reliability)

Key Concepts

Sagas: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
Idempotency: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
Compensating actions: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Understanding of transactions, state machines, and failure handling

Real World Outcome

You will run a workflow (e.g., order, payment, inventory) and watch it succeed or roll back via compensations when a step fails.

Example Output:

$ ./saga_sim --scenario order_fail_payment
Step 1: Reserve inventory -> OK
Step 2: Charge payment -> FAILED
Compensation: Release inventory -> OK
Saga outcome: FAILED (compensated)

The Core Question You’re Answering

“How do you keep a multi-step workflow consistent without global locks?”

Sagas show that consistency can be maintained through explicit compensation rather than global atomicity.

Concepts You Must Understand First

Stop and research these before coding:

Compensations
- What does it mean to undo a step?
- When is compensation impossible?
- Book Reference: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
Idempotency
- Why must retries be safe?
- How do you detect duplicate execution?
- Book Reference: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

Workflow Modeling
- How will you describe steps and compensations?
- How will you record progress for recovery?
Failure Semantics
- When do you retry vs compensate?
- How will you handle partial failures?

Thinking Exercise

Compensation Map

Draw a 3-step saga and list the compensation for each step.

Reserve -> Charge -> Ship
Compensate: Release <- Refund <- Recall

Questions while sketching:

Which steps are not reversible?
How does compensation affect user experience?

The Interview Questions They’ll Ask

Prepare to answer these:

“What is a Saga and how does it differ from 2PC?”
“Why is idempotency critical in Sagas?”
“What happens if a compensation fails?”
“When would you choose Sagas over 2PC?”
“How do you guarantee eventual consistency in a saga?”

Hints in Layers

Hint 1: Starting Point Define a workflow with explicit forward and compensation steps.

Hint 2: Next Level Add a progress log so you can resume after crash.

Hint 3: Technical Details Include idempotency keys for each step.

Hint 4: Tools/Debugging Print a timeline showing when compensation was triggered.

Books That Will Help

Topic	Book	Chapter
Sagas	“Designing Data-Intensive Applications” by Kleppmann	Ch. 7
Failures and retries	“Designing Data-Intensive Applications” by Kleppmann	Ch. 8

Implementation Hints Think of Sagas as durable workflows with explicit rollback plans. Focus on failure cases first, then add the happy path.

Learning milestones:

You can design correct compensations
You can recover a saga after a crash
You can explain why Sagas are eventually consistent

Project 9: “Replication Lag Observatory” — Staleness Window Analyzer

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Python, Rust, Java
Coolness Level	Level 2: Practical but Forgettable
Business Potential	2. The “Micro-SaaS / Pro Tool”
Difficulty	Level 2: Intermediate
Knowledge Area	Replication
Software or Tool	Lag analyzer
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A tool that simulates leader-follower replication and reports lag and staleness windows.

Why it teaches distributed systems: Replication lag explains why eventual consistency appears in real systems.

Core challenges you’ll face:

Modeling write propagation (maps to replication)
Measuring staleness windows (maps to consistency)
Handling follower failure (maps to availability)

Key Concepts

Replication lag: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Leader/follower roles: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Failure impact: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Basic replication concepts and simple metrics

Real World Outcome

You will see how lag varies with write rate, follower speed, and failure events, including a measurable “staleness window” for reads.

Example Output:

$ ./replication_lag --followers 3 --write-rate 50/s
Leader commit latency: 12ms
Follower lag: [45ms, 120ms, 310ms]
Staleness window (p95): 290ms

The Core Question You’re Answering

“How long can replicas be stale, and what controls that window?”

This reveals that eventual consistency is often just replication lag in disguise.

Concepts You Must Understand First

Stop and research these before coding:

Replication Models
- What is leader-based replication?
- Why do followers lag?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
Staleness
- How do you quantify staleness?
- Why is p95 more meaningful than average?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

Lag Measurement
- What metric captures “freshness” best?
- How will you track leader commit vs follower apply time?
Workload Modeling
- How will you simulate bursty writes?
- How will you represent slow followers?

Thinking Exercise

Lag Window

Sketch a timeline where leader commits at t=10 and follower applies at t=80.

Leader commit: t=10
Follower apply: t=80
Staleness window: 70

Questions while sketching:

What does a read at t=30 see?
How does lag affect user experience?

The Interview Questions They’ll Ask

Prepare to answer these:

“What causes replication lag?”
“Why do followers fall behind under load?”
“How do you measure staleness in production?”
“What is the difference between synchronous and asynchronous replication?”
“How do you reduce lag without killing throughput?”

Hints in Layers

Hint 1: Starting Point Model leader commits and follower applies with a queue.

Hint 2: Next Level Add randomized delays per follower.

Hint 3: Technical Details Track per-write commit and apply timestamps.

Hint 4: Tools/Debugging Summarize lag as percentiles rather than averages.

Books That Will Help

Topic	Book	Chapter
Replication lag	“Designing Data-Intensive Applications” by Kleppmann	Ch. 5
Failure impacts	“Designing Data-Intensive Applications” by Kleppmann	Ch. 8

Implementation Hints Keep the model simple: a leader log and per-follower apply offsets. The gap between them is your lag signal.

Learning milestones:

You can measure lag under different loads
You can explain staleness windows to a user
You can reason about sync vs async tradeoffs

Project 10: “Partition Recovery Playbook” — Split-Brain Resolution Strategies

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Python, Rust, Java
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Difficulty	Level 3: Advanced
Knowledge Area	Failure Modes
Software or Tool	Recovery simulator
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A scenario-driven tool that simulates network partitions and recovery strategies.

Why it teaches distributed systems: Partitions are the defining failure mode; recovery is where consistency breaks or holds.

Core challenges you’ll face:

Modeling split-brain writes (maps to CAP)
Choosing reconciliation rules (maps to consistency)
Measuring data loss or conflict (maps to correctness)

Key Concepts

Network partitions: “Designing Data-Intensive Applications” Ch. 8 - Kleppmann
CAP tradeoffs: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Conflict resolution: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Understanding of replication, consistency, and conflict resolution

Real World Outcome

You will simulate a partitioned cluster, allow both sides to accept writes, then recover and observe conflicts and resolution outcomes.

Example Output:

$ ./partition_playbook --accept-writes both
Partition: A,B | C,D
Writes during partition: x=1 on A, x=2 on C
Recovery: conflict detected
Resolution: last-write-wins -> x=2

The Core Question You’re Answering

“What should your system do when partitions end and histories disagree?”

This forces you to decide between losing data, merging, or rejecting writes.

Concepts You Must Understand First

Stop and research these before coding:

Split-Brain
- What happens if both sides accept writes?
- Why does this violate strong consistency?
- Book Reference: “Designing Data-Intensive Applications” Ch. 9 - Kleppmann
Conflict Resolution
- What are common merge strategies?
- Why is last-write-wins dangerous?
- Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

Policy Choice
- Will you allow writes in both partitions?
- How will you track divergent histories?
Recovery
- How will you compare versions on rejoin?
- What is your reconciliation strategy?

Thinking Exercise

Divergent Histories

Draw two partitions that write conflicting updates.

Partition A: x=1, x=3
Partition B: x=2
Rejoin: which values survive and why?

Questions while sketching:

What guarantees are lost during partition?
Which consistency model does your decision align with?

The Interview Questions They’ll Ask

Prepare to answer these:

“What is a split-brain scenario?”
“How does CAP influence partition handling?”
“What are common conflict resolution strategies?”
“Why is last-write-wins risky?”
“How do CRDTs avoid conflicts?”

Hints in Layers

Hint 1: Starting Point Create a simple model with two partitions and a shared keyspace.

Hint 2: Next Level Add version vectors or logical timestamps for conflict detection.

Hint 3: Technical Details Define a reconciliation policy and apply it to divergent histories.

Hint 4: Tools/Debugging Print a “before/after” state on recovery so differences are visible.

Books That Will Help

Topic	Book	Chapter
Partitions	“Designing Data-Intensive Applications” by Kleppmann	Ch. 8
Consistency tradeoffs	“Designing Data-Intensive Applications” by Kleppmann	Ch. 9

Implementation Hints Start with a single key and two partitions, then expand to multiple keys and resolution strategies.

Learning milestones:

You can reproduce split-brain and recovery
You can compare reconciliation strategies
You can explain CAP implications of each choice

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
Failure Injection Playground	Intermediate	1-2 weeks	High	High
Raft Log Replication Simulator	Advanced	1 month+	Very High	High
Paxos Message Trace Explorer	Expert	1 month+	Very High	Medium
Consistency Model Explorer	Advanced	1-2 weeks	High	High
Quorum Read/Write Tuner	Intermediate	Weekend	Medium	Medium
Sharding Strategy Simulator	Advanced	1-2 weeks	High	High
Two-Phase Commit Simulator	Advanced	1-2 weeks	High	Medium
Saga Orchestrator Simulator	Advanced	1-2 weeks	High	Medium
Replication Lag Observatory	Intermediate	Weekend	Medium	Medium
Partition Recovery Playbook	Advanced	1-2 weeks	High	High

Recommendation

Start with Failure Injection Playground to internalize failure modes. Then do Consistency Model Explorer and Quorum Read/Write Tuner to build the consistency intuition. After that, choose Raft Log Replication Simulator if you want deep consensus knowledge, or Sharding Strategy Simulator if you want scalability design instincts.

Final Overall Project: “Mini Distributed Database Control Plane” — Unified System Architecture

Attribute	Value
File	DISTRIBUTED_SYSTEMS_FUNDAMENTALS_PROJECTS.md
Main Programming Language	Go
Alternative Programming Languages	Rust, Java, Python
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	4. The “Open Core” Infrastructure
Difficulty	Level 4: Expert
Knowledge Area	Distributed Systems Architecture
Software or Tool	Control plane simulator
Main Book	“Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A unified simulator that combines Raft-based metadata consensus, sharded data placement, replication policies, and recovery from partitions.

Why it teaches distributed systems: It forces you to integrate consensus, replication, sharding, and failure handling into a coherent system.

Core challenges you’ll face:

Coordinating metadata changes with consensus (maps to safety)
Rebalancing shards while serving reads (maps to availability)
Handling failures and reconciliation (maps to consistency models)

Summary

You now have a complete sprint of projects that move from failure modeling to full distributed system design. Each project targets a distinct pillar (consensus, consistency, transactions, replication, sharding, failure handling), and together they build the mental models required to design and defend real-world distributed systems.