Learn Formal Verification (TLA+): From Zero to Model Master

Goal: Deeply understand formal verification using TLA+—learning how to think in states, transitions, and temporal logic to mathematically prove that distributed systems work as intended. You will move beyond “testing” into “proving,” identifying race conditions that would take years to find in production by modeling algorithms like Paxos and Raft from first principles.

Why Formal Verification & TLA+ Matters

In 1996, the Ariane 5 rocket exploded 40 seconds after launch due to a software bug. In 2012, Knight Capital Group lost $440 million in 45 minutes. These weren’t just “bugs”; they were failures of logic in complex systems.

Distributed systems are notoriously hard to test. The number of possible interleavings of events (messages arriving late, nodes failing, clocks drifting) is astronomical. Standard unit testing and integration testing are like shining a flashlight in a dark forest—you see what’s directly in front of you, but you miss the wolf lurking in the shadows.

TLA+ (Temporal Logic of Actions), created by Leslie Lamport (Turing Award winner), allows you to:

Exhaustively explore all states: Find that 1-in-a-billion race condition on your laptop.
Design before coding: Catch fundamental architectural flaws before a single line of C or Rust is written.
Communicate clearly: Provide a mathematical specification that serves as the “source of truth” for how a system should behave.

The Mental Shift: From Code to States

Most developers think in sequences of instructions. TLA+ forces you to think in sets of possible states and legal transitions.

Standard Code Thinking:          TLA+ State Thinking:
1. x = 1                        [ x: 0, y: 0 ] --(Action A)--> [ x: 1, y: 0 ]
2. y = x + 1                         |                              |
3. print y                      (Action B)                     (Action B)
                                     ↓                              ↓
                                [ x: 0, y: 1 ] --(Action A)--> [ x: 1, y: 1 ]

Core Concept Analysis

1. The World as a State Machine

In TLA+, every system is a state machine. A state is a mapping of variables to values. An action is a formula describing the relationship between the current state and the next state (often denoted with a prime, e.g., x' = x + 1).

Initial State (Init):
  x = 0

Next State Relation (Next):
  Increment ≜ x < 5 ∧ x' = x + 1
  Reset     ≜ x = 5 ∧ x' = 0

Spec ≜ Init ∧ □[Next]_x

2. Safety vs. Liveness

This is the most critical distinction in formal verification.

Safety: “Something bad never happens.” (e.g., Two processes are never in the critical section at the same time). If safety is violated, you can point to a specific moment (a trace) where it happened.
Liveness: “Something good eventually happens.” (e.g., A request eventually gets a response). If liveness is violated, you can’t point to a single state; you have to show an infinite loop where the good thing never occurs.

Safety (Bad never happens):      Liveness (Good eventually happens):
   [OK] -> [OK] -> [BAD!]          [WAIT] -> [WAIT] -> [DONE]
          (Finite trace)           (Infinite cycle of [WAIT])

3. Temporal Logic (The “Box” and “Diamond”)

TLA+ uses operators to describe behavior over time:

□P (Always P): P is true in every state of the behavior.
◇P (Eventually P): P is true in at least one state of the behavior.
□◇P (Infinitely Often P): P is true over and over again.
◇□P (Eventually Always P): The system eventually settles into a state where P remains true.

4. Nondeterminism

In TLA+, you don’t say when an action happens. You say what can happen. The model checker (TLC) then tries every possible order of events. This is how it finds race conditions: it tries the weird interleavings you never thought of.

Concept Summary Table

Concept Cluster	What You Need to Internalize
State Spaces	A system is a collection of variables. Every possible combination of values is a potential state.
Actions & Primes	An action is a predicate $P(s, s’)$. It defines if a transition from $s$ to $s’$ is valid.
Invariants	Boolean expressions that must be true in every reachable state. This is how you prove safety.
Temporal Formulas	Logic that describes the history of states. Used to define liveness and complex behaviors.
Refinement	Proving that a complex “low-level” model correctly implements a simpler “high-level” spec.

Deep Dive Reading by Concept

Foundations

Concept	Book & Chapter
The TLA+ Philosophy	“Specifying Systems” by Leslie Lamport — Ch. 1: “A Little Caching”
Logic & Sets	“Specifying Systems” by Leslie Lamport — Ch. 16: “The Logic of TLA”
Thinking in States	“Practical TLA+” by Hillel Wayne — Ch. 1: “An Introduction”

Distributed Systems Patterns

Concept	Book & Chapter
Consensus & Paxos	“Specifying Systems” by Leslie Lamport — Ch. 14: “The Paxos Consensus Algorithm”
Logical Clocks	“Distributed Systems: Principles and Paradigms” by Tanenbaum — Ch. 6.2: “Logical Clocks”
2PC and Commit	“Transaction Processing” by Jim Gray — Ch. 7: “Transaction Models”

Essential Reading Order

The Basics (Week 1):
- Practical TLA+ (Wayne) Ch. 1-3. Focus on PlusCal (a C-like language that translates to TLA+).
- Specifying Systems (Lamport) Ch. 1 & 2. Understand the raw TLA+ syntax.
Safety & Invariants (Week 2):
- Practical TLA+ Ch. 4. Learning to write effective invariants.
- Hillel Wayne’s blog: “How to find a bug with TLA+”.

Project 1: The Die Hard Water Jug (The “Hello World” of Model Checking)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: TLA+ / PlusCal
Alternative Languages: None (This is about formal specs)
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: State Space Search / Logic
Software or Tool: TLA+ Toolbox / VS Code TLA+ Extension
Main Book: “Specifying Systems” by Leslie Lamport

What you’ll build: A mathematical model of the famous puzzle from Die Hard with a Vengeance: You have a 3-gallon jug and a 5-gallon jug. How do you get exactly 4 gallons?

Why it teaches TLA+: This project strips away “distributed systems” complexity to focus on the core mechanic: defining initial states, defining legal moves (actions), and letting the model checker find the solution for you by trying to “prove” it’s impossible.

Core challenges you’ll face:

Defining the State → Variables for each jug’s content.
Defining Actions → Fill, Empty, and Pour (with logic to prevent overflow).
The “Invariant” Trick → You don’t tell TLC how to solve it; you tell it that “The big jug never has 4 gallons.” TLC will then try to find a way to break that rule, providing you with the exact steps to solve the puzzle.

Key Concepts:

State Variables: “Specifying Systems” Ch. 2.1
Next-State Relation: “Practical TLA+” Ch. 2
Error Traces: “Specifying Systems” Ch. 3.2

Difficulty: Beginner Time estimate: 2 hours Prerequisites: Basic logic (AND, OR, NOT).

Real World Outcome

You will have a .tla file. When you run the model checker (TLC), it will “fail” and provide you with a Trace. This trace is the step-by-step solution to the puzzle.

Example Output (TLC Trace):

1: <Initial predicate>
   bigJug = 0, smallJug = 0
2: <Action FillSmall>
   bigJug = 0, smallJug = 3
3: <Action PourSmallToBig>
   bigJug = 3, smallJug = 0
4: <Action FillSmall>
   bigJug = 3, smallJug = 3
5: <Action PourSmallToBig>
   bigJug = 5, smallJug = 1
6: <Action EmptyBig>
   bigJug = 0, smallJug = 1
7: <Action PourSmallToBig>
   bigJug = 1, smallJug = 0
8: <Action FillSmall>
   bigJug = 1, smallJug = 3
9: <Action PourSmallToBig>
   bigJug = 4, smallJug = 0  <-- INVARIANT VIOLATED (GOAL REACHED!)

The Core Question You’re Answering

“Can I describe a problem’s rules so precisely that a computer can find the solution for me?”

Before you write code, realize that most bugs happen because we don’t understand the “rules of the game” we are playing.

Project 2: The One-Bit Register (Understanding Atomic Steps)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: TLA+
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Hardware Logic / Atomicity
Software or Tool: TLA+ Toolbox
Main Book: “Specifying Systems” by Leslie Lamport

What you’ll build: A specification of a single-bit memory register that can be written by one process and read by another. You’ll model “Safe,” “Regular,” and “Atomic” registers.

Why it teaches TLA+: It introduces the concept of Interleaving. What happens if a Read occurs at the same time as a Write? TLA+ forces you to break a single “Write(1)” operation into “start write” and “end write” to see the chaos in the middle.

Core challenges you’ll face:

Modeling Concurrency → How to represent two processes running independently.
Safety Properties → Proving that a Read never returns a value that was never written.
Nondeterminism → Allowing a Read during a Write to return either 0 or 1.

Key Concepts:

Interleaving: “Specifying Systems” Ch. 2.4
Safety: “Practical TLA+” Ch. 4
Atomic Registers: Lamport’s “On Interprocess Communication”

Difficulty: Beginner Time estimate: 1 weekend Prerequisites: Project 1 completed.

Real World Outcome

You’ll create a model where you allow “fuzzy” reads. You will discover that a “Safe” register is actually quite dangerous—it can return any value if a read and write overlap. You will prove this by seeing TLC generate a trace where val = 42 even though you only ever wrote 0 or 1.

Project 3: Peterson’s Algorithm (The Race Condition Hunter)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: PlusCal (C-syntax)
Coolness Level: Level 4: Hardcore Tech Flex
Difficulty: Level 2: Intermediate
Knowledge Area: Concurrency / Mutual Exclusion
Software or Tool: TLA+ Toolbox
Main Book: “Practical TLA+” by Hillel Wayne

What you’ll build: A model of a classic mutual exclusion algorithm. Two processes want to enter a “critical section.” You must prove they never enter at the same time and that nobody gets stuck forever.

Why it teaches TLA+: This is your first encounter with Liveness. It’s easy to make a system “safe” by just making it do nothing. Peterson’s algorithm is clever because it guarantees someone eventually gets in.

Core challenges you’ll face:

Translation → Writing in PlusCal and seeing how it translates to the underlying TLA+ state machine.
Mutual Exclusion Invariant → \A p1, p2 \in Processes : p1 /= p2 => ~(pc[p1] = "CS" /\ pc[p2] = "CS").
Fairness → Why you need wf_ (Weak Fairness) to prove that a process won’t just sit on its hands forever.

Key Concepts:

Mutual Exclusion: “Operating Systems: Three Easy Pieces” Ch. 28
PlusCal Labels: “Practical TLA+” Ch. 3 (Labels define what is “atomic”)
Liveness & Fairness: “Specifying Systems” Ch. 8

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Understanding of basic multithreading concepts (locks/semaphores).

Real World Outcome

You will deliberately remove a single line from the algorithm (e.g., the line where a process “gives up” its turn). You will then watch TLC find a Deadlock.

Example Output:

Error: Deadlock reached.
Trace:
1. P1: wants to enter, sets flag[1]=true
2. P2: wants to enter, sets flag[2]=true
3. P1: waits for P2...
4. P2: waits for P1...
Result: Both processes wait forever. System halted.

Project 4: Two-Phase Commit (Distributed Atomicity)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: TLA+
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Distributed Transactions
Software or Tool: TLC Model Checker
Main Book: “Transaction Processing” by Jim Gray

What you’ll build: A model of a Transaction Coordinator and several Resource Managers. You will model the “Prepare” and “Commit” phases.

Why it teaches TLA+: You’ll learn how to model Unreliable Networks. In TLA+, you don’t use sockets. You use a variable called msgs (a set) and an action that “loses” a message by simply not defining its transition.

Core challenges you’ll face:

Modeling Failures → What happens if the Coordinator crashes after sending Commit to half the nodes?
The Blocking Problem → Proving that 2PC can lead to a state where nodes are stuck waiting forever if the coordinator disappears.
Invariants → Proving that either all nodes commit or none do.

Key Concepts:

Atomic Commit: “Distributed Systems” (Tanenbaum) Ch. 8.5
Failure Models: “Specifying Systems” Ch. 14.3
Network Invariants: “Designing Data-Intensive Applications” Ch. 9

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 3, knowledge of distributed databases.

Real World Outcome

You will prove that 2PC is a blocking protocol. You’ll create a liveness property saying “Eventually every node reaches a decision,” and TLC will show you a “Stuttering” trace where the system just stops because the coordinator crashed and the nodes are too “polite” to move on.

Project 5: Basic Paxos (The Holy Grail of Consensus)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: TLA+
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 4: Expert
Knowledge Area: Distributed Consensus
Software or Tool: TLA+ Toolbox
Main Book: “Specifying Systems” by Leslie Lamport

What you’ll build: The core Paxos algorithm (Proposers, Acceptors, Learners). You’ll model Phase 1a, 1b, 2a, and 2b.

Why it teaches TLA+: Paxos is notoriously difficult to understand but mathematically elegant. Writing it in TLA+ is often easier than writing it in code because you can ignore the “how” (network libraries) and focus on the “what” (the logic of majorities).

Core challenges you’ll face:

Quorums → Defining a set of majorities and proving their intersection properties.
Ballot Numbers → Managing monotonically increasing sequence numbers.
Safety Proof → Proving that only one value is ever chosen, even if multiple proposers try at once.

Key Concepts:

Paxos Made Simple: Leslie Lamport’s famous paper.
Quorums: “Distributed Systems” Ch. 6.3
Consistency Models: “Designing Data-Intensive Applications” Ch. 9

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Deep understanding of 2PC and quorums.

The Core Question You’re Answering

“How can a group of unreliable nodes agree on a single value without a leader?”

This is the foundation of modern cloud computing. Every time you use S3, DynamoDB, or Spanner, Paxos (or its variants) is running in the background.

Project 6: Vector Clocks & Causality

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: PlusCal
Coolness Level: Level 4: Hardcore Tech Flex
Difficulty: Level 3: Advanced
Knowledge Area: Distributed Systems / Logical Time
Software or Tool: TLA+ Toolbox
Main Book: “Distributed Systems: Concepts and Design” by Coulouris

What you’ll build: A system of nodes passing messages, where each node maintains a vector clock. You will model “happened-before” relationships.

Why it teaches TLA+: It teaches you how to model Metadata. You’ll see how tiny pieces of information attached to messages can reconstruct the “order” of events in a world where physical clocks can’t be trusted.

Core challenges you’ll face:

Partial Ordering → Understanding that some events have no order (concurrency).
Conflict Detection → Modeling two nodes updating the same key and using vector clocks to prove they are in conflict.
Causality Invariant → Proving that if event A caused event B, the clock for A is always “less than” the clock for B.

Key Concepts:

Lamport Timestamps: “Time, Clocks, and the Ordering of Events in a Distributed System” (Lamport, 1978).
Vector Clocks: “Designing Data-Intensive Applications” Ch. 5 (Detecting Concurrent Writes).

Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 3.

Thinking Exercise

The Paradox of Time

Imagine two nodes, A and B.

A writes x=1 (Clock: [A:1, B:0])
B writes x=2 (Clock: [A:0, B:1])
A and B exchange messages.

Questions to ask yourself:

Does “A happened before B”?
How does the system know which value to keep?
What if a third node C sees x=2 before it sees x=1?

Model this in TLA+. If your invariant is “The value of X is always the same on all nodes,” watch TLC destroy your model and show you why you need conflict resolution (LWW or CRDTs).

Project 7: Raft Leader Election

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: TLA+
Coolness Level: Level 5: Pure Magic
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Distributed Systems / Consensus
Software or Tool: TLA+ Toolbox
Main Book: “Raft: In Search of an Understandable Consensus Algorithm” (Ongaro & Ousterhout)

What you’ll build: A model of Raft’s election mechanism, including terms, votes, and heartbeat timeouts.

Why it teaches TLA+: Raft was designed to be “understandable,” but its edge cases are still subtle. You’ll learn how to model Timeouts without using a clock. In TLA+, a timeout is just an action that could happen if a certain amount of time has “passed” (abstractly represented).

Core challenges you’ll face:

Leader Uniqueness → Proving that for any given term, there is at most one leader.
Split Brain → Modeling what happens when network partitions occur.
Term Management → Ensuring nodes correctly update their terms when they see a higher one.

Key Concepts:

Election Safety: “Raft Paper” Section 5.2
Liveness (Election Progress): Why randomized timeouts are needed (modeled as nondeterminism in TLA+).

Difficulty: Expert Time estimate: 2 weeks Prerequisites: Project 5 (Paxos).

Real World Outcome

You’ll run a model with 5 nodes. You’ll introduce a network partition where 2 nodes are separated from the other 3. You’ll watch as the 3 nodes elect a leader, while the 2 nodes keep incrementing their terms but fail to elect anyone. Then, you’ll “heal” the partition and see the old leader step down.

Project 8: Raft Log Replication (The Commit Mystery)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: TLA+
Coolness Level: Level 5: Pure Magic
Difficulty: Level 5: Master
Knowledge Area: Replicated State Machines
Software or Tool: TLC Model Checker
Main Book: “Raft: In Search of an Understandable Consensus Algorithm”

What you’ll build: The log replication part of Raft. You’ll model how a leader sends log entries and how followers commit them.

Why it teaches TLA+: This is where the most subtle bugs in Raft live. You will model the Log Matching Property. You’ll use TLA+ to prove that if two logs have an entry with the same index and term, then the logs are identical up to that index.

Core challenges you’ll face:

Commit Index → Understanding the exact condition under which a leader can mark an entry as committed.
Safety during Leader Change → Proving that a new leader never overwrites a committed entry.
Refinement → Proving that this complex log-based system “refines” a simple atomic variable (the “State Machine”).

Key Concepts:

Log Completeness: Raft Paper Section 5.4.1
State Machine Safety: Raft Paper Section 5.4.3

Difficulty: Master Time estimate: 3-4 weeks Prerequisites: Project 7.

Project 9: Distributed Lock Manager (Chubby Lite)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: PlusCal
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Distributed Coordination
Software or Tool: TLA+ Toolbox
Main Book: “The Chubby lock service for loosely-coupled distributed systems” (Mike Burrows)

What you’ll build: A central lock server that grants leases to clients. Clients must renew leases or lose the lock.

Why it teaches TLA+: It teaches you to model Session Expiration. You’ll discover the “Fencing Token” problem: what if a client thinks it has the lock, but its lease expired while it was “paused” by GC?

Core challenges you’ll face:

Lease Expiration → Representing the passage of time relative to actions.
Clock Drift → Modeling what happens if the client’s clock is faster than the server’s.
Fencing → Implementing and proving that fencing tokens prevent old clients from corrupting data.

Key Concepts:

Leases: “Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency”.
Fencing Tokens: “Designing Data-Intensive Applications” Ch. 8 (The Pause).

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 4 (2PC).

Real World Outcome

You will find a bug where a client C1 acquires a lock, goes into a GC pause, the lock expires, C2 acquires the lock and writes to a database, and then C1 wakes up and also writes to the database. TLC will show you this trace, and you’ll “fix” it by adding a version check (fencing token).

Project 10: Consistent Hashing & Virtual Nodes

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: PlusCal
Coolness Level: Level 3: Genuinely Clever
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 2: Intermediate
Knowledge Area: Distributed Hash Tables (DHT)
Software or Tool: TLA+ Toolbox
Main Book: “Dynamo: Amazon’s Highly Available Key-value Store”

What you’ll build: A model of a ring-based consistent hashing system where nodes can join and leave.

Why it teaches TLA+: It teaches you to model Data Migration. When a node joins, who is responsible for which keys? TLA+ helps you prove that at no point is a key “lost” during the handoff.

Core challenges you’ll face:

Ring Structure → Representing the circular nature of the hash space.
Membership Changes → Proving that as nodes join/leave, the “responsibility” for a key transitions safely.
Virtual Nodes → Modeling load balancing and proving it reduces variance.

Key Concepts:

Consistent Hashing: Karger et al. (1997).
Virtual Nodes: Dynamo paper Section 4.2.

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic understanding of hash functions.

Project 11: Gossip Protocol (Information Dissemination)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: PlusCal
Coolness Level: Level 3: Genuinely Clever
Difficulty: Level 3: Advanced
Knowledge Area: P2P Networking
Software or Tool: TLA+ Toolbox
Main Book: “Epidemic Algorithms for Replicated Database Maintenance” (Demers et al.)

What you’ll build: A system where nodes randomly pick neighbors to share “news” (updates). You’ll model anti-entropy and rumor mongering.

Why it teaches TLA+: It teaches you to model Probabilistic behavior using nondeterminism. You’ll prove that while you can’t say when a node gets an update, you can prove that eventually all nodes will have it (liveness).

Core challenges you’ll face:

Eventually Consistent → Proving that if updates stop, all nodes eventually agree.
Network Partitions → Proving that gossip eventually heals once the partition is gone.
State Bloat → Modeling how nodes “forget” old news and proving it doesn’t lead to inconsistencies.

Project 12: Distributed Garbage Collection (Reference Counting)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: TLA+
Coolness Level: Level 4: Hardcore Tech Flex
Difficulty: Level 4: Expert
Knowledge Area: Memory Management / Distributed Systems
Software or Tool: TLC Model Checker
Main Book: “The Garbage Collection Handbook” by Richard Jones

What you’ll build: A model of “Weighted Reference Counting” or “Generation-Based” distributed GC.

Why it teaches TLA+: Distributed GC is one of the hardest things to get right. You’ll find bugs where a “Delete” message arrives before an “Add” message due to network reordering, leading to a “Useless Object” being kept forever or a “Live Object” being deleted.

Project 13: Practical Byzantine Fault Tolerance (PBFT)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: TLA+
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Cryptography / Distributed Systems
Software or Tool: TLA+ Toolbox
Main Book: “Practical Byzantine Fault Tolerance” (Castro & Liskov, 1999)

What you’ll build: A model of PBFT where nodes can be “evil” (Byzantine).

Why it teaches TLA+: You’ll learn how to model Adversaries. In TLA+, an “evil” node is just an action that can do anything—send fake messages, skip steps, or lie about its state. You’ll prove that the system remains safe as long as less than 1/3 of nodes are evil.

Project 14: Cache Coherency (MESI Protocol)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: TLA+
Coolness Level: Level 4: Hardcore Tech Flex
Difficulty: Level 3: Advanced
Knowledge Area: Computer Architecture
Software or Tool: TLA+ Toolbox
Main Book: “A Primer on Memory Consistency and Cache Coherence” (Sorin et al.)

What you’ll build: A model of the MESI (Modified, Exclusive, Shared, Invalid) protocol used in multi-core CPUs.

Why it teaches TLA+: It bridges the gap between hardware and software. You’ll see how CPUs keep their local caches in sync.

Project 15: TLA+ in Production (The AWS S3 Challenge)

File: LEARN_FORMAL_VERIFICATION_TLAPLUS.md
Main Specification Language: TLA+
Coolness Level: Level 5: Pure Magic
Business Potential: 1. The “Resume Gold”
Difficulty: Level 5: Master
Knowledge Area: Cloud Infrastructure
Software or Tool: TLA+ Toolbox

What you’ll build: A high-level spec of a distributed storage system (like S3) that must handle concurrent PUTs, GETs, and node failures while maintaining “Read-after-Write” consistency.

Why it teaches TLA+: This is a “Real World” application. You’ll apply everything: quorums, versions, failures, and refinement.

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
Die Hard Jug	Level 1	2h	★☆☆☆☆	★★★☆☆
Peterson’s Alg	Level 2	1w	★★☆☆☆	★★★★☆
2-Phase Commit	Level 3	2w	★★★☆☆	★★★★☆
Basic Paxos	Level 4	3w	★★★★☆	★★★★★
Raft Election	Level 4	2w	★★★★☆	★★★★★
Raft Replication	Level 5	4w	★★★★★	★★★★★
PBFT	Level 5	1m	★★★★★	★★★★★

Recommendation

Start with Project 1 (Die Hard Jug). It takes 2 hours and will give you the “Eureka!” moment when you see TLC find the solution for you. It builds the confidence needed to tackle the math in later projects.

Final Overall Project: The “Everything-Proof” Key-Value Store

The Challenge: Build a complete TLA+ specification for a distributed, replicated key-value store that:

Uses Raft for consensus.
Supports Consistent Hashing for sharding.
Implements Leases for high-performance reads.
Handles Byzantine node failures (optional stretch goal).

Why this is the final boss: This requires you to compose different specifications. You’ll learn how to treat “Raft” as a black box that provides “Consensus” and build your application logic on top of it.

Summary

This learning path covers Formal Verification through 15 hands-on projects. Here’s the complete list:

#	Project Name	Main Language	Difficulty	Time Estimate
1	Die Hard Jug	TLA+	Beginner	2 Hours
2	One-Bit Register	TLA+	Beginner	Weekend
3	Peterson’s Algorithm	PlusCal	Intermediate	1 Week
4	Two-Phase Commit	TLA+	Advanced	2 Weeks
5	Basic Paxos	TLA+	Expert	3 Weeks
6	Vector Clocks	PlusCal	Advanced	1 Week
7	Raft Election	TLA+	Expert	2 Weeks
8	Raft Replication	TLA+	Master	4 Weeks
9	Distributed Lock Mgr	PlusCal	Advanced	2 Weeks
10	Consistent Hashing	PlusCal	Intermediate	1 Week
11	Gossip Protocol	PlusCal	Advanced	1 Week
12	Dist. Garbage Collection	TLA+	Expert	2 Weeks
13	PBFT	TLA+	Master	1 Month
14	Cache Coherency (MESI)	TLA+	Advanced	1 Week
15	S3 Storage Spec	TLA+	Master	1 Month

Recommended Learning Path

For beginners: Start with projects #1, #2, #3, and #10. For intermediate: Jump to projects #4, #6, #9, and #11. For advanced: Focus on projects #5, #7, #8, #13, and #15.

Expected Outcomes

After completing these projects, you will:

Think in states and transitions rather than just code.
Understand the deep mathematical roots of Consensus.
Be able to find race conditions in your design before you code them.
Speak the language of Safety and Liveness properties.
Use the TLA+ Toolbox to verify your most critical system designs.

You’ll have built 15 working specifications that demonstrate deep understanding of distributed systems from first principles.