Project 12: WAN Netsplit Recovery Drill
Simulate cross-network partitions and implement deterministic state reconciliation.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4 |
| Time Estimate | 24-36 hours |
| Main Programming Language | Erlang or Elixir |
| Alternative Programming Languages | Gleam |
| Coolness Level | Level 5 |
| Business Potential | Level 4 |
| Prerequisites | Distributed Erlang, supervision, state modeling |
| Key Topics | netsplit handling, reconciliation, idempotent recovery |
1. Learning Objectives
By completing this project, you will:
- Model WAN partitions as expected system events.
- Build repeatable partition/heal drills.
- Implement deterministic merge policies for conflicting state.
- Validate that recovery converges after reconnect.
2. All Theory Needed (Per-Concept Breakdown)
Partition Semantics and Reconciliation
Fundamentals A netsplit is a connectivity break between nodes that can leave both sides of a system making progress independently. This creates conflicting state once communication resumes. Recovery is not just reconnecting transport; it is restoring application-level consistency. The core requirement is an explicit reconciliation policy.
Deep Dive into the concept In distributed BEAM systems, a partition appears as node disconnections and monitor events. Local workloads often continue, which is good for availability but dangerous for consistency when mutable state diverges. Designing for partitions starts with classification: which state can tolerate eventual consistency and which state needs stronger guarantees.
A practical pattern is to mark every mutable record with version or causality metadata. During partition, each side applies updates locally and records decision context. On heal, reconciliation workers compare divergent records, apply merge policy, and emit audit logs. These workers should be supervised and restart-safe because reconciliation itself can fail.
Idempotency is the anchor property. A good merge pipeline can replay the same reconciliation inputs multiple times and still reach the same final state. Without idempotency, retries create oscillation. Another key property is monotonic progress: each merge step should advance a version frontier rather than bouncing between states.
Operationally, you need deterministic drills. Random partitions are useful for chaos testing, but deterministic scripts are mandatory for learning and regression. Define partition scenarios by site pairs, duration, and expected post-heal outcomes. Keep a golden dataset with known conflicts so you can verify convergence.
Failure handling should separate concerns: transport recovery, state replay, conflict merge, and client notification. If these are tangled, debugging becomes slow. Keep each stage observable with explicit counters (records compared, merged, rejected, retried).
How this fit on projects This project builds directly on Project 11 topology and prepares you for Project 13 traffic resilience.
Definitions & key terms
- Netsplit: Connectivity break between cluster partitions.
- Convergence: All partitions reach consistent final state.
- Idempotent merge: Applying merge logic multiple times yields same result.
- Reconciliation audit: Record of conflict decisions and outcomes.
Mental model diagram
[Partition A writes] [Partition B writes]
| |
+----------- heal ----------+
|
[Reconcile Worker]
|
[Converged State]
How it works (step-by-step, with invariants and failure modes)
- Trigger partition between sites.
- Apply controlled conflicting writes.
- Heal transport.
- Run reconciliation workflow.
- Invariant: replaying reconciliation does not change final result.
- Failure modes: non-idempotent merge, missing causality metadata, replay gaps.
Minimal concrete example
record user42:
us-east version=7 status=online
eu-west version=6 status=offline
merge_policy: choose highest version, keep audit trail
Common misconceptions
- “Reconnect means solved.” Transport heal is only step one.
- “Last write wins is always enough.” It can lose critical business intent.
Check-your-understanding questions
- Why is idempotency critical in reconciliation?
- What signals indicate partition healing is complete?
- How do you verify true convergence across sites?
Check-your-understanding answers
- Retries are unavoidable; non-idempotent merges corrupt state.
- Transport is restored and reconciliation queues are drained.
- All replicas report identical state and version frontier.
Real-world applications
- Multi-region session systems
- Presence and routing state recovery
- Operational failover workflows
Where you’ll apply it
- Core implementation in Project 12
- Follow-on resilience in Project 13
References
- Distributed Erlang documentation
- OTP supervision principles
- Release and recovery operational guides
Key insights Partition tolerance requires explicit policy, not optimistic assumptions.
Summary Netsplit recovery is a design problem spanning transport, state, and operations. Deterministic drills and idempotent reconciliation are the foundation.
Homework/Exercises to practice the concept
- Define two merge policies for the same dataset and compare outcomes.
- Build a checklist for verifying post-heal convergence.
Solutions to the homework/exercises
- Contrast strict-authoritative-site vs version-based merge.
- Validate queue drain, record parity, and deterministic replay.
3. Project Specification
3.1 What You Will Build
A drill toolkit that can isolate sites, trigger controlled conflicts, heal links, and verify converged state.
3.2 Functional Requirements
- Partition Control: Start and stop site-level partition scenarios.
- Conflict Injection: Apply deterministic conflicting updates.
- Reconciliation Engine: Merge divergent state and emit audit records.
3.3 Non-Functional Requirements
- Performance: Reconciliation completes within defined SLO for test dataset.
- Reliability: Repeated drills produce equivalent outcomes.
- Usability: Operators can reproduce scenario from documented commands.
3.4 Example Usage / Output
$ netsplitctl isolate us-east eu-west
$ netsplitctl heal us-east eu-west
3.5 Data Formats / Schemas / Protocols
- Conflict record shape with version metadata
- Reconciliation audit event schema
- Drill scenario manifest format
3.6 Edge Cases
- Reconnect during ongoing writes
- Repeated heal/isolate flapping
- Partial audit log loss
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
- Start two site clusters from Project 11.
- Run
netsplitctl isolateand apply scripted writes. - Run
netsplitctl healand verify convergence.
3.7.2 Golden Path Demo (Deterministic)
One user record diverges during partition, then converges with documented merge rule.
3.7.3 If CLI: exact terminal transcript
$ netsplitctl isolate us-east eu-west
partition active
$ reconcilectl status
conflicts=14 merged=14 pending=0
4. Solution Architecture
4.1 High-Level Design
[Drill Controller] -> [Partition Injector] -> [Reconcile Workers] -> [Audit Store]
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Drill Controller | Orchestrate scenarios | Deterministic scenario IDs |
| Reconcile Worker | Merge conflicts | Idempotent merge contracts |
| Audit Store | Persist decisions | Queryability and replay support |
4.4 Data Structures (No Full Code)
- Conflict queue entries
- Versioned state records
- Reconciliation audit entries
4.4 Algorithm Overview
Key Algorithm: Conflict Reconciliation
- Load divergent records.
- Apply merge policy.
- Persist decision and new version.
- Re-check until no conflicts remain.
Complexity Analysis:
- Time: O(C) where C is number of conflicting records
- Space: O(C)
5. Implementation Guide
5.1 Development Environment Setup
# use standard cluster test environment from P11
5.2 Project Structure
project-root/
├── lib/
│ ├── drill_controller.ex
│ ├── partition_injector.ex
│ ├── reconcile_worker.ex
│ └── audit_store.ex
├── test/
└── README.md
5.3 The Core Question You’re Answering
“Can my distributed service prove recovery correctness after a real partition, not just reconnect transport?”
5.4 Concepts You Must Understand First
- Node monitor semantics
- Idempotent merge design
- Supervisor restart behavior under flapping conditions
5.5 Questions to Guide Your Design
- What merge policy is explicit and explainable?
- Which recovery steps can safely retry?
- How will you prove convergence objectively?