Project 12: WAN Netsplit Recovery Drill

Simulate cross-network partitions and implement deterministic state reconciliation.

Quick Reference

Attribute	Value
Difficulty	Level 4
Time Estimate	24-36 hours
Main Programming Language	Erlang or Elixir
Alternative Programming Languages	Gleam
Coolness Level	Level 5
Business Potential	Level 4
Prerequisites	Distributed Erlang, supervision, state modeling
Key Topics	netsplit handling, reconciliation, idempotent recovery

1. Learning Objectives

By completing this project, you will:

Model WAN partitions as expected system events.
Build repeatable partition/heal drills.
Implement deterministic merge policies for conflicting state.
Validate that recovery converges after reconnect.

2. All Theory Needed (Per-Concept Breakdown)

Partition Semantics and Reconciliation

Fundamentals A netsplit is a connectivity break between nodes that can leave both sides of a system making progress independently. This creates conflicting state once communication resumes. Recovery is not just reconnecting transport; it is restoring application-level consistency. The core requirement is an explicit reconciliation policy.

Deep Dive into the concept In distributed BEAM systems, a partition appears as node disconnections and monitor events. Local workloads often continue, which is good for availability but dangerous for consistency when mutable state diverges. Designing for partitions starts with classification: which state can tolerate eventual consistency and which state needs stronger guarantees.

A practical pattern is to mark every mutable record with version or causality metadata. During partition, each side applies updates locally and records decision context. On heal, reconciliation workers compare divergent records, apply merge policy, and emit audit logs. These workers should be supervised and restart-safe because reconciliation itself can fail.

Idempotency is the anchor property. A good merge pipeline can replay the same reconciliation inputs multiple times and still reach the same final state. Without idempotency, retries create oscillation. Another key property is monotonic progress: each merge step should advance a version frontier rather than bouncing between states.

Operationally, you need deterministic drills. Random partitions are useful for chaos testing, but deterministic scripts are mandatory for learning and regression. Define partition scenarios by site pairs, duration, and expected post-heal outcomes. Keep a golden dataset with known conflicts so you can verify convergence.

Failure handling should separate concerns: transport recovery, state replay, conflict merge, and client notification. If these are tangled, debugging becomes slow. Keep each stage observable with explicit counters (records compared, merged, rejected, retried).

How this fit on projects This project builds directly on Project 11 topology and prepares you for Project 13 traffic resilience.

Definitions & key terms

Netsplit: Connectivity break between cluster partitions.
Convergence: All partitions reach consistent final state.
Idempotent merge: Applying merge logic multiple times yields same result.
Reconciliation audit: Record of conflict decisions and outcomes.

Mental model diagram

[Partition A writes]       [Partition B writes]
         |                           |
         +----------- heal ----------+
                     |
              [Reconcile Worker]
                     |
               [Converged State]

How it works (step-by-step, with invariants and failure modes)

Trigger partition between sites.
Apply controlled conflicting writes.
Heal transport.
Run reconciliation workflow.
Invariant: replaying reconciliation does not change final result.
Failure modes: non-idempotent merge, missing causality metadata, replay gaps.

Minimal concrete example

record user42:
  us-east version=7 status=online
  eu-west version=6 status=offline
merge_policy: choose highest version, keep audit trail

Common misconceptions

“Reconnect means solved.” Transport heal is only step one.
“Last write wins is always enough.” It can lose critical business intent.

Check-your-understanding questions

Why is idempotency critical in reconciliation?
What signals indicate partition healing is complete?
How do you verify true convergence across sites?

Check-your-understanding answers

Retries are unavoidable; non-idempotent merges corrupt state.
Transport is restored and reconciliation queues are drained.
All replicas report identical state and version frontier.

Real-world applications

Multi-region session systems
Presence and routing state recovery
Operational failover workflows

Where you’ll apply it

Core implementation in Project 12
Follow-on resilience in Project 13

References

Distributed Erlang documentation
OTP supervision principles
Release and recovery operational guides

Key insights Partition tolerance requires explicit policy, not optimistic assumptions.

Summary Netsplit recovery is a design problem spanning transport, state, and operations. Deterministic drills and idempotent reconciliation are the foundation.

Homework/Exercises to practice the concept

Define two merge policies for the same dataset and compare outcomes.
Build a checklist for verifying post-heal convergence.

Solutions to the homework/exercises

Contrast strict-authoritative-site vs version-based merge.
Validate queue drain, record parity, and deterministic replay.

3. Project Specification

3.1 What You Will Build

A drill toolkit that can isolate sites, trigger controlled conflicts, heal links, and verify converged state.

3.2 Functional Requirements

Partition Control: Start and stop site-level partition scenarios.
Conflict Injection: Apply deterministic conflicting updates.
Reconciliation Engine: Merge divergent state and emit audit records.

3.3 Non-Functional Requirements

Performance: Reconciliation completes within defined SLO for test dataset.
Reliability: Repeated drills produce equivalent outcomes.
Usability: Operators can reproduce scenario from documented commands.

3.4 Example Usage / Output

$ netsplitctl isolate us-east eu-west
$ netsplitctl heal us-east eu-west

3.5 Data Formats / Schemas / Protocols

Conflict record shape with version metadata
Reconciliation audit event schema
Drill scenario manifest format

3.6 Edge Cases

Reconnect during ongoing writes
Repeated heal/isolate flapping
Partial audit log loss

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

Start two site clusters from Project 11.
Run netsplitctl isolate and apply scripted writes.
Run netsplitctl heal and verify convergence.

3.7.2 Golden Path Demo (Deterministic)

One user record diverges during partition, then converges with documented merge rule.

3.7.3 If CLI: exact terminal transcript

$ netsplitctl isolate us-east eu-west
partition active

$ reconcilectl status
conflicts=14 merged=14 pending=0

4. Solution Architecture

4.1 High-Level Design

[Drill Controller] -> [Partition Injector] -> [Reconcile Workers] -> [Audit Store]

4.2 Key Components

Component	Responsibility	Key Decisions
Drill Controller	Orchestrate scenarios	Deterministic scenario IDs
Reconcile Worker	Merge conflicts	Idempotent merge contracts
Audit Store	Persist decisions	Queryability and replay support

4.4 Data Structures (No Full Code)

Conflict queue entries
Versioned state records
Reconciliation audit entries

4.4 Algorithm Overview

Key Algorithm: Conflict Reconciliation

Load divergent records.
Apply merge policy.
Persist decision and new version.
Re-check until no conflicts remain.

Complexity Analysis:

Time: O(C) where C is number of conflicting records
Space: O(C)

5. Implementation Guide

5.1 Development Environment Setup

# use standard cluster test environment from P11

5.2 Project Structure

project-root/
├── lib/
│   ├── drill_controller.ex
│   ├── partition_injector.ex
│   ├── reconcile_worker.ex
│   └── audit_store.ex
├── test/
└── README.md

5.3 The Core Question You’re Answering

“Can my distributed service prove recovery correctness after a real partition, not just reconnect transport?”

5.4 Concepts You Must Understand First

Node monitor semantics
Idempotent merge design
Supervisor restart behavior under flapping conditions

5.5 Questions to Guide Your Design

What merge policy is explicit and explainable?
Which recovery steps can safely retry?
How will you prove convergence objectively?