Project 12: WAN Netsplit Recovery Drill

Simulate cross-network partitions and implement deterministic state reconciliation.

Quick Reference

Attribute Value
Difficulty Level 4
Time Estimate 24-36 hours
Main Programming Language Erlang or Elixir
Alternative Programming Languages Gleam
Coolness Level Level 5
Business Potential Level 4
Prerequisites Distributed Erlang, supervision, state modeling
Key Topics netsplit handling, reconciliation, idempotent recovery

1. Learning Objectives

By completing this project, you will:

  1. Model WAN partitions as expected system events.
  2. Build repeatable partition/heal drills.
  3. Implement deterministic merge policies for conflicting state.
  4. Validate that recovery converges after reconnect.

2. All Theory Needed (Per-Concept Breakdown)

Partition Semantics and Reconciliation

Fundamentals A netsplit is a connectivity break between nodes that can leave both sides of a system making progress independently. This creates conflicting state once communication resumes. Recovery is not just reconnecting transport; it is restoring application-level consistency. The core requirement is an explicit reconciliation policy.

Deep Dive into the concept In distributed BEAM systems, a partition appears as node disconnections and monitor events. Local workloads often continue, which is good for availability but dangerous for consistency when mutable state diverges. Designing for partitions starts with classification: which state can tolerate eventual consistency and which state needs stronger guarantees.

A practical pattern is to mark every mutable record with version or causality metadata. During partition, each side applies updates locally and records decision context. On heal, reconciliation workers compare divergent records, apply merge policy, and emit audit logs. These workers should be supervised and restart-safe because reconciliation itself can fail.

Idempotency is the anchor property. A good merge pipeline can replay the same reconciliation inputs multiple times and still reach the same final state. Without idempotency, retries create oscillation. Another key property is monotonic progress: each merge step should advance a version frontier rather than bouncing between states.

Operationally, you need deterministic drills. Random partitions are useful for chaos testing, but deterministic scripts are mandatory for learning and regression. Define partition scenarios by site pairs, duration, and expected post-heal outcomes. Keep a golden dataset with known conflicts so you can verify convergence.

Failure handling should separate concerns: transport recovery, state replay, conflict merge, and client notification. If these are tangled, debugging becomes slow. Keep each stage observable with explicit counters (records compared, merged, rejected, retried).

How this fit on projects This project builds directly on Project 11 topology and prepares you for Project 13 traffic resilience.

Definitions & key terms

  • Netsplit: Connectivity break between cluster partitions.
  • Convergence: All partitions reach consistent final state.
  • Idempotent merge: Applying merge logic multiple times yields same result.
  • Reconciliation audit: Record of conflict decisions and outcomes.

Mental model diagram

[Partition A writes]       [Partition B writes]
         |                           |
         +----------- heal ----------+
                     |
              [Reconcile Worker]
                     |
               [Converged State]

How it works (step-by-step, with invariants and failure modes)

  1. Trigger partition between sites.
  2. Apply controlled conflicting writes.
  3. Heal transport.
  4. Run reconciliation workflow.
  5. Invariant: replaying reconciliation does not change final result.
  6. Failure modes: non-idempotent merge, missing causality metadata, replay gaps.

Minimal concrete example

record user42:
  us-east version=7 status=online
  eu-west version=6 status=offline
merge_policy: choose highest version, keep audit trail

Common misconceptions

  • “Reconnect means solved.” Transport heal is only step one.
  • “Last write wins is always enough.” It can lose critical business intent.

Check-your-understanding questions

  1. Why is idempotency critical in reconciliation?
  2. What signals indicate partition healing is complete?
  3. How do you verify true convergence across sites?

Check-your-understanding answers

  1. Retries are unavoidable; non-idempotent merges corrupt state.
  2. Transport is restored and reconciliation queues are drained.
  3. All replicas report identical state and version frontier.

Real-world applications

  • Multi-region session systems
  • Presence and routing state recovery
  • Operational failover workflows

Where you’ll apply it

  • Core implementation in Project 12
  • Follow-on resilience in Project 13

References

  • Distributed Erlang documentation
  • OTP supervision principles
  • Release and recovery operational guides

Key insights Partition tolerance requires explicit policy, not optimistic assumptions.

Summary Netsplit recovery is a design problem spanning transport, state, and operations. Deterministic drills and idempotent reconciliation are the foundation.

Homework/Exercises to practice the concept

  1. Define two merge policies for the same dataset and compare outcomes.
  2. Build a checklist for verifying post-heal convergence.

Solutions to the homework/exercises

  1. Contrast strict-authoritative-site vs version-based merge.
  2. Validate queue drain, record parity, and deterministic replay.

3. Project Specification

3.1 What You Will Build

A drill toolkit that can isolate sites, trigger controlled conflicts, heal links, and verify converged state.

3.2 Functional Requirements

  1. Partition Control: Start and stop site-level partition scenarios.
  2. Conflict Injection: Apply deterministic conflicting updates.
  3. Reconciliation Engine: Merge divergent state and emit audit records.

3.3 Non-Functional Requirements

  • Performance: Reconciliation completes within defined SLO for test dataset.
  • Reliability: Repeated drills produce equivalent outcomes.
  • Usability: Operators can reproduce scenario from documented commands.

3.4 Example Usage / Output

$ netsplitctl isolate us-east eu-west
$ netsplitctl heal us-east eu-west

3.5 Data Formats / Schemas / Protocols

  • Conflict record shape with version metadata
  • Reconciliation audit event schema
  • Drill scenario manifest format

3.6 Edge Cases

  • Reconnect during ongoing writes
  • Repeated heal/isolate flapping
  • Partial audit log loss

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

  • Start two site clusters from Project 11.
  • Run netsplitctl isolate and apply scripted writes.
  • Run netsplitctl heal and verify convergence.

3.7.2 Golden Path Demo (Deterministic)

One user record diverges during partition, then converges with documented merge rule.

3.7.3 If CLI: exact terminal transcript

$ netsplitctl isolate us-east eu-west
partition active

$ reconcilectl status
conflicts=14 merged=14 pending=0

4. Solution Architecture

4.1 High-Level Design

[Drill Controller] -> [Partition Injector] -> [Reconcile Workers] -> [Audit Store]

4.2 Key Components

Component Responsibility Key Decisions
Drill Controller Orchestrate scenarios Deterministic scenario IDs
Reconcile Worker Merge conflicts Idempotent merge contracts
Audit Store Persist decisions Queryability and replay support

4.4 Data Structures (No Full Code)

  • Conflict queue entries
  • Versioned state records
  • Reconciliation audit entries

4.4 Algorithm Overview

Key Algorithm: Conflict Reconciliation

  1. Load divergent records.
  2. Apply merge policy.
  3. Persist decision and new version.
  4. Re-check until no conflicts remain.

Complexity Analysis:

  • Time: O(C) where C is number of conflicting records
  • Space: O(C)

5. Implementation Guide

5.1 Development Environment Setup

# use standard cluster test environment from P11

5.2 Project Structure

project-root/
├── lib/
│   ├── drill_controller.ex
│   ├── partition_injector.ex
│   ├── reconcile_worker.ex
│   └── audit_store.ex
├── test/
└── README.md

5.3 The Core Question You’re Answering

“Can my distributed service prove recovery correctness after a real partition, not just reconnect transport?”

5.4 Concepts You Must Understand First

  1. Node monitor semantics
  2. Idempotent merge design
  3. Supervisor restart behavior under flapping conditions

5.5 Questions to Guide Your Design

  1. What merge policy is explicit and explainable?
  2. Which recovery steps can safely retry?
  3. How will you prove convergence objectively?