Project 11: Multi-Network Cluster Formation Lab

Build one logical BEAM cluster from nodes spread across isolated networks.

Quick Reference

Attribute Value
Difficulty Level 3
Time Estimate 18-28 hours
Main Programming Language Elixir or Erlang
Alternative Programming Languages Gleam
Coolness Level Level 5
Business Potential Level 4
Prerequisites Distributed Erlang, OTP supervision, basic networking
Key Topics node naming, secure distribution, cross-network topology

1. Learning Objectives

By completing this project, you will:

  1. Form a BEAM cluster across multiple network segments.
  2. Apply secure distribution practices for node-to-node communication.
  3. Observe and diagnose node topology and cross-site failures.
  4. Design supervision boundaries for site-level failures.

2. All Theory Needed (Per-Concept Breakdown)

Multi-Network Distributed Erlang

Fundamentals Distributed Erlang lets processes on different nodes exchange messages almost like local messaging, but network boundaries still matter. Node naming, cookies, DNS or host mapping, and routing policies become part of system correctness. In multi-network deployments, you must design explicit connectivity paths rather than relying on accidental full mesh. Secure distribution is essential because a compromised distribution channel can expose the entire node.

Deep Dive into the concept In a single subnet lab, distributed Erlang often feels effortless. As soon as nodes span multiple networks, however, the implicit assumptions break: names might not resolve, routes might be asymmetric, and firewalls might block required ports. The first principle is that distributed runtime semantics depend on transport correctness. If transport is unstable, runtime-level links and monitors produce noisy failure signals.

Design your topology in layers. Keep local site meshes tight and explicit, then connect sites through gateway nodes. This avoids combinatorial growth in cross-site connections and limits blast radius when WAN links degrade. Site gateways should be supervised as critical infrastructure components, while local service workers remain independent.

Secure distribution changes the operational posture from “trusted private network” to “authenticated and encrypted channels.” TLS certificates, trust roots, and rotation schedules become part of cluster maintenance. You should document certificate ownership and rotation windows as first-class runbook items.

Observability is as important as connectivity. You need a topology command that reports node membership, inter-site links, and connection health. Without this, failures become guesswork. A useful invariant is: every cross-site connection should be discoverable through one command and validated with deterministic probes.

Common failures include cookie mismatch, inconsistent node names, DNS drift, and blocked distribution ports. Treat these as configuration drift incidents and standardize preflight checks before starting nodes.

How this fit on projects This concept is central to this project and directly prepares you for P12 and P13.

Definitions & key terms

  • Node longname: A fully-qualified node identity used for cross-host clustering.
  • Distribution cookie: Shared secret used for node authentication.
  • Gateway node: Node dedicated to inter-site connectivity.
  • Topology contract: Explicit list of allowed node-to-node links.

Mental model diagram

[Site A mesh] --- [Gateway A] ====WAN==== [Gateway B] --- [Site B mesh]
    |                  |                     |                 |
 local workers      monitored link      monitored link     local workers

How it works (step-by-step, with invariants and failure modes)

  1. Start local site nodes with stable names and shared cookie policy.
  2. Establish local mesh links.
  3. Bring up gateway nodes with inter-site routes.
  4. Validate cross-site node connectivity.
  5. Invariant: local services keep running if WAN fails.
  6. Failure modes: node flapping, partial connectivity, stale routes.

Minimal concrete example

site_a_nodes = [edge_us_1, edge_us_2]
site_b_nodes = [edge_eu_1, edge_eu_2]
connect_local(site_a_nodes)
connect_local(site_b_nodes)
connect(gateway_us, gateway_eu)

Common misconceptions

  • “If two nodes can ping once, distribution is healthy.” Health requires sustained checks.
  • “Private network means secure enough.” Distribution still needs explicit hardening.

Check-your-understanding questions

  1. Why should cross-site links be centralized through gateways?
  2. What breaks when longnames are inconsistent across nodes?
  3. Why is topology observability mandatory for operations?

Check-your-understanding answers

  1. It limits blast radius and simplifies failure diagnosis.
  2. Nodes fail to authenticate or route correctly.
  3. Without visibility, partitions and drift are hard to detect.

Real-world applications

  • Multi-region internal services
  • Private edge deployments
  • High-availability cluster control planes

Where you’ll apply it

  • Project 11 as the core implementation
  • Project 12 for partition drills
  • Project 13 for cross-site event routing

References

  • Distributed Erlang documentation
  • SSL distribution documentation
  • OTP supervision principles

Key insights Topology is architecture, not an implementation detail.

Summary A multi-network BEAM cluster succeeds when connectivity, security, and observability are designed together.

Homework/Exercises to practice the concept

  1. Draw a three-site topology with explicit allowed links.
  2. Define a preflight checklist for starting a new node.

Solutions to the homework/exercises

  1. Use per-site meshes and gateway-only inter-site links.
  2. Validate naming, cookie, certs, ports, and reachability.

3. Project Specification

3.1 What You Will Build

A cluster bootstrap toolkit with commands to show topology, validate links, and confirm cross-network process reachability.

3.2 Functional Requirements

  1. Topology Discovery: Show site, node, and link state.
  2. Connectivity Validation: Probe cross-site messaging paths.
  3. Security Mode Awareness: Report whether secure distribution is enabled.

3.3 Non-Functional Requirements

  • Performance: Health checks complete quickly enough for operator use.
  • Reliability: False positives/negatives are minimized.
  • Usability: Output is explicit and actionable.

3.4 Example Usage / Output

$ clusterctl topology
site=us-east ...

3.5 Data Formats / Schemas / Protocols

  • Node identity records
  • Site grouping metadata
  • Link health events

3.6 Edge Cases

  • One-way connectivity
  • Gateway failover
  • Cookie mismatch during rolling restart

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

  • Start nodes per site with your environment bootstrap script.
  • Run clusterctl topology and clusterctl ping probes.

3.7.2 Golden Path Demo (Deterministic)

Two sites join, probes succeed, topology reports all expected links.

3.7.3 If CLI: exact terminal transcript

$ clusterctl topology
site=us-east nodes=[edge-us-1,edge-us-2]
site=eu-west nodes=[edge-eu-1,edge-eu-2]
links=8 secure=true

4. Solution Architecture

4.1 High-Level Design

[Discovery] -> [Topology Model] -> [Health Prober] -> [CLI Reporter]

4.2 Key Components

Component Responsibility Key Decisions
Discovery Build node graph Active probe vs passive events
Health Prober Validate messaging path Probe interval and timeout
Reporter Render operator output Human-readable plus parseable format

4.4 Data Structures (No Full Code)

  • Site map keyed by region
  • Node metadata with connectivity state
  • Link status records with timestamp

4.4 Algorithm Overview

Key Algorithm: Topology Validation Loop

  1. Enumerate expected links from topology contract.
  2. Probe each link.
  3. Mark success/failure and summarize by site.

Complexity Analysis:

  • Time: O(L) where L is number of expected links
  • Space: O(N + L)

5. Implementation Guide

5.1 Development Environment Setup

# use your normal mix/rebar environment

5.2 Project Structure

project-root/
├── lib/
│   ├── topology.ex
│   ├── probe.ex
│   └── cli.ex
├── test/
└── README.md

5.3 The Core Question You’re Answering

“How do I keep distributed-node operations deterministic when infrastructure is non-deterministic?”

5.4 Concepts You Must Understand First

  1. Node naming and distribution identity
  2. Secure distribution channels
  3. Supervisor boundaries for infrastructure processes

5.5 Questions to Guide Your Design

  1. Which links are mandatory and which are optional?
  2. What exact signals define healthy vs degraded state?
  3. How do you expose failures so operators can act fast?