Project 11: Multi-Network Cluster Formation Lab

Build one logical BEAM cluster from nodes spread across isolated networks.

Quick Reference

Attribute	Value
Difficulty	Level 3
Time Estimate	18-28 hours
Main Programming Language	Elixir or Erlang
Alternative Programming Languages	Gleam
Coolness Level	Level 5
Business Potential	Level 4
Prerequisites	Distributed Erlang, OTP supervision, basic networking
Key Topics	node naming, secure distribution, cross-network topology

1. Learning Objectives

By completing this project, you will:

Form a BEAM cluster across multiple network segments.
Apply secure distribution practices for node-to-node communication.
Observe and diagnose node topology and cross-site failures.
Design supervision boundaries for site-level failures.

2. All Theory Needed (Per-Concept Breakdown)

Multi-Network Distributed Erlang

Fundamentals Distributed Erlang lets processes on different nodes exchange messages almost like local messaging, but network boundaries still matter. Node naming, cookies, DNS or host mapping, and routing policies become part of system correctness. In multi-network deployments, you must design explicit connectivity paths rather than relying on accidental full mesh. Secure distribution is essential because a compromised distribution channel can expose the entire node.

Deep Dive into the concept In a single subnet lab, distributed Erlang often feels effortless. As soon as nodes span multiple networks, however, the implicit assumptions break: names might not resolve, routes might be asymmetric, and firewalls might block required ports. The first principle is that distributed runtime semantics depend on transport correctness. If transport is unstable, runtime-level links and monitors produce noisy failure signals.

Design your topology in layers. Keep local site meshes tight and explicit, then connect sites through gateway nodes. This avoids combinatorial growth in cross-site connections and limits blast radius when WAN links degrade. Site gateways should be supervised as critical infrastructure components, while local service workers remain independent.

Secure distribution changes the operational posture from “trusted private network” to “authenticated and encrypted channels.” TLS certificates, trust roots, and rotation schedules become part of cluster maintenance. You should document certificate ownership and rotation windows as first-class runbook items.

Observability is as important as connectivity. You need a topology command that reports node membership, inter-site links, and connection health. Without this, failures become guesswork. A useful invariant is: every cross-site connection should be discoverable through one command and validated with deterministic probes.

Common failures include cookie mismatch, inconsistent node names, DNS drift, and blocked distribution ports. Treat these as configuration drift incidents and standardize preflight checks before starting nodes.

How this fit on projects This concept is central to this project and directly prepares you for P12 and P13.

Definitions & key terms

Node longname: A fully-qualified node identity used for cross-host clustering.
Distribution cookie: Shared secret used for node authentication.
Gateway node: Node dedicated to inter-site connectivity.
Topology contract: Explicit list of allowed node-to-node links.

Mental model diagram

[Site A mesh] --- [Gateway A] ====WAN==== [Gateway B] --- [Site B mesh]
    |                  |                     |                 |
 local workers      monitored link      monitored link     local workers

How it works (step-by-step, with invariants and failure modes)

Start local site nodes with stable names and shared cookie policy.
Establish local mesh links.
Bring up gateway nodes with inter-site routes.
Validate cross-site node connectivity.
Invariant: local services keep running if WAN fails.
Failure modes: node flapping, partial connectivity, stale routes.

Minimal concrete example

site_a_nodes = [edge_us_1, edge_us_2]
site_b_nodes = [edge_eu_1, edge_eu_2]
connect_local(site_a_nodes)
connect_local(site_b_nodes)
connect(gateway_us, gateway_eu)

Common misconceptions

“If two nodes can ping once, distribution is healthy.” Health requires sustained checks.
“Private network means secure enough.” Distribution still needs explicit hardening.

Check-your-understanding questions

Why should cross-site links be centralized through gateways?
What breaks when longnames are inconsistent across nodes?
Why is topology observability mandatory for operations?

Check-your-understanding answers

It limits blast radius and simplifies failure diagnosis.
Nodes fail to authenticate or route correctly.
Without visibility, partitions and drift are hard to detect.

Real-world applications

Multi-region internal services
Private edge deployments
High-availability cluster control planes

Where you’ll apply it

Project 11 as the core implementation
Project 12 for partition drills
Project 13 for cross-site event routing

References

Distributed Erlang documentation
SSL distribution documentation
OTP supervision principles

Key insights Topology is architecture, not an implementation detail.

Summary A multi-network BEAM cluster succeeds when connectivity, security, and observability are designed together.

Homework/Exercises to practice the concept

Draw a three-site topology with explicit allowed links.
Define a preflight checklist for starting a new node.

Solutions to the homework/exercises

Use per-site meshes and gateway-only inter-site links.
Validate naming, cookie, certs, ports, and reachability.

3. Project Specification

3.1 What You Will Build

A cluster bootstrap toolkit with commands to show topology, validate links, and confirm cross-network process reachability.

3.2 Functional Requirements

Topology Discovery: Show site, node, and link state.
Connectivity Validation: Probe cross-site messaging paths.
Security Mode Awareness: Report whether secure distribution is enabled.

3.3 Non-Functional Requirements

Performance: Health checks complete quickly enough for operator use.
Reliability: False positives/negatives are minimized.
Usability: Output is explicit and actionable.

3.4 Example Usage / Output

$ clusterctl topology
site=us-east ...

3.5 Data Formats / Schemas / Protocols

Node identity records
Site grouping metadata
Link health events

3.6 Edge Cases

One-way connectivity
Gateway failover
Cookie mismatch during rolling restart

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

Start nodes per site with your environment bootstrap script.
Run clusterctl topology and clusterctl ping probes.

3.7.2 Golden Path Demo (Deterministic)

Two sites join, probes succeed, topology reports all expected links.

3.7.3 If CLI: exact terminal transcript

$ clusterctl topology
site=us-east nodes=[edge-us-1,edge-us-2]
site=eu-west nodes=[edge-eu-1,edge-eu-2]
links=8 secure=true

4. Solution Architecture

4.1 High-Level Design

[Discovery] -> [Topology Model] -> [Health Prober] -> [CLI Reporter]

4.2 Key Components

Component	Responsibility	Key Decisions
Discovery	Build node graph	Active probe vs passive events
Health Prober	Validate messaging path	Probe interval and timeout
Reporter	Render operator output	Human-readable plus parseable format

4.4 Data Structures (No Full Code)

Site map keyed by region
Node metadata with connectivity state
Link status records with timestamp

4.4 Algorithm Overview

Key Algorithm: Topology Validation Loop

Enumerate expected links from topology contract.
Probe each link.
Mark success/failure and summarize by site.

Complexity Analysis:

Time: O(L) where L is number of expected links
Space: O(N + L)

5. Implementation Guide

5.1 Development Environment Setup

# use your normal mix/rebar environment

5.2 Project Structure

project-root/
├── lib/
│   ├── topology.ex
│   ├── probe.ex
│   └── cli.ex
├── test/
└── README.md

5.3 The Core Question You’re Answering

“How do I keep distributed-node operations deterministic when infrastructure is non-deterministic?”

5.4 Concepts You Must Understand First

Node naming and distribution identity
Secure distribution channels
Supervisor boundaries for infrastructure processes

5.5 Questions to Guide Your Design

Which links are mandatory and which are optional?
What exact signals define healthy vs degraded state?
How do you expose failures so operators can act fast?